Skip to content

PySpark Handle Null and NA: Practical Cleaning Recipes

Updated on

Nulls break metrics, joins, and exports when left unchecked. Silent dropna calls can also erase valuable rows. Teams need deliberate patterns to inspect, fill, drop, or safely compare nulls in PySpark.

PySpark’s built-in null handlers—fillna, dropna, na.replace, coalesce, and null-safe comparisons—keep pipelines predictable and transparent.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

Quick decision guide

GoalAPINotes
Fill defaultsfillnaDict per column; match types
Drop sparse rowsdropnaControl how, thresh, subset
Replace sentinelsna.replaceSwap placeholder strings or numbers
Pick first non-nullcoalesceCombine columns by priority
Null-safe equality<=> or eqNullSafeTreat nulls as equal

Setup

from pyspark.sql import SparkSession, functions as F
 
spark = SparkSession.builder.appName("null-handling").getOrCreate()
 
df = spark.createDataFrame(
    [
        (1, "Alice", None, None),
        (2, "Bob", "basic", 0),
        (3, None, "premium", None),
    ],
    "id INT, name STRING, tier STRING, score INT",
)

Inspect null patterns

null_stats = df.select(
    *[
        F.sum(F.col(c).isNull().cast("int")).alias(f"{c}_nulls")
        for c in df.columns
    ]
)
  • Count nulls per column before deciding to fill or drop.

Fill missing values

filled = df.fillna({"name": "unknown", "tier": "basic", "score": 0})
  • Provide a dict to keep types aligned; avoid filling numeric with strings.
  • For dates/timestamps, use domain defaults (to_date('1970-01-01')).

Drop rows selectively

drop_sparse = df.dropna(thresh=3)          # keep rows with >=3 non-null values
drop_missing_tier = df.dropna(subset=["tier"])
  • thresh guards against over-dropping.
  • Prefer column subsets over blanket dropna() to preserve useful rows.

Replace sentinel values

clean_sent = df.na.replace({"N/A": None, "": None}, subset=["tier", "name"])
  • Convert placeholder strings to proper nulls before filling or dropping.

Coalesce to pick first non-null

with_fallback = df.withColumn("primary_tier", F.coalesce("tier", F.lit("basic")))
  • coalesce returns the first non-null in the argument list.

Null-safe comparisons

matches = df.where(F.expr("tier <=> 'premium'"))
  • <=> (or eqNullSafe) treats two nulls as equal; regular = returns null for null comparisons.

Guarding downstream joins

  • Fill join keys only when business rules allow; otherwise filter null keys explicitly.
  • For fact-to-dimension joins, decide: drop null keys before join, or preserve and handle later.

Minimal cleaning pipeline

clean = (
    df
    .na.replace({"N/A": None, "": None})
    .dropna(subset=["name"])  # require a name
    .fillna({"tier": "basic", "score": 0})
    .withColumn("tier_clean", F.coalesce("tier", F.lit("basic")))
)

This sequence standardizes placeholders, keeps required fields, fills with business defaults, and leaves a clear fallback column for reporting.