PySpark Handle Null and NA: Practical Cleaning Recipes
Updated on
Nulls break metrics, joins, and exports when left unchecked. Silent dropna calls can also erase valuable rows. Teams need deliberate patterns to inspect, fill, drop, or safely compare nulls in PySpark.
PySpark’s built-in null handlers—fillna, dropna, na.replace, coalesce, and null-safe comparisons—keep pipelines predictable and transparent.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.
Quick decision guide
| Goal | API | Notes |
|---|---|---|
| Fill defaults | fillna | Dict per column; match types |
| Drop sparse rows | dropna | Control how, thresh, subset |
| Replace sentinels | na.replace | Swap placeholder strings or numbers |
| Pick first non-null | coalesce | Combine columns by priority |
| Null-safe equality | <=> or eqNullSafe | Treat nulls as equal |
Setup
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("null-handling").getOrCreate()
df = spark.createDataFrame(
[
(1, "Alice", None, None),
(2, "Bob", "basic", 0),
(3, None, "premium", None),
],
"id INT, name STRING, tier STRING, score INT",
)Inspect null patterns
null_stats = df.select(
*[
F.sum(F.col(c).isNull().cast("int")).alias(f"{c}_nulls")
for c in df.columns
]
)- Count nulls per column before deciding to fill or drop.
Fill missing values
filled = df.fillna({"name": "unknown", "tier": "basic", "score": 0})- Provide a dict to keep types aligned; avoid filling numeric with strings.
- For dates/timestamps, use domain defaults (
to_date('1970-01-01')).
Drop rows selectively
drop_sparse = df.dropna(thresh=3) # keep rows with >=3 non-null values
drop_missing_tier = df.dropna(subset=["tier"])threshguards against over-dropping.- Prefer column subsets over blanket
dropna()to preserve useful rows.
Replace sentinel values
clean_sent = df.na.replace({"N/A": None, "": None}, subset=["tier", "name"])- Convert placeholder strings to proper nulls before filling or dropping.
Coalesce to pick first non-null
with_fallback = df.withColumn("primary_tier", F.coalesce("tier", F.lit("basic")))coalescereturns the first non-null in the argument list.
Null-safe comparisons
matches = df.where(F.expr("tier <=> 'premium'"))<=>(oreqNullSafe) treats two nulls as equal; regular=returns null for null comparisons.
Guarding downstream joins
- Fill join keys only when business rules allow; otherwise filter null keys explicitly.
- For fact-to-dimension joins, decide: drop null keys before join, or preserve and handle later.
Minimal cleaning pipeline
clean = (
df
.na.replace({"N/A": None, "": None})
.dropna(subset=["name"]) # require a name
.fillna({"tier": "basic", "score": 0})
.withColumn("tier_clean", F.coalesce("tier", F.lit("basic")))
)This sequence standardizes placeholders, keeps required fields, fills with business defaults, and leaves a clear fallback column for reporting.
