Skip to content

PySpark Select, Filter, and withColumn: Core DataFrame Recipes

Updated on

Messy DataFrames slow every pipeline: wrong columns, unsafe filters, and fragile casts create silent errors. Teams need repeatable patterns for selecting, filtering, and deriving columns to keep Spark jobs predictable.

PySpark’s select, filter/where, and withColumn APIs solve this by keeping transformations explicit, type-safe, and testable. This guide shows the key patterns and how to avoid common pitfalls.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

Reader’s map

TaskAPIWhen to use
Pick/rename columnsselect, aliasKeep only needed columns; avoid selectExpr when simple
Row filteringfilter / whereBoth identical; chain conditions with & and `
Derived/conditional columnswithColumn, when/otherwiseAdd or replace columns with logic
SQL-like expressionsselectExpr, exprQuick arithmetic or SQL functions without many imports
Safe castingcast, try_cast (Spark 3.5+)Enforce types without throwing on bad values

Setup and sample data

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
 
spark = SparkSession.builder.appName("select-filter-withcolumn").getOrCreate()
 
df = spark.createDataFrame(
    [
        (1, "Alice", "2025-11-01", "premium", "42"),
        (2, "Bob",   "2025-11-02", "basic",   "x"),
        (3, "Cara",  "2025-11-02", None,      "7"),
    ],
    "id INT, name STRING, signup_date STRING, tier STRING, score STRING",
)

Select only what you need

slim = df.select(
    "id",
    F.col("name").alias("customer_name"),
    F.to_date("signup_date").alias("signup_dt"),
)
  • select keeps projections explicit and avoids wide scans.
  • Use alias for friendly names.

Filter / where: identical APIs

active = slim.where(
    (F.col("signup_dt") >= "2025-11-01")
    & (F.col("customer_name") != "Bob")
)
  • filter and where are the same; choose one for readability.
  • Combine conditions with & and |; wrap each condition in parentheses.
  • Null checks: isNull() / isNotNull() avoid surprises.

withColumn for derived and conditional values

scored = active.withColumn(
    "score_int",
    F.when(F.col("score").rlike("^[0-9]+$"), F.col("score").cast("int"))
     .otherwise(None),
)
  • withColumn adds or replaces; keep names unique to avoid accidental overwrite.
  • when/otherwise sets clear branch logic.

selectExpr for quick expressions

expr_df = df.selectExpr(
    "id",
    "upper(name) AS name_upper",
    "to_date(signup_date) AS signup_dt",
    "CASE WHEN tier = 'premium' THEN 1 ELSE 0 END AS is_premium",
)
  • Handy for SQL-style calculations without many functions imports.
  • Keep complex logic in regular withColumn for readability and testing.

Safe casting patterns

typed = (
    df
    .withColumn("score_int", F.col("score").cast("int"))
    .withColumn("signup_ts", F.to_timestamp("signup_date"))
)
  • Prefer explicit casts; avoid relying on schema inference from CSV sources.
  • On Spark 3.5+, try_cast returns null instead of failing on bad values.

Quick checks for data quality

from pyspark.sql import functions as F
 
bad_counts = df.select(
    F.sum(F.col("score").rlike("^[0-9]+$").cast("int")).alias("valid_scores"),
    F.sum(F.col("score").rlike("^[^0-9]").cast("int")).alias("invalid_scores"),
)
  • Validate before writing; small aggregations surface issues early.
  • Use count + isNull to gauge missing rates on key columns.

Common pitfalls and fixes

  • Unparenthesized conditions: always wrap each boolean clause when using &/|.
  • Accidental column overwrite: check df.columns or use new names with withColumn.
  • String dates without parsing: convert with to_date/to_timestamp before comparisons.
  • Null-sensitive comparisons: rely on isNull/isNotNull to avoid three-valued logic surprises.

Minimal pipeline example

clean = (
    df
    .select("id", "name", "signup_date", "tier", "score")
    .where(F.col("tier").isNotNull())
    .withColumn("signup_dt", F.to_date("signup_date"))
    .withColumn(
        "score_int",
        F.when(F.col("score").rlike("^[0-9]+$"), F.col("score").cast("int")),
    )
)

This keeps selections tight, filters explicit, and derived columns typed before downstream writes.