Skip to content

PySpark UDF Tutorial (Beginner-Friendly): What, Why, and How to Use pyspark udf

PySpark is great at processing big data fast—until you write logic Spark doesn’t understand. That’s where a UDF (User Defined Function) comes in: it lets you run custom Python code on Spark DataFrames.

The catch: UDFs can also make Spark slower if used carelessly. This tutorial shows when to use UDFs, how to write them correctly, and how to pick faster alternatives (like built-in functions or Pandas UDFs).


The problem UDFs solve

You have a DataFrame column and want to apply custom logic:

  • domain-specific parsing (weird IDs, custom rules)
  • custom scoring functions
  • text normalization not covered by built-in Spark SQL functions
  • mapping values using a complex rule

Spark’s built-in functions are optimized in the JVM. But pure Python logic isn’t automatically “Spark-native”.


The solution options (fastest → slowest)

In practice, prefer this order:

  1. Built-in Spark functions (pyspark.sql.functions) ✅ fastest
  2. SQL expressions (expr, when, regexp_extract) ✅ fast
  3. Pandas UDFs (vectorized with Apache Arrow) ✅ often fast
  4. Regular Python UDFs (row-by-row Python) ⚠️ often slowest

This post focuses on (3) and (4), with strong guidance on when to avoid them.


Setup: a small DataFrame to play with

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType, DoubleType
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame(
    [
        (" Alice  ", "US", 10),
        ("bob", "UK", 3),
        (None, "US", 7),
    ],
    ["name", "country", "visits"]
)
 
df.show()

Example 1: A simple Python UDF (string cleaning)

Goal

Trim spaces, lowercase, and handle nulls safely.

from pyspark.sql.functions import udf
 
def clean_name(x: str) -> str:
    if x is None:
        return None
    return x.strip().lower()
 
clean_name_udf = udf(clean_name, StringType())
 
df2 = df.withColumn("name_clean", clean_name_udf(F.col("name")))
df2.show()

Notes

  • You must provide a return type (StringType()).
  • Python UDF runs row-by-row in Python workers (can be slow on big data).

Example 2: Prefer built-in functions when possible

The previous UDF can be replaced by built-ins (faster):

df_fast = df.withColumn(
    "name_clean",
    F.lower(F.trim(F.col("name")))
)
df_fast.show()

Rule of thumb: if you can express it with F.*, do that.


Example 3: UDF with multiple inputs (custom scoring)

Let’s create a score based on country + visits.

def score(country: str, visits: int) -> float:
    if country is None or visits is None:
        return 0.0
    bonus = 1.5 if country == "US" else 1.0
    return float(visits) * bonus
 
score_udf = udf(score, DoubleType())
 
df3 = df.withColumn("score", score_udf(F.col("country"), F.col("visits")))
df3.show()

Example 4: Register a UDF for Spark SQL

If you want to use it in SQL:

spark.udf.register("score_udf_sql", score, DoubleType())
 
df.createOrReplaceTempView("t")
spark.sql("""
  SELECT name, country, visits, score_udf_sql(country, visits) AS score
  FROM t
""").show()

Common mistakes (and how to avoid them)

1) Returning the wrong type

If Spark expects DoubleType() and you return a string sometimes, you’ll get runtime errors or nulls.

2) Using UDFs in joins/filters unnecessarily

UDFs can block Spark optimizations. Try to compute a derived column first, or use built-ins.

3) Forgetting null handling

Always assume inputs can be None.


Faster alternative: Pandas UDF (vectorized)

A Pandas UDF processes batches of data at once (vectorized), often much faster than regular UDFs.

Example 5: Pandas UDF for name cleaning

import pandas as pd
from pyspark.sql.functions import pandas_udf
 
@pandas_udf("string")
def clean_name_pandas(s: pd.Series) -> pd.Series:
    return s.str.strip().str.lower()
 
df4 = df.withColumn("name_clean", clean_name_pandas(F.col("name")))
df4.show()

When Pandas UDFs help

  • logic is easier in pandas
  • you need vectorized operations
  • performance matters and built-ins aren’t enough

Example 6: Pandas UDF for element-wise scoring

@pandas_udf("double")
def score_pandas(country: pd.Series, visits: pd.Series) -> pd.Series:
    bonus = country.eq("US").astype("float64") * 0.5 + 1.0
    visits_filled = visits.fillna(0).astype("float64")
    return visits_filled * bonus
 
df5 = df.withColumn("score", score_pandas(F.col("country"), F.col("visits")))
df5.show()

Practical best practices checklist

  • ✅ Try built-in functions first (F.lower, F.when, F.regexp_extract, etc.)
  • ✅ If custom logic is needed, consider Pandas UDF before regular UDF
  • ✅ Always specify correct return types
  • ✅ Handle nulls (None) explicitly
  • ✅ Keep UDF logic pure and deterministic (same input → same output)
  • ⚠️ Avoid UDFs in filters/joins when possible (they can reduce optimizations)
  • ⚠️ Measure: compare runtime with/without UDF on a sample dataset

When you should use a regular Python UDF

Use a regular UDF when:

  • data size is not huge (or performance is not critical)
  • logic is hard to express with built-ins
  • pandas/Arrow isn’t available or doesn’t fit the transformation

If performance matters, reach for built-ins or Pandas UDFs first.