PySpark UDF Tutorial (Beginner-Friendly): What, Why, and How to Use `pyspark udf`

PySpark is great at processing big data fast—until you write logic Spark doesn’t understand. That’s where a UDF (User Defined Function) comes in: it lets you run custom Python code on Spark DataFrames.

The catch: UDFs can also make Spark slower if used carelessly. This tutorial shows when to use UDFs, how to write them correctly, and how to pick faster alternatives (like built-in functions or Pandas UDFs).

The problem UDFs solve

You have a DataFrame column and want to apply custom logic:

domain-specific parsing (weird IDs, custom rules)
custom scoring functions
text normalization not covered by built-in Spark SQL functions
mapping values using a complex rule

Spark’s built-in functions are optimized in the JVM. But pure Python logic isn’t automatically “Spark-native”.

The solution options (fastest → slowest)

In practice, prefer this order:

Built-in Spark functions (pyspark.sql.functions) ✅ fastest
SQL expressions (expr, when, regexp_extract) ✅ fast
Pandas UDFs (vectorized with Apache Arrow) ✅ often fast
Regular Python UDFs (row-by-row Python) ⚠️ often slowest

This post focuses on (3) and (4), with strong guidance on when to avoid them.

Setup: a small DataFrame to play with

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType, DoubleType
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame(
    [
        (" Alice  ", "US", 10),
        ("bob", "UK", 3),
        (None, "US", 7),
    ],
    ["name", "country", "visits"]
)
 
df.show()

Example 1: A simple Python UDF (string cleaning)

Goal

Trim spaces, lowercase, and handle nulls safely.

from pyspark.sql.functions import udf
 
def clean_name(x: str) -> str:
    if x is None:
        return None
    return x.strip().lower()
 
clean_name_udf = udf(clean_name, StringType())
 
df2 = df.withColumn("name_clean", clean_name_udf(F.col("name")))
df2.show()

Notes

You must provide a return type (StringType()).
Python UDF runs row-by-row in Python workers (can be slow on big data).

Example 2: Prefer built-in functions when possible

The previous UDF can be replaced by built-ins (faster):

df_fast = df.withColumn(
    "name_clean",
    F.lower(F.trim(F.col("name")))
)
df_fast.show()

Rule of thumb: if you can express it with F.*, do that.

Example 3: UDF with multiple inputs (custom scoring)

Let’s create a score based on country + visits.

def score(country: str, visits: int) -> float:
    if country is None or visits is None:
        return 0.0
    bonus = 1.5 if country == "US" else 1.0
    return float(visits) * bonus
 
score_udf = udf(score, DoubleType())
 
df3 = df.withColumn("score", score_udf(F.col("country"), F.col("visits")))
df3.show()

Example 4: Register a UDF for Spark SQL

If you want to use it in SQL:

spark.udf.register("score_udf_sql", score, DoubleType())
 
df.createOrReplaceTempView("t")
spark.sql("""
  SELECT name, country, visits, score_udf_sql(country, visits) AS score
  FROM t
""").show()

Common mistakes (and how to avoid them)

1) Returning the wrong type

If Spark expects DoubleType() and you return a string sometimes, you’ll get runtime errors or nulls.

2) Using UDFs in joins/filters unnecessarily

UDFs can block Spark optimizations. Try to compute a derived column first, or use built-ins.

3) Forgetting null handling

Always assume inputs can be None.

Faster alternative: Pandas UDF (vectorized)

A Pandas UDF processes batches of data at once (vectorized), often much faster than regular UDFs.

Example 5: Pandas UDF for name cleaning

import pandas as pd
from pyspark.sql.functions import pandas_udf
 
@pandas_udf("string")
def clean_name_pandas(s: pd.Series) -> pd.Series:
    return s.str.strip().str.lower()
 
df4 = df.withColumn("name_clean", clean_name_pandas(F.col("name")))
df4.show()

When Pandas UDFs help

logic is easier in pandas
you need vectorized operations
performance matters and built-ins aren’t enough

Example 6: Pandas UDF for element-wise scoring

@pandas_udf("double")
def score_pandas(country: pd.Series, visits: pd.Series) -> pd.Series:
    bonus = country.eq("US").astype("float64") * 0.5 + 1.0
    visits_filled = visits.fillna(0).astype("float64")
    return visits_filled * bonus
 
df5 = df.withColumn("score", score_pandas(F.col("country"), F.col("visits")))
df5.show()

Practical best practices checklist

✅ Try built-in functions first (F.lower, F.when, F.regexp_extract, etc.)
✅ If custom logic is needed, consider Pandas UDF before regular UDF
✅ Always specify correct return types
✅ Handle nulls (None) explicitly
✅ Keep UDF logic pure and deterministic (same input → same output)
⚠️ Avoid UDFs in filters/joins when possible (they can reduce optimizations)
⚠️ Measure: compare runtime with/without UDF on a sample dataset

When you should use a regular Python UDF

Use a regular UDF when:

data size is not huge (or performance is not critical)
logic is hard to express with built-ins
pandas/Arrow isn’t available or doesn’t fit the transformation

If performance matters, reach for built-ins or Pandas UDFs first.

PySpark UDF Tutorial (Beginner-Friendly): What, Why, and How to Use pyspark udf