PySpark UDF Tutorial (Beginner-Friendly): What, Why, and How to Use pyspark udf
PySpark is great at processing big data fast—until you write logic Spark doesn’t understand. That’s where a UDF (User Defined Function) comes in: it lets you run custom Python code on Spark DataFrames.
The catch: UDFs can also make Spark slower if used carelessly. This tutorial shows when to use UDFs, how to write them correctly, and how to pick faster alternatives (like built-in functions or Pandas UDFs).
The problem UDFs solve
You have a DataFrame column and want to apply custom logic:
- domain-specific parsing (weird IDs, custom rules)
- custom scoring functions
- text normalization not covered by built-in Spark SQL functions
- mapping values using a complex rule
Spark’s built-in functions are optimized in the JVM. But pure Python logic isn’t automatically “Spark-native”.
The solution options (fastest → slowest)
In practice, prefer this order:
- Built-in Spark functions (
pyspark.sql.functions) ✅ fastest - SQL expressions (
expr,when,regexp_extract) ✅ fast - Pandas UDFs (vectorized with Apache Arrow) ✅ often fast
- Regular Python UDFs (row-by-row Python) ⚠️ often slowest
This post focuses on (3) and (4), with strong guidance on when to avoid them.
Setup: a small DataFrame to play with
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType, DoubleType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(" Alice ", "US", 10),
("bob", "UK", 3),
(None, "US", 7),
],
["name", "country", "visits"]
)
df.show()Example 1: A simple Python UDF (string cleaning)
Goal
Trim spaces, lowercase, and handle nulls safely.
from pyspark.sql.functions import udf
def clean_name(x: str) -> str:
if x is None:
return None
return x.strip().lower()
clean_name_udf = udf(clean_name, StringType())
df2 = df.withColumn("name_clean", clean_name_udf(F.col("name")))
df2.show()Notes
- You must provide a return type (
StringType()). - Python UDF runs row-by-row in Python workers (can be slow on big data).
Example 2: Prefer built-in functions when possible
The previous UDF can be replaced by built-ins (faster):
df_fast = df.withColumn(
"name_clean",
F.lower(F.trim(F.col("name")))
)
df_fast.show()Rule of thumb: if you can express it with F.*, do that.
Example 3: UDF with multiple inputs (custom scoring)
Let’s create a score based on country + visits.
def score(country: str, visits: int) -> float:
if country is None or visits is None:
return 0.0
bonus = 1.5 if country == "US" else 1.0
return float(visits) * bonus
score_udf = udf(score, DoubleType())
df3 = df.withColumn("score", score_udf(F.col("country"), F.col("visits")))
df3.show()Example 4: Register a UDF for Spark SQL
If you want to use it in SQL:
spark.udf.register("score_udf_sql", score, DoubleType())
df.createOrReplaceTempView("t")
spark.sql("""
SELECT name, country, visits, score_udf_sql(country, visits) AS score
FROM t
""").show()Common mistakes (and how to avoid them)
1) Returning the wrong type
If Spark expects DoubleType() and you return a string sometimes, you’ll get runtime errors or nulls.
2) Using UDFs in joins/filters unnecessarily
UDFs can block Spark optimizations. Try to compute a derived column first, or use built-ins.
3) Forgetting null handling
Always assume inputs can be None.
Faster alternative: Pandas UDF (vectorized)
A Pandas UDF processes batches of data at once (vectorized), often much faster than regular UDFs.
Example 5: Pandas UDF for name cleaning
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("string")
def clean_name_pandas(s: pd.Series) -> pd.Series:
return s.str.strip().str.lower()
df4 = df.withColumn("name_clean", clean_name_pandas(F.col("name")))
df4.show()When Pandas UDFs help
- logic is easier in pandas
- you need vectorized operations
- performance matters and built-ins aren’t enough
Example 6: Pandas UDF for element-wise scoring
@pandas_udf("double")
def score_pandas(country: pd.Series, visits: pd.Series) -> pd.Series:
bonus = country.eq("US").astype("float64") * 0.5 + 1.0
visits_filled = visits.fillna(0).astype("float64")
return visits_filled * bonus
df5 = df.withColumn("score", score_pandas(F.col("country"), F.col("visits")))
df5.show()Practical best practices checklist
- ✅ Try built-in functions first (
F.lower,F.when,F.regexp_extract, etc.) - ✅ If custom logic is needed, consider Pandas UDF before regular UDF
- ✅ Always specify correct return types
- ✅ Handle nulls (
None) explicitly - ✅ Keep UDF logic pure and deterministic (same input → same output)
- ⚠️ Avoid UDFs in filters/joins when possible (they can reduce optimizations)
- ⚠️ Measure: compare runtime with/without UDF on a sample dataset
When you should use a regular Python UDF
Use a regular UDF when:
- data size is not huge (or performance is not critical)
- logic is hard to express with built-ins
- pandas/Arrow isn’t available or doesn’t fit the transformation
If performance matters, reach for built-ins or Pandas UDFs first.