PySpark UDF vs Pandas UDF vs `mapInPandas`: Which One Should You Use?

When you need custom logic in PySpark, you’ll typically reach for one of three tools:

Regular Python UDF (udf(...))
Pandas UDF (@pandas_udf)
mapInPandas (DataFrame → iterator of Pandas DataFrames)

They can all “run Python on Spark”, but they behave very differently in performance, flexibility, and how much of Spark’s optimization you keep.

This guide gives you a practical decision framework, plus examples you can copy.

Mental model: what changes between the three?

1) Regular UDF (row-by-row Python)

Spark ships columns to Python worker processes.
Your function runs one row at a time.
Often the slowest.
Can block Spark’s optimizer and code generation.

Use it when: logic is simple, data is small, or speed doesn’t matter.

2) Pandas UDF (vectorized batches via Arrow)

Spark sends data to Python in columnar batches using Apache Arrow.
Your function runs on Pandas Series / DataFrames (vectorized).
Usually much faster than regular UDF.

Use it when: you need custom column logic and want better performance.

3) `mapInPandas` (full control per partition batch)

Spark calls your function once per partition chunk, giving you an iterator of Pandas DataFrames.
You can do multi-column logic, complex transformations, even row expansion.
Great for “mini ETL” steps in Python while still parallelized by Spark.

Use it when: you need complex transformations that don’t fit the “one column in → one column out” shape.

Quick decision table

You need…	Best choice
Simple custom transform, low volume	Regular UDF
Column-wise transform, medium/large data	Pandas UDF
Complex logic: multiple columns, multiple output rows, joins inside pandas, heavy Python libs	`mapInPandas`
Maximum performance if possible	Built-in Spark SQL functions (avoid all three)

Rule: Built-in Spark functions > Pandas UDF > mapInPandas > regular UDF (typical, not absolute).

Example dataset

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType, StructType, StructField, LongType
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame(
    [(" Alice  ", "US", 10),
     ("bob", "UK", 3),
     (None, "US", 7)],
    ["name", "country", "visits"]
)

Regular UDF example (simple but slow)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
 
def clean_name(x):
    if x is None:
        return None
    return x.strip().lower()
 
clean_name_udf = udf(clean_name, StringType())
 
df_udf = df.withColumn("name_clean", clean_name_udf("name"))
df_udf.show()

Pros

Easiest to understand
Works everywhere

Cons

Python row-by-row overhead
Often prevents Spark from doing aggressive optimizations

Pandas UDF example (vectorized, usually faster)

import pandas as pd
from pyspark.sql.functions import pandas_udf
 
@pandas_udf("string")
def clean_name_vec(s: pd.Series) -> pd.Series:
    return s.str.strip().str.lower()
 
df_pandas_udf = df.withColumn("name_clean", clean_name_vec("name"))
df_pandas_udf.show()

Pros

Batch processing + vectorization
Much better throughput on large columns

Cons

Requires Arrow support and compatible environment
Still Python-side, still not as optimizable as built-ins

`mapInPandas` example (most flexible)

Use case: output multiple derived columns + custom rules

Maybe you want:

cleaned name
score based on country and visits
label buckets

import pandas as pd
 
def transform(pdf_iter):
    for pdf in pdf_iter:
        pdf["name_clean"] = pdf["name"].astype("string").str.strip().str.lower()
        pdf["visits"] = pdf["visits"].fillna(0).astype("float64")
        pdf["score"] = pdf["visits"] * pdf["country"].eq("US").astype("float64").add(1.0)  # US -> 2.0x, else 1.0x
        pdf["bucket"] = pd.cut(pdf["visits"], bins=[-1, 0, 5, 999999], labels=["none", "low", "high"])
        yield pdf
 
out_schema = "name string, country string, visits long, name_clean string, score double, bucket string"
 
df_map = df.mapInPandas(transform, schema=out_schema)
df_map.show()

Pros

Extremely flexible
Great for “per-partition pandas pipelines”
Can expand rows, compute multiple outputs, call external libs (carefully)

Cons

You must define schema correctly
More opportunity to accidentally create skew / large partitions
Still Python/Arrow overhead

What about performance?

You don’t need perfect benchmarks to choose correctly. Use these practical heuristics:

Regular UDF is usually worst when:

tens of millions of rows
simple transforms that Spark could do natively
used inside filters/joins

Pandas UDF shines when:

you’re transforming one or more columns with vectorized operations
you can write logic with pandas/numpy efficiently

`mapInPandas` is best when:

you need multi-step transforms that would be painful in Spark SQL
you want to create multiple columns at once
you need row expansion or complex conditional logic

Correctness and schema gotchas

Pandas UDF

Output must match declared type exactly.
Nulls/NaNs can appear; handle them explicitly.

`mapInPandas`

Output DataFrame must match schema: column names + dtypes + ordering.
Be careful with Python object dtype; cast to string/float explicitly.

The “avoid these” list (common anti-patterns)

Using regular UDF for basic string ops (lower, trim, regex) → use Spark built-ins.
Calling network APIs inside UDFs → will be slow, flaky, and hard to retry safely.
Heavy per-row Python loops inside Pandas UDF/mapInPandas → kills vectorization benefits.
Returning inconsistent types (sometimes int, sometimes string) → runtime failures / nulls.

Recommended decision flow

Can Spark built-ins do it? → Use built-ins.
Is it a column-wise transform (same number of rows in/out)? → Use Pandas UDF.
Do you need multi-column logic, multi-output columns, or row expansion? → Use mapInPandas.
Is it small data / quick prototype? → Regular UDF is acceptable.

A final tip: validate with Spark plans

Even without full benchmarking, you can learn a lot with:

df_pandas_udf.explain(True)

If you see Spark doing less optimization after adding your function, that’s a clue to try built-ins or restructure.

PySpark UDF vs Pandas UDF vs mapInPandas: Which One Should You Use?

Mental model: what changes between the three?

1) Regular UDF (row-by-row Python)

2) Pandas UDF (vectorized batches via Arrow)

3) mapInPandas (full control per partition batch)

Quick decision table

Example dataset

Regular UDF example (simple but slow)

Pros

Cons

Pandas UDF example (vectorized, usually faster)

Pros

Cons

mapInPandas example (most flexible)

Use case: output multiple derived columns + custom rules

Pros

Cons

What about performance?

Regular UDF is usually worst when:

Pandas UDF shines when:

mapInPandas is best when:

Correctness and schema gotchas

Pandas UDF

mapInPandas

The “avoid these” list (common anti-patterns)

Recommended decision flow

A final tip: validate with Spark plans

PySpark UDF vs Pandas UDF vs `mapInPandas`: Which One Should You Use?

3) `mapInPandas` (full control per partition batch)

`mapInPandas` example (most flexible)

`mapInPandas` is best when:

`mapInPandas`