Skip to content

PySpark UDF vs Pandas UDF vs mapInPandas: Which One Should You Use?

When you need custom logic in PySpark, you’ll typically reach for one of three tools:

  • Regular Python UDF (udf(...))
  • Pandas UDF (@pandas_udf)
  • mapInPandas (DataFrame → iterator of Pandas DataFrames)

They can all “run Python on Spark”, but they behave very differently in performance, flexibility, and how much of Spark’s optimization you keep.

This guide gives you a practical decision framework, plus examples you can copy.


Mental model: what changes between the three?

1) Regular UDF (row-by-row Python)

  • Spark ships columns to Python worker processes.
  • Your function runs one row at a time.
  • Often the slowest.
  • Can block Spark’s optimizer and code generation.

Use it when: logic is simple, data is small, or speed doesn’t matter.


2) Pandas UDF (vectorized batches via Arrow)

  • Spark sends data to Python in columnar batches using Apache Arrow.
  • Your function runs on Pandas Series / DataFrames (vectorized).
  • Usually much faster than regular UDF.

Use it when: you need custom column logic and want better performance.


3) mapInPandas (full control per partition batch)

  • Spark calls your function once per partition chunk, giving you an iterator of Pandas DataFrames.
  • You can do multi-column logic, complex transformations, even row expansion.
  • Great for “mini ETL” steps in Python while still parallelized by Spark.

Use it when: you need complex transformations that don’t fit the “one column in → one column out” shape.


Quick decision table

You need…Best choice
Simple custom transform, low volumeRegular UDF
Column-wise transform, medium/large dataPandas UDF
Complex logic: multiple columns, multiple output rows, joins inside pandas, heavy Python libsmapInPandas
Maximum performance if possibleBuilt-in Spark SQL functions (avoid all three)

Rule: Built-in Spark functions > Pandas UDF > mapInPandas > regular UDF (typical, not absolute).


Example dataset

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType, StructType, StructField, LongType
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame(
    [(" Alice  ", "US", 10),
     ("bob", "UK", 3),
     (None, "US", 7)],
    ["name", "country", "visits"]
)

Regular UDF example (simple but slow)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
 
def clean_name(x):
    if x is None:
        return None
    return x.strip().lower()
 
clean_name_udf = udf(clean_name, StringType())
 
df_udf = df.withColumn("name_clean", clean_name_udf("name"))
df_udf.show()

Pros

  • Easiest to understand
  • Works everywhere

Cons

  • Python row-by-row overhead
  • Often prevents Spark from doing aggressive optimizations

Pandas UDF example (vectorized, usually faster)

import pandas as pd
from pyspark.sql.functions import pandas_udf
 
@pandas_udf("string")
def clean_name_vec(s: pd.Series) -> pd.Series:
    return s.str.strip().str.lower()
 
df_pandas_udf = df.withColumn("name_clean", clean_name_vec("name"))
df_pandas_udf.show()

Pros

  • Batch processing + vectorization
  • Much better throughput on large columns

Cons

  • Requires Arrow support and compatible environment
  • Still Python-side, still not as optimizable as built-ins

mapInPandas example (most flexible)

Use case: output multiple derived columns + custom rules

Maybe you want:

  • cleaned name
  • score based on country and visits
  • label buckets
import pandas as pd
 
def transform(pdf_iter):
    for pdf in pdf_iter:
        pdf["name_clean"] = pdf["name"].astype("string").str.strip().str.lower()
        pdf["visits"] = pdf["visits"].fillna(0).astype("float64")
        pdf["score"] = pdf["visits"] * pdf["country"].eq("US").astype("float64").add(1.0)  # US -> 2.0x, else 1.0x
        pdf["bucket"] = pd.cut(pdf["visits"], bins=[-1, 0, 5, 999999], labels=["none", "low", "high"])
        yield pdf
 
out_schema = "name string, country string, visits long, name_clean string, score double, bucket string"
 
df_map = df.mapInPandas(transform, schema=out_schema)
df_map.show()

Pros

  • Extremely flexible
  • Great for “per-partition pandas pipelines”
  • Can expand rows, compute multiple outputs, call external libs (carefully)

Cons

  • You must define schema correctly
  • More opportunity to accidentally create skew / large partitions
  • Still Python/Arrow overhead

What about performance?

You don’t need perfect benchmarks to choose correctly. Use these practical heuristics:

Regular UDF is usually worst when:

  • tens of millions of rows
  • simple transforms that Spark could do natively
  • used inside filters/joins

Pandas UDF shines when:

  • you’re transforming one or more columns with vectorized operations
  • you can write logic with pandas/numpy efficiently

mapInPandas is best when:

  • you need multi-step transforms that would be painful in Spark SQL
  • you want to create multiple columns at once
  • you need row expansion or complex conditional logic

Correctness and schema gotchas

Pandas UDF

  • Output must match declared type exactly.
  • Nulls/NaNs can appear; handle them explicitly.

mapInPandas

  • Output DataFrame must match schema: column names + dtypes + ordering.
  • Be careful with Python object dtype; cast to string/float explicitly.

The “avoid these” list (common anti-patterns)

  • Using regular UDF for basic string ops (lower, trim, regex) → use Spark built-ins.
  • Calling network APIs inside UDFs → will be slow, flaky, and hard to retry safely.
  • Heavy per-row Python loops inside Pandas UDF/mapInPandas → kills vectorization benefits.
  • Returning inconsistent types (sometimes int, sometimes string) → runtime failures / nulls.

Recommended decision flow

  1. Can Spark built-ins do it? → Use built-ins.

  2. Is it a column-wise transform (same number of rows in/out)? → Use Pandas UDF.

  3. Do you need multi-column logic, multi-output columns, or row expansion? → Use mapInPandas.

  4. Is it small data / quick prototype? → Regular UDF is acceptable.


A final tip: validate with Spark plans

Even without full benchmarking, you can learn a lot with:

df_pandas_udf.explain(True)

If you see Spark doing less optimization after adding your function, that’s a clue to try built-ins or restructure.