Pandas GroupBy: Aggregation, Transform, Apply (2025 Guide)
Updated on
Working with split-apply-combine patterns is where Pandas shines. The catch: new users often struggle with choosing between agg, transform, and apply, or get tripped up by sorting, dropped categories, or the dreaded SettingWithCopy side effects.
This guide uses the PAS approach:
- Problem: Grouped calculations feel opaque and slow when you only need totals or per-row ratios.
- Agitate: Misusing
applyor forgettingas_index=Falseproduces surprising shapes, duplicate columns, and sluggish code. - Solution: Use a small set of repeatable patterns—
aggfor summaries,transformfor broadcasted metrics,applyonly when necessary—plus options for sorting and missing groups.
Quick Reference: When to Use Each Method
| Method | Returns shape | Best for | Example |
|---|---|---|---|
agg / aggregate | One row per group | Summaries and multi-metric outputs | df.groupby("team").agg(avg_score=("score","mean")) |
transform | Same length as input | Group-aware features to join back to rows | df["z"] = df.groupby("team")["score"].transform("zscore") |
apply | Flexible | Complex reshaping when agg/transform fall short | df.groupby("team").apply(custom_fn) |
Sample Data
import pandas as pd
data = {
"team": ["A", "A", "B", "B", "B", "C"],
"player": ["x1", "x2", "y1", "y2", "y3", "z1"],
"score": [9, 7, 8, 6, 10, 3],
"minutes": [30, 25, 28, 32, 20, 15],
}
df = pd.DataFrame(data)Aggregation Patterns (agg)
summary = (
df.groupby("team", as_index=False)
.agg(
avg_score=("score", "mean"),
max_score=("score", "max"),
minutes_played=("minutes", "sum"),
)
)- Use named aggregation to keep tidy column names.
as_index=Falsekeepsteamas a column (handy for merges/plots).sort=Falsepreserves input order if you care about original grouping order.
Multi-aggregation on one column
df.groupby("team", as_index=False).agg(
score_mean=("score", "mean"),
score_std=("score", "std"),
score_count=("score", "size"),
)Broadcast Features (transform)
Transform keeps the original shape so you can attach group metrics back to each row.
df["score_pct_of_team"] = (
df["score"] / df.groupby("team")["score"].transform("sum")
)Another example: z-score per team.
df["score_z"] = (
df.groupby("team")["score"]
.transform(lambda s: (s - s.mean()) / s.std(ddof=0))
)Use transform for:
- Ratios relative to group totals
- Normalized scores
- Group-level flags (e.g.,
rank,cumcount)
When apply Is Worth It
apply is flexible but slower; reserve it for logic that cannot be expressed with agg or transform.
def top_n(group, n=2):
return group.nlargest(n, "score")
top_players = df.groupby("team").apply(top_n, n=1).reset_index(drop=True)Prefer agg/transform first; reach for apply when:
- You need to return variable row counts per group.
- The operation requires custom Python code not available as vectorized methods.
Handling Missing Groups and Sorting
result = (
df.groupby("team", dropna=False, sort=False)
.agg(avg_score=("score", "mean"))
)dropna=FalsekeepsNaNgroup labels (e.g., unknown teams).sort=Falseavoids reordering groups—useful when order carries meaning (time, priority).
GroupBy With Multiple Keys and Indexes
multi = (
df.groupby(["team", "player"], as_index=False)
.agg(score_mean=("score", "mean"))
)Grouping on index levels:
df2 = df.set_index(["team", "player"])
df2.groupby(level="team")["score"].mean()Common Pitfalls and Fixes
| Pitfall | Fix |
|---|---|
Shape surprises (Series vs DataFrame) | Set as_index=False or use .reset_index() after aggregation. |
| Duplicate column names after merge | Use named aggregation to control output names. |
Slow code from apply | Replace with agg/transform or vectorized methods when possible. |
| Missing categories in output | Set observed=False (default) for categorical keys or ensure categories exist; use dropna=False for NaN. |
Export and Visualization Tips
- After aggregation, sort for presentation:
summary.sort_values("avg_score", ascending=False). - For charts, pivot the aggregated table:
summary.pivot(index="team", values="avg_score").
Related Guides
Key Takeaways
- Use
aggfor summaries,transformfor per-row features,applysparingly. - Control shape with
as_indexandreset_index. - Manage ordering and missing labels with
sortanddropna. - Start with vectorized methods to keep pipelines fast.