Pandas GroupBy: Aggregation, Transform, Apply (2026 Guide)
Updated on
Working with split-apply-combine patterns is where Pandas shines. The catch: new users often struggle with choosing between agg, transform, and apply, or get tripped up by sorting, dropped categories, or the dreaded SettingWithCopy side effects.
Every data analysis project eventually needs grouped calculations -- total sales by region, average scores by class, daily user counts. Without groupby, you would write loops, filter manually, and glue results together. With it, a single chain handles the split, the calculation, and the reassembly.
This guide covers the three core methods (agg, transform, apply), shows when to use each, and walks through the most common pitfalls so you can write correct, fast grouped operations on the first try.
Quick Reference: When to Use Each Method
| Method | Returns shape | Best for | Example |
|---|---|---|---|
agg / aggregate | One row per group | Summaries and multi-metric outputs | df.groupby("team").agg(avg_score=("score","mean")) |
transform | Same length as input | Group-aware features to join back to rows | df["z"] = df.groupby("team")["score"].transform("zscore") |
apply | Flexible | Complex reshaping when agg/transform fall short | df.groupby("team").apply(custom_fn) |
Sample Data
import pandas as pd
data = {
"team": ["A", "A", "B", "B", "B", "C"],
"player": ["x1", "x2", "y1", "y2", "y3", "z1"],
"score": [9, 7, 8, 6, 10, 3],
"minutes": [30, 25, 28, 32, 20, 15],
}
df = pd.DataFrame(data)Aggregation Patterns (agg)
summary = (
df.groupby("team", as_index=False)
.agg(
avg_score=("score", "mean"),
max_score=("score", "max"),
minutes_played=("minutes", "sum"),
)
)- Use named aggregation to keep tidy column names.
as_index=Falsekeepsteamas a column (handy for merges/plots).sort=Falsepreserves input order if you care about original grouping order.
Multi-aggregation on one column
df.groupby("team", as_index=False).agg(
score_mean=("score", "mean"),
score_std=("score", "std"),
score_count=("score", "size"),
)Broadcast Features (transform)
Transform keeps the original shape so you can attach group metrics back to each row.
df["score_pct_of_team"] = (
df["score"] / df.groupby("team")["score"].transform("sum")
)Another example: z-score per team.
df["score_z"] = (
df.groupby("team")["score"]
.transform(lambda s: (s - s.mean()) / s.std(ddof=0))
)Use transform for:
- Ratios relative to group totals
- Normalized scores
- Group-level flags (e.g.,
rank,cumcount)
When apply Is Worth It
apply is flexible but slower; reserve it for logic that cannot be expressed with agg or transform.
def top_n(group, n=2):
return group.nlargest(n, "score")
top_players = df.groupby("team").apply(top_n, n=1).reset_index(drop=True)Prefer agg/transform first; reach for apply when:
- You need to return variable row counts per group.
- The operation requires custom Python code not available as vectorized methods.
Handling Missing Groups and Sorting
result = (
df.groupby("team", dropna=False, sort=False)
.agg(avg_score=("score", "mean"))
)dropna=FalsekeepsNaNgroup labels (e.g., unknown teams). You can also fill missing values before grouping to ensure cleaner groups.sort=Falseavoids reordering groups—useful when order carries meaning (time, priority).
GroupBy With Multiple Keys and Indexes
multi = (
df.groupby(["team", "player"], as_index=False)
.agg(score_mean=("score", "mean"))
)Grouping on index levels:
df2 = df.set_index(["team", "player"])
df2.groupby(level="team")["score"].mean()Common Pitfalls and Fixes
| Pitfall | Fix |
|---|---|
Shape surprises (Series vs DataFrame) | Set as_index=False or use .reset_index() after aggregation. |
| Duplicate column names after merge | Use named aggregation to control output names. |
Slow code from apply | Replace with agg/transform or vectorized methods when possible. |
| Missing categories in output | Set observed=False (default) for categorical keys or ensure categories exist; use dropna=False for NaN. |
Export and Visualization Tips
- After aggregation, sort for presentation:
summary.sort_values("avg_score", ascending=False). - For charts, pivot the aggregated table:
summary.pivot(index="team", values="avg_score").
Related Guides
- Pandas Apply: Custom Functions on DataFrames -- use apply() with groupby for flexible group transformations
- Pandas Concat: Combine DataFrames -- stack grouped results back together
- Pandas Sort Values -- sort aggregated results for presentation
- Pandas value_counts: Count Unique Values -- quick frequency analysis before grouping
- Pandas Pivot Table -- reshape grouped data into summary tables
- Pandas Merge and Join
- Pandas Pivot and Melt
- Pandas Data Cleaning
- Pandas Visualization
Visualize GroupBy Results with PyGWalker
After running groupby aggregations, you typically want to visualize the results -- bar charts of group totals, distributions within groups, or trend comparisons. Instead of writing separate matplotlib or seaborn code for each chart, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any DataFrame into an interactive, Tableau-like visual exploration interface inside Jupyter Notebook.
import pandas as pd
import pygwalker as pyg
# Run your groupby aggregation
summary = df.groupby("team", as_index=False).agg(
avg_score=("score", "mean"),
total_minutes=("minutes", "sum"),
player_count=("player", "count"),
)
# Launch interactive visualization -- drag columns to axes
walker = pyg.walk(summary)With PyGWalker, you can drag team to the x-axis and avg_score to the y-axis to instantly create a bar chart comparing team performance -- no plotting code needed. This is especially powerful for exploring multi-level groupby results where you want to pivot between different metrics quickly.
Try PyGWalker in Google Colab (opens in a new tab) or install with
pip install pygwalker.
FAQ
What is the difference between agg and transform in pandas groupby?
agg() returns one row per group (a summary), while transform() returns a value for every row in the original DataFrame (same length as input). Use agg when you need totals, averages, or counts per group. Use transform when you need to attach a group-level metric back to each row, such as the percentage of a group total or a z-score within a group.
How do I groupby multiple columns in pandas?
Pass a list of column names to groupby(): df.groupby(["col_a", "col_b"]).agg(...). This creates groups based on unique combinations of both columns. Each unique pair of (col_a, col_b) values becomes a separate group.
Why does pandas groupby return a Series instead of a DataFrame?
This happens when you select a single column before aggregation, like df.groupby("team")["score"].mean(). To get a DataFrame, either use named aggregation (df.groupby("team", as_index=False).agg(avg_score=("score", "mean"))) or call .reset_index() on the result.
How do I iterate over pandas groupby groups?
Use a for loop: for name, group in df.groupby("team"):. The name variable holds the group key, and group is the subset DataFrame for that key. However, avoid iteration when possible -- agg, transform, and apply are faster and more idiomatic.
How do I filter groups in pandas groupby?
Use the .filter() method: df.groupby("team").filter(lambda g: g["score"].mean() > 5). This keeps all rows belonging to groups that satisfy the condition. Unlike agg, filter returns the original rows, not summaries.
Key Takeaways
- Use
aggfor summaries,transformfor per-row features,applysparingly. - Control shape with
as_indexandreset_index. - Manage ordering and missing labels with
sortanddropna. - Start with vectorized methods to keep pipelines fast.