Pandas GroupBy: Aggregation, Transform, Apply (2025 Guide)

Name: Rajiv Chandra

Updated on 11/19/2025

Working with split-apply-combine patterns is where Pandas shines. The catch: new users often struggle with choosing between agg, transform, and apply, or get tripped up by sorting, dropped categories, or the dreaded SettingWithCopy side effects.

This guide uses the PAS approach:

Problem: Grouped calculations feel opaque and slow when you only need totals or per-row ratios.
Agitate: Misusing apply or forgetting as_index=False produces surprising shapes, duplicate columns, and sluggish code.
Solution: Use a small set of repeatable patterns—agg for summaries, transform for broadcasted metrics, apply only when necessary—plus options for sorting and missing groups.

Quick Reference: When to Use Each Method

Method	Returns shape	Best for	Example
`agg` / `aggregate`	One row per group	Summaries and multi-metric outputs	`df.groupby("team").agg(avg_score=("score","mean"))`
`transform`	Same length as input	Group-aware features to join back to rows	`df["z"] = df.groupby("team")["score"].transform("zscore")`
`apply`	Flexible	Complex reshaping when `agg`/`transform` fall short	`df.groupby("team").apply(custom_fn)`

Sample Data

import pandas as pd
 
data = {
    "team": ["A", "A", "B", "B", "B", "C"],
    "player": ["x1", "x2", "y1", "y2", "y3", "z1"],
    "score": [9, 7, 8, 6, 10, 3],
    "minutes": [30, 25, 28, 32, 20, 15],
}
df = pd.DataFrame(data)

Aggregation Patterns (`agg`)

summary = (
    df.groupby("team", as_index=False)
      .agg(
          avg_score=("score", "mean"),
          max_score=("score", "max"),
          minutes_played=("minutes", "sum"),
      )
)

Use named aggregation to keep tidy column names.
as_index=False keeps team as a column (handy for merges/plots).
sort=False preserves input order if you care about original grouping order.

Multi-aggregation on one column

df.groupby("team", as_index=False).agg(
    score_mean=("score", "mean"),
    score_std=("score", "std"),
    score_count=("score", "size"),
)

Broadcast Features (`transform`)

Transform keeps the original shape so you can attach group metrics back to each row.

df["score_pct_of_team"] = (
    df["score"] / df.groupby("team")["score"].transform("sum")
)

Another example: z-score per team.

df["score_z"] = (
    df.groupby("team")["score"]
      .transform(lambda s: (s - s.mean()) / s.std(ddof=0))
)

Use transform for:

Ratios relative to group totals
Normalized scores
Group-level flags (e.g., rank, cumcount)

When `apply` Is Worth It

apply is flexible but slower; reserve it for logic that cannot be expressed with agg or transform.

def top_n(group, n=2):
    return group.nlargest(n, "score")
 
top_players = df.groupby("team").apply(top_n, n=1).reset_index(drop=True)

Prefer agg/transform first; reach for apply when:

You need to return variable row counts per group.
The operation requires custom Python code not available as vectorized methods.

Handling Missing Groups and Sorting

result = (
    df.groupby("team", dropna=False, sort=False)
      .agg(avg_score=("score", "mean"))
)

dropna=False keeps NaN group labels (e.g., unknown teams).
sort=False avoids reordering groups—useful when order carries meaning (time, priority).

GroupBy With Multiple Keys and Indexes

multi = (
    df.groupby(["team", "player"], as_index=False)
      .agg(score_mean=("score", "mean"))
)

Grouping on index levels:

df2 = df.set_index(["team", "player"])
df2.groupby(level="team")["score"].mean()

Common Pitfalls and Fixes

Pitfall	Fix
Shape surprises (`Series` vs `DataFrame`)	Set `as_index=False` or use `.reset_index()` after aggregation.
Duplicate column names after merge	Use named aggregation to control output names.
Slow code from `apply`	Replace with `agg`/`transform` or vectorized methods when possible.
Missing categories in output	Set `observed=False` (default) for categorical keys or ensure categories exist; use `dropna=False` for `NaN`.

Export and Visualization Tips

After aggregation, sort for presentation: summary.sort_values("avg_score", ascending=False).
For charts, pivot the aggregated table: summary.pivot(index="team", values="avg_score").

Related Guides

Key Takeaways

Use agg for summaries, transform for per-row features, apply sparingly.
Control shape with as_index and reset_index.
Manage ordering and missing labels with sort and dropna.
Start with vectorized methods to keep pipelines fast.