Skip to content

Pandas GroupBy: Aggregation, Transform, Apply (2025 Guide)

Updated on

Working with split-apply-combine patterns is where Pandas shines. The catch: new users often struggle with choosing between agg, transform, and apply, or get tripped up by sorting, dropped categories, or the dreaded SettingWithCopy side effects.

This guide uses the PAS approach:

  • Problem: Grouped calculations feel opaque and slow when you only need totals or per-row ratios.
  • Agitate: Misusing apply or forgetting as_index=False produces surprising shapes, duplicate columns, and sluggish code.
  • Solution: Use a small set of repeatable patterns—agg for summaries, transform for broadcasted metrics, apply only when necessary—plus options for sorting and missing groups.

Quick Reference: When to Use Each Method

MethodReturns shapeBest forExample
agg / aggregateOne row per groupSummaries and multi-metric outputsdf.groupby("team").agg(avg_score=("score","mean"))
transformSame length as inputGroup-aware features to join back to rowsdf["z"] = df.groupby("team")["score"].transform("zscore")
applyFlexibleComplex reshaping when agg/transform fall shortdf.groupby("team").apply(custom_fn)

Sample Data

import pandas as pd
 
data = {
    "team": ["A", "A", "B", "B", "B", "C"],
    "player": ["x1", "x2", "y1", "y2", "y3", "z1"],
    "score": [9, 7, 8, 6, 10, 3],
    "minutes": [30, 25, 28, 32, 20, 15],
}
df = pd.DataFrame(data)

Aggregation Patterns (agg)

summary = (
    df.groupby("team", as_index=False)
      .agg(
          avg_score=("score", "mean"),
          max_score=("score", "max"),
          minutes_played=("minutes", "sum"),
      )
)
  • Use named aggregation to keep tidy column names.
  • as_index=False keeps team as a column (handy for merges/plots).
  • sort=False preserves input order if you care about original grouping order.

Multi-aggregation on one column

df.groupby("team", as_index=False).agg(
    score_mean=("score", "mean"),
    score_std=("score", "std"),
    score_count=("score", "size"),
)

Broadcast Features (transform)

Transform keeps the original shape so you can attach group metrics back to each row.

df["score_pct_of_team"] = (
    df["score"] / df.groupby("team")["score"].transform("sum")
)

Another example: z-score per team.

df["score_z"] = (
    df.groupby("team")["score"]
      .transform(lambda s: (s - s.mean()) / s.std(ddof=0))
)

Use transform for:

  • Ratios relative to group totals
  • Normalized scores
  • Group-level flags (e.g., rank, cumcount)

When apply Is Worth It

apply is flexible but slower; reserve it for logic that cannot be expressed with agg or transform.

def top_n(group, n=2):
    return group.nlargest(n, "score")
 
top_players = df.groupby("team").apply(top_n, n=1).reset_index(drop=True)

Prefer agg/transform first; reach for apply when:

  • You need to return variable row counts per group.
  • The operation requires custom Python code not available as vectorized methods.

Handling Missing Groups and Sorting

result = (
    df.groupby("team", dropna=False, sort=False)
      .agg(avg_score=("score", "mean"))
)
  • dropna=False keeps NaN group labels (e.g., unknown teams).
  • sort=False avoids reordering groups—useful when order carries meaning (time, priority).

GroupBy With Multiple Keys and Indexes

multi = (
    df.groupby(["team", "player"], as_index=False)
      .agg(score_mean=("score", "mean"))
)

Grouping on index levels:

df2 = df.set_index(["team", "player"])
df2.groupby(level="team")["score"].mean()

Common Pitfalls and Fixes

PitfallFix
Shape surprises (Series vs DataFrame)Set as_index=False or use .reset_index() after aggregation.
Duplicate column names after mergeUse named aggregation to control output names.
Slow code from applyReplace with agg/transform or vectorized methods when possible.
Missing categories in outputSet observed=False (default) for categorical keys or ensure categories exist; use dropna=False for NaN.

Export and Visualization Tips

  • After aggregation, sort for presentation: summary.sort_values("avg_score", ascending=False).
  • For charts, pivot the aggregated table: summary.pivot(index="team", values="avg_score").

Related Guides


Key Takeaways

  • Use agg for summaries, transform for per-row features, apply sparingly.
  • Control shape with as_index and reset_index.
  • Manage ordering and missing labels with sort and dropna.
  • Start with vectorized methods to keep pipelines fast.