Skip to content

Pandas GroupBy: Aggregation, Transform, Apply (2026 Guide)

Updated on

Working with split-apply-combine patterns is where Pandas shines. The catch: new users often struggle with choosing between agg, transform, and apply, or get tripped up by sorting, dropped categories, or the dreaded SettingWithCopy side effects.

Every data analysis project eventually needs grouped calculations -- total sales by region, average scores by class, daily user counts. Without groupby, you would write loops, filter manually, and glue results together. With it, a single chain handles the split, the calculation, and the reassembly.

This guide covers the three core methods (agg, transform, apply), shows when to use each, and walks through the most common pitfalls so you can write correct, fast grouped operations on the first try.

📚

Quick Reference: When to Use Each Method

MethodReturns shapeBest forExample
agg / aggregateOne row per groupSummaries and multi-metric outputsdf.groupby("team").agg(avg_score=("score","mean"))
transformSame length as inputGroup-aware features to join back to rowsdf["z"] = df.groupby("team")["score"].transform("zscore")
applyFlexibleComplex reshaping when agg/transform fall shortdf.groupby("team").apply(custom_fn)

Sample Data

import pandas as pd
 
data = {
    "team": ["A", "A", "B", "B", "B", "C"],
    "player": ["x1", "x2", "y1", "y2", "y3", "z1"],
    "score": [9, 7, 8, 6, 10, 3],
    "minutes": [30, 25, 28, 32, 20, 15],
}
df = pd.DataFrame(data)

Aggregation Patterns (agg)

summary = (
    df.groupby("team", as_index=False)
      .agg(
          avg_score=("score", "mean"),
          max_score=("score", "max"),
          minutes_played=("minutes", "sum"),
      )
)
  • Use named aggregation to keep tidy column names.
  • as_index=False keeps team as a column (handy for merges/plots).
  • sort=False preserves input order if you care about original grouping order.

Multi-aggregation on one column

df.groupby("team", as_index=False).agg(
    score_mean=("score", "mean"),
    score_std=("score", "std"),
    score_count=("score", "size"),
)

Broadcast Features (transform)

Transform keeps the original shape so you can attach group metrics back to each row.

df["score_pct_of_team"] = (
    df["score"] / df.groupby("team")["score"].transform("sum")
)

Another example: z-score per team.

df["score_z"] = (
    df.groupby("team")["score"]
      .transform(lambda s: (s - s.mean()) / s.std(ddof=0))
)

Use transform for:

  • Ratios relative to group totals
  • Normalized scores
  • Group-level flags (e.g., rank, cumcount)

When apply Is Worth It

apply is flexible but slower; reserve it for logic that cannot be expressed with agg or transform.

def top_n(group, n=2):
    return group.nlargest(n, "score")
 
top_players = df.groupby("team").apply(top_n, n=1).reset_index(drop=True)

Prefer agg/transform first; reach for apply when:

  • You need to return variable row counts per group.
  • The operation requires custom Python code not available as vectorized methods.

Handling Missing Groups and Sorting

result = (
    df.groupby("team", dropna=False, sort=False)
      .agg(avg_score=("score", "mean"))
)
  • dropna=False keeps NaN group labels (e.g., unknown teams). You can also fill missing values before grouping to ensure cleaner groups.
  • sort=False avoids reordering groups—useful when order carries meaning (time, priority).

GroupBy With Multiple Keys and Indexes

multi = (
    df.groupby(["team", "player"], as_index=False)
      .agg(score_mean=("score", "mean"))
)

Grouping on index levels:

df2 = df.set_index(["team", "player"])
df2.groupby(level="team")["score"].mean()

Common Pitfalls and Fixes

PitfallFix
Shape surprises (Series vs DataFrame)Set as_index=False or use .reset_index() after aggregation.
Duplicate column names after mergeUse named aggregation to control output names.
Slow code from applyReplace with agg/transform or vectorized methods when possible.
Missing categories in outputSet observed=False (default) for categorical keys or ensure categories exist; use dropna=False for NaN.

Export and Visualization Tips


Related Guides


Visualize GroupBy Results with PyGWalker

After running groupby aggregations, you typically want to visualize the results -- bar charts of group totals, distributions within groups, or trend comparisons. Instead of writing separate matplotlib or seaborn code for each chart, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any DataFrame into an interactive, Tableau-like visual exploration interface inside Jupyter Notebook.

import pandas as pd
import pygwalker as pyg
 
# Run your groupby aggregation
summary = df.groupby("team", as_index=False).agg(
    avg_score=("score", "mean"),
    total_minutes=("minutes", "sum"),
    player_count=("player", "count"),
)
 
# Launch interactive visualization -- drag columns to axes
walker = pyg.walk(summary)

With PyGWalker, you can drag team to the x-axis and avg_score to the y-axis to instantly create a bar chart comparing team performance -- no plotting code needed. This is especially powerful for exploring multi-level groupby results where you want to pivot between different metrics quickly.

Try PyGWalker in Google Colab (opens in a new tab) or install with pip install pygwalker.


FAQ

What is the difference between agg and transform in pandas groupby?

agg() returns one row per group (a summary), while transform() returns a value for every row in the original DataFrame (same length as input). Use agg when you need totals, averages, or counts per group. Use transform when you need to attach a group-level metric back to each row, such as the percentage of a group total or a z-score within a group.

How do I groupby multiple columns in pandas?

Pass a list of column names to groupby(): df.groupby(["col_a", "col_b"]).agg(...). This creates groups based on unique combinations of both columns. Each unique pair of (col_a, col_b) values becomes a separate group.

Why does pandas groupby return a Series instead of a DataFrame?

This happens when you select a single column before aggregation, like df.groupby("team")["score"].mean(). To get a DataFrame, either use named aggregation (df.groupby("team", as_index=False).agg(avg_score=("score", "mean"))) or call .reset_index() on the result.

How do I iterate over pandas groupby groups?

Use a for loop: for name, group in df.groupby("team"):. The name variable holds the group key, and group is the subset DataFrame for that key. However, avoid iteration when possible -- agg, transform, and apply are faster and more idiomatic.

How do I filter groups in pandas groupby?

Use the .filter() method: df.groupby("team").filter(lambda g: g["score"].mean() > 5). This keeps all rows belonging to groups that satisfy the condition. Unlike agg, filter returns the original rows, not summaries.

Key Takeaways

  • Use agg for summaries, transform for per-row features, apply sparingly.
  • Control shape with as_index and reset_index.
  • Manage ordering and missing labels with sort and dropna.
  • Start with vectorized methods to keep pipelines fast.
📚