PySpark groupBy and Aggregation: Accurate Summaries at Scale
Updated on
Reporting breaks when aggregates double-count, skip null groups, or hide cardinality issues. PySpark’s groupBy and agg keep rollups accurate, but only when the right functions and aliases are chosen.
This guide shows dependable aggregation patterns: multi-metric calculations, distinct counting options, handling null groups, and ordering results for downstream use.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.
Aggregation choices
| Need | Function | Notes |
|---|---|---|
| Counts (includes nulls) | count("*") | Fast; counts rows |
| Distinct count | countDistinct(col) | Exact but heavier |
| Approx distinct | approx_count_distinct(col, rsd) | Tunable error; faster |
| Sum/avg with null safety | sum, avg | Ignore nulls by default |
| Multi-metric in one pass | agg with aliases | Keep names stable |
Setup
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("groupby-agg").getOrCreate()
sales = spark.createDataFrame(
[
("2025-11-01", "US", "mobile", 120.0, "A123"),
("2025-11-01", "US", "web", 80.0, "A124"),
("2025-11-01", "CA", "mobile", None, "A125"),
("2025-11-02", "US", "mobile", 200.0, "A126"),
("2025-11-02", None, "web", 50.0, "A127"),
],
"date STRING, country STRING, channel STRING, revenue DOUBLE, order_id STRING",
)Multi-aggregation with aliases
daily = (
sales.groupBy("date", "country")
.agg(
F.count("*").alias("orders"),
F.sum("revenue").alias("revenue_sum"),
F.avg("revenue").alias("revenue_avg"),
)
)- Aliases keep downstream column names stable.
- Null
revenuerows are ignored bysum/avg.
Distinct and approximate counts
unique_orders = (
sales.groupBy("date")
.agg(
F.countDistinct("order_id").alias("orders_exact"),
F.approx_count_distinct("order_id", 0.02).alias("orders_approx"),
)
)- Use
approx_count_distinctwhen performance matters and small error is acceptable.
Filtering aggregates with having-like logic
high_rev = daily.where(F.col("revenue_sum") > 150)- Filter after aggregation to mimic SQL
HAVING.
Ordering results
ranked = daily.orderBy(F.col("revenue_sum").desc(), "date")- Sort after aggregation; Spark will not guarantee order without
orderBy.
Handling null group keys
null_country_stats = (
sales.groupBy("country")
.agg(F.count("*").alias("rows"))
)groupBykeeps null keys; decide whether to impute (fillna) before aggregation.
Pivots for simple cross-tabs
pivoted = (
sales.groupBy("date")
.pivot("channel", ["mobile", "web"])
.agg(F.sum("revenue"))
)- Constrain pivot values to avoid exploding columns.
Minimal end-to-end rollup
summary = (
sales
.groupBy("date", "country")
.agg(
F.count("*").alias("orders"),
F.sum("revenue").alias("revenue_sum"),
F.countDistinct("order_id").alias("unique_orders"),
)
.orderBy("date", "country")
)This produces stable daily metrics, preserves null country handling decisions, and keeps column names ready for downstream joins or exports.
