Skip to content

PySpark groupBy and Aggregation: Accurate Summaries at Scale

Updated on

Reporting breaks when aggregates double-count, skip null groups, or hide cardinality issues. PySpark’s groupBy and agg keep rollups accurate, but only when the right functions and aliases are chosen.

This guide shows dependable aggregation patterns: multi-metric calculations, distinct counting options, handling null groups, and ordering results for downstream use.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

Aggregation choices

NeedFunctionNotes
Counts (includes nulls)count("*")Fast; counts rows
Distinct countcountDistinct(col)Exact but heavier
Approx distinctapprox_count_distinct(col, rsd)Tunable error; faster
Sum/avg with null safetysum, avgIgnore nulls by default
Multi-metric in one passagg with aliasesKeep names stable

Setup

from pyspark.sql import SparkSession, functions as F
 
spark = SparkSession.builder.appName("groupby-agg").getOrCreate()
 
sales = spark.createDataFrame(
    [
        ("2025-11-01", "US", "mobile", 120.0, "A123"),
        ("2025-11-01", "US", "web",    80.0,  "A124"),
        ("2025-11-01", "CA", "mobile", None,  "A125"),
        ("2025-11-02", "US", "mobile", 200.0, "A126"),
        ("2025-11-02", None, "web",    50.0,  "A127"),
    ],
    "date STRING, country STRING, channel STRING, revenue DOUBLE, order_id STRING",
)

Multi-aggregation with aliases

daily = (
    sales.groupBy("date", "country")
    .agg(
        F.count("*").alias("orders"),
        F.sum("revenue").alias("revenue_sum"),
        F.avg("revenue").alias("revenue_avg"),
    )
)
  • Aliases keep downstream column names stable.
  • Null revenue rows are ignored by sum/avg.

Distinct and approximate counts

unique_orders = (
    sales.groupBy("date")
    .agg(
        F.countDistinct("order_id").alias("orders_exact"),
        F.approx_count_distinct("order_id", 0.02).alias("orders_approx"),
    )
)
  • Use approx_count_distinct when performance matters and small error is acceptable.

Filtering aggregates with having-like logic

high_rev = daily.where(F.col("revenue_sum") > 150)
  • Filter after aggregation to mimic SQL HAVING.

Ordering results

ranked = daily.orderBy(F.col("revenue_sum").desc(), "date")
  • Sort after aggregation; Spark will not guarantee order without orderBy.

Handling null group keys

null_country_stats = (
    sales.groupBy("country")
    .agg(F.count("*").alias("rows"))
)
  • groupBy keeps null keys; decide whether to impute (fillna) before aggregation.

Pivots for simple cross-tabs

pivoted = (
    sales.groupBy("date")
    .pivot("channel", ["mobile", "web"])
    .agg(F.sum("revenue"))
)
  • Constrain pivot values to avoid exploding columns.

Minimal end-to-end rollup

summary = (
    sales
    .groupBy("date", "country")
    .agg(
        F.count("*").alias("orders"),
        F.sum("revenue").alias("revenue_sum"),
        F.countDistinct("order_id").alias("unique_orders"),
    )
    .orderBy("date", "country")
)

This produces stable daily metrics, preserves null country handling decisions, and keeps column names ready for downstream joins or exports.