Skip to content

Pandas Data Cleaning: Practical Workflow

Updated on

Dirty data blocks analysis long before modeling starts. Manual fixes with ad-hoc fillna or string replacements often leave hidden issues—wrong dtypes, silent duplicates, or outliers that skew metrics.

PAS framing:

  • Problem: Mixed types, missing values, and duplicated records break joins and bias statistics.
  • Agitate: Quick one-off fixes hide problems until late in a project, causing rework.
  • Solution: Use a repeatable cleaning checklist that standardizes dtypes, fills/nulls intentionally, handles outliers, and validates before export.

Cleaning Workflow at a Glance

StepWhat to doExample
1. ProfileInspect columns, nulls, uniquesdf.info(), df.describe(include="all")
2. Normalize columnsTrim/standardize names & textdf.columns = df.columns.str.strip().str.lower()
3. Fix dtypesUse to_datetime, to_numeric, astype("category")df["date"] = pd.to_datetime(df["date"], errors="coerce")
4. Handle missingDrop or fill with rulesdf["age"] = df["age"].fillna(df["age"].median())
5. Handle outliersWinsorize/clip or flagdf["revenue_capped"] = df["revenue"].clip(upper=df["revenue"].quantile(0.99))
6. DeduplicateDrop exact/partial duplicatesdf.drop_duplicates(subset=["id", "date"], keep="last")
7. ValidateAssert ranges, categoriesassert df["score"].between(0,100).all()

1) Profile Quickly

summary = {
    "rows": len(df),
    "columns": df.shape[1],
    "nulls": df.isna().sum(),
    "unique_counts": df.nunique(),
}
  • Spot mixed types early (df.info()).
  • Check skewed distributions with df.describe(percentiles=[0.01,0.99]).

2) Normalize Columns and Text

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df["country"] = df["country"].str.strip().str.title()
  • Standard names simplify joins and downstream code.
  • Use .str.normalize("NFKC") when cleaning user-entered text with odd characters.

3) Fix Data Types

df["date"] = pd.to_datetime(df["date"], errors="coerce", utc=True)
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["segment"] = df["segment"].astype("category")
  • errors="coerce" surfaces bad values as NaT/NaN for later handling.
  • convert_dtypes() provides nullable, memory-friendly types in bulk.

4) Handle Missing Values

df["age"] = df["age"].fillna(df["age"].median())
df["city"] = df["city"].fillna("Unknown")
df = df.dropna(subset=["id"])  # required key
  • Choose fills per column: numeric → median/mean; categorical → mode/placeholder.
  • Preserve signal by flagging imputations: df["age_imputed"] = df["age"].isna().

5) Handle Outliers

upper = df["revenue"].quantile(0.99)
lower = df["revenue"].quantile(0.01)
df["revenue_capped"] = df["revenue"].clip(lower=lower, upper=upper)
  • For ratios, consider z-score filtering: abs(z) < 3.
  • For financial data, prefer caps to outright removal to avoid changing counts.

6) Deduplicate Safely

df = df.drop_duplicates(subset=["id", "date"], keep="last")
  • If a uniqueness key is expected, confirm: assert df.duplicated(["id"]).sum() == 0.
  • For fuzzy matches, create normalized keys (e.g., lowercased emails) before deduping.

7) Validate Before Export

assert df["score"].between(0, 100).all()
valid_segments = {"basic", "pro", "enterprise"}
assert df["segment"].isin(valid_segments).all()
  • Use pd.testing.assert_frame_equal in tests when comparing cleaned outputs.
  • Add lightweight row counts and null checks in pipelines to catch regressions.

End-to-End Mini Pipeline

def clean(df):
    df = df.copy()
    df.columns = df.columns.str.strip().str.lower()
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    df["amount"] = df["amount"].fillna(df["amount"].median())
    df["segment"] = df["segment"].fillna("unknown").str.lower()
    df = df.dropna(subset=["id"])
    df = df.drop_duplicates(subset=["id", "date"], keep="last")
    assert df["amount"].ge(0).all()
    return df
  • Copy first to avoid mutating inputs in pipelines.
  • Pair this with df.pipe(clean) for readable workflows.

Related Guides


Key Takeaways

  • Standardize names, coerce dtypes, and handle nulls intentionally.
  • Cap or flag outliers instead of silently dropping rows.
  • Validate keys and ranges before exporting to analytics or BI layers.