Pandas Data Cleaning: Practical Workflow
Updated on
Dirty data blocks analysis long before modeling starts. Manual fixes with ad-hoc fillna or string replacements often leave hidden issues—wrong dtypes, silent duplicates, or outliers that skew metrics.
PAS framing:
- Problem: Mixed types, missing values, and duplicated records break joins and bias statistics.
- Agitate: Quick one-off fixes hide problems until late in a project, causing rework.
- Solution: Use a repeatable cleaning checklist that standardizes dtypes, fills/nulls intentionally, handles outliers, and validates before export.
Cleaning Workflow at a Glance
| Step | What to do | Example |
|---|---|---|
| 1. Profile | Inspect columns, nulls, uniques | df.info(), df.describe(include="all") |
| 2. Normalize columns | Trim/standardize names & text | df.columns = df.columns.str.strip().str.lower() |
| 3. Fix dtypes | Use to_datetime, to_numeric, astype("category") | df["date"] = pd.to_datetime(df["date"], errors="coerce") |
| 4. Handle missing | Drop or fill with rules | df["age"] = df["age"].fillna(df["age"].median()) |
| 5. Handle outliers | Winsorize/clip or flag | df["revenue_capped"] = df["revenue"].clip(upper=df["revenue"].quantile(0.99)) |
| 6. Deduplicate | Drop exact/partial duplicates | df.drop_duplicates(subset=["id", "date"], keep="last") |
| 7. Validate | Assert ranges, categories | assert df["score"].between(0,100).all() |
1) Profile Quickly
summary = {
"rows": len(df),
"columns": df.shape[1],
"nulls": df.isna().sum(),
"unique_counts": df.nunique(),
}- Spot mixed types early (
df.info()). - Check skewed distributions with
df.describe(percentiles=[0.01,0.99]).
2) Normalize Columns and Text
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df["country"] = df["country"].str.strip().str.title()- Standard names simplify joins and downstream code.
- Use
.str.normalize("NFKC")when cleaning user-entered text with odd characters.
3) Fix Data Types
df["date"] = pd.to_datetime(df["date"], errors="coerce", utc=True)
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["segment"] = df["segment"].astype("category")errors="coerce"surfaces bad values asNaT/NaNfor later handling.convert_dtypes()provides nullable, memory-friendly types in bulk.
4) Handle Missing Values
df["age"] = df["age"].fillna(df["age"].median())
df["city"] = df["city"].fillna("Unknown")
df = df.dropna(subset=["id"]) # required key- Choose fills per column: numeric → median/mean; categorical → mode/placeholder.
- Preserve signal by flagging imputations:
df["age_imputed"] = df["age"].isna().
5) Handle Outliers
upper = df["revenue"].quantile(0.99)
lower = df["revenue"].quantile(0.01)
df["revenue_capped"] = df["revenue"].clip(lower=lower, upper=upper)- For ratios, consider z-score filtering:
abs(z) < 3. - For financial data, prefer caps to outright removal to avoid changing counts.
6) Deduplicate Safely
df = df.drop_duplicates(subset=["id", "date"], keep="last")- If a uniqueness key is expected, confirm:
assert df.duplicated(["id"]).sum() == 0. - For fuzzy matches, create normalized keys (e.g., lowercased emails) before deduping.
7) Validate Before Export
assert df["score"].between(0, 100).all()
valid_segments = {"basic", "pro", "enterprise"}
assert df["segment"].isin(valid_segments).all()- Use
pd.testing.assert_frame_equalin tests when comparing cleaned outputs. - Add lightweight row counts and null checks in pipelines to catch regressions.
End-to-End Mini Pipeline
def clean(df):
df = df.copy()
df.columns = df.columns.str.strip().str.lower()
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["amount"] = df["amount"].fillna(df["amount"].median())
df["segment"] = df["segment"].fillna("unknown").str.lower()
df = df.dropna(subset=["id"])
df = df.drop_duplicates(subset=["id", "date"], keep="last")
assert df["amount"].ge(0).all()
return df- Copy first to avoid mutating inputs in pipelines.
- Pair this with
df.pipe(clean)for readable workflows.
Related Guides
Key Takeaways
- Standardize names, coerce dtypes, and handle nulls intentionally.
- Cap or flag outliers instead of silently dropping rows.
- Validate keys and ranges before exporting to analytics or BI layers.