Skip to content

Data Wrangling: Complete Guide to Cleaning and Transforming Data with Python

Updated on

Every data project starts with the same uncomfortable truth: raw data is messy. Columns have inconsistent formats. Dates arrive as strings. Customer records contain duplicates. CSV files from different departments use different naming conventions for the same fields. You cannot build reliable models, dashboards, or reports on data you do not trust.

Data wrangling is the process of turning that raw, unreliable data into something clean, structured, and ready for analysis. It consumes up to 80% of a data professional's time according to multiple industry surveys, yet it remains one of the least-taught skills in data science curricula. This guide walks through the full data wrangling pipeline using Python and Pandas, with practical code for every step.

📚

What Is Data Wrangling?

Data wrangling (also called data munging) is the process of discovering, structuring, cleaning, enriching, and validating raw data so it can be used for analysis or machine learning. It sits between data collection and data analysis in the workflow.

Data Wrangling vs Data Cleaning vs ETL

These three terms overlap but mean different things:

TermScopeFocusTypical Tools
Data WranglingBroad -- covers the full process from raw to analysis-readyExploration, reshaping, cleaning, enrichmentPandas, Polars, Python scripts
Data CleaningNarrow -- a subset of wranglingFixing errors, handling missing values, removing duplicatesPandas, OpenRefine, SQL
ETL (Extract, Transform, Load)System-level -- automated pipelinesMoving data between systems at scaleAirflow, dbt, Spark, AWS Glue

Data cleaning is one step within data wrangling. ETL is a production-level pipeline that may include wrangling logic but focuses on system integration and automation.

The Data Wrangling Pipeline

A structured wrangling workflow has five stages:

1. Discover -- Load the data and understand its shape, types, distributions, and problems.

2. Structure -- Reshape columns and rows so the data matches the schema you need. Rename columns, split or combine fields, pivot or unpivot tables.

3. Clean -- Handle missing values, fix data types, remove duplicates, and correct errors.

4. Enrich -- Add derived columns, join with external datasets, bin continuous values, or encode categorical variables.

5. Validate -- Confirm the output meets expectations: no nulls in required fields, values within expected ranges, row counts match.

The rest of this guide covers each stage with Python code.

Stage 1: Discover Your Data

Start every wrangling project by understanding what you have:

import pandas as pd
 
df = pd.read_csv("sales_data.csv")
 
# Shape and basic info
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
df.info()
 
# Statistical summary
df.describe(include="all")
 
# Check missing values
df.isnull().sum().sort_values(ascending=False)
 
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
 
# Preview unique values in categorical columns
for col in df.select_dtypes(include="object").columns:
    print(f"{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(5))
    print()

This discovery step tells you what problems exist before you start fixing them. Skip it and you will miss issues that surface later as bugs in your analysis.

For a visual approach to data discovery, PyGWalker (opens in a new tab) turns any Pandas DataFrame into an interactive Tableau-like interface. You can drag columns to axes, spot distributions, and identify outliers visually before writing any cleaning code:

import pygwalker as pyg
walker = pyg.walk(df)

Stage 2: Structure and Reshape

Renaming Columns

# Standardize column names: lowercase, underscores, no spaces
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace(r"[^\w]", "", regex=True)
)

Reshaping with Pivot and Melt

Pivot converts long data to wide format:

# Long format: each row is one product-month combination
# Wide format: months become columns
sales_wide = df.pivot_table(
    index="product",
    columns="month",
    values="revenue",
    aggfunc="sum"
)

Melt converts wide data to long format:

# Wide format back to long
sales_long = sales_wide.reset_index().melt(
    id_vars="product",
    var_name="month",
    value_name="revenue"
)

Stack and Unstack

# Stack: pivot columns into rows (creates MultiIndex)
stacked = df.set_index(["region", "product"]).stack()
 
# Unstack: pivot rows into columns
unstacked = stacked.unstack(level="product")

Use pivot_table when you need aggregation. Use melt when normalizing wide tables from spreadsheets. Use stack/unstack when working with hierarchical indices.

Stage 3: Clean the Data

Handling Missing Values

Missing data is the most common data quality problem. Pandas gives you three strategies:

# Strategy 1: Drop rows with any missing values
df_dropped = df.dropna()
 
# Strategy 2: Drop rows missing values in specific columns only
df_dropped = df.dropna(subset=["revenue", "customer_id"])
 
# Strategy 3: Fill with a constant
df["category"] = df["category"].fillna("Unknown")
 
# Strategy 4: Fill with column statistics
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
 
# Strategy 5: Forward fill (good for time series)
df["price"] = df["price"].ffill()
 
# Strategy 6: Interpolation (good for numeric time series)
df["temperature"] = df["temperature"].interpolate(method="linear")

Which strategy to choose:

SituationRecommended Approach
Less than 5% missing, randomDrop rows
Categorical columnFill with "Unknown" or mode
Numeric column, normal distributionFill with mean
Numeric column, skewed distributionFill with median
Time series dataForward fill or interpolation
Column has > 50% missingConsider dropping the column

Data Type Conversions

Wrong data types cause silent errors. A "price" column stored as a string cannot be summed. A "date" column stored as an object cannot be filtered by month.

# String to datetime
df["order_date"] = pd.to_datetime(df["order_date"], format="%Y-%m-%d")
 
# String to numeric (coerce turns unparseable values to NaN)
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
 
# Object to category (saves memory and enables ordering)
df["status"] = df["status"].astype("category")
 
# Boolean conversion
df["is_active"] = df["is_active"].map({"yes": True, "no": False, "Y": True, "N": False})
 
# Extract components from datetime
df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month
df["day_of_week"] = df["order_date"].dt.day_name()

String Cleaning

Real-world string data is full of inconsistencies:

# Strip whitespace
df["name"] = df["name"].str.strip()
 
# Standardize case
df["city"] = df["city"].str.lower()
df["state"] = df["state"].str.upper()
 
# Replace patterns with regex
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)
 
# Extract patterns
df["zip_code"] = df["address"].str.extract(r"(\d{5})")
 
# Fix common inconsistencies
df["country"] = df["country"].replace({
    "US": "United States",
    "USA": "United States",
    "U.S.A.": "United States",
    "uk": "United Kingdom",
    "UK": "United Kingdom",
})

Handling Duplicates

# Find duplicates
print(f"Total duplicates: {df.duplicated().sum()}")
 
# Find duplicates based on specific columns
dupes = df[df.duplicated(subset=["customer_id", "order_date"], keep=False)]
print(f"Duplicate orders: {len(dupes)}")
 
# Remove duplicates, keeping the first occurrence
df = df.drop_duplicates(subset=["customer_id", "order_date"], keep="first")
 
# Remove exact duplicate rows
df = df.drop_duplicates()

Stage 4: Enrich -- Merging and Joining Datasets

Most analysis requires combining data from multiple sources:

# Inner join: only matching rows
merged = pd.merge(orders, customers, on="customer_id", how="inner")
 
# Left join: all orders, with customer info where available
merged = pd.merge(orders, customers, on="customer_id", how="left")
 
# Join on different column names
merged = pd.merge(
    orders, products,
    left_on="prod_code",
    right_on="product_id",
    how="left"
)
 
# Concatenate DataFrames vertically (stacking rows)
all_months = pd.concat([jan_df, feb_df, mar_df], ignore_index=True)

Join type reference:

Join TypeKeepsUse When
innerOnly matching rows from both tablesYou need complete records only
leftAll rows from left, matches from rightLeft table is your primary dataset
rightAll rows from right, matches from leftRight table is your primary dataset
outerAll rows from both tablesYou need the full union of both datasets

Adding Derived Columns

# Calculated columns
df["profit"] = df["revenue"] - df["cost"]
df["profit_margin"] = (df["profit"] / df["revenue"] * 100).round(2)
 
# Binning continuous values
df["revenue_tier"] = pd.cut(
    df["revenue"],
    bins=[0, 100, 500, 1000, float("inf")],
    labels=["Low", "Medium", "High", "Premium"]
)
 
# Conditional columns
df["is_high_value"] = df["revenue"] > df["revenue"].quantile(0.9)

Stage 5: Validate

After wrangling, confirm the data meets your expectations:

def validate_dataframe(df):
    """Run basic validation checks on a wrangled DataFrame."""
    checks = {}
 
    # No nulls in required columns
    required = ["customer_id", "order_date", "revenue"]
    for col in required:
        null_count = df[col].isnull().sum()
        checks[f"no_nulls_{col}"] = null_count == 0
        if null_count > 0:
            print(f"WARNING: {col} has {null_count} null values")
 
    # Revenue should be positive
    neg_revenue = (df["revenue"] < 0).sum()
    checks["positive_revenue"] = neg_revenue == 0
    if neg_revenue > 0:
        print(f"WARNING: {neg_revenue} rows with negative revenue")
 
    # Dates should be within expected range
    min_date = df["order_date"].min()
    max_date = df["order_date"].max()
    checks["date_range"] = min_date >= pd.Timestamp("2020-01-01")
    print(f"Date range: {min_date} to {max_date}")
 
    # No duplicate orders
    dupe_count = df.duplicated(subset=["customer_id", "order_date"]).sum()
    checks["no_duplicates"] = dupe_count == 0
 
    passed = sum(checks.values())
    total = len(checks)
    print(f"\nValidation: {passed}/{total} checks passed")
    return checks
 
validate_dataframe(df)

Outlier Detection and Handling

Outliers can distort analysis. Detect them first, then decide what to do:

import numpy as np
 
# Method 1: IQR (Interquartile Range)
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
 
outliers = df[(df["revenue"] < lower) | (df["revenue"] > upper)]
print(f"IQR outliers: {len(outliers)} rows")
 
# Method 2: Z-score
from scipy import stats
df["z_score"] = np.abs(stats.zscore(df["revenue"].dropna()))
outliers_z = df[df["z_score"] > 3]
print(f"Z-score outliers: {len(outliers_z)} rows")
 
# Handling options
# Option A: Remove outliers
df_clean = df[(df["revenue"] >= lower) & (df["revenue"] <= upper)]
 
# Option B: Cap outliers (winsorization)
df["revenue_capped"] = df["revenue"].clip(lower=lower, upper=upper)
 
# Option C: Flag outliers for manual review
df["is_outlier"] = (df["revenue"] < lower) | (df["revenue"] > upper)

Full Wrangling Pipeline Example

Here is a complete, reproducible pipeline that ties every step together:

import pandas as pd
import numpy as np
 
def wrangle_sales_data(filepath):
    """Complete data wrangling pipeline for sales data."""
 
    # 1. Load
    df = pd.read_csv(filepath)
 
    # 2. Standardize column names
    df.columns = (
        df.columns.str.strip().str.lower().str.replace(" ", "_")
    )
 
    # 3. Fix data types
    df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
    df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
 
    # 4. Clean strings
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].str.strip()
 
    # 5. Handle missing values
    df = df.dropna(subset=["order_date", "customer_id"])
    df["revenue"] = df["revenue"].fillna(df["revenue"].median())
    df["category"] = df["category"].fillna("Unknown")
 
    # 6. Remove duplicates
    df = df.drop_duplicates(
        subset=["customer_id", "order_date", "product_id"],
        keep="first"
    )
 
    # 7. Add derived columns
    df["order_year"] = df["order_date"].dt.year
    df["order_month"] = df["order_date"].dt.month
 
    # 8. Remove outliers (IQR method on revenue)
    Q1 = df["revenue"].quantile(0.25)
    Q3 = df["revenue"].quantile(0.75)
    IQR = Q3 - Q1
    df = df[
        (df["revenue"] >= Q1 - 1.5 * IQR)
        & (df["revenue"] <= Q3 + 1.5 * IQR)
    ]
 
    # 9. Reset index
    df = df.reset_index(drop=True)
 
    return df
 
# Usage
clean_df = wrangle_sales_data("raw_sales.csv")
print(f"Clean dataset: {clean_df.shape[0]} rows, {clean_df.shape[1]} columns")

Pandas vs Polars vs SQL for Data Wrangling

Pandas is the default choice, but it is not the only option. Here is how the three main wrangling tools compare:

CriteriaPandasPolarsSQL
LanguagePythonPython / RustSQL
Speed (1GB file)Slow (single-threaded)Fast (multi-threaded, Rust)Depends on engine
Memory usageHigh (eager evaluation)Lower (lazy evaluation)Engine-managed
Syntaxdf.groupby().agg()df.group_by().agg()GROUP BY
Missing valuesNaN / Nonenull (native)NULL
Pivot / MeltBuilt-inBuilt-inVaries by engine
Learning curveModerateModerateEasy for SQL users
Best forExploratory analysis, prototypingLarge datasets, performanceDatabase-resident data
EcosystemMassive (NumPy, Scikit-learn)GrowingDepends on DB

When to use each:

  • Pandas: You are doing exploratory analysis, your data fits in memory, and you need the broadest library compatibility.
  • Polars: Your data is large (1GB+), you need speed, or you want lazy evaluation for memory efficiency.
  • SQL: Your data is already in a database and you want transformations to run where the data lives.

Best Practices and Common Pitfalls

Best Practices

  1. Always explore before cleaning. Run df.info(), df.describe(), and df.isnull().sum() before writing any transformation code.

  2. Work on a copy. Use df_clean = df.copy() so you can compare results against the original.

  3. Chain operations. Pandas method chaining keeps pipelines readable:

clean_df = (
    df
    .rename(columns=str.lower)
    .dropna(subset=["customer_id"])
    .assign(revenue=lambda x: pd.to_numeric(x["revenue"], errors="coerce"))
    .query("revenue > 0")
    .drop_duplicates(subset=["customer_id", "order_date"])
    .reset_index(drop=True)
)
  1. Document your decisions. When you fill missing values or remove outliers, add a comment explaining why.

  2. Validate at each stage. Check row counts, null counts, and value distributions after each transformation.

Common Pitfalls

PitfallWhat Goes WrongFix
Ignoring SettingWithCopyWarningSilent bugs from chained indexingUse .loc[] or .copy()
Filling all NaN with 0Zeros are meaningful in most contextsUse median, mode, or "Unknown"
Dropping rows too aggressivelyLosing valid dataOnly drop when necessary, document the criteria
Not checking join resultsDuplicated rows from many-to-many joinsCheck merged.shape after every merge
Hardcoding column namesPipeline breaks on new dataUse pattern matching or config files

AI-Assisted Data Wrangling

Manual data wrangling is time-consuming and error-prone. AI tools can accelerate the process by suggesting transformations, detecting anomalies, and generating cleaning code.

RunCell (opens in a new tab) brings an AI agent directly into your Jupyter notebook. Instead of writing every cleaning step by hand, you can describe what you need in natural language -- "convert the date column to datetime, fill missing revenue with the median, and remove duplicate orders" -- and RunCell generates the Pandas code. This is especially useful for repetitive wrangling tasks across multiple datasets with similar structures.

FAQ

What is data wrangling in simple terms?

Data wrangling is the process of taking raw, messy data and converting it into a clean, structured format suitable for analysis. It includes fixing missing values, correcting data types, removing duplicates, reshaping tables, and merging datasets from multiple sources.

What is the difference between data wrangling and data cleaning?

Data cleaning is one step within data wrangling. Cleaning focuses specifically on fixing errors -- missing values, duplicates, incorrect entries. Data wrangling is broader and includes everything from initial exploration and restructuring to enrichment and validation.

What tools are used for data wrangling?

Python with Pandas is the most common tool. Other options include Polars (faster for large datasets), SQL (for database-resident data), R with dplyr, and commercial tools like Trifacta and Alteryx. For visual exploration before wrangling, PyGWalker provides an interactive drag-and-drop interface on top of Pandas DataFrames.

How much time do data scientists spend on data wrangling?

Industry surveys consistently report that data professionals spend 60-80% of their time on data wrangling and preparation. This is why efficient wrangling practices and automation tools are critical for data team productivity.

What are the steps in data wrangling?

The five main steps are: (1) Discover -- explore the data's shape, types, and quality issues; (2) Structure -- rename columns, reshape tables; (3) Clean -- handle missing values, fix types, remove duplicates; (4) Enrich -- add derived columns, join with external data; (5) Validate -- confirm the output meets quality requirements.

Conclusion

Data wrangling is the unglamorous foundation that makes every downstream analysis possible. Without it, dashboards show wrong numbers, models train on garbage, and business decisions get made with bad information.

The key takeaways: always explore before you clean, use the five-stage pipeline (discover, structure, clean, enrich, validate), pick the right missing-value strategy for each column, validate your output, and document your decisions. Pandas handles most wrangling tasks well; switch to Polars when dataset size demands it and SQL when data lives in a database.

Build your wrangling logic into reusable functions, not one-off scripts. The data will change. The sources will change. Your pipeline should handle both gracefully.

📚