Data Wrangling: Complete Guide to Cleaning and Transforming Data with Python
Updated on
Every data project starts with the same uncomfortable truth: raw data is messy. Columns have inconsistent formats. Dates arrive as strings. Customer records contain duplicates. CSV files from different departments use different naming conventions for the same fields. You cannot build reliable models, dashboards, or reports on data you do not trust.
Data wrangling is the process of turning that raw, unreliable data into something clean, structured, and ready for analysis. It consumes up to 80% of a data professional's time according to multiple industry surveys, yet it remains one of the least-taught skills in data science curricula. This guide walks through the full data wrangling pipeline using Python and Pandas, with practical code for every step.
What Is Data Wrangling?
Data wrangling (also called data munging) is the process of discovering, structuring, cleaning, enriching, and validating raw data so it can be used for analysis or machine learning. It sits between data collection and data analysis in the workflow.
Data Wrangling vs Data Cleaning vs ETL
These three terms overlap but mean different things:
| Term | Scope | Focus | Typical Tools |
|---|---|---|---|
| Data Wrangling | Broad -- covers the full process from raw to analysis-ready | Exploration, reshaping, cleaning, enrichment | Pandas, Polars, Python scripts |
| Data Cleaning | Narrow -- a subset of wrangling | Fixing errors, handling missing values, removing duplicates | Pandas, OpenRefine, SQL |
| ETL (Extract, Transform, Load) | System-level -- automated pipelines | Moving data between systems at scale | Airflow, dbt, Spark, AWS Glue |
Data cleaning is one step within data wrangling. ETL is a production-level pipeline that may include wrangling logic but focuses on system integration and automation.
The Data Wrangling Pipeline
A structured wrangling workflow has five stages:
1. Discover -- Load the data and understand its shape, types, distributions, and problems.
2. Structure -- Reshape columns and rows so the data matches the schema you need. Rename columns, split or combine fields, pivot or unpivot tables.
3. Clean -- Handle missing values, fix data types, remove duplicates, and correct errors.
4. Enrich -- Add derived columns, join with external datasets, bin continuous values, or encode categorical variables.
5. Validate -- Confirm the output meets expectations: no nulls in required fields, values within expected ranges, row counts match.
The rest of this guide covers each stage with Python code.
Stage 1: Discover Your Data
Start every wrangling project by understanding what you have:
import pandas as pd
df = pd.read_csv("sales_data.csv")
# Shape and basic info
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
df.info()
# Statistical summary
df.describe(include="all")
# Check missing values
df.isnull().sum().sort_values(ascending=False)
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Preview unique values in categorical columns
for col in df.select_dtypes(include="object").columns:
print(f"{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(5))
print()This discovery step tells you what problems exist before you start fixing them. Skip it and you will miss issues that surface later as bugs in your analysis.
For a visual approach to data discovery, PyGWalker (opens in a new tab) turns any Pandas DataFrame into an interactive Tableau-like interface. You can drag columns to axes, spot distributions, and identify outliers visually before writing any cleaning code:
import pygwalker as pyg walker = pyg.walk(df)
Stage 2: Structure and Reshape
Renaming Columns
# Standardize column names: lowercase, underscores, no spaces
df.columns = (
df.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
.str.replace(r"[^\w]", "", regex=True)
)Reshaping with Pivot and Melt
Pivot converts long data to wide format:
# Long format: each row is one product-month combination
# Wide format: months become columns
sales_wide = df.pivot_table(
index="product",
columns="month",
values="revenue",
aggfunc="sum"
)Melt converts wide data to long format:
# Wide format back to long
sales_long = sales_wide.reset_index().melt(
id_vars="product",
var_name="month",
value_name="revenue"
)Stack and Unstack
# Stack: pivot columns into rows (creates MultiIndex)
stacked = df.set_index(["region", "product"]).stack()
# Unstack: pivot rows into columns
unstacked = stacked.unstack(level="product")Use pivot_table when you need aggregation. Use melt when normalizing wide tables from spreadsheets. Use stack/unstack when working with hierarchical indices.
Stage 3: Clean the Data
Handling Missing Values
Missing data is the most common data quality problem. Pandas gives you three strategies:
# Strategy 1: Drop rows with any missing values
df_dropped = df.dropna()
# Strategy 2: Drop rows missing values in specific columns only
df_dropped = df.dropna(subset=["revenue", "customer_id"])
# Strategy 3: Fill with a constant
df["category"] = df["category"].fillna("Unknown")
# Strategy 4: Fill with column statistics
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
# Strategy 5: Forward fill (good for time series)
df["price"] = df["price"].ffill()
# Strategy 6: Interpolation (good for numeric time series)
df["temperature"] = df["temperature"].interpolate(method="linear")Which strategy to choose:
| Situation | Recommended Approach |
|---|---|
| Less than 5% missing, random | Drop rows |
| Categorical column | Fill with "Unknown" or mode |
| Numeric column, normal distribution | Fill with mean |
| Numeric column, skewed distribution | Fill with median |
| Time series data | Forward fill or interpolation |
| Column has > 50% missing | Consider dropping the column |
Data Type Conversions
Wrong data types cause silent errors. A "price" column stored as a string cannot be summed. A "date" column stored as an object cannot be filtered by month.
# String to datetime
df["order_date"] = pd.to_datetime(df["order_date"], format="%Y-%m-%d")
# String to numeric (coerce turns unparseable values to NaN)
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# Object to category (saves memory and enables ordering)
df["status"] = df["status"].astype("category")
# Boolean conversion
df["is_active"] = df["is_active"].map({"yes": True, "no": False, "Y": True, "N": False})
# Extract components from datetime
df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month
df["day_of_week"] = df["order_date"].dt.day_name()String Cleaning
Real-world string data is full of inconsistencies:
# Strip whitespace
df["name"] = df["name"].str.strip()
# Standardize case
df["city"] = df["city"].str.lower()
df["state"] = df["state"].str.upper()
# Replace patterns with regex
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)
# Extract patterns
df["zip_code"] = df["address"].str.extract(r"(\d{5})")
# Fix common inconsistencies
df["country"] = df["country"].replace({
"US": "United States",
"USA": "United States",
"U.S.A.": "United States",
"uk": "United Kingdom",
"UK": "United Kingdom",
})Handling Duplicates
# Find duplicates
print(f"Total duplicates: {df.duplicated().sum()}")
# Find duplicates based on specific columns
dupes = df[df.duplicated(subset=["customer_id", "order_date"], keep=False)]
print(f"Duplicate orders: {len(dupes)}")
# Remove duplicates, keeping the first occurrence
df = df.drop_duplicates(subset=["customer_id", "order_date"], keep="first")
# Remove exact duplicate rows
df = df.drop_duplicates()Stage 4: Enrich -- Merging and Joining Datasets
Most analysis requires combining data from multiple sources:
# Inner join: only matching rows
merged = pd.merge(orders, customers, on="customer_id", how="inner")
# Left join: all orders, with customer info where available
merged = pd.merge(orders, customers, on="customer_id", how="left")
# Join on different column names
merged = pd.merge(
orders, products,
left_on="prod_code",
right_on="product_id",
how="left"
)
# Concatenate DataFrames vertically (stacking rows)
all_months = pd.concat([jan_df, feb_df, mar_df], ignore_index=True)Join type reference:
| Join Type | Keeps | Use When |
|---|---|---|
inner | Only matching rows from both tables | You need complete records only |
left | All rows from left, matches from right | Left table is your primary dataset |
right | All rows from right, matches from left | Right table is your primary dataset |
outer | All rows from both tables | You need the full union of both datasets |
Adding Derived Columns
# Calculated columns
df["profit"] = df["revenue"] - df["cost"]
df["profit_margin"] = (df["profit"] / df["revenue"] * 100).round(2)
# Binning continuous values
df["revenue_tier"] = pd.cut(
df["revenue"],
bins=[0, 100, 500, 1000, float("inf")],
labels=["Low", "Medium", "High", "Premium"]
)
# Conditional columns
df["is_high_value"] = df["revenue"] > df["revenue"].quantile(0.9)Stage 5: Validate
After wrangling, confirm the data meets your expectations:
def validate_dataframe(df):
"""Run basic validation checks on a wrangled DataFrame."""
checks = {}
# No nulls in required columns
required = ["customer_id", "order_date", "revenue"]
for col in required:
null_count = df[col].isnull().sum()
checks[f"no_nulls_{col}"] = null_count == 0
if null_count > 0:
print(f"WARNING: {col} has {null_count} null values")
# Revenue should be positive
neg_revenue = (df["revenue"] < 0).sum()
checks["positive_revenue"] = neg_revenue == 0
if neg_revenue > 0:
print(f"WARNING: {neg_revenue} rows with negative revenue")
# Dates should be within expected range
min_date = df["order_date"].min()
max_date = df["order_date"].max()
checks["date_range"] = min_date >= pd.Timestamp("2020-01-01")
print(f"Date range: {min_date} to {max_date}")
# No duplicate orders
dupe_count = df.duplicated(subset=["customer_id", "order_date"]).sum()
checks["no_duplicates"] = dupe_count == 0
passed = sum(checks.values())
total = len(checks)
print(f"\nValidation: {passed}/{total} checks passed")
return checks
validate_dataframe(df)Outlier Detection and Handling
Outliers can distort analysis. Detect them first, then decide what to do:
import numpy as np
# Method 1: IQR (Interquartile Range)
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["revenue"] < lower) | (df["revenue"] > upper)]
print(f"IQR outliers: {len(outliers)} rows")
# Method 2: Z-score
from scipy import stats
df["z_score"] = np.abs(stats.zscore(df["revenue"].dropna()))
outliers_z = df[df["z_score"] > 3]
print(f"Z-score outliers: {len(outliers_z)} rows")
# Handling options
# Option A: Remove outliers
df_clean = df[(df["revenue"] >= lower) & (df["revenue"] <= upper)]
# Option B: Cap outliers (winsorization)
df["revenue_capped"] = df["revenue"].clip(lower=lower, upper=upper)
# Option C: Flag outliers for manual review
df["is_outlier"] = (df["revenue"] < lower) | (df["revenue"] > upper)Full Wrangling Pipeline Example
Here is a complete, reproducible pipeline that ties every step together:
import pandas as pd
import numpy as np
def wrangle_sales_data(filepath):
"""Complete data wrangling pipeline for sales data."""
# 1. Load
df = pd.read_csv(filepath)
# 2. Standardize column names
df.columns = (
df.columns.str.strip().str.lower().str.replace(" ", "_")
)
# 3. Fix data types
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
# 4. Clean strings
for col in df.select_dtypes(include="object").columns:
df[col] = df[col].str.strip()
# 5. Handle missing values
df = df.dropna(subset=["order_date", "customer_id"])
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
df["category"] = df["category"].fillna("Unknown")
# 6. Remove duplicates
df = df.drop_duplicates(
subset=["customer_id", "order_date", "product_id"],
keep="first"
)
# 7. Add derived columns
df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month
# 8. Remove outliers (IQR method on revenue)
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
df = df[
(df["revenue"] >= Q1 - 1.5 * IQR)
& (df["revenue"] <= Q3 + 1.5 * IQR)
]
# 9. Reset index
df = df.reset_index(drop=True)
return df
# Usage
clean_df = wrangle_sales_data("raw_sales.csv")
print(f"Clean dataset: {clean_df.shape[0]} rows, {clean_df.shape[1]} columns")Pandas vs Polars vs SQL for Data Wrangling
Pandas is the default choice, but it is not the only option. Here is how the three main wrangling tools compare:
| Criteria | Pandas | Polars | SQL |
|---|---|---|---|
| Language | Python | Python / Rust | SQL |
| Speed (1GB file) | Slow (single-threaded) | Fast (multi-threaded, Rust) | Depends on engine |
| Memory usage | High (eager evaluation) | Lower (lazy evaluation) | Engine-managed |
| Syntax | df.groupby().agg() | df.group_by().agg() | GROUP BY |
| Missing values | NaN / None | null (native) | NULL |
| Pivot / Melt | Built-in | Built-in | Varies by engine |
| Learning curve | Moderate | Moderate | Easy for SQL users |
| Best for | Exploratory analysis, prototyping | Large datasets, performance | Database-resident data |
| Ecosystem | Massive (NumPy, Scikit-learn) | Growing | Depends on DB |
When to use each:
- Pandas: You are doing exploratory analysis, your data fits in memory, and you need the broadest library compatibility.
- Polars: Your data is large (1GB+), you need speed, or you want lazy evaluation for memory efficiency.
- SQL: Your data is already in a database and you want transformations to run where the data lives.
Best Practices and Common Pitfalls
Best Practices
-
Always explore before cleaning. Run
df.info(),df.describe(), anddf.isnull().sum()before writing any transformation code. -
Work on a copy. Use
df_clean = df.copy()so you can compare results against the original. -
Chain operations. Pandas method chaining keeps pipelines readable:
clean_df = (
df
.rename(columns=str.lower)
.dropna(subset=["customer_id"])
.assign(revenue=lambda x: pd.to_numeric(x["revenue"], errors="coerce"))
.query("revenue > 0")
.drop_duplicates(subset=["customer_id", "order_date"])
.reset_index(drop=True)
)-
Document your decisions. When you fill missing values or remove outliers, add a comment explaining why.
-
Validate at each stage. Check row counts, null counts, and value distributions after each transformation.
Common Pitfalls
| Pitfall | What Goes Wrong | Fix |
|---|---|---|
Ignoring SettingWithCopyWarning | Silent bugs from chained indexing | Use .loc[] or .copy() |
| Filling all NaN with 0 | Zeros are meaningful in most contexts | Use median, mode, or "Unknown" |
| Dropping rows too aggressively | Losing valid data | Only drop when necessary, document the criteria |
| Not checking join results | Duplicated rows from many-to-many joins | Check merged.shape after every merge |
| Hardcoding column names | Pipeline breaks on new data | Use pattern matching or config files |
AI-Assisted Data Wrangling
Manual data wrangling is time-consuming and error-prone. AI tools can accelerate the process by suggesting transformations, detecting anomalies, and generating cleaning code.
RunCell (opens in a new tab) brings an AI agent directly into your Jupyter notebook. Instead of writing every cleaning step by hand, you can describe what you need in natural language -- "convert the date column to datetime, fill missing revenue with the median, and remove duplicate orders" -- and RunCell generates the Pandas code. This is especially useful for repetitive wrangling tasks across multiple datasets with similar structures.
FAQ
What is data wrangling in simple terms?
Data wrangling is the process of taking raw, messy data and converting it into a clean, structured format suitable for analysis. It includes fixing missing values, correcting data types, removing duplicates, reshaping tables, and merging datasets from multiple sources.
What is the difference between data wrangling and data cleaning?
Data cleaning is one step within data wrangling. Cleaning focuses specifically on fixing errors -- missing values, duplicates, incorrect entries. Data wrangling is broader and includes everything from initial exploration and restructuring to enrichment and validation.
What tools are used for data wrangling?
Python with Pandas is the most common tool. Other options include Polars (faster for large datasets), SQL (for database-resident data), R with dplyr, and commercial tools like Trifacta and Alteryx. For visual exploration before wrangling, PyGWalker provides an interactive drag-and-drop interface on top of Pandas DataFrames.
How much time do data scientists spend on data wrangling?
Industry surveys consistently report that data professionals spend 60-80% of their time on data wrangling and preparation. This is why efficient wrangling practices and automation tools are critical for data team productivity.
What are the steps in data wrangling?
The five main steps are: (1) Discover -- explore the data's shape, types, and quality issues; (2) Structure -- rename columns, reshape tables; (3) Clean -- handle missing values, fix types, remove duplicates; (4) Enrich -- add derived columns, join with external data; (5) Validate -- confirm the output meets quality requirements.
Conclusion
Data wrangling is the unglamorous foundation that makes every downstream analysis possible. Without it, dashboards show wrong numbers, models train on garbage, and business decisions get made with bad information.
The key takeaways: always explore before you clean, use the five-stage pipeline (discover, structure, clean, enrich, validate), pick the right missing-value strategy for each column, validate your output, and document your decisions. Pandas handles most wrangling tasks well; switch to Polars when dataset size demands it and SQL when data lives in a database.
Build your wrangling logic into reusable functions, not one-off scripts. The data will change. The sources will change. Your pipeline should handle both gracefully.