Pandas fillna(): Handle Missing Values in DataFrames
Updated on
Missing values are the silent saboteur of data analysis. A single NaN hiding in a critical column can cause an aggregation to return NaN, a machine learning model to throw an error at training time, or a dashboard chart to render a blank gap where a trend line should be. Real-world datasets almost always contain gaps -- sensor readings drop out, survey respondents skip questions, API responses return null fields, and CSV imports arrive with empty cells. The question is never whether you will encounter missing data, but how you will handle it. For a broader overview of missing data strategies, see the pandas missing values guide.
The pandas fillna() method is the primary tool for replacing missing values with something meaningful. This guide covers every parameter, demonstrates common fill strategies (scalar, dictionary, forward fill, backward fill, mean/median/mode), compares fillna() against dropna() and interpolate(), and shows how to chain these operations into a clean data pipeline. Every code example is copy-ready with expected output.
Detecting Missing Values Before Filling
Before filling anything, you need to know where the gaps are. Pandas provides three detection functions:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'age': [28, np.nan, 35, np.nan, 42],
'salary': [55000, 62000, np.nan, 48000, np.nan]
})
# isna() returns True for missing values (alias: isnull())
print(df.isna())Output:
name age salary
0 False False False
1 False True False
2 True False True
3 False True False
4 False False TrueQuick summary of missing counts
# Count missing values per column
print(df.isna().sum())Output:
name 1
age 2
salary 2
dtype: int64notna() for the inverse check
# notna() returns True for non-missing values
print(df.notna().sum())Output:
name 4
age 3
salary 3
dtype: int64| Function | Returns True when | Alias |
|---|---|---|
isna() | Value is NaN, None, or NaT | isnull() |
notna() | Value is not missing | notnull() |
These functions work on both DataFrames and individual Series. Use them to audit your data before deciding on a fill strategy.
Basic fillna() with a Scalar Value
The simplest use of fillna() replaces every NaN in the DataFrame with a single value:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'price': [19.99, np.nan, 29.99],
'stock': [100, 50, np.nan]
})
print("Before:")
print(df)
df_filled = df.fillna(0)
print("\nAfter fillna(0):")
print(df_filled)Output:
Before:
product price stock
0 Widget 19.99 100.0
1 Gadget NaN 50.0
2 Gizmo 29.99 NaN
After fillna(0):
product price stock
0 Widget 19.99 100.0
1 Gadget 0.00 50.0
2 Gizmo 29.99 0.0This works, but filling a price column with 0 is misleading -- it suggests the product is free. For string columns, you might fill with "Unknown". The key is choosing a fill value that makes semantic sense for each column.
Full Method Signature
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)| Parameter | Type | Default | Description |
|---|---|---|---|
value | scalar, dict, Series, or DataFrame | None | The value to fill missing entries with |
method | 'ffill', 'bfill', or None | None | Propagation method for filling gaps |
axis | 0 or 1 | None | Fill along rows (0) or columns (1) |
inplace | bool | False | If True, modifies the DataFrame in place |
limit | int | None | Maximum number of consecutive NaNs to fill |
fillna() with a Dictionary: Different Values per Column
In most real datasets, each column represents a different type of measurement, and a single fill value does not make sense everywhere. Pass a dictionary to fillna() to specify per-column fill values:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', None, 'Charlie', 'Diana'],
'age': [28, 34, np.nan, 45],
'department': ['Engineering', 'Sales', None, 'Marketing'],
'salary': [75000, np.nan, 68000, np.nan]
})
fill_values = {
'name': 'Unknown',
'age': df['age'].median(),
'department': 'Unassigned',
'salary': df['salary'].mean()
}
df_filled = df.fillna(fill_values)
print(df_filled)Output:
name age department salary
0 Alice 28.0 Engineering 75000.0
1 Unknown 34.0 Sales 71500.0
2 Charlie 34.0 Unassigned 68000.0
3 Diana 45.0 Marketing 71500.0This is the recommended approach for production data pipelines because it gives you explicit control over what each column receives.
Forward Fill (ffill) and Backward Fill (bfill)
Time-series data and ordered datasets often benefit from propagation-based filling. Forward fill carries the last known value forward; backward fill takes the next known value backward.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range('2026-01-01', periods=7, freq='D'),
'temperature': [22.1, np.nan, np.nan, 24.5, np.nan, 26.0, np.nan]
})
print("Original:")
print(df)
print("\nForward fill (ffill):")
print(df.fillna(method='ffill'))
print("\nBackward fill (bfill):")
print(df.fillna(method='bfill'))Output:
Original:
date temperature
0 2026-01-01 22.1
1 2026-01-02 NaN
2 2026-01-03 NaN
3 2026-01-04 24.5
4 2026-01-05 NaN
5 2026-01-06 26.0
6 2026-01-07 NaN
Forward fill (ffill):
date temperature
0 2026-01-01 22.1
1 2026-01-02 22.1
2 2026-01-03 22.1
3 2026-01-04 24.5
4 2026-01-05 24.5
5 2026-01-06 26.0
6 2026-01-07 26.0
Backward fill (bfill):
date temperature
0 2026-01-01 22.1
1 2026-01-02 24.5
2 2026-01-03 24.5
3 2026-01-04 24.5
4 2026-01-05 26.0
5 2026-01-06 26.0
6 2026-01-07 NaNNotice that backward fill leaves the last row as NaN because there is no subsequent value to pull from. You can combine both methods to close all gaps:
df_filled = df.fillna(method='ffill').fillna(method='bfill')
print(df_filled)Starting with pandas 2.1, you can also use the standalone df.ffill() and df.bfill() methods directly, which are shorthand for fillna(method='ffill') and fillna(method='bfill').
Limiting Propagation with limit
When a sensor drops out for days, forward-filling indefinitely can mask real data gaps. The limit parameter caps how many consecutive NaNs get filled:
import pandas as pd
import numpy as np
s = pd.Series([1.0, np.nan, np.nan, np.nan, 5.0])
print("limit=1:")
print(s.fillna(method='ffill', limit=1))
print("\nlimit=2:")
print(s.fillna(method='ffill', limit=2))Output:
limit=1:
0 1.0
1 1.0
2 NaN
3 NaN
4 5.0
dtype: float64
limit=2:
0 1.0
1 1.0
2 1.0
3 NaN
4 5.0
dtype: float64This is critical for time-series data where you want to fill small gaps but flag longer outages for manual review.
fillna() with Mean, Median, and Mode
Statistical imputation replaces missing values with a summary statistic computed from the non-missing values in that column. This is the most common strategy for numerical features before feeding data into a model:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'math_score': [85, np.nan, 92, 78, np.nan, 88],
'reading_score': [np.nan, 76, 81, np.nan, 90, 85],
'grade': ['A', 'B', 'A', np.nan, 'B', np.nan]
})
# Fill numerical columns with their column mean
df['math_score'] = df['math_score'].fillna(df['math_score'].mean())
df['reading_score'] = df['reading_score'].fillna(df['reading_score'].median())
# Fill categorical column with mode (most frequent value)
df['grade'] = df['grade'].fillna(df['grade'].mode()[0])
print(df)Output:
math_score reading_score grade
0 85.00 83.00 A
1 85.75 76.00 B
2 92.00 81.00 A
3 78.00 83.00 A
4 85.75 90.00 B
5 88.00 85.00 A| Strategy | Best for | Notes |
|---|---|---|
mean() | Numerical data with roughly symmetric distributions | Sensitive to outliers |
median() | Numerical data with skewed distributions or outliers | More robust than mean |
mode() | Categorical data or discrete numerical values | Returns the most common value; mode()[0] grabs the first if tied |
For machine learning pipelines, consider using sklearn.impute.SimpleImputer which integrates with scikit-learn pipelines and handles train/test split imputation correctly. You can also fill missing values per group using .groupby() combined with transform(), or use .apply() for custom per-column fill logic.
interpolate() for Numerical Data
When data follows a trend (stock prices, sensor readings, growth metrics), interpolate() estimates missing values based on surrounding data points rather than using a flat fill:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'day': range(1, 8),
'revenue': [1000, np.nan, np.nan, 1600, np.nan, 2000, np.nan]
})
df['fillna_ffill'] = df['revenue'].fillna(method='ffill')
df['interpolated'] = df['revenue'].interpolate(method='linear')
print(df)Output:
day revenue fillna_ffill interpolated
0 1 1000.0 1000.0 1000.0
1 2 NaN 1000.0 1200.0
2 3 NaN 1000.0 1400.0
3 4 1600.0 1600.0 1600.0
4 5 NaN 1600.0 1800.0
5 6 2000.0 2000.0 2000.0
6 7 NaN 2000.0 2000.0Notice how interpolate() produces a smooth linear progression (1000, 1200, 1400, 1600, 1800, 2000) while ffill creates flat plateaus. Pandas supports multiple interpolation methods:
| Method | Description |
|---|---|
'linear' | Default. Draws a straight line between known points. |
'time' | Linear interpolation weighted by time index. |
'index' | Uses the actual numerical index values. |
'polynomial' | Fits a polynomial of specified order. |
'spline' | Fits a spline of specified order for smooth curves. |
Use interpolate() when the data has a natural ordering and trend. Use fillna() when you have a known replacement value or need propagation-based filling.
The inplace Parameter
Like most pandas methods, fillna() returns a new DataFrame by default. Setting inplace=True modifies the original:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, np.nan, 3], 'b': [np.nan, 5, 6]})
# Method 1: assignment (recommended)
df_new = df.fillna(0)
print(f"Original unchanged: {df.isna().sum().sum()} NaNs")
print(f"New copy: {df_new.isna().sum().sum()} NaNs")
# Method 2: inplace (modifies original)
df.fillna(0, inplace=True)
print(f"After inplace: {df.isna().sum().sum()} NaNs")Output:
Original unchanged: 2 NaNs
New copy: 0 NaNs
After inplace: 0 NaNsModern pandas best practice favors assignment over inplace=True because assignment works naturally in method chains and makes data flow explicit.
Comparison: fillna() vs dropna() vs interpolate()
Choosing the right missing-data strategy depends on your dataset, the missingness pattern, and your downstream use case. Here is a side-by-side comparison:
| Aspect | fillna() | dropna() | interpolate() |
|---|---|---|---|
| What it does | Replaces NaN with a specified value | Removes rows or columns containing NaN | Estimates NaN from surrounding values |
| Row count | Preserved | Reduced | Preserved |
| Best for | Known replacement values, categorical data, statistical imputation | Small percentage of missing rows, or when imputation would distort analysis | Ordered/time-series numerical data with a natural trend |
| Risk | Introduces bias if fill value is poorly chosen | Loses data; can bias results if missingness is not random | Assumes a smooth underlying pattern that may not exist |
| Typical use case | Fill missing survey answers with "No response", fill prices with column mean | Drop rows with no target variable before model training | Fill gaps in daily stock prices or temperature readings |
| Handles categorical data | Yes | Yes (by dropping) | No (numerical only) |
| Chain-friendly | Yes | Yes | Yes |
Decision rule of thumb:
- If less than 5% of rows are missing and the data is missing completely at random,
dropna()is safe. - If you have a meaningful default or can compute a reasonable statistic, use
fillna(). - If the data is ordered and numerical with a trend, use
interpolate().
fillna() on Specific Columns
You do not always want to fill the entire DataFrame. Apply fillna() to individual columns or a subset:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'city': ['NYC', None, 'LA', None, 'Chicago'],
'temperature': [32.1, np.nan, 75.3, np.nan, 28.5],
'humidity': [45, 60, np.nan, np.nan, 55]
})
# Fill only the city column
df['city'] = df['city'].fillna('Unknown')
# Fill only the temperature column with its mean
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
# Leave humidity NaNs untouched for now
print(df)Output:
city temperature humidity
0 NYC 32.100000 45.0
1 Unknown 45.300000 60.0
2 LA 75.300000 NaN
3 Unknown 45.300000 NaN
4 Chicago 28.500000 55.0This selective approach is important when different columns require different treatment -- or when some missing values are intentional (e.g., humidity might not apply to indoor measurements).
Chaining fillna() with Other Operations
Pandas method chaining lets you build readable data pipelines. fillna() fits naturally into these chains:
import pandas as pd
import numpy as np
raw = pd.DataFrame({
'customer_id': [101, 102, 101, 103, 102, 104],
'purchase': [25.0, np.nan, 30.0, np.nan, 15.0, np.nan],
'channel': ['web', 'store', None, 'web', None, 'store']
})
result = (
raw
.fillna({'purchase': 0, 'channel': 'unknown'})
.drop_duplicates(subset=['customer_id'], keep='first')
.sort_values('customer_id')
.reset_index(drop=True)
)
print(result)Output:
customer_id purchase channel
0 101 25.0 web
1 102 0.0 store
2 103 0.0 web
3 104 0.0 storeThis pipeline fills missing values, deduplicates by customer ID, sorts, and resets the index in a single readable expression.
Real-World Pipeline: Cleaning Sales Data
Here is a more realistic chain that combines multiple cleaning steps:
import pandas as pd
import numpy as np
sales = pd.DataFrame({
'date': ['2026-01-01', '2026-01-02', '2026-01-03', '2026-01-04', '2026-01-05'],
'product': ['Widget', None, 'Widget', 'Gadget', None],
'units': [10, np.nan, 15, np.nan, 8],
'unit_price': [9.99, 9.99, np.nan, 14.99, np.nan],
'region': ['East', 'East', None, 'West', 'West']
})
clean = (
sales
.assign(date=lambda d: pd.to_datetime(d['date']))
.fillna({
'product': 'Unknown',
'region': 'Unassigned',
'units': sales['units'].median(),
'unit_price': sales['unit_price'].median()
})
.assign(total=lambda d: d['units'] * d['unit_price'])
.sort_values('date')
.reset_index(drop=True)
)
print(clean)Output:
date product units unit_price region total
0 2026-01-01 Widget 10.0 9.99 East 99.90
1 2026-01-02 Unknown 10.0 9.99 East 99.90
2 2026-01-03 Widget 15.0 9.99 Unassigned 149.85
3 2026-01-04 Gadget 10.0 14.99 West 149.90
4 2026-01-05 Unknown 8.0 9.99 West 79.92The assign() calls create or transform columns, fillna() handles the gaps, and the chain flows top to bottom in logical order.
Visualize Missing Data Patterns with PyGWalker
Before choosing a fill strategy, it helps to see where the missing values are concentrated. Are they scattered randomly, clustered in certain columns, or correlated with specific time periods? Visual inspection often reveals patterns that summary statistics miss.
PyGWalker (opens in a new tab) is an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visualization interface directly in Jupyter Notebook. You can drag columns onto axes, switch chart types, and filter data with clicks instead of writing matplotlib boilerplate.
import pandas as pd
import pygwalker as pyg
# Load your data and mark missing patterns
df = pd.read_csv('your_data.csv')
# Add a column counting missing values per row
df['missing_count'] = df.isna().sum(axis=1)
# Launch interactive explorer
walker = pyg.walk(df)Inside the PyGWalker interface, you can create bar charts showing the count of missing values per column, heatmaps revealing which rows have the most gaps, and scatter plots to check if missingness correlates with other variables. This kind of visual audit often changes which fill strategy you choose.
Install PyGWalker with
pip install pygwalkeror try it in Google Colab (opens in a new tab).
FAQ
What is the difference between fillna() and dropna()?
fillna() replaces missing values with a value you specify, keeping all rows intact. dropna() removes entire rows (or columns) that contain missing values. Use fillna() when you have a reasonable replacement value and want to preserve your row count. Use dropna() when the missing rows are few and imputation would introduce unacceptable bias.
Can I fill NaN values with the mean of a column?
Yes. Use df['column'] = df['column'].fillna(df['column'].mean()). This computes the mean from the non-missing values and fills every NaN in that column with the result. For skewed data, median() is often a better choice because it is less affected by extreme outliers.
What does the limit parameter do in fillna()?
The limit parameter caps the maximum number of consecutive NaN values that get filled. For example, df.fillna(method='ffill', limit=2) will forward-fill at most 2 consecutive gaps. Any longer sequence of missing values will be only partially filled, leaving the remaining gaps as NaN. This is useful for time-series data where you want to fill short gaps but flag extended outages.
How do I fill NaN with different values for different columns?
Pass a dictionary to fillna() where keys are column names and values are the fill values: df.fillna({'age': 0, 'name': 'Unknown', 'salary': df['salary'].median()}). Each column gets its own fill value, and columns not listed in the dictionary are left unchanged.
Does fillna() change the original DataFrame?
No, by default fillna() returns a new DataFrame and the original remains unchanged. To modify the original, either use assignment (df = df.fillna(0)) or pass inplace=True. The assignment approach is recommended because it works with method chaining and makes the data flow explicit.
Conclusion
Missing values are inevitable in real-world data. The pandas fillna() method gives you precise control over how to handle them:
- Use scalar fillna for simple, uniform replacements across the entire DataFrame.
- Use dictionary fillna to apply different fill strategies per column -- the most common pattern in production code.
- Use forward fill (ffill) and backward fill (bfill) for ordered and time-series data where propagating known values makes sense.
- Use mean, median, or mode for statistical imputation of numerical and categorical columns.
- Use interpolate() when the data follows a natural trend and you want smooth estimated values rather than flat fills.
- Use the limit parameter to prevent propagation-based methods from filling excessively long gaps.
- Prefer assignment over inplace=True for cleaner, more readable code.
- Always detect and audit missing values with
isna()andnotna()before choosing a fill strategy.
Once your missing values are handled, tools like PyGWalker (opens in a new tab) let you interactively explore the cleaned data without writing chart code -- helping you verify that your fill logic produced sensible results and move straight into analysis.
Related Guides
- Pandas Missing Values: Complete Guide -- broader overview of detecting, analyzing, and handling missing data
- Remove Duplicate Rows -- clean duplicates alongside missing values
- Pandas GroupBy -- fill missing values per group with groupby + transform
- Pandas Apply -- apply custom fill logic across rows or columns
- Pandas Data Cleaning Guide -- end-to-end data cleaning workflow