Skip to content

Pandas fillna(): Handle Missing Values in DataFrames

Updated on

Missing values are the silent saboteur of data analysis. A single NaN hiding in a critical column can cause an aggregation to return NaN, a machine learning model to throw an error at training time, or a dashboard chart to render a blank gap where a trend line should be. Real-world datasets almost always contain gaps -- sensor readings drop out, survey respondents skip questions, API responses return null fields, and CSV imports arrive with empty cells. The question is never whether you will encounter missing data, but how you will handle it. For a broader overview of missing data strategies, see the pandas missing values guide.

The pandas fillna() method is the primary tool for replacing missing values with something meaningful. This guide covers every parameter, demonstrates common fill strategies (scalar, dictionary, forward fill, backward fill, mean/median/mode), compares fillna() against dropna() and interpolate(), and shows how to chain these operations into a clean data pipeline. Every code example is copy-ready with expected output.

📚

Detecting Missing Values Before Filling

Before filling anything, you need to know where the gaps are. Pandas provides three detection functions:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [28, np.nan, 35, np.nan, 42],
    'salary': [55000, 62000, np.nan, 48000, np.nan]
})
 
# isna() returns True for missing values (alias: isnull())
print(df.isna())

Output:

    name    age  salary
0  False  False   False
1  False   True   False
2   True  False    True
3  False   True   False
4  False  False    True

Quick summary of missing counts

# Count missing values per column
print(df.isna().sum())

Output:

name      1
age       2
salary    2
dtype: int64

notna() for the inverse check

# notna() returns True for non-missing values
print(df.notna().sum())

Output:

name      4
age       3
salary    3
dtype: int64
FunctionReturns True whenAlias
isna()Value is NaN, None, or NaTisnull()
notna()Value is not missingnotnull()

These functions work on both DataFrames and individual Series. Use them to audit your data before deciding on a fill strategy.

Basic fillna() with a Scalar Value

The simplest use of fillna() replaces every NaN in the DataFrame with a single value:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'price': [19.99, np.nan, 29.99],
    'stock': [100, 50, np.nan]
})
 
print("Before:")
print(df)
 
df_filled = df.fillna(0)
 
print("\nAfter fillna(0):")
print(df_filled)

Output:

Before:
  product  price  stock
0  Widget  19.99  100.0
1  Gadget    NaN   50.0
2   Gizmo  29.99    NaN

After fillna(0):
  product  price  stock
0  Widget  19.99  100.0
1  Gadget   0.00   50.0
2   Gizmo  29.99    0.0

This works, but filling a price column with 0 is misleading -- it suggests the product is free. For string columns, you might fill with "Unknown". The key is choosing a fill value that makes semantic sense for each column.

Full Method Signature

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
ParameterTypeDefaultDescription
valuescalar, dict, Series, or DataFrameNoneThe value to fill missing entries with
method'ffill', 'bfill', or NoneNonePropagation method for filling gaps
axis0 or 1NoneFill along rows (0) or columns (1)
inplaceboolFalseIf True, modifies the DataFrame in place
limitintNoneMaximum number of consecutive NaNs to fill

fillna() with a Dictionary: Different Values per Column

In most real datasets, each column represents a different type of measurement, and a single fill value does not make sense everywhere. Pass a dictionary to fillna() to specify per-column fill values:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'name': ['Alice', None, 'Charlie', 'Diana'],
    'age': [28, 34, np.nan, 45],
    'department': ['Engineering', 'Sales', None, 'Marketing'],
    'salary': [75000, np.nan, 68000, np.nan]
})
 
fill_values = {
    'name': 'Unknown',
    'age': df['age'].median(),
    'department': 'Unassigned',
    'salary': df['salary'].mean()
}
 
df_filled = df.fillna(fill_values)
print(df_filled)

Output:

      name   age   department   salary
0    Alice  28.0  Engineering  75000.0
1  Unknown  34.0        Sales  71500.0
2  Charlie  34.0   Unassigned  68000.0
3    Diana  45.0    Marketing  71500.0

This is the recommended approach for production data pipelines because it gives you explicit control over what each column receives.

Forward Fill (ffill) and Backward Fill (bfill)

Time-series data and ordered datasets often benefit from propagation-based filling. Forward fill carries the last known value forward; backward fill takes the next known value backward.

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'date': pd.date_range('2026-01-01', periods=7, freq='D'),
    'temperature': [22.1, np.nan, np.nan, 24.5, np.nan, 26.0, np.nan]
})
 
print("Original:")
print(df)
 
print("\nForward fill (ffill):")
print(df.fillna(method='ffill'))
 
print("\nBackward fill (bfill):")
print(df.fillna(method='bfill'))

Output:

Original:
        date  temperature
0 2026-01-01         22.1
1 2026-01-02          NaN
2 2026-01-03          NaN
3 2026-01-04         24.5
4 2026-01-05          NaN
5 2026-01-06         26.0
6 2026-01-07          NaN

Forward fill (ffill):
        date  temperature
0 2026-01-01         22.1
1 2026-01-02         22.1
2 2026-01-03         22.1
3 2026-01-04         24.5
4 2026-01-05         24.5
5 2026-01-06         26.0
6 2026-01-07         26.0

Backward fill (bfill):
        date  temperature
0 2026-01-01         22.1
1 2026-01-02         24.5
2 2026-01-03         24.5
3 2026-01-04         24.5
4 2026-01-05         26.0
5 2026-01-06         26.0
6 2026-01-07          NaN

Notice that backward fill leaves the last row as NaN because there is no subsequent value to pull from. You can combine both methods to close all gaps:

df_filled = df.fillna(method='ffill').fillna(method='bfill')
print(df_filled)

Starting with pandas 2.1, you can also use the standalone df.ffill() and df.bfill() methods directly, which are shorthand for fillna(method='ffill') and fillna(method='bfill').

Limiting Propagation with limit

When a sensor drops out for days, forward-filling indefinitely can mask real data gaps. The limit parameter caps how many consecutive NaNs get filled:

import pandas as pd
import numpy as np
 
s = pd.Series([1.0, np.nan, np.nan, np.nan, 5.0])
 
print("limit=1:")
print(s.fillna(method='ffill', limit=1))
 
print("\nlimit=2:")
print(s.fillna(method='ffill', limit=2))

Output:

limit=1:
0    1.0
1    1.0
2    NaN
3    NaN
4    5.0
dtype: float64

limit=2:
0    1.0
1    1.0
2    1.0
3    NaN
4    5.0
dtype: float64

This is critical for time-series data where you want to fill small gaps but flag longer outages for manual review.

fillna() with Mean, Median, and Mode

Statistical imputation replaces missing values with a summary statistic computed from the non-missing values in that column. This is the most common strategy for numerical features before feeding data into a model:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'math_score': [85, np.nan, 92, 78, np.nan, 88],
    'reading_score': [np.nan, 76, 81, np.nan, 90, 85],
    'grade': ['A', 'B', 'A', np.nan, 'B', np.nan]
})
 
# Fill numerical columns with their column mean
df['math_score'] = df['math_score'].fillna(df['math_score'].mean())
df['reading_score'] = df['reading_score'].fillna(df['reading_score'].median())
 
# Fill categorical column with mode (most frequent value)
df['grade'] = df['grade'].fillna(df['grade'].mode()[0])
 
print(df)

Output:

   math_score  reading_score grade
0       85.00          83.00     A
1       85.75          76.00     B
2       92.00          81.00     A
3       78.00          83.00     A
4       85.75          90.00     B
5       88.00          85.00     A
StrategyBest forNotes
mean()Numerical data with roughly symmetric distributionsSensitive to outliers
median()Numerical data with skewed distributions or outliersMore robust than mean
mode()Categorical data or discrete numerical valuesReturns the most common value; mode()[0] grabs the first if tied

For machine learning pipelines, consider using sklearn.impute.SimpleImputer which integrates with scikit-learn pipelines and handles train/test split imputation correctly. You can also fill missing values per group using .groupby() combined with transform(), or use .apply() for custom per-column fill logic.

interpolate() for Numerical Data

When data follows a trend (stock prices, sensor readings, growth metrics), interpolate() estimates missing values based on surrounding data points rather than using a flat fill:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'day': range(1, 8),
    'revenue': [1000, np.nan, np.nan, 1600, np.nan, 2000, np.nan]
})
 
df['fillna_ffill'] = df['revenue'].fillna(method='ffill')
df['interpolated'] = df['revenue'].interpolate(method='linear')
 
print(df)

Output:

   day  revenue  fillna_ffill  interpolated
0    1   1000.0        1000.0        1000.0
1    2      NaN        1000.0        1200.0
2    3      NaN        1000.0        1400.0
3    4   1600.0        1600.0        1600.0
4    5      NaN        1600.0        1800.0
5    6   2000.0        2000.0        2000.0
6    7      NaN        2000.0        2000.0

Notice how interpolate() produces a smooth linear progression (1000, 1200, 1400, 1600, 1800, 2000) while ffill creates flat plateaus. Pandas supports multiple interpolation methods:

MethodDescription
'linear'Default. Draws a straight line between known points.
'time'Linear interpolation weighted by time index.
'index'Uses the actual numerical index values.
'polynomial'Fits a polynomial of specified order.
'spline'Fits a spline of specified order for smooth curves.

Use interpolate() when the data has a natural ordering and trend. Use fillna() when you have a known replacement value or need propagation-based filling.

The inplace Parameter

Like most pandas methods, fillna() returns a new DataFrame by default. Setting inplace=True modifies the original:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({'a': [1, np.nan, 3], 'b': [np.nan, 5, 6]})
 
# Method 1: assignment (recommended)
df_new = df.fillna(0)
print(f"Original unchanged: {df.isna().sum().sum()} NaNs")
print(f"New copy: {df_new.isna().sum().sum()} NaNs")
 
# Method 2: inplace (modifies original)
df.fillna(0, inplace=True)
print(f"After inplace: {df.isna().sum().sum()} NaNs")

Output:

Original unchanged: 2 NaNs
New copy: 0 NaNs
After inplace: 0 NaNs

Modern pandas best practice favors assignment over inplace=True because assignment works naturally in method chains and makes data flow explicit.

Comparison: fillna() vs dropna() vs interpolate()

Choosing the right missing-data strategy depends on your dataset, the missingness pattern, and your downstream use case. Here is a side-by-side comparison:

Aspectfillna()dropna()interpolate()
What it doesReplaces NaN with a specified valueRemoves rows or columns containing NaNEstimates NaN from surrounding values
Row countPreservedReducedPreserved
Best forKnown replacement values, categorical data, statistical imputationSmall percentage of missing rows, or when imputation would distort analysisOrdered/time-series numerical data with a natural trend
RiskIntroduces bias if fill value is poorly chosenLoses data; can bias results if missingness is not randomAssumes a smooth underlying pattern that may not exist
Typical use caseFill missing survey answers with "No response", fill prices with column meanDrop rows with no target variable before model trainingFill gaps in daily stock prices or temperature readings
Handles categorical dataYesYes (by dropping)No (numerical only)
Chain-friendlyYesYesYes

Decision rule of thumb:

  1. If less than 5% of rows are missing and the data is missing completely at random, dropna() is safe.
  2. If you have a meaningful default or can compute a reasonable statistic, use fillna().
  3. If the data is ordered and numerical with a trend, use interpolate().

fillna() on Specific Columns

You do not always want to fill the entire DataFrame. Apply fillna() to individual columns or a subset:

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'city': ['NYC', None, 'LA', None, 'Chicago'],
    'temperature': [32.1, np.nan, 75.3, np.nan, 28.5],
    'humidity': [45, 60, np.nan, np.nan, 55]
})
 
# Fill only the city column
df['city'] = df['city'].fillna('Unknown')
 
# Fill only the temperature column with its mean
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
 
# Leave humidity NaNs untouched for now
print(df)

Output:

      city  temperature  humidity
0      NYC    32.100000      45.0
1  Unknown    45.300000      60.0
2       LA    75.300000       NaN
3  Unknown    45.300000       NaN
4  Chicago    28.500000      55.0

This selective approach is important when different columns require different treatment -- or when some missing values are intentional (e.g., humidity might not apply to indoor measurements).

Chaining fillna() with Other Operations

Pandas method chaining lets you build readable data pipelines. fillna() fits naturally into these chains:

import pandas as pd
import numpy as np
 
raw = pd.DataFrame({
    'customer_id': [101, 102, 101, 103, 102, 104],
    'purchase': [25.0, np.nan, 30.0, np.nan, 15.0, np.nan],
    'channel': ['web', 'store', None, 'web', None, 'store']
})
 
result = (
    raw
    .fillna({'purchase': 0, 'channel': 'unknown'})
    .drop_duplicates(subset=['customer_id'], keep='first')
    .sort_values('customer_id')
    .reset_index(drop=True)
)
 
print(result)

Output:

   customer_id  purchase  channel
0          101      25.0      web
1          102       0.0    store
2          103       0.0      web
3          104       0.0    store

This pipeline fills missing values, deduplicates by customer ID, sorts, and resets the index in a single readable expression.

Real-World Pipeline: Cleaning Sales Data

Here is a more realistic chain that combines multiple cleaning steps:

import pandas as pd
import numpy as np
 
sales = pd.DataFrame({
    'date': ['2026-01-01', '2026-01-02', '2026-01-03', '2026-01-04', '2026-01-05'],
    'product': ['Widget', None, 'Widget', 'Gadget', None],
    'units': [10, np.nan, 15, np.nan, 8],
    'unit_price': [9.99, 9.99, np.nan, 14.99, np.nan],
    'region': ['East', 'East', None, 'West', 'West']
})
 
clean = (
    sales
    .assign(date=lambda d: pd.to_datetime(d['date']))
    .fillna({
        'product': 'Unknown',
        'region': 'Unassigned',
        'units': sales['units'].median(),
        'unit_price': sales['unit_price'].median()
    })
    .assign(total=lambda d: d['units'] * d['unit_price'])
    .sort_values('date')
    .reset_index(drop=True)
)
 
print(clean)

Output:

        date  product  units  unit_price      region   total
0 2026-01-01   Widget   10.0        9.99        East   99.90
1 2026-01-02  Unknown   10.0        9.99        East   99.90
2 2026-01-03   Widget   15.0        9.99  Unassigned  149.85
3 2026-01-04   Gadget   10.0       14.99        West  149.90
4 2026-01-05  Unknown    8.0        9.99        West   79.92

The assign() calls create or transform columns, fillna() handles the gaps, and the chain flows top to bottom in logical order.

Visualize Missing Data Patterns with PyGWalker

Before choosing a fill strategy, it helps to see where the missing values are concentrated. Are they scattered randomly, clustered in certain columns, or correlated with specific time periods? Visual inspection often reveals patterns that summary statistics miss.

PyGWalker (opens in a new tab) is an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visualization interface directly in Jupyter Notebook. You can drag columns onto axes, switch chart types, and filter data with clicks instead of writing matplotlib boilerplate.

import pandas as pd
import pygwalker as pyg
 
# Load your data and mark missing patterns
df = pd.read_csv('your_data.csv')
 
# Add a column counting missing values per row
df['missing_count'] = df.isna().sum(axis=1)
 
# Launch interactive explorer
walker = pyg.walk(df)

Inside the PyGWalker interface, you can create bar charts showing the count of missing values per column, heatmaps revealing which rows have the most gaps, and scatter plots to check if missingness correlates with other variables. This kind of visual audit often changes which fill strategy you choose.

Install PyGWalker with pip install pygwalker or try it in Google Colab (opens in a new tab).

FAQ

What is the difference between fillna() and dropna()?

fillna() replaces missing values with a value you specify, keeping all rows intact. dropna() removes entire rows (or columns) that contain missing values. Use fillna() when you have a reasonable replacement value and want to preserve your row count. Use dropna() when the missing rows are few and imputation would introduce unacceptable bias.

Can I fill NaN values with the mean of a column?

Yes. Use df['column'] = df['column'].fillna(df['column'].mean()). This computes the mean from the non-missing values and fills every NaN in that column with the result. For skewed data, median() is often a better choice because it is less affected by extreme outliers.

What does the limit parameter do in fillna()?

The limit parameter caps the maximum number of consecutive NaN values that get filled. For example, df.fillna(method='ffill', limit=2) will forward-fill at most 2 consecutive gaps. Any longer sequence of missing values will be only partially filled, leaving the remaining gaps as NaN. This is useful for time-series data where you want to fill short gaps but flag extended outages.

How do I fill NaN with different values for different columns?

Pass a dictionary to fillna() where keys are column names and values are the fill values: df.fillna({'age': 0, 'name': 'Unknown', 'salary': df['salary'].median()}). Each column gets its own fill value, and columns not listed in the dictionary are left unchanged.

Does fillna() change the original DataFrame?

No, by default fillna() returns a new DataFrame and the original remains unchanged. To modify the original, either use assignment (df = df.fillna(0)) or pass inplace=True. The assignment approach is recommended because it works with method chaining and makes the data flow explicit.

Conclusion

Missing values are inevitable in real-world data. The pandas fillna() method gives you precise control over how to handle them:

  • Use scalar fillna for simple, uniform replacements across the entire DataFrame.
  • Use dictionary fillna to apply different fill strategies per column -- the most common pattern in production code.
  • Use forward fill (ffill) and backward fill (bfill) for ordered and time-series data where propagating known values makes sense.
  • Use mean, median, or mode for statistical imputation of numerical and categorical columns.
  • Use interpolate() when the data follows a natural trend and you want smooth estimated values rather than flat fills.
  • Use the limit parameter to prevent propagation-based methods from filling excessively long gaps.
  • Prefer assignment over inplace=True for cleaner, more readable code.
  • Always detect and audit missing values with isna() and notna() before choosing a fill strategy.

Once your missing values are handled, tools like PyGWalker (opens in a new tab) let you interactively explore the cleaned data without writing chart code -- helping you verify that your fill logic produced sensible results and move straight into analysis.

Related Guides

📚