The Ultimate Guide: How to Use Scikit-learn Imputer
Updated on
When dealing with large datasets, it's almost inevitable to encounter missing values. Handling missing data efficiently is a fundamental step in the data preprocessing journey. Scikit-learn's Imputer provides a set of robust strategies for this task. In this article, we'll dive into Scikit-learn's SimpleImputer, IterativeImputer, KNNImputer, and how to leverage them to handle numerical and categorical data.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.
Unmasking the Imputer
Before we delve into the "how-tos," let's first understand what an Imputer is. Essentially, an imputer is an estimator that fills in missing values in your dataset. For numerical data, it leverages strategies like mean, median, or constant, while for categorical data, it uses the most frequent value or a constant. Also, you can train your model to predict the missing values, depending on the complexity of your data and the resources at your disposal.
In this guide, we'll primarily focus on Scikit-learn’s SimpleImputer, IterativeImputer, and KNNImputer. Plus, we'll walk you through the process of creating a pipeline to impute both categorical and numerical features seamlessly and feed them into a machine learning model.
Setting Up the Scikit-learn's Imputer
Scikit-learn’s imputation functions offer a convenient way to deal with missing data using a few lines of code. More so, they allow you to create pipelines, making it easy to reproduce results and improve your machine learning development workflow.
Environment and Dataset
Before we dive into the specifics of using Scikit-learn's Imputer, let's set up our environment and acquire a dataset. For this tutorial, we'll be using the Pima Indians Diabetes Database, a popular dataset in the machine learning community.
This dataset is freely available on Kaggle, and we can download it directly using the Kaggle API. First, though, we need to install the necessary Python libraries:
!pip install numpy pandas scikit-learn kaggle
Next, we'll download the Pima Indians Diabetes Database using the Kaggle API:
!kaggle datasets download -d uciml/pima-indians-diabetes-database
!unzip pima-indians-diabetes-database.zip -d ./dataset
Now that we've got our dataset, let's load it into a pandas DataFrame and take a look:
import pandas as pd
df = pd.read_csv("./dataset/diabetes.csv")
print(df.head())
Dive into the Dataset
Once we have the data loaded, it's essential to understand its structure. The Pima Indians Diabetes Database has eight numerical features and one binary target column, "Outcome". The features relate to various health metrics, such as glucose level, blood pressure, and BMI, while the target column indicates whether or not an individual has diabetes.
First, let's examine the shape of our dataset:
print(df.shape)
With this output, we can confirm that our dataset contains 768 rows and 9 columns.
Unveiling Missing Values
In real-world datasets, missing values are quite common. Therefore, before diving into machine learning modeling, we must identify and appropriately handle these missing values.
Let's check for missing values in our dataset:
missing_values = df.isnull().sum()
print(missing_values)
In an ideal scenario, this command would return zeroes for all columns. However, in practice, that's rarely the case. We'll cover how to handle these missing values in the next sections.
Imputing Numerical Values
In the realm of Scikit-learn's Imputer, we distinguish between numerical and categorical imputation. Numerical imputation is the process of replacing missing numerical values with statistical estimates. It's common to use the mean, median, or mode as the replacement value.
For the purposes of our tutorial, let's consider that the 'BloodPressure' column has missing values. Our first task is to confirm the number of missing values:
print(df['BloodPressure'].isnull().sum())
In the upcoming section, we will learn how to impute these missing values using Scikit-learn's SimpleImputer.
Imputing Numerical Values with Scikit-learn's SimpleImputer
Scikit-learn's SimpleImputer
provides a straightforward way to handle missing numerical values. It offers various strategies, including replacing missing values with the mean, median, or a constant value. Let's walk through an example of using SimpleImputer
to impute the missing values in the 'BloodPressure' column.
from sklearn.impute import SimpleImputer
# Create an instance of SimpleImputer
imputer = SimpleImputer(strategy='mean')
# Reshape the column into a 2D array
blood_pressure = df['BloodPressure'].values.reshape(-1, 1)
# Impute the missing values
imputed_blood_pressure = imputer.fit_transform(blood_pressure)
# Update the DataFrame with the imputed values
df['BloodPressure'] = imputed_blood_pressure
By setting the strategy parameter to 'mean', the SimpleImputer
calculates the mean of the available values and replaces the missing values with that mean value. You can also use 'median' or 'constant' as the strategy, depending on the nature of your data.
After imputing the missing values, it's always a good practice to double-check if any missing values remain:
print(df['BloodPressure'].isnull().sum())
With the missing values imputed, the output should be 0, indicating that there are no remaining missing values in the 'BloodPressure' column.
Imputing Categorical Values
Now, let's move on to the imputation of categorical values. For this example, let's consider the 'SkinThickness' column, which contains missing categorical values.
Similar to the previous section, we'll first check how many missing values are present in the 'SkinThickness' column:
print(df['SkinThickness'].isnull().sum())
To impute categorical values, we can use the 'most_frequent' strategy provided by the SimpleImputer
. This strategy replaces missing values with the most frequent value in the column.
# Create an instance of SimpleImputer for categorical values
imputer_categorical = SimpleImputer(strategy='most_frequent')
# Reshape the column into a 2D array
skin_thickness = df['SkinThickness'].values.reshape(-1, 1)
# Impute the missing values
imputed_skin_thickness = imputer_categorical.fit_transform(skin_thickness)
# Update the DataFrame with the imputed values
df['SkinThickness'] = imputed_skin_thickness
After imputing the missing values, let's confirm if any missing values remain in the 'SkinThickness' column:
print(df['SkinThickness'].isnull().sum())
The output should be 0, indicating that all missing categorical values in the 'SkinThickness' column have been successfully imputed.
Conclusion
In this part of the guide, we learned how to use Scikit-learn's SimpleImputer
to handle missing values in both numerical and categorical columns. We explored strategies like mean, median, and most frequent to impute missing values based on the nature of the data.
Imputation is an essential step in the data preprocessing pipeline, as it allows us to retain valuable information and ensure that our models can learn from complete datasets. By leveraging Scikit-learn's SimpleImputer
, data scientists and machine learning practitioners can efficiently handle missing data and build robust models.