Skip to content

Confusion Matrix in Sklearn: How to Evaluate Classification Models

Updated on

Your classification model reports 95% accuracy, so you deploy it. Then you discover it misses 80% of the positive cases you actually care about -- fraudulent transactions, disease diagnoses, defective products. Accuracy alone hides critical information about where and how a model fails.

A single accuracy number collapses all types of errors into one metric. A spam filter that lets every spam email through and correctly classifies all legitimate emails still achieves high accuracy if spam is only 5% of the total. You need to see the full picture: how many positives does the model catch, how many negatives does it misclassify, and where exactly the errors fall.

The confusion matrix breaks model performance into its four components -- true positives, true negatives, false positives, and false negatives. Combined with derived metrics like precision, recall, and F1-score, it gives you actionable insight into what your model gets right and wrong. Scikit-learn provides confusion_matrix, classification_report, and ConfusionMatrixDisplay to make this analysis straightforward.

📚

What Is a Confusion Matrix?

A confusion matrix is a table that compares predicted labels against actual labels for a classification model. For binary classification, it is a 2x2 grid:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Each cell counts the number of samples that fall into that category:

  • True Positive (TP): Model predicted positive, and it was actually positive. Correct.
  • True Negative (TN): Model predicted negative, and it was actually negative. Correct.
  • False Positive (FP): Model predicted positive, but it was actually negative. Type I error.
  • False Negative (FN): Model predicted negative, but it was actually positive. Type II error.

Basic Confusion Matrix with Sklearn

from sklearn.metrics import confusion_matrix
import numpy as np
 
# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[5 1]
#  [2 7]]

Reading this output: sklearn arranges the matrix with row 0 = actual negative, row 1 = actual positive.

Predicted 0Predicted 1
Actual 0TN = 5FP = 1
Actual 1FN = 2TP = 7

So the model correctly identified 5 negatives and 7 positives, while making 1 false positive and 2 false negative errors.

Extracting Individual Values

from sklearn.metrics import confusion_matrix
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print(f"True Negatives:  {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives:  {tp}")
# True Negatives:  5
# False Positives: 1
# False Negatives: 2
# True Positives:  7

Precision, Recall, F1-Score, and Accuracy

These metrics are derived directly from the confusion matrix:

MetricFormulaWhat It Answers
Accuracy(TP + TN) / (TP + TN + FP + FN)Of all predictions, how many were correct?
PrecisionTP / (TP + FP)Of predicted positives, how many were actually positive?
Recall (Sensitivity)TP / (TP + FN)Of actual positives, how many did we catch?
SpecificityTN / (TN + FP)Of actual negatives, how many did we correctly identify?
F1-Score2 * (Precision * Recall) / (Precision + Recall)Harmonic mean of precision and recall
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score
)
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_true, y_pred):.4f}")
# Accuracy:  0.8000
# Precision: 0.8750
# Recall:    0.7778
# F1-Score:  0.8235

When to Prioritize Precision vs Recall

ScenarioPrioritizeWhy
Spam detectionPrecisionFalse positives (legit email marked spam) annoy users
Disease screeningRecallFalse negatives (missed disease) are dangerous
Fraud detectionRecallMissing fraud is more costly than investigating false alarms
Search engine resultsPrecisionIrrelevant results degrade user experience
Manufacturing defect detectionRecallDefective products reaching customers is costly
Content recommendationPrecisionIrrelevant recommendations reduce engagement

Classification Report

Sklearn's classification_report computes precision, recall, F1-score, and support (number of actual occurrences) for each class in one call:

from sklearn.metrics import classification_report
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

Output:

              precision    recall  f1-score   support

    Negative       0.71      0.83      0.77         6
    Positive       0.88      0.78      0.82         9

    accuracy                           0.80        15
   macro avg       0.80      0.81      0.80        15
weighted avg       0.81      0.80      0.80        15
  • macro avg: Unweighted mean across classes. Treats all classes equally.
  • weighted avg: Mean weighted by class support. Accounts for class imbalance.
  • support: Number of actual samples in each class.

Visualizing the Confusion Matrix

Using ConfusionMatrixDisplay

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
# Load and split data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
 
# Train model (see /topics/Scikit-Learn/sklearn-random-forest for details)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp.plot(cmap='Blues')
plt.title('Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

Using Seaborn Heatmap

For more customization, use seaborn directly. See our detailed seaborn heatmap guide for advanced styling options:

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
 
# Load, split, train
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
 
# Plot with seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=data.target_names,
    yticklabels=data.target_names,
    square=True,
    linewidths=0.5
)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix_seaborn.png', dpi=150)
plt.show()

Normalized Confusion Matrix

Raw counts can be misleading when classes have different sizes. Normalizing shows proportions instead:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Normalized confusion matrix (by true labels)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Raw counts
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    cmap='Blues',
    ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
# Normalized (rows sum to 1)
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    normalize='true',
    cmap='Blues',
    values_format='.2%',
    ax=axes[1]
)
axes[1].set_title('Normalized by True Label')
 
plt.tight_layout()
plt.savefig('confusion_matrix_normalized.png', dpi=150)
plt.show()

The normalize parameter accepts three options:

ValueNormalizationUse Case
'true'Rows sum to 1 (divide by actual class count)See recall per class
'pred'Columns sum to 1 (divide by predicted class count)See precision per class
'all'All cells sum to 1 (divide by total count)See overall distribution

Multi-Class Confusion Matrix

The confusion matrix extends naturally to more than two classes. Each row represents one actual class, and each column represents one predicted class:

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
 
# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
 
# Train and predict
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
 
# Classification report
print("\nClassification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=iris.target_names
))
 
# Visualize
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=iris.target_names
)
disp.plot(cmap='Blues')
plt.title('Iris Classification - 3 Classes')
plt.tight_layout()
plt.savefig('multi_class_confusion_matrix.png', dpi=150)
plt.show()

Multi-Class Averaging Strategies

When computing precision, recall, and F1 for multi-class problems, you need to choose an averaging method:

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
 
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
for avg in ['micro', 'macro', 'weighted']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f1 = f1_score(y_test, y_pred, average=avg)
    print(f"{avg:8s} -- Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}")
AverageMethodBest For
microTotal TP, FP, FN across all classesWhen class imbalance matters
macroUnweighted mean per classWhen all classes are equally important
weightedWeighted mean by class supportDefault choice for imbalanced datasets

Complete Example: End-to-End Classification Evaluation

from sklearn.metrics import (
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
 
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
 
# Build pipeline (see /topics/Scikit-Learn/sklearn-pipeline for a full guide)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=3, random_state=42
    ))
])
 
# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print("=" * 50)
print("MODEL EVALUATION REPORT")
print("=" * 50)
print(f"\nConfusion Matrix:")
print(f"  TP={tp}, FP={fp}")
print(f"  FN={fn}, TN={tn}")
print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    cmap='Blues', ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    normalize='true', values_format='.1%', cmap='Blues', ax=axes[1]
)
axes[1].set_title('Normalized')
plt.tight_layout()
plt.savefig('full_evaluation.png', dpi=150)
plt.show()

Exploring Classification Results with PyGWalker

After building your confusion matrix, dig deeper into misclassifications by exploring the raw data interactively. PyGWalker (opens in a new tab) turns your prediction results into a drag-and-drop visual analytics interface in Jupyter:

import pandas as pd
import pygwalker as pyg
 
# Build results DataFrame with features and predictions
results = pd.DataFrame(X_test, columns=data.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['correct'] = y_test == y_pred
results['confidence'] = y_prob
 
# Launch interactive exploration
walker = pyg.walk(results)

Filter by misclassified samples, compare feature distributions between TP/FP/FN/TN groups, and identify patterns that explain where the model struggles.

For iterating on classification experiments in Jupyter -- adjusting thresholds, testing different models, or exploring feature combinations -- RunCell (opens in a new tab) provides an AI agent that accelerates the experimentation loop.

FAQ

What is a confusion matrix in sklearn?

A confusion matrix is a table that shows the counts of correct and incorrect predictions for each class. In sklearn, confusion_matrix(y_true, y_pred) returns a 2D numpy array where rows represent actual classes and columns represent predicted classes. For binary classification, it shows true positives, true negatives, false positives, and false negatives.

How do I read a confusion matrix?

In sklearn's confusion matrix, rows are actual labels and columns are predicted labels. For binary classification: top-left is true negatives (TN), top-right is false positives (FP), bottom-left is false negatives (FN), and bottom-right is true positives (TP). The diagonal elements are correct predictions.

What is the difference between precision and recall?

Precision measures how many of the predicted positives are actually positive (TP / (TP + FP)). Recall measures how many of the actual positives the model captured (TP / (TP + FN)). Precision answers "when the model says positive, how often is it right?" while recall answers "of all actual positives, how many did the model find?"

When should I use F1-score instead of accuracy?

Use F1-score when your classes are imbalanced. If 95% of samples are negative, a model that always predicts negative gets 95% accuracy but 0% recall on positives. F1-score is the harmonic mean of precision and recall, so it penalizes models that sacrifice one for the other.

How do I plot a confusion matrix in Python?

Use ConfusionMatrixDisplay.from_predictions(y_true, y_pred) for the quickest method. For more customization, compute the matrix with confusion_matrix() and plot it with seaborn.heatmap(). Both approaches support normalized matrices, custom color maps, and class labels.

What does normalize='true' do in ConfusionMatrixDisplay?

Setting normalize='true' divides each row by the total number of actual samples in that class, so each row sums to 1. This shows recall per class as a percentage. Use normalize='pred' to see precision per class, or normalize='all' to see the overall proportion.

Conclusion

The confusion matrix is the foundation of classification model evaluation. Accuracy alone is insufficient -- you need to see the specific types of errors your model makes. Use confusion_matrix and classification_report from sklearn to get the full picture, visualize with ConfusionMatrixDisplay or seaborn heatmaps for presentations and reports, and normalize when class sizes differ. Wrap your entire preprocessing and model workflow in an sklearn Pipeline to ensure consistent evaluation. Choose your primary metric based on the business cost of each error type: precision when false positives are expensive, recall when false negatives are dangerous, and F1-score when you need a balanced measure.

Related Guides

📚