Skip to content

Sklearn 中的混淆矩阵:如何评估分类模型

Updated on

你的分类模型报告了 95% 的 accuracy,于是你把它上线了。随后你发现:它漏掉了你真正关心的 80% 正例——比如欺诈交易、疾病诊断、缺陷产品。仅看 accuracy 会掩盖模型究竟在哪里、以何种方式失败的关键信息。

一个单一的 accuracy 数字会把所有类型的错误压缩成一个指标。如果垃圾邮件只占总量的 5%,那么一个“把所有垃圾邮件都放过、但能正确识别所有正常邮件”的过滤器,仍然可以得到很高的 accuracy。你需要看到完整图景:模型抓住了多少正例、错分了多少负例、错误到底落在哪里。

混淆矩阵将模型表现拆解为四个组成部分——true positives、true negatives、false positives、false negatives。再结合 precision、recall、F1-score 等派生指标,它能给出可操作的洞察:模型哪里做对了、哪里做错了。Scikit-learn 提供了 confusion_matrixclassification_reportConfusionMatrixDisplay,让这类分析变得非常直接。

📚

什么是混淆矩阵?

混淆矩阵是一张表,用来对比分类模型的预测标签与真实标签。对于二分类,它是一个 2x2 的网格:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

每个单元格统计落入该类别的样本数量:

  • True Positive (TP): 模型预测为正,且实际为正。正确。
  • True Negative (TN): 模型预测为负,且实际为负。正确。
  • False Positive (FP): 模型预测为正,但实际为负。I 类错误(Type I error)。
  • False Negative (FN): 模型预测为负,但实际为正。II 类错误(Type II error)。

使用 Sklearn 的基础混淆矩阵

from sklearn.metrics import confusion_matrix
import numpy as np
 
# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[5 1]
#  [2 7]]

如何解读这个输出:sklearn 的排列方式是 row 0 = 实际为负,row 1 = 实际为正。

Predicted 0Predicted 1
Actual 0TN = 5FP = 1
Actual 1FN = 2TP = 7

因此,模型正确识别了 5 个负例和 7 个正例,同时产生了 1 个 false positive 和 2 个 false negative 错误。

提取单个值

from sklearn.metrics import confusion_matrix
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print(f"True Negatives:  {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives:  {tp}")
# True Negatives:  5
# False Positives: 1
# False Negatives: 2
# True Positives:  7

Precision、Recall、F1-Score 与 Accuracy

这些指标都可以直接从混淆矩阵推导出来:

MetricFormulaWhat It Answers
Accuracy(TP + TN) / (TP + TN + FP + FN)在所有预测里,有多少是正确的?
PrecisionTP / (TP + FP)在预测为正的样本中,有多少实际上为正?
Recall (Sensitivity)TP / (TP + FN)在所有实际为正的样本中,我们捕获了多少?
SpecificityTN / (TN + FP)在所有实际为负的样本中,我们正确识别了多少?
F1-Score2 * (Precision * Recall) / (Precision + Recall)precision 与 recall 的调和平均
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score
)
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_true, y_pred):.4f}")
# Accuracy:  0.8000
# Precision: 0.8750
# Recall:    0.7778
# F1-Score:  0.8235

何时优先选择 Precision vs Recall

ScenarioPrioritizeWhy
Spam detectionPrecisionfalse positive(正常邮件被标成垃圾邮件)会让用户很烦
Disease screeningRecallfalse negative(漏诊)风险很高
Fraud detectionRecall漏掉欺诈通常比核查误报更昂贵
Search engine resultsPrecision不相关结果会显著降低用户体验
Manufacturing defect detectionRecall缺陷品流向客户代价很大
Content recommendationPrecision不相关推荐会降低互动与留存

Classification Report

Sklearn 的 classification_report 可以一次性为每个类别计算 precision、recall、F1-score 和 support(该类真实出现次数):

from sklearn.metrics import classification_report
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

输出:

              precision    recall  f1-score   support

    Negative       0.71      0.83      0.77         6
    Positive       0.88      0.78      0.82         9

    accuracy                           0.80        15
   macro avg       0.80      0.81      0.80        15
weighted avg       0.81      0.80      0.80        15
  • macro avg: 各类别的非加权平均,对每个类别一视同仁。
  • weighted avg: 按类别 support 加权的平均,能反映类别不均衡的影响。
  • support: 每个类别的真实样本数量。

可视化混淆矩阵

使用 ConfusionMatrixDisplay

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
# Load and split data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
 
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp.plot(cmap='Blues')
plt.title('Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

使用 Seaborn Heatmap

如果需要更多自定义,可以直接用 seaborn:

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
 
# Load, split, train
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
 
# Plot with seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=data.target_names,
    yticklabels=data.target_names,
    square=True,
    linewidths=0.5
)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix_seaborn.png', dpi=150)
plt.show()

归一化混淆矩阵(Normalized Confusion Matrix)

当类别规模不同,原始计数可能会产生误导。归一化能展示比例,从而更可比:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Normalized confusion matrix (by true labels)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Raw counts
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    cmap='Blues',
    ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
# Normalized (rows sum to 1)
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    normalize='true',
    cmap='Blues',
    values_format='.2%',
    ax=axes[1]
)
axes[1].set_title('Normalized by True Label')
 
plt.tight_layout()
plt.savefig('confusion_matrix_normalized.png', dpi=150)
plt.show()

normalize 参数支持三种选项:

ValueNormalizationUse Case
'true'每行和为 1(除以真实类别计数)查看每类的 recall
'pred'每列和为 1(除以预测类别计数)查看每类的 precision
'all'所有单元格和为 1(除以总样本数)查看整体分布

多分类混淆矩阵(Multi-Class Confusion Matrix)

混淆矩阵可以自然扩展到多于两个类别的情况。每一行代表一个真实类别,每一列代表一个预测类别:

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
 
# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
 
# Train and predict
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
 
# Classification report
print("\nClassification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=iris.target_names
))
 
# Visualize
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=iris.target_names
)
disp.plot(cmap='Blues')
plt.title('Iris Classification - 3 Classes')
plt.tight_layout()
plt.savefig('multi_class_confusion_matrix.png', dpi=150)
plt.show()

多分类平均策略(Averaging Strategies)

在多分类问题中计算 precision、recall、F1 时,需要选择平均方式:

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
 
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
for avg in ['micro', 'macro', 'weighted']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f1 = f1_score(y_test, y_pred, average=avg)
    print(f"{avg:8s} -- Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}")
AverageMethodBest For
micro汇总所有类别的 TP、FP、FN 再计算类别不均衡影响很关键时
macro对每个类别指标做非加权平均各类别同等重要时
weighted按类别 support 加权平均类别不均衡数据的默认常用选择

完整示例:端到端的分类评估

from sklearn.metrics import (
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
 
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
 
# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=3, random_state=42
    ))
])
 
# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print("=" * 50)
print("MODEL EVALUATION REPORT")
print("=" * 50)
print(f"\nConfusion Matrix:")
print(f"  TP={tp}, FP={fp}")
print(f"  FN={fn}, TN={tn}")
print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    cmap='Blues', ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    normalize='true', values_format='.1%', cmap='Blues', ax=axes[1]
)
axes[1].set_title('Normalized')
plt.tight_layout()
plt.savefig('full_evaluation.png', dpi=150)
plt.show()

使用 PyGWalker 深入探索分类结果

在构建混淆矩阵之后,你还可以通过交互式探索原始数据,进一步挖掘误分类(misclassifications)的原因。PyGWalker (opens in a new tab) 能把你的预测结果转换为 Jupyter 中可拖拽的可视化分析界面:

import pandas as pd
import pygwalker as pyg
 
# Build results DataFrame with features and predictions
results = pd.DataFrame(X_test, columns=data.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['correct'] = y_test == y_pred
results['confidence'] = y_prob
 
# Launch interactive exploration
walker = pyg.walk(results)

你可以按误分类样本进行筛选,对比 TP/FP/FN/TN 各组的特征分布,并识别能解释模型“卡点”的模式。

如果你在 Jupyter 中迭代分类实验——调整阈值、测试不同模型、或探索特征组合——RunCell (opens in a new tab) 提供了一个 AI agent,可加速整个实验迭代循环。

FAQ

sklearn 中的 confusion matrix 是什么?

混淆矩阵是一张表,展示每个类别的正确与错误预测计数。在 sklearn 中,confusion_matrix(y_true, y_pred) 返回一个 2D numpy array,其中行表示真实类别、列表示预测类别。对于二分类,它对应 true positives、true negatives、false positives、false negatives。

我该如何读混淆矩阵?

在 sklearn 的混淆矩阵中,行是真实标签,列是预测标签。对于二分类:左上是真负例(TN),右上是假正例(FP),左下是假负例(FN),右下是真正例(TP)。对角线元素表示预测正确的样本。

precision 和 recall 有什么区别?

precision 衡量预测为正的样本中有多少实际为正(TP / (TP + FP))。recall 衡量实际为正的样本中有多少被模型捕获(TP / (TP + FN))。precision 回答“模型说是正时,有多常是对的?”,recall 回答“所有真实正例里,模型找到了多少?”

什么时候应该用 F1-score 而不是 accuracy?

当类别不均衡时使用 F1-score。若 95% 样本为负,一个永远预测负的模型能得到 95% accuracy,但对正例的 recall 为 0。F1-score 是 precision 与 recall 的调和平均,因此会惩罚“只顾一头、牺牲另一头”的模型。

如何在 Python 中绘制混淆矩阵?

最快的方法是使用 ConfusionMatrixDisplay.from_predictions(y_true, y_pred)。如果需要更多自定义,可以先用 confusion_matrix() 计算矩阵,再用 seaborn.heatmap() 可视化。两种方式都支持归一化矩阵、自定义 colormap 与类别标签。

ConfusionMatrixDisplay 中 normalize='true' 是什么意思?

设置 normalize='true' 会把每一行除以该真实类别的样本总数,使每行之和为 1。这等价于按类别展示 recall 的百分比。用 normalize='pred' 可以看每类 precision,用 normalize='all' 可以看整体占比。

结论

混淆矩阵是评估分类模型的基础。仅看 accuracy 并不够——你需要看清模型具体犯了哪些错误。使用 sklearn 的 confusion_matrixclassification_report 获取完整信息;在演示与报告中,可用 ConfusionMatrixDisplay 或 seaborn heatmaps 来可视化;当类别规模不同,记得进行归一化。最终应根据不同错误类型的业务成本来选择主要指标:false positive 代价高时优先 precision,false negative 危险时优先 recall,需要平衡度量时使用 F1-score。

📚