sklearn 中的混淆矩阵是什么？

混淆矩阵是一个表格，显示每个类别正确和错误预测的数量。sklearn 的 confusion_matrix(y_true, y_pred) 返回一个二维数组，其中行表示实际类别，列表示预测类别。

如何读取混淆矩阵？

行是实际标签，列是预测标签。对于二分类：左上是 TN，右上是 FP，左下是 FN，右下是 TP。对角线元素是正确预测。

精确率和召回率有什么区别？

精确率是 TP/(TP+FP)，即预测为正的样本中有多少是正确的。召回率是 TP/(TP+FN)，即实际为正的样本中有多少被找到。精确率衡量预测质量，召回率衡量检测完整性。

什么时候应该使用 F1 分数而不是准确率？

当类别不平衡时使用 F1 分数。只预测多数类的模型准确率很高，但少数类的召回率为零。F1 通过结合精确率和召回率来惩罚这种情况。

ConfusionMatrixDisplay 中 normalize='true' 的作用是什么？

normalize='true' 将每行除以实际类别数量，使行之和为 1，以百分比显示每个类别的召回率。使用 'pred' 显示每类的精确率，使用 'all' 显示整体比例。

Sklearn 中的混淆矩阵：如何评估分类模型

Q: 如何在 Python 中绘制混淆矩阵？

使用 ConfusionMatrixDisplay.from_predictions(y_true, y_pred) 快速绘图，或者计算 confusion_matrix() 并用 seaborn.heatmap() 可视化以获得更多自定义选项。

Name: Soren Atelier

更新于 2026/2/12

你的分类模型报告了 95% 的 accuracy，于是你把它上线了。随后你发现：它漏掉了你真正关心的 80% 正例——比如欺诈交易、疾病诊断、缺陷产品。仅看 accuracy 会掩盖模型究竟在哪里、以何种方式失败的关键信息。

一个单一的 accuracy 数字会把所有类型的错误压缩成一个指标。如果垃圾邮件只占总量的 5%，那么一个“把所有垃圾邮件都放过、但能正确识别所有正常邮件”的过滤器，仍然可以得到很高的 accuracy。你需要看到完整图景：模型抓住了多少正例、错分了多少负例、错误到底落在哪里。

混淆矩阵将模型表现拆解为四个组成部分——true positives、true negatives、false positives、false negatives。再结合 precision、recall、F1-score 等派生指标，它能给出可操作的洞察：模型哪里做对了、哪里做错了。Scikit-learn 提供了 confusion_matrix、classification_report 和 ConfusionMatrixDisplay，让这类分析变得非常直接。

什么是混淆矩阵？

混淆矩阵是一张表，用来对比分类模型的预测标签与真实标签。对于二分类，它是一个 2x2 的网格：

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

每个单元格统计落入该类别的样本数量：

True Positive (TP)： 模型预测为正，且实际为正。正确。
True Negative (TN)： 模型预测为负，且实际为负。正确。
False Positive (FP)： 模型预测为正，但实际为负。I 类错误（Type I error）。
False Negative (FN)： 模型预测为负，但实际为正。II 类错误（Type II error）。

使用 Sklearn 的基础混淆矩阵

from sklearn.metrics import confusion_matrix
import numpy as np
 
# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[5 1]
#  [2 7]]

如何解读这个输出：sklearn 的排列方式是 row 0 = 实际为负，row 1 = 实际为正。

	Predicted 0	Predicted 1
Actual 0	TN = 5	FP = 1
Actual 1	FN = 2	TP = 7

因此，模型正确识别了 5 个负例和 7 个正例，同时产生了 1 个 false positive 和 2 个 false negative 错误。

提取单个值

from sklearn.metrics import confusion_matrix
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print(f"True Negatives:  {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives:  {tp}")
# True Negatives:  5
# False Positives: 1
# False Negatives: 2
# True Positives:  7

Precision、Recall、F1-Score 与 Accuracy

这些指标都可以直接从混淆矩阵推导出来：

Metric	Formula	What It Answers
Accuracy	(TP + TN) / (TP + TN + FP + FN)	在所有预测里，有多少是正确的？
Precision	TP / (TP + FP)	在预测为正的样本中，有多少实际上为正？
Recall (Sensitivity)	TP / (TP + FN)	在所有实际为正的样本中，我们捕获了多少？
Specificity	TN / (TN + FP)	在所有实际为负的样本中，我们正确识别了多少？
F1-Score	2 * (Precision * Recall) / (Precision + Recall)	precision 与 recall 的调和平均

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score
)
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_true, y_pred):.4f}")
# Accuracy:  0.8000
# Precision: 0.8750
# Recall:    0.7778
# F1-Score:  0.8235

何时优先选择 Precision vs Recall

Scenario	Prioritize	Why
Spam detection	Precision	false positive（正常邮件被标成垃圾邮件）会让用户很烦
Disease screening	Recall	false negative（漏诊）风险很高
Fraud detection	Recall	漏掉欺诈通常比核查误报更昂贵
Search engine results	Precision	不相关结果会显著降低用户体验
Manufacturing defect detection	Recall	缺陷品流向客户代价很大
Content recommendation	Precision	不相关推荐会降低互动与留存

Classification Report

Sklearn 的 classification_report 可以一次性为每个类别计算 precision、recall、F1-score 和 support（该类真实出现次数）：

from sklearn.metrics import classification_report
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
 
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

输出：

              precision    recall  f1-score   support

    Negative       0.71      0.83      0.77         6
    Positive       0.88      0.78      0.82         9

    accuracy                           0.80        15
   macro avg       0.80      0.81      0.80        15
weighted avg       0.81      0.80      0.80        15

macro avg： 各类别的非加权平均，对每个类别一视同仁。
weighted avg： 按类别 support 加权的平均，能反映类别不均衡的影响。
support： 每个类别的真实样本数量。

可视化混淆矩阵

使用 ConfusionMatrixDisplay

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
# Load and split data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
 
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp.plot(cmap='Blues')
plt.title('Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

使用 Seaborn Heatmap

如果需要更多自定义，可以直接用 seaborn：

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
 
# Load, split, train
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
 
# Plot with seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=data.target_names,
    yticklabels=data.target_names,
    square=True,
    linewidths=0.5
)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix_seaborn.png', dpi=150)
plt.show()

归一化混淆矩阵（Normalized Confusion Matrix）

当类别规模不同，原始计数可能会产生误导。归一化能展示比例，从而更可比：

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Normalized confusion matrix (by true labels)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Raw counts
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    cmap='Blues',
    ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
# Normalized (rows sum to 1)
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=data.target_names,
    normalize='true',
    cmap='Blues',
    values_format='.2%',
    ax=axes[1]
)
axes[1].set_title('Normalized by True Label')
 
plt.tight_layout()
plt.savefig('confusion_matrix_normalized.png', dpi=150)
plt.show()

normalize 参数支持三种选项：

Value	Normalization	Use Case
`'true'`	每行和为 1（除以真实类别计数）	查看每类的 recall
`'pred'`	每列和为 1（除以预测类别计数）	查看每类的 precision
`'all'`	所有单元格和为 1（除以总样本数）	查看整体分布

多分类混淆矩阵（Multi-Class Confusion Matrix）

混淆矩阵可以自然扩展到多于两个类别的情况。每一行代表一个真实类别，每一列代表一个预测类别：

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
 
# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
 
# Train and predict
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
 
# Classification report
print("\nClassification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=iris.target_names
))
 
# Visualize
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=iris.target_names
)
disp.plot(cmap='Blues')
plt.title('Iris Classification - 3 Classes')
plt.tight_layout()
plt.savefig('multi_class_confusion_matrix.png', dpi=150)
plt.show()

多分类平均策略（Averaging Strategies）

在多分类问题中计算 precision、recall、F1 时，需要选择平均方式：

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
 
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
for avg in ['micro', 'macro', 'weighted']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f1 = f1_score(y_test, y_pred, average=avg)
    print(f"{avg:8s} -- Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}")

Average	Method	Best For
`micro`	汇总所有类别的 TP、FP、FN 再计算	类别不均衡影响很关键时
`macro`	对每个类别指标做非加权平均	各类别同等重要时
`weighted`	按类别 support 加权平均	类别不均衡数据的默认常用选择

完整示例：端到端的分类评估

from sklearn.metrics import (
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
 
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
 
# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=3, random_state=42
    ))
])
 
# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
 
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
 
print("=" * 50)
print("MODEL EVALUATION REPORT")
print("=" * 50)
print(f"\nConfusion Matrix:")
print(f"  TP={tp}, FP={fp}")
print(f"  FN={fn}, TN={tn}")
print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    cmap='Blues', ax=axes[0]
)
axes[0].set_title('Raw Counts')
 
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names,
    normalize='true', values_format='.1%', cmap='Blues', ax=axes[1]
)
axes[1].set_title('Normalized')
plt.tight_layout()
plt.savefig('full_evaluation.png', dpi=150)
plt.show()

使用 PyGWalker 深入探索分类结果

在构建混淆矩阵之后，你还可以通过交互式探索原始数据，进一步挖掘误分类（misclassifications）的原因。PyGWalker (opens in a new tab) 能把你的预测结果转换为 Jupyter 中可拖拽的可视化分析界面：

import pandas as pd
import pygwalker as pyg
 
# Build results DataFrame with features and predictions
results = pd.DataFrame(X_test, columns=data.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['correct'] = y_test == y_pred
results['confidence'] = y_prob
 
# Launch interactive exploration
walker = pyg.walk(results)

你可以按误分类样本进行筛选，对比 TP/FP/FN/TN 各组的特征分布，并识别能解释模型“卡点”的模式。

如果你在 Jupyter 中迭代分类实验——调整阈值、测试不同模型、或探索特征组合——RunCell (opens in a new tab) 提供了一个 AI agent，可加速整个实验迭代循环。

FAQ

sklearn 中的 confusion matrix 是什么？

混淆矩阵是一张表，展示每个类别的正确与错误预测计数。在 sklearn 中，confusion_matrix(y_true, y_pred) 返回一个 2D numpy array，其中行表示真实类别、列表示预测类别。对于二分类，它对应 true positives、true negatives、false positives、false negatives。

我该如何读混淆矩阵？

在 sklearn 的混淆矩阵中，行是真实标签，列是预测标签。对于二分类：左上是真负例（TN），右上是假正例（FP），左下是假负例（FN），右下是真正例（TP）。对角线元素表示预测正确的样本。

precision 和 recall 有什么区别？

precision 衡量预测为正的样本中有多少实际为正（TP / (TP + FP)）。recall 衡量实际为正的样本中有多少被模型捕获（TP / (TP + FN)）。precision 回答“模型说是正时，有多常是对的？”，recall 回答“所有真实正例里，模型找到了多少？”

什么时候应该用 F1-score 而不是 accuracy？

当类别不均衡时使用 F1-score。若 95% 样本为负，一个永远预测负的模型能得到 95% accuracy，但对正例的 recall 为 0。F1-score 是 precision 与 recall 的调和平均，因此会惩罚“只顾一头、牺牲另一头”的模型。

如何在 Python 中绘制混淆矩阵？

最快的方法是使用 ConfusionMatrixDisplay.from_predictions(y_true, y_pred)。如果需要更多自定义，可以先用 confusion_matrix() 计算矩阵，再用 seaborn.heatmap() 可视化。两种方式都支持归一化矩阵、自定义 colormap 与类别标签。

ConfusionMatrixDisplay 中 normalize='true' 是什么意思？

设置 normalize='true' 会把每一行除以该真实类别的样本总数，使每行之和为 1。这等价于按类别展示 recall 的百分比。用 normalize='pred' 可以看每类 precision，用 normalize='all' 可以看整体占比。

结论

混淆矩阵是评估分类模型的基础。仅看 accuracy 并不够——你需要看清模型具体犯了哪些错误。使用 sklearn 的 confusion_matrix 和 classification_report 获取完整信息；在演示与报告中，可用 ConfusionMatrixDisplay 或 seaborn heatmaps 来可视化；当类别规模不同，记得进行归一化。最终应根据不同错误类型的业务成本来选择主要指标：false positive 代价高时优先 precision，false negative 危险时优先 recall，需要平衡度量时使用 F1-score。

📚