Sklearn 中的混淆矩阵:如何评估分类模型
Updated on
你的分类模型报告了 95% 的 accuracy,于是你把它上线了。随后你发现:它漏掉了你真正关心的 80% 正例——比如欺诈交易、疾病诊断、缺陷产品。仅看 accuracy 会掩盖模型究竟在哪里、以何种方式失败的关键信息。
一个单一的 accuracy 数字会把所有类型的错误压缩成一个指标。如果垃圾邮件只占总量的 5%,那么一个“把所有垃圾邮件都放过、但能正确识别所有正常邮件”的过滤器,仍然可以得到很高的 accuracy。你需要看到完整图景:模型抓住了多少正例、错分了多少负例、错误到底落在哪里。
混淆矩阵将模型表现拆解为四个组成部分——true positives、true negatives、false positives、false negatives。再结合 precision、recall、F1-score 等派生指标,它能给出可操作的洞察:模型哪里做对了、哪里做错了。Scikit-learn 提供了 confusion_matrix、classification_report 和 ConfusionMatrixDisplay,让这类分析变得非常直接。
什么是混淆矩阵?
混淆矩阵是一张表,用来对比分类模型的预测标签与真实标签。对于二分类,它是一个 2x2 的网格:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
每个单元格统计落入该类别的样本数量:
- True Positive (TP): 模型预测为正,且实际为正。正确。
- True Negative (TN): 模型预测为负,且实际为负。正确。
- False Positive (FP): 模型预测为正,但实际为负。I 类错误(Type I error)。
- False Negative (FN): 模型预测为负,但实际为正。II 类错误(Type II error)。
使用 Sklearn 的基础混淆矩阵
from sklearn.metrics import confusion_matrix
import numpy as np
# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[5 1]
# [2 7]]如何解读这个输出:sklearn 的排列方式是 row 0 = 实际为负,row 1 = 实际为正。
| Predicted 0 | Predicted 1 | |
|---|---|---|
| Actual 0 | TN = 5 | FP = 1 |
| Actual 1 | FN = 2 | TP = 7 |
因此,模型正确识别了 5 个负例和 7 个正例,同时产生了 1 个 false positive 和 2 个 false negative 错误。
提取单个值
from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")
# True Negatives: 5
# False Positives: 1
# False Negatives: 2
# True Positives: 7Precision、Recall、F1-Score 与 Accuracy
这些指标都可以直接从混淆矩阵推导出来:
| Metric | Formula | What It Answers |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | 在所有预测里,有多少是正确的? |
| Precision | TP / (TP + FP) | 在预测为正的样本中,有多少实际上为正? |
| Recall (Sensitivity) | TP / (TP + FN) | 在所有实际为正的样本中,我们捕获了多少? |
| Specificity | TN / (TN + FP) | 在所有实际为负的样本中,我们正确识别了多少? |
| F1-Score | 2 * (Precision * Recall) / (Precision + Recall) | precision 与 recall 的调和平均 |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score
)
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1-Score: {f1_score(y_true, y_pred):.4f}")
# Accuracy: 0.8000
# Precision: 0.8750
# Recall: 0.7778
# F1-Score: 0.8235何时优先选择 Precision vs Recall
| Scenario | Prioritize | Why |
|---|---|---|
| Spam detection | Precision | false positive(正常邮件被标成垃圾邮件)会让用户很烦 |
| Disease screening | Recall | false negative(漏诊)风险很高 |
| Fraud detection | Recall | 漏掉欺诈通常比核查误报更昂贵 |
| Search engine results | Precision | 不相关结果会显著降低用户体验 |
| Manufacturing defect detection | Recall | 缺陷品流向客户代价很大 |
| Content recommendation | Precision | 不相关推荐会降低互动与留存 |
Classification Report
Sklearn 的 classification_report 可以一次性为每个类别计算 precision、recall、F1-score 和 support(该类真实出现次数):
from sklearn.metrics import classification_report
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))输出:
precision recall f1-score support
Negative 0.71 0.83 0.77 6
Positive 0.88 0.78 0.82 9
accuracy 0.80 15
macro avg 0.80 0.81 0.80 15
weighted avg 0.81 0.80 0.80 15- macro avg: 各类别的非加权平均,对每个类别一视同仁。
- weighted avg: 按类别 support 加权的平均,能反映类别不均衡的影响。
- support: 每个类别的真实样本数量。
可视化混淆矩阵
使用 ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
# Load and split data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=data.target_names
)
disp.plot(cmap='Blues')
plt.title('Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()使用 Seaborn Heatmap
如果需要更多自定义,可以直接用 seaborn:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load, split, train
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot with seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(
cm,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=data.target_names,
yticklabels=data.target_names,
square=True,
linewidths=0.5
)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Breast Cancer Classification')
plt.tight_layout()
plt.savefig('confusion_matrix_seaborn.png', dpi=150)
plt.show()归一化混淆矩阵(Normalized Confusion Matrix)
当类别规模不同,原始计数可能会产生误导。归一化能展示比例,从而更可比:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Normalized confusion matrix (by true labels)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Raw counts
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred,
display_labels=data.target_names,
cmap='Blues',
ax=axes[0]
)
axes[0].set_title('Raw Counts')
# Normalized (rows sum to 1)
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred,
display_labels=data.target_names,
normalize='true',
cmap='Blues',
values_format='.2%',
ax=axes[1]
)
axes[1].set_title('Normalized by True Label')
plt.tight_layout()
plt.savefig('confusion_matrix_normalized.png', dpi=150)
plt.show()normalize 参数支持三种选项:
| Value | Normalization | Use Case |
|---|---|---|
'true' | 每行和为 1(除以真实类别计数) | 查看每类的 recall |
'pred' | 每列和为 1(除以预测类别计数) | 查看每类的 precision |
'all' | 所有单元格和为 1(除以总样本数) | 查看整体分布 |
多分类混淆矩阵(Multi-Class Confusion Matrix)
混淆矩阵可以自然扩展到多于两个类别的情况。每一行代表一个真实类别,每一列代表一个预测类别:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load iris dataset (3 classes)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# Train and predict
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Classification report
print("\nClassification Report:")
print(classification_report(
y_test, y_pred,
target_names=iris.target_names
))
# Visualize
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=iris.target_names
)
disp.plot(cmap='Blues')
plt.title('Iris Classification - 3 Classes')
plt.tight_layout()
plt.savefig('multi_class_confusion_matrix.png', dpi=150)
plt.show()多分类平均策略(Averaging Strategies)
在多分类问题中计算 precision、recall、F1 时,需要选择平均方式:
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
for avg in ['micro', 'macro', 'weighted']:
p = precision_score(y_test, y_pred, average=avg)
r = recall_score(y_test, y_pred, average=avg)
f1 = f1_score(y_test, y_pred, average=avg)
print(f"{avg:8s} -- Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}")| Average | Method | Best For |
|---|---|---|
micro | 汇总所有类别的 TP、FP、FN 再计算 | 类别不均衡影响很关键时 |
macro | 对每个类别指标做非加权平均 | 各类别同等重要时 |
weighted | 按类别 support 加权平均 | 类别不均衡数据的默认常用选择 |
完整示例:端到端的分类评估
from sklearn.metrics import (
confusion_matrix, classification_report, ConfusionMatrixDisplay,
accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=3, random_state=42
))
])
# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print("=" * 50)
print("MODEL EVALUATION REPORT")
print("=" * 50)
print(f"\nConfusion Matrix:")
print(f" TP={tp}, FP={fp}")
print(f" FN={fn}, TN={tn}")
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred, display_labels=data.target_names,
cmap='Blues', ax=axes[0]
)
axes[0].set_title('Raw Counts')
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred, display_labels=data.target_names,
normalize='true', values_format='.1%', cmap='Blues', ax=axes[1]
)
axes[1].set_title('Normalized')
plt.tight_layout()
plt.savefig('full_evaluation.png', dpi=150)
plt.show()使用 PyGWalker 深入探索分类结果
在构建混淆矩阵之后,你还可以通过交互式探索原始数据,进一步挖掘误分类(misclassifications)的原因。PyGWalker (opens in a new tab) 能把你的预测结果转换为 Jupyter 中可拖拽的可视化分析界面:
import pandas as pd
import pygwalker as pyg
# Build results DataFrame with features and predictions
results = pd.DataFrame(X_test, columns=data.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['correct'] = y_test == y_pred
results['confidence'] = y_prob
# Launch interactive exploration
walker = pyg.walk(results)你可以按误分类样本进行筛选,对比 TP/FP/FN/TN 各组的特征分布,并识别能解释模型“卡点”的模式。
如果你在 Jupyter 中迭代分类实验——调整阈值、测试不同模型、或探索特征组合——RunCell (opens in a new tab) 提供了一个 AI agent,可加速整个实验迭代循环。
FAQ
sklearn 中的 confusion matrix 是什么?
混淆矩阵是一张表,展示每个类别的正确与错误预测计数。在 sklearn 中,confusion_matrix(y_true, y_pred) 返回一个 2D numpy array,其中行表示真实类别、列表示预测类别。对于二分类,它对应 true positives、true negatives、false positives、false negatives。
我该如何读混淆矩阵?
在 sklearn 的混淆矩阵中,行是真实标签,列是预测标签。对于二分类:左上是真负例(TN),右上是假正例(FP),左下是假负例(FN),右下是真正例(TP)。对角线元素表示预测正确的样本。
precision 和 recall 有什么区别?
precision 衡量预测为正的样本中有多少实际为正(TP / (TP + FP))。recall 衡量实际为正的样本中有多少被模型捕获(TP / (TP + FN))。precision 回答“模型说是正时,有多常是对的?”,recall 回答“所有真实正例里,模型找到了多少?”
什么时候应该用 F1-score 而不是 accuracy?
当类别不均衡时使用 F1-score。若 95% 样本为负,一个永远预测负的模型能得到 95% accuracy,但对正例的 recall 为 0。F1-score 是 precision 与 recall 的调和平均,因此会惩罚“只顾一头、牺牲另一头”的模型。
如何在 Python 中绘制混淆矩阵?
最快的方法是使用 ConfusionMatrixDisplay.from_predictions(y_true, y_pred)。如果需要更多自定义,可以先用 confusion_matrix() 计算矩阵,再用 seaborn.heatmap() 可视化。两种方式都支持归一化矩阵、自定义 colormap 与类别标签。
ConfusionMatrixDisplay 中 normalize='true' 是什么意思?
设置 normalize='true' 会把每一行除以该真实类别的样本总数,使每行之和为 1。这等价于按类别展示 recall 的百分比。用 normalize='pred' 可以看每类 precision,用 normalize='all' 可以看整体占比。
结论
混淆矩阵是评估分类模型的基础。仅看 accuracy 并不够——你需要看清模型具体犯了哪些错误。使用 sklearn 的 confusion_matrix 和 classification_report 获取完整信息;在演示与报告中,可用 ConfusionMatrixDisplay 或 seaborn heatmaps 来可视化;当类别规模不同,记得进行归一化。最终应根据不同错误类型的业务成本来选择主要指标:false positive 代价高时优先 precision,false negative 危险时优先 recall,需要平衡度量时使用 F1-score。