Sklearn Random Forest:Python 分类与回归完整指南
Updated on
你构建了一个决策树,在训练集上达到了 95% 的准确率,但在新数据上只有 62%。单个决策树会记忆训练集——每一次分裂、每一片叶子都针对它见过的样本进行了调优。结果是一个在纸面上看起来不错但在生产环境中失败的模型。
这个过拟合问题不仅仅是学术上的。团队部署的模型在开发笔记本中表现良好,但在实时数据上却产生不可靠的预测。单个决策树具有高方差:训练数据的微小变化会产生完全不同的树结构。你无法信任一个对训练数据如此敏感的模型。
随机森林(Random Forest)通过在数据的随机子集和特征上构建数百个决策树,然后通过多数投票(分类)或平均(回归)结合它们的预测来解决这个问题。这种集成方法显著降低了方差,同时保持了准确率。Scikit-learn 的 RandomForestClassifier 和 RandomForestRegressor 提供了生产就绪的实现,具有内置的特征重要性、袋外(out-of-bag)评估和并行训练功能。
什么是随机森林?
随机森林是一种集成学习方法,它结合多个决策树以产生单一、更稳健的预测。它使用一种称为 Bagging(Bootstrap Aggregating,自助聚合)的技术:
- 自助采样(Bootstrap sampling): 通过有放回抽样创建训练数据的多个随机子集。每个子集大约是原始数据的 63%。
- 随机特征选择: 在每棵树的每个分裂点,只考虑特征的随机子集(通常分类任务为
sqrt(n_features),回归任务为n_features/3)。 - 独立训练: 在每个自助样本上使用随机特征约束训练决策树。
- 聚合: 通过多数投票(分类)或均值(回归)结合预测。
数据采样和特征选择中的随机性确保了个体树之间的去相关性。即使一棵树对特定模式过拟合,大多数其他树不会,并且集成会平均掉噪声。
何时使用随机森林
| 场景 | 适用随机森林? | 原因 |
|---|---|---|
| 具有混合特征类型的表格数据 | 是 | 处理数值和分类特征,无需缩放 |
| 需要特征重要性排名 | 是 | 内置 feature_importances_ 属性 |
| 小到中型数据集(最多约 10 万行) | 是 | 并行处理训练速度快 |
| 不平衡分类 | 是 | 支持 class_weight='balanced' |
| 需要可解释的预测 | 中等 | 单棵树可解释,但集成较难解释 |
| 非常高维的稀疏数据(文本) | 否 | 线性模型或梯度提升通常更好 |
| 具有严格延迟的实时推理 | 需谨慎 | 大型森林在预测时可能较慢 |
RandomForestClassifier:分类示例
以下是使用葡萄酒数据集的完整分类示例:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_wine
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {wine.target_names}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=100,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))关键参数解释
| 参数 | 默认值 | 描述 | 调优建议 |
|---|---|---|---|
n_estimators | 100 | 森林中的树数量 | 树越多 = 性能越好但越慢。通常 100-500。 |
max_depth | None | 每棵树的最大深度 | None 表示完全生长。设为 10-30 以减少过拟合。 |
min_samples_split | 2 | 分裂节点的最小样本数 | 增加到 5-20 以防止在噪声数据上过拟合。 |
min_samples_leaf | 1 | 叶子节点的最小样本数 | 增加到 2-10 以获得更平滑的预测。 |
max_features | 'sqrt' | 每次分裂考虑的特征数 | 分类任务用 'sqrt',或尝试 'log2' 或分数。 |
bootstrap | True | 使用自助采样 | 小数据集设为 False 以使用每棵树的全部数据。 |
class_weight | None | 每个类别的权重 | 不平衡数据集使用 'balanced'。 |
n_jobs | None | 并行作业数 | 设为 -1 使用所有 CPU 核心。 |
oob_score | False | 使用袋外样本进行评估 | 设为 True 以获得无需留出验证集的内置验证估计。 |
袋外(Out-of-Bag, OOB)分数
每棵树大约使用 63% 的数据进行训练。剩余的 37%(袋外样本)可用作免费的验证集:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(
n_estimators=200,
oob_score=True,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Test Score: {rf.score(X_test, y_test):.4f}")OOB 分数为你提供验证估计,无需单独的留出集。在数据有限时特别有用。
RandomForestRegressor:回归示例
随机森林回归通过平均所有树的输出来预测连续值:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regressor
rf_reg = RandomForestRegressor(
n_estimators=200,
max_depth=20,
min_samples_leaf=5,
random_state=42,
n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
# Evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")回归器对比
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
regressors = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
}
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12}")
print("-" * 52)
for name, model in regressors.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
print(f"{name:<25} {scores.mean():>12.4f} {scores.std():>12.4f}")在非线性关系的数据集上,随机森林通常优于单个决策树和线性模型,同时与梯度提升具有竞争力。
超参数调优
GridSearchCV:穷举搜索
GridSearchCV 测试指定参数值的每种组合:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_wine
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
rf,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")RandomizedSearchCV:高效搜索
当参数空间很大时,RandomizedSearchCV 采样固定数量的参数组合,而不是尝试所有组合:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [None, 5, 10, 15, 20, 30],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
'bootstrap': [True, False],
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(
rf,
param_distributions,
n_iter=100,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV Score: {random_search.best_score_:.4f}")
print(f"Test Score: {random_search.score(X_test, y_test):.4f}")调优的参数重要性
并非所有参数都有同等影响。将调优预算集中在最重要的参数上:
| 参数 | 影响 | 优先级 | 说明 |
|---|---|---|---|
n_estimators | 高 | 第1位 | 更多树几乎总是有帮助,直到收益递减(约 200-500) |
max_depth | 高 | 第2位 | 直接控制过拟合。尝试 None、10、20、30 |
min_samples_leaf | 中等 | 第3位 | 平滑预测。尝试 1、2、5、10 |
max_features | 中等 | 第4位 | 控制树的多样性。分类任务 'sqrt' 通常不错 |
min_samples_split | 低 | 第5位 | 实际影响小于 min_samples_leaf |
bootstrap | 低 | 第6位 | True 几乎总是更好。只在极小数据集上尝试 False |
特征重要性
随机森林最强的优势之一是内置的特征重要性。了解哪些特征驱动预测有助于模型解释、特征选择和领域洞察。
基于不纯度的特征重要性
默认的 feature_importances_ 属性衡量每个特征在所有树中减少不纯度(分类为 Gini,回归为方差)的程度:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Get feature importances
importances = rf.feature_importances_
feature_names = wine.feature_names
indices = np.argsort(importances)[::-1]
# Print ranked features
print("Feature Ranking:")
for i, idx in enumerate(indices):
print(f" {i+1}. {feature_names[idx]:25s} ({importances[idx]:.4f})")
# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(indices)), importances[indices[::-1]], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices[::-1]])
plt.xlabel('Feature Importance (Gini)')
plt.title('Random Forest Feature Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
plt.show()排列重要性(Permutation Importance)
基于不纯度的重要性可能对高基数特征有偏。排列重要性衡量当特征值被随机打乱时模型性能的下降:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Compute permutation importance on the test set
perm_imp = permutation_importance(
rf, X_test, y_test,
n_repeats=30,
random_state=42,
n_jobs=-1
)
# Sort and display
sorted_idx = perm_imp.importances_mean.argsort()[::-1]
print("Permutation Importance (test set):")
for idx in sorted_idx:
mean = perm_imp.importances_mean[idx]
std = perm_imp.importances_std[idx]
print(f" {wine.feature_names[idx]:25s}: {mean:.4f} +/- {std:.4f}")
# Plot with error bars
plt.figure(figsize=(10, 6))
plt.barh(
range(len(sorted_idx)),
perm_imp.importances_mean[sorted_idx[::-1]],
xerr=perm_imp.importances_std[sorted_idx[::-1]],
align='center'
)
plt.yticks(range(len(sorted_idx)), [wine.feature_names[i] for i in sorted_idx[::-1]])
plt.xlabel('Decrease in Accuracy')
plt.title('Permutation Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_permutation_importance.png', dpi=150)
plt.show()使用哪种重要性方法?
| 方法 | 优点 | 缺点 | 最佳适用场景 |
|---|---|---|---|
基于不纯度 (feature_importances_) | 快速,无需额外计算 | 对高基数特征有偏 | 快速筛选,初步探索 |
| 排列重要性 | 无偏,在测试数据上有效 | 较慢,受相关特征影响 | 最终特征选择,报告 |
随机森林的交叉验证
基础交叉验证
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Per-fold: {scores}")用于不平衡数据的 StratifiedKFold
StratifiedKFold 在每个折中保持类别分布,这对不平衡数据集至关重要:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_wine
import numpy as np
wine = load_wine()
X, y = wine.data, wine.target
# Stratified 10-fold cross-validation
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"Stratified 10-Fold Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Multiple metrics
from sklearn.model_selection import cross_validate
results = cross_validate(
rf, X, y, cv=skf,
scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
n_jobs=-1
)
for metric in ['test_accuracy', 'test_f1_weighted', 'test_precision_weighted', 'test_recall_weighted']:
vals = results[metric]
name = metric.replace('test_', '')
print(f"{name:>20s}: {vals.mean():.4f} (+/- {vals.std():.4f})")处理不平衡数据
当一个类别的样本远多于其他类别时,模型可以通过始终预测多数类来达到高准确率。随机森林提供了几种处理工具。
使用 class_weight='balanced'
class_weight='balanced' 参数自动调整权重,与类别频率成反比:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
# Create imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(
n_samples=2000,
n_features=20,
weights=[0.95, 0.05],
flip_y=0,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Without class weight
rf_default = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print("=== Without class_weight ===")
print(classification_report(y_test, rf_default.predict(X_test)))
# With balanced class weight
rf_balanced = RandomForestClassifier(
n_estimators=200,
class_weight='balanced',
random_state=42,
n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print("=== With class_weight='balanced' ===")
print(classification_report(y_test, rf_balanced.predict(y_test)))集成 SMOTE 进行过采样
SMOTE(Synthetic Minority Oversampling Technique,合成少数类过采样技术)为少数类创建合成样本。与 imblearn 的管道一起使用:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# Create imbalanced dataset
X, y = make_classification(
n_samples=2000,
n_features=20,
weights=[0.95, 0.05],
flip_y=0,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# SMOTE + Random Forest pipeline
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('rf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])
pipeline.fit(X_train, y_train)
print("=== SMOTE + Random Forest ===")
print(classification_report(y_test, pipeline.predict(X_test)))模型评估
分类报告和混淆矩阵
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, confusion_matrix,
ConfusionMatrixDisplay, accuracy_score
)
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=wine.target_names)}")
# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=wine.target_names)
disp.plot(cmap='Blues')
plt.title('Random Forest - Wine Classification')
plt.tight_layout()
plt.savefig('rf_confusion_matrix.png', dpi=150)
plt.show()二分类的 ROC 曲线
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Predict probabilities
y_prob = rf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
# Plot ROC curve
RocCurveDisplay.from_estimator(rf, X_test, y_test)
plt.title(f'Random Forest ROC Curve (AUC = {auc:.4f})')
plt.tight_layout()
plt.savefig('rf_roc_curve.png', dpi=150)
plt.show()随机森林与其他算法对比
| 特性 | 随机森林 | XGBoost | 梯度提升 | 决策树 |
|---|---|---|---|---|
| 集成类型 | Bagging(并行) | Boosting(顺序) | Boosting(顺序) | 单一模型 |
| 准确率 | 高 | 非常高 | 非常高 | 中等 |
| 训练速度 | 快(可并行) | 中等 | 慢(顺序) | 非常快 |
| 预测速度 | 中等 | 快 | 中等 | 非常快 |
| 过拟合风险 | 低 | 低(需调优) | 低(需调优) | 高 |
| 超参数敏感度 | 低 | 高 | 高 | 中等 |
| 需要特征缩放 | 否 | 否 | 否 | 否 |
| 处理缺失值 | 否(需插补) | 是(内置) | 否(需插补) | 否 |
| 内置特征重要性 | 是 | 是 | 是 | 是 |
| 可解释性 | 中等 | 低 | 低 | 高 |
| 最佳适用 | 通用,首选模型 | Kaggle 竞赛,最大准确率 | 结构化表格数据 | 快速基线,小数据集 |
何时选择随机森林而非其他方案:
- 需要最小调优的强基线模型
- 训练速度重要且拥有多核 CPU
- 想要可靠的特征重要性估计
- 不追求梯度提升方法可能提供的最后 0.5% 准确率
真实世界管道:端到端示例
此管道结合预处理、特征工程、模型训练、评估和预测,以生产级工作流形式呈现:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
# Load and prepare data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
# Introduce some missing values to simulate real data
np.random.seed(42)
mask = np.random.random(df.shape) < 0.05
df_missing = df.mask(mask.astype(bool))
df_missing['target'] = cancer.target # Keep target clean
X = df_missing.drop('target', axis=1)
y = df_missing['target']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Build preprocessing + model pipeline
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
]
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=300,
max_depth=20,
min_samples_leaf=2,
class_weight='balanced',
random_state=42,
n_jobs=-1
))
])
# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='accuracy')
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Evaluation
print(f"\nTest Set Results:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
# Make predictions on new data
sample = X_test.iloc[:3]
predictions = pipeline.predict(sample)
probabilities = pipeline.predict_proba(sample)
print(f"\nSample Predictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
class_name = cancer.target_names[pred]
confidence = prob[pred]
print(f" Sample {i+1}: {class_name} (confidence: {confidence:.2%})")保存和加载模型
import joblib
# Save the trained pipeline
joblib.dump(pipeline, 'rf_pipeline.joblib')
# Load and use later
loaded_pipeline = joblib.load('rf_pipeline.joblib')
new_predictions = loaded_pipeline.predict(X_test[:5])
print(f"Loaded model predictions: {new_predictions}")使用 PyGWalker 探索结果
训练随机森林模型后,你通常需要详细探索特征重要性模式、预测分布和误分类案例。PyGWalker (opens in a new tab) 可让你将结果 DataFrame 转换为 Jupyter 中交互式的类 Tableau 探索界面:
import pandas as pd
import pygwalker as pyg
# Build a results DataFrame
results = pd.DataFrame(X_test.values, columns=cancer.feature_names)
results['actual'] = y_test.values
results['predicted'] = y_pred
results['correct'] = y_test.values == y_pred
results['prob_malignant'] = pipeline.predict_proba(X_test)[:, 0]
results['prob_benign'] = pipeline.predict_proba(X_test)[:, 1]
# Launch interactive exploration
walker = pyg.walk(results)将特征拖到坐标轴,按误分类样本筛选,并按预测置信度着色,以识别模型在哪些方面存在困难。这种可视化分析帮助你决定需要工程化哪些特征或哪些样本需要更仔细检查。
对于运行完整的机器学习实验工作流——从数据加载到模型对比再到最终评估——RunCell (opens in a new tab) 提供了一个 AI 驱动的 Jupyter 环境,帮助你更快地进行实验迭代、自动生成评估代码并管理笔记本工作流。
常见问题
随机森林中应该使用多少棵树?
从 100-200 棵树开始。准确率通常随树的数量增加而提高,但在某一点后会趋于平稳。使用交叉验证找到最佳点。超过 500 棵树后,收益通常可以忽略不计,而训练时间会增加。在增加 n_estimators 时监控 OOB 分数——当它停止改善时,说明树的数量已经足够。
随机森林需要特征缩放吗?
不需要。随机森林基于特征值阈值进行分裂,因此特征的绝对比例不会影响分裂决策。与逻辑回归、SVM 或神经网络不同,随机森林自然地处理不同范围的特征。但是,如果你的管道包含其他组件(如 PCA 或基于距离的预处理),则这些步骤可能仍需要缩放。
随机森林如何处理缺失值?
Scikit-learn 的 RandomForestClassifier 和 RandomForestRegressor 不原生处理缺失值。必须在训练前插补缺失数据——对数值特征使用 SimpleImputer 配合中位数或均值策略,或使用更高级的插补方法如 IterativeImputer。H2O 或 LightGBM 等其他实现可以直接处理缺失值。
随机森林和梯度提升有什么区别?
随机森林并行独立构建树(Bagging),而梯度提升顺序构建树,每棵树纠正前一棵树的错误(Boosting)。随机森林降低方差,梯度提升降低偏差。实践中,梯度提升(特别是 XGBoost)通常能达到略高的准确率,但随机森林更容易调优且不太容易过拟合。
随机森林可以用于特征选择吗?
可以。使用 feature_importances_ 进行快速排名,或使用 permutation_importance 获得更可靠的估计。然后可以删除低重要性特征并重新训练。或者,在管道中使用 SelectFromModel 配合随机森林估计器,自动选择高于阈值的特征。
结论
随机森林是机器学习中最可靠和通用的算法之一。它通过结合数百个去相关的决策树来减少过拟合,无需特征缩放即可处理分类和回归任务,并提供内置的特征重要性排名。对于大多数表格数据问题,它作为一个出色的首选模型,通常足以用于生产环境。
首先使用默认参数的 RandomForestClassifier 或 RandomForestRegressor 作为基线。首先调整 n_estimators 进行收益递减分析,然后调整 max_depth 和 min_samples_leaf 以控制过拟合。不平衡数据使用 class_weight='balanced',可靠的特征排名使用排列重要性,稳健评估使用 StratifiedKFold 交叉验证。当你需要在结构化数据上获得绝对最高准确率时,考虑梯度提升或 XGBoost,但随机森林仍然是最安全的默认选择,很少会表现糟糕。