Sklearn Random Forest: Complete Guide to Classification and Regression in Python
Updated on
You built a decision tree that gets 95% training accuracy, then it scores 62% on new data. A single decision tree memorizes the training set -- every split, every leaf is tuned to the exact samples it saw. The result is a model that looks great on paper but fails in production.
This overfitting problem is not just academic. Teams deploy models that perform well in development notebooks but generate unreliable predictions on live data. A single decision tree has high variance: small changes in the training data produce entirely different tree structures. You cannot trust a model that is this sensitive to its training data.
Random Forest solves this by building hundreds of decision trees on random subsets of data and features, then combining their predictions through majority voting (classification) or averaging (regression). This ensemble approach dramatically reduces variance while maintaining accuracy. Scikit-learn's RandomForestClassifier and RandomForestRegressor provide a production-ready implementation with built-in feature importance, out-of-bag evaluation, and parallel training.
What Is Random Forest?
Random Forest is an ensemble learning method that combines multiple decision trees to produce a single, more robust prediction. It uses a technique called bagging (Bootstrap Aggregating):
- Bootstrap sampling: Create multiple random subsets of the training data by sampling with replacement. Each subset is roughly 63% of the original data.
- Random feature selection: At each split in each tree, consider only a random subset of features (typically
sqrt(n_features)for classification,n_features/3for regression). - Independent training: Train a decision tree on each bootstrap sample with the random feature constraint.
- Aggregation: Combine predictions by majority vote (classification) or mean (regression).
The randomness in both data sampling and feature selection ensures that individual trees are decorrelated. Even if one tree overfits a particular pattern, the majority of other trees will not, and the ensemble averages out the noise.
When to Use Random Forest
| Scenario | Random Forest? | Why |
|---|---|---|
| Tabular data with mixed feature types | Yes | Handles numeric and categorical features, no scaling needed |
| You need feature importance rankings | Yes | Built-in feature_importances_ attribute |
| Small to medium datasets (up to ~100K rows) | Yes | Fast training with parallel processing |
| Imbalanced classification | Yes | Supports class_weight='balanced' |
| You need interpretable predictions | Moderate | Individual trees are interpretable, but the ensemble is less so |
| Very high-dimensional sparse data (text) | No | Linear models or gradient boosting are typically better |
| Real-time inference with strict latency | Careful | Large forests can be slow at prediction time |
RandomForestClassifier: Classification Example
Here is a complete classification example using the wine dataset:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_wine
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {wine.target_names}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=100,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))Key Parameters Explained
| Parameter | Default | Description | Tuning Tip |
|---|---|---|---|
n_estimators | 100 | Number of trees in the forest | More trees = better performance but slower. 100-500 is typical. |
max_depth | None | Maximum depth of each tree | None means fully grown. Set to 10-30 to reduce overfitting. |
min_samples_split | 2 | Minimum samples to split a node | Increase to 5-20 to prevent overfitting on noisy data. |
min_samples_leaf | 1 | Minimum samples in a leaf node | Increase to 2-10 for smoother predictions. |
max_features | 'sqrt' | Features considered at each split | 'sqrt' for classification, 'log2' or a fraction for alternatives. |
bootstrap | True | Use bootstrap sampling | Set False for small datasets to use all data per tree. |
class_weight | None | Weights for each class | Use 'balanced' for imbalanced datasets. |
n_jobs | None | Number of parallel jobs | Set to -1 to use all CPU cores. |
oob_score | False | Use out-of-bag samples for evaluation | Set True for a built-in validation estimate without a holdout set. |
Out-of-Bag (OOB) Score
Each tree is trained on roughly 63% of the data. The remaining 37% (out-of-bag samples) can be used as a free validation set:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(
n_estimators=200,
oob_score=True,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Test Score: {rf.score(X_test, y_test):.4f}")The OOB score gives you a validation estimate without needing a separate holdout set. It is especially useful when data is limited.
RandomForestRegressor: Regression Example
Random Forest regression predicts continuous values by averaging the outputs of all trees:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regressor
rf_reg = RandomForestRegressor(
n_estimators=200,
max_depth=20,
min_samples_leaf=5,
random_state=42,
n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
# Evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")Comparing Regressors
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
regressors = {
'Linear Regression': LinearRegression(), # see /topics/Scikit-Learn/sklearn-linear-regression
'Ridge': Ridge(alpha=1.0),
'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
}
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12}")
print("-" * 52)
for name, model in regressors.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
print(f"{name:<25} {scores.mean():>12.4f} {scores.std():>12.4f}")Random Forest typically outperforms a single decision tree and linear models on datasets with non-linear relationships, while being competitive with gradient boosting.
Hyperparameter Tuning
GridSearchCV: Exhaustive Search
GridSearchCV tests every combination of specified parameter values:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_wine
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
rf,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")RandomizedSearchCV: Efficient Search
When the parameter space is large, RandomizedSearchCV samples a fixed number of parameter combinations instead of trying all of them:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [None, 5, 10, 15, 20, 30],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
'bootstrap': [True, False],
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(
rf,
param_distributions,
n_iter=100,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV Score: {random_search.best_score_:.4f}")
print(f"Test Score: {random_search.score(X_test, y_test):.4f}")Parameter Importance for Tuning
Not all parameters have equal impact. Focus your tuning budget on the parameters that matter most:
| Parameter | Impact | Priority | Notes |
|---|---|---|---|
n_estimators | High | 1st | More trees almost always helps until diminishing returns (~200-500) |
max_depth | High | 2nd | Controls overfitting directly. Try None, 10, 20, 30 |
min_samples_leaf | Medium | 3rd | Smooths predictions. Try 1, 2, 5, 10 |
max_features | Medium | 4th | Controls tree diversity. 'sqrt' is usually good for classification |
min_samples_split | Low | 5th | Less impact than min_samples_leaf in practice |
bootstrap | Low | 6th | True is almost always better. Only try False on very small datasets |
Feature Importance
One of Random Forest's strongest advantages is built-in feature importance. Understanding which features drive predictions helps with model interpretation, feature selection, and domain insights.
Impurity-Based Feature Importance
The default feature_importances_ attribute measures how much each feature decreases impurity (Gini for classification, variance for regression) across all trees:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Get feature importances
importances = rf.feature_importances_
feature_names = wine.feature_names
indices = np.argsort(importances)[::-1]
# Print ranked features
print("Feature Ranking:")
for i, idx in enumerate(indices):
print(f" {i+1}. {feature_names[idx]:25s} ({importances[idx]:.4f})")
# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(indices)), importances[indices[::-1]], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices[::-1]])
plt.xlabel('Feature Importance (Gini)')
plt.title('Random Forest Feature Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
plt.show()Permutation Importance
Impurity-based importance can be biased toward high-cardinality features. Permutation importance measures the drop in model performance when a feature's values are randomly shuffled:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Compute permutation importance on the test set
perm_imp = permutation_importance(
rf, X_test, y_test,
n_repeats=30,
random_state=42,
n_jobs=-1
)
# Sort and display
sorted_idx = perm_imp.importances_mean.argsort()[::-1]
print("Permutation Importance (test set):")
for idx in sorted_idx:
mean = perm_imp.importances_mean[idx]
std = perm_imp.importances_std[idx]
print(f" {wine.feature_names[idx]:25s}: {mean:.4f} +/- {std:.4f}")
# Plot with error bars
plt.figure(figsize=(10, 6))
plt.barh(
range(len(sorted_idx)),
perm_imp.importances_mean[sorted_idx[::-1]],
xerr=perm_imp.importances_std[sorted_idx[::-1]],
align='center'
)
plt.yticks(range(len(sorted_idx)), [wine.feature_names[i] for i in sorted_idx[::-1]])
plt.xlabel('Decrease in Accuracy')
plt.title('Permutation Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_permutation_importance.png', dpi=150)
plt.show()Which Importance Method to Use?
| Method | Pros | Cons | Best For |
|---|---|---|---|
Impurity-based (feature_importances_) | Fast, no extra computation | Biased toward high-cardinality features | Quick screening, initial exploration |
| Permutation importance | Unbiased, works on test data | Slower, affected by correlated features | Final feature selection, reporting |
Cross-Validation with Random Forest
Basic Cross-Validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Per-fold: {scores}")StratifiedKFold for Imbalanced Data
StratifiedKFold preserves the class distribution in each fold, which is critical for imbalanced datasets:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_wine
import numpy as np
wine = load_wine()
X, y = wine.data, wine.target
# Stratified 10-fold cross-validation
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"Stratified 10-Fold Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Multiple metrics
from sklearn.model_selection import cross_validate
results = cross_validate(
rf, X, y, cv=skf,
scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
n_jobs=-1
)
for metric in ['test_accuracy', 'test_f1_weighted', 'test_precision_weighted', 'test_recall_weighted']:
vals = results[metric]
name = metric.replace('test_', '')
print(f"{name:>20s}: {vals.mean():.4f} (+/- {vals.std():.4f})")Handling Imbalanced Data
When one class has far more samples than others, a model can achieve high accuracy by always predicting the majority class. Random Forest provides several tools to handle this.
Using class_weight='balanced'
The class_weight='balanced' parameter automatically adjusts weights inversely proportional to class frequencies:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
# Create imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(
n_samples=2000,
n_features=20,
weights=[0.95, 0.05],
flip_y=0,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Without class weight
rf_default = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print("=== Without class_weight ===")
print(classification_report(y_test, rf_default.predict(X_test)))
# With balanced class weight
rf_balanced = RandomForestClassifier(
n_estimators=200,
class_weight='balanced',
random_state=42,
n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print("=== With class_weight='balanced' ===")
print(classification_report(y_test, rf_balanced.predict(X_test)))Integrating SMOTE for Oversampling
SMOTE (Synthetic Minority Oversampling Technique) creates synthetic samples for the minority class. Use it with imblearn's pipeline:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# Create imbalanced dataset
X, y = make_classification(
n_samples=2000,
n_features=20,
weights=[0.95, 0.05],
flip_y=0,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# SMOTE + Random Forest pipeline
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('rf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])
pipeline.fit(X_train, y_train)
print("=== SMOTE + Random Forest ===")
print(classification_report(y_test, pipeline.predict(X_test)))Model Evaluation
Classification Report and Confusion Matrix
For a deep dive into interpreting these metrics, see our confusion matrix guide.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, confusion_matrix,
ConfusionMatrixDisplay, accuracy_score
)
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=wine.target_names)}")
# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=wine.target_names)
disp.plot(cmap='Blues')
plt.title('Random Forest - Wine Classification')
plt.tight_layout()
plt.savefig('rf_confusion_matrix.png', dpi=150)
plt.show()ROC Curve for Binary Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# Predict probabilities
y_prob = rf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
# Plot ROC curve
RocCurveDisplay.from_estimator(rf, X_test, y_test)
plt.title(f'Random Forest ROC Curve (AUC = {auc:.4f})')
plt.tight_layout()
plt.savefig('rf_roc_curve.png', dpi=150)
plt.show()Random Forest vs Other Algorithms
| Feature | Random Forest | XGBoost | Gradient Boosting | Decision Tree |
|---|---|---|---|---|
| Ensemble Type | Bagging (parallel) | Boosting (sequential) | Boosting (sequential) | Single model |
| Accuracy | High | Very High | Very High | Moderate |
| Training Speed | Fast (parallelizable) | Moderate | Slow (sequential) | Very Fast |
| Prediction Speed | Moderate | Fast | Moderate | Very Fast |
| Overfitting Risk | Low | Low (with tuning) | Low (with tuning) | High |
| Hyperparameter Sensitivity | Low | High | High | Moderate |
| Feature Scaling Required | No | No | No | No |
| Handles Missing Values | No (needs imputation) | Yes (built-in) | No (needs imputation) | No |
| Built-in Feature Importance | Yes | Yes | Yes | Yes |
| Interpretability | Moderate | Low | Low | High |
| Best For | General-purpose, first model | Kaggle competitions, maximum accuracy | Structured tabular data | Quick baselines, small datasets |
When to choose Random Forest over alternatives:
- You need a strong baseline model with minimal tuning
- Training speed matters and you have multiple CPU cores
- You want reliable feature importance estimates
- You are not chasing the last 0.5% of accuracy that boosting methods might provide
Real-World Pipeline: End-to-End Example
This pipeline combines preprocessing, feature engineering, model training, evaluation, and prediction in a production-style workflow:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
# Load and prepare data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
# Introduce some missing values to simulate real data
np.random.seed(42)
mask = np.random.random(df.shape) < 0.05
df_missing = df.mask(mask.astype(bool))
df_missing['target'] = cancer.target # Keep target clean
X = df_missing.drop('target', axis=1)
y = df_missing['target']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Build preprocessing + model pipeline
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
]
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=300,
max_depth=20,
min_samples_leaf=2,
class_weight='balanced',
random_state=42,
n_jobs=-1
))
])
# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='accuracy')
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Evaluation
print(f"\nTest Set Results:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
# Make predictions on new data
sample = X_test.iloc[:3]
predictions = pipeline.predict(sample)
probabilities = pipeline.predict_proba(sample)
print(f"\nSample Predictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
class_name = cancer.target_names[pred]
confidence = prob[pred]
print(f" Sample {i+1}: {class_name} (confidence: {confidence:.2%})")Saving and Loading the Model
import joblib
# Save the trained pipeline
joblib.dump(pipeline, 'rf_pipeline.joblib')
# Load and use later
loaded_pipeline = joblib.load('rf_pipeline.joblib')
new_predictions = loaded_pipeline.predict(X_test[:5])
print(f"Loaded model predictions: {new_predictions}")Exploring Results with PyGWalker
After training your Random Forest model, you often need to explore feature importance patterns, prediction distributions, and misclassification cases in detail. PyGWalker (opens in a new tab) lets you turn your results DataFrame into an interactive Tableau-like exploration interface directly in Jupyter:
import pandas as pd
import pygwalker as pyg
# Build a results DataFrame
results = pd.DataFrame(X_test.values, columns=cancer.feature_names)
results['actual'] = y_test.values
results['predicted'] = y_pred
results['correct'] = y_test.values == y_pred
results['prob_malignant'] = pipeline.predict_proba(X_test)[:, 0]
results['prob_benign'] = pipeline.predict_proba(X_test)[:, 1]
# Launch interactive exploration
walker = pyg.walk(results)Drag features to axes, filter by misclassified samples, and color-code by prediction confidence to identify where the model struggles. This kind of visual analysis helps you decide which features to engineer or which samples need closer inspection.
For running your full ML experimentation workflow -- from data loading through model comparison to final evaluation -- RunCell (opens in a new tab) provides an AI-powered Jupyter environment that helps you iterate faster on experiments, auto-generate evaluation code, and manage your notebook workflow.
FAQ
How many trees should I use in a Random Forest?
Start with 100-200 trees. Accuracy generally improves with more trees but plateaus after a certain point. Use cross-validation to find the sweet spot. Beyond 500 trees, gains are usually negligible while training time increases. Monitor the OOB score as you increase n_estimators -- when it stops improving, you have enough trees.
Does Random Forest need feature scaling?
No. Random Forest makes splits based on feature value thresholds, so the absolute scale of features does not affect the splitting decisions. Unlike logistic regression, SVM, or neural networks, Random Forest handles features with different ranges naturally. However, if your pipeline includes other components (like PCA or distance-based preprocessing), scaling may still be required for those steps.
How does Random Forest handle missing values?
Scikit-learn's RandomForestClassifier and RandomForestRegressor do not handle missing values natively. You must impute missing data before training -- use SimpleImputer with median or mean strategy for numeric features, or use more advanced imputation methods like IterativeImputer. Some other implementations like H2O or LightGBM can handle missing values directly.
What is the difference between Random Forest and Gradient Boosting?
Random Forest builds trees independently in parallel (bagging), while Gradient Boosting builds trees sequentially where each tree corrects the errors of the previous one (boosting). Random Forest reduces variance, Gradient Boosting reduces bias. In practice, Gradient Boosting (especially XGBoost) often achieves slightly higher accuracy, but Random Forest is easier to tune and less prone to overfitting.
Can Random Forest be used for feature selection?
Yes. Use feature_importances_ for a quick ranking or permutation_importance for a more reliable estimate. You can then drop low-importance features and retrain. Alternatively, use SelectFromModel with a Random Forest estimator inside a pipeline to automatically select features above a threshold.
Conclusion
Random Forest is one of the most reliable and versatile algorithms in machine learning. It reduces overfitting by combining hundreds of decorrelated decision trees, handles both classification and regression tasks without feature scaling, and provides built-in feature importance rankings. For most tabular data problems, it serves as an excellent first model that often performs well enough for production use.
Start with RandomForestClassifier or RandomForestRegressor with default parameters as your baseline. Tune n_estimators first for diminishing-returns analysis, then max_depth and min_samples_leaf to control overfitting. Use class_weight='balanced' for imbalanced data, permutation importance for reliable feature rankings, and StratifiedKFold cross-validation for robust evaluation. For simpler linear problems, sklearn LinearRegression may be sufficient. When you need the absolute highest accuracy on structured data, consider Gradient Boosting or XGBoost, but Random Forest remains the safest default choice that rarely fails badly.
Related Guides
- Sklearn Confusion Matrix -- evaluate classifier performance with precision, recall, and F1
- Sklearn Pipeline -- wrap preprocessing and Random Forest into a single deployable object
- Sklearn Linear Regression -- a simpler model for linear regression tasks
- Pandas read_csv -- load datasets from CSV files before training