Skip to content

Sklearn Random Forest: Complete Guide to Classification and Regression in Python

Updated on

You built a decision tree that gets 95% training accuracy, then it scores 62% on new data. A single decision tree memorizes the training set -- every split, every leaf is tuned to the exact samples it saw. The result is a model that looks great on paper but fails in production.

This overfitting problem is not just academic. Teams deploy models that perform well in development notebooks but generate unreliable predictions on live data. A single decision tree has high variance: small changes in the training data produce entirely different tree structures. You cannot trust a model that is this sensitive to its training data.

Random Forest solves this by building hundreds of decision trees on random subsets of data and features, then combining their predictions through majority voting (classification) or averaging (regression). This ensemble approach dramatically reduces variance while maintaining accuracy. Scikit-learn's RandomForestClassifier and RandomForestRegressor provide a production-ready implementation with built-in feature importance, out-of-bag evaluation, and parallel training.

📚

What Is Random Forest?

Random Forest is an ensemble learning method that combines multiple decision trees to produce a single, more robust prediction. It uses a technique called bagging (Bootstrap Aggregating):

  1. Bootstrap sampling: Create multiple random subsets of the training data by sampling with replacement. Each subset is roughly 63% of the original data.
  2. Random feature selection: At each split in each tree, consider only a random subset of features (typically sqrt(n_features) for classification, n_features/3 for regression).
  3. Independent training: Train a decision tree on each bootstrap sample with the random feature constraint.
  4. Aggregation: Combine predictions by majority vote (classification) or mean (regression).

The randomness in both data sampling and feature selection ensures that individual trees are decorrelated. Even if one tree overfits a particular pattern, the majority of other trees will not, and the ensemble averages out the noise.

When to Use Random Forest

ScenarioRandom Forest?Why
Tabular data with mixed feature typesYesHandles numeric and categorical features, no scaling needed
You need feature importance rankingsYesBuilt-in feature_importances_ attribute
Small to medium datasets (up to ~100K rows)YesFast training with parallel processing
Imbalanced classificationYesSupports class_weight='balanced'
You need interpretable predictionsModerateIndividual trees are interpretable, but the ensemble is less so
Very high-dimensional sparse data (text)NoLinear models or gradient boosting are typically better
Real-time inference with strict latencyCarefulLarge forests can be slow at prediction time

RandomForestClassifier: Classification Example

Here is a complete classification example using the wine dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_wine
 
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
 
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {wine.target_names}")
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
 
# Evaluate
y_pred = rf.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

Key Parameters Explained

ParameterDefaultDescriptionTuning Tip
n_estimators100Number of trees in the forestMore trees = better performance but slower. 100-500 is typical.
max_depthNoneMaximum depth of each treeNone means fully grown. Set to 10-30 to reduce overfitting.
min_samples_split2Minimum samples to split a nodeIncrease to 5-20 to prevent overfitting on noisy data.
min_samples_leaf1Minimum samples in a leaf nodeIncrease to 2-10 for smoother predictions.
max_features'sqrt'Features considered at each split'sqrt' for classification, 'log2' or a fraction for alternatives.
bootstrapTrueUse bootstrap samplingSet False for small datasets to use all data per tree.
class_weightNoneWeights for each classUse 'balanced' for imbalanced datasets.
n_jobsNoneNumber of parallel jobsSet to -1 to use all CPU cores.
oob_scoreFalseUse out-of-bag samples for evaluationSet True for a built-in validation estimate without a holdout set.

Out-of-Bag (OOB) Score

Each tree is trained on roughly 63% of the data. The remaining 37% (out-of-bag samples) can be used as a free validation set:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(
    n_estimators=200,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
 
print(f"OOB Score:  {rf.oob_score_:.4f}")
print(f"Test Score: {rf.score(X_test, y_test):.4f}")

The OOB score gives you a validation estimate without needing a separate holdout set. It is especially useful when data is limited.

RandomForestRegressor: Regression Example

Random Forest regression predicts continuous values by averaging the outputs of all trees:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Train regressor
rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_depth=20,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
 
# Evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
 
print(f"R-squared: {r2:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")

Comparing Regressors

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
regressors = {
    'Linear Regression': LinearRegression(),  # see /topics/Scikit-Learn/sklearn-linear-regression
    'Ridge': Ridge(alpha=1.0),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
}
 
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12}")
print("-" * 52)
 
for name, model in regressors.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
    print(f"{name:<25} {scores.mean():>12.4f} {scores.std():>12.4f}")

Random Forest typically outperforms a single decision tree and linear models on datasets with non-linear relationships, while being competitive with gradient boosting.

Hyperparameter Tuning

GridSearchCV: Exhaustive Search

GridSearchCV tests every combination of specified parameter values:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_wine
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}
 
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
    rf,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)
 
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score:   {grid_search.best_score_:.4f}")
print(f"Test Score:      {grid_search.score(X_test, y_test):.4f}")

RandomizedSearchCV: Efficient Search

When the parameter space is large, RandomizedSearchCV samples a fixed number of parameter combinations instead of trying all of them:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 5, 10, 15, 20, 30],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
    'bootstrap': [True, False],
}
 
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(
    rf,
    param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=1
)
random_search.fit(X_train, y_train)
 
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV Score:   {random_search.best_score_:.4f}")
print(f"Test Score:      {random_search.score(X_test, y_test):.4f}")

Parameter Importance for Tuning

Not all parameters have equal impact. Focus your tuning budget on the parameters that matter most:

ParameterImpactPriorityNotes
n_estimatorsHigh1stMore trees almost always helps until diminishing returns (~200-500)
max_depthHigh2ndControls overfitting directly. Try None, 10, 20, 30
min_samples_leafMedium3rdSmooths predictions. Try 1, 2, 5, 10
max_featuresMedium4thControls tree diversity. 'sqrt' is usually good for classification
min_samples_splitLow5thLess impact than min_samples_leaf in practice
bootstrapLow6thTrue is almost always better. Only try False on very small datasets

Feature Importance

One of Random Forest's strongest advantages is built-in feature importance. Understanding which features drive predictions helps with model interpretation, feature selection, and domain insights.

Impurity-Based Feature Importance

The default feature_importances_ attribute measures how much each feature decreases impurity (Gini for classification, variance for regression) across all trees:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Get feature importances
importances = rf.feature_importances_
feature_names = wine.feature_names
indices = np.argsort(importances)[::-1]
 
# Print ranked features
print("Feature Ranking:")
for i, idx in enumerate(indices):
    print(f"  {i+1}. {feature_names[idx]:25s} ({importances[idx]:.4f})")
 
# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(indices)), importances[indices[::-1]], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices[::-1]])
plt.xlabel('Feature Importance (Gini)')
plt.title('Random Forest Feature Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
plt.show()

Permutation Importance

Impurity-based importance can be biased toward high-cardinality features. Permutation importance measures the drop in model performance when a feature's values are randomly shuffled:

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Compute permutation importance on the test set
perm_imp = permutation_importance(
    rf, X_test, y_test,
    n_repeats=30,
    random_state=42,
    n_jobs=-1
)
 
# Sort and display
sorted_idx = perm_imp.importances_mean.argsort()[::-1]
 
print("Permutation Importance (test set):")
for idx in sorted_idx:
    mean = perm_imp.importances_mean[idx]
    std = perm_imp.importances_std[idx]
    print(f"  {wine.feature_names[idx]:25s}: {mean:.4f} +/- {std:.4f}")
 
# Plot with error bars
plt.figure(figsize=(10, 6))
plt.barh(
    range(len(sorted_idx)),
    perm_imp.importances_mean[sorted_idx[::-1]],
    xerr=perm_imp.importances_std[sorted_idx[::-1]],
    align='center'
)
plt.yticks(range(len(sorted_idx)), [wine.feature_names[i] for i in sorted_idx[::-1]])
plt.xlabel('Decrease in Accuracy')
plt.title('Permutation Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_permutation_importance.png', dpi=150)
plt.show()

Which Importance Method to Use?

MethodProsConsBest For
Impurity-based (feature_importances_)Fast, no extra computationBiased toward high-cardinality featuresQuick screening, initial exploration
Permutation importanceUnbiased, works on test dataSlower, affected by correlated featuresFinal feature selection, reporting

Cross-Validation with Random Forest

Basic Cross-Validation

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
 
wine = load_wine()
X, y = wine.data, wine.target
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
 
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Per-fold:    {scores}")

StratifiedKFold for Imbalanced Data

StratifiedKFold preserves the class distribution in each fold, which is critical for imbalanced datasets:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_wine
import numpy as np
 
wine = load_wine()
X, y = wine.data, wine.target
 
# Stratified 10-fold cross-validation
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
 
scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"Stratified 10-Fold Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
 
# Multiple metrics
from sklearn.model_selection import cross_validate
 
results = cross_validate(
    rf, X, y, cv=skf,
    scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
    n_jobs=-1
)
 
for metric in ['test_accuracy', 'test_f1_weighted', 'test_precision_weighted', 'test_recall_weighted']:
    vals = results[metric]
    name = metric.replace('test_', '')
    print(f"{name:>20s}: {vals.mean():.4f} (+/- {vals.std():.4f})")

Handling Imbalanced Data

When one class has far more samples than others, a model can achieve high accuracy by always predicting the majority class. Random Forest provides several tools to handle this.

Using class_weight='balanced'

The class_weight='balanced' parameter automatically adjusts weights inversely proportional to class frequencies:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
 
# Create imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    weights=[0.95, 0.05],
    flip_y=0,
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
 
# Without class weight
rf_default = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print("=== Without class_weight ===")
print(classification_report(y_test, rf_default.predict(X_test)))
 
# With balanced class weight
rf_balanced = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print("=== With class_weight='balanced' ===")
print(classification_report(y_test, rf_balanced.predict(X_test)))

Integrating SMOTE for Oversampling

SMOTE (Synthetic Minority Oversampling Technique) creates synthetic samples for the minority class. Use it with imblearn's pipeline:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
 
# Create imbalanced dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    weights=[0.95, 0.05],
    flip_y=0,
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
 
# SMOTE + Random Forest pipeline
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])
pipeline.fit(X_train, y_train)
 
print("=== SMOTE + Random Forest ===")
print(classification_report(y_test, pipeline.predict(X_test)))

Model Evaluation

Classification Report and Confusion Matrix

For a deep dive into interpreting these metrics, see our confusion matrix guide.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix,
    ConfusionMatrixDisplay, accuracy_score
)
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
 
# Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=wine.target_names)}")
 
# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=wine.target_names)
disp.plot(cmap='Blues')
plt.title('Random Forest - Wine Classification')
plt.tight_layout()
plt.savefig('rf_confusion_matrix.png', dpi=150)
plt.show()

ROC Curve for Binary Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Predict probabilities
y_prob = rf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
 
# Plot ROC curve
RocCurveDisplay.from_estimator(rf, X_test, y_test)
plt.title(f'Random Forest ROC Curve (AUC = {auc:.4f})')
plt.tight_layout()
plt.savefig('rf_roc_curve.png', dpi=150)
plt.show()

Random Forest vs Other Algorithms

FeatureRandom ForestXGBoostGradient BoostingDecision Tree
Ensemble TypeBagging (parallel)Boosting (sequential)Boosting (sequential)Single model
AccuracyHighVery HighVery HighModerate
Training SpeedFast (parallelizable)ModerateSlow (sequential)Very Fast
Prediction SpeedModerateFastModerateVery Fast
Overfitting RiskLowLow (with tuning)Low (with tuning)High
Hyperparameter SensitivityLowHighHighModerate
Feature Scaling RequiredNoNoNoNo
Handles Missing ValuesNo (needs imputation)Yes (built-in)No (needs imputation)No
Built-in Feature ImportanceYesYesYesYes
InterpretabilityModerateLowLowHigh
Best ForGeneral-purpose, first modelKaggle competitions, maximum accuracyStructured tabular dataQuick baselines, small datasets

When to choose Random Forest over alternatives:

  • You need a strong baseline model with minimal tuning
  • Training speed matters and you have multiple CPU cores
  • You want reliable feature importance estimates
  • You are not chasing the last 0.5% of accuracy that boosting methods might provide

Real-World Pipeline: End-to-End Example

This pipeline combines preprocessing, feature engineering, model training, evaluation, and prediction in a production-style workflow:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
 
# Load and prepare data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
 
# Introduce some missing values to simulate real data
np.random.seed(42)
mask = np.random.random(df.shape) < 0.05
df_missing = df.mask(mask.astype(bool))
df_missing['target'] = cancer.target  # Keep target clean
 
X = df_missing.drop('target', axis=1)
y = df_missing['target']
 
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Build preprocessing + model pipeline
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])
 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ]
)
 
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=300,
        max_depth=20,
        min_samples_leaf=2,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])
 
# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='accuracy')
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
 
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
 
# Evaluation
print(f"\nTest Set Results:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
 
# Make predictions on new data
sample = X_test.iloc[:3]
predictions = pipeline.predict(sample)
probabilities = pipeline.predict_proba(sample)
 
print(f"\nSample Predictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    class_name = cancer.target_names[pred]
    confidence = prob[pred]
    print(f"  Sample {i+1}: {class_name} (confidence: {confidence:.2%})")

Saving and Loading the Model

import joblib
 
# Save the trained pipeline
joblib.dump(pipeline, 'rf_pipeline.joblib')
 
# Load and use later
loaded_pipeline = joblib.load('rf_pipeline.joblib')
new_predictions = loaded_pipeline.predict(X_test[:5])
print(f"Loaded model predictions: {new_predictions}")

Exploring Results with PyGWalker

After training your Random Forest model, you often need to explore feature importance patterns, prediction distributions, and misclassification cases in detail. PyGWalker (opens in a new tab) lets you turn your results DataFrame into an interactive Tableau-like exploration interface directly in Jupyter:

import pandas as pd
import pygwalker as pyg
 
# Build a results DataFrame
results = pd.DataFrame(X_test.values, columns=cancer.feature_names)
results['actual'] = y_test.values
results['predicted'] = y_pred
results['correct'] = y_test.values == y_pred
results['prob_malignant'] = pipeline.predict_proba(X_test)[:, 0]
results['prob_benign'] = pipeline.predict_proba(X_test)[:, 1]
 
# Launch interactive exploration
walker = pyg.walk(results)

Drag features to axes, filter by misclassified samples, and color-code by prediction confidence to identify where the model struggles. This kind of visual analysis helps you decide which features to engineer or which samples need closer inspection.

For running your full ML experimentation workflow -- from data loading through model comparison to final evaluation -- RunCell (opens in a new tab) provides an AI-powered Jupyter environment that helps you iterate faster on experiments, auto-generate evaluation code, and manage your notebook workflow.

FAQ

How many trees should I use in a Random Forest?

Start with 100-200 trees. Accuracy generally improves with more trees but plateaus after a certain point. Use cross-validation to find the sweet spot. Beyond 500 trees, gains are usually negligible while training time increases. Monitor the OOB score as you increase n_estimators -- when it stops improving, you have enough trees.

Does Random Forest need feature scaling?

No. Random Forest makes splits based on feature value thresholds, so the absolute scale of features does not affect the splitting decisions. Unlike logistic regression, SVM, or neural networks, Random Forest handles features with different ranges naturally. However, if your pipeline includes other components (like PCA or distance-based preprocessing), scaling may still be required for those steps.

How does Random Forest handle missing values?

Scikit-learn's RandomForestClassifier and RandomForestRegressor do not handle missing values natively. You must impute missing data before training -- use SimpleImputer with median or mean strategy for numeric features, or use more advanced imputation methods like IterativeImputer. Some other implementations like H2O or LightGBM can handle missing values directly.

What is the difference between Random Forest and Gradient Boosting?

Random Forest builds trees independently in parallel (bagging), while Gradient Boosting builds trees sequentially where each tree corrects the errors of the previous one (boosting). Random Forest reduces variance, Gradient Boosting reduces bias. In practice, Gradient Boosting (especially XGBoost) often achieves slightly higher accuracy, but Random Forest is easier to tune and less prone to overfitting.

Can Random Forest be used for feature selection?

Yes. Use feature_importances_ for a quick ranking or permutation_importance for a more reliable estimate. You can then drop low-importance features and retrain. Alternatively, use SelectFromModel with a Random Forest estimator inside a pipeline to automatically select features above a threshold.

Conclusion

Random Forest is one of the most reliable and versatile algorithms in machine learning. It reduces overfitting by combining hundreds of decorrelated decision trees, handles both classification and regression tasks without feature scaling, and provides built-in feature importance rankings. For most tabular data problems, it serves as an excellent first model that often performs well enough for production use.

Start with RandomForestClassifier or RandomForestRegressor with default parameters as your baseline. Tune n_estimators first for diminishing-returns analysis, then max_depth and min_samples_leaf to control overfitting. Use class_weight='balanced' for imbalanced data, permutation importance for reliable feature rankings, and StratifiedKFold cross-validation for robust evaluation. For simpler linear problems, sklearn LinearRegression may be sufficient. When you need the absolute highest accuracy on structured data, consider Gradient Boosting or XGBoost, but Random Forest remains the safest default choice that rarely fails badly.

Related Guides

📚