Sklearn Random Forest: Complete Guide to Classification and Regression in Python

Q: How many trees should I use in a Random Forest?

Start with 100-200 trees. Accuracy improves with more trees but plateaus. Use cross-validation to find the sweet spot. Beyond 500 trees, gains are usually negligible. Monitor OOB score to determine when adding trees stops helping.

Q: Does Random Forest need feature scaling?

No. Random Forest splits on feature thresholds, so absolute scale does not affect decisions. Unlike SVM or neural networks, it handles different feature ranges naturally. Only scale if other pipeline components require it.

Q: How does Random Forest handle missing values?

Scikit-learn's Random Forest does not handle missing values natively. Impute missing data before training using SimpleImputer with median or mean strategy, or use advanced methods like IterativeImputer.

Q: Can Random Forest be used for feature selection?

Yes. Use feature_importances_ for quick ranking or permutation_importance for reliable estimates. Drop low-importance features and retrain, or use SelectFromModel with a Random Forest estimator in a pipeline.

Name: Soren Atelier

Updated on 2/16/2026

You built a decision tree that gets 95% training accuracy, then it scores 62% on new data. A single decision tree memorizes the training set -- every split, every leaf is tuned to the exact samples it saw. The result is a model that looks great on paper but fails in production.

This overfitting problem is not just academic. Teams deploy models that perform well in development notebooks but generate unreliable predictions on live data. A single decision tree has high variance: small changes in the training data produce entirely different tree structures. You cannot trust a model that is this sensitive to its training data.

Random Forest solves this by building hundreds of decision trees on random subsets of data and features, then combining their predictions through majority voting (classification) or averaging (regression). This ensemble approach dramatically reduces variance while maintaining accuracy. Scikit-learn's RandomForestClassifier and RandomForestRegressor provide a production-ready implementation with built-in feature importance, out-of-bag evaluation, and parallel training.

What Is Random Forest?

Random Forest is an ensemble learning method that combines multiple decision trees to produce a single, more robust prediction. It uses a technique called bagging (Bootstrap Aggregating):

Bootstrap sampling: Create multiple random subsets of the training data by sampling with replacement. Each subset is roughly 63% of the original data.
Random feature selection: At each split in each tree, consider only a random subset of features (typically sqrt(n_features) for classification, n_features/3 for regression).
Independent training: Train a decision tree on each bootstrap sample with the random feature constraint.
Aggregation: Combine predictions by majority vote (classification) or mean (regression).

The randomness in both data sampling and feature selection ensures that individual trees are decorrelated. Even if one tree overfits a particular pattern, the majority of other trees will not, and the ensemble averages out the noise.

When to Use Random Forest

Scenario	Random Forest?	Why
Tabular data with mixed feature types	Yes	Handles numeric and categorical features, no scaling needed
You need feature importance rankings	Yes	Built-in `feature_importances_` attribute
Small to medium datasets (up to ~100K rows)	Yes	Fast training with parallel processing
Imbalanced classification	Yes	Supports `class_weight='balanced'`
You need interpretable predictions	Moderate	Individual trees are interpretable, but the ensemble is less so
Very high-dimensional sparse data (text)	No	Linear models or gradient boosting are typically better
Real-time inference with strict latency	Careful	Large forests can be slow at prediction time

RandomForestClassifier: Classification Example

Here is a complete classification example using the wine dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_wine
 
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
 
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {wine.target_names}")
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
 
# Evaluate
y_pred = rf.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

Key Parameters Explained

Parameter	Default	Description	Tuning Tip
`n_estimators`	100	Number of trees in the forest	More trees = better performance but slower. 100-500 is typical.
`max_depth`	None	Maximum depth of each tree	None means fully grown. Set to 10-30 to reduce overfitting.
`min_samples_split`	2	Minimum samples to split a node	Increase to 5-20 to prevent overfitting on noisy data.
`min_samples_leaf`	1	Minimum samples in a leaf node	Increase to 2-10 for smoother predictions.
`max_features`	'sqrt'	Features considered at each split	'sqrt' for classification, 'log2' or a fraction for alternatives.
`bootstrap`	True	Use bootstrap sampling	Set False for small datasets to use all data per tree.
`class_weight`	None	Weights for each class	Use 'balanced' for imbalanced datasets.
`n_jobs`	None	Number of parallel jobs	Set to -1 to use all CPU cores.
`oob_score`	False	Use out-of-bag samples for evaluation	Set True for a built-in validation estimate without a holdout set.

Out-of-Bag (OOB) Score

Each tree is trained on roughly 63% of the data. The remaining 37% (out-of-bag samples) can be used as a free validation set:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(
    n_estimators=200,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
 
print(f"OOB Score:  {rf.oob_score_:.4f}")
print(f"Test Score: {rf.score(X_test, y_test):.4f}")

The OOB score gives you a validation estimate without needing a separate holdout set. It is especially useful when data is limited.

RandomForestRegressor: Regression Example

Random Forest regression predicts continuous values by averaging the outputs of all trees:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Train regressor
rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_depth=20,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
 
# Evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
 
print(f"R-squared: {r2:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")

Comparing Regressors

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
regressors = {
    'Linear Regression': LinearRegression(),  # see /topics/Scikit-Learn/sklearn-linear-regression
    'Ridge': Ridge(alpha=1.0),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
}
 
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12}")
print("-" * 52)
 
for name, model in regressors.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
    print(f"{name:<25} {scores.mean():>12.4f} {scores.std():>12.4f}")

Random Forest typically outperforms a single decision tree and linear models on datasets with non-linear relationships, while being competitive with gradient boosting.

Hyperparameter Tuning

GridSearchCV: Exhaustive Search

GridSearchCV tests every combination of specified parameter values:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_wine
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}
 
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
    rf,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)
 
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score:   {grid_search.best_score_:.4f}")
print(f"Test Score:      {grid_search.score(X_test, y_test):.4f}")

RandomizedSearchCV: Efficient Search

When the parameter space is large, RandomizedSearchCV samples a fixed number of parameter combinations instead of trying all of them:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 5, 10, 15, 20, 30],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
    'bootstrap': [True, False],
}
 
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(
    rf,
    param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=1
)
random_search.fit(X_train, y_train)
 
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV Score:   {random_search.best_score_:.4f}")
print(f"Test Score:      {random_search.score(X_test, y_test):.4f}")

Parameter Importance for Tuning

Not all parameters have equal impact. Focus your tuning budget on the parameters that matter most:

Parameter	Impact	Priority	Notes
`n_estimators`	High	1st	More trees almost always helps until diminishing returns (~200-500)
`max_depth`	High	2nd	Controls overfitting directly. Try None, 10, 20, 30
`min_samples_leaf`	Medium	3rd	Smooths predictions. Try 1, 2, 5, 10
`max_features`	Medium	4th	Controls tree diversity. 'sqrt' is usually good for classification
`min_samples_split`	Low	5th	Less impact than min_samples_leaf in practice
`bootstrap`	Low	6th	True is almost always better. Only try False on very small datasets

Feature Importance

One of Random Forest's strongest advantages is built-in feature importance. Understanding which features drive predictions helps with model interpretation, feature selection, and domain insights.

Impurity-Based Feature Importance

The default feature_importances_ attribute measures how much each feature decreases impurity (Gini for classification, variance for regression) across all trees:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Get feature importances
importances = rf.feature_importances_
feature_names = wine.feature_names
indices = np.argsort(importances)[::-1]
 
# Print ranked features
print("Feature Ranking:")
for i, idx in enumerate(indices):
    print(f"  {i+1}. {feature_names[idx]:25s} ({importances[idx]:.4f})")
 
# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(indices)), importances[indices[::-1]], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices[::-1]])
plt.xlabel('Feature Importance (Gini)')
plt.title('Random Forest Feature Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
plt.show()

Permutation Importance

Impurity-based importance can be biased toward high-cardinality features. Permutation importance measures the drop in model performance when a feature's values are randomly shuffled:

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Compute permutation importance on the test set
perm_imp = permutation_importance(
    rf, X_test, y_test,
    n_repeats=30,
    random_state=42,
    n_jobs=-1
)
 
# Sort and display
sorted_idx = perm_imp.importances_mean.argsort()[::-1]
 
print("Permutation Importance (test set):")
for idx in sorted_idx:
    mean = perm_imp.importances_mean[idx]
    std = perm_imp.importances_std[idx]
    print(f"  {wine.feature_names[idx]:25s}: {mean:.4f} +/- {std:.4f}")
 
# Plot with error bars
plt.figure(figsize=(10, 6))
plt.barh(
    range(len(sorted_idx)),
    perm_imp.importances_mean[sorted_idx[::-1]],
    xerr=perm_imp.importances_std[sorted_idx[::-1]],
    align='center'
)
plt.yticks(range(len(sorted_idx)), [wine.feature_names[i] for i in sorted_idx[::-1]])
plt.xlabel('Decrease in Accuracy')
plt.title('Permutation Importance - Wine Dataset')
plt.tight_layout()
plt.savefig('rf_permutation_importance.png', dpi=150)
plt.show()

Which Importance Method to Use?

Method	Pros	Cons	Best For
Impurity-based (`feature_importances_`)	Fast, no extra computation	Biased toward high-cardinality features	Quick screening, initial exploration
Permutation importance	Unbiased, works on test data	Slower, affected by correlated features	Final feature selection, reporting

Cross-Validation with Random Forest

Basic Cross-Validation

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
 
wine = load_wine()
X, y = wine.data, wine.target
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
 
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Per-fold:    {scores}")

StratifiedKFold for Imbalanced Data

StratifiedKFold preserves the class distribution in each fold, which is critical for imbalanced datasets:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_wine
import numpy as np
 
wine = load_wine()
X, y = wine.data, wine.target
 
# Stratified 10-fold cross-validation
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
 
scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"Stratified 10-Fold Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
 
# Multiple metrics
from sklearn.model_selection import cross_validate
 
results = cross_validate(
    rf, X, y, cv=skf,
    scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
    n_jobs=-1
)
 
for metric in ['test_accuracy', 'test_f1_weighted', 'test_precision_weighted', 'test_recall_weighted']:
    vals = results[metric]
    name = metric.replace('test_', '')
    print(f"{name:>20s}: {vals.mean():.4f} (+/- {vals.std():.4f})")

Handling Imbalanced Data

When one class has far more samples than others, a model can achieve high accuracy by always predicting the majority class. Random Forest provides several tools to handle this.

Using class_weight='balanced'

The class_weight='balanced' parameter automatically adjusts weights inversely proportional to class frequencies:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
 
# Create imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    weights=[0.95, 0.05],
    flip_y=0,
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
 
# Without class weight
rf_default = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print("=== Without class_weight ===")
print(classification_report(y_test, rf_default.predict(X_test)))
 
# With balanced class weight
rf_balanced = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print("=== With class_weight='balanced' ===")
print(classification_report(y_test, rf_balanced.predict(X_test)))

Integrating SMOTE for Oversampling

SMOTE (Synthetic Minority Oversampling Technique) creates synthetic samples for the minority class. Use it with imblearn's pipeline:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
 
# Create imbalanced dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    weights=[0.95, 0.05],
    flip_y=0,
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
 
# SMOTE + Random Forest pipeline
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])
pipeline.fit(X_train, y_train)
 
print("=== SMOTE + Random Forest ===")
print(classification_report(y_test, pipeline.predict(X_test)))

Model Evaluation

Classification Report and Confusion Matrix

For a deep dive into interpreting these metrics, see our confusion matrix guide.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix,
    ConfusionMatrixDisplay, accuracy_score
)
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
 
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
 
# Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=wine.target_names)}")
 
# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=wine.target_names)
disp.plot(cmap='Blues')
plt.title('Random Forest - Wine Classification')
plt.tight_layout()
plt.savefig('rf_confusion_matrix.png', dpi=150)
plt.show()

ROC Curve for Binary Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
 
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
 
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Predict probabilities
y_prob = rf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
 
# Plot ROC curve
RocCurveDisplay.from_estimator(rf, X_test, y_test)
plt.title(f'Random Forest ROC Curve (AUC = {auc:.4f})')
plt.tight_layout()
plt.savefig('rf_roc_curve.png', dpi=150)
plt.show()

Random Forest vs Other Algorithms

Feature	Random Forest	XGBoost	Gradient Boosting	Decision Tree
Ensemble Type	Bagging (parallel)	Boosting (sequential)	Boosting (sequential)	Single model
Accuracy	High	Very High	Very High	Moderate
Training Speed	Fast (parallelizable)	Moderate	Slow (sequential)	Very Fast
Prediction Speed	Moderate	Fast	Moderate	Very Fast
Overfitting Risk	Low	Low (with tuning)	Low (with tuning)	High
Hyperparameter Sensitivity	Low	High	High	Moderate
Feature Scaling Required	No	No	No	No
Handles Missing Values	No (needs imputation)	Yes (built-in)	No (needs imputation)	No
Built-in Feature Importance	Yes	Yes	Yes	Yes
Interpretability	Moderate	Low	Low	High
Best For	General-purpose, first model	Kaggle competitions, maximum accuracy	Structured tabular data	Quick baselines, small datasets

When to choose Random Forest over alternatives:

You need a strong baseline model with minimal tuning
Training speed matters and you have multiple CPU cores
You want reliable feature importance estimates
You are not chasing the last 0.5% of accuracy that boosting methods might provide

Real-World Pipeline: End-to-End Example

This pipeline combines preprocessing, feature engineering, model training, evaluation, and prediction in a production-style workflow:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
 
# Load and prepare data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
 
# Introduce some missing values to simulate real data
np.random.seed(42)
mask = np.random.random(df.shape) < 0.05
df_missing = df.mask(mask.astype(bool))
df_missing['target'] = cancer.target  # Keep target clean
 
X = df_missing.drop('target', axis=1)
y = df_missing['target']
 
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Build preprocessing + model pipeline
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])
 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ]
)
 
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=300,
        max_depth=20,
        min_samples_leaf=2,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])
 
# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='accuracy')
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
 
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
 
# Evaluation
print(f"\nTest Set Results:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
 
# Make predictions on new data
sample = X_test.iloc[:3]
predictions = pipeline.predict(sample)
probabilities = pipeline.predict_proba(sample)
 
print(f"\nSample Predictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    class_name = cancer.target_names[pred]
    confidence = prob[pred]
    print(f"  Sample {i+1}: {class_name} (confidence: {confidence:.2%})")

Saving and Loading the Model

import joblib
 
# Save the trained pipeline
joblib.dump(pipeline, 'rf_pipeline.joblib')
 
# Load and use later
loaded_pipeline = joblib.load('rf_pipeline.joblib')
new_predictions = loaded_pipeline.predict(X_test[:5])
print(f"Loaded model predictions: {new_predictions}")

Exploring Results with PyGWalker

After training your Random Forest model, you often need to explore feature importance patterns, prediction distributions, and misclassification cases in detail. PyGWalker (opens in a new tab) lets you turn your results DataFrame into an interactive Tableau-like exploration interface directly in Jupyter:

import pandas as pd
import pygwalker as pyg
 
# Build a results DataFrame
results = pd.DataFrame(X_test.values, columns=cancer.feature_names)
results['actual'] = y_test.values
results['predicted'] = y_pred
results['correct'] = y_test.values == y_pred
results['prob_malignant'] = pipeline.predict_proba(X_test)[:, 0]
results['prob_benign'] = pipeline.predict_proba(X_test)[:, 1]
 
# Launch interactive exploration
walker = pyg.walk(results)

Drag features to axes, filter by misclassified samples, and color-code by prediction confidence to identify where the model struggles. This kind of visual analysis helps you decide which features to engineer or which samples need closer inspection.

For running your full ML experimentation workflow -- from data loading through model comparison to final evaluation -- RunCell (opens in a new tab) provides an AI-powered Jupyter environment that helps you iterate faster on experiments, auto-generate evaluation code, and manage your notebook workflow.

FAQ

How many trees should I use in a Random Forest?

Start with 100-200 trees. Accuracy generally improves with more trees but plateaus after a certain point. Use cross-validation to find the sweet spot. Beyond 500 trees, gains are usually negligible while training time increases. Monitor the OOB score as you increase n_estimators -- when it stops improving, you have enough trees.

Does Random Forest need feature scaling?

No. Random Forest makes splits based on feature value thresholds, so the absolute scale of features does not affect the splitting decisions. Unlike logistic regression, SVM, or neural networks, Random Forest handles features with different ranges naturally. However, if your pipeline includes other components (like PCA or distance-based preprocessing), scaling may still be required for those steps.

How does Random Forest handle missing values?

Scikit-learn's RandomForestClassifier and RandomForestRegressor do not handle missing values natively. You must impute missing data before training -- use SimpleImputer with median or mean strategy for numeric features, or use more advanced imputation methods like IterativeImputer. Some other implementations like H2O or LightGBM can handle missing values directly.

What is the difference between Random Forest and Gradient Boosting?

Random Forest builds trees independently in parallel (bagging), while Gradient Boosting builds trees sequentially where each tree corrects the errors of the previous one (boosting). Random Forest reduces variance, Gradient Boosting reduces bias. In practice, Gradient Boosting (especially XGBoost) often achieves slightly higher accuracy, but Random Forest is easier to tune and less prone to overfitting.

Can Random Forest be used for feature selection?

Yes. Use feature_importances_ for a quick ranking or permutation_importance for a more reliable estimate. You can then drop low-importance features and retrain. Alternatively, use SelectFromModel with a Random Forest estimator inside a pipeline to automatically select features above a threshold.

Conclusion

Random Forest is one of the most reliable and versatile algorithms in machine learning. It reduces overfitting by combining hundreds of decorrelated decision trees, handles both classification and regression tasks without feature scaling, and provides built-in feature importance rankings. For most tabular data problems, it serves as an excellent first model that often performs well enough for production use.

Start with RandomForestClassifier or RandomForestRegressor with default parameters as your baseline. Tune n_estimators first for diminishing-returns analysis, then max_depth and min_samples_leaf to control overfitting. Use class_weight='balanced' for imbalanced data, permutation importance for reliable feature rankings, and StratifiedKFold cross-validation for robust evaluation. For simpler linear problems, sklearn LinearRegression may be sufficient. When you need the absolute highest accuracy on structured data, consider Gradient Boosting or XGBoost, but Random Forest remains the safest default choice that rarely fails badly.

Related Guides

Sklearn Confusion Matrix -- evaluate classifier performance with precision, recall, and F1
Sklearn Pipeline -- wrap preprocessing and Random Forest into a single deployable object
Sklearn Linear Regression -- a simpler model for linear regression tasks
Pandas read_csv -- load datasets from CSV files before training

📚