Sklearn Train Test Split: Complete Guide to Splitting Data in Python
Updated on
Training a machine learning model on your entire dataset and then evaluating it on the same data leads to a critical problem: your model will appear to perform well, but it has simply memorized the data rather than learned patterns. This overfitting means your model will fail spectacularly when it encounters new, unseen data. Data scientists need a reliable way to evaluate model performance on data the model has never seen during training.
The solution is train-test splitting. By holding back a portion of your data for evaluation, you get an honest assessment of how your model will perform in the real world. Sklearn's train_test_split function makes this process straightforward, but using it incorrectly can still lead to data leakage, poor generalization, and misleading performance metrics.
This guide covers everything you need to know about sklearn's train_test_split, from basic usage to advanced techniques for time series data, imbalanced classes, and multi-output problems.
What is Train Test Split?
Train test split is the fundamental technique for evaluating machine learning models. You divide your dataset into two parts: a training set used to fit the model, and a test set used to evaluate the model's performance on unseen data.
The train_test_split function from scikit-learn (sklearn) automates this process, handling the random shuffling and splitting with just one line of code.
from sklearn.model_selection import train_test_split
# Basic usage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)In this example, X contains your features (input variables) and y contains your target variable (what you want to predict). The function returns four arrays: training features, test features, training labels, and test labels.
Basic train_test_split Syntax
The simplest usage of train_test_split requires just two arguments: your features and your target variable.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load sample data
iris = load_iris()
X = iris.data
y = iris.target
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")This splits your data randomly, with 80% going to training and 20% to testing. However, this basic usage has a critical flaw: the split is different every time you run the code, making results irreproducible.
Essential Parameters
test_size and train_size
The test_size parameter controls how much data goes to the test set. You can specify it as:
- A float between 0.0 and 1.0 (proportion of the dataset)
- An integer (absolute number of test samples)
# 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 50 samples in test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=50)
# Alternatively, specify train_size
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)If you specify both test_size and train_size, they must add up to 1.0 (or the total dataset size if using integers). In most cases, specifying just test_size is sufficient.
random_state for Reproducibility
The random_state parameter is crucial for reproducible results. Without it, you get a different split every time you run your code, making it impossible to debug or compare experiments.
# Without random_state - different split each time
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2)
print(f"Same split? {(X_train1 == X_train2).all()}") # False
# With random_state - same split every time
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Same split? {(X_train1 == X_train2).all()}") # TrueUse any integer for random_state. The specific number doesn't matter; what matters is using the same number consistently across your project.
shuffle Parameter
By default, train_test_split shuffles the data before splitting. For most machine learning tasks, this is exactly what you want. However, for time series data or when the order matters, you should disable shuffling.
# Shuffle enabled (default)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
# Shuffle disabled (for time series)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)When shuffle=False, the function simply takes the first portion for training and the last portion for testing, maintaining the original order.
Parameter Reference Table
| Parameter | Type | Default | Description |
|---|---|---|---|
test_size | float or int | None | Proportion (0.0-1.0) or number of samples for test set |
train_size | float or int | None | Proportion (0.0-1.0) or number of samples for train set |
random_state | int | None | Random seed for reproducibility |
shuffle | bool | True | Whether to shuffle data before splitting |
stratify | array-like | None | Data to use for stratified splitting |
Stratified Splitting for Imbalanced Data
When your dataset has imbalanced classes (some classes have far fewer samples than others), random splitting can create training or test sets that poorly represent the overall distribution. This is especially problematic for classification tasks.
The stratify parameter ensures that the class distribution in the training and test sets matches the original dataset.
import numpy as np
from sklearn.model_selection import train_test_split
# Create imbalanced dataset (90% class 0, 10% class 1)
X = np.random.randn(1000, 5)
y = np.array([0] * 900 + [1] * 100)
# Without stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train distribution: Class 0: {sum(y_train == 0)}, Class 1: {sum(y_train == 1)}")
print(f"Test distribution: Class 0: {sum(y_test == 0)}, Class 1: {sum(y_test == 1)}")
# With stratification
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nStratified train distribution: Class 0: {sum(y_train == 0)}, Class 1: {sum(y_train == 1)}")
print(f"Stratified test distribution: Class 0: {sum(y_test == 0)}, Class 1: {sum(y_test == 1)}")With stratification, both training and test sets maintain the 90/10 class distribution. Without it, you might get lucky and have a representative split, or you might end up with a test set that has only 5% of class 1, leading to unreliable evaluation metrics.
Splitting Multiple Arrays
You can split multiple arrays at once, and sklearn will ensure they're split in the same way (same indices for all arrays).
import numpy as np
X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)
sample_weights = np.random.rand(100)
# Split all three arrays
X_train, X_test, y_train, y_test, weights_train, weights_test = train_test_split(
X, y, sample_weights, test_size=0.2, random_state=42
)
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"weights_train shape: {weights_train.shape}")This is particularly useful when you have sample weights, multiple target variables, or additional metadata that needs to be split consistently.
Train/Test Split vs Cross-Validation vs Holdout
Different validation strategies serve different purposes. Here's a comparison:
| Method | Data Usage | Computational Cost | Best For | Limitations |
|---|---|---|---|---|
| Train/Test Split | 70-80% train, 20-30% test | Low | Quick model evaluation, large datasets | Single evaluation, might be lucky/unlucky with split |
| Cross-Validation | 100% used for training/testing (k-fold) | High (k times slower) | Small datasets, reliable performance estimate | Computationally expensive, not for time series |
| Train/Val/Test (Holdout) | 60% train, 20% val, 20% test | Medium | Hyperparameter tuning, final evaluation | More data needed, more complex workflow |
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(random_state=42)
# Method 1: Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
print(f"Train/Test Split Score: {model.score(X_test, y_test):.3f}")
# Method 2: 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-Validation Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Method 3: Train/Val/Test split
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
model.fit(X_train, y_train)
print(f"Validation Score: {model.score(X_val, y_val):.3f}")
print(f"Test Score: {model.score(X_test, y_test):.3f}")For most projects, start with a simple train/test split. Use cross-validation when you have limited data or need a more robust performance estimate. Use train/val/test when you need to tune hyperparameters.
Advanced Splitting Techniques
Time Series Splitting
For time series data, random shuffling destroys the temporal order, which can lead to data leakage (using future information to predict the past). Use TimeSeriesSplit instead:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
X = np.random.randn(100, 5)
y = np.random.randn(100)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(f"Train size: {len(train_index)}, Test size: {len(test_index)}")TimeSeriesSplit creates multiple train/test splits where each training set includes all past data up to a certain point, and the test set includes the immediately following period. This simulates real-world forecasting where you only have past data to predict the future.
GroupShuffleSplit for Grouped Data
When your data has groups (e.g., multiple measurements from the same patient, multiple transactions from the same customer), you need to ensure entire groups stay together in either the training or test set to avoid data leakage.
from sklearn.model_selection import GroupShuffleSplit
import numpy as np
X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)
groups = np.array([0] * 25 + [1] * 25 + [2] * 25 + [3] * 25) # 4 groups
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, test_idx in gss.split(X, y, groups):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
print(f"Train groups: {np.unique(groups[train_idx])}")
print(f"Test groups: {np.unique(groups[test_idx])}")This ensures that all samples from a given group are in either the training set or the test set, never both.
Stratified Multi-Output Splitting
For multi-output classification problems, you can't directly use stratify with a 2D array. Instead, create a single label that represents the combination of all outputs:
import numpy as np
from sklearn.model_selection import train_test_split
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, (1000, 3)) # 3 binary outputs
# Create combined labels for stratification
y_combined = y[:, 0] * 4 + y[:, 1] * 2 + y[:, 2]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y_combined
)
print(f"Original distribution: {np.unique(y_combined, return_counts=True)[1]}")
print(f"Train distribution: {np.unique(y_combined[:(len(X) - len(X_test))], return_counts=True)[1]}")Best Practices for Train Test Split
Choosing the Right Split Ratio
The most common split ratios are:
- 80/20: Standard choice for medium to large datasets (10,000+ samples)
- 70/30: Better for smaller datasets (1,000-10,000 samples) to have more robust test evaluation
- 90/10: For very large datasets (100,000+ samples) where even 10% provides ample test samples
- 60/20/20: For train/validation/test when tuning hyperparameters
import numpy as np
def recommend_split_ratio(n_samples):
if n_samples < 1000:
return "Consider cross-validation instead of simple split"
elif n_samples < 10000:
return "70/30 split recommended"
elif n_samples < 100000:
return "80/20 split recommended"
else:
return "90/10 or 80/20 split recommended"
sample_sizes = [500, 5000, 50000, 500000]
for size in sample_sizes:
print(f"{size} samples: {recommend_split_ratio(size)}")Avoiding Data Leakage
Data leakage occurs when information from the test set influences the training process. Common sources:
- Preprocessing before splitting: Always split first, then preprocess
- Feature scaling on combined data: Fit scaler only on training data
- Feature selection on combined data: Select features using only training data
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)
# WRONG: Scale before splitting (data leakage!)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled_wrong, y, test_size=0.2)
# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on training data
X_test_scaled = scaler.transform(X_test) # Transform test data using training statsThe wrong approach uses information from the entire dataset (including test samples) to calculate scaling parameters, which leaks information about the test set into the training process.
Stratifying Whenever Possible
For classification problems, always use stratified splitting unless you have a specific reason not to. This is especially critical for:
- Imbalanced datasets
- Small datasets
- Multi-class problems with rare classes
from sklearn.model_selection import train_test_split
import numpy as np
# Rare disease dataset: 1% positive cases
X = np.random.randn(1000, 20)
y = np.array([0] * 990 + [1] * 10)
# Without stratification - might have no positive cases in test set!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)
print(f"Non-stratified test positives: {sum(y_test)}")
# With stratification - guarantees proportional representation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, random_state=123, stratify=y
)
print(f"Stratified test positives: {sum(y_test)}")Common Mistakes to Avoid
1. Forgetting random_state
Without random_state, your results change every time you run the code. This makes debugging impossible and experiments irreproducible.
# BAD: No random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# GOOD: Set random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)2. Not Stratifying Imbalanced Classes
For imbalanced datasets, random splitting can create highly unrepresentative test sets, leading to unreliable performance metrics.
# BAD: No stratification for imbalanced data
X_train, X_test, y_train, y_test = train_test_split(X, y_imbalanced, test_size=0.2)
# GOOD: Use stratification
X_train, X_test, y_train, y_test = train_test_split(
X, y_imbalanced, test_size=0.2, stratify=y_imbalanced, random_state=42
)3. Splitting Time Series Data with Shuffle
Time series models depend on temporal order. Shuffling destroys this structure and can lead to severe data leakage.
# BAD: Shuffling time series data
X_train, X_test, y_train, y_test = train_test_split(
X_timeseries, y_timeseries, test_size=0.2, shuffle=True
)
# GOOD: Disable shuffling or use TimeSeriesSplit
X_train, X_test, y_train, y_test = train_test_split(
X_timeseries, y_timeseries, test_size=0.2, shuffle=False
)4. Preprocessing Before Splitting
Fitting preprocessors (scalers, imputers, encoders) on the entire dataset before splitting causes data leakage.
# BAD: Preprocessing before split
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# GOOD: Split first, then preprocess
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)5. Using Test Set for Hyperparameter Tuning
The test set should only be used for final evaluation. If you use it to choose hyperparameters, you're essentially training on your test data.
# BAD: Tuning on test set
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
best_score = 0
best_params = None
for n_estimators in [10, 50, 100]:
model = RandomForestClassifier(n_estimators=n_estimators)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # Using test set!
if score > best_score:
best_score = score
best_params = n_estimators
# GOOD: Use validation set or cross-validation
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
best_score = 0
best_params = None
for n_estimators in [10, 50, 100]:
model = RandomForestClassifier(n_estimators=n_estimators)
model.fit(X_train, y_train)
score = model.score(X_val, y_val) # Using validation set
if score > best_score:
best_score = score
best_params = n_estimatorsPractical Example: Complete Workflow
Here's a complete machine learning workflow using train_test_split correctly:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Load data
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int) # Binary classification
# Step 1: Split data (stratified for balanced test set)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Step 2: Preprocess (fit on training data only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 3: Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Step 4: Evaluate
y_pred = model.predict(X_test_scaled)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Step 5: Check for overfitting
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f"\nTrain accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
print(f"Overfitting gap: {train_score - test_score:.3f}")Using RunCell for Interactive Data Splitting
When working in Jupyter notebooks, experimenting with different split ratios and parameters can be tedious. RunCell (opens in a new tab) provides an AI agent specifically designed for data science workflows in Jupyter. It can help you:
- Automatically test multiple split ratios and compare results
- Detect data leakage in your preprocessing pipeline
- Suggest optimal stratification strategies for your specific dataset
- Generate validation curves to choose the right train/test ratio
RunCell integrates directly into your Jupyter environment, making it easy to iterate on your data splitting strategy without writing repetitive code.
Visualizing Your Data with PyGWalker
After splitting your data, it's crucial to verify that your training and test sets have similar distributions. PyGWalker (opens in a new tab) turns your pandas DataFrames into interactive Tableau-style visualizations, making it easy to:
- Compare feature distributions between train and test sets
- Identify potential sampling bias in your splits
- Visualize class imbalances and verify stratification worked correctly
- Explore relationships between features in your training data
import pygwalker as pyg
import pandas as pd
# Convert to DataFrames for visualization
train_df = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
train_df['dataset'] = 'train'
test_df = pd.DataFrame(X_test, columns=[f'feature_{i}' for i in range(X_test.shape[1])])
test_df['dataset'] = 'test'
combined = pd.concat([train_df, test_df])
# Create interactive visualization
pyg.walk(combined)This lets you interactively explore whether your train and test distributions match, which is critical for reliable model evaluation.
FAQ
How do I choose between 80/20 and 70/30 split?
Use 80/20 for datasets larger than 10,000 samples, and 70/30 for smaller datasets (1,000-10,000 samples). The key is ensuring your test set has enough samples for reliable evaluation—typically at least 200-500 samples for classification problems. For very large datasets (100,000+ samples), you can use 90/10 or even 95/5 since even 5% provides thousands of test samples.
What is random_state and why does it matter?
random_state is the seed for the random number generator that shuffles your data before splitting. Using the same random_state value ensures you get the identical split every time you run your code, which is essential for reproducibility and debugging. Without it, you'll get different train/test splits each time, making it impossible to determine whether performance changes are due to model improvements or just lucky/unlucky data splits.
When should I use stratify parameter?
Use stratify=y for all classification problems, especially when you have imbalanced classes or small datasets. Stratification ensures that the class distribution in your training and test sets matches the overall distribution. For example, if 10% of your data is positive cases, stratification guarantees that both training and test sets have approximately 10% positive cases, preventing evaluation bias from unrepresentative splits.
Can I use train_test_split for time series data?
No, you should not use train_test_split with shuffle=True for time series data, as it destroys temporal ordering and causes data leakage (using future data to predict the past). Instead, either use train_test_split with shuffle=False for a simple chronological split, or use TimeSeriesSplit for cross-validation that respects temporal order. For time series, always ensure training data comes before test data chronologically.
How is train_test_split different from cross-validation?
train_test_split creates a single train/test partition (typically 80/20), giving you one performance estimate. Cross-validation (like k-fold) creates multiple train/test splits and averages the results, providing a more robust performance estimate. Use train_test_split for quick evaluation and large datasets. Use cross-validation for small datasets (less than 1,000 samples) or when you need more reliable performance estimates. Cross-validation is k times slower (e.g., 5-fold is 5× slower) but reduces variance in your performance metrics.
Conclusion
Sklearn's train_test_split is the foundational tool for evaluating machine learning models. By properly splitting your data, you get honest performance estimates that predict real-world model behavior. Remember the key principles: always set random_state for reproducibility, use stratify for classification problems, avoid preprocessing before splitting, and choose your split ratio based on dataset size.
Master these fundamentals, and you'll avoid the most common pitfalls that lead to overfit models and misleading performance metrics. Whether you're building a simple classifier or a complex deep learning system, proper train-test splitting is the first step toward reliable machine learning.