Sklearn Linear Regression: Complete Guide with Python Examples
Updated on
You have a dataset with features and a continuous target variable. You want to predict outcomes -- housing prices, sales revenue, temperature trends -- but you are not sure which approach to use or how to set it up correctly in Python. The wrong model choice or missing preprocessing steps lead to poor predictions and wasted time debugging.
Linear regression is the most widely used algorithm for continuous prediction tasks, yet getting it right involves more than calling .fit() and .predict(). You need to understand how the model works internally, when it fails, how to evaluate it properly, and when to switch to regularized variants like Ridge or Lasso. Skipping these steps means deploying models that perform well on training data but break down on new observations.
Scikit-learn provides LinearRegression along with a complete ecosystem of tools for preprocessing, evaluation, and regularization. This guide covers everything from basic usage to production-ready regression pipelines.
What Is Linear Regression?
Linear regression models the relationship between one or more input features and a continuous output by fitting a straight line (or hyperplane) that minimizes the sum of squared residuals. The equation for a model with n features is:
y = b0 + b1*x1 + b2*x2 + ... + bn*xnWhere b0 is the intercept (bias term), b1...bn are the coefficients (weights) for each feature, and y is the predicted value.
The model finds the coefficients that minimize the Ordinary Least Squares (OLS) cost function:
Cost = Sum of (y_actual - y_predicted)^2This has a closed-form solution, so training is fast even on large datasets.
Simple Linear Regression with Sklearn
Simple linear regression uses a single feature to predict the target. Here is a complete example:
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data: years of experience vs salary (in thousands)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # see /topics/NumPy/numpy-reshape
y = np.array([35, 40, 45, 55, 60, 62, 70, 75, 82, 90])
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
# Predict salary for 12 years of experience
prediction = model.predict([[12]])
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
# Coefficient (slope): 5.9394
# Intercept: 28.3333
# Predicted salary for 12 years: $99.61kUnderstanding the Output
| Attribute | Meaning | Example Value |
|---|---|---|
model.coef_ | Weight for each feature | [5.94] -- salary increases ~$5,940 per year |
model.intercept_ | Predicted y when all features are 0 | 28.33 -- base salary of $28,330 |
model.score(X, y) | R-squared on the given data | 0.98 |
Multiple Linear Regression
When you have more than one feature, the model fits a hyperplane instead of a line:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
print(f"Features: {feature_names}")
print(f"Dataset shape: {X.shape}") # (20640, 8)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Print coefficients for each feature
print("\nFeature Coefficients:")
for name, coef in zip(feature_names, model.coef_):
print(f" {name:12s}: {coef:+.6f}")
print(f" {'Intercept':12s}: {model.intercept_:+.6f}")
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nR² (train): {train_score:.4f}")
print(f"R² (test): {test_score:.4f}")Model Evaluation: R-squared, MSE, and RMSE
R-squared alone does not tell the full story. Use multiple metrics to evaluate regression models:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
# R² Score: 0.5758
# MSE: 0.5559
# RMSE: 0.7456
# MAE: 0.5332Evaluation Metrics Explained
| Metric | Formula | Range | Interpretation |
|---|---|---|---|
| R-squared (R²) | 1 - (SS_res / SS_tot) | (-inf, 1] | Proportion of variance explained. 1.0 = perfect, 0 = no better than mean |
| MSE | mean((y - y_pred)²) | [0, inf) | Average squared error. Penalizes large errors more |
| RMSE | sqrt(MSE) | [0, inf) | Same units as target variable. Easier to interpret than MSE |
| MAE | mean(|y - y_pred|) | [0, inf) | Average absolute error. Robust to outliers |
A low R-squared does not always mean a bad model. For noisy real-world data (like housing prices), R² = 0.6 can be reasonable. Always compare RMSE against the scale of your target variable. For classification tasks, see our guide on the sklearn confusion matrix for appropriate evaluation metrics.
Feature Scaling for Linear Regression
Standard LinearRegression does not require feature scaling because it uses OLS with a closed-form solution. However, scaling becomes essential when using regularization:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Without scaling (fine for basic LinearRegression)
model_no_scale = LinearRegression()
model_no_scale.fit(X_train, y_train)
print(f"LinearRegression R² (no scaling): {model_no_scale.score(X_test, y_test):.4f}")
# With scaling via Pipeline (required for regularized models)
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
print(f"Ridge R² (with scaling): {pipeline.score(X_test, y_test):.4f}")Why scaling matters for regularization: Ridge and Lasso penalize large coefficients equally. If one feature ranges 0-1 and another ranges 0-100,000, the penalty disproportionately shrinks the small-range feature's coefficient. Scaling puts all features on the same scale so the penalty is applied fairly.
Polynomial Features: Modeling Non-Linear Relationships
When the relationship between features and target is not linear, polynomial features can capture curves and interactions:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = 3 * X.ravel()**2 - 5 * X.ravel() + 10 + np.random.randn(200) * 15
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Linear model
linear = LinearRegression()
linear.fit(X_train, y_train)
print(f"Linear R²: {r2_score(y_test, linear.predict(X_test)):.4f}")
# Polynomial (degree 2) model
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('linear', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
print(f"Poly (d=2) R²: {r2_score(y_test, poly_pipeline.predict(X_test)):.4f}")
# Polynomial (degree 3) model
poly3_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=3, include_bias=False)),
('linear', LinearRegression())
])
poly3_pipeline.fit(X_train, y_train)
print(f"Poly (d=3) R²: {r2_score(y_test, poly3_pipeline.predict(X_test)):.4f}")Warning: High-degree polynomials overfit quickly. Use cross-validation to select the right degree, and prefer regularization for polynomial models.
Regularization: Ridge, Lasso, and ElasticNet
When your model has many features or polynomial terms, regularization prevents overfitting by adding a penalty to large coefficients.
Ridge Regression (L2 Penalty)
Ridge adds the sum of squared coefficients to the cost function. It shrinks coefficients toward zero but never sets them exactly to zero.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Find best alpha with cross-validation
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge())
])
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['ridge__alpha']}")
print(f"Best CV R²: {grid.best_score_:.4f}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")Lasso Regression (L1 Penalty)
Lasso adds the sum of absolute coefficients. It can set coefficients exactly to zero, performing automatic feature selection:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(max_iter=10000))
])
param_grid = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['lasso__alpha']}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")
# Show which features were selected (non-zero coefficients)
lasso_model = grid.best_estimator_.named_steps['lasso']
feature_names = housing.feature_names
for name, coef in zip(feature_names, lasso_model.coef_):
status = "KEPT" if abs(coef) > 1e-6 else "DROPPED"
print(f" {name:12s}: {coef:+.6f} [{status}]")ElasticNet (L1 + L2 Penalty)
ElasticNet combines Ridge and Lasso penalties. The l1_ratio parameter controls the mix: 0 = pure Ridge, 1 = pure Lasso.
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('elasticnet', ElasticNet(max_iter=10000))
])
param_grid = {
'elasticnet__alpha': [0.01, 0.1, 1.0],
'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['elasticnet__alpha']}")
print(f"Best l1_ratio: {grid.best_params_['elasticnet__l1_ratio']}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")Comparison: LinearRegression vs Ridge vs Lasso vs ElasticNet
| Model | Penalty | Feature Selection | When to Use | Scaling Required |
|---|---|---|---|---|
| LinearRegression | None | No | Few features, no multicollinearity, good signal-to-noise ratio | No |
| Ridge | L2 (squared) | No (shrinks toward zero) | Many correlated features, want to keep all features | Yes |
| Lasso | L1 (absolute) | Yes (sets coefficients to zero) | Many features, want automatic feature selection | Yes |
| ElasticNet | L1 + L2 | Yes (partial) | Many correlated features, want some feature selection | Yes |
Choosing the Right Model
Use LinearRegression as your baseline. If the model overfits (big gap between train and test R-squared), try Ridge first. If you suspect many irrelevant features, try Lasso. If features are correlated and you want selection, try ElasticNet. For non-linear problems, consider a Random Forest instead. Always use cross-validation to compare.
Assumptions of Linear Regression
Linear regression produces reliable results when these assumptions hold:
- Linearity -- The relationship between features and target is linear (or linearizable with transformations).
- Independence -- Observations are independent of each other. Violated in time-series data without accounting for autocorrelation.
- Homoscedasticity -- The variance of residuals is constant across all levels of the predicted values.
- Normality of residuals -- Residuals follow a normal distribution. Matters most for confidence intervals and hypothesis tests, less for prediction accuracy.
- No multicollinearity -- Features are not highly correlated with each other. Multicollinearity inflates coefficient variance and makes individual coefficients unreliable.
Checking Assumptions in Code
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred
# Check residual statistics
print(f"Residual mean: {residuals.mean():.6f}") # Should be near 0
print(f"Residual std: {residuals.std():.4f}")
print(f"Residual skewness: {float(np.mean((residuals - residuals.mean())**3) / residuals.std()**3):.4f}")
# Check for multicollinearity (correlation matrix)
corr_matrix = np.corrcoef(X_train, rowvar=False)
print(f"\nMax feature correlation: {np.max(np.abs(corr_matrix - np.eye(corr_matrix.shape[0]))):.4f}")Complete Pipeline: Real-World Regression
Here is a production-style pipeline that combines preprocessing, feature engineering, and model selection:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Define models to compare
models = {
'LinearRegression': Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
]),
'Ridge (alpha=1)': Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
]),
'Lasso (alpha=0.01)': Pipeline([
('scaler', StandardScaler()),
('model', Lasso(alpha=0.01, max_iter=10000))
]),
'Poly(2) + Ridge': Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=10.0))
])
}
# Evaluate all models
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12} {'Test R²':>10}")
print("-" * 62)
for name, pipeline in models.items():
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
pipeline.fit(X_train, y_train)
test_r2 = pipeline.score(X_test, y_test)
print(f"{name:<25} {cv_scores.mean():>12.4f} {cv_scores.std():>12.4f} {test_r2:>10.4f}")Exploring Regression Results with PyGWalker
After training your model, understanding prediction patterns is critical. PyGWalker (opens in a new tab) lets you visually explore residuals, feature importances, and predicted-vs-actual relationships through an interactive drag-and-drop interface in Jupyter:
import pandas as pd
import pygwalker as pyg
# Build a results DataFrame
results = pd.DataFrame(housing.data[len(X_train):], columns=housing.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['residual'] = y_test - y_pred
results['abs_error'] = np.abs(y_test - y_pred)
# Launch interactive exploration
walker = pyg.walk(results)You can drag features to axes, color-code by residual magnitude, and identify which segments of your data the model struggles with -- all without writing plotting code.
For running experiments iteratively in Jupyter, RunCell (opens in a new tab) provides an AI agent that helps you test different feature combinations, hyperparameters, and preprocessing steps without manually rewriting cells.
FAQ
What is LinearRegression in sklearn?
sklearn.linear_model.LinearRegression is an Ordinary Least Squares (OLS) regression model. It fits a linear equation to the data by minimizing the sum of squared differences between actual and predicted values. It is the simplest and most interpretable regression model in scikit-learn.
How do I interpret the R-squared score?
R-squared measures the proportion of variance in the target variable explained by the model. An R-squared of 0.80 means 80% of the variance is explained. A value of 1.0 is a perfect fit, 0.0 means the model is no better than predicting the mean, and negative values mean the model is worse than just using the mean.
When should I use Ridge vs Lasso vs ElasticNet?
Use Ridge when you want to keep all features but reduce overfitting (multicollinear features). Use Lasso when you want automatic feature selection (it sets irrelevant feature coefficients to zero). Use ElasticNet when features are correlated and you want a balance of Ridge's stability and Lasso's sparsity.
Does LinearRegression need feature scaling?
Basic LinearRegression does not require feature scaling because the OLS solution is scale-invariant. However, Ridge, Lasso, and ElasticNet all require scaling because their penalties treat all coefficient magnitudes equally. Always scale features before regularized regression.
How do I handle categorical features in linear regression?
Convert categorical features to numeric using OneHotEncoder or pd.get_dummies() before fitting. Sklearn's LinearRegression only accepts numeric input. For pipelines, use ColumnTransformer to apply different transformations to numeric and categorical columns -- see our sklearn Pipeline guide for a full walkthrough.
What is the difference between MSE and RMSE?
MSE (Mean Squared Error) is the average of squared differences between actual and predicted values. RMSE (Root Mean Squared Error) is the square root of MSE. RMSE has the same units as the target variable, making it easier to interpret. For example, if predicting house prices in dollars, RMSE of 50,000 means average prediction error of about $50,000.
Conclusion
Sklearn's LinearRegression is the starting point for any regression task in Python. It is fast, interpretable, and effective when the underlying relationship is approximately linear. For real-world datasets with noise, multicollinearity, or many features, Ridge, Lasso, and ElasticNet provide regularization that improves generalization. Always evaluate with multiple metrics (R-squared, RMSE, MAE), use train-test splits to avoid overfitting, and check residual patterns to verify your model's assumptions hold. Build pipelines with StandardScaler and PolynomialFeatures to keep your workflow clean and reproducible.
Related Guides
- Sklearn Pipeline -- chain preprocessing and regression into a single deployable object
- Sklearn Confusion Matrix -- evaluation metrics for classification tasks
- Sklearn Random Forest -- a non-linear alternative when linear models underperform
- Pandas read_csv -- loading datasets from CSV files for regression analysis