Sklearn 线性回归:包含 Python 示例的完整指南
Updated on
你有一个包含特征(features)和连续目标变量的数据集。你想预测结果——房价、销售收入、温度趋势——但你不确定该用哪种方法,或者不知道如何在 Python 中正确搭建。错误的模型选择或遗漏必要的预处理步骤,会导致预测效果很差,并浪费大量时间在调试上。
线性回归是用于连续值预测任务中最常用的算法,但要把它用对,远不止调用 .fit() 和 .predict() 这么简单。你需要理解模型的内部原理、它在什么情况下会失效、如何正确评估,以及何时应该切换到 Ridge 或 Lasso 这类带正则化的变体。跳过这些步骤,往往会部署出只在训练集上表现很好、但在新数据上立刻崩溃的模型。
Scikit-learn 提供了 LinearRegression,并配套了完整的预处理、评估与正则化工具生态。本指南将覆盖从基础用法到可用于生产环境的回归 pipeline 的全部内容。
什么是线性回归?
线性回归通过拟合一条直线(或超平面)来建模一个或多个输入特征与连续输出之间的关系,使残差平方和最小。对于包含 n 个特征的模型,其形式为:
y = b0 + b1*x1 + b2*x2 + ... + bn*xn其中 b0 是截距(bias term),b1...bn 是每个特征对应的系数(权重),y 是预测值。
模型通过最小化 普通最小二乘(Ordinary Least Squares, OLS) 的代价函数来求解系数:
Cost = Sum of (y_actual - y_predicted)^2该问题存在闭式解,因此即使在大数据集上训练也通常很快。
使用 Sklearn 进行简单线性回归
简单线性回归只使用一个特征来预测目标变量。下面是一个完整示例:
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data: years of experience vs salary (in thousands)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([35, 40, 45, 55, 60, 62, 70, 75, 82, 90])
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
# Predict salary for 12 years of experience
prediction = model.predict([[12]])
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
# Coefficient (slope): 5.9394
# Intercept: 28.3333
# Predicted salary for 12 years: $99.61k理解输出结果
| Attribute | 含义 | 示例值 |
|---|---|---|
model.coef_ | 每个特征的权重 | [5.94] —— 工资每年约增加 $5,940 |
model.intercept_ | 当所有特征为 0 时的预测 y | 28.33 —— 基础工资约 $28,330 |
model.score(X, y) | 在给定数据上的 R-squared | 0.98 |
多元线性回归
当你有多个特征时,模型拟合的是一个超平面而非一条直线:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
print(f"Features: {feature_names}")
print(f"Dataset shape: {X.shape}") # (20640, 8)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Print coefficients for each feature
print("\nFeature Coefficients:")
for name, coef in zip(feature_names, model.coef_):
print(f" {name:12s}: {coef:+.6f}")
print(f" {'Intercept':12s}: {model.intercept_:+.6f}")
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nR² (train): {train_score:.4f}")
print(f"R² (test): {test_score:.4f}")模型评估:R-squared、MSE 与 RMSE
仅看 R-squared 并不能说明全部情况。评估回归模型时应结合多个指标:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
# R² Score: 0.5758
# MSE: 0.5559
# RMSE: 0.7456
# MAE: 0.5332评估指标说明
| Metric | Formula | Range | 解释 |
|---|---|---|---|
| R-squared (R²) | 1 - (SS_res / SS_tot) | (-inf, 1] | 解释的方差比例。1.0=完美拟合,0=不比均值预测好 |
| MSE | mean((y - y_pred)²) | [0, inf) | 平均平方误差。对大误差惩罚更重 |
| RMSE | sqrt(MSE) | [0, inf) | 与目标变量同单位,比 MSE 更直观 |
| MAE | mean(|y - y_pred|) | [0, inf) | 平均绝对误差。对异常值更稳健 |
R² 低不一定代表模型差。对于噪声很大的真实数据(如房价),R² = 0.6 可能也算合理。务必把 RMSE 与目标变量的量纲/尺度对比来判断误差是否可接受。
线性回归的特征缩放(Feature Scaling)
标准的 LinearRegression 不强制要求特征缩放,因为它用的是 OLS 的闭式解。但当使用正则化时,缩放就变得关键:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Without scaling (fine for basic LinearRegression)
model_no_scale = LinearRegression()
model_no_scale.fit(X_train, y_train)
print(f"LinearRegression R² (no scaling): {model_no_scale.score(X_test, y_test):.4f}")
# With scaling via Pipeline (required for regularized models)
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
print(f"Ridge R² (with scaling): {pipeline.score(X_test, y_test):.4f}")**为什么缩放对正则化很重要:**Ridge 和 Lasso 会对系数大小施加同等形式的惩罚。如果一个特征范围是 0-1,另一个范围是 0-100,000,那么惩罚项会不成比例地影响不同特征对应的系数。缩放能把特征放到同一尺度,使惩罚更公平地作用于所有特征。
多项式特征:建模非线性关系
当特征与目标之间的关系不是线性的,多项式特征可以捕捉曲线和交互项:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = 3 * X.ravel()**2 - 5 * X.ravel() + 10 + np.random.randn(200) * 15
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Linear model
linear = LinearRegression()
linear.fit(X_train, y_train)
print(f"Linear R²: {r2_score(y_test, linear.predict(X_test)):.4f}")
# Polynomial (degree 2) model
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('linear', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
print(f"Poly (d=2) R²: {r2_score(y_test, poly_pipeline.predict(X_test)):.4f}")
# Polynomial (degree 3) model
poly3_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=3, include_bias=False)),
('linear', LinearRegression())
])
poly3_pipeline.fit(X_train, y_train)
print(f"Poly (d=3) R²: {r2_score(y_test, poly3_pipeline.predict(X_test)):.4f}")**警告:**高阶多项式很容易快速过拟合。应使用 cross-validation 来选择合适的阶数,并且对多项式模型通常更推荐配合正则化。
正则化:Ridge、Lasso 与 ElasticNet
当模型拥有大量特征或多项式项时,正则化通过对大系数施加惩罚来抑制过拟合。
Ridge Regression(L2 惩罚)
Ridge 将系数平方和加入代价函数。它会把系数向 0 收缩,但通常不会把系数精确压到 0。
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Find best alpha with cross-validation
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge())
])
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['ridge__alpha']}")
print(f"Best CV R²: {grid.best_score_:.4f}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")Lasso Regression(L1 惩罚)
Lasso 将系数绝对值和加入代价函数。它可以把部分系数精确压到 0,从而实现自动特征选择:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(max_iter=10000))
])
param_grid = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['lasso__alpha']}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")
# Show which features were selected (non-zero coefficients)
lasso_model = grid.best_estimator_.named_steps['lasso']
feature_names = housing.feature_names
for name, coef in zip(feature_names, lasso_model.coef_):
status = "KEPT" if abs(coef) > 1e-6 else "DROPPED"
print(f" {name:12s}: {coef:+.6f} [{status}]")ElasticNet(L1 + L2 惩罚)
ElasticNet 结合 Ridge 与 Lasso 的惩罚项。l1_ratio 控制混合比例:0=纯 Ridge,1=纯 Lasso。
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('elasticnet', ElasticNet(max_iter=10000))
])
param_grid = {
'elasticnet__alpha': [0.01, 0.1, 1.0],
'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
print(f"Best alpha: {grid.best_params_['elasticnet__alpha']}")
print(f"Best l1_ratio: {grid.best_params_['elasticnet__l1_ratio']}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")对比:LinearRegression vs Ridge vs Lasso vs ElasticNet
| Model | Penalty | Feature Selection | 何时使用 | 是否需要缩放 |
|---|---|---|---|---|
| LinearRegression | 无 | 否 | 特征少、无多重共线性、信噪比较好 | 否 |
| Ridge | L2(平方) | 否(向 0 收缩) | 特征多且相关,希望保留所有特征 | 是 |
| Lasso | L1(绝对值) | 是(系数可变为 0) | 特征多,希望自动特征选择 | 是 |
| ElasticNet | L1 + L2 | 是(部分) | 特征相关且希望做一定选择 | 是 |
如何选择合适的模型
先用 LinearRegression 作为 baseline。如果模型过拟合(训练与测试 R-squared 差距明显),优先尝试 Ridge。如果你怀疑有很多无关特征,尝试 Lasso。如果特征之间相关性强且你又想进行选择,尝试 ElasticNet。无论如何,都应使用 cross-validation 来进行对比。
线性回归的假设(Assumptions)
当以下假设成立时,线性回归通常能给出更可靠的结果:
- 线性(Linearity) —— 特征与目标的关系是线性的(或可通过变换线性化)。
- 独立性(Independence) —— 观测值彼此独立。在时间序列数据中若未处理自相关,该假设会被破坏。
- 同方差性(Homoscedasticity) —— 在不同预测值水平下,残差方差保持恒定。
- 残差正态性(Normality of residuals) —— 残差近似服从正态分布。对置信区间与假设检验更关键,对预测精度影响相对较小。
- 无多重共线性(No multicollinearity) —— 特征之间不应高度相关。共线性会抬高系数方差,使单个系数解释变得不可靠。
用代码检查假设
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred
# Check residual statistics
print(f"Residual mean: {residuals.mean():.6f}") # Should be near 0
print(f"Residual std: {residuals.std():.4f}")
print(f"Residual skewness: {float(np.mean((residuals - residuals.mean())**3) / residuals.std()**3):.4f}")
# Check for multicollinearity (correlation matrix)
corr_matrix = np.corrcoef(X_train, rowvar=False)
print(f"\nMax feature correlation: {np.max(np.abs(corr_matrix - np.eye(corr_matrix.shape[0]))):.4f}")完整 Pipeline:真实世界回归流程
下面是一个偏生产风格的 pipeline 示例,将预处理、特征工程与模型对比整合在一起:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Define models to compare
models = {
'LinearRegression': Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
]),
'Ridge (alpha=1)': Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
]),
'Lasso (alpha=0.01)': Pipeline([
('scaler', StandardScaler()),
('model', Lasso(alpha=0.01, max_iter=10000))
]),
'Poly(2) + Ridge': Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=10.0))
])
}
# Evaluate all models
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12} {'Test R²':>10}")
print("-" * 62)
for name, pipeline in models.items():
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
pipeline.fit(X_train, y_train)
test_r2 = pipeline.score(X_test, y_test)
print(f"{name:<25} {cv_scores.mean():>12.4f} {cv_scores.std():>12.4f} {test_r2:>10.4f}")使用 PyGWalker 探索回归结果
训练完模型后,理解预测模式同样关键。PyGWalker (opens in a new tab) 让你可以在 Jupyter 中通过交互式拖拽界面,可视化探索残差、特征重要性、预测值与真实值关系等:
import pandas as pd
import pygwalker as pyg
# Build a results DataFrame
results = pd.DataFrame(housing.data[len(X_train):], columns=housing.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['residual'] = y_test - y_pred
results['abs_error'] = np.abs(y_test - y_pred)
# Launch interactive exploration
walker = pyg.walk(results)你可以把特征拖到坐标轴上、用残差大小进行颜色编码,并识别模型在哪些数据分段上表现较差——全程无需手写绘图代码。
如果你在 Jupyter 里进行迭代实验,RunCell (opens in a new tab) 提供了一个 AI agent,帮助你测试不同的特征组合、超参数和预处理步骤,而不必手动反复改写 cells。
FAQ
sklearn 中的 LinearRegression 是什么?
sklearn.linear_model.LinearRegression 是一个普通最小二乘(OLS)回归模型。它通过最小化真实值与预测值之间的平方差之和来拟合线性方程。它是 scikit-learn 中最基础、也最容易解释的回归模型。
如何解读 R-squared 分数?
R-squared 表示模型解释了目标变量方差的比例。R-squared 为 0.80 表示解释了 80% 的方差。1.0 表示完美拟合,0.0 表示不比预测均值更好,负值表示模型比直接使用均值预测还差。
什么时候用 Ridge、Lasso、ElasticNet?
当你想保留全部特征但降低过拟合(特征共线性强)时用 Ridge;当你需要自动特征选择(把无关特征系数压到 0)时用 Lasso;当特征相关且希望在 Ridge 的稳定性与 Lasso 的稀疏性之间取得平衡时用 ElasticNet。
LinearRegression 需要做特征缩放吗?
基础的 LinearRegression 不要求特征缩放,因为 OLS 解对尺度不敏感。但 Ridge、Lasso、ElasticNet 需要缩放,因为它们的惩罚项会同等对待各系数的大小。进行正则化回归前应始终先缩放特征。
线性回归如何处理类别特征(categorical features)?
在拟合之前,需用 OneHotEncoder 或 pd.get_dummies() 将类别特征转换为数值。Sklearn 的 LinearRegression 只接受数值输入。在 pipeline 中,可以用 ColumnTransformer 对数值列与类别列应用不同的变换。
MSE 和 RMSE 有什么区别?
MSE(Mean Squared Error)是预测值与真实值差的平方的平均值;RMSE(Root Mean Squared Error)是 MSE 的平方根。RMSE 与目标变量单位相同,因此更容易解释。例如预测房价时 RMSE=50,000 表示平均预测误差约为 $50,000。
总结
Sklearn 的 LinearRegression 是你在 Python 中进行任何回归任务的起点。它训练速度快、可解释性强,并且当真实关系近似线性时非常有效。对于包含噪声、共线性或高维特征的真实数据集,Ridge、Lasso 与 ElasticNet 提供的正则化通常能提升泛化能力。务必使用多个指标(R-squared、RMSE、MAE)进行评估,用 train-test split 避免过拟合,并检查残差模式来验证模型假设是否成立。通过 StandardScaler 与 PolynomialFeatures 构建 pipeline,可以让你的工作流更整洁、可复现。