Skip to content

Sklearn 线性回归:包含 Python 示例的完整指南

Updated on

你有一个包含特征(features)和连续目标变量的数据集。你想预测结果——房价、销售收入、温度趋势——但你不确定该用哪种方法,或者不知道如何在 Python 中正确搭建。错误的模型选择或遗漏必要的预处理步骤,会导致预测效果很差,并浪费大量时间在调试上。

线性回归是用于连续值预测任务中最常用的算法,但要把它用对,远不止调用 .fit().predict() 这么简单。你需要理解模型的内部原理、它在什么情况下会失效、如何正确评估,以及何时应该切换到 Ridge 或 Lasso 这类带正则化的变体。跳过这些步骤,往往会部署出只在训练集上表现很好、但在新数据上立刻崩溃的模型。

Scikit-learn 提供了 LinearRegression,并配套了完整的预处理、评估与正则化工具生态。本指南将覆盖从基础用法到可用于生产环境的回归 pipeline 的全部内容。

📚

什么是线性回归?

线性回归通过拟合一条直线(或超平面)来建模一个或多个输入特征与连续输出之间的关系,使残差平方和最小。对于包含 n 个特征的模型,其形式为:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

其中 b0 是截距(bias term),b1...bn 是每个特征对应的系数(权重),y 是预测值。

模型通过最小化 普通最小二乘(Ordinary Least Squares, OLS) 的代价函数来求解系数:

Cost = Sum of (y_actual - y_predicted)^2

该问题存在闭式解,因此即使在大数据集上训练也通常很快。

使用 Sklearn 进行简单线性回归

简单线性回归只使用一个特征来预测目标变量。下面是一个完整示例:

from sklearn.linear_model import LinearRegression
import numpy as np
 
# Sample data: years of experience vs salary (in thousands)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([35, 40, 45, 55, 60, 62, 70, 75, 82, 90])
 
# Create and train the model
model = LinearRegression()
model.fit(X, y)
 
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
 
# Predict salary for 12 years of experience
prediction = model.predict([[12]])
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
# Coefficient (slope): 5.9394
# Intercept: 28.3333
# Predicted salary for 12 years: $99.61k

理解输出结果

Attribute含义示例值
model.coef_每个特征的权重[5.94] —— 工资每年约增加 $5,940
model.intercept_当所有特征为 0 时的预测 y28.33 —— 基础工资约 $28,330
model.score(X, y)在给定数据上的 R-squared0.98

多元线性回归

当你有多个特征时,模型拟合的是一个超平面而非一条直线:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
print(f"Features: {feature_names}")
print(f"Dataset shape: {X.shape}")  # (20640, 8)
 
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Print coefficients for each feature
print("\nFeature Coefficients:")
for name, coef in zip(feature_names, model.coef_):
    print(f"  {name:12s}: {coef:+.6f}")
print(f"  {'Intercept':12s}: {model.intercept_:+.6f}")
 
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nR² (train): {train_score:.4f}")
print(f"R² (test):  {test_score:.4f}")

模型评估:R-squared、MSE 与 RMSE

仅看 R-squared 并不能说明全部情况。评估回归模型时应结合多个指标:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
 
print(f"R² Score:  {r2:.4f}")
print(f"MSE:       {mse:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")
# R² Score:  0.5758
# MSE:       0.5559
# RMSE:      0.7456
# MAE:       0.5332

评估指标说明

MetricFormulaRange解释
R-squared (R²)1 - (SS_res / SS_tot)(-inf, 1]解释的方差比例。1.0=完美拟合,0=不比均值预测好
MSEmean((y - y_pred)²)[0, inf)平均平方误差。对大误差惩罚更重
RMSEsqrt(MSE)[0, inf)与目标变量同单位,比 MSE 更直观
MAEmean(|y - y_pred|)[0, inf)平均绝对误差。对异常值更稳健

R² 低不一定代表模型差。对于噪声很大的真实数据(如房价),R² = 0.6 可能也算合理。务必把 RMSE 与目标变量的量纲/尺度对比来判断误差是否可接受。

线性回归的特征缩放(Feature Scaling)

标准的 LinearRegression 不强制要求特征缩放,因为它用的是 OLS 的闭式解。但当使用正则化时,缩放就变得关键:

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Without scaling (fine for basic LinearRegression)
model_no_scale = LinearRegression()
model_no_scale.fit(X_train, y_train)
print(f"LinearRegression R² (no scaling): {model_no_scale.score(X_test, y_test):.4f}")
 
# With scaling via Pipeline (required for regularized models)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
print(f"Ridge R² (with scaling):          {pipeline.score(X_test, y_test):.4f}")

**为什么缩放对正则化很重要:**Ridge 和 Lasso 会对系数大小施加同等形式的惩罚。如果一个特征范围是 0-1,另一个范围是 0-100,000,那么惩罚项会不成比例地影响不同特征对应的系数。缩放能把特征放到同一尺度,使惩罚更公平地作用于所有特征。

多项式特征:建模非线性关系

当特征与目标之间的关系不是线性的,多项式特征可以捕捉曲线和交互项:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
 
# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = 3 * X.ravel()**2 - 5 * X.ravel() + 10 + np.random.randn(200) * 15
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Linear model
linear = LinearRegression()
linear.fit(X_train, y_train)
print(f"Linear R²: {r2_score(y_test, linear.predict(X_test)):.4f}")
 
# Polynomial (degree 2) model
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
print(f"Poly (d=2) R²: {r2_score(y_test, poly_pipeline.predict(X_test)):.4f}")
 
# Polynomial (degree 3) model
poly3_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('linear', LinearRegression())
])
poly3_pipeline.fit(X_train, y_train)
print(f"Poly (d=3) R²: {r2_score(y_test, poly3_pipeline.predict(X_test)):.4f}")

**警告:**高阶多项式很容易快速过拟合。应使用 cross-validation 来选择合适的阶数,并且对多项式模型通常更推荐配合正则化。

正则化:Ridge、Lasso 与 ElasticNet

当模型拥有大量特征或多项式项时,正则化通过对大系数施加惩罚来抑制过拟合。

Ridge Regression(L2 惩罚)

Ridge 将系数平方和加入代价函数。它会把系数向 0 收缩,但通常不会把系数精确压到 0。

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Find best alpha with cross-validation
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])
 
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha: {grid.best_params_['ridge__alpha']}")
print(f"Best CV R²: {grid.best_score_:.4f}")
print(f"Test R²:    {grid.score(X_test, y_test):.4f}")

Lasso Regression(L1 惩罚)

Lasso 将系数绝对值和加入代价函数。它可以把部分系数精确压到 0,从而实现自动特征选择:

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(max_iter=10000))
])
 
param_grid = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha: {grid.best_params_['lasso__alpha']}")
print(f"Test R²:    {grid.score(X_test, y_test):.4f}")
 
# Show which features were selected (non-zero coefficients)
lasso_model = grid.best_estimator_.named_steps['lasso']
feature_names = housing.feature_names
for name, coef in zip(feature_names, lasso_model.coef_):
    status = "KEPT" if abs(coef) > 1e-6 else "DROPPED"
    print(f"  {name:12s}: {coef:+.6f}  [{status}]")

ElasticNet(L1 + L2 惩罚)

ElasticNet 结合 Ridge 与 Lasso 的惩罚项。l1_ratio 控制混合比例:0=纯 Ridge,1=纯 Lasso。

from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('elasticnet', ElasticNet(max_iter=10000))
])
 
param_grid = {
    'elasticnet__alpha': [0.01, 0.1, 1.0],
    'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
 
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha:    {grid.best_params_['elasticnet__alpha']}")
print(f"Best l1_ratio: {grid.best_params_['elasticnet__l1_ratio']}")
print(f"Test R²:       {grid.score(X_test, y_test):.4f}")

对比:LinearRegression vs Ridge vs Lasso vs ElasticNet

ModelPenaltyFeature Selection何时使用是否需要缩放
LinearRegression特征少、无多重共线性、信噪比较好
RidgeL2(平方)否(向 0 收缩)特征多且相关,希望保留所有特征
LassoL1(绝对值)是(系数可变为 0)特征多,希望自动特征选择
ElasticNetL1 + L2是(部分)特征相关且希望做一定选择

如何选择合适的模型

先用 LinearRegression 作为 baseline。如果模型过拟合(训练与测试 R-squared 差距明显),优先尝试 Ridge。如果你怀疑有很多无关特征,尝试 Lasso。如果特征之间相关性强且你又想进行选择,尝试 ElasticNet。无论如何,都应使用 cross-validation 来进行对比。

线性回归的假设(Assumptions)

当以下假设成立时,线性回归通常能给出更可靠的结果:

  1. 线性(Linearity) —— 特征与目标的关系是线性的(或可通过变换线性化)。
  2. 独立性(Independence) —— 观测值彼此独立。在时间序列数据中若未处理自相关,该假设会被破坏。
  3. 同方差性(Homoscedasticity) —— 在不同预测值水平下,残差方差保持恒定。
  4. 残差正态性(Normality of residuals) —— 残差近似服从正态分布。对置信区间与假设检验更关键,对预测精度影响相对较小。
  5. 无多重共线性(No multicollinearity) —— 特征之间不应高度相关。共线性会抬高系数方差,使单个系数解释变得不可靠。

用代码检查假设

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred
 
# Check residual statistics
print(f"Residual mean:     {residuals.mean():.6f}")   # Should be near 0
print(f"Residual std:      {residuals.std():.4f}")
print(f"Residual skewness: {float(np.mean((residuals - residuals.mean())**3) / residuals.std()**3):.4f}")
 
# Check for multicollinearity (correlation matrix)
corr_matrix = np.corrcoef(X_train, rowvar=False)
print(f"\nMax feature correlation: {np.max(np.abs(corr_matrix - np.eye(corr_matrix.shape[0]))):.4f}")

完整 Pipeline:真实世界回归流程

下面是一个偏生产风格的 pipeline 示例,将预处理、特征工程与模型对比整合在一起:

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Define models to compare
models = {
    'LinearRegression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LinearRegression())
    ]),
    'Ridge (alpha=1)': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=1.0))
    ]),
    'Lasso (alpha=0.01)': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Lasso(alpha=0.01, max_iter=10000))
    ]),
    'Poly(2) + Ridge': Pipeline([
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=10.0))
    ])
}
 
# Evaluate all models
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12} {'Test R²':>10}")
print("-" * 62)
 
for name, pipeline in models.items():
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
    pipeline.fit(X_train, y_train)
    test_r2 = pipeline.score(X_test, y_test)
    print(f"{name:<25} {cv_scores.mean():>12.4f} {cv_scores.std():>12.4f} {test_r2:>10.4f}")

使用 PyGWalker 探索回归结果

训练完模型后,理解预测模式同样关键。PyGWalker (opens in a new tab) 让你可以在 Jupyter 中通过交互式拖拽界面,可视化探索残差、特征重要性、预测值与真实值关系等:

import pandas as pd
import pygwalker as pyg
 
# Build a results DataFrame
results = pd.DataFrame(housing.data[len(X_train):], columns=housing.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['residual'] = y_test - y_pred
results['abs_error'] = np.abs(y_test - y_pred)
 
# Launch interactive exploration
walker = pyg.walk(results)

你可以把特征拖到坐标轴上、用残差大小进行颜色编码,并识别模型在哪些数据分段上表现较差——全程无需手写绘图代码。

如果你在 Jupyter 里进行迭代实验,RunCell (opens in a new tab) 提供了一个 AI agent,帮助你测试不同的特征组合、超参数和预处理步骤,而不必手动反复改写 cells。

FAQ

sklearn 中的 LinearRegression 是什么?

sklearn.linear_model.LinearRegression 是一个普通最小二乘(OLS)回归模型。它通过最小化真实值与预测值之间的平方差之和来拟合线性方程。它是 scikit-learn 中最基础、也最容易解释的回归模型。

如何解读 R-squared 分数?

R-squared 表示模型解释了目标变量方差的比例。R-squared 为 0.80 表示解释了 80% 的方差。1.0 表示完美拟合,0.0 表示不比预测均值更好,负值表示模型比直接使用均值预测还差。

什么时候用 Ridge、Lasso、ElasticNet?

当你想保留全部特征但降低过拟合(特征共线性强)时用 Ridge;当你需要自动特征选择(把无关特征系数压到 0)时用 Lasso;当特征相关且希望在 Ridge 的稳定性与 Lasso 的稀疏性之间取得平衡时用 ElasticNet。

LinearRegression 需要做特征缩放吗?

基础的 LinearRegression 不要求特征缩放,因为 OLS 解对尺度不敏感。但 Ridge、Lasso、ElasticNet 需要缩放,因为它们的惩罚项会同等对待各系数的大小。进行正则化回归前应始终先缩放特征。

线性回归如何处理类别特征(categorical features)?

在拟合之前,需用 OneHotEncoderpd.get_dummies() 将类别特征转换为数值。Sklearn 的 LinearRegression 只接受数值输入。在 pipeline 中,可以用 ColumnTransformer 对数值列与类别列应用不同的变换。

MSE 和 RMSE 有什么区别?

MSE(Mean Squared Error)是预测值与真实值差的平方的平均值;RMSE(Root Mean Squared Error)是 MSE 的平方根。RMSE 与目标变量单位相同,因此更容易解释。例如预测房价时 RMSE=50,000 表示平均预测误差约为 $50,000。

总结

Sklearn 的 LinearRegression 是你在 Python 中进行任何回归任务的起点。它训练速度快、可解释性强,并且当真实关系近似线性时非常有效。对于包含噪声、共线性或高维特征的真实数据集,Ridge、Lasso 与 ElasticNet 提供的正则化通常能提升泛化能力。务必使用多个指标(R-squared、RMSE、MAE)进行评估,用 train-test split 避免过拟合,并检查残差模式来验证模型假设是否成立。通过 StandardScalerPolynomialFeatures 构建 pipeline,可以让你的工作流更整洁、可复现。

📚