如何解读 R 方分数？

R 方衡量模型解释的方差比例。1.0 表示完美，0.0 表示不比均值好，负值意味着模型比预测均值更差。

何时应该使用 Ridge vs Lasso vs ElasticNet？

当特征相关时使用 Ridge 来减少过拟合。使用 Lasso 进行自动特征选择。当特征相关时使用 ElasticNet 在 Ridge 稳定性和 Lasso 稀疏性之间取得平衡。

LinearRegression 需要特征缩放吗？

基本的 LinearRegression 不需要缩放。但是，Ridge、Lasso 和 ElasticNet 都需要特征缩放，因为它们的惩罚项对所有系数的量级一视同仁。

如何在线性回归中处理分类特征？

在拟合之前使用 OneHotEncoder 或 pd.get_dummies() 将分类特征转换为数值。在管道中使用 ColumnTransformer 处理混合列类型。

Sklearn 线性回归：包含 Python 示例的完整指南

Q: sklearn 中的 LinearRegression 是什么？

sklearn.linear_model.LinearRegression 是一个普通最小二乘（OLS）回归模型，通过最小化实际值与预测值之差的平方和来拟合线性方程。

Q: MSE 和 RMSE 有什么区别？

MSE 是平方误差的平均值。RMSE 是 MSE 的平方根，与目标变量具有相同的单位，使其更容易解读。

Name: Soren Atelier

更新于 2026/2/12

你有一个包含特征（features）和连续目标变量的数据集。你想预测结果——房价、销售收入、温度趋势——但你不确定该用哪种方法，或者不知道如何在 Python 中正确搭建。错误的模型选择或遗漏必要的预处理步骤，会导致预测效果很差，并浪费大量时间在调试上。

线性回归是用于连续值预测任务中最常用的算法，但要把它用对，远不止调用 .fit() 和 .predict() 这么简单。你需要理解模型的内部原理、它在什么情况下会失效、如何正确评估，以及何时应该切换到 Ridge 或 Lasso 这类带正则化的变体。跳过这些步骤，往往会部署出只在训练集上表现很好、但在新数据上立刻崩溃的模型。

Scikit-learn 提供了 LinearRegression，并配套了完整的预处理、评估与正则化工具生态。本指南将覆盖从基础用法到可用于生产环境的回归 pipeline 的全部内容。

什么是线性回归？

线性回归通过拟合一条直线（或超平面）来建模一个或多个输入特征与连续输出之间的关系，使残差平方和最小。对于包含 n 个特征的模型，其形式为：

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

其中 b0 是截距（bias term），b1...bn 是每个特征对应的系数（权重），y 是预测值。

模型通过最小化 普通最小二乘（Ordinary Least Squares, OLS） 的代价函数来求解系数：

Cost = Sum of (y_actual - y_predicted)^2

该问题存在闭式解，因此即使在大数据集上训练也通常很快。

使用 Sklearn 进行简单线性回归

简单线性回归只使用一个特征来预测目标变量。下面是一个完整示例：

from sklearn.linear_model import LinearRegression
import numpy as np
 
# Sample data: years of experience vs salary (in thousands)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([35, 40, 45, 55, 60, 62, 70, 75, 82, 90])
 
# Create and train the model
model = LinearRegression()
model.fit(X, y)
 
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
 
# Predict salary for 12 years of experience
prediction = model.predict([[12]])
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
# Coefficient (slope): 5.9394
# Intercept: 28.3333
# Predicted salary for 12 years: $99.61k

理解输出结果

Attribute	含义	示例值
`model.coef_`	每个特征的权重	[5.94] —— 工资每年约增加 $5,940
`model.intercept_`	当所有特征为 0 时的预测 y	28.33 —— 基础工资约 $28,330
`model.score(X, y)`	在给定数据上的 R-squared	0.98

多元线性回归

当你有多个特征时，模型拟合的是一个超平面而非一条直线：

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
print(f"Features: {feature_names}")
print(f"Dataset shape: {X.shape}")  # (20640, 8)
 
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Print coefficients for each feature
print("\nFeature Coefficients:")
for name, coef in zip(feature_names, model.coef_):
    print(f"  {name:12s}: {coef:+.6f}")
print(f"  {'Intercept':12s}: {model.intercept_:+.6f}")
 
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nR² (train): {train_score:.4f}")
print(f"R² (test):  {test_score:.4f}")

模型评估：R-squared、MSE 与 RMSE

仅看 R-squared 并不能说明全部情况。评估回归模型时应结合多个指标：

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
 
print(f"R² Score:  {r2:.4f}")
print(f"MSE:       {mse:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")
# R² Score:  0.5758
# MSE:       0.5559
# RMSE:      0.7456
# MAE:       0.5332

评估指标说明

Metric	Formula	Range	解释
R-squared (R²)	1 - (SS_res / SS_tot)	(-inf, 1]	解释的方差比例。1.0=完美拟合，0=不比均值预测好
MSE	mean((y - y_pred)²)	[0, inf)	平均平方误差。对大误差惩罚更重
RMSE	sqrt(MSE)	[0, inf)	与目标变量同单位，比 MSE 更直观
MAE	mean(\|y - y_pred\|)	[0, inf)	平均绝对误差。对异常值更稳健

R² 低不一定代表模型差。对于噪声很大的真实数据（如房价），R² = 0.6 可能也算合理。务必把 RMSE 与目标变量的量纲/尺度对比来判断误差是否可接受。

线性回归的特征缩放（Feature Scaling）

标准的 LinearRegression 不强制要求特征缩放，因为它用的是 OLS 的闭式解。但当使用正则化时，缩放就变得关键：

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Without scaling (fine for basic LinearRegression)
model_no_scale = LinearRegression()
model_no_scale.fit(X_train, y_train)
print(f"LinearRegression R² (no scaling): {model_no_scale.score(X_test, y_test):.4f}")
 
# With scaling via Pipeline (required for regularized models)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
print(f"Ridge R² (with scaling):          {pipeline.score(X_test, y_test):.4f}")

**为什么缩放对正则化很重要：**Ridge 和 Lasso 会对系数大小施加同等形式的惩罚。如果一个特征范围是 0-1，另一个范围是 0-100,000，那么惩罚项会不成比例地影响不同特征对应的系数。缩放能把特征放到同一尺度，使惩罚更公平地作用于所有特征。

多项式特征：建模非线性关系

当特征与目标之间的关系不是线性的，多项式特征可以捕捉曲线和交互项：

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
 
# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = 3 * X.ravel()**2 - 5 * X.ravel() + 10 + np.random.randn(200) * 15
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Linear model
linear = LinearRegression()
linear.fit(X_train, y_train)
print(f"Linear R²: {r2_score(y_test, linear.predict(X_test)):.4f}")
 
# Polynomial (degree 2) model
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
print(f"Poly (d=2) R²: {r2_score(y_test, poly_pipeline.predict(X_test)):.4f}")
 
# Polynomial (degree 3) model
poly3_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('linear', LinearRegression())
])
poly3_pipeline.fit(X_train, y_train)
print(f"Poly (d=3) R²: {r2_score(y_test, poly3_pipeline.predict(X_test)):.4f}")

**警告：**高阶多项式很容易快速过拟合。应使用 cross-validation 来选择合适的阶数，并且对多项式模型通常更推荐配合正则化。

正则化：Ridge、Lasso 与 ElasticNet

当模型拥有大量特征或多项式项时，正则化通过对大系数施加惩罚来抑制过拟合。

Ridge Regression（L2 惩罚）

Ridge 将系数平方和加入代价函数。它会把系数向 0 收缩，但通常不会把系数精确压到 0。

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Find best alpha with cross-validation
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])
 
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha: {grid.best_params_['ridge__alpha']}")
print(f"Best CV R²: {grid.best_score_:.4f}")
print(f"Test R²:    {grid.score(X_test, y_test):.4f}")

Lasso Regression（L1 惩罚）

Lasso 将系数绝对值和加入代价函数。它可以把部分系数精确压到 0，从而实现自动特征选择：

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(max_iter=10000))
])
 
param_grid = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha: {grid.best_params_['lasso__alpha']}")
print(f"Test R²:    {grid.score(X_test, y_test):.4f}")
 
# Show which features were selected (non-zero coefficients)
lasso_model = grid.best_estimator_.named_steps['lasso']
feature_names = housing.feature_names
for name, coef in zip(feature_names, lasso_model.coef_):
    status = "KEPT" if abs(coef) > 1e-6 else "DROPPED"
    print(f"  {name:12s}: {coef:+.6f}  [{status}]")

ElasticNet（L1 + L2 惩罚）

ElasticNet 结合 Ridge 与 Lasso 的惩罚项。l1_ratio 控制混合比例：0=纯 Ridge，1=纯 Lasso。

from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('elasticnet', ElasticNet(max_iter=10000))
])
 
param_grid = {
    'elasticnet__alpha': [0.01, 0.1, 1.0],
    'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
 
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)
 
print(f"Best alpha:    {grid.best_params_['elasticnet__alpha']}")
print(f"Best l1_ratio: {grid.best_params_['elasticnet__l1_ratio']}")
print(f"Test R²:       {grid.score(X_test, y_test):.4f}")

对比：LinearRegression vs Ridge vs Lasso vs ElasticNet

Model	Penalty	Feature Selection	何时使用	是否需要缩放
LinearRegression	无	否	特征少、无多重共线性、信噪比较好	否
Ridge	L2（平方）	否（向 0 收缩）	特征多且相关，希望保留所有特征	是
Lasso	L1（绝对值）	是（系数可变为 0）	特征多，希望自动特征选择	是
ElasticNet	L1 + L2	是（部分）	特征相关且希望做一定选择	是

如何选择合适的模型

先用 LinearRegression 作为 baseline。如果模型过拟合（训练与测试 R-squared 差距明显），优先尝试 Ridge。如果你怀疑有很多无关特征，尝试 Lasso。如果特征之间相关性强且你又想进行选择，尝试 ElasticNet。无论如何，都应使用 cross-validation 来进行对比。

线性回归的假设（Assumptions）

当以下假设成立时，线性回归通常能给出更可靠的结果：

线性（Linearity） —— 特征与目标的关系是线性的（或可通过变换线性化）。
独立性（Independence） —— 观测值彼此独立。在时间序列数据中若未处理自相关，该假设会被破坏。
同方差性（Homoscedasticity） —— 在不同预测值水平下，残差方差保持恒定。
残差正态性（Normality of residuals） —— 残差近似服从正态分布。对置信区间与假设检验更关键，对预测精度影响相对较小。
无多重共线性（No multicollinearity） —— 特征之间不应高度相关。共线性会抬高系数方差，使单个系数解释变得不可靠。

用代码检查假设

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np
 
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred
 
# Check residual statistics
print(f"Residual mean:     {residuals.mean():.6f}")   # Should be near 0
print(f"Residual std:      {residuals.std():.4f}")
print(f"Residual skewness: {float(np.mean((residuals - residuals.mean())**3) / residuals.std()**3):.4f}")
 
# Check for multicollinearity (correlation matrix)
corr_matrix = np.corrcoef(X_train, rowvar=False)
print(f"\nMax feature correlation: {np.max(np.abs(corr_matrix - np.eye(corr_matrix.shape[0]))):.4f}")

完整 Pipeline：真实世界回归流程

下面是一个偏生产风格的 pipeline 示例，将预处理、特征工程与模型对比整合在一起：

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
import numpy as np
 
# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
 
# Define models to compare
models = {
    'LinearRegression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LinearRegression())
    ]),
    'Ridge (alpha=1)': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=1.0))
    ]),
    'Lasso (alpha=0.01)': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Lasso(alpha=0.01, max_iter=10000))
    ]),
    'Poly(2) + Ridge': Pipeline([
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=10.0))
    ])
}
 
# Evaluate all models
print(f"{'Model':<25} {'CV R² (mean)':>12} {'CV R² (std)':>12} {'Test R²':>10}")
print("-" * 62)
 
for name, pipeline in models.items():
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
    pipeline.fit(X_train, y_train)
    test_r2 = pipeline.score(X_test, y_test)
    print(f"{name:<25} {cv_scores.mean():>12.4f} {cv_scores.std():>12.4f} {test_r2:>10.4f}")

使用 PyGWalker 探索回归结果

训练完模型后，理解预测模式同样关键。PyGWalker (opens in a new tab) 让你可以在 Jupyter 中通过交互式拖拽界面，可视化探索残差、特征重要性、预测值与真实值关系等：

import pandas as pd
import pygwalker as pyg
 
# Build a results DataFrame
results = pd.DataFrame(housing.data[len(X_train):], columns=housing.feature_names)
results['actual'] = y_test
results['predicted'] = y_pred
results['residual'] = y_test - y_pred
results['abs_error'] = np.abs(y_test - y_pred)
 
# Launch interactive exploration
walker = pyg.walk(results)

你可以把特征拖到坐标轴上、用残差大小进行颜色编码，并识别模型在哪些数据分段上表现较差——全程无需手写绘图代码。

如果你在 Jupyter 里进行迭代实验，RunCell (opens in a new tab) 提供了一个 AI agent，帮助你测试不同的特征组合、超参数和预处理步骤，而不必手动反复改写 cells。

FAQ

sklearn 中的 LinearRegression 是什么？

sklearn.linear_model.LinearRegression 是一个普通最小二乘（OLS）回归模型。它通过最小化真实值与预测值之间的平方差之和来拟合线性方程。它是 scikit-learn 中最基础、也最容易解释的回归模型。

如何解读 R-squared 分数？

R-squared 表示模型解释了目标变量方差的比例。R-squared 为 0.80 表示解释了 80% 的方差。1.0 表示完美拟合，0.0 表示不比预测均值更好，负值表示模型比直接使用均值预测还差。

什么时候用 Ridge、Lasso、ElasticNet？

当你想保留全部特征但降低过拟合（特征共线性强）时用 Ridge；当你需要自动特征选择（把无关特征系数压到 0）时用 Lasso；当特征相关且希望在 Ridge 的稳定性与 Lasso 的稀疏性之间取得平衡时用 ElasticNet。

LinearRegression 需要做特征缩放吗？

基础的 LinearRegression 不要求特征缩放，因为 OLS 解对尺度不敏感。但 Ridge、Lasso、ElasticNet 需要缩放，因为它们的惩罚项会同等对待各系数的大小。进行正则化回归前应始终先缩放特征。

线性回归如何处理类别特征（categorical features）？

在拟合之前，需用 OneHotEncoder 或 pd.get_dummies() 将类别特征转换为数值。Sklearn 的 LinearRegression 只接受数值输入。在 pipeline 中，可以用 ColumnTransformer 对数值列与类别列应用不同的变换。

MSE 和 RMSE 有什么区别？

MSE（Mean Squared Error）是预测值与真实值差的平方的平均值；RMSE（Root Mean Squared Error）是 MSE 的平方根。RMSE 与目标变量单位相同，因此更容易解释。例如预测房价时 RMSE=50,000 表示平均预测误差约为 $50,000。

总结

Sklearn 的 LinearRegression 是你在 Python 中进行任何回归任务的起点。它训练速度快、可解释性强，并且当真实关系近似线性时非常有效。对于包含噪声、共线性或高维特征的真实数据集，Ridge、Lasso 与 ElasticNet 提供的正则化通常能提升泛化能力。务必使用多个指标（R-squared、RMSE、MAE）进行评估，用 train-test split 避免过拟合，并检查残差模式来验证模型假设是否成立。通过 StandardScaler 与 PolynomialFeatures 构建 pipeline，可以让你的工作流更整洁、可复现。

📚