模型预测误差分解——偏差、方差

更新时间:2023-06-20 20:42:11 阅读：评论：0

模型预测误差分解——偏差、⽅差模型误差分解

模型误差分解可以类⽐统计学中的⽅差分解，⽅差分解要涉及到不同的实验组，可以将⽅差分解到不同影响因⼦上去，模型的误差分解也可以按照这种思路，那么，就要设计实验的控制变量，这⾥要研究模型受不同训练集的影响⼤⼩，因此，训练集就是实验要控制的变量，这⾥我们将⼀个机器学习的样本划分成n个⼦集，为了消除样本⼦集本⾝的⼤⼩带来的影响，要求这n个⼦集所包含的特征个数和样本个数相同，并且采取的算法相同，⽐如⽤的是线性回归那么就是线性回归，⽤的logistic回归，就必须都⽤logistic回归，总之，除了训练集内容不同，其他都要相同，如上图，每⼀个⼦训练集都可以训练出⼀个模型参数，⽽给定相同的输⼊:，n个模型会产⽣n个预测值：，那么对于给定的算法，这个算法在所有数据⼦集上的平均预测值，就代表了该算法的预测值，算法受样本⼦集的影响，就体

现在了预测值的波动率上，于是产⽣了如下两个定义：定义1：算法的预测值，定义为所有⼦集训练出来的相同算法模型的平均预测值：定义2：⽅差因为除了训练集不同，其他条件都相同，则训练集不同带来的影响效应可以⽤预测值的波动来衡量，这个波动就定义为⽅差，这个⽅差要区别于我们的标准差的平⽅的那个⽅差：

在实验中，我们的训练样本都是对真实值的观测，⽐如温度，都是⽤温度计的测量值，⽽测量值本⾝就是对真实值的估计，这个过程会有⼀个观测误差，因此，这⾥引⼊另外两个定义：

定义3：真实值，观测⽬标的实际值，⽤来表⽰。

定义4：观测值，对真实值的观测，记做，假设没有观测误差的情况下，那观测值应该等于真实值，但是实际情况观测值往往不等于实际值，因为⽐如⼈⼯标注错误，或者是真实值不可获取（如温度），所以，总会存在⼀定的观测误差，这⾥假设观测误差的均值为0，观测误差记做，则有：

x 1,,...,y ^11y ^12y ^1n y

^=y i ^n 1i =1∑n

y ^i i var (∣x )=y ^(−)n 1i =1∑n

y ^i i y i ^2y y D ϵE (ϵ)=0armchair

var (ϵ)=E ((y −D y ))

D 2

最后，来看偏差，我们构造预测模型，就是希望这个模型有⾜够的预测准确率，这个预测准确率是相对真实值（⽽不是观测值）的，因此，预测模型对真实值的误差，由于误差有正负之分，这⾥对误差求平⽅，消除正负号的影响，就得到了预测偏差的定义：

偏差：相对真实值，预测误差的平⽅：

gbs

回顾⼀下⽅差分解，⽅差分解是对误差的⽅差进⾏分解，由于每⼀组样本集都会有对观察值产⽣⼀个预测，也就是说每⼀组都会产⽣⼀个预测误差，这个误差称为残差，那么给定同⼀个下，残差平⽅和就是要分解的对象：

，则可以证明：

具体证明过程，参考周志华：机器学习：

在实际中，如果做k-fold验证，假设样本划分为10份，每⼀份都会有⼀个测试集，就可以看做我们的Dn，假设观测误差为0，则可以求出残差平⽅和的分解，有了这个分解可以做两个事情：模型选择；在测试集上，不同模型的⽐较，得到模型的⽅差和偏差后，⽐较不同模型的偏差和⽅差，可以做模型选择。

训练提前终⽌；在训练集上，可以⽐较同⼀个模型的偏差和⽅差随训练时间的变化过程，当训练过渡训练时，偏差会⼀直下降，此时⽅差可能会增⼤，这个时候就需要终⽌训练，防⽌过拟合了。

bias (x )=2(y −)y

^2x res =y −D f (x ;D )

var (res ):=E ((y −D D f (x ;D )))=2bias (x )+2var (∣x )+y

^var (ϵ)

下⾯将给出两个案例：

案例⼀，当组合模型时，可以降低过拟合，也就是降低⽅差，现在给出如下数据的⽣成过程：

为数据的⽣成过程加上⼀个⾼斯噪声，当给定x的取值范围时，给定模型DecisionTreeRegressor和参数，⽐较BaggingRegressor和DecisionTreeRegressor模型的⽅差、偏差，验证上述结论。

运⾏如下代码，会得到这么⼀个图⽚，从图中可以看出，bagging的error明显⼩于DecisionTreeRegressor模型，尤其是绿⾊曲线的降幅更⼤，说明bagging极⼤的降低了⽅差，其实bagging就是组合对冲的思想，在概率论中，可以严格来证明，组合的波动率⼩于当

个成分的波动率。

felixstowey (x )=e +−x 2 1.5e (x −2)2

"""

============================================================ Single estimator versus bagging: bias-variance decomposition

============================================================

This example illustrates and compares the bias-variance decomposition of the expected mean squared error of a single estimator against a bagging enmble. In regression, the expected mean squared error of an estimator can be decompod in terms of bias, variance and noi. On average over datats of the regression problem, the bias term measures the average amount by which the predictions of the estimator differ from the predictions of the best possible estimator for the problem (i.e., the Bayes model). The variance term measures the variability of the predictions of the estimator when fit over different instances LS of the problem. Finally, the noi measures the irreducible part

of the error which is due the variability in the data.

The upper left figure illustrates the predictions (in dark red) of a single

decision tree trained over a random datat LS (the blue dots) of a toy 1d regression problem. It also

illustrates the predictions (in light red) of other

single decision trees trained over other (and different) randomly drawn instances LS of the problem. Intuitively, the variance term here corresponds to the width of the beam of predictions (in light red) of the individual

estimators. The larger the variance, the more nsitive are the predictions for

`x` to small changes in the training t. The bias term corresponds to the difference between the average prediction of the estimator (in cyan) and the

best possible model (in dark blue). On this problem, we can thus obrve that

the bias is quite low (both the cyan and the blue curves are clo to each other) while the variance is large (the red beam is rather wide).

The lower left figure plots the pointwi decomposition of the expected mean squared error of a single decision tree. It confirms that the bias term (in

blue) is low while the variance is large (in green). It also illustrates the

noi part of the error which, as expected, appears to be constant and around

`0.01`.

The right figures correspond to the same plots but using instead a bagging enmble of decision trees. In both figures, we can obrve that the bias term

is larger than in the previous ca. In the upper right figure, the difference between the average prediction (in cyan) and the best possible model is larger (e.g., notice the offt around `x=2`). In the lower right figure, the bias

curve is also slightly higher than in the lower left figure. In terms of

variance however, the beam of predictions is narrower, which suggests that the variance is lower. Indeed, as the lower right figure confirms, the variance

term (in green) is lower than for single decision trees. Overall, the bias-variance decomposition is therefore no longer the same. The tradeoff is better

for bagging: averaging veral decision trees fit on bootstrap copies of the datat slightly increas the bias term but allows for a larger reduction of

eifthe variance, which results in a lower overall mean squared error (compare the red curves int the lower figures). The script output also confirms this

intuition. The total error of the bagging enmble is lower than the total

error of a single decision tree, and this difference indeed mainly stems from a reduced variance.

For further details on bias-variance decomposition, e ction 7.3 of [1]_. References

----------

.. [1] T. Hastie, R. Tibshirani and J. Friedman,

"Elements of Statistical Learning", Springer, 2009.

"""

print(__doc__)

folklore# Author: Gilles Louppe <g.>

# Licen: BSD 3 clau

import numpy as np

import matplotlib.pyplot as plt

ble import BaggingRegressor

import DecisionTreeRegressor

# Settings

n_repeat =50# Number of iterations for computing expectations

n_train =50# Size of the training t

n_test =1000# Size of the test t

noi =0.1# Standard deviation of the noi

np.random.ed(0)

international是什么意思

# Change this for exploring the bias-variance decomposition of other

# estimators. This should work well for estimators with high variance (e.g.,

# decision trees or KNN), but poorly for estimators with low variance (e.g.,

guardians

# linear models).

estimators =[("Tree", DecisionTreeRegressor()),

("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))]

n_estimators =len(estimators)

# Generate data

负责人英文def f(x):

"""

实现真实的数据⽣成过程，注意，需要将X变成⼀维的

param x:

:return:

"""

x = x.ravel()

p(-x **2)+1.5* np.exp(-(x -2)**2)

def generate(n_samples, noi, n_repeat=1):

"""

实现观测值的⽣成过程，即在真实的数据过程基础上，加上⼀个均值为0，⽅差为noi的⽩噪声注意：⽀持多个⼦集的⽣成

:param n_samples: ⽣成观测集合⼦集的样本长度

:param noi: ⽩噪声的⽅差

:param n_repeat: ⽣成观测集合的⼦集个数

:return:

"""projector

X = np.random.rand(n_samples)*10-5

X = np.sort(X)

if n_repeat ==1:

y = f(X)+ al(0.0, noi, n_samples)

el:

y = np.zeros((n_samples, n_repeat))

for i in range(n_repeat):

y[:, i]= f(X)+ al(0.0, noi, n_samples)

X = X.reshape((n_samples,1))

program files是什么意思

return X, y

X_train =[]

y_train =[]

for i in range(n_repeat):

X, y = generate(n_samples=n_train, noi=noi)

X_train.append(X)

y_train.append(y)

X_test, y_test = generate(n_samples=n_test, noi=noi, n_repeat=n_repeat)

plt.figure(figsize=(10,8))

# Loop over estimators to compare

for n,(name, estimator)in enumerate(estimators):

# Compute predictions

本文发布于:2023-06-20 20:42:11，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1000891.html

上一篇：同态加密（1）GSW加密方案

下一篇：CMOS Technology Characterization for Analog and RF Design

标签：模型误差观测训练过程预测分解

留言与评论（共有 0 条评论）