首页 > 美文阅读

Stochastic gradient boosting

更新时间:2023-07-18 08:17:32 阅读：评论：0

怎么样做灯笼

Computational Statistics&Data Analysis38(2002)367–378

/locate/csda Stochastic gradient boosting

Jerome H.Friedman

Department of Statistics and Stanford Linear Accelerator Center,Stanford University,

Stanford,CA94305,USA

Abstract

Gradient boosting constructs additive regression models by quentiallyÿtting a simple parame-terized function(ba learner)to current“pudo”-residuals by least squares at each iteration.The pudo-residuals are the gradient of the loss functional being minimized,with respect to the model values at each training data point evaluated at the current step.It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating ran-domization into the procedure.Speciÿcally,at each iteration a subsample of the training data is drawn at random(without replacement)from the full training data t.This randomly lected subsample is then ud in place of the full sample toÿt the ba learner and compute the m

odel update for the current iteration.This randomized approach also increas robustness against overcapacity of the ba learner.c 2002Elvier Science B.V.All rights rerved.

1.Gradient boosting

In the function estimation problem one has a system consisting of a random“out-put”or“respon”variable y and a t of random“input”or“explanatory”variables x={x1;:::;x n}.Given a“training”sample{y i;x i}N1of known(y;x)values,the goal is toÿnd a function F∗(x)that maps x to y,such that over the joint distribution of all(y;x)values,the expected value of some speciÿed loss function (y;F(x))is minimized

F∗(x)=arg min

F(x)

E y;x (y;F(x)):(1) Boosting approximates F∗(x)by an“additive”expansion of the form

F(x)=

亚运征文tie的现在分词m=0

ÿm h(x;a m);(2)

0167-9473/02/$-e front matter c 2002Elvier Science B.V.All rights rerved. PII:S0167-9473(01)00065-2

368J.H.Friedman /Computational Statistics &Data Analysis 38(2002)367–378

where the functions h (x ;a )(“ba learner”)are usually chon to be simple functions of x with parameters a ={a 1;a 2;:::}.The expansion coe cients {ÿm }M 0and the parameters {a m }M 0are jointly ÿt to the training data in a forward “stage wi”manner.One starts with an initial guess F 0(x ),and then for m =1;2;:::;M

(ÿm ;a m )=arg min ÿ;a N i −1 (y i ;F m −1(x i )+ÿh (x i ;a ))(3)

and

F m (x )=F m −1(x )+ÿm h (x ;a m ):(4)Gradient boosting (Friedman,1999)approximately solves (3)for arbitrary (dif-ferentiable)loss functions (y;F (x ))with a two step procedure.First,the function h (x ;a )is ÿt by least squares

a m =arg min a ; N i =1[˜y im − h (x i ;a )]2(5)

to the current “pudo”-residuals ˜y im =− @ (y i ;F (x i ))@F (x i ) F (x )=F m −1(x )

:(6)Then,given h (x ;a m ),the optimal value of the coe cient ÿm is determined

ÿm =arg min ÿN i =1 (y i ;F m −1(x i )+ÿh (x i ;a m )):(7)

空开头的四字成语

This strategy replaces a potentially di cult function optimization problem (3)by one bad on least squares (5),followed by a single parameter optimization (7)bad on the general loss criterion .Gradient tree boosting specializes this approach to the ca where the ba learner h (x ;a )is an L terminal node regression tree.At each iteration m ,a regression tree partitions the x space into L -disjoint regions {R lm }L l =1and predicts a parate constant value in each one

h (x ;{R lm }L 1)=L

l −1 y lm 1(x ∈R lm ):(8)

Here y lm =mean x i ∈R lm (˜y im )is the mean of (6)in each region R lm .The parameters of this ba learner are the splitting variables and corresponding split points deÿning the tree,which in turn 影托邦

deÿne the corresponding regions {R lm }L 1of the partition at the m th iteration.The are induced in a top-down “best-ÿrst”manner using a least squares splitting criterion (Friedman et al.,1998).With regression trees,(7)can be solved parately within each region R lm deÿned by the corresponding terminal node l of the m th tree.Becau tree (8)predicts a constant value y lm within each region R lm ,the solution to (7)reduces to a simple “location”estimate bad on the

J.H.Friedman /Computational Statistics &Data Analysis 38(2002)367–378369

criterion

lm =arg min x i ∈R lm (y i ;F m −1(x i )+ ):

The current approximation F m −1(x )is then parately updated in each corresponding region

F m (x )=F m −1(x )+ · lm 1(x ∈R lm ):

The “shrinkage”parameter 0¡ 61controls the learning rate of the procedure.Empirically (Friedman,1999),it was found that small values ( 60:1)lead to much better generalization error.This leads to the following algorithm for generalized boosting of decision trees:

Algorithm 1:Gradient TreeBoost 1F 0(x )=arg min N i −1 (y i ; ).2For m =1to M do:3˜y im =− @ (y i ;F (x i ))@F (x i ) F (x )=F m −1(x );i =1;N 4{R lm }L 1=L −terminal node tree ({˜y im ;x i }N 1)5 lm =arg min x i ∈R lm (y i ;F m −1(x i )+ )6F m (x )=F m −1(x )+ · lm 1(x ∈R lm )7endFor.

Friedman (1999)prented speciÿc algorithms bad on this template for veral loss criteria including least squares: (y;F )=(y −F )2,least-absolute-deviation: (y;F )=|y −F |,Huber M : (y;F )=(y −F )21(|y −F |6 )+2 (|y −F |− =2)1(|y −F |¿ ),and for classiÿcation,K class multinomial negative log likelihood.

2.Stochastic gradient boosting

With his “bagging”procedure,Breiman (1996)introduced the notion that inject-ing randomness into function estimation procedures could improve their performance.Early implementations of AdaBoost (Freund and Schapire,1996)also employed ran-dom sampling,but this was considered an approximation to deterministic weighting when the implementation of the ba learner did not support obrvation weights,rather than as an esntial ingredient.Recently,Breiman (1999)propod a hybrid bagging boosting procedure (“adaptive bagging”)intended for least squares ÿtting of additive expansions (2).It replaces the ba learner in regular boosting procedures with the

corresponding bagged ba learner,and substitutes “out of bag”residuals for the ordinary residuals at each boosting step.Motivated by Breiman (1999),a minor modiÿcation was made to gradient boost-ing (Algorithm 1)to incorporate randomness as an integral part of the procedure.Speciÿcally,at each iteration a subsample of the training data is drawn at random

370J.H.Friedman /Computational Statistics &Data Analysis 38(2002)367–378

(without replacement)from the full training data t.This randomly lected sub-sample is then ud,instead of the full sample,to ÿt the ba learner (line 4)and compute the model update for the current iteration (line 5).Let {y i ;x i }N 1be the entire training data sample and { (i )}N i be a random permu-tation of the integers {1;:::;N }.Then a random subsample of size ˜N ¡N is given by {y (i );x (i )}˜N 1.The stochastic gradient boosting algorithm is then

Algorithm 2:Stochastic Gradient TreeBoost 1F 0(x )=arg min N i −1 (y i ; ).2For m =1to M do:3{ (i )}N 1=rand perm {i }N 14˜y (i )m =−

@ (y (i );F (x (i )))@F (x (i )) F (x )=F m −1(x );i =1;˜N 5{R lm }L 1=L −terminal node tree ({˜y (i )m ;x (i )}˜N 1)6 lm =arg min x (i )∈R lm (y (i );F m −1(x (i ))+ )

7F m (x )=F m −1(x )+ · lm 1(x ∈R lm )8endFor.

Using ˜N =N introduces no randomness and caus Algorithm 2to return the same result as Algorithm 1.The smaller the fraction f =˜N =N ,the more the ran-dom samples ud in successive iterations will di er,thereby introducing more overall randomness into the procedure.Using the value f =1=2is roughly equiv-alent to drawing bootstrap samples at each iteration.Using ˜N =f ·N also re-duces computation by a factor of f .However,making the value of f smaller reduces the amount of data available to train the ba learner at each iteration.This will cau the variance associated with the individual ba learner estimates to increa.

3.Simulation studies

The e ect of randomization on gradient tree boost procedures will likely depend on the particular problem at hand.Important characteristics of problems that af-fect performance include training sample size N ,true underlying “target”func-tion F ∗(x )(1),and the distribution of the departures, ,of y |x from F ∗(x ).In order to gauge the value of any estimation method it is necessary to accurately evaluate its performance over many di erent situations.This is most conveniently accomplished through Monte Carlo simulation,where data can be generated ac-cording to a wide variety of prescriptions,and resulting performance accurately calculated.One of the most important characteristics of any problem a ecting performance is the true underlying target function F ∗(x )(1).S

ince the nature of the target function can vary greatly over di erent problems,and is ldom known,we evaluate the rel-ative merits of randomized gradient tree gradient boosting on a variety of di erent targets randomly drawn from a broad “realistic”class of functions.The procedure ud here to generate the random functions is described in Friedman (1999).The

1560年J.H.Friedman/Computational Statistics&Data Analysis38(2002)367–378371 simulation studies below are bad on the same100randomly generated target func-tions ud in Friedman(1999).

Performance is bad on the average-absolute-error of the derived estimateˆF(x) in approximating each target F∗(x)

A(ˆF)=E x|F∗(x)−ˆF(x)|(9) as estimated from a large independent test data t.Performance comparisons among veral di erent estimates{ˆF k(x)}K1are bad on the absolute error(9)of each one relative to the best performer

R(ˆF k)=A(ˆF k)=min{A(ˆF l)}K1:(10) Thus,for each of the100target functions,the best method k∗=arg min k{A(ˆF k)}K1re-ceives the value R(ˆF k∗)=1:0,and the others receive a larger value{R(ˆF k)¿1:0}k=k∗. If a particular method was best(smallest error)for every target,its distribution of (10)over all100target functions would be a point mass at the value1.0.彭真夫人

3.1.Regression

In this ction,the e ect of randomization on the(Huber)M TreeBoost proce-dure is investigated.Among the regression procedures derived in Friedman(1999), M TreeBoost has the best overall performance and was considered the method of choice.Its break down parameters was t to the default value =0:9.For small data t(N=500)the shrinkage parameter (Algorithm2)was t to =0:005. For the larger ones(N=5000)it was t to =0:05.Best-ÿrst regressions trees with six terminal nodes were ud as the ba learner.

Here,we compare various levels of randomization in terms of performance over the100target functions for two di erent error distributions.One hundred data ts {y i;x i}N1were generated according to

y i=F∗(x i)+ i;(11) where F∗(x)reprents each of the100randomly generated target functions.For theÿrst study,the errors i were generated from a Gaussian distribution with zero mean,and variance adjusted so that

E| |=E x|F∗(x)−median x F∗(x)|(12) giving a1=1signal-to-noi ratio.For the cond study the errors were generated from a“slash”distribution, i=s·(u=v),where u∼N(0;1)and v∼U[0;1].The scale factor s i

s adjusted to give a1=1signal-to-noi ratio(12).The slash distribution has very thick tails and is often ud as an extreme to test robustness.

3.1.1.Gaussian errors

女孩壁纸Fig.1compares the performance of M TreeBoost for di erent degrees of ran-domization,for small training data ts(N=500).The degree of randomness is controlled by the fraction f=˜N=N of randomly drawn obrvations ud to train the regression tree at each iteration.Shown are the distributions of{R(ˆF f)}81(10)

本文发布于:2023-07-18 08:17:32，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1102756.html

上一篇：Performance Evaluation of (IJIGSP-V6-N12-9)

下一篇：Elasticarch-打分机制

标签：成语女孩开头

留言与评论（共有 0 条评论）