首页 > 英文翻译

The boosting approach to machine learning An overview

更新时间:2023-05-09 01:55:22 阅读：评论：0

Nonlinear Estimation and Classification, Springer, 2003.

The Boosting Approach to Machine Learning

An Overview

Robert E.Schapire

AT&T Labs Rearch

Shannon Laboratory

180Park Avenue,Room A203

Florham Park,NJ07932USA

December19,2001

Abstract

Boosting is a general method for improving the accuracy of any given learning algorithm.Focusing primarily on the AdaBoost algorithm,this

chapter overviews some of the recent work on boosting including analys

of AdaBoost’s training error and generalization error;boosting’s connection

to game theory and linear programming;the relationship between boosting

and logistic regression;extensions of AdaBoost for multiclass classiﬁcation

problems;methods of incorporating human knowledge into boosting;and

experimental and applied work using boosting.

1Introduction

Machine learning studies automatic techniques for learning to make accurate pre-dictions bad on past obrvations.For example,suppo that we would like to build an emailﬁlter that can distinguish spam(junk)email from non-spam.The machine-learning approach to this problem would be the follow

ing:Start by gath-ering as many examples as posible of both spam and non-spam emails.Next,feed the examples,together with labels indicating if they are spam or not,to your favorite machine-learning algorithm which will automatically produce a classiﬁ-cation or prediction rule.Given a new,unlabeled email,such a rule attempts to predict if it is spam or not.The goal,of cour,is to generate a rule that makes the most accurate predictions possible on new test examples.

Building a highly accurate prediction rule is certainly a difﬁcult task.On the other hand,it is not hard at all to come up with very rough rules of thumb that are only moderately accurate.An example of such a rule is something like the following:“If the phra‘buy now’occurs in the email,then predict it is spam.”Such a rule will not even come clo to covering all spam messages;for instance, it really says nothing about what to predict if‘buy now’does not occur in the message.On the other hand,this rule will make predictions that are signiﬁcantly better than random guessing.

Boosting,the machine-learning method that is the subject of this chapter,is bad on the obrvation thatﬁnding many rough rules of thumb can be a lot easier thanﬁnding a single,highly accurate prediction rule.To apply the boosting ap-proach,we start with a method or algorithm forﬁnding the rou

gh rules of thumb. The boosting algorithm calls this“weak”or“ba”learning algorithm repeatedly, each time feeding it a different subt of the training examples(or,to be more pre-ci,a different distribution or weighting over the training examples1).Each time it is called,the ba learning algorithm generates a new weak prediction rule,and after many rounds,the boosting algorithm must combine the weak rules into a single prediction rule that,hopefully,will be much more accurate than any one of the weak rules.

To make this approach work,there are two fundamental questions that must be answered:ﬁrst,how should each distribution be chon on each round,and cond, how should the weak rules be combined into a single rule?Regarding the choice of distribution,the technique that we advocate is to place the most weight on the examples most often misclassiﬁed by the preceding weak rules;this has the effect of forcing the ba learner to focus its attention on the“hardest”examples.As for combining the weak rules,simply taking a(weighted)majority vote of their predictions is natural and effective.

There is also the question of what to u for the ba learning algorithm,but this question we purpoly leave unanswered so that we will end up with a general boosting procedure that can be combined with any ba learning algorithm.

Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.This chapter prents an overview of some of the recent work on boosting,focusing especially on the Ada-Boost algorithm which has undergone inten theoretical study and empirical test-ing.

Given:where, Initialize.

For:

Train ba learner using distribution.

Get ba classiﬁer.

Choo.

Update:

of rounds.One of the main ideas of the algorithm is to maintain a

distribution or t of weights over the training t.The weight of this distribution on

training example on round is denoted.Initially,all weights are t equally, but on each round,the weights of incorrectly classiﬁed examples are incread so

that the ba learner is forced to focus on the hard examples in the training t.

The ba learner’s job is toﬁnd a ba classiﬁer appropriate

for the distribution.(Ba classiﬁers were also called rules of thumb or weak prediction rules in Section1.)In the simplest ca,the range of each is binary, i.e.,restricted to;the ba learner’s job then is to minimize the error

Once the ba classiﬁer has been received,AdaBoost choos a parameter that intuitively measures the importance that it assigns to.In theﬁgure, we have deliberately left the choice of unspeciﬁed.For binary,we typically t

(1) as in the original description of AdaBoost given by Freund and Schapire[32].More on choosing follows in Section3.The distribution is then updated using the rule shown in theﬁgure.Theﬁnal or combined classiﬁer is a weighted majority vote of the ba classiﬁers where is the weight assigned to.

3Analyzing the training error

The most basic theoretical property of AdaBoost concerns its ability to reduce the training ,the fraction of mistakes on the training t.Speciﬁcally, Schapire and Singer[70],in generalizing a theorem of Freund and Schapire[32], show that the training error of theﬁnal classiﬁer is bounded as follows:

(2) where henceforth we deﬁne

(3)

so that.(For simplicity of notation,we write and as shorthand for and,respectively.)The inequality follows from the fact that if.The equality can be proved straightforwardly by unraveling the recursive deﬁnition of.

Eq.(2)suggests that the training error can be reduced most rapidly(in a greedy way)by choosing and on each round to minimize

(4)

In the ca of binary classiﬁers,this leads to the choice of given in Eq.(1)and gives a bound on the training error of

(5)

where we deﬁne.This bound wasﬁrst proved by Freund and Schapire[32].Thus,if each ba classiﬁer is slightly better than random so that for some,then the training error drops exponentially fast in since the bound in Eq.(5)is at most.This bound,combined with the bounds on generalization error given below prove that AdaBoost is indeed a boosting al-gorithm in the n that it can efﬁciently convert a true weak learning algorithm (that can always generate a classiﬁer with a weak edge for any distribution)into a strong learning algorithm(that can generate a classiﬁer with an arbitrarily low error rate,given sufﬁcient data).

Eq.(2)points to the fact that,at heart,AdaBoost is a procedure forﬁnding a linear combination of ba classiﬁers which attempts to minimize

(6)

Esntially,on each round,AdaBoost choos(by calling the ba learner)and then ts to add one m

ore term to the accumulating weighted sum of ba classi-ﬁers in such a way that the sum of exponentials above will be maximally reduced. In other words,AdaBoost is doing a kind of steepest descent arch to minimize Eq.(6)where the arch is constrained at each step to follow coordinate direc-tions(where we identify coordinates with the weights assigned to ba classiﬁers). This view of boosting and its generalization are examined in considerable detail by Duffy and Helmbold[23],Mason et al.[51,52]and Friedman[35].See also Section6.

Schapire and Singer[70]discuss the choice of and in the ca that is real-valued(rather than binary).In this ca,can be interpreted as a “conﬁdence-rated prediction”in which the sign of is the predicted label, while the magnitude gives a measure of conﬁdence.Here,Schapire and Singer advocate choosing and so as to minimize(Eq.(4))on each round.

本文发布于:2023-05-09 01:55:22，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/101385.html

上一篇：VAR模型的Eviews方法

下一篇：地质软件dips使用帮助官方文档

标签：

留言与评论（共有 0 条评论）