首页 > 美文鉴赏

Online policy adaptation for enmble classifiers

更新时间:2023-05-04 19:29:04 阅读：评论：0

Online Policy Adaptation for Enmble

Classiﬁers

Christos Dimitrakakis and Samy Bengio

IDIAP,P.O.Box592,CH-1920Martigny,Switzerland

Abstract.Enmble algorithms can improve the performance of a given

learning algorithm through the combination of multiple ba classiﬁers

into an enmble.In this paper,the idea of using an adaptive policy for

training and combining the ba classiﬁers is put forward.The eﬀective-

ness of this approach for online learning is demonstrated by experimental

results on veral UCI benchmark databas.

1Introduction

The problem of pattern classiﬁcation has been addresd in the past using supervid learning methods.In this context,a t of N example patterns D={(x

,y1),(x2,y2),...,(x N,y N)}is prented to the learning machine,which

adapts its parameter vector so that,when input vector x i is prented to it, the machine outputs the corresponding class y i.Let us denote the output of a learning machine for a particular vector x i as h(x i).The classiﬁcation error for that particular example can be designated asi=1if h(x i)=y i and 0otherwi.Thus,the classiﬁcation error for the t of examplesD can be summarid as the empirical errorL= ii/N.IfD is a suﬃciently large reprentative sample taken from a distribution D,the generalization error L= p D(x)(x)would be clo toL.In practice,however,the training t provides limited sampling of the distribution D,leading to problems such as overﬁtting.Adding the eﬀects of the classiﬁer’s inherent bias and variance,we will have L>L.

Since the generalization error cannot be directly obrved,it has been com-mon to u a part of the training data for validation for its estimation.This has led to the development of techniques mainly aimed at reducing overﬁtting caud by limited sampling,such as early stopping and K-fold cross-vali

dation.

Another possible solution is oﬀered by enmble methods,such as the mix-tures of experts(MOE)architecture[7],bagging[4]and boosting[6].The boosting algorithm AdaBoost has been shown to signiﬁcantly outperform other This work is supported by the Swiss National Science Foundation through the National Centre of Competence in Rearch on”Interactive Multimodal Information Management”.

enmble techniques.Theoretical results explaining the eﬀectiveness of Ad-aBoost relate it to the margin of classiﬁcation[11].The margin distribution for the two class ca can be deﬁned as:

margin f(x,y)=yf(x),

where x∈X,y∈{−1,1}and f:X→[−1,1].In general,the hypothesis h(x) can be derived from f(x)by tting h(x)=sign(f(x)).In this ca,|f(x)|can be interpreted as the conﬁdence in the label prediction.For the multi-class ca,let f y(x)be the model’s estimate of the probability of class y given input x.In this ca the margin is deﬁned as:

f y′(x).(1)

margin f(x,y)=f y(x)−max

y′=y

Thus the margin can rve as a measure of how far away from the threshold classiﬁcation decisions are made.A particular measure is the minimum margin over the :

margin f(x,y).

margin min(D)=min

(x,y)∈D

It is argued[11]that AdaBoost is indirectly maximising this margin,leading to more robust performance.Although there exist counterexamples for which the minimum margin is not an adequate predictor of generalisation[5],attempts to apply algorithms that directly maximi the margin have obtained some success[10,9].

In this work the possibility of using an adaptive rather than aﬁxed policy for training and combining ba classiﬁers is investigated.Theﬁeld of reinforce-ment learning[12]provides natural candidates for

u in ada呆萌可爱头像 ptive policies.In particular,the policy is adapted using Q-learning[13],a method that improves a policy through the iterative approximation of an evaluation function Q.Pre-viously Q-learning had been ud in a similar mixture model applied to a control task[1].An Expectation Maximisation bad mixtures of experts(MOE)al-gorithm for supervid learning was prented in[8].In this paper,we attempt to solve the same task as in the standard MOE model,but through the u of reinforcement learning rather than expectation maximization techniques.

The rest of the paper is organid as follows.The framework of Reinforce-ment Learning(RL)is introduced in Section2.Section2.1outlines how the RL methods are employed in this work and describes how the system is imple-mented.Experiments are described in Section3,followed by conclusions and suggestions for future rearch.

2General Architecture

The RL classiﬁer enmble consists of a t of n ba classiﬁers,or experts, E={e1,e2,...,e n}and a controlling agent that lects the experts to make

classiﬁcation decisions and to train on particular examples.The speciﬁc RL algorithm employed in this work is outlined below.The following ction de-scribes in what manner it was ud to control the

classiﬁer enmble.

For the controlling agent we deﬁne a t of states s∈S and a t of actions a∈A.At each time step t,the agent is at state s t=s and choos action a t=a.After the action is taken,the agent receives a reward r t and it enters a new state s t+1=s′.A policy:(S,A)→[0,1]is deﬁned as a t of probabilities:

= p(a|s) (s,a)∈(S,A)

for lecting an action a given the state s.The objective is toﬁnd the policy that maximis the discounted future return of the system,starting at time t,

which is deﬁned as:

R t=

∞

k=0

k r t+k+1,

where∈[0,1)is a discount factor.This can be achieved by the iterative application of two steps.First,we estimate the return of actions under the current policy.More speciﬁcally,we deﬁne Q:(S,A)→ℜas the expected return of taking action a when being at state s at time t and followingthereafter:

Q(s,a)=E ∞ k=0k r t+k+1 s t=s,a t=a .

Qitlf is unknown and we maintain instead an estimate Q for each state action pair.Herein we employ the Q-learning update when action a j is lected in state s:

Q(s,a j)←Q(s,a j)+(r+max

Q(s′,a i)−Q(s,a j)),>0(2)

The cond step involves deriving a policyfrom the updated estimates Q. This can be derived from the evaluations Q(s,a)either deterministically,by al-ways lecting the action a j with the largest expected return,or stochastically.−greedy action lection lects the highest evaluated action with probability (1−),with∈[0,1],otherwi it lects a random action.Softmax action lection lects ac

tion a j with probability e Q(s,a j)/ i e Q(s,a i).Stochastic ac-tion lection is in general necessary so that all state-action pairs are sampled frequently enough for us to have accurate estimates of their expected return.

2.1Implementation

We employ an architecture with n experts,implemented as multi-layer percep-trons(MLPs),and a further MLP with n outputs and parameterswhich acts as the controlling agent.At each time step t a new example x is prented to the enmble and each expert e i emits a decision h i(x).The state space of the controlling agent is S≡X,the same as the classiﬁers’input space.Its outputs approximate Q(s,a j)in order to lect actions from A.We examine the ca

in which each action a j corresponds to lecting expert e j for training on the current example.The decisions of the experts themlves can be combined with a weighted sum:f (x )= i w i h i (x ),where w i =e Q (x,a i )P j e

Q (x,a j ).Alternatively we can make hard decisions by tting f (x )=h j (x ),where j =arg max i Q (s,a i ).The classiﬁcation decision results in a return r ∈{0,1},which is 1if f (x )=y and 0otherwi.

The Q -learning update remains esntially the same as in (2)but,becau of the parameterid reprentation,we perform gradient descent to update our estimates,with the back-propagated error being =r +max i Q (s ′,a i )−Q (s,a j )and learning rate >0.The algorithm is implemented as follows:

1.Select example x t randomly from X .

2.Given s =x t ,choo a j ∈A according to a policy derived from Q (for example using -greedy action lection).

3.Take ac申请信 tion a j ,obrve r and the next state s ′=x t +1,chon randomly from X .

4.Calculate ←r +max i Q (s ′,a i )−Q (s,a j ).

5.←+∇Q (s,a j ).

6.s ←s ′.

7.Loop to 2,unless termination condition is met.3Experimental results

A t of experiments has been performed,in order to evaluate the eﬀective-ness of this approach,on 9datats that are available from the UCI Machine Learning Repository [3].For each datat cross-validation was ud to lect the number of hidden units for the ba classiﬁer.Each classiﬁer was trained for 100iterations and a learning rate =0.01was ud.The discount param-eter for the controlling agent was t to 01.The results reported here are for -greedy action lection and for the hard combination method.Results with soft怎样重置路由器 max action lection and weighted combination are not signiﬁcantly diﬀerent.

A comparison was made between the RL-controlled mixture,a single MLP,the Mixture of Experts and AdaBoost using MLPs.As can be en in Table 1,the enmbles generally manage to improve test performance compared to the ba classiﬁer,with the RL mixture outperforming AdaBoost and MOE 4and

1The

classiﬁcation task is similar to an n -armed bandit problem,since the next state is not inﬂuenced by the agent’s actions.However it is more accurately described as a partially obrvable process,since the parameters of the classiﬁers constitute a state which changes depending on the agent’s actions.

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 11 expert 2 experts 4 experts 8 experts 16 experts 32 experts Figure 1:Cumulative margin dis-

tribution for RL on the ionosphere

datat,with an increasing number of

experts.MLP Boost MOE RL 7.28% 1.21% 4.84% 2.43%32.8%16.2%31.6%29.1%15.0%16.2%13.7%15.0%5.96% 5.96% 5.96% 3.08%4.10% 2.52% 4.55% 3.73%2.63% 1.42% 2.13% 2.3%2.72% 3.10% 2.80% 2.69%8.33% 6.48%7.75%7.41%56.1%61.9%68.1%48.3%

Table 1:Classiﬁcation error on the UCI breast,forest,heart,ionosphere,letter,optdigits,pendigits,spamba and vowel datats using 32experts.7times out of 9respectively.For each datat we have also calculated the cumulative margin distribution resulting from equation (1).For the RL mixture there was a constant improvement in the distribution in most datats when the number of experts was incread (c.f.Figure 1),though this did not always result in an improvement in generalisation performance.

4Conclusions and Future Rearch

The aim of this work was to demonstrate the feasibility of using adaptive policies to train and combin

e a t of ba classiﬁers.While this purpo has arguably been reached,there still remain some questions to be answered,such as under what conditions the margin of classiﬁcation is incread when using this approach.

In the future we would like to explore the relationship between RL and EM techniques for training enmbles.Furthermore,it would be interesting to investigate the applica发面千层饼的做法 tion of RL when the agent’s state space is extended to include information about each expert.In this ca it would no longer consti-tute of i.i.d samples,so the agent’s actions will aﬀect its future state.However,perhaps the most promising direction in this domain would be to extend the t of possible actions so that more interesting policies can be developed.For such enlarged spaces it would appear necessary to replace action-value methods for policy improvement with direct gradient descent in policy space [2].The latter methods have also been theoretically proven to converge in the ca of multiple agents and are much more suitable for problems in partially obrvable environments and with large state-action spaces.

References

[1]C.Anderson and Z.Hong.Reinforcement learning with modular neural

networks for control,1994.

[2]Jonathan Ba大同盟战争 xter and Peter L.Bartlett.Reinforc平平淡淡的反义词 ement learning in

POMDP’s via direct gradient ascent.In Proc.17th Machine Learning,pages41–48.Morgan Kaufmann,San Francisco,CA, 2000.

[3]C.L.Blake and C.J.Merz.UCI repository of machine learning databas.

www.ics.uci.edu/∼mlearn/MLRepository.html,1998.

[4]Leo Breiman.Bagging predictors.Machine Learning,24(2):123–140,1996.

[5]Leo Breiman.Arcing the edge.Technical report,Department of Statistics,

University of California,Berkeley,CA.,1997.

[6]Yoav Freund and Robert E.Schapire.A decision-theoretic generalization

of on-line learning and an application to boosting.Journal of Computer and System Sciences,55(1):119–139,1997.

[7]R.A.Jacobs,M.I.Jordan,S.J.Nowlan,and G.E.Hinton.Adaptive

mixtures of local experts.Neural Computation,3(1):79–87,1991.

[8]Michael I.Jordan and Robert A.Jacobs.Hierarchical mixtures of experts

and the EM algorithm.Neural Computation,6(2):181–214,1994.

[9]Yi Li and Philip M.Long.The relaxed online maximum margin algorithm.

Machine Learning,46(1/3):361,2002.

[10]Llew Mason,Peter L.Bartlett,and Jonathan Baxter.Improved gener-

alization through explicit optimization of margins.Machine Learning, 38(3):243,2000.

[11]Robert E.Schapire,Yoav Freund,Peter Bartlett,and Wee Sun Lee.Boost-

ing the margin:a new explanation for the eﬀectiveness of voting methods.

In Proc.14th International Conference on Machine Learning,pag1988年属相 es322–330.Morgan Kaufmann,1997.

[12]Richard S.Sutton and Andrew G.Barto.Reinforcement Learning:An

Introduction.MIT Press,1998.

[13]Christopher J.C.H.Watkins and Peter Dayan.Technical note q-learning.

Machine Learning,8:279,1992.

本文发布于:2023-05-04 19:29:04，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/856455.html

上一篇：英语翻译家基础：专业术语翻译

下一篇：BASEBAND BOARD RESOURCE MODEL ADAPTATION METHOD AN

标签：

留言与评论（共有 0 条评论）