Online policy adaptation for enmble classifiers

更新时间:2023-05-04 19:29:04 阅读: 评论:0

Online Policy Adaptation for Enmble
Classifiers
Christos Dimitrakakis and Samy Bengio
IDIAP,P.O.Box592,CH-1920Martigny,Switzerland
Abstract.Enmble algorithms can improve the performance of a given
learning algorithm through the combination of multiple ba classifiers
into an enmble.In this paper,the idea of using an adaptive policy for
training and combining the ba classifiers is put forward.The effective-
ness of this approach for online learning is demonstrated by experimental
results on veral UCI benchmark databas.
1Introduction
The problem of pattern classification has been addresd in the past using supervid learning methods.In this context,a t of N example patterns D={(x
,y1),(x2,y2),...,(x N,y N)}is prented to the learning machine,which
1
adapts its parameter vector so that,when input vector x i is prented to it, the machine outputs the corresponding class y i.Let us denote the output of a learning machine for a particular vector x i as h(x i).The classification error for that particular example can be designated asi=1if h(x i)=y i and 0otherwi.Thus,the classification error for the t of examplesD can be summarid as the empirical errorL= ii/N.IfD is a sufficiently large reprentative sample taken from a distribution D,the generalization error L= p D(x)(x)would be clo toL.In practice,however,the training t provides limited sampling of the distribution D,leading to problems such as overfitting.Adding the effects of the classifier’s inherent bias and variance,we will have L>L.
Since the generalization error cannot be directly obrved,it has been com-mon to u a part of the training data for validation for its estimation.This has led to the development of techniques mainly aimed at reducing overfitting caud by limited sampling,such as early stopping and K-fold cross-vali
dation.
Another possible solution is offered by enmble methods,such as the mix-tures of experts(MOE)architecture[7],bagging[4]and boosting[6].The boosting algorithm AdaBoost has been shown to significantly outperform other This work is supported by the Swiss National Science Foundation through the National Centre of Competence in Rearch on”Interactive Multimodal Information Management”.
enmble techniques.Theoretical results explaining the effectiveness of Ad-aBoost relate it to the margin of classification[11].The margin distribution for the two class ca can be defined as:
margin f(x,y)=yf(x),
where x∈X,y∈{−1,1}and f:X→[−1,1].In general,the hypothesis h(x) can be derived from f(x)by tting h(x)=sign(f(x)).In this ca,|f(x)|can be interpreted as the confidence in the label prediction.For the multi-class ca,let f y(x)be the model’s estimate of the probability of class y given input x.In this ca the margin is defined as:
f y′(x).(1)
margin f(x,y)=f y(x)−max
y′=y
Thus the margin can rve as a measure of how far away from the threshold classification decisions are made.A particular measure is the minimum margin over the :
margin f(x,y).
margin min(D)=min
(x,y)∈D
It is argued[11]that AdaBoost is indirectly maximising this margin,leading to more robust performance.Although there exist counterexamples for which the minimum margin is not an adequate predictor of generalisation[5],attempts to apply algorithms that directly maximi the margin have obtained some success[10,9].
In this work the possibility of using an adaptive rather than afixed policy for training and combining ba classifiers is investigated.Thefield of reinforce-ment learning[12]provides natural candidates for
u in ada呆萌可爱头像 ptive policies.In particular,the policy is adapted using Q-learning[13],a method that improves a policy through the iterative approximation of an evaluation function Q.Pre-viously Q-learning had been ud in a similar mixture model applied to a control task[1].An Expectation Maximisation bad mixtures of experts(MOE)al-gorithm for supervid learning was prented in[8].In this paper,we attempt to solve the same task as in the standard MOE model,but through the u of reinforcement learning rather than expectation maximization techniques.
The rest of the paper is organid as follows.The framework of Reinforce-ment Learning(RL)is introduced in Section2.Section2.1outlines how the RL methods are employed in this work and describes how the system is imple-mented.Experiments are described in Section3,followed by conclusions and suggestions for future rearch.
2General Architecture
The RL classifier enmble consists of a t of n ba classifiers,or experts, E={e1,e2,...,e n}and a controlling agent that lects the experts to make
classification decisions and to train on particular examples.The specific RL algorithm employed in this work is outlined below.The following ction de-scribes in what manner it was ud to control the
classifier enmble.
For the controlling agent we define a t of states s∈S and a t of actions a∈A.At each time step t,the agent is at state s t=s and choos action a t=a.After the action is taken,the agent receives a reward r t and it enters a new state s t+1=s′.A policy:(S,A)→[0,1]is defined as a t of probabilities:
= p(a|s)  (s,a)∈(S,A)
for lecting an action a given the state s.The objective is tofind the policy that maximis the discounted future return of the system,starting at time t,
which is defined as:
R t=
k=0
k r t+k+1,
where∈[0,1)is a discount factor.This can be achieved by the iterative application of two steps.First,we estimate the return of actions under the current policy.More specifically,we define Q:(S,A)→ℜas the expected return of taking action a when being at state s at time t and followingthereafter:
Q(s,a)=E ∞ k=0k r t+k+1  s t=s,a t=a .
Qitlf is unknown and we maintain instead an estimate Q for each state action pair.Herein we employ the Q-learning update when action a j is lected in state s:
Q(s,a j)←Q(s,a j)+(r+max
i
Q(s′,a i)−Q(s,a j)),>0(2)
The cond step involves deriving a policyfrom the updated estimates Q. This can be derived from the evaluations Q(s,a)either deterministically,by al-ways lecting the action a j with the largest expected return,or stochastically.−greedy action lection lects the highest evaluated action with probability (1−),with∈[0,1],otherwi it lects a random action.Softmax action lection lects ac
tion a j with probability e Q(s,a j)/ i e Q(s,a i).Stochastic ac-tion lection is in general necessary so that all state-action pairs are sampled frequently enough for us to have accurate estimates of their expected return.
2.1Implementation
We employ an architecture with n experts,implemented as multi-layer percep-trons(MLPs),and a further MLP with n outputs and parameterswhich acts as the controlling agent.At each time step t a new example x is prented to the enmble and each expert e i emits a decision h i(x).The state space of the controlling agent is S≡X,the same as the classifiers’input space.Its outputs approximate Q(s,a j)in order to lect actions from A.We examine the ca
in which each action a j corresponds to lecting expert e j for training on the current example.The decisions of the experts themlves can be combined with a weighted sum:f (x )= i w i h i (x ),where w i =e Q (x,a i )P j e
Q (x,a j ).Alternatively we can make hard decisions by tting f (x )=h j (x ),where j =arg max i Q (s,a i ).The classification decision results in a return r ∈{0,1},which is 1if f (x )=y and 0otherwi.
The Q -learning update remains esntially the same as in (2)but,becau of the parameterid reprentation,we perform gradient descent to update our estimates,with the back-propagated error being =r +max i Q (s ′,a i )−Q (s,a j )and learning rate >0.The algorithm is implemented as follows:
1.Select example x t randomly from X .
2.Given s =x t ,choo a j ∈A according to a policy derived from Q (for example using -greedy action lection).
3.Take ac申请信 tion a j ,obrve r and the next state s ′=x t +1,chon randomly from X .
4.Calculate ←r +max i Q (s ′,a i )−Q (s,a j ).
5.←+∇Q (s,a j ).
6.s ←s ′.
7.Loop to 2,unless termination condition is met.3Experimental results
A t of experiments has been performed,in order to evaluate the effective-ness of this approach,on 9datats that are available from the UCI Machine Learning Repository [3].For each datat cross-validation was ud to lect the number of hidden units for the ba classifier.Each classifier was trained for 100iterations and a learning rate =0.01was ud.The discount param-eter for the controlling agent was t to 01.The results reported here are for -greedy action lection and for the hard combination method.Results with soft怎样重置路由器 max action lection and weighted combination are not significantly different.
A comparison was made between the RL-controlled mixture,a single MLP,the Mixture of Experts and AdaBoost using MLPs.As can be en in Table 1,the enmbles generally manage to improve test performance compared to the ba classifier,with the RL mixture outperforming AdaBoost and MOE 4and
1The
classification task is similar to an n -armed bandit problem,since the next state is not influenced by the agent’s actions.However it is more accurately described as a partially obrvable process,since the parameters of the classifiers constitute a state which changes depending on the agent’s actions.
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 11 expert 2 experts 4 experts 8 experts 16 experts 32 experts Figure 1:Cumulative margin dis-
tribution for RL on the ionosphere
datat,with an increasing number of
experts.MLP Boost MOE RL 7.28%  1.21%  4.84%  2.43%32.8%16.2%31.6%29.1%15.0%16.2%13.7%15.0%5.96%  5.96%  5.96%  3.08%4.10%  2.52%  4.55%  3.73%2.63%  1.42%  2.13%  2.3%2.72%  3.10%  2.80%  2.69%8.33%  6.48%7.75%7.41%56.1%61.9%68.1%48.3%
Table 1:Classification error on the UCI breast,forest,heart,ionosphere,letter,optdigits,pendigits,spamba and vowel datats using 32experts.7times out of 9respectively.For each datat we have also calculated the cumulative margin distribution resulting from equation (1).For the RL mixture there was a constant improvement in the distribution in most datats when the number of experts was incread (c.f.Figure 1),though this did not always result in an improvement in generalisation performance.
4Conclusions and Future Rearch
The aim of this work was to demonstrate the feasibility of using adaptive policies to train and combin
e a t of ba classifiers.While this purpo has arguably been reached,there still remain some questions to be answered,such as under what conditions the margin of classification is incread when using this approach.
In the future we would like to explore the relationship between RL and EM techniques for training enmbles.Furthermore,it would be interesting to investigate the applica发面千层饼的做法 tion of RL when the agent’s state space is extended to include information about each expert.In this ca it would no longer consti-tute of i.i.d samples,so the agent’s actions will affect its future state.However,perhaps the most promising direction in this domain would be to extend the t of possible actions so that more interesting policies can be developed.For such enlarged spaces it would appear necessary to replace action-value methods for policy improvement with direct gradient descent in policy space [2].The latter methods have also been theoretically proven to converge in the ca of multiple agents and are much more suitable for problems in partially obrvable environments and with large state-action spaces.
References
[1]C.Anderson and Z.Hong.Reinforcement learning with modular neural
networks for control,1994.
[2]Jonathan Ba大同盟战争 xter and Peter L.Bartlett.Reinforc平平淡淡的反义词 ement learning in
POMDP’s via direct gradient ascent.In Proc.17th Machine Learning,pages41–48.Morgan Kaufmann,San Francisco,CA, 2000.
[3]C.L.Blake and C.J.Merz.UCI repository of machine learning databas.
www.ics.uci.edu/∼mlearn/MLRepository.html,1998.
[4]Leo Breiman.Bagging predictors.Machine Learning,24(2):123–140,1996.
[5]Leo Breiman.Arcing the edge.Technical report,Department of Statistics,
University of California,Berkeley,CA.,1997.
[6]Yoav Freund and Robert E.Schapire.A decision-theoretic generalization
of on-line learning and an application to boosting.Journal of Computer and System Sciences,55(1):119–139,1997.
[7]R.A.Jacobs,M.I.Jordan,S.J.Nowlan,and G.E.Hinton.Adaptive
mixtures of local experts.Neural Computation,3(1):79–87,1991.
[8]Michael I.Jordan and Robert A.Jacobs.Hierarchical mixtures of experts
and the EM algorithm.Neural Computation,6(2):181–214,1994.
[9]Yi Li and Philip M.Long.The relaxed online maximum margin algorithm.
Machine Learning,46(1/3):361,2002.
[10]Llew Mason,Peter L.Bartlett,and Jonathan Baxter.Improved gener-
alization through explicit optimization of margins.Machine Learning, 38(3):243,2000.
[11]Robert E.Schapire,Yoav Freund,Peter Bartlett,and Wee Sun Lee.Boost-
ing the margin:a new explanation for the effectiveness of voting methods.
In Proc.14th International Conference on Machine Learning,pag1988年属相 es322–330.Morgan Kaufmann,1997.
[12]Richard S.Sutton and Andrew G.Barto.Reinforcement Learning:An
Introduction.MIT Press,1998.
[13]Christopher J.C.H.Watkins and Peter Dayan.Technical note q-learning.
Machine Learning,8:279,1992.

本文发布于:2023-05-04 19:29:04,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/856455.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图