首页 > 美文阅读

Bandit bad Monte-Carlo Planning

更新时间:2023-07-13 21:43:16 阅读：评论：0

Levente Kocsis and Csaba Szepesv´a ri

Computer and Automation Rearch Institute of the

Hungarian Academy of Sciences,Kende u.13-17,1111Budapest,Hungary

kocsis@sztaki.hu

Abstract.For large state-space Markovian Decision Problems Monte-

Carlo planning is one of the few viable approaches toﬁnd near-optimal

solutions.In this paper we introduce a new algorithm,UCT,that ap-

plies bandit ideas to guide Monte-Carlo planning.Inﬁnite-horizon or

discounted MDPs the algorithm is shown to be consistent andﬁnite

sample bounds are derived on the estimation error due to sampling.Ex-

perimental results show that in veral domains,UCT is signiﬁcantly

more eﬃcient than its alternatives.

1Introduction

Consider the problem ofﬁnding a near optimal action in large state-space Markovian Decision Problems(MDPs)under the assumption a generative model of the MDP is available.One of the few viable approaches is to carry out sampling bad lookahead arch,as propod by Kearns et al.[8],who spar lookahead arch procedure builds a tree with its nodes labelled by either states or state-action pairs in an alternating manner,and the root corresponding to the initial state from where planning is initiated.Each node labelled by a state is followed in the tree by aﬁxed number of nodes associated with the actions available at that state,whilst each corresponding state-action labelled node is followed by aﬁxed number of state-labelled nodes sampled using the generative model of the MDP.During sampling,the sampled rewards are stored with the edges connecting state-action nodes and state nodes.The tree is built in a stage-wi manner,from the root to the leafs.Its depth isﬁxed.The computation of the values of the actions at the initial state happens from the leafs by propagating the values up in the tree:The value of a state-action labelled node is computed bad o

n the average of the sum of the rewards along the edges originating at the node and the values at the corresponding successor nodes,whilst the value of a state node is computed by taking the maximum of the values of its children.Kearns et al.showed that in order to ﬁnd an action at the initial state who value is within the -vicinity of that of the best,for discounted MPDs with discount factor0<γ<1,K actions and uniformly bounded rewards,regardless of the size of the state-spaceﬁxed size trees suﬃce[8].In particular,the depth of the tree is proportional to 1/(1−γ)log(1/( (1−γ))),whilst its width is proportional to K/( (1−γ)).

Although this result looks promising,1in practice,the amount of work needed to compute just a single almost-optimal action at a given state can be over-whelmingly large.In this paper we are interested in improving the performance of this vanilla Monte-Carlo planning algorithm.In particular,we are interested in Monte-Carlo planning algorithms with two important characteristics:(1)small error probability if the algorithm is stopped prematurely,and(2)convergence to the best action if enough time is given.

支票怎么填写

Besides MPDs,we are also interested in game-tree arch.Over the years, Monte-Carlo simulation bad arch algorithms have been ud successfully in many non-deterministic and imperfect information games,including backgam-mon[14],poker[4]and Scrabble[12].Recently,Monte-Carlo ar

ch proved to be competitive in deterministic games with large branching factors,viz.in Go [5].For real-time strategy games,due to their enormous branching factors and stochasticity,Monte-Carlo simulations ems to be one of the few feasible ap-proaches for planning[7].Intriguingly,Monte-Carlo arch algorithms ud by today’s games programs u either uniform sampling of actions or some heuristic biasing of the action lection probabilities that come with no guarantees.

The main idea of the algorithm propod in this paper is to sample actions lectively.In order to motivate our approach let us consider problems with a large number of actions and assume that the lookahead is carried out at aﬁxed depth D.If sampling can be restricted to say half of the actions at all stages then the overall work reduction is(1/2)D.Hence,if one is able to identify a large subt of the suboptimal actions early in the sampling procedure then huge performance improvements can be expected.

By deﬁnition,an action is suboptimal for a given state,if its value is less than the best of the action-values for the same state.Since action-values depend on the values of successor states,the problem boils down to getting the estimation error of the state-values for such states decay fast.In order to achieve this,an eﬃcient algorithm must balance between testing alternatives that look currently the best so as to obtain preci estimates,and the exploration of currently suboptimal-looking alternative

s,so as to ensure that no good alternatives are misd becau of early estimation errors.Obviously,the criteria are contradictory and the problem ofﬁnding the right balance is known as the the exploration-exploitation dilemma.The most basic form of this dilemma shows up in multi-armed bandit problems[1].

The main idea in this paper it to apply a particular bandit algorithm,UCB1 (UCB stands for Upper Conﬁdence Bounds),for rollout-bad Monte-Carlo plan-ning.The new algorithm,called UCT(UCB applied to trees)described in Section 2is called UCT.Theoretical results show that the new algorithm is consistent, whilst experimental results(Section3)for artiﬁcial game domains(P-games) and the sailing domain(a speciﬁc MDP)studied earlier in a similar context by others[11]indicate that UCT has a signiﬁcant performance advantage over its clost competitors.

1In fact,as also noted by[8]the bound might be unimprovable,though this still remains an open problem.

2The UCT algorithm

2.1Rollout-bad planning

In this paper we consider Monte-Carlo planning algorithms that we call rollout-bad.As oppod to the algorithm described in the introduction(stage-wi tree building),a rollout-bad algorithm builds its lookahead tree by repeatedly sampling episodes from the initial state.An episode is a quence of state-action-reward triplets that are obtained using the domains generative model.The tree is built by adding the information gathered during an episode to it in an incre-mental manner.

The reason that we consider rollout-bad algorithms is that they allow us to keep track of estimates of the actions’values at the sampled states encountered in earlier episodes.Hence,if some state is reencountered then the estimated action-values can be ud to bias the choice of what action to follow,potentially speeding up the convergence of the value estimates.If the portion of states that are encountered multiple times in the procedure is small then the performance of rollout-bad sampling degenerates to that of vanilla(non-lective)Monte-Carlo planning.On the other hand,for domains where the t of successor states concentrates to a few states only,rollout-bad algorithms implementing lective sampling might have an advantage over other methods.

The generic scheme of rollout-bad Monte-Carlo planning is given in Fig-ure1.The algorithm iteratively generates episodes(line3),and returns the action with the highest average obrved long-term reward(line5).2In pro-cedure UpdateValue the total reward q is ud to adjust the estimated val

ue for the given state-action pair at the given depth,completed by increasing the counter that stores the number of visits of the state-action pair at the given depth.Episodes are generated by the arch function that lects and eﬀectu-ates actions recursively until some terminal condition is satisﬁed.This can be the reach of a terminal state,or episodes can be cut at a certain depth(line8). Alternatively,as suggested by Peret and Garcia[11]and motivated by itera-tive deepening,the arch can be implemented in phas where in each pha the depth of arch is incead.An approximate way to implement iterative deepening,that we also follow in our experiments,is to stop the episodes with probability that is inverly proportional to the number of visits to the state.

The eﬀectiveness of the whole algorithm will crucially depend on how the actions are lected in line9.In vanilla Monte-Carlo planning(referred by MC in the following)the actions are sampled uniformly.The main contribution of the prent paper is the introduction of a bandit-algorithm for the implementation of the lective sampling of actions.

草哭

2.2Stochastic bandit problems and UCB1

A bandit problem with K arms(actions)is deﬁned by the quence of random payoﬀs X it,i=1,...,K,t≥1,where each i is the index of a gambling machine 2The function bestMove is trivial,and is omitted due to the lack of space.

家常白菜

function MonteCarloPlanning(state )2:

repeat 3:

arch(state ,0)4:

until Timeout 5:

幼儿园科学教育return bestAction(state ,0)6:

function arch(state ,depth )7:

if Terminal(state )then return 08:

if Leaf(state,d )then return Evaluate(state )9:

action :=lectAction(state ,depth )10:

(nextstate,reward ):=simulateAction(state ,action )11:

q :=reward +γarch(nextstate ,depth +1)12:

UpdateValue(state,action,q,depth )13:return q

Fig.1.The pudocode of a generic Monte-Carlo planning algorithm.

(the “arm”of a bandit).Successive plays of machine i yield the payoﬀs X i 1,X i 2,....For simplicity,we shall assume that X it lies in the interval [0,1].An allocation policy is a mapping that lects the next arm to be played bad on the quence of past lections and payoﬀs obtained.The expected regret of an allocation policy A after n plays is deﬁned by R n =max i E [ n t =1X it ]−E K j =1 T j (n )t =1X j,t ,where I t ∈{1,...,K }is the index of the arm lected at time t by policy A ,and where T i (t )= t s =1I (I s =i )is the number of times arm i was played up to time t (including t ).Thus,the regret is the loss caud by the policy not always playing the best machine.For a large class of payoﬀdistributions,there is no policy who regret would grow slower than O (ln n )

[10].For such payoﬀdistributions,a policy is said to resolve the exploration-exploitation tradeoﬀif its regret growth rate is within a constant factor of the best possible regret rate.

Algorithm UCB1,who ﬁnite-time regret is studied in details by [1]is a simple,yet attractive algorithm

that succeeds in resolving the exploration-exploitation tradeoﬀ.It keeps track the average rewards X i,T i (t −1)for all the arms and choos the arm with the best upper conﬁdence bound:I t =argmax i ∈{1,...,K }

X i,T i (t −1)+c t −1,T i (t −1) ,(1)

where c t,s is a bias quence chon to be

c t,s = 2ln t s .(2)

The bias quence is such that if X it were independantly and identically distrib-uted then the inequalities P X is ≥µi +c t,s ≤t −4,(3)P X is ≤µi −c t,s ≤t −4(4)were satisﬁed.This follows from Hoeﬀding’s inequality.In our ca,UCB1is ud in the internal nodes to lect the actions to be sampled next.Since for

any given node,the sampling probability of the actions at nodes below the node (in the tree)is changing,the payoﬀquences experienced will drift in time.Hence,in UCT,the above expression for the bias terms c t,s needs to replaced by a term that takes into account this drift of payoﬀs.One of our main results will

show despite this drift,bias terms of the form c t,s =2C p ln t s with appropriate

constants C p can still be constructed for the payoﬀquences experienced at any of the internal nodes such that the above tail inequalities are still satisﬁed.2.3The propod algorithm

In UCT the action lection problem as treated as a parate multi-armed bandit for every (explored)internal node.The arms correspond to actions and the payoﬀs to the cumulated (discounted)rewards of the paths originating at the node.s In particular,in state s ,at depth d ,the action that maximizes Q t (s,a,d )+c N s,d (t ),N s,a,d (t )is lected,where Q t (s,a,d )is the estimated value of action a in state s at depth d and time t ,N s,d (t )is the number of times state s has been visited up to time t at depth d and N s,a,d (t )is the number of times action a was lected when state s has been visited,up to time t at depth d .3

2.4Theoretical analysis

The analysis is broken down to ﬁrst analysing UCB1for non-stationary bandit problems where the payoﬀquences might drift,and then showing that the payoﬀquences experienced at the internal nodes of the tree satisfy the drift-conditions (e below)and ﬁnally proving the consistency of the whole procedure.

The so-called drift-conditions that we make on the non-stationary payoﬀquences are as follows:For simplicity,we assume that 0≤X it ≤1.We assume

that the expected values of the averages X in =1n n t =1X it converge.We let µin =E X in and µi =lim n →∞µin .Further,we deﬁne δin by µin =µi +δin and assume that the tail inequalities (3),(4)are satisﬁed for c t,s =2C p ln t s with an

appropriate constant C p >0.Throughout the analysis of non-stationary bandit problems we shall always assume without explicitly stating it that the drift-conditions are satisﬁed for the payoﬀquences.

For the sake of simplicity we assume that there exist a single optimal action.4Quantities related to this optimal arm shall be upper indexed by a ,µ∗,T ∗(t ),X ∗t ,etc.Due to the lack of space the proofs of most of the results are omitted.

We let ∆i =µ∗−µi .We assume that C p is such that there exist an integer N 0such that for s ≥N 0,c ss ≥2|δis |for any suboptimal arm i .Clearly,when UCT is applied in a tree then at the leafs δis =0and this condition is automatically satisﬁed with N 0=1.For upper levels,we will argue by induction by showing an upper bound δts for the lower levels that C p can be lected to make N 0<+∞.

Our ﬁrst result is a generalization of Theorem 1due to Auer et al.[1].The proof cloly follows this earlier proof.

预备党员转正申请书格式

The algorithm has to be implemented such that division by zero is avoided.4The generalization of the results to the ca of multiple optimal arms follow easily.

Theorem 1Consider UCB1applied to a non-stationary problem.Let T i (n )denote the number of plays of arm i .Then if i the index of a suboptimal arm,n >K ,then E [T i (n )]≤16C 2p ln n

(∆i /2)2+2N 0+π2

At tho internal nodes of the lookahead tree that are labelled by some state,the state values are estimated by averaging all the (cumulative)payoﬀs for the episodes starting from that node.The values are then ud in calculating the value of the action leading to the given state.Hence,the rate of convergence of the bias of the estimated state-values will inﬂuence the rate of convergence of values further up in the tree.The next result,building on Theorem 1,gives a bound on this bias.Theor

中考阅读em 2Let X n = K i =1T i

逐渐怎么读

(n )n X i,T i (n ).Then

E X n −µ∗ ≤|δ∗n |+O K (C 2p ln n +N 0)n

,(5)UCB1never stops exploring.This allows us to derive that the average rewards at internal nodes concentrate quickly around their means.The following theorem shows that the number of times an arm is pulled can actually be lower bounded by the logarithm of the total number of trials:

Theorem 3(Lower Bound)There exists some positive constant ρsuch that for all arms i and n ,T i (n )≥ ρlog(n ) .

狗狗抽搐Among the the drift-conditions that we made on the payoﬀprocess was that the average payoﬀs concentrate around their mean quickly.The following result shows that this property is kept intact and in particular,this result completes the proof that if payoﬀprocess futher down in the tree satisfy the drift conditions then payoﬀprocess higher in the tree will satisfy the drift conditions,too:Theorem 4Fix δ>0and let ∆n =9 .The following bounds hold true provided that n is suﬃciently large:P nX n ≥n E X n +∆n ≤δ,P nX n ≤n E X n −∆n ≤δ.

Finally,as we will be interested in the failure probability of the algorithm at the root,we prove the following result:

Theorem 5(Convergence of Failure Probability)Let ˆI t =argmax i X i,T i (t ).Then P (ˆI t =i ∗)≤C 1t ρ2 min i =i ∗∆i 36 2with some constant C .In particular,it holds that lim t →∞P (ˆI

t =i ∗)=0.Now follows our main result:

Theorem 6Consider a ﬁnite-horizon MDP with rewards scaled to lie in the

[0,1]interval.Let the horizon of the MDP be D ,and the number of actions per state be K .Consider algorithm UCT such that the bias terms of UCB1are mul-tiplied by D .Then the bias of the estimated expected payoﬀ,X n ,is O (log(n )/n ).Further,the failure probability at the root converges to zero at a polynomial rate as the number of episodes grows to inﬁnity.

本文发布于:2023-07-13 21:43:16，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1094990.html

上一篇：英语语法：it作为主语的用法（1）

下一篇：move_ba（路径规划）

标签：预备党员支票白菜科学阅读狗狗转正填写

留言与评论（共有 0 条评论）