Bandit bad Monte-Carlo Planning

更新时间:2023-07-13 21:43:16 阅读: 评论:0

Bandit bad Monte-Carlo Planning
Levente Kocsis and Csaba Szepesv´a ri
Computer and Automation Rearch Institute of the
Hungarian Academy of Sciences,Kende u.13-17,1111Budapest,Hungary
kocsis@sztaki.hu
Abstract.For large state-space Markovian Decision Problems Monte-
Carlo planning is one of the few viable approaches tofind near-optimal
solutions.In this paper we introduce a new algorithm,UCT,that ap-
plies bandit ideas to guide Monte-Carlo planning.Infinite-horizon or
discounted MDPs the algorithm is shown to be consistent andfinite
sample bounds are derived on the estimation error due to sampling.Ex-
perimental results show that in veral domains,UCT is significantly
more efficient than its alternatives.
1Introduction
Consider the problem offinding a near optimal action in large state-space Markovian Decision Problems(MDPs)under the assumption a generative model of the MDP is available.One of the few viable approaches is to carry out sampling bad lookahead arch,as propod by Kearns et al.[8],who spar lookahead arch procedure builds a tree with its nodes labelled by either states or state-action pairs in an alternating manner,and the root corresponding to the initial state from where planning is initiated.Each node labelled by a state is followed in the tree by afixed number of nodes associated with the actions available at that state,whilst each corresponding state-action labelled node is followed by afixed number of state-labelled nodes sampled using the generative model of the MDP.During sampling,the sampled rewards are stored with the edges connecting state-action nodes and state nodes.The tree is built in a stage-wi manner,from the root to the leafs.Its depth isfixed.The computation of the values of the actions at the initial state happens from the leafs by propagating the values up in the tree:The value of a state-action labelled node is computed bad o
n the average of the sum of the rewards along the edges originating at the node and the values at the corresponding successor nodes,whilst the value of a state node is computed by taking the maximum of the values of its children.Kearns et al.showed that in order to find an action at the initial state who value is within the -vicinity of that of the best,for discounted MPDs with discount factor0<γ<1,K actions and uniformly bounded rewards,regardless of the size of the state-spacefixed size trees suffice[8].In particular,the depth of the tree is proportional to 1/(1−γ)log(1/( (1−γ))),whilst its width is proportional to K/( (1−γ)).
Although this result looks promising,1in practice,the amount of work needed to compute just a single almost-optimal action at a given state can be over-whelmingly large.In this paper we are interested in improving the performance of this vanilla Monte-Carlo planning algorithm.In particular,we are interested in Monte-Carlo planning algorithms with two important characteristics:(1)small error probability if the algorithm is stopped prematurely,and(2)convergence to the best action if enough time is given.
支票怎么填写
Besides MPDs,we are also interested in game-tree arch.Over the years, Monte-Carlo simulation bad arch algorithms have been ud successfully in many non-deterministic and imperfect information games,including backgam-mon[14],poker[4]and Scrabble[12].Recently,Monte-Carlo ar
ch proved to be competitive in deterministic games with large branching factors,viz.in Go [5].For real-time strategy games,due to their enormous branching factors and stochasticity,Monte-Carlo simulations ems to be one of the few feasible ap-proaches for planning[7].Intriguingly,Monte-Carlo arch algorithms ud by today’s games programs u either uniform sampling of actions or some heuristic biasing of the action lection probabilities that come with no guarantees.
The main idea of the algorithm propod in this paper is to sample actions lectively.In order to motivate our approach let us consider problems with a large number of actions and assume that the lookahead is carried out at afixed depth D.If sampling can be restricted to say half of the actions at all stages then the overall work reduction is(1/2)D.Hence,if one is able to identify a large subt of the suboptimal actions early in the sampling procedure then huge performance improvements can be expected.
By definition,an action is suboptimal for a given state,if its value is less than the best of the action-values for the same state.Since action-values depend on the values of successor states,the problem boils down to getting the estimation error of the state-values for such states decay fast.In order to achieve this,an efficient algorithm must balance between testing alternatives that look currently the best so as to obtain preci estimates,and the exploration of currently suboptimal-looking alternative
s,so as to ensure that no good alternatives are misd becau of early estimation errors.Obviously,the criteria are contradictory and the problem offinding the right balance is known as the the exploration-exploitation dilemma.The most basic form of this dilemma shows up in multi-armed bandit problems[1].
The main idea in this paper it to apply a particular bandit algorithm,UCB1 (UCB stands for Upper Confidence Bounds),for rollout-bad Monte-Carlo plan-ning.The new algorithm,called UCT(UCB applied to trees)described in Section 2is called UCT.Theoretical results show that the new algorithm is consistent, whilst experimental results(Section3)for artificial game domains(P-games) and the sailing domain(a specific MDP)studied earlier in a similar context by others[11]indicate that UCT has a significant performance advantage over its clost competitors.
1In fact,as also noted by[8]the bound might be unimprovable,though this still remains an open problem.
2The UCT algorithm
2.1Rollout-bad planning
In this paper we consider Monte-Carlo planning algorithms that we call rollout-bad.As oppod to the algorithm described in the introduction(stage-wi tree building),a rollout-bad algorithm builds its lookahead tree by repeatedly sampling episodes from the initial state.An episode is a quence of state-action-reward triplets that are obtained using the domains generative model.The tree is built by adding the information gathered during an episode to it in an incre-mental manner.
The reason that we consider rollout-bad algorithms is that they allow us to keep track of estimates of the actions’values at the sampled states encountered in earlier episodes.Hence,if some state is reencountered then the estimated action-values can be ud to bias the choice of what action to follow,potentially speeding up the convergence of the value estimates.If the portion of states that are encountered multiple times in the procedure is small then the performance of rollout-bad sampling degenerates to that of vanilla(non-lective)Monte-Carlo planning.On the other hand,for domains where the t of successor states concentrates to a few states only,rollout-bad algorithms implementing lective sampling might have an advantage over other methods.
The generic scheme of rollout-bad Monte-Carlo planning is given in Fig-ure1.The algorithm iteratively generates episodes(line3),and returns the action with the highest average obrved long-term reward(line5).2In pro-cedure UpdateValue the total reward q is ud to adjust the estimated val
ue for the given state-action pair at the given depth,completed by increasing the counter that stores the number of visits of the state-action pair at the given depth.Episodes are generated by the arch function that lects and effectu-ates actions recursively until some terminal condition is satisfied.This can be the reach of a terminal state,or episodes can be cut at a certain depth(line8). Alternatively,as suggested by Peret and Garcia[11]and motivated by itera-tive deepening,the arch can be implemented in phas where in each pha the depth of arch is incead.An approximate way to implement iterative deepening,that we also follow in our experiments,is to stop the episodes with probability that is inverly proportional to the number of visits to the state.
The effectiveness of the whole algorithm will crucially depend on how the actions are lected in line9.In vanilla Monte-Carlo planning(referred by MC in the following)the actions are sampled uniformly.The main contribution of the prent paper is the introduction of a bandit-algorithm for the implementation of the lective sampling of actions.
草哭
2.2Stochastic bandit problems and UCB1
A bandit problem with K arms(actions)is defined by the quence of random payoffs X it,i=1,...,K,t≥1,where each i is the index of a gambling machine 2The function bestMove is trivial,and is omitted due to the lack of space.
1:
家常白菜
function MonteCarloPlanning(state )2:
repeat 3:
arch(state ,0)4:
until Timeout 5:
幼儿园科学教育return bestAction(state ,0)6:
function arch(state ,depth )7:
if Terminal(state )then return 08:
if Leaf(state,d )then return Evaluate(state )9:
action :=lectAction(state ,depth )10:
(nextstate,reward ):=simulateAction(state ,action )11:
q :=reward +γarch(nextstate ,depth +1)12:
UpdateValue(state,action,q,depth )13:return q
Fig.1.The pudocode of a generic Monte-Carlo planning algorithm.
(the “arm”of a bandit).Successive plays of machine i yield the payoffs X i 1,X i 2,....For simplicity,we shall assume that X it lies in the interval [0,1].An allocation policy is a mapping that lects the next arm to be played bad on the quence of past lections and payoffs obtained.The expected regret of an allocation policy A after n plays is defined by R n =max i E [ n t =1X it ]−E  K j =1 T j (n )t =1X j,t  ,where I t ∈{1,...,K }is the index of the arm lected at time t by policy A ,and where T i (t )= t s =1I (I s =i )is the number of times arm i was played up to time t (including t ).Thus,the regret is the loss caud by the policy not always playing the best machine.For a large class of payoffdistributions,there is no policy who regret would grow slower than O (ln n )
[10].For such payoffdistributions,a policy is said to resolve the exploration-exploitation tradeoffif its regret growth rate is within a constant factor of the best possible regret rate.
Algorithm UCB1,who finite-time regret is studied in details by [1]is a simple,yet attractive algorithm
that succeeds in resolving the exploration-exploitation tradeoff.It keeps track the average rewards X i,T i (t −1)for all the arms and choos the arm with the best upper confidence bound:I t =argmax i ∈{1,...,K }
X i,T i (t −1)+c t −1,T i (t −1) ,(1)
where c t,s is a bias quence chon to be
c t,s = 2ln t s .(2)
The bias quence is such that if X it were independantly and identically distrib-uted then the inequalities P  X is ≥µi +c t,s  ≤t −4,(3)P  X is ≤µi −c t,s  ≤t −4(4)were satisfied.This follows from Hoeffding’s inequality.In our ca,UCB1is ud in the internal nodes to lect the actions to be sampled next.Since for
any given node,the sampling probability of the actions at nodes below the node (in the tree)is changing,the payoffquences experienced will drift in time.Hence,in UCT,the above expression for the bias terms c t,s needs to replaced by a term that takes into account this drift of payoffs.One of our main results will
show despite this drift,bias terms of the form c t,s =2C p  ln t s with appropriate
constants C p can still be constructed for the payoffquences experienced at any of the internal nodes such that the above tail inequalities are still satisfied.2.3The propod algorithm
In UCT the action lection problem as treated as a parate multi-armed bandit for every (explored)internal node.The arms correspond to actions and the payoffs to the cumulated (discounted)rewards of the paths originating at the node.s In particular,in state s ,at depth d ,the action that maximizes Q t (s,a,d )+c N s,d (t ),N s,a,d (t )is lected,where Q t (s,a,d )is the estimated value of action a in state s at depth d and time t ,N s,d (t )is the number of times state s has been visited up to time t at depth d and N s,a,d (t )is the number of times action a was lected when state s has been visited,up to time t at depth d .3
2.4Theoretical analysis
The analysis is broken down to first analysing UCB1for non-stationary bandit problems where the payoffquences might drift,and then showing that the payoffquences experienced at the internal nodes of the tree satisfy the drift-conditions (e below)and finally proving the consistency of the whole procedure.
The so-called drift-conditions that we make on the non-stationary payoffquences are as follows:For simplicity,we assume that 0≤X it ≤1.We assume
that the expected values of the averages X in =1n  n t =1X it converge.We let µin =E  X in  and µi =lim n →∞µin .Further,we define δin by µin =µi +δin and assume that the tail inequalities (3),(4)are satisfied for c t,s =2C p  ln t s with an
appropriate constant C p >0.Throughout the analysis of non-stationary bandit problems we shall always assume without explicitly stating it that the drift-conditions are satisfied for the payoffquences.
For the sake of simplicity we assume that there exist a single optimal action.4Quantities related to this optimal arm shall be upper indexed by a ,µ∗,T ∗(t ),X ∗t ,etc.Due to the lack of space the proofs of most of the results are omitted.
We let ∆i =µ∗−µi .We assume that C p is such that there exist an integer N 0such that for s ≥N 0,c ss ≥2|δis |for any suboptimal arm i .Clearly,when UCT is applied in a tree then at the leafs δis =0and this condition is automatically satisfied with N 0=1.For upper levels,we will argue by induction by showing an upper bound δts for the lower levels that C p can be lected to make N 0<+∞.
Our first result is a generalization of Theorem 1due to Auer et al.[1].The proof cloly follows this earlier proof.
预备党员转正申请书格式
3
The algorithm has to be implemented such that division by zero is avoided.4The generalization of the results to the ca of multiple optimal arms follow easily.
Theorem 1Consider UCB1applied to a non-stationary problem.Let T i (n )denote the number of plays of arm i .Then if i the index of a suboptimal arm,n >K ,then E [T i (n )]≤16C 2p ln n
(∆i /2)2+2N 0+π2
3.
At tho internal nodes of the lookahead tree that are labelled by some state,the state values are estimated by averaging all the (cumulative)payoffs for the episodes starting from that node.The values are then ud in calculating the value of the action leading to the given state.Hence,the rate of convergence of the bias of the estimated state-values will influence the rate of convergence of values further up in the tree.The next result,building on Theorem 1,gives a bound on this bias.Theor
中考阅读em 2Let X n = K i =1T i
逐渐怎么读
(n )n X i,T i (n ).Then
E  X n  −µ∗  ≤|δ∗n |+O  K (C 2p ln n +N 0)n
,(5)UCB1never stops exploring.This allows us to derive that the average rewards at internal nodes concentrate quickly around their means.The following theorem shows that the number of times an arm is pulled can actually be lower bounded by the logarithm of the total number of trials:
Theorem 3(Lower Bound)There exists some positive constant ρsuch that for all arms i and n ,T i (n )≥ ρlog(n ) .
狗狗抽搐Among the the drift-conditions that we made on the payoffprocess was that the average payoffs concentrate around their mean quickly.The following result shows that this property is kept intact and in particular,this result completes the proof that if payoffprocess futher down in the tree satisfy the drift conditions then payoffprocess higher in the tree will satisfy the drift conditions,too:Theorem 4Fix δ>0and let ∆n =9 .The following bounds hold true provided that n is sufficiently large:P  nX n ≥n E  X n  +∆n  ≤δ,P  nX n ≤n E  X n  −∆n  ≤δ.
Finally,as we will be interested in the failure probability of the algorithm at the root,we prove the following result:
Theorem 5(Convergence of Failure Probability)Let ˆI t =argmax i X i,T i (t ).Then P (ˆI t =i ∗)≤C  1t  ρ2 min i =i ∗∆i 36 2with some constant C .In particular,it holds that lim t →∞P (ˆI
t =i ∗)=0.Now follows our main result:
Theorem 6Consider a finite-horizon MDP with rewards scaled to lie in the
[0,1]interval.Let the horizon of the MDP be D ,and the number of actions per state be K .Consider algorithm UCT such that the bias terms of UCB1are mul-tiplied by D .Then the bias of the estimated expected payoff,X n ,is O (log(n )/n ).Further,the failure probability at the root converges to zero at a polynomial rate as the number of episodes grows to infinity.

本文发布于:2023-07-13 21:43:16,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/1094990.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:预备党员   支票   白菜   科学   阅读   狗狗   转正   填写
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图