Gambling in a rigged casino The adversarial multi-armed bandit problem

更新时间:2023-05-14 11:46:08 阅读: 评论:0

Appearing in proceedings,36th Annual Symposium on Foundations of Computer Science,November,1995.
stockingsGambling in a rigged casino:
The adversarial multi-armed bandit problem
Peter Auer Nicol`o Cesa-Bianchi Yoav Freund Robert E.Schapire
Computer&Information Sciences University of California
Santa Cruz,CA95064
pauer@c.ucsc.edu
DSI
Universit`a di Milano
20135Milano(Italy)
cesabian@dsi.unimi.it
A T&T Bell Laboratories
600Mountain Avenue
Murray Hill,NJ07974
yoav,schapire@
Abstract
In the multi-armed bandit problem,a gambler must decide
which arm of non-identical slot machines to play in a -
quence of trials so as to maximize his reward.This classical problem has received much attention becau of the simple
imanmodel it provides of the trade-off between exploration(trying
out each arm tofind the best one)and exploitation(playing
the arm believed to give the best payoff).Past solutions for
the bandit problem have almost always relied on assumptions
about the statistics of the slot machines.
In this work,we make no statistical assumptions whatso-
ever about the nature of the process generating the payoffs of
the slot machines.We give a solution to the bandit problem
in which an adversary,rather than a well-behaved stochastic
process,has complete control over the payoffs.In a quence of plays,we prove that the expected per-round payoff of our
algorithmapproaches that of the best arm at the rate13, and we give an improved rate of convergence when the best
arm has fairly low payoff.
We also consider a tting in which the player has a team of“experts”advising him on which arm to pl
ay;here,we give a strategy that will guarantee expected payoff clo to that of the best expert.Finally,we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.
1Introduction
In the well studied multi-armed bandit problem,originally propod by Robbins[9],a gambler must choo which of slot machines to play.At each time step,he pulls the arm of one of the machines and receives a reward or payoff(possibly zero or negative).The gambler’s purpo is to maximize his total reward over a quence of trials.Since each arm is assumed to have a different distribution of rewards,the goal is tofind the arm with the best expected return as early as possible,and then to keep gambling using that arm.
The problem is a classical example of the trade-off be-tween exploration and exploitation.On the one hand,if the gambler plays exclusively on the machine that he thinks is best(“exploitation”),he may fail to discover that one of the other arms actually has a higher average return.On the other hand,if he spends too much time trying out all the machines and gathering statistics(“exploration”),he may fail to play the best arm often enough to get a high total return.
As a more practically motivated example,consider the task of repeatedly choosing a route for transmitting packets between two points in a communication network.Suppo there are possible routes and the transmission cost is re-ported back to the nder.Then the problem can be en as that of lecting a route for each packet so that the total cost of transmitting a large t of packets would not be much larger than the cost incurred by nding them all on the single best route.
In the past,the bandit problem has almost always been studied with the aid of statistical assumptions on the process generating the rewards for each arm.In the gambling example, for instance,it might be natural to assume that the distribution of rewards for each arm is Gaussian and time-invariant.How-ever,it is likely that the costs associated with each route in the routing example cannot be modeled by a stationary distri-bution,so a more sophisticated t of statistical assumptions would be required.In general,it may be difficult or impossi-ble to determine the right statistical assumptions for a given domain,and some domains may be inherently adversarial in nature so that no such assumptions are appropriate.
In this paper,we prent a variant of the bandit problem in which no statistical assumptions are made about the gen-eration of rewards.In our model,the reward associated with each arm is determined at each time step by an adversary with unbounded computational power rather than by some benign st
ochastic process.We only assume that the rewards are cho-n from a bounded range.The performance of any player is measured in terms of ,the difference between the total reward scored by the player and the total reward scored by the best arm.
Atfirst it may em impossible that the player should stand a chance against such a powerful opponent.Indeed,a deterministic player will fare very badly against an adversary who assigns low payoff to the chon arm and high payoff
to all the other arms.However,in this paper we prent
a very efficient,randomized player algorithm that performs
well against any adversary.We prove that the regret suffered by our algorithm is at most23log13,where is the number of arms and is the number of time steps.Note
that the average per-time-step regret approaches zero at the
rate13.
We also prent more refined bounds in which the depen-
dence on is replaced by the total reward of the best arm(or an assumed upper bound thereof).
Our worst-ca bounds may appear weaker than the bounds
proved using statistical assumptions,such as tho shown by
Lai and Robbins[6]of the form log.However,when
服装导购员销售技巧comparing our results to tho in the statistics literature,it is important to point out two differences between our framework and theirs:
1.They define the regret as the difference between the ex-
pected total reward of the player and the maximum of the expected total rewards of any arm.Our definition, in contrast,measures regret with respect to the specific quence of payoffs actually generated by the adversary.
2.They assume that the distribution of rewards that is asso-
ciated with each arm isfixed as the number of iterations
increas to infinity.In contrast,our bounds hold for anyfinite,and,by the generality of our model,the bounds are applicable when the payoffs are randomly(or adversarially)chon in a manner that does depend on. While it might em that the apparent weakness of our bounds is a result of the adversarial nature of our framework,in fact, each of the differences described above suffices to show that an upper bound of log is impossible in our model.This holds even when the rewards are generated randomly and in-dependently of the player’s actions as in the standard statistical framework.
Consider thefirst difference.In a statistical tting,the
difference in the definition of the regret corresponds to the difference between the maximum expected total reward of any arm and the expected maximum total reward of any arm in a quence of trials.The two measures can be far apart since,due to random variation,the expected maximum is typically much larger than the maximal expectation.As shown by Cesa-Bianchi et al.[2],this idea can be ud to construct a lower bound for the regret of any algorithm of the formΩ
.
A non-stochastic bandit problem was also considered by Gittins[4]and Ishikida and V araiya[5].Howe
ver,their ver-sion of the bandit problem is very different from ours:they
repeated matrix games.
2Notation and terminology
We formalize the bandit problem as a game between a player choosing actions and an adversary choosing the rewards as-sociated with each action.The game is parameterized by the number of possible actions,where each action is denoted by an integer,1.We will assume that all the rewards belong to the unit interval01.The generalization to rewards in for arbitrary is straightforward.
The game is played in a quence of trials12. We distinguish two variants:the partial information game, which captures the adversarial multi-armed bandit problem; and the full information game,which is esntially equivalent to the framework studied by Freund and Schapire[3].On each trial of the full information game:
1.The adversary lects a vector01of current
rewards.The th component is interpreted as the reward associated with action at trial.
2.Without knowledge of the adversary’s choice,the
player choos an action by picking a number
12and scores the corresponding reward .
3.The player obrves the entire vector of current re-
wards.
The partial information game corresponds to the above de-scription of the full information game but with step3replaced by:
3’.The player obrves only the reward for the chon action.
Let1be the total reward of player choos-ing actions12.
We formally define an adversary as a deterministic rule mapping the past history of play11to the current reward vector.(Since all our results are worst-ca with respect to the adversary,there is no additional power to be gained by allowing the adversary to be randomized).As a special ca,we say that an adversary is oblivious if it is independent of the player’s ,if the reward at trial is a function of only.Clearly,all of our results,which are proved for a non-oblivious adversary,hold for an oblivious adversary as well.
As our player algorithms will be randomized,fixing an adversary and a player algorithm defines a probability distri-bution over the t1of quences of actions. All the probabilities and expectations considered in this pa-per will be with respect to this distribution.For an oblivious adversary,the rewards arefixed quantities with respect to this distribution,but for a non-oblivious adversary,each reward
is a random variable defined on the t11 of player actions up to trial  1.We will not u explicit notation to reprent this dependence,but will refer to it in the text when appropriate.
The measure of the performance of our algorithm is the regret,which is the expected value of the difference between the total reward of the algorithm and the total reward of the best action.Formally,we define the expected total reward of algorithm by
E E1
1
the expected total reward of the best action by
玩具总动员1best max
1
E
1
1
and the regret of algorithm by E best This definition is easiest to interpret for an oblivious adversary since,in this ca,best truly measures what could have been gained had the best action been played for the entire quence. However,for a nonoblivious adversary,the definition of regret is a bit strange:Although it still compares the total reward of the algorithm to the sum of rewards that were associated with taking some action on all iterations,had action been taken,the rewards chon by the adversary would have been different than tho actually generated,since the variable depends on the past history of plays11.Although the definition of looks difficult to interpret in this ca,in Section8we prove that our bounds on the regret for a non-oblivious adversary can also be ud to derive an interesting result in the context of repeated matrix games.
3The full information game
In this ction,we describe an algorithm,called, for the full information game which will also be ud as a building block in the design of our algorithm for the partial information game.The version of prented here is a slight variant2of the algorithm introduced by Freund and Schapire[3]as a generalization of Littlestone and Warmuth’s Weighted Majority[7]algorithm.
is described in Figure1.The main idea is simply to choo action at time with probability proportional to 1,where0is a parameter and is the total reward scored so far by action.Thus,actions yielding high rewards quickly gain a high probability of being chon.
The following is a straightforward variant of Freund and Schapire’s Theorem2.For completeness,a proof is provided in Appendix A.
Algorithm Hedge
Parameter:A real number0.
Initialization:Set1:0for1.
Repeat for12until game ends
1.Choo action according to the distribution,
where
stalker1
for all actions1.
indicate是什么意思Taking expectations with respect to the random choice of plays and using standard approximations for the ln function gives a lower bound on the expected gain:
Theorem3.2For0,the expected gain of algorithm Hedge in the full information game is at least
E Hedge best
ln1ln
2best
ln
2ln in the full information game.
4The partial information game
In this ction,we move to the analysis of the partial in-formation game.We prent an algorithm that runs the algorithm of Section3as a subroutine.( stands for“Exponential-weight algorithm for Exploration and Exploitation.”)
The algorithm is described in Figure2.On each trial, receives the distribution vector from,and lects an action according to the distributionˆwhich is a mixture of and the uniform distribution.Intuitively,
Algorithm Exp3
Parameters:Reals0and01. Initialization:Initialize.
father是什么意思Repeat for12until game ends
1.Get the distribution from.
2.Select action to be with probability
ˆ1.
3.Receive reward01.
4.Feed the simulated reward vectorˆback to
,where
ˆˆif
0otherwi.
Figure2:Algorithm for the partial information game. mixing in the uniform distribution is done in order to make sure that the algorithm tries out all actions and gets good estimates of the rewards for each.Otherwi,the algorithm might miss a good action becau the initial rewards it obrves for this action are low and large rewards that occur later are not obrved becau the action is not lected.
After receives the reward associated with the chon action,it generates a simulated reward vectorˆfor .As requires full information,all components of this vector must befilled in,even for the actions that were not lected.For actions not chon,we tˆto be zero.For the chon action,we t the simulated reward ˆproportional toˆ.This compensates the reward of actions that are unlikely to be chon and guarantees that the expected simulated gain associated with anyfixed action is proportional to the actual gain of the , that Eˆ11
best
ln1
ln
best
By making an appropriate choice of the parameters and we get the following bound.
Corollary4.2If best and algorithm Exp3is run with input parameters:3
ln2,then the regret of algorithm Exp3
is at most哈哈大笑的意思
Exp33
2
23ln13
Proof of Theorem4.1.We begin by showing a lower bound on the total reward for any quence1: 1
1
1
ˆ
1
ln1
1
ˆln(2)
Equation(2)holds for any action by Lemma3.1since the simulated reward vectorsˆare fed back to at each time step.For any we have
EˆE11Eˆ11
E1
人力资源部岗位职责1ˆ
ˆ
1ˆ0
reward by the probability of taking the action guarantees that the expected value of each estimate is ,E˜
1for all1and all1,where
˜denotes the value of˜at the end of trial.Using the estimates,the algorithm detects(approximately)when the actual gain of some action has advanced beyond. When this happens,the algorithm increments and restarts with a larger bound on the maximal gain.
The performance of the algorithm is characterized by the following theorem.
Theorem5.1For2,the regret suffered by algorithm Exp31is at most
Exp314323best3ln
(We did not attempt to optimize the constants in this theorem.) The proof of the theorem is divided into two lemmas.The first bounds the regret caud by each round,and the cond bounds the number of rounds.We define the followingrandom variables:denotes the trial on which the th round begins and denotes the total number of ,the value of on the last trial).For notational convenience,we also define1to be1,and we defineˆ
3
ln
Proof.We u Equation(2)from the proof of Theorem4.1. We replace
ln111
ˆln
11
ˆ
We bound the cond occurrence of the sum by adding
non-negative terms:11ˆ11
˜11.From the definition of the termination con-dition,we know that˜11.Substituting this bound and our choices for and into the last equation above we get the statement of the lemma.
The next lemma shows that,with high probability,there are not too many rounds.
Lemma5.3Let2,best,and log2best. Then for any3,P1022535323
best
. Proof.(Sketch)The idea of the proof is simple.If algorithm Exp31terminated round number before iteration, then the value of at least one of the estimators˜1˜at iteration has to be larger than, where is the value of the parameter ud on round. However,the expected value of the estimator is best, so reaching round implies a large estimation error.Simple arguments from probability theory show that the probability of such a large error in any of the estimators is very small. (Details given in Appendix B.)
Proof of Theorem5.1.Since the theorem holds trivially when
best,we assume without generality that best.
Wefix some action,partition the expected total gain into runs,and bound the gain from each round using Lemma5.2:
E
1
E
1
11
E
1
11
ˆ
33
2
E
1
23
E
1
653环保英语作文
23
best3
2
6523best6553
This gives the statement of the theorem.
6A lower bound
In this ction,we prove an information-theoretic lower bound on the regret of any ,a lower bound that holds
6

本文发布于:2023-05-14 11:46:08,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/78/626791.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:销售   玩具   服装   技巧   总动员
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图