首页 > 英语园地

Gambling in a rigged casino The adversarial multi-armed bandit problem

更新时间:2023-05-14 11:46:08 阅读：评论：0

Appearing in proceedings,36th Annual Symposium on Foundations of Computer Science,November,1995.

stockingsGambling in a rigged casino:

The adversarial multi-armed bandit problem

Peter Auer Nicol`o Cesa-Bianchi Yoav Freund Robert E.Schapire

Computer&Information Sciences University of California

Santa Cruz,CA95064

pauer@c.ucsc.edu

DSI

Universit`a di Milano

20135Milano(Italy)

cesabian@dsi.unimi.it

A T&T Bell Laboratories

600Mountain Avenue

Murray Hill,NJ07974

yoav,schapire@

Abstract

In the multi-armed bandit problem,a gambler must decide

which arm of non-identical slot machines to play in a -

quence of trials so as to maximize his reward.This classical problem has received much attention becau of the simple

imanmodel it provides of the trade-off between exploration(trying

out each arm toﬁnd the best one)and exploitation(playing

the arm believed to give the best payoff).Past solutions for

the bandit problem have almost always relied on assumptions

about the statistics of the slot machines.

In this work,we make no statistical assumptions whatso-

ever about the nature of the process generating the payoffs of

the slot machines.We give a solution to the bandit problem

in which an adversary,rather than a well-behaved stochastic

process,has complete control over the payoffs.In a quence of plays,we prove that the expected per-round payoff of our

algorithmapproaches that of the best arm at the rate13, and we give an improved rate of convergence when the best

arm has fairly low payoff.

We also consider a tting in which the player has a team of“experts”advising him on which arm to pl

ay;here,we give a strategy that will guarantee expected payoff clo to that of the best expert.Finally,we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.

1Introduction

In the well studied multi-armed bandit problem,originally propod by Robbins[9],a gambler must choo which of slot machines to play.At each time step,he pulls the arm of one of the machines and receives a reward or payoff(possibly zero or negative).The gambler’s purpo is to maximize his total reward over a quence of trials.Since each arm is assumed to have a different distribution of rewards,the goal is toﬁnd the arm with the best expected return as early as possible,and then to keep gambling using that arm.

The problem is a classical example of the trade-off be-tween exploration and exploitation.On the one hand,if the gambler plays exclusively on the machine that he thinks is best(“exploitation”),he may fail to discover that one of the other arms actually has a higher average return.On the other hand,if he spends too much time trying out all the machines and gathering statistics(“exploration”),he may fail to play the best arm often enough to get a high total return.

As a more practically motivated example,consider the task of repeatedly choosing a route for transmitting packets between two points in a communication network.Suppo there are possible routes and the transmission cost is re-ported back to the nder.Then the problem can be en as that of lecting a route for each packet so that the total cost of transmitting a large t of packets would not be much larger than the cost incurred by nding them all on the single best route.

In the past,the bandit problem has almost always been studied with the aid of statistical assumptions on the process generating the rewards for each arm.In the gambling example, for instance,it might be natural to assume that the distribution of rewards for each arm is Gaussian and time-invariant.How-ever,it is likely that the costs associated with each route in the routing example cannot be modeled by a stationary distri-bution,so a more sophisticated t of statistical assumptions would be required.In general,it may be difﬁcult or impossi-ble to determine the right statistical assumptions for a given domain,and some domains may be inherently adversarial in nature so that no such assumptions are appropriate.

In this paper,we prent a variant of the bandit problem in which no statistical assumptions are made about the gen-eration of rewards.In our model,the reward associated with each arm is determined at each time step by an adversary with unbounded computational power rather than by some benign st

ochastic process.We only assume that the rewards are cho-n from a bounded range.The performance of any player is measured in terms of ,the difference between the total reward scored by the player and the total reward scored by the best arm.

Atﬁrst it may em impossible that the player should stand a chance against such a powerful opponent.Indeed,a deterministic player will fare very badly against an adversary who assigns low payoff to the chon arm and high payoff

to all the other arms.However,in this paper we prent

a very efﬁcient,randomized player algorithm that performs

well against any adversary.We prove that the regret suffered by our algorithm is at most23log13,where is the number of arms and is the number of time steps.Note

that the average per-time-step regret approaches zero at the

rate13.

We also prent more reﬁned bounds in which the depen-

dence on is replaced by the total reward of the best arm(or an assumed upper bound thereof).

Our worst-ca bounds may appear weaker than the bounds

proved using statistical assumptions,such as tho shown by

Lai and Robbins[6]of the form log.However,when

服装导购员销售技巧comparing our results to tho in the statistics literature,it is important to point out two differences between our framework and theirs:

1.They deﬁne the regret as the difference between the ex-

pected total reward of the player and the maximum of the expected total rewards of any arm.Our deﬁnition, in contrast,measures regret with respect to the speciﬁc quence of payoffs actually generated by the adversary.

2.They assume that the distribution of rewards that is asso-

ciated with each arm isﬁxed as the number of iterations

increas to inﬁnity.In contrast,our bounds hold for anyﬁnite,and,by the generality of our model,the bounds are applicable when the payoffs are randomly(or adversarially)chon in a manner that does depend on. While it might em that the apparent weakness of our bounds is a result of the adversarial nature of our framework,in fact, each of the differences described above sufﬁces to show that an upper bound of log is impossible in our model.This holds even when the rewards are generated randomly and in-dependently of the player’s actions as in the standard statistical framework.

Consider theﬁrst difference.In a statistical tting,the

difference in the deﬁnition of the regret corresponds to the difference between the maximum expected total reward of any arm and the expected maximum total reward of any arm in a quence of trials.The two measures can be far apart since,due to random variation,the expected maximum is typically much larger than the maximal expectation.As shown by Cesa-Bianchi et al.[2],this idea can be ud to construct a lower bound for the regret of any algorithm of the formΩ

A non-stochastic bandit problem was also considered by Gittins[4]and Ishikida and V araiya[5].Howe

ver,their ver-sion of the bandit problem is very different from ours:they

repeated matrix games.

2Notation and terminology

We formalize the bandit problem as a game between a player choosing actions and an adversary choosing the rewards as-sociated with each action.The game is parameterized by the number of possible actions,where each action is denoted by an integer,1.We will assume that all the rewards belong to the unit interval01.The generalization to rewards in for arbitrary is straightforward.

The game is played in a quence of trials12. We distinguish two variants:the partial information game, which captures the adversarial multi-armed bandit problem; and the full information game,which is esntially equivalent to the framework studied by Freund and Schapire[3].On each trial of the full information game:

1.The adversary lects a vector01of current

rewards.The th component is interpreted as the reward associated with action at trial.

2.Without knowledge of the adversary’s choice,the

player choos an action by picking a number

12and scores the corresponding reward .

3.The player obrves the entire vector of current re-

wards.

The partial information game corresponds to the above de-scription of the full information game but with step3replaced by:

3’.The player obrves only the reward for the chon action.

Let1be the total reward of player choos-ing actions12.

We formally deﬁne an adversary as a deterministic rule mapping the past history of play11to the current reward vector.(Since all our results are worst-ca with respect to the adversary,there is no additional power to be gained by allowing the adversary to be randomized).As a special ca,we say that an adversary is oblivious if it is independent of the player’s ,if the reward at trial is a function of only.Clearly,all of our results,which are proved for a non-oblivious adversary,hold for an oblivious adversary as well.

As our player algorithms will be randomized,ﬁxing an adversary and a player algorithm deﬁnes a probability distri-bution over the t1of quences of actions. All the probabilities and expectations considered in this pa-per will be with respect to this distribution.For an oblivious adversary,the rewards areﬁxed quantities with respect to this distribution,but for a non-oblivious adversary,each reward

is a random variable deﬁned on the t11 of player actions up to trial 1.We will not u explicit notation to reprent this dependence,but will refer to it in the text when appropriate.

The measure of the performance of our algorithm is the regret,which is the expected value of the difference between the total reward of the algorithm and the total reward of the best action.Formally,we deﬁne the expected total reward of algorithm by

E E1

the expected total reward of the best action by

玩具总动员1best max

and the regret of algorithm by E best This deﬁnition is easiest to interpret for an oblivious adversary since,in this ca,best truly measures what could have been gained had the best action been played for the entire quence. However,for a nonoblivious adversary,the deﬁnition of regret is a bit strange:Although it still compares the total reward of the algorithm to the sum of rewards that were associated with taking some action on all iterations,had action been taken,the rewards chon by the adversary would have been different than tho actually generated,since the variable depends on the past history of plays11.Although the deﬁnition of looks difﬁcult to interpret in this ca,in Section8we prove that our bounds on the regret for a non-oblivious adversary can also be ud to derive an interesting result in the context of repeated matrix games.

3The full information game

In this ction,we describe an algorithm,called, for the full information game which will also be ud as a building block in the design of our algorithm for the partial information game.The version of prented here is a slight variant2of the algorithm introduced by Freund and Schapire[3]as a generalization of Littlestone and Warmuth’s Weighted Majority[7]algorithm.

is described in Figure1.The main idea is simply to choo action at time with probability proportional to 1,where0is a parameter and is the total reward scored so far by action.Thus,actions yielding high rewards quickly gain a high probability of being chon.

The following is a straightforward variant of Freund and Schapire’s Theorem2.For completeness,a proof is provided in Appendix A.

Algorithm Hedge

Parameter:A real number0.

Initialization:Set1:0for1.

Repeat for12until game ends

1.Choo action according to the distribution,

where

stalker1

for all actions1.

indicate是什么意思Taking expectations with respect to the random choice of plays and using standard approximations for the ln function gives a lower bound on the expected gain:

Theorem3.2For0,the expected gain of algorithm Hedge in the full information game is at least

E Hedge best

ln1ln

2best

2ln in the full information game.

4The partial information game

In this ction,we move to the analysis of the partial in-formation game.We prent an algorithm that runs the algorithm of Section3as a subroutine.( stands for“Exponential-weight algorithm for Exploration and Exploitation.”)

The algorithm is described in Figure2.On each trial, receives the distribution vector from,and lects an action according to the distributionˆwhich is a mixture of and the uniform distribution.Intuitively,

Algorithm Exp3

Parameters:Reals0and01. Initialization:Initialize.

father是什么意思Repeat for12until game ends

1.Get the distribution from.

2.Select action to be with probability

ˆ1.

3.Receive reward01.

4.Feed the simulated reward vectorˆback to

,where

ˆˆif

0otherwi.

Figure2:Algorithm for the partial information game. mixing in the uniform distribution is done in order to make sure that the algorithm tries out all actions and gets good estimates of the rewards for each.Otherwi,the algorithm might miss a good action becau the initial rewards it obrves for this action are low and large rewards that occur later are not obrved becau the action is not lected.

After receives the reward associated with the chon action,it generates a simulated reward vectorˆfor .As requires full information,all components of this vector must beﬁlled in,even for the actions that were not lected.For actions not chon,we tˆto be zero.For the chon action,we t the simulated reward ˆproportional toˆ.This compensates the reward of actions that are unlikely to be chon and guarantees that the expected simulated gain associated with anyﬁxed action is proportional to the actual gain of the , that Eˆ11

best

ln1

best

By making an appropriate choice of the parameters and we get the following bound.

Corollary4.2If best and algorithm Exp3is run with input parameters:3

ln2,then the regret of algorithm Exp3

is at most哈哈大笑的意思

Exp33

23ln13

Proof of Theorem4.1.We begin by showing a lower bound on the total reward for any quence1: 1

1ˆ

ln1

ˆln(2)

Equation(2)holds for any action by Lemma3.1since the simulated reward vectorsˆare fed back to at each time step.For any we have

EˆE11Eˆ11

人力资源部岗位职责1ˆ

1ˆ0

reward by the probability of taking the action guarantees that the expected value of each estimate is ,E˜

1for all1and all1,where

˜denotes the value of˜at the end of trial.Using the estimates,the algorithm detects(approximately)when the actual gain of some action has advanced beyond. When this happens,the algorithm increments and restarts with a larger bound on the maximal gain.

The performance of the algorithm is characterized by the following theorem.

Theorem5.1For2,the regret suffered by algorithm Exp31is at most

Exp314323best3ln

(We did not attempt to optimize the constants in this theorem.) The proof of the theorem is divided into two lemmas.The ﬁrst bounds the regret caud by each round,and the cond bounds the number of rounds.We deﬁne the followingrandom variables:denotes the trial on which the th round begins and denotes the total number of ,the value of on the last trial).For notational convenience,we also deﬁne1to be1,and we deﬁneˆ

Proof.We u Equation(2)from the proof of Theorem4.1. We replace

ln111

ˆln

We bound the cond occurrence of the sum by adding

non-negative terms:11ˆ11

1ˆ

˜11.From the deﬁnition of the termination con-dition,we know that˜11.Substituting this bound and our choices for and into the last equation above we get the statement of the lemma.

The next lemma shows that,with high probability,there are not too many rounds.

Lemma5.3Let2,best,and log2best. Then for any3,P1022535323

best

. Proof.(Sketch)The idea of the proof is simple.If algorithm Exp31terminated round number before iteration, then the value of at least one of the estimators˜1˜at iteration has to be larger than, where is the value of the parameter ud on round. However,the expected value of the estimator is best, so reaching round implies a large estimation error.Simple arguments from probability theory show that the probability of such a large error in any of the estimators is very small. (Details given in Appendix B.)

Proof of Theorem5.1.Since the theorem holds trivially when

best,we assume without generality that best.

Weﬁx some action,partition the expected total gain into runs,and bound the gain from each round using Lemma5.2:

653环保英语作文

best3

6523best6553

This gives the statement of the theorem.

6A lower bound

In this ction,we prove an information-theoretic lower bound on the regret of any ,a lower bound that holds

本文发布于:2023-05-14 11:46:08，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/626791.html

上一篇：六人聚会博弈论

下一篇：完全信息动态博弈

标签：销售玩具服装技巧总动员

留言与评论（共有 0 条评论）