Adversarial Reinforcement Learningat the moment
William Uther Manuela Veloso
January2003
CMU-CS-03-107
School of Computer Science
bellyCarnegie Mellon Universitykebi
Pittsburgh,PA15213
This manualscript was originally submitted for publication in April1997.Corrections were never completed,and so the paper was not published.However,a copy was placed on the web and a number of people referenced the work from there.It is now being published as a technical report for ea of reference.With the exception of the addition of this title page,the work is unmodified from the1997original.
Keywords:Reinforcement Learning,Markov Games,Adversarial Reinforcement Learn-ing
jiugan
Abstract
Reinforcement Learning has been ud for a number of years in single agent environments. This article reports on our investigation of Reinforcement Learning techniques in a multi-agent and adversarial environment with continuous obrvable state information.We intro-duce a new framework,two-player hexagonal grid soccer,in which to evaluate algorithms. We then compare the performance of veral single-agent Reinforcement Learning techniques in that environment.The are further compared to a previously developed adversarial Re-inforcement Learning algorithm designed for Markov games.Building upon the efforts, we introduce new algorithms to handle the multi-agent,the adversarial,and the continuous-valued aspects of the domain.We introduce a technique for modelling the opponent in an adversarial game.We introduce an extension to Prioritized Sweeping that allows gener-alization of learnt knowledge over neighboring states in the domain;and we introduce an extension to the U Tree generalizing algorithm that allows the handling of continuous state spaces.Extensive empirical evaluation is conducted in the grid soccer domain.
yishi
This page intentionally left blank.
Adversarial Reinforcement Learning
William Uther and Manuela Veloso
Computer Science Department
Carnegie Mellon University
Pittsburgh,PA15213
uther,u.edu
April24,1997
Abstract
marie rneholt
Reinforcement Learning has been ud for a number of years in single agent environ-ments.This article reports on our investigation of Reinforcement Learning techniques in a multi-agent and adversarial environment with continuous obrvable state information.We in-troduce a new framework,two-player hexagonal grid soccer,in which to evaluate algorithms. We then compare the performance of veral single-agent Reinforcement Learning techniques in that environment.The are further compared to a previously developed adversarial Rein-forcement Learning algorithm desig
ned for Markov games.Building upon the efforts,we introduce new algorithms to handle the multi-agent,the adversarial,and the continuous-valued aspects of the domain.We introduce a technique for modelling the opponent in an adversarial game.We introduce an extension to Prioritized Sweeping that allows generalization of learnt knowledge over neighboring states in the domain;and we introduce an extension to the U Tree generalizing algorithm that allows the handling of continuous state spaces.Extensive empirical evaluation is conducted in the grid soccer domain.
1Introduction
Multi-agent adversarial environments have traditionally been addresd as game playing situ-ations.Indeed,one of thefirst areas to be studied in Artificial Intelligence was game playing. For example,the pioneering checkers playing algorithm by[Samuel,1959]ud both arch and machine learning strategies.Interestingly,his approach is similar to modern Reinforce-ment Learning techniques[Kaelbling et al.,1996].An evaluation function that guides the -lection of moves is reprented as a parameterized weighted sum of game features.Parameters are incrementally refined as a function of the game playing performance.This is a similar method to classical Reinforcement Learning which also provides for incremental update of an evaluation function,although in this ca it is reprented as a table of values.
Since Samuel’s work however,Reinforcement Learning techniques were not ud again in an adversarial tting until quite recently.[Tesauro,1995,Thrun,1995]have both ud neural nets in a Reinforcement Learning paradigm.[Tesauro,1995]’s work in the game of checkers
was successful,but required hand tuned features being fed to the algorithm for high quality
play.[Thrun,1995]was moderately successful in using similar techniques in chess,but the
techniques were not as successful as they had been in the checkers domain.This work has
been repeated in other domains,but again,without the same success as in the checkers domain
(in[Kaelbling et al.,1996]).
[Littman,1994]took standard Q Learning,[Watkins and Dayan,1992],and modified it to
sunrowork with Markov games.He replaced the simple update ud in standard Q Learning
with a mixed strategy(probabilistic)update.He then evaluated this by playing against both standard Q Learning and random players in a simple game.The game ud in
[Littman,1994]is a small two player grid soccer game designed to be able to be solved quickly
by traditional Q Learning techniques.He trained4different players for his game.Two players
ud his algorithm,two ud normal Q Learning.One of each was trained against a random
服饰搭配培训opponent,the other against an opponent of the same type.Littman then froze tho four players
and trained‘challengers’against them.His results showed that his algorithms,which learned a
probabilistic strategy,performed better under the conditions than Q Learning,which learned
a deterministic strategy,or his hand coded,but again deterministic,strategy.
We u a similar environment to that ud by[Littman,1994]to investigate Markov games.
Our environment is larger,both in number of states and number of actions per state,to more
英语effectively test the generalization capabilities of our algorithms.We conduct tests where both
players are learning as they play.This allows learning to take the place of a mixed,or prob-
abilistic,strategy.We look at a number of standard Reinforcement Learning algorithms and
compare them in a simple game.None of the algorithms we test perform any internal arch or
lookahead when deciding actions;they all u just the current state and their learnt evaluation
for that state.While arch would improve performance,we considered it orthogonal,and a
future step,to learning the evaluation function.
In the Reinforcement Learning paradigm an agent is placed in a situation without knowl-
edge of any goals or other information about the environment.As the agent acts in the environ-
ment it is given feedback:a reinforcement value or reward that defines the utility of being in
the current state.Over time the agent is suppod to customize its actions to the environment
so as to maximize the sum of this reward.By only giving the agent reward when a goal is
reached,the agent learns to achieve its goals.
In an adversarial tting there are multiple(at least two)agents in the world.In particular,
in a game with two players,when an agent wins a game it is given a positive reinforcement
and its opponent is given negative reinforcement.Maximizing reward corresponds directly to
winning games.Over time the agent is learning to act so that it wins the game.
In this paper we investigate the performance of some previously published algorithms in an
chances
adversarial environment;Q Learning,Minimax Q Learning,and Prioritized Sweeping.We also
introduce a new algorithm,Opponent Modelling Q Learning,to try and improve upon the
algorithms.All of the techniques rely on a table of values and actions and do not generalize
between similar or equivalent states.The learned tables are“state-specific.”We introduce
Fitted Prioritized Sweeping and a modification of the U Tree algorithm[McCallum,1995],
Continuous U Tree,as examples of algorithms that generalize over multiple states.Finally,we
look at what can be learned by looking at the world from your opponent’s point of view.