Ann Nowe´, Peter Vrancx, and Yann-Michae¨l De Hauwere
Abstract. Reinforcement Learning was originally developed for Markov Decision Process (MDPs). It allows a single agent to learn a policy that maximizes a pos- sibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experi- ment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. The situa- tions ari naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In the domains multi-agent learning is ud, either becau of the complexity of the domain or be- cau control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordi- nating with
other learners or by competing with them. This chapter focus on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework bad on the eco
六级 答案nomic rearch into game theory, and illustrate the additional complexity that aris in such systems. We also described a reprentative lection of algorithms for the different areas of multi-agent rein- forcement learning rearch.
The reinforcement learning techniques studied throughout this book enable a single agent to learn o
ptimal behavior through trial-and-error interactions with its environ- ment. Various RL techniques have been developed which allow an agent to optimize its behavior in a wide range of circumstances. However, when multiple learners si- multaneously apply reinforcement learning in a shared environment, the traditional approaches often fail.
outbreakIn the multi-agent tting, the assumptions that are needed to guarantee conver- gence are often violated. Even in the most basic ca where agents share a stationary environment and need to learn a strategy for a single state, many new complexities ari. When agent objectives are aligned and all agents try to maximize the same re- ward signal, coordination is still required to reach the global optimum. When agents have opposing goals, a clear optimal solution may no longer exist. In this ca, an equilibrium between agent strategies is usually arched for. In such an equilibrium, no agent can improve
its payoff when the other agents keep their actions fixed.
When, in addition to multiple agents, we assume a dynamic environment which requires multiple quential decisions, the problem becomes even more complex. Now agents do not only have to coordinate, they also have to take into account the current state of their environment. This problem i
s further complicated by the fact that agents typically have only limited information about the system. In general, they may not be able to obrve actions or rewards of other agents, even though the actions have a direct impact on their own rewards and their environment. In the most challenging ca, an agent may not even be aware of the prence of other agents, making the environment em non-stationary. In other cas, the agents have access to all this information, but learning in a fully joint state-action space is in general impractical, both due to the computational complexity and in terms of the coordination required between the agents. In order to develop a successful multi- agent approach, all the issues need to be addresd. Figure 14.1 depicts a standard model of Multi-Agent Reinforcement Learning.
like mikeDespite the added learning complexity, a real need for multi-agent systems ex- ists. Often systems are inherently decentralized, and a central, single agent learning approach is not feasible. This situation may ari becau data or control is physi- cally distributed, becau multiple, possibly conflicting, objectives should be met, or simply becau a single centralized controller requires to many resources. Examples of such systems are multi-robot t-ups, decentralized network routing, distributed load-balancing, electronic auctions, traffic control and many others.
The need for adaptive multi-agent systems, combined with the complexities of dealing with interactin
g learners has led to the development of a multi-agent rein- forcement learning field, which is built on two basic pillars: the reinforcement learn-ing rearch performed within AI, and the interdisciplinary work on game theory. While early game theory focud on purely competitive games, it has since devel- oped into a general framework for analyzing strategic interactions. It has attracted interest from fields as diver as psychology, economics and biology. With the ad- vent of multi-agent systems, it has also gained importance within the AI community and computer science in general. In this chapter we discuss how game theory pro- vides both a means to describe the problem tting for multi-agent learning and the tools to analyze the outcome of learning.
除了多个代理之外,当我们假设需要多个顺序决策的动态环境时,问题变得更加复杂。现在代理商不仅需要协调,他们还必须考虑到他们环境的当前状态。由于代理通常仅具有关于系统的有限信息,因此该问题进⼀步复杂化。⼀般⽽⾔,他们可能⽆法观察其他代理⼈的⾏为或奖励,即使这些⾏为对他们⾃⼰的奖励和环境有直接影响。在最具挑战性的情况下,代理商可能甚⾄不知道其他代理商的存在,使环境看起来不稳定。在其他情况下,代理可以访问所有这些信息,但是由于计算复杂性和代理之间所需的协调,在完全联合的状态 - 动作空间中学习通常是不切实际的。为了开发成功的多代理⽅法,需要解决所有这些问题。图14.1描绘了多智能体强化学习的标准模型。
The multi-agent systems considered in this chapter are characterized by strategic interactions betwe
en the agents. By this we mean that the agents are autonomous en- tities, who have individual goals and independent decision making capabilities, but who also are influenced by each other’s decisions. We distinguish this tting from the approaches that can be regarded as distributed or parallel reinforcement learn- ing. In such systems multiple learners collaboratively learn a single objective. This includes systems were multiple agents update the policy in parallel (Mariano and Morales, 2001), swarm bad techniques (Dorigo and Stu¨tzle, 2004) and approaches dividing the learning state space among agents (Steenhaut et al, 1997). Many of the systems can be treated as advanced exploration techniques for standard rein- forcement learning and are still covered by the single agent theoretical frameworks, such as the framework described in (Tsitsiklis, 1994). The convergence of the al- gorithms remain valid as long as outdated information is eventually discarded. For example, it allows to u outdated Q-values in the max-operator in the right hand side of standard Q-learning update rule (described in Chapter 1). This is particularly interesting when he Q-values are belonging to to different agents each exploring their own part of the environment and only now and then exchange their Q-values. The systems covered by this chapter, however, go beyond the standard single agent theory, and as such require a different framework.
An overview of multi-agent rearch bad on strategic interactions between agents is given in Table 14.1. The techniques listed are categorized bad on their
applicability and kind of information they u while learning in a multi-agent sys- tem. We distinguish between techniqueshrone
for stateless games, which focus on dealing with multi-agent interactions while assuming that the environment is stationary, and Markov game techniques, which deal with both multi-agent interactions and a dy- namic environment. Furthermore, we also show the information ud by the agents for learning. Independent learners learn bad only on their own reward obrvation, while joint action learners also u obrvations of actions and possibly rewards of the other agents.
Table 14.1 Overview of current MARL approaches. Algorithms are classified by their ap- plicability (common interest or general Markov games) and their information requirement (scalar feedback or joint-action information).
instIn the following ction we will describe the repeated games framework. This tting introduces many of the complexities that ari from interactions between learning agents. However, the repeated game tting only considers static, stateless environments, where the learning challenges stem only from the interactions with other agents. In Section 14.3 we
introduce Markov Games. This framework gen- eralizes the Markov Decision Process (MDP) tting usually employed for single agent RL. It considers both interactions between agents and a dynamic
environment. We explain both value iteration and policy iteration approaches for solving the Markov games. Section 14.4 describes the current state of the art in
multi-agent re- arch, which takes the middle ground between independent learning techniques and Markov game techniques operating in the full joint-state joint-action space. Finally in Section 14.5, we shortly describe other interesting background material.
在下⼀节中,我们将描述重复的游戏框架。 此设置介绍了学习代理之间交互所产⽣的许多复杂性。 然⽽,重复的游戏设置仅考虑静态的⽆状态环境,其中学习挑战仅源于与其他代理的交互。 在14.3节中,我们介绍马尔可夫游戏。 该框架通常⽤于单个代理RL的马尔可夫决策过程(MDP)设置。 它考虑了代理和动态环境之间的相互作⽤。 我们解释了⽤于解决这些马尔可夫游戏的价值迭代和政策迭代⽅法。 第14.4节描述了多智能体研究中的当前技术⽔平,它在独⽴学习技术和在完整的联合状态联合作⽤空间中运⾏的马尔可夫游戏技术之间取得了中间地位。 最后在第14.5节中,我们将简要介绍其他有趣的背景资料。
14.2Repeated Games
14.2.1Game Theory
The central idea of game theory is to model strategic interactions as a game between a t of players. A game is a mathematical object, which describes the conquences of interactions between player strategies in terms of individual payoffs. Different reprentations for a game are possible. For example,traditional AI rearch often focuss on the extensive form games, which were ud as a reprentation of situa- tions where players take turns to perform an action. This reprentation is ud, for instance, with the classical minimax algorithm (Rusll and Norvig, 2003). In this chapter, however, we will focus on the so called normal form games, in which game players simultaneously lect an individual action to perform. This tting is often ud as a testbed for multi-agent learning approaches. Below we the review basic game theoretic terminology and define some common solution concepts in games.
博弈论的核⼼思想是将战略互动模型化为⼀组参与者之间的博弈。 游戏是⼀个数学对象,它描述了玩家策略之间在个⼈收益⽅⾯的交互的后果。 可以对游戏进⾏不同的表⽰。 例如,传统的⼈⼯智能研究通常侧重于⼴泛的形式游戏,这些游戏被⽤来表⽰玩家轮流执⾏动作的情境。 例如,该表⽰⽤于经典的minimax算法(Rusll和Norvig,2003)。 然⽽,在本章中,我们将关注所谓的普通形式游戏,其中游戏玩家同时选择要执⾏的个⼈动作。 此设置通常⽤作多代理学习⽅法的测试平台。 下⾯我们回顾⼀下基本的游戏理论术语,并在游戏中定义⼀些常见的解决⽅案概念。 Form Games
Definition 14.1. A normal form game is a tuple (n,A1,…,n,R1,…,n), where
·1,… , n is a collection of participants in the game, called players;
·Ak is the individual (finite) t of actions available to player k;
·Rk : A1 × …× An → R is the individual reward function of player k, specifying the expected payoff he receives for a play a ∈ A1 × …× An.
A game is played by allowing each player k to independently lect an individual action a from its private action t Ak.The combination of actions of all players
constitute a joint action or action profile a from the joint action t A = A1 … An. For each joint action a A, Rk(a) denotes agent k’s expected payoff.
Normal form games are reprented by their payoff matrix. Some typical 2-player
juicy怎么读games are given in Table 14.2. In this ca the action lected by player 1 refers to a row in the matr
integrateix, while that of player 2 determines the column. The corresponding entry in the matrix then gives the payoffs player 1 and player 2 receive for the play. Players 1 and 2 are also referred to as the row and the column player, respectively. Using more dimensional matrices n-player games can be reprented where each entry in the matrix contains the payoff for each of the agents for the corresponding combination of actions.
从联合⾏动集A = A1 … An构成联合⾏动或⾏动概况a。 对于每个联合⾏动,A,Rk(a)表⽰代理⼈k的预期收益。