博弈论与多智能体强化学习
Ann Nowe´, Peter Vrancx, and Yann-Michae¨l De Hauwere
Abstract. Reinforcement Learning was originally developed for Markov Decision Process (MDPs). It allows a single agent to learn a policy that maximizes a pos- sibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experi- ment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. The situa- tions ari naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In the domains multi-agent learning is ud, either becau of the complexity of the domain or be- cau control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordi- nating with
other learners or by competing with them. This chapter focus on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework bad on the eco
六级 答案nomic rearch into game theory, and illustrate the additional complexity that aris in such systems. We also described a reprentative lection of algorithms for the different areas of multi-agent rein- forcement learning rearch.
抽象。强化学习最初是为马尔可夫决策过程(MDP)开发的。它允许单个代理⼈学习在随机静⽌环境中最⼤化可能延迟的奖励信号的策略。它保证了最优政策的收敛,只要代理能够进⾏充分的实验,并且其运作的环境是马尔可夫。但是,当多个代理在共享环境中应⽤强化学习时,这可能超出了MDP模型。在这样的系统中,代理的最优策略不仅取决于环境,还取决于其他代理的策略。这些情况⾃然⽽然地出现在各种领域,例如:机器⼈技术,电信,经济学,分布式控制,拍卖,交通灯控制等。在这些领域中,使⽤多智能体学习,要么是因为域的复杂性或者因为控制本质上是分散的。在这样的系统中,代理⼈能够通过与其他学习者协调或通过与他们竞争来发现问题的良好解决⽅案是很重要的。本章重点介绍多代理系统中的应⽤程序强化学习技术。我们描述了⼀个基于博弈论经济研究的基本学习框架,并说明了这种系统中出现的额外复杂性。我们还描述了针对多智能体强化学习研究的不同领域的代表性算法选择。
Introduction
The reinforcement learning techniques studied throughout this book enable a single agent to learn o
ptimal behavior through trial-and-error interactions with its environ- ment. Various RL techniques have been developed which allow an agent to optimize its behavior in a wide range of circumstances. However, when multiple learners si- multaneously apply reinforcement learning in a shared environment, the traditional approaches often fail.
outbreakIn the multi-agent tting, the assumptions that are needed to guarantee conver- gence are often violated. Even in the most basic ca where agents share a stationary environment and need to learn a strategy for a single state, many new complexities ari. When agent objectives are aligned and all agents try to maximize the same re- ward signal, coordination is still required to reach the global optimum. When agents have opposing goals, a clear optimal solution may no longer exist. In this ca, an equilibrium between agent strategies is usually arched for. In such an equilibrium, no agent can improve
its payoff when the other agents keep their actions fixed.
When, in addition to multiple agents, we assume a dynamic environment which requires multiple quential decisions, the problem becomes even more complex. Now agents do not only have to coordinate, they also have to take into account the current state of their environment. This problem i
s further complicated by the fact that agents typically have only limited information about the system. In general, they may not be able to obrve actions or rewards of other agents, even though the actions have a direct impact on their own rewards and their environment. In the most challenging ca, an agent may not even be aware of the prence of other agents, making the environment em non-stationary. In other cas, the agents have access to all this information, but learning in a fully joint state-action space is in general impractical, both due to the computational complexity and in terms of the coordination required between the agents. In order to develop a successful multi- agent approach, all the issues need to be addresd. Figure 14.1 depicts a standard model of Multi-Agent Reinforcement Learning.
like mikeDespite the added learning complexity, a real need for multi-agent systems ex- ists. Often systems are inherently decentralized, and a central, single agent learning approach is not feasible. This situation may ari becau data or control is physi- cally distributed, becau multiple, possibly conflicting, objectives should be met, or simply becau a single centralized controller requires to many resources. Examples of such systems are multi-robot t-ups, decentralized network routing, distributed load-balancing, electronic auctions, traffic control and many others.
The need for adaptive multi-agent systems, combined with the complexities of dealing with interactin
g learners has led to the development of a multi-agent rein- forcement learning field, which is built on two basic pillars: the reinforcement learn-ing rearch performed within AI, and the interdisciplinary work on game theory. While early game theory focud on purely competitive games, it has since devel- oped into a general framework for analyzing strategic interactions. It has attracted interest from fields as diver as psychology, economics and biology. With the ad- vent of multi-agent systems, it has also gained importance within the AI community and computer science in general. In this chapter we discuss how game theory pro- vides both a means to describe the problem tting for multi-agent learning and the tools to analyze the outcome of learning.
本书研究的强化学习技术使单个代理⼈能够通过与环境的反复试验来学习最佳⾏为。已经开发了各种RL技术,其允许代理在各种情况下优化其⾏为。但是,当多个学习者在共享环境中同时应⽤强化学习时,传统⽅法通常会失败。
在多智能体环境中,通常会违反保证收敛所需的假设。即使在最基本的情况下,代理⼈共享⼀个固定的环境并需要学习单⼀状态的策略,也会出现许多新的复杂性。当代理⽬标⼀致并且所有代理都试图最⼤化相同的后向信号时,仍然需要协调以达到全局最优。当代理⼈有相反的⽬标时,可能不再存在明确的最优解决⽅案。在这种情况下,通常会搜索代理策略之间的平衡。在这样的均衡中,当其他代理⼈保持其⾏动不动时,任何代理⼈都不能提⾼其收益。
除了多个代理之外,当我们假设需要多个顺序决策的动态环境时,问题变得更加复杂。现在代理商不仅需要协调,他们还必须考虑到他们环境的当前状态。由于代理通常仅具有关于系统的有限信息,因此该问题进⼀步复杂化。⼀般⽽⾔,他们可能⽆法观察其他代理⼈的⾏为或奖励,即使这些⾏为对他们⾃⼰的奖励和环境有直接影响。在最具挑战性的情况下,代理商可能甚⾄不知道其他代理商的存在,使环境看起来不稳定。在其他情况下,代理可以访问所有这些信息,但是由于计算复杂性和代理之间所需的协调,在完全联合的状态 - 动作空间中学习通常是不切实际的。为了开发成功的多代理⽅法,需要解决所有这些问题。图14.1描绘了多智能体强化学习的标准模型。
尽管增加了学习复杂性,但仍然存在对多代理系统的真正需求。通常,系统本质上是分散的,并且中央的单⼀代理学习⽅法是不可⾏的。出现这种情况可能是因为数据或控制是物理分布的,因为应该满⾜多个可能相互冲突的⽬标,或者仅仅因为单个集中控制器需要许多资源。这种系统的⽰例是多机器⼈设置,分散式⽹络路由,分布式负载平衡,电⼦拍卖,交通控制等等。
对⾃适应多智能体系统的需求,加上处理相互作⽤的学习者的复杂性,导致了多智能体强化学习领域的发展,这个领域建⽴在两个基本⽀柱上:强化学习研究在AI,以及博弈论的跨学科研究。虽然早期的博弈论主要关注纯粹的竞争性游戏,但它已经发展成为分析战略互动的⼀般框架。它吸引了⼼理学,经济学和⽣物学等多个领域的兴趣。随着多智能体系统的出现,它在⼈⼯智能社区和计算机科学中也变得越来越重要。在本章中,我们将讨论博弈论如何提供描述多智能体学习问题设置的⽅法和分
析学习结果的⼯具。
太奇mba
The multi-agent systems considered in this chapter are characterized by strategic interactions betwe
en the agents. By this we mean that the agents are autonomous en- tities, who have individual goals and independent decision making capabilities, but who also are influenced by each other’s decisions. We distinguish this tting from the approaches that can be regarded as distributed or parallel reinforcement learn- ing. In such systems multiple learners collaboratively learn a single objective. This includes systems were multiple agents update the policy in parallel (Mariano and Morales, 2001), swarm bad techniques (Dorigo and Stu¨tzle, 2004) and approaches dividing the learning state space among agents (Steenhaut et al, 1997). Many of the systems can be treated as advanced exploration techniques for standard rein- forcement learning and are still covered by the single agent theoretical frameworks, such as the framework described in (Tsitsiklis, 1994). The convergence of the al- gorithms remain valid as long as outdated information is eventually discarded. For example, it allows to u outdated Q-values in the max-operator in the right hand side of standard Q-learning update rule (described in Chapter 1). This is particularly interesting when he Q-values are belonging to to different agents each exploring their own part of the environment and only now and then exchange their Q-values. The systems covered by this chapter, however, go beyond the standard single agent theory, and as such require a different framework.
An overview of multi-agent rearch bad on strategic interactions between agents is given in Table 14.1. The techniques listed are categorized bad on their
applicability and kind of information they u while learning in a multi-agent sys- tem. We distinguish between techniqueshrone
for stateless games, which focus on dealing with multi-agent interactions while assuming that the environment is stationary, and Markov game techniques, which deal with both multi-agent interactions and a dy- namic environment. Furthermore, we also show the information ud by the agents for learning. Independent learners learn bad only on their own reward obrvation, while joint action learners also u obrvations of actions and possibly rewards of the other agents.
Table 14.1 Overview of current MARL approaches. Algorithms are classified by their ap- plicability (common interest or general Markov games) and their information requirement (scalar feedback or joint-action information).
本章考虑的多代理系统的特点是代理之间的战略互动。我们的意思是代理⼈是⾃治的,他们有个⼈⽬标和独⽴的决策能⼒,但他们也受到彼此决策的影响。我们将此设置与可被视为分布式或并⾏强化学习的⽅法区分开来。在这样的系统中,多个学习者协作地学习单个⽬标。这包括系统是多个代理并⾏更新策略(Mariano和Morales,2001),基于群的技术(Dorigo和Stu?tzle,2004)以及在代理之间划分学习状态空间的⽅法(Steenhaut等,1997)。这些系统中的许多系统可被视为标准加强学习
的⾼级探索技术,并且仍然由单⼀代理理论框架所涵盖,例如(Tsitsiklis,1994)中描述的框架。只要过时的信息最终被丢弃,算法的收敛仍然有效。例如,它允许在标准Q学习更新规则(在第1章中描述)右侧的max-operator中使⽤过时的Q值。当他的Q值属于不同的代理⼈时,这⼀点特别有趣,每个代理⼈都在探索他们⾃⼰的环境部分,现在只交换他们的Q值。但是,本章涉及的系统超出了标准的单⼀代理理论,因此需要不同的框架。
表14.1给出了基于代理之间战略相互作⽤的多智能体研究概述。列出的技术根据它们进⾏分类
他们在多智能体系统中学习时使⽤的适⽤性和信息种类。我们区分⽆状态游戏的技术,这些技术专注于处理多智能体交互,同时假设环境是静⽌的,⽽Markov游戏技术则涉及多智能体交互和动态环境。此外,我们还会显⽰代理商⽤于学习的信息。独⽴学习者仅根据⾃⼰的奖励观察进⾏学习,⽽联合⾏动学习者也使⽤⾏动观察和其他⾏动者的奖励。
表14.1当前MARL⽅法概述。算法根据其适⽤性(共同兴趣或⼀般马尔可夫游戏)及其信息要求(标量反馈或联合⾏动信息)进⾏分类。
instIn the following ction we will describe the repeated games framework. This tting introduces many of the complexities that ari from interactions between learning agents. However, the repeated game tting only considers static, stateless environments, where the learning challenges stem only from the interactions with other agents. In Section 14.3 we
introduce Markov Games. This framework gen- eralizes the Markov Decision Process (MDP) tting usually employed for single agent RL. It considers both interactions between agents and a dynamic
environment. We explain both value iteration and policy iteration approaches for solving the Markov games. Section 14.4 describes the current state of the art in
multi-agent re- arch, which takes the middle ground between independent learning techniques and Markov game techniques operating in the full joint-state joint-action space. Finally in Section 14.5, we shortly describe other interesting background material.
在下⼀节中,我们将描述重复的游戏框架。 此设置介绍了学习代理之间交互所产⽣的许多复杂性。 然⽽,重复的游戏设置仅考虑静态的⽆状态环境,其中学习挑战仅源于与其他代理的交互。 在14.3节中,我们介绍马尔可夫游戏。 该框架通常⽤于单个代理RL的马尔可夫决策过程(MDP)设置。 它考虑了代理和动态环境之间的相互作⽤。 我们解释了⽤于解决这些马尔可夫游戏的价值迭代和政策迭代⽅法。 第14.4节描述了多智能体研究中的当前技术⽔平,它在独⽴学习技术和在完整的联合状态联合作⽤空间中运⾏的马尔可夫游戏技术之间取得了中间地位。 最后在第14.5节中,我们将简要介绍其他有趣的背景资料。
hen的复数14.2Repeated Games
14.2.1Game Theory
旗开得胜英文版
The central idea of game theory is to model strategic interactions as a game between a t of players. A game is a mathematical object, which describes the conquences of interactions between player strategies in terms of individual payoffs. Different reprentations for a game are possible. For example,traditional AI rearch often focuss on the extensive form games, which were ud as a reprentation of situa- tions where players take turns to perform an action. This reprentation is ud, for instance, with the classical minimax algorithm (Rusll and Norvig, 2003). In this chapter, however, we will focus on the so called normal form games, in which game players simultaneously lect an individual action to perform. This tting is often ud as a testbed for multi-agent learning approaches. Below we the review basic game theoretic terminology and define some common solution concepts in games.
博弈论的核⼼思想是将战略互动模型化为⼀组参与者之间的博弈。 游戏是⼀个数学对象,它描述了玩家策略之间在个⼈收益⽅⾯的交互的后果。 可以对游戏进⾏不同的表⽰。 例如,传统的⼈⼯智能研究通常侧重于⼴泛的形式游戏,这些游戏被⽤来表⽰玩家轮流执⾏动作的情境。 例如,该表⽰⽤于经典的minimax算法(Rusll和Norvig,2003)。 然⽽,在本章中,我们将关注所谓的普通形式游戏,其中游戏玩家同时选择要执⾏的个⼈动作。 此设置通常⽤作多代理学习⽅法的测试平台。 下⾯我们回顾⼀下基本的游戏理论术语,并在游戏中定义⼀些常见的解决⽅案概念。
14.2.1.1Normal Form Games
Definition 14.1. A normal form game is a tuple (n,A1,…,n,R1,…,n), where
·1,… , n is a collection of participants in the game, called players;
·Ak is the individual (finite) t of actions available to player k;
·Rk : A1 × …× An → R is the individual reward function of player k, specifying the expected payoff he receives for a play a ∈ A1 × …× An.
A game is played by allowing each player k to independently lect an individual action a from its private action t Ak.The combination of actions of all players
constitute a joint action or action profile a from the joint action t A = A1 … An. For each joint action a A, Rk(a) denotes agent k’s expected payoff.
Normal form games are reprented by their payoff matrix. Some typical 2-player
juicy怎么读games are given in Table 14.2. In this ca the action lected by player 1 refers to a row in the matr
integrateix, while that of player 2 determines the column. The corresponding entry in the matrix then gives the payoffs player 1 and player 2 receive for the play. Players 1 and 2 are also referred to as the row and the column player, respectively. Using more dimensional matrices n-player games can be reprented where each entry in the matrix contains the payoff for each of the agents for the corresponding combination of actions.
通过允许每个玩家k从其私⼈动作集Ak中独⽴地选择个体动作a来玩游戏。所有玩家的动作的组合
从联合⾏动集A = A1 … An构成联合⾏动或⾏动概况a。 对于每个联合⾏动,A,Rk(a)表⽰代理⼈k的预期收益。