首页 > 美文阅读

The complexity of decentralized control of markov decision process

更新时间:2023-06-23 15:51:25 阅读：评论：0

Daniel S.Bernstein,Shlomo Zilberstein,and Neil Immerman Department of Computer Science

University of Massachutts

Amherst,Massachutts01003

bern,shlomo,immerman@cs.umass.edu

Abstract

Planning for distributed agents with partial

state information is considered from a decision-

theoretic perspective.We describe generaliza-

组织生活会对照检查材料

tions of both the MDP and POMDP models

that allow for decentralized control.For even a

small number of agents,theﬁnite-horizon prob-

lems corresponding to both of our models are

complete for nondeterministic exponential time.

The complexity results illustrate a fundamen-

tal difference between centralized and decentral-

ized control of Markov process.In contrast to

the MDP and POMDP problems,the problems

we consider provably do not admit polynomial-

time algorithms and most likely require doubly

中国白酒exponential time to solve in the worst ca.We

have thus provided mathematical evidence corre-

sponding to the intuition that decentralized plan-

ning problems cannot easily be reduced to cen-

tralized problems and solved exactly using estab-

lished techniques.

1Introduction

Among rearchers in artiﬁcial intelligence,there has been growing interest in problems with multiple distributed agents working to achieve a common goal(Grosz&Kraus, 1996;Lesr,1998;desJardins et al.,1999;Durfee,1999; Stone&Veloso,1999).In many of the problems,intera-gent communication is costly or impossible.For instance, consider two robots cooperating to push a box(Mataric, 1998).Communication between the robots may take time that could otherwi be spent performing physical actions. Thus,it may be suboptimal for the robots to communi-cate frequently.A planner is faced with the difﬁcult task of deciding what each robot should do in between com-munications,when it only has access to its own nsory information.Other problems of planning for distributed agents with limited communication include maximizing the throughput of a multiple access broadcast channel(Ooi& Wornell,1996)and coordinating multiple spacecraft on a mission together(Estlin et al.,1999).We are interested in the question of whether the planning problems are computationally

harder to solve than problems that involve planning for a single agent or multiple agents with access to the exact same information.我的父亲母亲电影

We focus on centralized planning for distributed agents, with the Markov decision process(MDP)framework as the basis for our model of agents interacting with an envi-ronment.A partially obrvable Markov decision process (POMDP)is a generalization of an MDP in which an agent must ba its decisions on incomplete information about the state of the environment(White,1993).We extend the POMDP model to allow for multiple distributed agents to each receive local obrvations and ba their decisions on the obrvations.The state transitions and expected rewards depend on the actions of all of the agents.We call this a decentralized partially obrvable Markov de-cision process(DEC-POMDP).An interesting special ca of a DEC-POMDP satisﬁes the assumption that at any time step the state is uniquely determined from the current t of obrvations of the agents.This is denoted a decen-tralized Markov decision process(DEC-MDP).The MDP, POMDP,and DEC-MDP can all be viewed as special cas of the DEC-POMDP.The relationships among the models are shown in Figure1.

中国古代刑罚There has been some related work in AI.Boutilier(1999) studies multi-agent Markov decision process(MMDPs), but in this model,the agents all have access to the same in-formation.In the fra

mework we describe,this assumption is not made.Peshkin et al.(2000)u esntially the DEC-POMDP model(although they refer to it as a partially ob-rvable identical payoff stochastic game(POIPSG))and discuss algorithms for obtaining approximate solutions to the corresponding optimization problem.The models that we study also exist in the control theory literature(Ooi et al.,1997;Aicardi et al.,1987).However,the compu-tational complexity inherent in the models has not been studied.One cloly related piece of work is that of Tsit-

DEC-MDP

POMDP MDP

DEC-POMDP

Figure1:The relationships among the models.

siklis and Athans(1985),in which the complexity of non-quential decentralized decision problems is studied.

We discuss the computational complexity ofﬁnding opti-mal policies for theﬁnite-horizon versions of the prob-lems.It is known that solving an MDP is P-complete and that solving a POMDP is PSPAC

E-complete(Papadim-itriou&Tsitsiklis,1987).We show that solving a DEC-POMDP with a constant number,,of agents is com-plete for the complexity class nondeterministic exponen-tial time(NEXP).Furthermore,solving a DEC-MDP with a constant number,,of agents is NEXP-complete. This has a few conquences.One is that the problems provably do not admit polynomial-time algorithms.This trait is not shared by the MDP problems nor the POMDP problems.Another conquence is that any algorithm for solving either problem will most likely take doubly expo-nential time in the worst ca.In contrast,the exact al-gorithms forﬁnite-horizon POMDPs take“only”exponen-tial time in the worst ca.Thus,our results shed light on the fundamental differences between centralized and de-centralized control of Markov decision process.We now have mathematical evidence corresponding to the intuition that decentralized planning problems are more difﬁcult to solve than their centralized counterparts.The results can steer rearchers away from trying toﬁnd easy reductions from the decentralized problems to centralized ones and to-ward completely different approaches.

A preci categorization of the two-agent DEC-MDP prob-lem prents an interesting mathematical challenge.The extent of our prent knowledge is that the problem is PSPACE-hard and is contained in NEXP.

2Centralized Models

A Markov decision process(MDP)models an agent acting in a stochastic environment to maximize its long-term re-ward.The type of MDP that we consider contains aﬁnite t of states,with as the start state.For each state ,is aﬁnite t of actions available to the agent.

is the table of transition probabilities,where

is the probability of a transition to state given that the agent performed action in state.is the reward func-tion,where is the expected reward received by the agent given that it cho action in state.

There are veral different ways to deﬁne“long-term re-ward”and thus veral different measures of optimality.In this paper,we focus onﬁnite-horizon optimality,for which the aim is to maximize the expected sum of rewards re-ceived over time steps.Formally,the agent should max-imize

where is the reward received at time step.A policy for aﬁnite-horizon MDP is a mapping from each state and time to an action.This is called a non-stationary policy.The decision problem corresponding to aﬁnite-horizon MDP is as follows:Given an MDP,a positive integer,and an integer,is there a policy that yields total reward at least?

An MDP can be generalized so that the agent does not nec-essarily obrve the exact state of the environment at each time step.This is called a partially obrvable Markov de-cision process(POMDP).A POMDP has a state t,a start state,a table of transition probabilities,and a reward function,just as an MDP does.Additionally,it contains aﬁnite t of obrvations,and a table of ob-rvation probabilities,where is the probability that is obrved,given that action was taken and led to state.For each obrvation,is aﬁnite t of actions available to the agent.A policy is now a mapping from histories of obrvations to actions in. The decision problem for a POMDP is stated in exactly the same way as for an MDP.

3Decentralized Models

A decentralized partially obrvable Markov decision pro-cess(DEC-POMDP)is a generalization of a POMDP to allow for distributed control by agents that may not be able to obrve the exact state.A DEC-POMDP contains aﬁnite t of states,with as the start state.The transition probabilities

and expected rewards depend on the ac-tions of all agents.is aﬁnite t of obrvations for agent,and is a table of obrvation probabilities, where is the probability that are obrved by agents respectively, given that the action tuple was taken and led to state.Each agent has a t of actions for each obrvation.Notice that this model reduces to a POMDP in the one-agent ca.

For each,let denote the t of obrvation tuples that have a nonzero chance of occurring given that the action tuple was taken and led to state.To form a decentralized Markov decision process(DEC-MDP),we add the requirement

that for each,and each

鸟的英文the state is uniquely determined by .In the one-agent ca,this model is esn-tially the same as an MDP.

We deﬁne a local policy,,to be a mapping from local histories of obrvations to actions.

A joint policy,,is deﬁned to be a tu-ple of local policies.We wish toﬁnd a joint policy that maximizes the total expected return over theﬁnite hori-zon.The decision problem is stated as follows:Given a DEC-POMDP,a positive integer,and an integer, is there a joint policy that yields total reward at least? Let DEC-POMDP and DEC-MDP denote the deci-sion problems for the-agent DEC-POMDP and the-agent DEC-MDP,respectively.

4Complexity Results

It is necessary to consider only problems for which

.If we place no restrictions on,then the upper bounds do not necessarily hold.Also,we assume that each of the elements of the tables for the transition prob-abilities and expected rewards can be reprented with a constant number of bits.With the restrictions,it was shown in(Papadimitriou&Tsitsiklis,1987)that the de-cision problem for an MDP is P-complete.In the same paper,the authors showed that the decision problem for a POMDP is PSPACE-complete and thus probably does not admit a polynomial-time algorithm.We prove that for all ,DEC-POMDP is NEXP-complete,and for all

,DEC-MDP is NEXP-complete,where NEXP NTIME(Papadimitriou,1994).Since P NEXP,we can be certain that there does not exist a polynomial-time algorithm for either problem.Moreover,there probably is not even an exponential-time algorithm that solves either problem.

For our reduction,we u a problem called TILING(Pa-padimitriou,1994),which is described as follows:We are given a t of square tile types,to-gether with two relations(the horizontal and vertical compatibility relations,respectively).We are also given an integer in binary.A tiling is a function

.A tiling is consistent if and only if(a),and(b)for all

,and. The decision problem is to tell,given,,,and, whether a consistent tiling exists.It is known that TIL

ING is NEXP-complete.

Theorem1For all,DEC-POMDP is NEXP-complete.

Proof.First,we will show that the problem is in NEXP.We can guess a joint policy and write it down in exponential time.This is becau a joint policy consists of map-pings from local histories to actions,and since, all histories have length less than.A DEC-POMDP together with a joint policy can be viewed as a POMDP to-gether with a policy,where the obrvations in the POMDP correspond to the obrvation tuples in the DEC-POMDP. In exponential time,each of the exponentially many possi-ble quences of obrvations can be converted into belief states.The transition probabilities and expected rewards for the corresponding“belief MDP”can be computed in exponential time(Kaelbling et al.,1998).It is possible to u dynamic programming to determine whether the policy yields expected reward at least in this belief MDP.This takes at most exponential time.

Now we show that the problem is NEXP-hard.For sim-plicity,we consider only the two-agent ca.Clearly,the problem with more agents can be no easier.We are given an arbitrary instance of TILING.From it,we construct a DEC-POMDP such that the existence of a joint policy that yields a reward of at least zero is equivalent to the existence of a consistent tiling in the original problem.Furt

hermore, in the DEC-POMDP that is constructed.Intu-itively,a local policy in our DEC-POMDP corresponds to a mapping from tile positions to tile ,a tiling,and thus a joint policy corresponds to a pair of tilings.The pro-cess works as follows:In the position choice pha,two tile positions are randomly“chon”by the environment. Then,at the tile choice step,each agent es a different position and must u its policy to determine a tile to be placed in that position.Bad on information about where the two positions are in relation to each other,the environ-ment checks whether the tile types placed in the two posi-tions could be part of one consistent tiling.Only if the nec-essary conditions hold do the agents obtain a nonnegative reward.It turns out that the agents can obtain a nonnega-tive expected reward if and only if the conditions hold for all pairs of positions the environment can ,there exists a consistent tiling.

We now prent the construction in detail.During the posi-tion choice pha,each agent only has one action available to it,and a reward of zero is obtained at each step.The states and the transition probability matrix compri the nontrivial aspect of this pha.Recall that this pha intu-itively reprents the choosing of two tile positions.First, let the two tile positions be denoted and, where.There are steps in this pha,and each step is devoted to the choosing of one bit of one of the numbers.(We assume that is a power of two.It is straightforward to modify the proof to deal with t

he more general ca.)The order in which the bits are chon is important,and it is as follows:The bits of and are chon from least signiﬁcant up to most sig-niﬁcant,alternating between the two numbers at each step. Then and are chon in the same way.As the bits of

the numbers are being determined,information about the relationships between the numbers is being recorded in the state.How we express all of this as a Markov process is explained below.

Each state has six components,and each component rep-rents a necessary piece of information about the two tile positions being chon.We describe how each of the com-ponents changes with time.A time step in our process can be viewed as having two parts,which we refer to as the stochastic part and the deterministic part.During the stochastic part,the environment“ﬂips a coin”to choo either the number0or the number1,each with equal prob-ability.After this choice is made,the change in each com-ponent of the state can be described by a deterministicﬁnite automaton that takes as input a string of0’s and1’s(the en-vironment’s coinﬂips).The mantics of the components, along with their associated automata,are described below: 1)Bit Chon in the Last Step

This component of the state says whether0or1was just chon by the environment.The corresponding automaton consists of only two states.

2)Number of Bits Chon So Far

This component simply counts up to,in order to determine when the position choice pha should end.Its automaton consists of states.

3)Equal Tile Positions

After the steps,this component tells us whether the two tile positions chon are equal or not.For this automa-ton,along with the following three,we need to have a no-tion of an accept state.Consider the following regular ex-pression:

Note that the DFA corresponding to the above expression, on an input of length,ends in an accept state if and only if.

4)Upper Left Tile Position网站营运

This component is ud to check whether theﬁrst tile posi-tion is the upper left corner of the grid.Its regular expres-sion is as follows:

The corresponding DFA,on an input of length,ends in an accept state if and only if.

5)Horizontally Adjacent Tile Positions

This component is ud to check whether theﬁrst tile po-sition is directly to the left of the cond one.Its regular expression is as follows:

The corresponding DFA,on an input of length,ends in an accept state if and only if.6)Vertically Adjacent Tile Positions

This component is ud to check whether theﬁrst tile posi-tion is directly above the cond one.Its regular expression is as follows:

The corresponding DFA,on an input of length,ends in an accept state if and only if.

So far we have described the six automata that determine how each of the six components of the state evolve bad on input(0or1)from the environment.We can take the cross product of the six automata to get a new automaton that is only polynomially bigger and describes how the entire state evolves bad on the quence of0’s and1’s chon by the environment.This automaton,along with the en-vironment’s“coinﬂips,”corresponds to a Markov process. The number of states of the process is polylogarithmic in, and hence polynomial in the size of the TILING instance. The start state is a tup

le of the start states of the six au-tomata.The table of transition probabilities for this process can be constructed in time polylogarithmic in.

We have described the states,actions,state transitions,and rewards for the position choice pha,and we now describe the obrvation function.In this DEC-POMDP,the obr-vations are uniquely determined from the state.For the states after which a bit of or has been chon,agent one obrves theﬁrst component of the state,while agent two obrves a dummy obrvation.The rever is true for the states after which a bit of or has been chon.Intu-itively,agent one“es”only,and agent two“es”only.

When the cond component of the state reaches its limit, the tile positions have been chon,and the last four com-ponents of the state contain information about the tile po-sitions and how they are related.Of cour,the exact tile positions are not recorded in the state,as this would require exponentially many states.This marks the end of the posi-tion choice pha.In the next step,which we call the tile choice step,each agent has actions available to it, corresponding to each of the tile types,.We de-note agent one’s choice and agent two’s choice.No matter which actions are chon,the state transitions de-terministically to someﬁnal state.The reward function for this step is the nontrivial part.After the actions are chon, the following statements are checked for validity:

1)If,then.

2)If,then.

3)If,then.

4)If,then.

If all of the are true,then a reward of0is received.Oth-erwi,a reward of-1is received.This reward function can be computed from the TILING instance in polynomial

time.To complete the construction,the horizon is t to(exactly the number of steps it takes the process to reach the tile choice step,and fewer than the number of states).

Now we argue that the expected reward is zero if and only if there exists a consistent tiling.First,suppo a consis-tent tiling exists.This tiling corresponds to a local policy for an agent.If each of the two agents follows this policy, then no matter which two positions are chon by the en-vironment,the agents choo tile types for tho positions so that the conditions checked at the end evaluate to true. Thus,no matter what quence of0’s and1’s the environ-ment choos,the agents receive a reward of zero.Hence, the expected reward for the agents is zero.

For the conver,suppo the expected reward is zero. Then the reward is zero no matter what quence of0’s and1’s the environment ,no matter which two tile positions are chon.This implies that the four condi-tions mentioned above are satisﬁed for any two tile posi-tions that are chon.Theﬁrst condition ensures that for all pairs of tile positions,if the positions are equal,then the tile types chon are the same.This implies that the two agents’tilings are exactly the same.The last three condi-tions ensure that this tiling is consistent.

Theorem2For all,DEC-MDP is NEXP-complete.

Proof.(Sketch)Inclusion in NEXP follows from the fact that a DEC-MDP is a special ca of a DEC-POMDP.For NEXP-hardness,we can reduce a DEC-POMDP with two agents to a DEC-MDP with three agents.We simply add a third agent to the DEC-POMDP and impo the following requirement:The state is uniquely determined by just the third agent’s obrvation,but the third agent always has just one action and cannot affect the state transitions or rewards received.It is clear that the new problem qualiﬁes as a DEC-MDP and is esntially the same as the original DEC-POMDP.

The reduction described above can also be ud to con-struct a two-agent DEC-MDP from a POMDP and hence show that DEC-MDP is PSPACE-hard.However,this technique is not powerful en

ough to prove the NEXP-hardness of the problem.In fact,the question of whether DEC-MDP is NEXP-hard remains open.Note that in the reduction in the proof of Theorem1,the obrva-tion function is such that there are some parts of the state that are hidden from both agents.This needs to some-how be avoided in order to reduce to a two-agent DEC-MDP.A simpler task may actually be to derive a better up-per bound for the problem.For example,it may be pos-sible that DEC-MDP co-NEXP,where co-NEXP

References

Aicardi,M.,Franco,D.&Minciardi,R.(1987).Decentral-ized optimal control of Markov chains with a common past information t.IEEE Transactions on Automatic Control, AC-32(11),1028–1031.

Babai,L.,Fortnow,L.&Lund, C.(1991).Non-deterministic exponential time has two-prover interactive protocols.Computational Complexity,1,3–40.

Boutilier,C.(1999).Multiagent systems:Challenges and opportunities for decision-theoretic planning.AI Magazine, 20(4),35–43.

Cassandra,A.,Littman,M.L.&Zhang,N.L.(1997).In-cremental pruning:A simple,fast,exact method for p

ar-tially obrvable Markov decision process.In Proceed-ings of the Thirteenth Annual Conference on Uncertainty in Artiﬁcial Intelligence(pp.54–61).

desJardins,M.E.,Durfee,E.H.,Ortiz,C.L.&Wolverton, M.J.(1999).A survey of rearch in distributed,continual planning.AI Magazine,20(4),13–22.

Durfee,E.H.(1999).Distributed problem solving and planning.In Multiagent Systems(pp.121–164).Cam-bridge,MA:The MIT Press.

Estlin,T.,Gray,A.,Mann,T.,Rabideau,G.,Casta˜n o,R., Chien,S.&Mjolsness,E.(1999).An integrated system for mulit-rover scientiﬁc exploration.In Proceedings of the Sixteenth National Conference on Artiﬁcial Intelligence (pp.541–548).

Grosz,B.&Kraus,S.(1996).Collaborative plans for com-plex group action.Artiﬁcial Intelligence,86(2),269–357. Hann,E.(1998).Solving POMDPs by arching in pol-icy space.In Proceedings of the Fourteenth Annual Con-ference on Uncertainty in Artiﬁcial Intelligence(pp.211–219).

Jaakkola,T.,Singh,S.P.&Jordan,M.I.(1995).Reinforce-ment learning algorithm for partially obrvable

Markov decision problems.In Proceedings of Advances in Neural Information Processing Systems7(pp.345–352). Kaelbling,L.P.,Littman,M.L.&Cassandra,A.R.(1998). Planning and actiong in partially obrvable stochastic do-mains.Artiﬁcial Intelligence,101(1-2),99–134. Lesr,V.R.(1998).Reﬂections on the nature of multi-agent coordination and its implications for an agent archi-tecture.Autonomous Agents and Multi-Agent Systems,1, 89–111.

Madani,O.,Hanks,S.&Condon,A.(1999).On the un-decidability of probabilistic planning and inﬁnite-horizon partially obrvable Markov decision process problems.In Proceedings of the Sixteenth National Conference on Arti-ﬁcial Intelligence(pp.541–548).

八省联保Mataric,M.J.(1998).Using communication to reduce lo-cality in distributed multi-agent learning.Journal of Exper-imental and Theoretical Artiﬁcial Intelligence,10(3),357–369.

Ooi,J.M.,Verbout,S.M.,Ludwig,J.T.&Wornell,G.W. (1997).A paration theorem for periodic sharing informa-tion patterns in decentralized control.IEEE Transactions on Automatic Control,42(11),1546–1550.

Ooi,J.M.&Wornell,G.W.(1996).Decentralized con-trol of a multiple access broadcast channel:Performance bounds.In Proceedings of the35th Conference on Deci-sion and Control(pp.29

3–298).

Papadimitriou,C.H.(1994).Computational Complexity. Reading,MA:Addison-Wesley.

Papadimitriou,C.H.&Tsitsiklis,J.N.(1987).The com-plexity of Markov decision process.Mathematics of Op-erations Rearch,12(3),441–450.

Peshkin,L.,Kim,K.-E.,Meuleau,N.&Kaelbling,L.P. (2000).Learning to cooperate via policy arch.In Pro-ceedings of the Sixteenth International Conference on Un-certainty in Artiﬁcial Intelligence.

对仗词大全Peterson,G.L.&Reif,J.R.(1979).Multiple-person al-ternation.In20th Annual Symposium on Foundations of Computer Science(pp.348–363).

Stone,P.&Veloso,M.(1999).Task decomposition,dy-namic role assignment,and low-bandwidth communication for real-time strategic teamwork.Artiﬁcial Intelligence, 110(2),241–273.

Tsitsiklis,J.N.&Athans,M.(1985).On the complexity of decentralized decision making and detection problems. IEEE Transactions on Automatic Control,AC-30(5),440–446.

White,D.J.(1993).Markov Decision Process.West Susx,England:John Wiley&Sons.

本文发布于:2023-06-23 15:51:25，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1022171.html

上一篇：员工年终总结(二十五篇)

下一篇：2023年消防安全主题个人心得(7篇)

标签：中国组织生活电影营运材料父亲网站对照检查

留言与评论（共有 0 条评论）