DeterministicPolicyGradientAlgorithms(DPG强化学习。。。

更新时间:2023-07-05 00:59:27 阅读: 评论:0

DeterministicPolicyGradientAlgorithms (DPG 强化学习。。。Deterministic Policy Gradient Algorithms
David Silver , Guy Lever , Nicolas Heess , Thomas Degris , Daan Wierstra  & Martin Riedmiller
Abstract
在本⽂中,我们考虑确定性策略梯度算法,⽤于连续⾏动的强化学习。 确定性策略梯度具有特别吸引⼈的形式:它是action-value函数的预期梯度。这种简单形式意味着确定性策略梯度⽐通常的随机策略梯度估计效率⾼得多。为了保证充分的探索,我们引⼊了⼀种从探索⾏为策略中学习确定性⽬标策略的离线策略actor-critic算法。结果表明,确定性策略梯度算法在⾼维⾏动空间的性能明显优于随机策略梯度算法。
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions.The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
当你离开时
1. Introduction
吉林市新东方策略梯度算法⼴泛⽤于具有连续动作空间的强化学习问题。 基本思想是通过参数概率分布  来表⽰策略,其根据参数⽮量随机地选择状态中的动作。 策略梯度算法通常通过对该随机策略进⾏采样并根据更⼤的累积奖励的⽅向来调整策略参数。Policy gradient algorithms are widely ud in reinforcement learning problems with continuous action spaces. The basic idea is to reprent the policy by a parametric probability distribution  that stochastically lects action a in state s according to parameter vector θ. Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.
在本⽂中,我们改为考虑确定性策略。 很⾃然地想知道随机策略是否可以采⽤相同的⽅法:在策略梯度的⽅向上调整策略参数。 先前认为确定性策略梯度不存在,或者只能在使⽤模型时获得(Peters,2010)。 然⽽,我们表明确定性策略梯度确实存在,⽽且它有⼀个简单的model-free形式,它简单地遵循action-value函数的梯度。 此外,我们证明了确定性策略梯度是随机策略梯度的策略⽅差倾向于零的极限情况。
In this paper we instead consider deterministic policies . It is natural to wonder whether the same approach can be followed as for stochastic policies: adjusting the policy parameters in the direction o
f the policy gradient. It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010). However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function. In addition, we show that the deterministic policy gradient is the limiting ca, as policy variance tends to zero, of the stochastic policy gradient.从实践的⾓度来看,随机策略和确定性策略梯度之间存在着关键的差异。 在随机情况下,策略梯度在状态和动作空间上进⾏整合,⽽在确定性情况下,它仅在状态空间上进⾏整合。 因此,计算随机策略梯度可能需要更多样本,特别是如果动作空间具有许多维度。From a practical viewpoint, there is a crucial difference between the stochastic and deterministic policy gradients. In the stochastic ca, the policy gradient integrates over both state and action spaces, whereas in the deterministic ca it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples,especially if the action space has many dimensions.
π(a ∣θs )=P [a ∣s ;θ]θs a π(a ∣θs )=P [a ∣s ;θ]a =μ(s )θa =μ(s )θ
为了探索完整的状态和⾏为空间,随机策略常常是必要的。 为了保证我们的确定性策略梯度算法继续得到满意的探索,我们引⼊了⼀种离线策略学习算法。 其基本思想是根据随机⾏为策略选择⾏动(以确保⾜够的探索),但要了解确定性⽬标策略(利⽤确定性策略梯度的效率) 。利⽤确定性策略梯度,推
导出⼀种利⽤可微函数逼近器估计动作值函数的离线策略actor-critic算法,并根据近似的动作值梯度⽅向更新策略参数。 我们还引⼊了确定性策略梯度的兼容函数逼近的概念,以确保近似不会偏离策略梯度。
In order to explore the full state and action space, a stochastic policy is often necessary. To ensure that our
deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning
algorithm. The basic idea is to choo actions according to a stochastic behaviour policy (to ensure adequate
exploration), but to learn about a deterministic target policy (exploiting the efficiency of the deterministic policy
purcha
gradient). We u the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the
action-value function using a differentiable function approximator, and then updates the policy param
eters in the
direction of the approximate action-value gradient. We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the policy gradient.
我们将确定性的actor-critic算法应⽤于⼏个基准问题:⾼维强盗; ⼏个具有低维动作空间的标准基准强化学习任务; 以及控制章鱼臂的⾼维任务。 我们的结果表明,使⽤确定性策略梯度优于随机梯度,特别是在⾼维任务中,具有显着的性能优势。 此外,我们的算法不需要⽐先前⽅法更多的计算:每个更新的计算成本在动作维度和策略参数的数量上是线性的。 最后,有许多应⽤程序(例如机器⼈)提供了可微控制策略,但是没有将噪声注⼊控制器的功能。 在这些情况下,随机策略梯度不适⽤,⽽我们的⽅法可能仍然有⽤。
We apply our deterministic actor-critic algorithms to veral benchmark problems: a high-dimensional bandit; veral standard benchmark reinforcement learning tasks with low dimensional action spaces; and a high-dimensional task for controlling an octopus arm. Our results demonstrate a significant performance advantage to using deterministic policy gradients over stochastic policy gradients, particularly in high dimensional tasks. Furthermore, our algorithms require no more compu
tation than prior methods: the computational cost of each update is linear in the action dimensionality and the number of policy parameters. Finally, there are many applications (for example in robotics) where a
differentiable control policy is provided, but where there is no functionality to inject noi into the controller. In the cas, the stochastic policy gradient is inapplicable, whereas our methods may still be uful.
2. Background
2.1. Preliminaries
我们研究了强化学习和控制问题,其中⼀个agent ⾏动在随机环境中通过在⼀系列时间步骤上顺序选择⾏为,以最⼤化累积奖励。我们将问题建模为马尔可夫决策过程(MDP),其中包括:状态空间 ,动作空间,具有密度 的初始状态分布,具有条件密度的静态过渡动态分布  满⾜马尔可夫属性,对于状态—动作空间中的任何轨迹
,以及奖励函数 。策略⽤于选择MDP中的操作。⼀般来说,策略是随机的,⽤ 表⽰,其中 是的概率度量集合,是⼀个有个参数的向量,和 是与策略关联的条件概率密度 。agent使⽤其策略与MDP交互
以提供状态,动作和奖励的轨迹, 对应于 。返回 是从时间步长起的总折扣奖励, 其中。价值函数被定义为预期的总折扣奖励, 和。agent的⽬标是获得⼀个策略,该策略从开始状态最⼤化累积折扣奖励,由性能⽬标  表⽰。
We study reinforcement learning and control problems in which an agent acts in a stochastic environment by
quentially choosing actions over a quence of time steps, in order to maximi a cumulative reward. We model the problem as a Markov decision process (MDP) which compris: a state space , an action space , an initial state
distribution with density , a stationary transition dynamics distribution with conditional density satisfying the Markov property , for any trajectory  in
blousstate-action space, and a reward function . A policy is ud to lect actions in the MDP. In general the policy is stochastic and denoted by , where  is the t of probability measures on  and  is a vector of  parameters, and  is the conditional probability density  at associated with the policy. The
agent us its policy to interact with the MDP to give a trajectory of states, actions and rewards,  ove
r . The return  is the total discounted reward from time-step  onwards,  where . Value functions are defined to be the expected total discounted reward,  and . The agent’s goal is to obtain a policy which maximis the cumulative discounted reward from the start state, denoted by the performance objective .S A p (s )11p (s ∣t +1s ,a )t t p (s ∣t +1s ,a ,...,s ,a )=11t t p (s ∣t +1s ,a )t t s ,a ,s ,a ,...,s ,a 1122T T r :S ×A →R π:θS →P (A )P (A )A θ∈R n n π(a ∣θt s )t a t h :1T =s ,a ,r ,...,s ,a ,r 111T T T S ×A ×R r t γt r =t γγr (s ,a )∑k =t ∞k −t k k 0<γ<1V (s )=πE [r ∣1γS =1s ;π]Q (s ,a )=πE [r ∣1γS =1s ,A =1a ,π]J (π)=E [r ∣1γπ]S A p (s )11p (s ∣t +1s ,a )t t p (s ∣t +1s ,a ,...,s ,a )=11t t p (s ∣t +1s ,a )t t s ,a ,s ,a ,...,s ,a 1122T T r :S ×A →R π:θS →P (A )P (A )A θ∈R n n π(a ∣θt s )t a t h :1T =
principal读音
s ,a ,r ,...,s ,a ,r 111T T T S ×A ×R r t γt r =t γγr (s ,a )∑k =t ∞
k −t k k 0<γ<1V (s )=πE [r ∣1γS =1s ;π]Q (s ,a )=πE [r ∣1γS =1s ,A =1a ,π]J (π)=E [r ∣1γπ]
我们⽤ 从状态过渡个时间步后表⽰ 的密度。 我们还⽤  来表⽰(不正
确的)折扣状态分布。 然后我们可以将性能⽬标写成期望,
We denote the density at state  after transitioning for  time steps from state  by . We also denote the (improper) discounted state distribution by . We can then write the performance
objective as an expectation,其中  表⽰相对于折扣状态分布  2的(不正确的)预期值。 在本⽂的其余部分,为简单起见,我们假设  且  是 的紧凑⼦集。
where  denotes the (improper) expected value with respect to discounted state distribution .2 In the
remainder of the paper we suppo for simplicity that  and that  is a compact subt of .
triangles2.2. Stochastic Policy Gradient Theorem
策略梯度算法可能是最流⾏的⼀类连续动作强化学习算法。 这些算法背后的基本思想是在性能梯度  的⽅向上调整策略的参数。这些算法的基本结果是策略梯度定理(Sutton et al., 1999),
Policy gradient algorithms are perhaps the most popular class of continuous action reinforcement learning algorithms.The basic idea behind the algorithms is to adjust the parameters θ of the policy in the direction of the performance gradient . The fundamental result underlying the algorithms is the policy gradient theorem (Sutton et al.,
1999),
superclass策略梯度⾮常简单。 特别是,尽管状态分布  取决于策略参数,但策略梯度不依赖于状态分布的梯度。
The policy gradient is surprisingly simple. In particular, despite the fact that the state distribution  depends on the policy parameters, the policy gradient does not depend on the gradient of the state distribution.
该定理具有重要的实⽤价值,因为它将性能梯度的计算减少到简单的期望。 通过形成基于样本的这⼀期望的估计,策略梯度定理已经被⽤来推导各种策略梯度算法(Degris et al., 2012a)。 这些算法必须解决的⼀个问题是如何估计动作值函数 。 也许最简单的⽅法是使⽤样本返回  来估计  的值,这导致REINFORCE算法的变体(Williams,1992)。
This theorem has important practical value, becau it reduces the computation of the performance gradient to a
simple expectation. The policy gradient theorem has been ud to derive a variety of policy gradient algorithms (Degris et al., 2012a), by forming a sample-bad estimate of this expectation. One issue that the algorithms must address is how to estimate the action-value function . Perhaps the simplest approach is to u a sample return  to estimate the value of , which leads to a variant of the REINFORCE algorithm (Williams, 1992).
2.3. Stochastic Actor-Critic Algorithms
p (s →s ,t ,π)′s t s ′ρ(s ):π′=γp (s )p (s →∫S ∑k =t ∞t −11s ,t ,π)ds ′s ′t s p (s →s ,t ,π)′ρ(s ):π′=γp (s )p (s →∫S ∑k =t ∞t −11s ,t ,π)ds ′J (π)=θρ(s )π(s ,a )r (s ,a )dads =∫S π∫A θE [r (s ,a )]s ∼ρ,a ∼ππθ(1)
E [⋅]s ∼ρρ(s )A =R m S R d E [⋅]s ∼ρρ(s )A =R m S R d ∇J (π)θθθ∇J (π)θθ∇J (π)=θρ(s )∇π(s ∣∫S π∫A θθa )Q (s ,a )dads =πE [∇log π(a ∣s ∼ρ,a ∼ππθθθs )Q (s ,a )]
π(2)
ρ(s )πρ(s )πQ (s ,a )πr t γQ (s ,a )πt t Q (s ,a )πr t γQ (s ,a )πt t
Actor-Critic是基于策略梯度定理的⼴泛使⽤的框架(Sutton等⼈,1999; Peters等⼈,2005; Bhatnagar等⼈,2007; Degris等⼈,2012a)。Actor-Critic由两个同名组成部分组成。actor通过等式2的随机梯度上升来调整随机策略 的参数 。带有参数向量的动作值函数 被⽤来代替等式2中的未知真实动作值函数。critic使⽤适当的策略评估算法(例如时间差异学习)估计动作 - 值函数 。通常,⽤函数逼近器 代替真实的动作值函数 可能会引⼊偏差。然⽽,如果函数逼近器是兼容的,则i) 和 ii)选择参数w以使最⼩化均⽅误差 ,然后就没有偏差(Sutton et al。,1999),
The actor-critic is a widely ud architecture bad on the policy gradient theorem (Sutton et al., 1999; Peters et al.,2005; Bhatnagar et al., 2007; Degris et al., 2012a). The actor-critic consists of tw
o eponymous components. An actor adjusts the parameters  of the stochastic policy  by stochastic gradient ascent of Equation 2. Instead of the unknown true action-value function  in Equation 2, an action-value function  is ud, with parameter vector w. A critic estimates the action-value function  using an appropriate policy evaluation
algorithm such as temporal-difference learning. In general, substituting a function approximator  for the true action-value function  may introduce bias. However, if the function approximator is compatible such that i)  and ii) the parameters w are chon to minimi the mean-squared error , then there is no bias (Sutton et al., 1999),
更直观地说,条件i)表⽰兼容函数逼近器在随机策略的“特征”中是线性的,,条件ii)要求参数是线性回归的解决⽅案从这些特征估计 的问题。 在实践中,条件ii)通常放宽,有利于通过时间差异学习更有效地估计价值函数的策略评估算法(Bhatnagar等,2007; Degris等,2012b; Peters等,2005); 事实上,如果i)和ii)都满⾜,则整体算法相当于完全没有使⽤critic(Sutton等,2000),就像REINFORCE算法(Williams,1992)。
More intuitively, condition i) says that compatible function approximators are linear in “features” of the stochastic policy, , and condition ii) requires that the parameters are the solution to the linear regr
ession problem that estimates  from the features. In practice, condition ii) is usually relaxed in favour of policy evaluation algorithms that estimate the value function more efficiently by temporal-difference learning (Bhatnagar et al., 2007;Degris et al., 2012b; Peters et al., 2005); indeed if both i) and ii) are satisfied then the overall algorithm is equivalent to not using a critic at all (Sutton et al., 2000), much like the REINFORCE algorithm (Williams, 1992).
2.4. Off-Policy Actor-Critic
从不同⾏为策略 中采样的轨迹估计离线策略的策略梯度通常很有⽤。 在离线策略环境中,性能⽬标通常被修改为⽬标策略的价值函数,在⾏为策略的状态分布上取平均值(Degris等,2012b),
It is often uful to estimate the policy gradient off-policy from trajectories sampled from a distinct behaviour policy . In an off-policy tting, the performance objective is typically modified to be the value function of the target policy, averaged over the state distribution of the behaviour policy (Degris et al., 2012b),
区分性能⽬标并应⽤近似给出了离线策略的策略梯度
Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)
π(s )θθw Q (s ,a )w Q (s ,a )πQ (s ,a )≈w Q (s ,a )πQ (s ,a )w Q (s ,a )πQ (s ,a )=w ∇log π(a ∣θθs )w ⊺ϵ(w )=2E [(Q (s ,a )−s ∼ρ,a ∼ππθw Q (s ,a ))]π2θπ(s )θQ (s ,a )πQ (s ,a )w Q (s ,a )≈w Q (s ,a )πQ (s ,a )w Q (s ,a )πQ (s ,a )=w ∇log π(a ∣θθs )w ⊺ϵ(w )=
2E [(Q (s ,a )−s ∼ρ,a ∼ππθw Q (s ,a ))]π2∇J (π)=θE [∇log π(a ∣s ∼ρ,a ∼ππθθθs )Q (s ,a )]w (3)
∇log π(a ∣θθs )Q (s ,a )π∇log π(a ∣θθs )Q (s ,a )πβ(a ∣s )=π(a ∣θs )β(a ∣s )=π(a ∣θs )J (π)=βθρ(s )V (s )ds =∫S βπρ(s )π(a ∣∫S ∫A βθs )Q (s ,a )dads
π
这个近似去掉了⼀个依赖于动作值梯度  的项); Degris et al. (2012b)认为这是⼀个很好的近似,因为它可以保留梯度上升收敛的局部最优集。 Off-Policy Actor-Critic(OffPAC)算法(Degris et al., 2012b)使⽤⾏为策略 来⽣成轨迹。critic通过梯度时间差学习,从这些轨迹中估计出⼀个离线策略的状态值函数 (Sutton et al., 2009)。通过随机梯度上升等式5,actor更新策略参数,也从这些离线策略轨迹。代替等式5中的未知动作-值函数 ,时间差分误差  被使⽤,;这可以被证明是对真实梯度的近似(Bhatnagar等,2007)。actor和critic家都使⽤重要性抽样⽐  来调整根据⽽不是选择动作的事实。
This approximation drops a term that depends on the action-value gradient ; Degris et al. (2012b) argrowup
gue that this is a good approximation since it can prerve the t of local optima to which gradient ascent converges. The Off-Policy Actor-Critic (OffPAC) algorithm (Degris et al., 2012b) us a behaviour policy  to generate
trajectories. A critic estimates a state-value function, , off-policy from the trajectories, by gradient temporal-difference learning (Sutton et al., 2009). An actor updates the policy parameters , also off-policy from the trajectories, by stochastic gradient ascent of Equation 5. Instead of the unknown action-value function in Equation 5, the temporal-difference error  is ud, ; this can be shown to provide an approximation to the true gradient (Bhatnagar et al., 2007). Both the actor and the critic u an importanceyen
sampling ratio  to adjust for the fact that actions were lected according to  rather than .
REFERENCES ∇J (π)≈θβθρ(s )∇πQ (s ,a )dads ∫S ∫A βθθπ(4)
=E [∇log π(a ∣s ∼ρ,a ∼βββ(a ∣s )θπ(a ∣s )
θθθs )Q (s ,a )]π(5)
∇Q (s ,a )θπβ(a ∣s )V (s )≈v V (s )πθQ (s ,a )πδt δ=t r +
becau of you 歌词
t +1γV (s )−V (s )v t +1v t β(a ∣s )θπ(a ∣s )θπβ∇Q (s ,a )θπβ(a ∣s )V (s )≈v V (s )πθQ (s ,a )πδt δ=t r +t +1γV (s )−V (s )v t +1v t β(a ∣s )θπ(a ∣s )θπβ

本文发布于:2023-07-05 00:59:27,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/167315.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:策略   梯度   算法   动作   确定性   函数
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图