DeterministicPolicyGradientAlgorithms（DPG强化学习。。。

更新时间:2023-07-05 00:59:27 阅读：评论：0

DeterministicPolicyGradientAlgorithms （DPG 强化学习。。。Deterministic Policy Gradient Algorithms

David Silver , Guy Lever , Nicolas Heess , Thomas Degris , Daan Wierstra & Martin Riedmiller

Abstract

在本⽂中，我们考虑确定性策略梯度算法，⽤于连续⾏动的强化学习。确定性策略梯度具有特别吸引⼈的形式：它是action-value函数的预期梯度。这种简单形式意味着确定性策略梯度⽐通常的随机策略梯度估计效率⾼得多。为了保证充分的探索，我们引⼊了⼀种从探索⾏为策略中学习确定性⽬标策略的离线策略actor-critic算法。结果表明，确定性策略梯度算法在⾼维⾏动空间的性能明显优于随机策略梯度算法。

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions.The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.

当你离开时

1. Introduction

吉林市新东方策略梯度算法⼴泛⽤于具有连续动作空间的强化学习问题。基本思想是通过参数概率分布来表⽰策略，其根据参数⽮量随机地选择状态中的动作。策略梯度算法通常通过对该随机策略进⾏采样并根据更⼤的累积奖励的⽅向来调整策略参数。Policy gradient algorithms are widely ud in reinforcement learning problems with continuous action spaces. The basic idea is to reprent the policy by a parametric probability distribution that stochastically lects action a in state s according to parameter vector θ. Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.

在本⽂中，我们改为考虑确定性策略。很⾃然地想知道随机策略是否可以采⽤相同的⽅法：在策略梯度的⽅向上调整策略参数。先前认为确定性策略梯度不存在，或者只能在使⽤模型时获得（Peters，2010）。然⽽，我们表明确定性策略梯度确实存在，⽽且它有⼀个简单的model-free形式，它简单地遵循action-value函数的梯度。此外，我们证明了确定性策略梯度是随机策略梯度的策略⽅差倾向于零的极限情况。

In this paper we instead consider deterministic policies . It is natural to wonder whether the same approach can be followed as for stochastic policies: adjusting the policy parameters in the direction o

f the policy gradient. It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010). However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function. In addition, we show that the deterministic policy gradient is the limiting ca, as policy variance tends to zero, of the stochastic policy gradient.从实践的⾓度来看，随机策略和确定性策略梯度之间存在着关键的差异。在随机情况下，策略梯度在状态和动作空间上进⾏整合，⽽在确定性情况下，它仅在状态空间上进⾏整合。因此，计算随机策略梯度可能需要更多样本，特别是如果动作空间具有许多维度。From a practical viewpoint, there is a crucial difference between the stochastic and deterministic policy gradients. In the stochastic ca, the policy gradient integrates over both state and action spaces, whereas in the deterministic ca it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples,especially if the action space has many dimensions.

π(a ∣θs )=P [a ∣s ;θ]θs a π(a ∣θs )=P [a ∣s ;θ]a =μ(s )θa =μ(s )θ

为了探索完整的状态和⾏为空间，随机策略常常是必要的。为了保证我们的确定性策略梯度算法继续得到满意的探索，我们引⼊了⼀种离线策略学习算法。其基本思想是根据随机⾏为策略选择⾏动(以确保⾜够的探索)，但要了解确定性⽬标策略(利⽤确定性策略梯度的效率) 。利⽤确定性策略梯度，推

导出⼀种利⽤可微函数逼近器估计动作值函数的离线策略actor-critic算法，并根据近似的动作值梯度⽅向更新策略参数。我们还引⼊了确定性策略梯度的兼容函数逼近的概念，以确保近似不会偏离策略梯度。

In order to explore the full state and action space, a stochastic policy is often necessary. To ensure that our

deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning

algorithm. The basic idea is to choo actions according to a stochastic behaviour policy (to ensure adequate

exploration), but to learn about a deterministic target policy (exploiting the efficiency of the deterministic policy

purcha

gradient). We u the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the

action-value function using a differentiable function approximator, and then updates the policy param

eters in the

direction of the approximate action-value gradient. We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the policy gradient.

我们将确定性的actor-critic算法应⽤于⼏个基准问题：⾼维强盗; ⼏个具有低维动作空间的标准基准强化学习任务; 以及控制章鱼臂的⾼维任务。我们的结果表明，使⽤确定性策略梯度优于随机梯度，特别是在⾼维任务中，具有显着的性能优势。此外，我们的算法不需要⽐先前⽅法更多的计算：每个更新的计算成本在动作维度和策略参数的数量上是线性的。最后，有许多应⽤程序(例如机器⼈)提供了可微控制策略，但是没有将噪声注⼊控制器的功能。在这些情况下，随机策略梯度不适⽤，⽽我们的⽅法可能仍然有⽤。

We apply our deterministic actor-critic algorithms to veral benchmark problems: a high-dimensional bandit; veral standard benchmark reinforcement learning tasks with low dimensional action spaces; and a high-dimensional task for controlling an octopus arm. Our results demonstrate a significant performance advantage to using deterministic policy gradients over stochastic policy gradients, particularly in high dimensional tasks. Furthermore, our algorithms require no more compu

tation than prior methods: the computational cost of each update is linear in the action dimensionality and the number of policy parameters. Finally, there are many applications (for example in robotics) where a

differentiable control policy is provided, but where there is no functionality to inject noi into the controller. In the cas, the stochastic policy gradient is inapplicable, whereas our methods may still be uful.

2. Background

2.1. Preliminaries

我们研究了强化学习和控制问题，其中⼀个agent ⾏动在随机环境中通过在⼀系列时间步骤上顺序选择⾏为，以最⼤化累积奖励。我们将问题建模为马尔可夫决策过程（MDP），其中包括：状态空间，动作空间，具有密度的初始状态分布，具有条件密度的静态过渡动态分布满⾜马尔可夫属性，对于状态—动作空间中的任何轨迹

，以及奖励函数。策略⽤于选择MDP中的操作。⼀般来说，策略是随机的，⽤表⽰，其中是的概率度量集合，是⼀个有个参数的向量，和是与策略关联的条件概率密度。agent使⽤其策略与MDP交互

以提供状态，动作和奖励的轨迹，对应于。返回是从时间步长起的总折扣奖励，其中。价值函数被定义为预期的总折扣奖励，和。agent的⽬标是获得⼀个策略，该策略从开始状态最⼤化累积折扣奖励，由性能⽬标表⽰。

We study reinforcement learning and control problems in which an agent acts in a stochastic environment by

quentially choosing actions over a quence of time steps, in order to maximi a cumulative reward. We model the problem as a Markov decision process (MDP) which compris: a state space , an action space , an initial state

distribution with density , a stationary transition dynamics distribution with conditional density satisfying the Markov property , for any trajectory in

blousstate-action space, and a reward function . A policy is ud to lect actions in the MDP. In general the policy is stochastic and denoted by , where is the t of probability measures on and is a vector of parameters, and is the conditional probability density at associated with the policy. The

agent us its policy to interact with the MDP to give a trajectory of states, actions and rewards, ove

r . The return is the total discounted reward from time-step onwards, where . Value functions are defined to be the expected total discounted reward, and . The agent’s goal is to obtain a policy which maximis the cumulative discounted reward from the start state, denoted by the performance objective .S A p (s )11p (s ∣t +1s ,a )t t p (s ∣t +1s ,a ,...,s ,a )=11t t p (s ∣t +1s ,a )t t s ,a ,s ,a ,...,s ,a 1122T T r :S ×A →R π:θS →P (A )P (A )A θ∈R n n π(a ∣θt s )t a t h :1T =s ,a ,r ,...,s ,a ,r 111T T T S ×A ×R r t γt r =t γγr (s ,a )∑k =t ∞k −t k k 0<γ<1V (s )=πE [r ∣1γS =1s ;π]Q (s ,a )=πE [r ∣1γS =1s ,A =1a ,π]J (π)=E [r ∣1γπ]S A p (s )11p (s ∣t +1s ,a )t t p (s ∣t +1s ,a ,...,s ,a )=11t t p (s ∣t +1s ,a )t t s ,a ,s ,a ,...,s ,a 1122T T r :S ×A →R π:θS →P (A )P (A )A θ∈R n n π(a ∣θt s )t a t h :1T =

principal读音

s ,a ,r ,...,s ,a ,r 111T T T S ×A ×R r t γt r =t γγr (s ,a )∑k =t ∞

k −t k k 0<γ<1V (s )=πE [r ∣1γS =1s ;π]Q (s ,a )=πE [r ∣1γS =1s ,A =1a ,π]J (π)=E [r ∣1γπ]

我们⽤从状态过渡个时间步后表⽰的密度。我们还⽤来表⽰（不正

确的）折扣状态分布。然后我们可以将性能⽬标写成期望，

We denote the density at state after transitioning for time steps from state by . We also denote the (improper) discounted state distribution by . We can then write the performance

objective as an expectation，其中表⽰相对于折扣状态分布 2的（不正确的）预期值。在本⽂的其余部分，为简单起见，我们假设且是的紧凑⼦集。

where denotes the (improper) expected value with respect to discounted state distribution .2 In the

remainder of the paper we suppo for simplicity that and that is a compact subt of .

triangles2.2. Stochastic Policy Gradient Theorem

策略梯度算法可能是最流⾏的⼀类连续动作强化学习算法。这些算法背后的基本思想是在性能梯度的⽅向上调整策略的参数。这些算法的基本结果是策略梯度定理(Sutton et al., 1999)，

Policy gradient algorithms are perhaps the most popular class of continuous action reinforcement learning algorithms.The basic idea behind the algorithms is to adjust the parameters θ of the policy in the direction of the performance gradient . The fundamental result underlying the algorithms is the policy gradient theorem (Sutton et al.,

1999),

superclass策略梯度⾮常简单。特别是，尽管状态分布取决于策略参数，但策略梯度不依赖于状态分布的梯度。

The policy gradient is surprisingly simple. In particular, despite the fact that the state distribution depends on the policy parameters, the policy gradient does not depend on the gradient of the state distribution.

该定理具有重要的实⽤价值，因为它将性能梯度的计算减少到简单的期望。通过形成基于样本的这⼀期望的估计，策略梯度定理已经被⽤来推导各种策略梯度算法(Degris et al.， 2012a)。这些算法必须解决的⼀个问题是如何估计动作值函数。也许最简单的⽅法是使⽤样本返回来估计的值，这导致REINFORCE算法的变体（Williams，1992）。

This theorem has important practical value, becau it reduces the computation of the performance gradient to a

simple expectation. The policy gradient theorem has been ud to derive a variety of policy gradient algorithms (Degris et al., 2012a), by forming a sample-bad estimate of this expectation. One issue that the algorithms must address is how to estimate the action-value function . Perhaps the simplest approach is to u a sample return to estimate the value of , which leads to a variant of the REINFORCE algorithm (Williams, 1992).

2.3. Stochastic Actor-Critic Algorithms

p (s →s ,t ,π)′s t s ′ρ(s ):π′=γp (s )p (s →∫S ∑k =t ∞t −11s ,t ,π)ds ′s ′t s p (s →s ,t ,π)′ρ(s ):π′=γp (s )p (s →∫S ∑k =t ∞t −11s ,t ,π)ds ′J (π)=θρ(s )π(s ,a )r (s ,a )dads =∫S π∫A θE [r (s ,a )]s ∼ρ,a ∼ππθ(1)

E [⋅]s ∼ρρ(s )A =R m S R d E [⋅]s ∼ρρ(s )A =R m S R d ∇J (π)θθθ∇J (π)θθ∇J (π)=θρ(s )∇π(s ∣∫S π∫A θθa )Q (s ,a )dads =πE [∇log π(a ∣s ∼ρ,a ∼ππθθθs )Q (s ,a )]

π(2)

ρ(s )πρ(s )πQ (s ,a )πr t γQ (s ,a )πt t Q (s ,a )πr t γQ (s ,a )πt t

Actor-Critic是基于策略梯度定理的⼴泛使⽤的框架（Sutton等⼈，1999; Peters等⼈，2005; Bhatnagar等⼈，2007; Degris等⼈，2012a）。Actor-Critic由两个同名组成部分组成。actor通过等式2的随机梯度上升来调整随机策略的参数。带有参数向量的动作值函数被⽤来代替等式2中的未知真实动作值函数。critic使⽤适当的策略评估算法（例如时间差异学习）估计动作 - 值函数。通常，⽤函数逼近器代替真实的动作值函数可能会引⼊偏差。然⽽，如果函数逼近器是兼容的，则i）和 ii）选择参数w以使最⼩化均⽅误差，然后就没有偏差（Sutton et al。，1999），

The actor-critic is a widely ud architecture bad on the policy gradient theorem (Sutton et al., 1999; Peters et al.,2005; Bhatnagar et al., 2007; Degris et al., 2012a). The actor-critic consists of tw

o eponymous components. An actor adjusts the parameters of the stochastic policy by stochastic gradient ascent of Equation 2. Instead of the unknown true action-value function in Equation 2, an action-value function is ud, with parameter vector w. A critic estimates the action-value function using an appropriate policy evaluation

algorithm such as temporal-difference learning. In general, substituting a function approximator for the true action-value function may introduce bias. However, if the function approximator is compatible such that i) and ii) the parameters w are chon to minimi the mean-squared error , then there is no bias (Sutton et al., 1999),

更直观地说，条件i）表⽰兼容函数逼近器在随机策略的“特征”中是线性的，，条件ii）要求参数是线性回归的解决⽅案从这些特征估计的问题。在实践中，条件ii）通常放宽，有利于通过时间差异学习更有效地估计价值函数的策略评估算法（Bhatnagar等，2007; Degris等，2012b; Peters等，2005）; 事实上，如果i）和ii）都满⾜，则整体算法相当于完全没有使⽤critic（Sutton等，2000），就像REINFORCE算法（Williams，1992）。

More intuitively, condition i) says that compatible function approximators are linear in “features” of the stochastic policy, , and condition ii) requires that the parameters are the solution to the linear regr

ession problem that estimates from the features. In practice, condition ii) is usually relaxed in favour of policy evaluation algorithms that estimate the value function more efficiently by temporal-difference learning (Bhatnagar et al., 2007;Degris et al., 2012b; Peters et al., 2005); indeed if both i) and ii) are satisfied then the overall algorithm is equivalent to not using a critic at all (Sutton et al., 2000), much like the REINFORCE algorithm (Williams, 1992).

2.4. Off-Policy Actor-Critic

从不同⾏为策略中采样的轨迹估计离线策略的策略梯度通常很有⽤。在离线策略环境中，性能⽬标通常被修改为⽬标策略的价值函数，在⾏为策略的状态分布上取平均值（Degris等，2012b），

It is often uful to estimate the policy gradient off-policy from trajectories sampled from a distinct behaviour policy . In an off-policy tting, the performance objective is typically modified to be the value function of the target policy, averaged over the state distribution of the behaviour policy (Degris et al., 2012b),

区分性能⽬标并应⽤近似给出了离线策略的策略梯度

Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

π(s )θθw Q (s ,a )w Q (s ,a )πQ (s ,a )≈w Q (s ,a )πQ (s ,a )w Q (s ,a )πQ (s ,a )=w ∇log π(a ∣θθs )w ⊺ϵ(w )=2E [(Q (s ,a )−s ∼ρ,a ∼ππθw Q (s ,a ))]π2θπ(s )θQ (s ,a )πQ (s ,a )w Q (s ,a )≈w Q (s ,a )πQ (s ,a )w Q (s ,a )πQ (s ,a )=w ∇log π(a ∣θθs )w ⊺ϵ(w )=

2E [(Q (s ,a )−s ∼ρ,a ∼ππθw Q (s ,a ))]π2∇J (π)=θE [∇log π(a ∣s ∼ρ,a ∼ππθθθs )Q (s ,a )]w (3)

∇log π(a ∣θθs )Q (s ,a )π∇log π(a ∣θθs )Q (s ,a )πβ(a ∣s )=π(a ∣θs )β(a ∣s )=π(a ∣θs )J (π)=βθρ(s )V (s )ds =∫S βπρ(s )π(a ∣∫S ∫A βθs )Q (s ,a )dads

这个近似去掉了⼀个依赖于动作值梯度的项）; Degris et al. (2012b)认为这是⼀个很好的近似，因为它可以保留梯度上升收敛的局部最优集。 Off-Policy Actor-Critic（OffPAC）算法(Degris et al., 2012b)使⽤⾏为策略来⽣成轨迹。critic通过梯度时间差学习，从这些轨迹中估计出⼀个离线策略的状态值函数 (Sutton et al., 2009)。通过随机梯度上升等式5，actor更新策略参数，也从这些离线策略轨迹。代替等式5中的未知动作-值函数，时间差分误差被使⽤，；这可以被证明是对真实梯度的近似（Bhatnagar等，2007）。actor和critic家都使⽤重要性抽样⽐来调整根据⽽不是选择动作的事实。

This approximation drops a term that depends on the action-value gradient ; Degris et al. (2012b) argrowup

gue that this is a good approximation since it can prerve the t of local optima to which gradient ascent converges. The Off-Policy Actor-Critic (OffPAC) algorithm (Degris et al., 2012b) us a behaviour policy to generate

trajectories. A critic estimates a state-value function, , off-policy from the trajectories, by gradient temporal-difference learning (Sutton et al., 2009). An actor updates the policy parameters , also off-policy from the trajectories, by stochastic gradient ascent of Equation 5. Instead of the unknown action-value function in Equation 5, the temporal-difference error is ud, ; this can be shown to provide an approximation to the true gradient (Bhatnagar et al., 2007). Both the actor and the critic u an importanceyen

sampling ratio to adjust for the fact that actions were lected according to rather than .

REFERENCES ∇J (π)≈θβθρ(s )∇πQ (s ,a )dads ∫S ∫A βθθπ(4)

=E [∇log π(a ∣s ∼ρ,a ∼βββ(a ∣s )θπ(a ∣s )

θθθs )Q (s ,a )]π(5)

∇Q (s ,a )θπβ(a ∣s )V (s )≈v V (s )πθQ (s ,a )πδt δ=t r +

becau of you 歌词

t +1γV (s )−V (s )v t +1v t β(a ∣s )θπ(a ∣s )θπβ∇Q (s ,a )θπβ(a ∣s )V (s )≈v V (s )πθQ (s ,a )πδt δ=t r +t +1γV (s )−V (s )v t +1v t β(a ∣s )θπ(a ∣s )θπβ

本文发布于:2023-07-05 00:59:27，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/167315.html

上一篇：abaqus二次开发输出结点坐标

下一篇：使用python进行ABAQUS的二次开发的简要说明（byYoung2017.06.27）

标签：策略梯度算法动作确定性函数

留言与评论（共有 0 条评论）