强化学习(四)--DDPG算法

更新时间:2023-06-17 04:05:54 阅读: 评论:0

强化学习(四)--DDPG算法
强化学习(四)--DDPG算法
上⼀篇⽂章介绍了PG算法⼤类的Reinforce算法,它是⼀种基于MC更新⽅式的算法,⽽它的另⼀⼤类是基于Actor-Critic算法,它是⼀种基于TD更新⽅式的算法。这⼀篇⽂章就来介绍AC算法中应⽤最多的DDPG算法,它可以直接输出确定性的连续形动作。
1. DDPG算法
详细的算法介绍还是推荐科⽼师的课程(),TD更新⽅式是指每⼀个episode的每⼀个step都进⾏算法的更新。
教学质量DDPG算法有Actor⽹络和Critic⽹络,Actor⽹络输出浮点数,表⽰确定性的策略,Critic⽹络则是负责对Actor⽹络输出的动作进⾏评价(也就是我们常⽤的Q值),它的结构图如下图:
Actor⽹络: 根据Critic⽹络的评分来更新策略⽹络;
Critic⽹络: 在每⼀个step对Actor⽹络输出的动作评分,根据与环境的reward来调整⽹络的评估策略,最⼤化未来总收益。
因此,可以得出优化Actor⽹络、Critic⽹络的Loss函数为:
- Actor⽹络:Loss = -Q;
- Critic⽹络:Loss = MSE(Q估计,Qtarget).
要注意的是,也可以将DQN中的Target Network和Replay Memory引⼊到DDPG中,因此需要建⽴四
个神经⽹络:Critic⽹络、Target_Critic⽹络、Actor⽹络、Target_Actror⽹络。因此,整个DDPG的框架如下图:
这个框架很清晰的讲解了DDPG算法的核⼼,可以多看⼏遍深⼊了解。
2. DDPG算法代码
如果搞懂了DDPG算法的核⼼内容,要实现它还是挺容易的,参考前⼏节的DQN算法,这⾥直接上完整代码,结合注释很清晰的能看懂:
import torch
as nn
functional as F
import numpy as np
import gym
>>>>#  hyper parameters  >>>>
EPISODES =1000# 进⾏多少个Episode
STEPS =200# 每个Episode进⾏多少step
TEST =5# 测试时进⾏多少个Episode
LR_ACTOR =0.001# Actor和Critic的学习率
LR_CRITIC =0.002
GAMMA =0.9
MEMORY_CAPACITY =10000# 经验池的⼤⼩
BATCH_SIZE =32# 从经验池中取出每个bath的数量
ENV_NAME ='Pendulum-v0'
env = gym.make(ENV_NAME)
env = env.unwrapped  # ⽤env.unwrapped可以得到原始的类,原始类想step多久就多久,不会200步后失败:
env.ed(1)# 随机种⼦
s_dim = env.obrvation_space.shape[0]# 状态的个数
a_dim = env.action_space.shape[0]# 动作的个数
a_bound = env.action_space.high          # 动作的上下限
a_low_bound = env.action_space.low
TAU =0.01# ⽤于target⽹络软更新的参数
>>>>># DDPG Framework >>>>##
# # # # Actor策略⽹络的模型,2层全连接层 # # # #
class ActorNet(nn.Module):
广东省大学排名
def__init__(lf, s_dim, a_dim):
super(ActorNet, lf).__init__()
lf.fc1 = nn.Linear(s_dim,30)
lf.fc1.al_(0,0.1)# initialization of FC1
lf.out = nn.Linear(30, a_dim)
lf.out.al_(0,0.1)# initilizaiton of OUT
def forward(lf, x):
x = lf.fc1(x)
x = F.relu(x)
x = lf.out(x)
x = torch.tanh(x)# tanh函数,把输出限制在[-1,1]的范围内
actions = x *2# for the game "Pendulum-v0", action range is [-2, 2]
return actions
# # # # Critic评价⽹络的模型,3层全连接层 # # # #
class CriticNet(nn.Module):
def__init__(lf, s_dim, a_dim):
super(CriticNet, lf).__init__()
水浒传好汉
lf.fcs = nn.Linear(s_dim,30)
英语我怎么写lf.fcs.al_(0,0.1)
lf.fca = nn.Linear(a_dim,30)
lf.fca.al_(0,0.1)
lf.out = nn.Linear(30,1)
lf.out.al_(0,0.1)
def forward(lf, s, a):
x = lf.fcs(s)
y = lf.fca(a)
actions_value = lf.lu(x + y))# critic⽹络的输⼊值是s,a两个值
return actions_value
>>>>># DDPG Class >>>>##
# # # # 建⽴DDPG主要的类 # # # #
class DDPG(nn.Module):
def__init__(lf,act_dim,obs_dim,a_bound):
super(DDPG,lf).__init__()
# 建⽴状态、动作的参数
lf.act_dim = act_dim
lf.obs_dim = obs_dim
lf.a_bound = a_bound
# 记忆库的参数,当超过记忆库总量时开始训练
lf.pointer =0
# 建⽴四个⽹络
lf.actor_eval = ActorNet(obs_dim, act_dim)
lf.actor_target = ActorNet(obs_dim, act_dim)
# 建⽴经验回放库
< = np.zeros((MEMORY_CAPACITY, s_dim *2+ a_dim +1), dtype=np.float32) # 建⽴⽹络优化器和loss有害垃圾桶
lf.actor_optimizer = torch.optim.Adam(lf.actor_eval.parameters(), lr=LR_ACTOR)
lf.loss_func = nn.MSELoss()
# 动作选择的函数
def choo_action(lf,obs):
obs = torch.unsqueeze(torch.FloatTensor(obs),0)
action = lf.actor_eval(obs)[0].detach()
return action
# 经验回放的函数
def store_transition(lf,obs,action,reward,next_obs):
transition = np.hstack((obs, action,[reward], next_obs))
index = lf.pointer % MEMORY_CAPACITY  # replace the old data with new data
<[index,:]= transition
lf.pointer +=1
# 学习的函数
def learn(lf):
#  target ⽹络的软更新
for x in lf.actor_target.state_dict().keys():# state_dict,它包含了优化器的状态以及被使⽤的超参数
for x in lf.actor_target.state_dict().keys():# state_dict,它包含了优化器的状态以及被使⽤的超参数eval('lf.actor_target.'+ x +'.data.mul_((1-TAU))')
eval('lf.actor_target.'+ x +'.data.add_(TAU*lf.actor_eval.'+ x +'.data)')
for x itic_target.state_dict().keys():
eval('lf.critic_target.'+ x +'.data.mul_((1-TAU))')
eval('lf.critic_target.'+ x +'.data.add_(itic_eval.'+ x +'.data)')
# 从经验池中取⼀个batch的数据
indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)
batch_trans = lf.memory[indices,:]
# extract data from mini-batch of transitions including s, a, r, s_
batch_s = torch.FloatTensor(batch_trans[:,:lf.obs_dim])
batch_a = torch.FloatTensor(batch_trans[:, lf.obs_dim:lf.obs_dim + lf.act_dim])
batch_r = torch.FloatTensor(batch_trans[:,-lf.obs_dim -1:-lf.obs_dim])
batch_s_ = torch.FloatTensor(batch_trans[:,-lf.obs_dim:])
# Actor策略⽹络的更新
action = lf.actor_eval(batch_s)
Q = lf.critic_eval(batch_s,action)
actor_loss =-an(Q)
lf._grad()
actor_loss.backward()
lf.actor_optimizer.step()
# Critic评价⽹络的更新
act_target = lf.actor_target(batch_s_)
q_tmp = lf.critic_target(batch_s_,act_target)
Q_target = batch_r + GAMMA * q_tmp
Q_eval = lf.critic_eval(batch_s,batch_a)
td_error = lf.loss_func(Q_eval,Q_target)
<_grad()
td_error.backward()
>>>>># Training >>>>##
# # # # 主函数 # # # #
def main():
var =3
agent = DDPG(a_dim, s_dim, a_bound)# 定义DDPG的类
for episode in range(EPISODES):
obs = ()
for step in range(STEPS):
action = agent.choo_action(obs)# 选择动作
action = np.clip(al(action, var), a_low_bound, a_bound)
next_obs,reward,done,_ = env.step(action)
agent.store_transition(obs,action,reward,next_obs)# 存⼊经验回放池
if agent.pointer > MEMORY_CAPACITY:# 当超过经验回放池的容量时
var *=0.9995# 减少探索的⽐例
agent.learn()
obs = next_obs
if done:
break
# 每20个episode进⾏测试
if episode %20==0:
total_reward =0
清香造句
for i in range(TEST):# 每次测试取5个episode的平均
obs = ()
for j in range(STEPS):
action = agent.choo_action(obs)
action = np.clip(al(action, var), a_low_bound, a_bound)
action = np.clip(al(action, var), a_low_bound, a_bound)
next_obs,reward,done,_ = env.step(action)
obs = next_obs
total_reward += reward
if done:
break
avg_reward = total_reward / TEST    # 计算测试的平均reward
print('Episode: ',episode,'Test_reward: ',avg_reward)
if __name__ =='__main__':
眼睛下面的痣
main()
3. DDPG算法的效果展⽰
这⾥使⽤的是gym中的 **‘Pendulum-v0’**环境,它是⼀个基于连续动作的环境,输出的动作是[-2,2]的浮点数,⽬标是保持杆⼦零⾓度(垂直),旋转速度最⼩,⼒度最⼩。
两三成语在训练了200个episode后,杆⼦⽴起来的时间越来越长,DDPG算法具有效果。

本文发布于:2023-06-17 04:05:54,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/972740.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:算法   动作   输出
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图