强化学习笔记(七)演员-评论家算法(Actor-CriticAlgorithms)及Pyt。。。强化学习笔记(七)演员-评论家算法(Actor-Critic Algorithms)及Pytorch实现
接着上⼀节的学习笔记。上⼀节学习总结了Policy Gradient⽅法以及蒙特卡洛Reinforce实现。这节了解⼀下Actor-Critic算法。Actor-Critic是2000年在NIPS上发表的⼀篇名为 Actor-Critic Algorithms的论⽂中提出的。它是⼀种策略(Policy Bad)和价值(Value Bad)相结合的⽅法,见UCL第七讲的开篇PPT(下图)
Q1: Actor-Critic的含义,与纯策略梯度法的不同?
第⼀个是Actor⾓⾊,在⼀些资料中也称为“演员⾓⾊”。这个⾓⾊是⼀个相对独⽴的模型,你仍然可以把它理解成⼀个神经⽹络,任务就是学动作。优化它的过程和优化⼀个普通DQN⾥⾯的⽹络没有太⼤的区别。
另⼀个是Critic⾓⾊,或者称作“评论家⾓⾊”。它负责评估Actor的表现,并指导Actor下⼀阶段的动作。这个⾓⾊也是⼀个独⽴的模型。在这种思维的指导下,估值学习也是⼀个独⽴的、可优化的任务,需要通过⼀个模型进⾏拟合。动作输出也是⼀个模型,通过⼀个模型进⾏拟合。
这种思想有点类似GAN⽹络中的⽣成器和判别器,两者相互监督和牵制,最后达到较好的效果。如果之前的DQN,Policy Gradient梯度上升公式及蒙特卡洛Reinforce算法都看懂了的话,这⾥还是很好理解的。
与Monte Carlo Policy Gradient不同,Actor Critic放弃利⽤回报来评估真实价值函数,⽽直接使⽤Critic算法,利⽤函数逼近法(Function Approximation Methods)即神经⽹络,利⽤逼近策略梯度法⽽⾮真实策略梯度。
Q2: 基线(Baline)和优势函数(Advantage Function)的理解
策略梯度的更新公式如下:
上⼀节已经讲过,是分值函数(Score Function),它是固定不动的,因为我们要更新的变量就是策略. 对,能不能做做⽂章呢?通过理论推导,可以发现引⼊⼀个Baline函数,只要它不随动作a变化,上述等式依然成⽴,这是因为减的那⼀项是0.
令,引⼊优势函数(Advantage Function)——.
两者的期望⼀样,优势函数有什么意义?从上节Monte Carlo 策略梯度法可以看到,更新项是蒙特卡洛采样的target return,这样导致的⽅差会很⼤。⽽使⽤优势函数可以显著地减少策略梯度的⽅差。
Critic(评论家)的任务就是要评估优势函数,⽽TD error的期望刚好就是优势函数:
基于Pytorch 的Actor-Critic 实现
参考Tensorflow版本:/ljpzzz/machinelearning/blob/master/reinforcement-learning/actor_critic.py
∇J (θ)=θE ∇log π(s ,a )Q (s ,a )πθ[θθw ]
∇log π(s ,a )θθπ(s ,a )θQ (s ,a )w B (s )E ∇log π(s ,a )B (s )πθ[θθ]=d (s )∇π(s ,a )B (s )s ∈S ∑πθa ∑
θθ=d B (s )∇π(s ,a )s ∈S ∑πθθa ∈A ∑θ=d B (s )∇1s ∈S ∑πθ=0
B (s )=V θ(s )πA (s ,a )πθA (s ,a )πθ∇J (θ)θ=Q (s ,a )−V (s )
πθπθ=E ∇log π(s ,a )A (s ,a )πθ[θθπθ]
代码改了很久,在⼀些很简单的地⽅被卡住了。主要注意的是td_error是Critic的Q⽹络算出来的值,直接返回是带第⼀个⽹络梯度的,这时候需要去掉这个梯度,不然在Actor更新的时候就会报错。
另外,这个代码很难收敛,我⼀直持续在9分上不去了。不过写⼀遍确实有助于对策略-价值双⽹络的理解。
程序流程
1. 实例化actor / critic 并初始化超参数
2. for Epochs:
for Steps:
① ⽤actor选动作
② Step(state, action) 状态转移
③ ⽤critic的Q_Net算V(s)和V(s’)灾难来临时
得TD_error = r + γV(s’)-V(s)
顺便⽤TD_error的均⽅误差训练Q_Network
④ TD_error反馈给Actor,Policy Gradient公式 训练Actor
⑤ state = next_state
"""
@ Author: Peter Xiao
@ Date: 2020/7/23
@ Filename: Actor_critic.py
挽救的反义词
@ Brief: 使⽤ Actor-Critic算法训练CartPole-v0
热带风暴水上乐园"""
import gym360鲁大师
import torch
as nn
functional as F
import numpy as np
import time
# Hyper Parameters for Actor
GAMMA =0.95# discount factor
LR =0.01# learning rate
# U GPU
device = torch.device("cuda"if torch.cuda.is_available()el"cpu")
torch.abled =Fal# ⾮确定性算法
class PGNetwork(nn.Module):
def__init__(lf, state_dim, action_dim):
super(PGNetwork, lf).__init__()
lf.fc1 = nn.Linear(state_dim,20)
lf.fc2 = nn.Linear(20, action_dim)
def forward(lf, x):
北欧诸神
out = F.relu(lf.fc1(x))
out = lf.fc2(out)
return out
def initialize_weights(lf):
for m dules():
al_(m.weight.data,0,0.1)
stant_(m.bias.data,0.01)
class Actor(object):
# dqn Agent
def__init__(lf, env):# 初始化
# 状态空间和动作空间的维度
lf.state_dim = env.obrvation_space.shape[0]
lf.action_dim = env.action_space.n
# init network parameters
lfwork = PGNetwork(state_dim=lf.state_dim, action_dim=lf.action_dim).to(device)
lf.optimizer = torch.optim.Adam(lfwork.parameters(), lr=LR)
# init some parameters
lf.time_step =0
def choo_action(lf, obrvation):
obrvation = torch.FloatTensor(obrvation).to(device)
network_output = lfwork.forward(obrvation)
_grad():
prob_weights = F.softmax(network_output, dim=0).cuda().data.cpu().numpy()
# prob_weights = F.softmax(network_output, dim=0).detach().numpy()
action = np.random.choice(range(prob_weights.shape[0]),
p=prob_weights)# lect the actions prob
return action
def learn(lf, state, action, td_error):
lf.time_step +=1
# Step 1: 前向传播
softmax_input = lfwork.forward(torch.FloatTensor(state).to(device)).unsqueeze(0)
action = torch.LongTensor([action]).to(device)
neg_log_prob = F.cross_entropy(input=softmax_input, target=action, reduction='none') # Step 2: 反向传播
# 这⾥需要最⼤化当前策略的价值,因此需要最⼤化neg_log_prob * tf_error,即最⼩化-neg_log_prob * td_error loss_a =-neg_log_prob * td_error
_grad()
loss_a.backward()
lf.optimizer.step()
# Hyper Parameters for Critic
EPSILON =0.01# final value of epsilon
REPLAY_SIZE =10000# experience replay buffer size
BATCH_SIZE =32# size of minibatch
REPLACE_TARGET_FREQ =10# frequency to update target Q network
class QNetwork(nn.Module):
def__init__(lf, state_dim, action_dim):
super(QNetwork, lf).__init__()
lf.fc1 = nn.Linear(state_dim,20)
lf.fc2 = nn.Linear(20,1)# 这个地⽅和之前略有区别,输出不是动作维度,⽽是⼀维
def forward(lf, x):
out = F.relu(lf.fc1(x))
out = lf.fc2(out)
巴豆是什么东西
return out
def initialize_weights(lf):
for m dules():
al_(m.weight.data,0,0.1)
stant_(m.bias.data,0.01)
幼儿园互动小游戏class Critic(object):
def__init__(lf, env):
# 状态空间和动作空间的维度
lf.state_dim = env.obrvation_space.shape[0]
lf.action_dim = env.action_space.n
# init network parameters
lfwork = QNetwork(state_dim=lf.state_dim, action_dim=lf.action_dim).to(device)
lf.optimizer = torch.optim.Adam(lfwork.parameters(), lr=LR)
lf.loss_func = nn.MSELoss()
# init some parameters
lf.time_step =0
lf.epsilon = EPSILON # epsilon值是随机不断变⼩的
def train_Q_network(lf, state, reward, next_state):
s, s_ = torch.FloatTensor(state).to(device), torch.FloatTensor(next_state).to(device) # 前向传播
v = lfwork.forward(s)# v(s)
v_ = lfwork.forward(s_)# v(s')
# 反向传播
班长自荐书
loss_q = lf.loss_func(reward + GAMMA * v_, v)
_grad()
loss_q.backward()
lf.optimizer.step()
_grad():
td_error = reward + GAMMA * v_ - v
return td_error
# Hyper Parameters
ENV_NAME ='CartPole-v0'
EPISODE =3000# Episode limitation
STEP =3000# Step limitation in an episode
TEST =10# The number of experiment test every 100 episode
def main():
# initialize OpenAI Gym env and dqn agent
env = gym.make(ENV_NAME)
actor = Actor(env)
critic = Critic(env)
for episode in range(EPISODE):
# initialize task
state = ()
# Train
for step in range(STEP):
action = actor.choo_action(state)# SoftMax概率选择action
next_state, reward, done, _ = env.step(action)
td_error = ain_Q_network(state, reward, next_state)# gradient = grad[r + gamma * V(s_) - V(s)] actor.learn(state, action, td_error)# true_gradient = grad[logPi(s,a) * td_error]
state = next_state
if done:
break
# Test every 100 episodes
if episode %100==0:
total_reward =0
for i in range(TEST):
state = ()
for j in range(STEP):
action = actor.choo_action(state)# direct action for test
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
ave_reward = total_reward/TEST
print('episode: ', episode,'Evaluation Average Reward:', ave_reward)