Option-Critic代码分析
Option-Critic代码分析
1.option-critic_network.py分析
a. State Network
state_model将input进⾏三层卷积处理,并压成⼀维向量flattened 输⼊给全连接层得到flattened * weights4 + bias1。
我的理解:这个过程就是为了提取图像中的特征并作为可观测的状态量,便于进⼀步地处理。
q_model计算matmul(input, q_weights1) + q_bias。q_model的输⼊也就是state_model的输出。
在这个过程中,变量network_params包括3个filter的参数、weights4、 bias1;Q_params包括 q_weights1,q_bias1。
为什么要对图像进⾏卷积处理?
原始图像通过与卷积核的数学运算,可以提取出图像的某些指定特征(features)。
什么是⼆维卷积?
图⽚中,绿⾊的5x5的⽅框可以看成是⼀张灰⾊图像的局部像素矩阵,移动的黄⾊3x3的⽅框就称之为kernel(卷积核),⽽粉⾊的⽅框就是被卷积处理过的结果。
神经⽹络中的filter (滤波器)与kernel(内核)的概念
kernel: 内核是⼀个2维矩阵,长 × 宽;
filter:滤波器是⼀个三维⽴⽅体,长× 宽 × 深度,其中深度便是由多少张内核构成;可以说kernel 是filter 的基本元素,多张kernel 组成⼀个filter。
神经⽹络中的channels概念
CNN处理过程中,存在输⼊的channel和输出的channel;其中,输⼊的channel数=输⼊数据的深度;输出的channel数=输出数据的深度。tensorflow中的卷积处理函数
conv2d(
input,
filter,
strides,
padding,
u_cudnn_on_gpu=True,
data_format='NHWC',
name=None)
Args:
• input:输⼊的tensor,被卷积的图像,conv2d要求input必须是四维的。四个维度分别为[batch, in_height, in_width, in_channels],
即batch size,输⼊图像的⾼和宽以及单张图像的通道数。牵牛花寓意
• filter:卷积核,也要求是四维,[filter_height, filter_width, in_channels, out_channels]四个维度分别表⽰卷积核的⾼、宽,输⼊图像的通道数和卷积输出通道数。
• strides:步长,即卷积核在与图像做卷积的过程中每次移动的距离,⼀般定义为[1,stride_h,stride_w,1],stride_h与stride_w分别表⽰在⾼的⽅向和宽的⽅向的移动的步长,第⼀个1表⽰在batch上移动的步长,最后⼀个1表⽰在通道维度移动的步长,⽽⽬前tensorflow规定:strides[0]= strides[3]=1,即不允许跳过bacth和通道,前⾯的动态图中的stride_h与stride_w均为1
• padding:边缘处理⽅式,值为“SAME” 和 “VALID”.
由于卷积核是有尺⼨的,当卷积核移动到边缘时,卷积核中的部分元素没有对应的像素值与之匹配。
此时选择“SAME”模式,则在对应的位置补零,继续完成卷积运算,在strides为[1,1,1,1]的情况下,卷积操作前后图像尺⼨不变即为“SAME”。
若选择 “VALID”模式,则在边缘处不进⾏卷积运算,若运算后图像的尺⼨会变⼩。
Returns:阑尾图片
A Tensor.4维张量
lf.inputs = tf.p0laceholder(
shape=[None,84,84,4], dtype=tf.uint8, name="inputs")
scaled_image = tf.to_float(lf.inputs)/255.0
create_state_network
def state_model (lf , input , kernel_shapes , weight_shapes ):
# kernel_shapes=[[8, 8, 4, 32], [4, 4, 32, 64], [3, 3, 64, 64]]
# [filter_height, filter_width, in_channels, out_channels]
# weight_shapes=[[3136, 512]]
weights1 = tf .get_variable (
"weights1", kernel_shapes [0],
initializer =tf .contrib .layers .xavier_initializer ())
cut过去分词weights2 = tf .get_variable (
"weights2", kernel_shapes [1],
initializer =tf .contrib .layers .xavier_initializer ())
weights3 = tf .get_variable (
"weights3", kernel_shapes [2],
initializer =tf .contrib .layers .xavier_initializer ())
weights4 = tf .get_variable (
"weights5", weight_shapes [0],
initializer =tf .contrib .layers .xavier_initializer ())
bias1 = tf .get_variable (
"q_bias1", weight_shapes [0][1],
initializer =tf .constant_initializer ())
# Convolve
conv1 = tf .nn .relu (tf .nn .conv2d (
家庭教育是什么input , weights1, strides =[1, 4, 4, 1], padding ='VALID'))
conv2 = tf .nn .relu (tf .nn .conv2d (
conv1, weights2, strides =[1, 2, 2, 1], padding ='VALID'))
conv3 = tf .nn .relu (tf .nn .conv2d (
conv2, weights3, strides =[1, 1, 1, 1], padding ='VALID'))
# Flatten and Feedforward
flattened = tf .contrib .layers .flatten (conv3)
net = tf .nn .relu (tf .nn .xw_plus_b (flattened , weights4, bias1))
return net
q_model
def q_model (lf , input , weight_shape ):
weights1 = tf .get_variable (
"q_weights1", weight_shape ,
initializer =tf .contrib .layers .xavier_initializer ())
# 这个初始化器是⽤来使得每⼀层输出的⽅差应该尽量相等。
bias1 = tf .get_variable (
"q_bias1", weight_shape [1],
initializer =tf .constant_initializer ())
# 将变量初始化为给定的常量,初始化⼀切所提供的值。
return tf .nn .xw_plus_b (input , weights1, bias1)
# 计算matmul(input, q_weights1) + q_bias 。b. Prime Network Prime Network就是需要被时刻更新的主⽹络,的主要由create_state_network和target_q_model组成。
create_state_network和上⾯的state_network的处理相似,也是将图像经过三层卷积处理然后通过⼀层全连接层得到观测的状态值。其中,filtersinput的shape两个⽹络都是⼀样的。也就是说,两个⽹络的参数是数量相同且对应的。换⾔之,create_state_network复制了state_network。
target_q_model实现了target_Q_out = input * weights1 + bias1,得到了动作价值函数。target_network_params 包括了create_state_network⾥⾯的参数。target_Q_params包括了weights1 、bias1。
⽤当前⽹络(State_Network)更新⽬标⽹络(Prime_Network)中的参数。tau=0.001。
Q (s ,a )π
lf .update_target_network_params = \
[lf .target_network_params [i ].assign (
tf .multiply (lf work_params [i ], lf .tau ) +
tf .multiply (lf .target_network_params [i ], 1. - lf .tau ))
for i in range (len (lf .target_network_params ))]
create_state_network
def create_state_network (lf , scaledImage ):
# Convolve
# kernel_size 指的是卷积核的size;stride 步长; padding 边缘处理⽅式:'VALID'在边缘处不进⾏卷积计算
# out_height = round((in_height - floor(filter_height / 2) * 2) / strides_height) floor 表⽰下取整 round 表⽰四舍五⼊
# num_outputs 是输出的通道数,等于filters 的数量
# filters:Integer, the dimensionality of the output space (i.e. the number of filters in the convolution).
conv1 = slim .conv2d (
inputs =scaledImage , num_outputs =32, kernel_size =[8, 8], stride =[4, 4],
padding ='VALID', bias_initializer =None )
conv2 = slim .conv2d (
inputs =conv1, num_outputs =64, kernel_size =[4, 4], stride =[2, 2],
padding ='VALID', bias_initializer =None )
conv3 = slim .conv2d (
inputs =conv2, num_outputs =64, kernel_size =[3, 3], stride =[1, 1],
padding ='VALID', bias_initializer =None )
# Flatten and Feedforward
flattened = tf .contrib .layers .flatten (conv3)
net = tf .contrib .layers .fully_connected (
inputs =flattened ,
num_outputs =lf .h_size ,
activation_fn =tf .nn .relu )
return net上海的景点
target_q_model
def target_q_model (lf , input , weight_shape ):
weights1 = tf .get_variable (
"target_q_weights1", weight_shape ,
initializer =tf .contrib .layers .xavier_initializer ())
bias1 = tf .get_variable (
"target_q_bias1", weight_shape [1],
initializer =tf .constant_initializer ())
return tf .nn .xw_plus_b (input , weights1, bias1)c. Action Network 得到策略价值函数action_probs:图像经过state_model三层卷积和⼀层全连接层数量,输出4x4x64个神经元,⽤于表⽰图像可观测的状态量。
嘹的拼音以⼀层全连接层来对进⾏Function Approximation。action_params包括了该全连接层的参数。
state_out是state_modle的输出,shape=[4,4,64];action_dim=6; option_dim=8; h_size=512
张量(tensor)运算
算术操作
描述tf.add(x,y)
离职证明模板
将x和y逐元素相加tf.subtract(x,y)
将x和y逐元素相减tf.multiply(x,y)将x和y逐元素相乘
π(a ∣s )
ω,θπ(a ∣s )ω,θ
f.divide(x,y)将x和y逐元素相除d(x,y)将x逐元素求余
算术操作
描述 lf .action_input = tf .concat (
[lf .state_out , lf .state_out , lf .state_out , lf .state_out ,
lf .state_out , lf .state_out , lf .state_out , lf .state_out ], 1)
lf .action_input = tf .reshape (
lf .action_input , shape =[-1, lf .o_dim , 1, lf .h_size ])
oh = tf .reshape (lf .options_onehot , shape =[-1, lf .o_dim , 1])
lf .action_input = tf .reshape (
tf .reduce_sum (
tf .squeeze (
lf .action_input , [2]) * oh , [1]),#⾏求和
煮中药shape =[-1, 1, lf .h_size ])
# tf.reduce_sum 此函数计算⼀个张量的各个维度上元素的总和.
lf .action_probs = tf .contrib .layers .fully_connected (
inputs =lf .action_input ,
num_outputs =lf .a_dim ,
activation_fn =tf .nn .softmax )
lf .action_probs = tf .squeeze (lf .action_probs , [1])
lf .action_params = tf .trainable_variables ()[
len (lf work_params ) + len (lf .target_network_params ) +
len (lf .Q_params ) + len (lf .target_Q_params ):]d. Termination Network
termination_model通过全连接层tf.nn.w_plus_b(input, weights1, bias1))来对终⽌函数option_term_prob 进⾏Function Approximation。神经⽹络的参数被包括于termination_params中。注意,termination_params和action_params都属于option策略的参数。next_option_term_prob表⽰ with tf .variable_scope ("termination_probs") as term_scope :
lf .termination_probs = lf .apply_termination_model (
tf .stop_gradient (lf .state_out ))# 截断传递的梯度。
term_scope .reu_variables ()
lf .next_termination_probs = lf .apply_termination_model (
tf .stop_gradient (lf .next_state_out ))
lf .termination_params = tf .trainable_variables ()[-2:]
lf .option_term_prob = tf .reduce_sum (
lf .termination_probs * lf .options_onehot , [1])
lf .next_option_term_prob = tf .reduce_sum (
lf .next_termination_probs * lf .options_onehot , [1])
lf .reward = tf .placeholder (tf .float32, [None , 1], name ="reward")
lf .done = tf .placeholder (tf .float32, [None , 1], name ="done")
# lf.disc_option_term_prob = tf.placeholder(tf.float32, [None, 1])
disc_option_term_prob = tf .stop_gradient (lf .next_option_term_prob )termination_model β(s )ω,ϑβ(s )
ω,ϑ′