强化学习中的两种探索-平衡策略

更新时间:2023-06-20 19:33:40 阅读：评论：0

强化学习中的两种探索-平衡策略

强化学习中的两种探索-平衡策略ε-greedy⽅法

UCB（Upper Confidence Bound）⽅法

为了解决强化学习中的⼀个经典问题：exploration and exploitation tradeoff 即：到底我们应该花精⼒去探索从⽽对收益有更精确的估计，还是应该按照⽬前拥有的信息，选择最⼤收益期望的⾏为？

这样看上去可能不好理解，⼀个⼩例⼦帮助理解:

假如你想在淘宝上买⼀本书，你⼀输⼊书的名字就看到，第⼀个链接的价格为10元，第⼆个链接为9.9元，第三个为11元，此时你有两个选择，直接买9.9元的书，因为这是你⽬前看到最便宜的价格。这就是exploitation。但是现实中你并不会这么做，⾄少⼤部分⼈不会这么做，⼤家应该都会继续把列表往下翻，你有可能还会找到8元的价格，这个价格显然更加划算，但你付出了更多的时间精⼒。这就是exploration。更多的exploration可能会得到更多的收益，但会花费更⼤的精⼒，直接exploitation虽然节省精⼒，但不⼀定得到更多的收益，这样就会存在exploration 和 exploitation 的平衡问题。 ε-greedy和UCB就是exploration and exploitation tradeoff中两种常⽤的策略。

1.ε-greedy ⽅法

a ={argmaxq(a)，随机，1−εε

以1-ε 的概率选择q值⼤的动作，以ε的概率选择随机动作

例⼦：

a2a3当前时刻

q(a1)=0.2q(a2)=0.3q(a3)=0.6下⼀时刻q(a1)=0.2q(a2)=1.5q(a3)=0.3

如果我们根据贪婪策略的话，会选择q值⼤的那个动作，即a3。但下⼀时刻q值就会变化，因此根据贪婪策略我们的最终收益只会是0.6+0.3=0.9。

但根据ε-greedy⽅法，当前时刻也有ε的概率选择到动作a2，最终的收益为0.3+1.5=1.8，显然最终的收益⽐贪婪策略得到的要⾼。代码实现：

2.UCB ⽅法

在ε-greedy⽅法收敛后，仍然会以⼀个ε的概率去选择不是最优的动作，容易造成精⼒的⽩⽩浪费。

def choo_action(lf, policy, **kwargs): if np.random.random() < kwargs['epsilon']: action = np.random.randint(1, 4) el: action = np.argmax(lf.q) + 1 return action

该公式分为两部分，前半部分是正在的q值，可以理解为exploitation。后半部分可视为影响因⼦，n为总的动作数，该部分随着动作a 的选择次数增加⽽减⼩，从⽽更偏向于探索执⾏次数少的动作，可理解为exploration。

ε-greedy⽅法也可以设置衰减因⼦，让ε随着循环次数的增加⽽衰减，最终衰减到很⼩的值，此时ε-greedy⽅法也就变成了贪婪算法。此⽅法也能让ε-greedy⽅法达到和UCB差不多的效率。

代码实现：

3.总结

为了⽐较两种⽅法的优劣，将两种⽅法放在⼀个多臂赌博机的例⼦中进⾏对⽐，代码如下： def choo_action(lf, policy, **kwargs): c_ratio = kwargs['c_ratio'] if 0 in lf.action_counts: action = np.where(lf.action_counts==0)[0][0]+1 el: value = lf.q + c_ratio*np.sqrt(np.unts) / lf.action_counts) action = np.argmax(value)+1 return action

8import numpy as np import matplotlib.pyplot as plt class KB_Game: def __init__(lf, *args, **kwargs): lf.q = np.array([0.0, 0.0, 0.0]) lf.action_counts = np.array([0,0,0]) lf.current_cumulative_rewards = 0.0 lf.actions = [1, 2, 3] lf.counts = 0 lf.counts_history = [] lf.cumulative_rewards_history=[] lf.a = 1 lf.reward = 0 def step(lf, a): r = 0 if a == 1: r = al(1,1) if a == 2: r = al(2,1) if a == 3: r = al(1.5,1) return r def choo_action(lf, policy, **kwargs): action = 0 if policy == 'e_greedy': if np.random.random()<kwargs['epsilon']: action = np.random.randint(1,4) el: action = np.argmax(lf.q)+1 if policy == 'ucb': c_ratio = kwargs['c_ratio'] if 0 in lf.action_counts: action = np.where(lf.action_counts==0)[0][0]+1 el: value = lf.q + c_ratio*np.sqrt(np.unts) / lf.action_counts)

experimentation16

snookered27

value = lf.q + c_ratio*np.sqrt(np.unts) / lf.action_counts) action = np.argmax(value)+1 if policy == 'boltzmann': tau = kwargs['temperature'] p = np.exp(lf.q/tau)/(np.p(lf.q/tau))) action = np.random.choice([1,2,3], p = p.ravel()) return action def train(lf, play_total, policy, **kwargs): reward_1 = [] reward_2 = [] reward_3 = [] for i in range(play_total): action = 0 if policy == 'e_greedy': action = lf.choo_action(policy,epsilon=kwargs['epsilon'] ) if policy == 'ucb': action = lf.choo_action(policy, c_ratio=kwargs['c_ratio']) if policy == 'boltzmann': action = lf.choo_action(policy, temperature=kwargs['temperature']) lf.a = action # print(lf.a) #与环境交互⼀次 lf.r = lf.step(lf.a) lf.counts += 1 #更新值函数 lf.q[lf.a-1] = (lf.q[lf.a-1]*lf.action_counts[lf.a-1]+lf.r)/(lf.action_counts[lf.a-1]+1) lf.action_counts[lf.a-1] +=1 reward_1.append([lf.q[0]]) reward_2.append([lf.q[1]]) reward_3.appe

nd([lf.q[2]]) lf.current_cumulative_rewards += lf.r # lf.cumulative_rewards_history.append(lf.current_cumulative_rewards) lf.cumulative_rewards_history.append(lf.current_cumulative_rewards / lf.counts) # 平均奖励 lf.counts_history.append(i) # lf.action_history.append(lf.a) # plt.figure(1) # plt.unts_history, reward_1,'r') # plt.unts_history, reward_2,'g') # plt.unts_history, reward_3,'b') # plt.draw() # plt.figure(2) # plt.unts_history, lf.cumulative_rewards_history,'k') # plt.draw() # plt.show() def ret(lf): lf.q = np.array([0.0, 0.0, 0.0]) lf.action_counts = np.array([0, 0, 0]) lf.current_cumulative_rewards = 0.0 lf.counts = 0 lf.counts_history = [] lf.cumulative_rewards_history = [] lf.a = 1 lf.reward = 0 def plot(lf, colors, policy,style): plt.figure(1) plt.unts_history,lf.cumulative_rewards_history,colors,label=policy,linestyle=style) plt.legend() plt.xlabel('n',fontsize=18) plt.ylabel('total rewards',fontsize=18) # plt.figure(2) # plt.unts_history, lf.action_history, colors, label=policy) # plt.legend() # plt.xlabel('n', fontsize=18) # plt.ylabel('action', fontsize=18)

substantially38

抛弃英语39

package是什么意思

friend74

77英语四六级查分

100

101

102

代码运⾏结果如下：

从图中可以看出，UCB算法能更快达到最⼤收益，且更加稳定。UCB的累计收益也要⾼于ε-greedy。i

f __name__ == '__main__': np.random.ed(0) k_gamble = KB_Game() total = 2000 ain(play_total=total, policy='e_greedy', epsilon=0.05) k_gamble.plot(colors='r',policy='e_greedy',style='-.') () # ain(play_total=total, policy='boltzmann',temperature=1) # k_gamble.plot(colors='b', policy='boltzmann',style='--') # () ain(play_total=total, policy='ucb', c_ratio=0.5) k_gamble.plot(colors='g', policy='ucb',style='-') plt.show() # k_gamble.plot(colors='r', strategy='e_greedy') # () # ain(steps=200, strategy='ucb', c_ratio=0.5) # k_gamble.plot(colors='g', strategy='ucb') # () # ain(steps=200, strategy='boltzmann', a_ratio=0.1) # k_gamble.plot(colors='b', strategy='boltzmann') # plt.show()102

103

104

105

106

107

108

109

邓亚萍学英语

110

111

112

113

114

115

116

海南中考成绩查询入口117

118squirrel是什么意思

119

120

121

122

123

124

125

本文发布于:2023-06-20 19:33:40，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/151766.html

上一篇：Theprinciplesofeconomics.经济学的原理（英译中）

下一篇：Systems and methods for reducing a trade-off betwe

标签：选择收益动作

留言与评论（共有 0 条评论）