斯坦福大学 CS229 机器学习notes12

更新时间:2023-06-03 05:47:00 阅读：评论：0

CS229Lecture notes

Andrew Ng

Part XIII

Reinforcement Learning and Control

We now begin our study of reinforcement learning and adaptive control.

In supervid learning,we saw algorithms that tried to make their outputs mimic the labels y given in the training t.In that tting,the labels gave an unambiguous“right answer”for each of the inputs x.In contrast,for many quential decision making and control problems,it is very diﬃcult to provide this type of explicit supervision to a learning algorithm.For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the“correct”actions to take are to make it walk,and so do not know how to provide explicit supervision for a learning algorithm to try to mimic.

In the reinforcement learning framework,we will instead provide our al-gorithms only a reward function,which indicates to the learning agent when it is doing well,and when it is doing poorly.In the fo

ur-legged walking ex-ample,the reward function might give the robot positive rewards for moving forwards,and negative rewards for either moving backwards or falling over. It will then be the learning algorithm’s job toﬁgure out how to choo actions over time so as to obtain large rewards.

比较伤感的句子Reinforcement learning has been successful in applications as diver as autonomous helicopterﬂight,robot legged locomotion,cell-phone network routing,marketing strategy lection,factory control,and eﬃcient web-page indexing.Our study of reinforcement learning will begin with a deﬁnition of the Markov decision process(MDP),which provides the formalism in which RL problems are usually pod.

密宗和禅宗的区别2 1Markov decision process

A Markov decision process is a tuple(S,A,{P sa},γ,R),where:

•S is a t of states.(For example,in autonomous helicopterﬂight,S might be the t of all possible positions and orientations of the heli-copter.)

•A is a t of actions.(For example,the t of all possible directions in which you can push the helicop

ter’s control sticks.)

潘雨晨

•P sa are the state transition probabilities.For each state s∈S and action a∈A,P sa is a distribution over the state space.We’ll say more about this later,but brieﬂy,P sa gives the distribution over what states we will transition to if we take action a in state s.

•γ∈[0,1)is called the discount factor.

•R:S×A→R is the reward function.(Rewards are sometimes also written as a function of a state S only,in which ca we would have R:S→R).

The dynamics of an MDP proceeds as follows:We start in some state s0, and get to choo some action a0∈A to take in the MDP.As a result of our choice,the state of the MDP randomly transitions to some successor state

s1,drawn according to s1∼P s一起长大的玩具

0a0.Then,we get to pick another action a1.

As a result of this action,the state transitions again,now to some s2∼P s

1a1.

We then pick a2,and Pictorially,we can reprent this process as follows:

s0a0

−→s1a1

−→s2a2

−→s3a3

−→...

Upon visiting the quence of states s0,s1,...with actions a0,a1,...,our total payoﬀis given by

R(s0,a0)+γR(s1,a1)+γ2R(s2,a2)+···.

Or,when we are writing rewards as a function of the states only,this becomes

R(s0)+γR(s1)+γ2R(s2)+···.

For most of our development,we will u the simpler state-rewards R(s), though the generalization to state-action rewards R(s,a)oﬀers no special diﬃculties.

3 Our goal in reinforcement learning is to choo actions over time so as to maximize the expected value of the total payoﬀ:

E R(s0)+γR(s1)+γ2R(s2)+···

Note that the reward at timestep t is discounted by a factor ofγt.Thus,to make this expectation large,we would like to accrue positive rewards as soon as possible(and postpone negative rewards as long as possible).In economic applications where R(·)is the amount of money made,γalso has a natural interpretation in terms of the interest rate(where a dollar today is worth more than a dollar tomorrow).

A policy is any functionπ:S→A mapping from the states to the actions.We say that we are executing some policyπif,whenever we are in state s,we take action a=π(s).We also deﬁne the value function for a policyπaccording to

家具木材种类Vπ(s)=E R(s0)+γR(s1)+γ2R(s2)+··· s0=s,π].

Vπ(s)is simply the expected sum of discounted rewards upon starting in state s,and taking actions according toπ.1

Given aﬁxed policyπ,its value function Vπsatisﬁes the Bellman equa-tions:

Vπ(s)=R(s)+γ s ∈S P sπ(s)(s )Vπ(s ).

This says that the expected sum of discounted rewards Vπ(s)for starting in s consists of two terms:First,the immediate reward R(s)that we get rightaway simply for starting in state s,and cond,the expected sum of future discounted rewards.Examining the cond term in more detail,we

[Vπ(s )].This e that the summation term above can be rewritten E s ∼P

酒店租赁合同

sπ(s)

is the expected sum of discounted rewards for starting in state s ,where s is distributed according P sπ(s),which is the distribution over where we will end up after taking theﬁrst actionπ(s)in the MDP from state s.Thus,the cond term above gives the expected sum of discounted rewards obtained after theﬁrst step in the MDP.

Bellman’s equations can be ud to eﬃciently solve for Vπ.Speciﬁcally, in aﬁnite-state MDP(|S|<∞),we can write down one such equation for Vπ(s)for every state s.This gives us a t of|S|linear equations in|S| variables(the unknown Vπ(s)’s,one for each state),which can be eﬃciently solved for the Vπ(s)’s.

1This notation in which we condition onπisn’t technically correct becauπisn’t a random variable,but this is quite standard in the literature.

We also deﬁne the optimal value function according to

V ∗(s )=max πV π(s ).(1)

In other words,this is the best possible expected sum of discounted rewards that can be attained using any policy.There is also a version of Bellman’s equations for the optimal value function:V ∗(s )=R (s )+max a ∈A γ s ∈S

P sa (s )V ∗(s ).(2)The ﬁrst term above is the immediate reward as before.The cond term is the maximum over all actions a of the expected future sum of discounted rewards we’ll get upon after ac

tion a .You should make sure you understand this equation and e why it makes n.

We also deﬁne a policy π∗:S →A as follows:π∗(s )=arg max a ∈A s ∈S

P sa (s )V ∗(s ).(3)Note that π∗(s )gives the action a that attains the maximum in the “max”in Equation (2).

It is a fact that for every state s and every policy π,we have

柴犬幼崽V ∗(s )=V π∗

(s )≥V π(s ).2016年台风

The ﬁrst equality says that the V π∗,the value function for π∗,is equal to the optimal value function V ∗for every state s .Further,the inequality above says that π∗’s value is at least a large as the value of any other other policy.In other words,π∗as deﬁned in Equation (3)is the optimal policy.

Note that π∗has the interesting property that it is the optimal policy for all states s .Speciﬁcally,it is not the ca that if we were starting in some state s then there’d be some optimal policy for that state,and if we were starting in some other state s then there’d be some other policy that’s optimal p

olicy for s .Speciﬁcally,the same policy π∗attains the maximum in Equation (1)for all states s .This means that we can u the same policy π∗no matter what the initial state of our MDP is.2Value iteration and policy iteration

We now describe two eﬃcient algorithms for solving ﬁnite-state MDPs.For now,we will consider only MDPs with ﬁnite state and action spaces (|S |<∞,|A |<∞).

The ﬁrst algorithm,value iteration ,is as follows:

1.For each state s,initialize V(s):=0.

2.Repeat until convergence{

For every state,update V(s):=R(s)+max a∈Aγ s P sa(s )V(s ).

}

This algorithm can be thought of as repeatedly trying to update the esti-mated value function using Bellman Equations(2).

There are two possible ways of performing the updates in the inner loop of the algorithm.In theﬁrst,we canﬁrst compute the new values for V(s)for every state s,and then overwrite all the old values with the new values.This is called a synchronous update.In this ca,the algorithm can be viewed as implementing a“Bellman backup operator”that takes a current estimate of the value function,and maps it to a new estimate.(See homework problem for details.)Alternatively,we can also perform asynchronous updates. Here,we would loop over the states(in some order),updating the values one at a time.

Under either synchronous or asynchronous updates,it can be shown that value iteration will cau V to converge to V∗.Having found V∗,we can then u Equation(3)toﬁnd the optimal policy.

Apart from value iteration,there is a cond standard algorithm forﬁnd-ing an optimal policy for an MDP.The policy iteration algorithm proceeds as follows:

1.Initializeπrandomly.

2.Repeat until convergence{

(a)Let V:=Vπ.

(b)For each state s,letπ(s):=arg max a∈A s P sa(s )V(s ).

}

Thus,the inner-loop repeatedly computes the value function for the current policy,and then updates the policy using the current value function.(The policyπfound in step(b)is also called the policy that is greedy with re-spect to V.)Note that step(a)can be done via solving Bellman’s equations as described earlier,which in the ca of aﬁxed policy,is just a t of|S| linear equations in|S|variables.

After at most aﬁnite number of iterations of this algorithm,V will con-verge to V∗,andπwill converge toπ∗.

本文发布于:2023-06-03 05:47:00，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/843324.html

上一篇：留点友谊给自己800字(三篇)

下一篇：发言稿是参加会议者为了在会议或重要活动上表达自己意见、看法或汇报思想工作情况而事先准备好的文稿。发言稿可以按照用途、性质等来划分，是演讲上一个重要的准备工作。记得网的精心为您带来了3篇《致辞发言稿20

标签：种类句子禅宗长大酒店木材伤感区别

留言与评论（共有 0 条评论）