CS229Lecture notes
Andrew Ng
Part XIII
Reinforcement Learning and Control
We now begin our study of reinforcement learning and adaptive control.
In supervid learning,we saw algorithms that tried to make their outputs mimic the labels y given in the training t.In that tting,the labels gave an unambiguous“right answer”for each of the inputs x.In contrast,for many quential decision making and control problems,it is very difficult to provide this type of explicit supervision to a learning algorithm.For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the“correct”actions to take are to make it walk,and so do not know how to provide explicit supervision for a learning algorithm to try to mimic.
In the reinforcement learning framework,we will instead provide our al-gorithms only a reward function,which indicates to the learning agent when it is doing well,and when it is doing poorly.In the fo
ur-legged walking ex-ample,the reward function might give the robot positive rewards for moving forwards,and negative rewards for either moving backwards or falling over. It will then be the learning algorithm’s job tofigure out how to choo actions over time so as to obtain large rewards.
比较伤感的句子Reinforcement learning has been successful in applications as diver as autonomous helicopterflight,robot legged locomotion,cell-phone network routing,marketing strategy lection,factory control,and efficient web-page indexing.Our study of reinforcement learning will begin with a definition of the Markov decision process(MDP),which provides the formalism in which RL problems are usually pod.
1
密宗和禅宗的区别2 1Markov decision process
A Markov decision process is a tuple(S,A,{P sa},γ,R),where:
•S is a t of states.(For example,in autonomous helicopterflight,S might be the t of all possible positions and orientations of the heli-copter.)
•A is a t of actions.(For example,the t of all possible directions in which you can push the helicop
ter’s control sticks.)
潘雨晨
•P sa are the state transition probabilities.For each state s∈S and action a∈A,P sa is a distribution over the state space.We’ll say more about this later,but briefly,P sa gives the distribution over what states we will transition to if we take action a in state s.
•γ∈[0,1)is called the discount factor.
•R:S×A→R is the reward function.(Rewards are sometimes also written as a function of a state S only,in which ca we would have R:S→R).
The dynamics of an MDP proceeds as follows:We start in some state s0, and get to choo some action a0∈A to take in the MDP.As a result of our choice,the state of the MDP randomly transitions to some successor state
s1,drawn according to s1∼P s一起长大的玩具
0a0.Then,we get to pick another action a1.
As a result of this action,the state transitions again,now to some s2∼P s
1a1.
We then pick a2,and Pictorially,we can reprent this process as follows:
s0a0
−→s1a1
−→s2a2
−→s3a3
−→...
Upon visiting the quence of states s0,s1,...with actions a0,a1,...,our total payoffis given by
R(s0,a0)+γR(s1,a1)+γ2R(s2,a2)+···.
Or,when we are writing rewards as a function of the states only,this becomes
R(s0)+γR(s1)+γ2R(s2)+···.
For most of our development,we will u the simpler state-rewards R(s), though the generalization to state-action rewards R(s,a)offers no special difficulties.
3 Our goal in reinforcement learning is to choo actions over time so as to maximize the expected value of the total payoff:
E R(s0)+γR(s1)+γ2R(s2)+···
Note that the reward at timestep t is discounted by a factor ofγt.Thus,to make this expectation large,we would like to accrue positive rewards as soon as possible(and postpone negative rewards as long as possible).In economic applications where R(·)is the amount of money made,γalso has a natural interpretation in terms of the interest rate(where a dollar today is worth more than a dollar tomorrow).
A policy is any functionπ:S→A mapping from the states to the actions.We say that we are executing some policyπif,whenever we are in state s,we take action a=π(s).We also define the value function for a policyπaccording to
家具木材种类Vπ(s)=E R(s0)+γR(s1)+γ2R(s2)+··· s0=s,π].
Vπ(s)is simply the expected sum of discounted rewards upon starting in state s,and taking actions according toπ.1
Given afixed policyπ,its value function Vπsatisfies the Bellman equa-tions:
Vπ(s)=R(s)+γ s ∈S P sπ(s)(s )Vπ(s ).
This says that the expected sum of discounted rewards Vπ(s)for starting in s consists of two terms:First,the immediate reward R(s)that we get rightaway simply for starting in state s,and cond,the expected sum of future discounted rewards.Examining the cond term in more detail,we
[Vπ(s )].This e that the summation term above can be rewritten E s ∼P
酒店租赁合同
sπ(s)
is the expected sum of discounted rewards for starting in state s ,where s is distributed according P sπ(s),which is the distribution over where we will end up after taking thefirst actionπ(s)in the MDP from state s.Thus,the cond term above gives the expected sum of discounted rewards obtained after thefirst step in the MDP.
Bellman’s equations can be ud to efficiently solve for Vπ.Specifically, in afinite-state MDP(|S|<∞),we can write down one such equation for Vπ(s)for every state s.This gives us a t of|S|linear equations in|S| variables(the unknown Vπ(s)’s,one for each state),which can be efficiently solved for the Vπ(s)’s.
1This notation in which we condition onπisn’t technically correct becauπisn’t a random variable,but this is quite standard in the literature.
4
We also define the optimal value function according to
V ∗(s )=max πV π(s ).(1)
In other words,this is the best possible expected sum of discounted rewards that can be attained using any policy.There is also a version of Bellman’s equations for the optimal value function:V ∗(s )=R (s )+max a ∈A γ s ∈S
P sa (s )V ∗(s ).(2)The first term above is the immediate reward as before.The cond term is the maximum over all actions a of the expected future sum of discounted rewards we’ll get upon after ac
tion a .You should make sure you understand this equation and e why it makes n.
We also define a policy π∗:S →A as follows:π∗(s )=arg max a ∈A s ∈S
P sa (s )V ∗(s ).(3)Note that π∗(s )gives the action a that attains the maximum in the “max”in Equation (2).
It is a fact that for every state s and every policy π,we have
柴犬幼崽V ∗(s )=V π∗
(s )≥V π(s ).2016年台风
The first equality says that the V π∗,the value function for π∗,is equal to the optimal value function V ∗for every state s .Further,the inequality above says that π∗’s value is at least a large as the value of any other other policy.In other words,π∗as defined in Equation (3)is the optimal policy.
Note that π∗has the interesting property that it is the optimal policy for all states s .Specifically,it is not the ca that if we were starting in some state s then there’d be some optimal policy for that state,and if we were starting in some other state s then there’d be some other policy that’s optimal p
olicy for s .Specifically,the same policy π∗attains the maximum in Equation (1)for all states s .This means that we can u the same policy π∗no matter what the initial state of our MDP is.2Value iteration and policy iteration
We now describe two efficient algorithms for solving finite-state MDPs.For now,we will consider only MDPs with finite state and action spaces (|S |<∞,|A |<∞).
The first algorithm,value iteration ,is as follows:
5
1.For each state s,initialize V(s):=0.
2.Repeat until convergence{
For every state,update V(s):=R(s)+max a∈Aγ s P sa(s )V(s ).
}
This algorithm can be thought of as repeatedly trying to update the esti-mated value function using Bellman Equations(2).
There are two possible ways of performing the updates in the inner loop of the algorithm.In thefirst,we canfirst compute the new values for V(s)for every state s,and then overwrite all the old values with the new values.This is called a synchronous update.In this ca,the algorithm can be viewed as implementing a“Bellman backup operator”that takes a current estimate of the value function,and maps it to a new estimate.(See homework problem for details.)Alternatively,we can also perform asynchronous updates. Here,we would loop over the states(in some order),updating the values one at a time.
Under either synchronous or asynchronous updates,it can be shown that value iteration will cau V to converge to V∗.Having found V∗,we can then u Equation(3)tofind the optimal policy.
Apart from value iteration,there is a cond standard algorithm forfind-ing an optimal policy for an MDP.The policy iteration algorithm proceeds as follows:
1.Initializeπrandomly.
2.Repeat until convergence{
(a)Let V:=Vπ.
(b)For each state s,letπ(s):=arg max a∈A s P sa(s )V(s ).
}
Thus,the inner-loop repeatedly computes the value function for the current policy,and then updates the policy using the current value function.(The policyπfound in step(b)is also called the policy that is greedy with re-spect to V.)Note that step(a)can be done via solving Bellman’s equations as described earlier,which in the ca of afixed policy,is just a t of|S| linear equations in|S|variables.
After at most afinite number of iterations of this algorithm,V will con-verge to V∗,andπwill converge toπ∗.