Bipedal Walking on Rough Terrain Using Manifold Control
Tom Erez and William D.Smart
Media and Machines Lab,Department of Computer Science and Engineering
Washington University in St.Louis,MO
etom,wds@c.wustl.edu
Abstract—This paper prents an algorithm for adapting periodic behavior to gradual shifts in task parameters.Since learning optimal control in high dimensional domains is subject to the’cur of dimensionality’,we parametrize the policy only along the limit cycle traverd by the gait,and thus focus the computational effort on a clod one-dimensional manifold,embedded in the high-dimensional state space.We take an initial gait as a departure point,and iterate between modifying the task slightly,and adapting the gait to this modification.This creates a quence of gaits,each optimized for a different variant of the task.Since every two gaits in this quence are very similar,the whole quence spans a two-dimensional manifold,and combining all policies in this 2-manifold provides additional robustness to the system.We demonstrate our approach on two simulations of bipedal robots —the compass gait
walker,which is a four-dimensional system, and RABBIT,which is ten-dimensional.The walkers’gaits are adapted to a quence of changes in the ground slope,and when all policies in the quence are combined,the walkers can safely traver a rough terrain,where the incline changes at every step.
I.INTRODUCTION
This paper deals with the general task of augmenting the capacities of legged robots by using reinforcement learn-ing1.The standard paradigm in Control Theory,whereby an optimized reference trajectory is foundfirst,and then a stabilizing controller is designed,can become laborious when a whole range of task variants are considered.Standard algo-rithms of reinforcement learning cannot yet offer compelling alternatives to the control theory paradigm,mostly becau of the prohibitive effect of the cur of dimensionality. Legged robots often constitute a high-dimensional system, and standard reinforcement learning methods,with their focus on Markov Decision Process models,usually cannot overcome the exponential growth in state space volume. Most previous work in machine learning for gait domains required either an exhaustive study of the state space[1], or the u of non-specific optimization techniques,such as genetic algorithms[2].In this paper,we wish to take a first step towards efficient reinforcement learning in high-dimensional domains by focusing on periodic tasks.
We make the obrvation that while legged robots have a high-dimensional state space,not every point in the state space reprents a viable po.By definition,a proper gait would always converge to a stable limit cycle,which traces a clod one-dimensional manifold embedded in the 1The interested reader is encouraged to follow the links mentioned in the footnotes to ction IV to e movies of our simulations high-dimensional state space.This is true for any system performing a periodic task,regardless of the size of its state space(e also[3],ction3.1,and[4],figure19,for a validation of this point in the model discusd below).This obrvation holds a great promi:a controller that can keep the system clo to one particular limit cycle despite minor has a non-trivial basin of attraction)is free to safely ignore the entire volume of the state space. Finding such a stable controller is far from trivial,and amounts to creating a stable gait.However,for our purpo, such a controller can be suboptimal,and may be supplied by a human tele-operating the system,by leveraging on passive dynamic properties of the system(as in ction IV-A),or by applying control theory tools(as in ction IV-B).In all cas,the one-dimensional manifold traced by the gait of a stable controller can be identified in one cycle of the gait, simply by recording the state of the system at every time step. Furthermore,by querying the controller,we can identify the local policy on and around that manifold.With the two provided,we can create a local reprentation of the policy which generated the gait by approximating the policy only on and around
that manifold,like a ring embedded in state space,and this holds true regardless of the dimensionality of the state space.By reprenting the original control function in a compact way we may focus our computational effort on the relevant manifold alone,and circumvent the cur of dimensionality as such a parametrization does not scale exponentially with the number of dimensions.This opens a door for an array of reinforcement learning methods(such as policy gradient)which may be ud to adapt the initial controller to different conditions,and thus augment the capacities of the system.
In this article we report two experiments.Thefirst studies the Compass-Gait walker([9],[10],[11],a system known for its capacity to walk without actuation on a small range of downward slopes.The cond experiment us a simulation of the robot RABBIT[3],a biped robot with knees and a torso,but no feet,which has been studied before by the control theory community[5],[4],[6].Thefirst model has a four-dimensional state space,and the cond model has10 state dimensions and4action dimensions.By composing together veral controllers,each adapted to a different incline,we are able to create a composite controller that can stably traver a rough terrain going downhill.The same algorithm was successfully applied to the cond system too, although the size of that problem would be prohibitive for
most reinforcement learning algorithms.
In the following wefirst give a short review of previous work in machine learning,and then explain the technical aspects of constructing a manifold controller,as well as the learning algorithms ud.We then demonstrate the effec-tiveness of Manifold Control by showing how it is ud to augment the capacities of existing controllers in two different models of bipedal walk.We conclude by discussing the potential of our approach,and offer directions for future work.
II.P REVIOUS W ORK
The generalfield of gait design has been at the focus of mechanical engineering for many years,and recent years saw an increa in the contribution from the domain of machine learning.For example,Stilman et al.[7]studied an eight-dimensional system of a biped robot with knees, similar to the one studied below.They showed that in their particular ca the dimensionality can be reduced through some specific approximations during different phas.Then, they partitioned the entire volume of the reduced state space into a grid,and performed Q-learning using a simulation model of the system’s dynamics.The result was a robot walker that can handle a range of slopes around the level horizontal plane.
In addition,there is a growing interest in recent years in gaits that can effectively take advantage of th
e passive dynamics(e the review by Collins et al.[8]for a thorough coverage).Tedrake[9]discuss veral versions the com-pass gait walker which were built and analyzed.Controllers for the compass gait bad on an analytical treatment of the system equations wasfirst suggested by Goswami et al.[10],who ud both hip and ankle actuation.Further results were described by Spong and Bhatia[11],where the ca of uneven terrain was also discusd.Ramamoorthy and Kuipers[12]suggested hybrid control of walking over irregular terrain by eking inspiration from human walking.
III.M ANIFOLD C ONTROL
A.The Basic Control Scheme
The basic idea in manifold control is to focus the com-putational effort on the limit cycle.Therefore,the policy is approximated using locally activated processing elements (PEs),positioned along the manifold spanned by the limit cycle.Each PE defines a local policy,linear in the position of the state relative to that PE.When the policy is queried with a given state x,the local policy of each PE is calculated as:
µi(x)=[1(x−c i)T M T]G i(1) where c i is the location of element i,M is a diagonal matrix which determi
nes the scale of each dimension,and G i is an (n+1)-by-m matrix,where m is the action dimension and n is the number of state space dimensions.G i is made of m columns,one for each action dimension,and each column is an(n+1)-sized gain vector.Thefinal policy u(x)is calculated by mixing the local policies of each PE
according
Fig.1.Illustrations of the models ud.On the left,the compass-gait walker:the system’s state is defined by the two legs’angles from the vertical direction and the associated angular velocities,for a total of four dimensions. Thisfigure also depicts the incline of the sloped ground.On the right, RABBIT:the system’s state is defined bt the angle of the torso form the vertical direction,the angles between the thighs and the torso,and the knee angles between the shank and the thigh.This model of RABBIT has ten state dimensions,where at every moment one leg isfixed to the ground, and the other leg is free to swing.
to a normalized Gaussian activation function,usingσas a scalar bandwidth term:
hey stephenw i=exp(−(x−c i)T M TσM(x−c i)),(2)
u(x)= n i=1w iµi
traver a path of higher llect more rewards,or less cost)along its modified limit cycle.
1)Defining the Value Function:In the prent work we consider a standard nondiscounted reinforcement learning formulation with afinite time horizon and no terminal costs. More accurately,we define the Value Vπ(x0)of a given state x0under afixed policyπ(x)as:
Vπ(x0)= T0r(x t,π(x t))dt(4) where r(x,u)is the reward determined by the current state and the lected action,T is the time horizon,and x t is the solution of the time invariant ordinary differential equation ˙x=f(x,π(x))with the initial condition x=x0,so that
x t= t0f(xτ,π(xτ))dτ.(5) 2)Approximating the Policy Gradient:With C being the locations of the processing elements,and G being the t of their local gains,we make u of a method,due to[14],of piecewi estimation of the gradient of the value function at a given initial state x0with respect to the parameter t G. As Munos showed in[14],from(4)we can write
∂V
∂G
,(6) and for the general form r=r(x,u)we can decompo
∂r/∂G as
∂r
∂u ∂u
∂x
∂x
Fig.2.The path ud to test the compass gait walker,and an overlay of the walker traversing this path.Note how the step length adapts to the changing incline.
the forward swing,but will undergo an inelastic collision with thefloor during the backward swing.At this point it will become the stance leg,and the other leg will be t free to swing.The entire system is placed on a plane that is inclined at a variable angleφfrom the horizon.
skyer
In the interests of brevity,we omit a complete description of the system dynamics here,referring the interested reader to the relevant literature[15],[10],[11].Although previous work considered actuation both at the hip and the ankle,we cho to study the ca of actuation at the hip only.
The learning pha in this system was done using simple stochastic gradient ascent,rather than the elaborate policy gradient estimation described in ction III-B.2.The initial manifold was sampled at an incline ofφ=−0.05(the initial policy is the zero policy,so there were no approximation errors involved).One shaping iteration consisted of the following:first,G was modified to G tent=G+ηδG, withη=0.1andδG drawn at random from a multinormal distribution with unit covariance.The new policy’s perfor-mance was measured as a sum of all the rewards along20 steps.If the value of this new policy was bigger than the prent one,it was adopted,otherwi it was rejected.Then, a newδG was drawn,and the process repeated itlf.After 3successful adoptions,the shaping iterations step concluded with a resampling of the new controller,and the incline was decread by0.01.
After10shaping iteration steps,we had controllers that could handle inclines up toφ=−0.14.After another10 iteration steps with the incline increasing by0.005,we had controllers that could handle inclines up toφ=0.0025(a slight uphill).This range is approxmately double the limit of the passive walker[15].
Finally,we combined the various controllers into one composite controller.This new controller ud1500charts to span a two-dimensional manifold embedded in the four-dimensional state space.The performance of the composite controller was tested on an uneven terrain where the incline was gradually changed fromφ=0toφ=0.15radians, made of“tiles”of variable length who inclines were0.01 radians apart.
figure III-B.2shows an overlay of the walker’s downhill path.A movie of this march is available online.2
B.The Uphill-Walking RABBIT Robot
We applied manifold control also to simulations of the legged robot RABBIT,using code from Prof.Jessy Grizzle that is freely available online[16].RABBIT is a biped robot with a torso,two knees and no feet(efigure1b.),and is actuated at four places:both hip joins(where thighs are actuated aga
inst the torso),and both knees(where shanks are actuated against the thighs).The simulation assumes a stance leg with no slippage,and a swing leg that is free to move at all directions until it collides inelastically with the floor,and becomes the stance leg,freeing the other leg to swing.This robot too is modeled as a nonlinear system with impul effects.Again,we are forced to omit a complete reconstruction of the model’s details,and refer the interested reader to[4],equation8.
This model was studied extensively by the control theory community.In particular,an optimal desired signal was derived in[6],and a controller that successfully realizes this signal was prented in[4].However,all the efforts were focud on RABBIT walking on even terrain.We sought a way to augment the capacities of the RABBIT model,and allow it to traver a rough,uneven terrain.We found that the controller suggested by[4]can easily handle negative (downhill)incline of0.2radiand and more,but cannot handle positive(uphill)inclines.3.
problem是什么意思Learning started by approximating the policy from[4]as a manifold controller,using400processing elements with a mean distance of about0.03state space length units.The performance of the manifold controller was indistinguishable to the naked eye from the original controller,and perfor-mance,as measured by the performance criterion C3in[6] (the same ud by[4]),was only1%wor,probably due to minor approximation errors.
The policy gradient was estimated using(6),according to a simple reward model:
r(x,u)=10v x hip−
1
你也一样英语怎么说
Fig.3.The rough terrain traverd by RABBIT.Since this model has knees,it can walk both uphill and downhill.Note how the step length adapts to the changing terrain.The movie of this parade can be en at tt /2b8sdm,which is a shortcut to the YouTube website.
where v x
hip is the velocity of the hip joint(where the thigh
and the torso meet)in the positive direction of the X-axis, and u max is a scalar parameter(in our ca,chon to be 120)that tunes the nonlinear action penalty and promotes energetic efficiency.in your head
After the initial manifold controller was created,the system followed a fully automated shaping protocol for 20iterations:at every iteration,∂V/∂G was estimated, andηwasfixed to0.1%of|G|.This small learning rate ensured that we don’t modify the policy too much and lo stability.The modified policy,assumed to be slightly better, was then tested on a slightly bigger incline(the veryfirst manifold controller was tried on an incline of0rad.,and in every iteration we incread the incline in0.003rad.).This small modification to the model parameters ensured that the controller can still walk stably on the incline.If stability was not lost(as was the ca in all our iterations),we resampled u(·;C,G new)so that C adj overlapped the limit cycle of the modified system(with the new policy and new incline),and the whole process repeated.This procedure allowed a gradual increa in the system’s maximal stable incline.
Figure4depicts the evolution of the stability margins of every ring along the shaping iteration:for every iteration we prent an upper(and lower)bound on the incline for which the controller can maintain stability.This
was tested by tting a test incline,and allowing the system to run for10 conds.If no collap happened by this time,the test incline was raid(lowered),until an incline was found for which the system can no longer maintain stability.As this picture shows,our automated shaping protocol does not maintain a tight control on the stability margins-for most iterations,a modest improvement is recorded.The system’s nonlinearity is well illustrated by the curious ca of iteration9,where the same magnitude ofδG caus a massive improvement, despite the fact that the control manifold itlf didn’t change dramatically(efigure5).The conver is also true for some iterations(such as17
and18)there is a decrea in the stability margins,but this is not harming the overall effectiveness,since the iterations are using training data obtained at an incline that is very far from the stability Fig.4.Thisfigure shows the inclines for which each iteration could maintain a stable gait on the RABBIT model.The diagonal line shows the incline for which each iteration was trained.Iteration0is the original controller.The initial manifold control approximation degrades most of the stability margin of the original control,but this is quickly regained through adaptation.Note that both the learning rate and the incline change rate were held constant through the entire process.The big jump in iteration 9exemplifies the nonlinearity of the system,as small changes may have unpredictable results,in this ca,for the best.
margin.Finally,three iterations were compod together,and the resulting controller successfully traverd a rough terrain that included inclines from−0.05to0.15radians.Figure3 shows an overlay image of the rough path.
V.C ONCLUSION AND F UTURE W ORK
In this paper we prent a compact reprentation of the policy for periodic tasks,and apply a trajectory-bad policy gradient algorithm to it.Most importantly,the methods we prent do not scale
exponentially with the number of dimensions,and hence allow us to circumvent the cur of dimensionality in the particular ca of periodic tasks.By following a gradual shaping process,we are able to create robust controllers that augment the capacities of existing systems in a consistent way.
3
3.54
−1.5
−0.2angle
a n g . v e l.
angle
a n g . v e
l.
angle
a n g . v e l.
angle
a n g . v e
l.
angle
a n g . v e l.
2020语文答案
angle
a n g . v e
l.
angle
a n g . v e l.
angle
a n g . v e
l.
angle
a n g . v e l.
angle
a n g . v e
l.
angle
a n g . v e l.
angle
a n g . v e
l.marktwain
angle
a n g . v e l.
angle
a n g . v e l.Fig.5.A projection of the manifold of veral stages of the shaping process for the RABBIT model.The top row shows the angle and angular velocity between the torso and the stance thigh,and the bottom row shows the angle and angular velocity of the knee of the swing leg.Every two concutive iterations are only slightly different from each other.Throughout the entire shaping process,changes accumulate,and new solutions emerge.
Manifold control may also be ud when the initial con-troller is profoundly suboptimal 4.It is also important to note that the rough terrain was traverd without informing the walker of the current terrain.We may say that the walkers walked blindly on their rough path.This demonstrates how stable a composite manifold controller can be.However,in some practical applications it could be beneficial to reprent this important piece of information explicitly,and lect the most appropriate ring at every step.
We believe that the combination of local learning and careful shaping holds a great promi to many applications of periodic tasks,and hope to demonstrate it through future work on even higher-dimensional systems.Future rearch directions could include methods that allow cond-order convergence,and learning a model of the plant.
R EFERENCES
[1]M.Stilman,C.G.Atkeson,J.J.Kuffner,and G.Zeglin,“Dynamic
programming in reduced dimensional spaces:Dynamic planning for robust biped locomotion,”in Proceedings of the 2005IEEE Interna-tional Conference on Robotics and Automation (ICRA 2005),2005,pp.2399–2404.
[2]J.Buchli, F.Iida,and A.Ijspeert,“Finding resonance:Adaptive
frequency oscillators for dynamic legged locomotion,”in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).IEEE,2006,pp.3903–3909.
[3] C.Chevallereau and P.Sardain,“Design and actuation optimization of
a 4-axes biped robot for walking and running,”in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA),2000.
[4] F.Plestan,J.W.Grizzle,E.Westervelt,and G.Abba,“Stable walking
of a 7-dof biped robot,”IEEE Trans.Robot.Automat.,vol.19,no.4,pp.653–668,Aug.2003.4the
interested reader is welcome to e other results of manifold learning on a 14-dimensional system /2h3qny /2462j7.
[5] C.Sabourin,O.Bruneau,and G.Buche,“Experimental validation of
a robust control strategy for the robot rabbit,”in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA),2005.
booby
[6] C.Chevallereau and Y .Aoustin,“Optimal reference trajectories for
walking and running of a biped robot,”Robotica ,vol.19,no.5,pp.557–569,2001.
[7]M.Stilman,C.G.Atkeson,J.J.Kuffner,and G.Zeglin,“Dynamic
programming in reduced dimensional spaces:Dynamic planning for robust biped locomotion,”in Proceedings of the 2005IEEE Interna-tional Conference on Robotics and Automation (ICRA 2005),2005,pp.2399–2404.
[8]S.H.Collins,A.Ruina,R.Tedrake,,and M.Wis,“Efficient bipedal
go for it
robots bad on passive-dynamic walkers,”Science ,pp.307:1082–1085,February 2005.
[9]R.L.Tedrake,“Applied optimal control for dynamically stable legged
locomotion,”Ph.D.disrtation,Massachutts Institute of Technol-ogy,August 2004.
[10] A.Goswami, B.Espiau,and A.Keramane,“Limit cycles in a
passive compass gait biped and passivity-mimicking control laws,”Autonomous Robots ,vol.4,no.3,p
p.273–286,1997.
[11]M.W.Spong and G.Bhatia,“Further results on the control of the com-pass gait biped,”in Proceedings of the 2003IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003),vol.2,2003,pp.1933–1938.
[12]S.Ramamoorthy and B.Kuipers,“Qualitative hybrid control of
dynamic bipedal walking,”in Robotics :Science and Systems II ,G.S.Sukhatme,S.Schaal,W.Burgard,and D.Fox,Eds.MIT Press,2007.[13]S.Schaal and C.Atkeson,“constructive incremental learning from
only local information,”neural computation ,no.8,pp.2047–2084,1998.
[14]R.Munos,“Policy gradient in continuous time,”Journal of Machine
Learning Rearch ,vol.7,pp.771–791,2006.
[15] A.Goswami,B.Thuilot,and B.Espiau,“Compass-like biped robot
何钢
part i:Stability and bifurcation of passive gaits,”INRIA,Tech.Rep.2996,October 1996.
[16] E.Westervelt, B.Morris,and J.Grizzle.(2003)Five link walker.
IEEE-CDC Workshop:Feedback Control of Biped Walking Robots.[Online]./2znlz2