Using Self-Organizing Distinctive State
Abstraction to Navigate a Maze World
Connie Li
Phil Katz
May14,2006
Abstract
This paper prents a developmental robotics experiment implement-ing lf-organizing distinctive state abstraction and growing neural gas.
The robot brain takes nsory input of eight sonars,a light nsor,and沈从文作者简介
a stall nsor and has motor outputs of translation and rotation.The
robot is placed in a maze world with a goal signified by a light source.We
test whether the robot can learn to traver the maze from increasingly
distant starting points,using reinforcement learning.Although the robot
does not learn to traver the maze,it ems likely that an adaptation of
our algorithm could complete this task.
1Introduction
One of the primary goals in thefield of developmental robotics is to create a robot brain that stores its own reprentations of its nsorimotor states;to develop state abstraction.It is easy to provide a robotic brain with pre-defined distinct states;however,this inrts anthropomorphic bias,as the manner in which a human experimenter divides nsorimotor space into states is most likely not the same manner in which the robot wouldfind it most uful to be done.The challenge,of cour,is to tell a robot brain how to generate states without specifying which states to generate.Robots are becoming proficient at dividing discrete nsorimotor spaces into discrete states;dividing a continuous nsorimotor state into discrete spaces proves to be even more of a challenge for robot brains.One way to develop discrete state abstraction of a continuous world is to u lf-organizing distinctive state abstraction(Provost et al.2006).
The idea of lf-organizing distinctive state abstraction(SODA)is that each navigational state is compod of high-level action abstractions that allow it to approximate one nsory state.The nsory states are generated through growing neural gas(Fritzke1995).Growing neural gas(GNG)was developed so that important topological features of a given environment could be distin-guished.This algorithm is ideal for developmental robotics becau it is free from anthropomorphic bias and requires few parameters in order to run.Other
1
dimensionality reduction techniques,such as competitive Hebbian learning,re-quire a predefined network size and other parameters.The Growing Neural Gas algorithm allows the network to t the parameters dynamically to appropri-ate values.The implementation of Growing Neural Gas that we ud in this experiment was modified from code written by Dan Amato.
The SODA algorithm us the GNG units to help determine its high-level actions.A high-level SODA action is compod of two steps,trajectory follow-ing and hill climbing.For trajectory following,the algorithmfinds the clost GNG unit to the current nsory input,and us Q-learning to determine the best action.After trajectory following has led the robot to a new nsory GNG state,it executes hil
l climbing tofine-tune the movement executed by the trajectory following.The Q-learning algorithm us reinforcement learning to back-propagate reward from successful trials.Becau our world is continuous, we ud the GNG units and the low-level actions to form the Q-table.
2Our Environment
Our experiment placed a single simulated robot,named Katie,in a simulated world.The simulator that we ud,implementing a simulated Pioneer robot, was the award-winning Pyrobot software.Katie was equipped with eight sonar nsors,arranged across her front half,a stall nsor,and a light nsor.The light nsor was actually taken from the max of two light nsors,one on her front left and one on her front right.If the light value was greater than0.2, it was discretized to10,so that there was a clear contrast for Katie between eing the light and not eing the light.The ten values were combined to create her nsory state vector at each timestep.Pioneer robots(and therefore Katie)are equipped with two independently-controlled wheels that allow free movement through a continuous world.For the SODA algorithm,it is necessary to generate a list of low-level actions for the robot to u in navigation.We gave Katie a list of sixteen actions with the intention of allowing her complete freedom of movement;nine of the actions are generally forward-moving,four are turns, and three move Katie backwards.
Our world was designed as a relatively simple maze for Katie to navigate. There is a light source in the upper-left corner that rves as the goal.At the start of the experiment,Katie is placed at the point labelled’1’on the map, very clo to the goal.Every six trials,she is moved to a starting point farther and farther away from the goal.We implemented this incremental increa in difficulty so that Katie would be able to learn ctions of the maze at a time, and then build on that knowledge to learn larger and larger ctions.
3Our Algorithm
文字开头的成语
The GNG variant we ud works by generating a new GNG unit,comprid of a reprentative vector and a list of neighboring units,every t number of
2
不该丢失的勇气
Figure1:Katie’s World.The numbers signify the quence of starting locations that Katie us over the cour of the experiment.
timesteps.At every timestep,the GNG is given the current nsory state of the robot,andfinds the clost(in Euclidean distance)unit in the network.It then slightly adjusts that unit’s reprentative vector,ts the age of that unit to zero,and ages each neighbor of the current unit.After units reach a certain age,they are removed from the network.
Our implementation of the SODA algorithm ud the low-level actions de-scribed previously.The SODA algorithm goes through a simple loop:
1.Find the clost GNG unit(as measured by Euclidean distance of the
reprentative vector).
2.Using the Q-table,choo a low-level movement and continue to execute
that movement until the clost GNG unit is no longer the same.
•70percent of the time,choo the action with the highest Q-score.
wps背景图片The other30percent,choo a random action.
•The Q-scores are back-propagated through the Q-table when the goal
is reached using a temporal differencing algorithm(implemented in
3
争吵的英文
Figure2:A t A,of20GNG units.The were generated using the full (non-log)nsor values.The black micircle reprents the light nsor value.
Pyrobot by Robert Cay)
•The reward for the Q-score is bad on what we called a“cookie
score”.Katie dropped“cookies”as she travelled through the world.
The cookie score was calculated by summing the amount of time that
Katie spent in clo proximity to a cookie.Cookies had a small decay
factor so that they disappeared after a certain amount of time.The
reward at the end of a trial was bad on the inver of this cookie
score and on the maximum number of timesteps,so that the less time
she spent visiting places she had already been,the higher her reward.
•As mentioned,the Q-table ud GNG units as states.Each GNG
unit that contained a light value of approximately10was classified
as a goal state.
3.Using the same GNG unit that was ud for trajectory following,com-
mence hill climbing:
•Execute and then rever each low-level action to e if it would get
the current nsory vector clor to the reprentative vector of the
GNG unit you are hill climbing.
4
•Choo the action which will minimize the distance to the repren-
tative vector.If no action will decrea the distance,choo a new
GNG unit and commence trajectory following.
4Our Experiment
微观世界观后感
Our experiment was divided into two parts;in thefirst,Katie wandered ran-domly about the world,growing neural gas.In the cond part,she ran a ries of trials,during which she was rewarded for reaching the goal.
Figure3:A subt of t B of GNG units.The full t contains28GNG units. The were generated by taking the log of the nsor values.The black micircle reprents the light nsor value.The images are magnified200percent.
Generating an appropriately sized t of GNG units required many trials and minor adjustments to the GNG algorithm.We did not want a t of GNG units with more than thirty units,becau it would slow down the algorithm (particularly the hill climbing step).We also did not want a t of GNG units with less than ten units,becau such a t could not encompass the variety of nsory states provided by our world.We generated two ts of GNG units that fit the parameters;one using the direct nsor data and one taking the log of the nsor values,to emphasize the differences between small values.
雪字开头的成语For the cond half of our experiment,Katie was placed in the world and given a certain number of timesteps(between1000and3500,depending on the distance from the starting location to the goal)to attempt tofind the goal.If she reached the goal(by reaching a GNG state with a light value of10),the
5
reward is back-propagated along her path.If she reaches the timestep limit and has not en the ligh
t,she is ret to the starting position with no reward.She is placed at each of the six start positions six times before moving on to the next one.
5Results
The results that we were hoping to e were that the cookie score and num-ber of timesteps needed tofind the goal would go down during each pha of the experiment.Becau we normalized the cookie score to be divided by the maximum number of timesteps allowed,which incread with distance between the start and the goal,we also expected to e a general decrea in the cookie score through the whole experiment.We also hoped to e at worst a consistent percentage of“failed”trials;trials in which Katie did notfind the goal.
When we ran the experiment with t A of GNG units,we did notfind a consistent percentage of“failed”trials;in fact,from starting point#4onwards, Katie never found the goal.
StartP oint AverageSteps
1581
21246
19朵玫瑰花语31846
42500
53000
63500