Visual Odometry
David Nist´e r Oleg Naroditsky James Bergen
Sarnoff Corporation
CN5300,Princeton NJ08530USA
Abstract
We prent a system that estimates the motion of a stereo head or a single moving camera bad on video input.The system operates in real-time with low delay and the motion estimates are ud for navigational purpos.The front end of the system is a feature tracker.Point features are matched between pairs of frames and linked into image trajectories at video rate.Robust estimates of the camera motion are then produced from the feature tracks using a geometric hypothesize-and-test architecture.This generates what we call visual ion estimates from visual input alone.No prior knowledge of the scene nor the motion is necessary.The visual odometry can also be ud in con-junction with information from other sources such as GPS, inertia nsors,wheel encoders,etc.The po estimation method has been applied successfully to video from aerial, autom
otive and handheld platforms.We focus on results with an autonomous ground vehicle.We give examples of camera trajectories estimated purely from images over pre-viously unen distances and periods of time.
1.Introduction
An important application of computer vision is to au-tonomous navigation of vehicles and robots.Effective u of video nsors for obstacle detection and navigation has been a goal in ground vehicle robotics for many years. Stereo vision for obstacle detection and ego-motion esti-mation for platform localization are some of the key as-pects of this endeavor.Cloly related is what is known in the robotics community as simultaneous localization and mapping(SLAM).SLAM has most often been performed using other nsors than regular cameras.However,rela-tively recent performance improvements in both nsors and computing hardware have made real-time vision processing much more practical and as computer vision algorithms ma-ture,we expect to e more of visually bad navigation sys-tems.Such real-time applications as stereo analysis for de-tection of obstructions and curb or lane marker tracking for on-road control have become tractable to implement within standard PC-bad systems.However,more general visual estimation functions remain difficult to achieve within
民主生活会对照the Figure1:Left:Results with a single camera mounted obliquely on an aerial platform.The aeroplaneflies at a low altitude and turns to make another sweep.The visual odom-etry shown was estimated in real-time with low delay.The result is bad solely on visual input and no prior knowledge of the scene nor the motion is ud.A textured triangula-tion in the frustum of the most re
白芷美容方法cent camera position is also shown.Right:Results from a stereo pair mounted on a ground vehicle.The vehicle path is is over600meters long and includes three tight loops.
speed and latency constraints required for in-the-loop vehi-cle control functions.
When obstacle detection and mapping is performed us-ing visual input such as for example stereo data,it becomes even more natural to u the visual input also to estimate the motion of the platform.In this paper we describe a real-time method for deriving vehicle motion from monocular and stereo video quences.All of the processing runs at video rates on a1GHz Pentium III class machine.
Some of the components of our system have been given in detail in[13]and[15]and we will not reproduce tho here.Our system performs robust camera motion estima-tion bad on feature tracks and is in that respect a develop-ment in the direction by authors[11,5,18]and the commercial software[1].However,it operates in real-time and in a calibrated framework and is in this n more cloly related to[3,2,10].
The rest of the paper is organized as follows.Sections2 and3prent feature detection and matching,respectively. Section4discuss the robust estimation with one and two cameras.Section5
gives results and Section6concludes.
2.Feature Detection
In each frame,we detect Harris corners[7].This type of point feature has been found to give detections that are rela-tively stable under small to moderate image distortions[20]. Since we are using video input,we can rely on distortions between concutive frames to be fairly small.
There are many details such as the exact choice offilter taps,thresholding or order of computation and storage that will affect the result quality,the cache performance and the processing time.We will therefore describe our implemen-tation in detail.Cache performance is often hard to predict and we tried many different implementations before ttling on this one.
The incoming image is reprented with8bits per pixel. Wefirst compute the strength s of the corner respon.For every output line of corner respon,temporaryfilter out-puts are needed for a certain number of lines above and below the current output line.Allfilter outputs are com-puted only once,and stored in wrap-around buffers for op-timal cache performance.The wrap-around buffers repre-nt the temporaryfilter outputs in a rolling window.The rolling window contains the minimal number of lines nec-essary in order to avoid recomputing anyfilter outputs.Let I x and I y denote the
horizontal and vertical derivatives of the image.The wrap-around buffers and the resulting cor-ner respon are updated line by line,using four sweeps per line.Thefirst sweep updates wrap-around buffers for I x I x,I x I y,I y I y.The buffers are5lines long and the typ-ical sweep updates one line,positioned two lines ahead of the current output line of corner respon.The derivatives I x and I y are computed by horizontal and verticalfilters of the type
−101
and shifted down one bit be-fore the multiplications to keep the input down to8bits and output down to16bits.The cond sweep convolves all thefive lines in the wrap-around buffers vertically with the binomialfilter
14641
to produce the three single lines g xx,g xy,g yy of32bitfilter output.This is ac-complished by shifts and additions to avoid expensive mul-tiplications.The third sweep convolves horizontally with the same binomialfilter to produce the32bit single lines G xx,G xy,G yy,stored back in the same place,but shifted two pixels.The fourth sweep computes the determinant d=G xx G yy−G xy G xy,trace t=G xx+G yy and the strength s=d−kt2of the corner respon,where k=0.06,all infloating point.
Thefilter sweeps are implemented in MMX in chunks of128pixels and interleaved manually to avoid stalls and make optimal u of both pipelines.For details on the rules of thumb for MMX coding,e[9].
After the corner respon is computed,non-max sup-pression is ud to define the actual feature points.A fea-ture point is declared at each pixel where the respon is stronger than at all other pixels in a5×5neighborhood.No subpixel precision detection is ud.For this computation step,it turns out that the lazy evaluation of AND statements in C makes plain C code faster than any attempts to u MMX for the non-max suppression.The reason is that on average,a larger suppressing value is found long before the whole5×5neighborhood is exhausted.
In contrast to popular practice,we do not u any ab-solute or global thresholds on the strength of the corner respon.We only u a local saturation that limits the number of detected features in a local region of the im-age.This saturation only ts in when the feature density becomes extremely excessive and threatens to hurt the pro-cessing time significantly.We typically allow up to5000 feature points distributed in10by10buckets of the image, i.e.100features in each bucket.The survivor features in each bucket are found with the quicklect algorithm[19] bad on the strength of the corner respon.Note however that the amount of allowed features is very generous.In fact, for low r
esolution images the limit can not even be reached, since a lower density is already enforced by the non-max suppression.Here are the sweeps in pudo-code:
Sweep1:Let i,j indicate beginning of line
for(c=0;c<128;c++){
Ix=(img[i][j+c-1]-img[i][j+c+1])>>1;
Iy=(img[i-1][j+c]-img[i+1][j+c])>>1;
dxx[c]=Ix*Ix;
dxx[c+128]=Ix*Iy;
dxx[c+256]=Iy*Iy;}
Sweep2:Let d0-d4be pointers to the five buffer lines
for(c=0;c<128;c++){
dd=d2[c];
g[c]=d0[c]+(d1[c]<<2)+(dd<<2)+(dd<<1)+(d3[c]<<2)+d4[c];
修养的意思dd=d2[c+128];
g[c+128]=d0[c+128]+(d1[c+128]<<2)+(dd<<2)+(dd<<1)+(d3[c+128]<<2)+d4[c+128];
dd=d2[c+256];
g[c+256]=d0[c+256]+(d1[c+256]<<2)+(dd<<2)+(dd<<1)+(d3[c+256]<<2)+d4[c+256];}
Sweep3:for(c=0;c<124;c++)
g[c]=g[c]+(g[c+1]<<2)+(g[c+2]<<2)+(g[c+2]<<1)+(g[c+3]<<2)+g[c+4];
Sweep4:for(c=0;c<124;c++){
Gxx=gxx[c];
Gxy=gxy[c];
Gyy=gyy[c];
d=Gxx*Gyy-Gxy*Gxy;
t=Gxx+Gyy;
s[c]=d-k*t*t;}
Non-max Suppression:for(i=top;i<=bottom;i++)for(j=left;j<=right;j++){ v=s[i][j];
if(
v>s[i-2][j-2]&&v>s[i-2][j-1]&&v>s[i-2][j]&&v>s[i-2][j+1]&&v>s[i-2][j+2]&& v>s[i-1][j-2]&&v>s[i-1][j-1]&&v>s[i-1][j]&&v>s[i-1][j+1]&&v>s[i-1][j+2]&& v>s[i][j-2]&&v>s[i][j-1]&&v>s[i][j+1]&&v>s[i][j+2]&& v>s[i+1][j-2]&&v>s[i+1][j-1]&&v>s[i+1][j]&&v>s[i+1][j+1]&&v>s[i+1][j+2]&& v>s[i+2][j-2]&&v>s[i+2][j-1]&&v>s[i+2][j]&&v>s[i+2][j+1]&&v>s[i+2][j+2]) Declare Feature}
3.Feature Matching
The features points are matched between pairs of frames. In contrast to the KLT tracker[21],we detect features in all frames and only allow matches between features.A fea-ture in one image is matched to every feature within afixed distance from it in the next image.That is,we match all feature
s that are within a certain disparity limit from each other.We typically u a disparity limit that is10%of the image size,but depending on the speed requirements and the smoothness of the input,we sometimes u as low as 3%and as high as30%of the image size.Normalized cor-relation over an11×11window is ud to evaluate the po-tential matches.For speed considerations,uniform weight-ing is ud across the whole window.The key to achiev-ing fast matching is to minimize the amount of computa-tion spent on each potential matching feature pair.In com-parison,computation spent on features parately in each image is negligible,since such computation is done only once per feature point,while every feature point is involved in a large number of potential matches.Since we only al-low feature-to-feature matches we can preprocess the image patches parately.Each11×11patch centred on a detected feature is copied from the image and laid out concutively in memory as an n=121byte vector(in fact,we pad to 128bytes for convenience).At the same time,the values
A=
I(1)
B=
I2(2)
C=
1
√
nB−A2
(3)
are precomputed for each patch.For each potential match, all we have to do is compute the scalar product
D=
I1I2(4)
between the two patches.The normalized correlation is then
(nD−A1A2)C1C2.(5)The scalar product between two128byte vectors is com-puted very efficiently with MMX instructions.In fact,the multiplications can be carried out just as fast as the values can be fetched from and stored back into memory.This brings us to an interesting point.It is a common belief that the sum of absolute differences(SAD)is more efficient than normalized correlation,which is of cour true in some cir-cumstances or when considering the amount of chip surface spent in dedicated hardware,but this is not true in our t-ting.Changing the multiplications to subtractions would achieve nothing,since memory speed is the bottleneck.
To decide which matches to accept,we u an old but powerful trick,namely mutual consistency check.Every feature is involved in a number of normalized correlations with features from the other image,decided by the maxi-mum disparity limit.The feature from the other image that produces the highest normalized correlation is the preferred mate.The feature from the other image also in its turn has a preferred mate.Only pairs of features that’want to get married’,i.e.mutually has each other as the preferred mate, are accepted as a valid match.Note that with good book-keeping,the mutual consistency check can be accomplished without computing the correlations more than once.
We u the accepted matches both in stereo and in video to produce tracks.For tracking,we simply link the ac-cepted matches between pairs of frames into tracks over time.Thus,we do not really distin
最好的股票网站guish between matching and
tracking.
Figure2:Input frame(left)and feature tracker output (right).Circles reprent current feature locations and curves are feature tracks through the image.
4.Robust Estimation
The feature tracking operates without any geometric con-straints.The resulting tracks are then fed forward to ge-ometry estimation.For the geometry estimation part of the system we have veral inc
arnations.One is using monoc-ular video as input and another one us stereo.We also have versions that perform mosaicing with much the same methodology.
4.1The Monocular Scheme
The monocular version operates as follows.
1.Track features over a certain number of frames.Es-
timate the relative pos between three of the frames using the5-point algorithm[13]and preemptive RANSAC[15]followed by iterative refinement.
2.Triangulate the obrved feature tracks into3D points
using thefirst and last obrvation on each track and optimal triangulation according to directional error.
This can be achieved in clod form[17].If this is not thefirst time through the loop,estimate the scale fac-tor between the prent reconstruction and the previous camera trajectory with another preemptive RANSAC procedure.Put the prent reconstruction in the coor-dinate system of the previous one.
3.Track for a certain additional number of frames.Com-
pute the po of the camera with respect to the known 3D points using the3-point algorithm[6]and preemp-tive RANSAC followed by iterative refinement.
有珠
4.Re-triangulate the3D points using thefirst and last ob-
rvations on their image track.Repeat from Step3a certain number of times.
5.Repeat from Step1a certain number of times.
6.Inrt afirewall and repeat from Step1.
The meaning of“firewall”in this context is the follow-ing.From the system point of view,error accumulation and propagation is a rious concern.For example,if the po is incorrectly estimated,this will lead to incorrectly posi-tioned3D points,which will in turn hurt subquent po estimates and the system will never recover.However,the above scheme opens up the possibility of building afirewall against error propagation by simply prescribing that trian-gulation of3D points is never performed using obrvations beyond the most recentfirewall.That is,for purpos of tri-angulation,the frame after the most recentfirewall is con-sidered thefirst frame.Our system then gets
the desirable property that after afirewall,the relative pos will be es-timated exactly as if the system was started afresh.The state of the system before thefirewall can only affect the choice of coordinate system for subquent pos,nothing el.Thefirewall helps protecting both against propagation of gross errors and slow error buildup that is not fully sup-presd by the iterative refinements.
耐用充电宝Our current real-time scheme is simplistic in the way it choos which frames to u for relative orientation.We also have a version that us model lection to choo the frames in the same spirit as[11].This has given promising results in offline simulations.However,it has not yet proved fast enough to improve upon the real-time results.One of the advantages of using a stereo head is that the choices are largely avoided.4.2The Stereo Scheme绿鲤鱼
When a stereo rig is available,we can avoid the difficult relative orientation step and instead perform triangulation followed by po repeatedly.Moreover,the relative pos can be estimated in a known scale,since we know the size of the stereo baline.This makes the estimates more uful and manageable.The stereo version of the system operates as follows.
1.Match feature points between the left and right images参苓健脾胃颗粒
of the stereo pair.Triangulate the obrved matches into3D points.
2.Track features for a certain number of frames.Com-
pute the po of the stereo rig with preemptive RANSAC followed by iterative refinement.The3-point algorithm(considering the left image)is ud as the hypothesis generator.The scoring and itera-tive refinement are bad on reprojection errors in both frames of the stereo pair.
3.Repeat from Step2a certain number of times.
4.Triangulate all new feature matches using the obrva-
tions in the left and right images.Repeat from Step2
a certain number of times.
5.Re-triangulate all3D points to t up afirewall.Repeat
from Step
2.
Figure3:Camera path output from the stereo scheme in3D.
The reader might wonder why an absolute orientation be-tween3D points is not ud in Step2,since3D points can be obtained with the stereo rig at each time instance.We have also tried such a scheme and found it to be greatly inferior. The reason is the way ambiguity impacts the estimates.The triangulations are much more uncertain in the depth direc-tion.When the same points are triangulated in two different stereo frames,there are therefore large discrepancies along the depth direction.For alignment between small ts of 3D points,this has a devastating effect on the results.This
Figure4:A picture of the autonomous ground vehicle.Note the two stereo heads mounted in the front.For this plat-form,a specifically tailored version of the camera motion estimation system is ud.The available stereo input is leveraged to obtain even more stable visual odometry.The feature tracking front end is esntially the same,but the the known stereo geometry is ud in the robust motion estima-tion.The somewhat more difficult relative orientation step ud in the single camera c
a can thus be avoided.The sys-tem then esntially performs triangulation of scene points followed by estimation of po from the structure that was previously determined.A significant advantage is that in this mode the system can perform well despite little or even no camera motion.
might be avoided with larger ts of points,but that would compromi the robustness against outliers that we achieve with the RANSAC scheme.
Instead,we u the3-point algorithm for single camera po.It is less affected by uncertainty in depth of the3D points.The reason is that it is bad on image quantities and that the new camera po is not very far from the po from where the3D point was originally triangulated.The uncertainty in depth is caud esntially by the fact that moving the3D point in the depth direction does not cau much change in the reprojected image position.Turning this around,it means that changes in depth do not change the po estimates much.In a nutshell,if we stick to using image bad quantities,the uncertainty effects’cancel’to a large extent.
However,for the benefit of using image bad quanti-ties,we pay the penalty of using only one of the images in the hypothesis generation.To mitigate the effect of this,it is important to score and optimize using both images.We have tested the scheme without this improvement.The tri-angulation is then c
arried out using both views,balancing the reprojection errors between the two,but the po is only bad on one view.Any imperfections in the geometry,such as small calibration errors,will bias the triangulated posi-tions of the3D points.A po estimate that only considers one view will then accomodate for some of the bias by’for-getting’about the requirements from the other view.When the3D points are then re-triangulated,they are again bi-ad.This leads to a constant drift or torsion that increas with the frequency of re-triangulation.This problem is very much alleviated by scoring and optimizing bad on both views simultaneously.At least,the po estimate will not drift when the rig is motionless.Since the balanced repro-jection error is esntially the same score that was ud in triangulation to place the3D points at the best possible lo-cation relative to the stereo rig,the position estimates for the stereo rig will have no reason to move bad on that score.
Also,to make the hypothesis generation more symmet-ric between the images,we have developed a generalized version of the3-point po algorithm that can u non-concurrent rays[16].With this generalization,it will be possible to pick the minimal ts randomly with points from both images.
The choices between triangulation and re-triangulation in the above scheme were made to accommodate certain trade-offs.We wish to re-triangulate all features quite fre-quently to put upfirewalls against error propagation.We also wish to triangulate new features as they begin to track
to avoid running out of3D points.On the other hand,in or-der to suppress drift to the largest extent possible,we wish to u3D points that were triangulated as far back in time as possible.
A significant advantage of the stereo scheme is that it can operate correctly even without any camera motion.This is also an indication of its greater stability,since many of the difficulties in monocular ego-motion estimation are caud by small motions.
4.3Preemptive RANSAC and Scoring
For all the camera motion estimation,we u preeemptive RANSAC as prented in[15].Multiple(typically500) minimal random samples of feature correspondences are taken.Each sample contains the smallest number of fea-ture correspondences necessary to obtain a unique solution for the camera motion.All the motion hypothes thus gen-erated then compete in a preemptive scoring scheme that is designed to quicklyfind a motion hypothesis that enjoys a large support among all the feature correspondences.All estimations arefinished off with an iterative refinement with the support as objective function.Support is measured by a robust reprojection error.We assume Cauchy distribution for the reprojection errors and ignore constant factors of the likelihood whenever possible.This means that if the scaled squared magnitude of a reprojection error
is u,it contributes a term−ln(1+u)to the log-likelihood.Most robustifica-tion kernels require expensive transcendental functions like in this ca,the logarithm.Cauchy distribution has the ad-
vantage that we can u the following trick to prevent the logarithm from becoming a major bottleneck.The repro-jection errors are taken in groups of ten and the robust log-likelihood for each group is computed as
−ln
10
i=1
(1+u i).(6)
Larger groups should be avoided to ensure that the calcula-tion stays within thefloating point range.
The reprojection errors for single pos and the stereo
rig are straightforward to compute since we have the3D
points.For scoring the three-view relative po estimates, we u a trifocal Sampson error,which is significantly faster
than the previous state of the art[22].The clod form trifo-
cal Sampson error is described in[16].The iterative refine-ment for three views is however carried out with a complete,
hand-optimized bundle adjustment,since full bundle adjust-ment turns out to be faster than any attempts to eliminate the
structure parameters.
5.Results
The visual odometry system was integrated into a mobile
robotic platform equipped with a Differential Global Posi-
tioning System(DGPS)as well as a high precision Inertial Navigation System(INS).DGPS functioning in RT-2mode
allowed us to collect position data with up to2cm relative
accuracy.We compare the visual odometry output to the integrated INS/DGPS navigation system,which we treat as
ground truth.We show that the visual odometry po esti-mates are accurate and reliable under a variety of maneuvers
in realistic ground vehicle scenarios.We also demonstrate
the ufulness of visual odometry for map building and ob-stacle avoidance.
No a priori knowledge of the motion was ud to produce the visual odometry.A completely general3D trajectory
was estimated in all our experiments.In particular,we did not explicitly force the trajectory to stay upright or within a
certain height of the ground plane.The fact that it did any-
way is a strong verification of the robustness and accuracy of the result.
5.1System Configuration
The vehicle was equipped with a pair of synchronized ana-log cameras.Each camera had a horizontalfield of view of 50◦,and imagefields of720×240resolution were ud for processing.The stereo pair was tilted toward the side of the
vehicle by about10◦and had a baline of28cm(e Fig-
ure4).Due to a variety of other tasks running concurrently on the system,the visual odometry’s frame processing rate was limited to around13Hz.
Run Frames DGPS(m)VisOdo(m)%error Loops1602185.88183.90 1.07 Meadow2263266.16279.77 4.86 Woods2944365.96372.02 1.63 Table1:Metric accuracy of visual odometry position esti-mates.The number of frames procesd is given in column 2.Total vehicle path lengths estimated by DGPS and visual odometry are given in columns3and4with relative error in distance given in column4.
During each run of the vehicle we collected time stamped po data from visual odometry and the vehicle navigation system(VNS),which includes GPS and INS.To obtain quantitative comparisons,the
coordinate systems of the visual odometry and the VNS were aligned by a least squaresfit of the initial twenty pos.In the abnce of VNS,visual odometry can be ud directly for navigation relative to the vehicle’s initial po.Note that there is no need to estimate a scale factor between the coordinate sys-tems since we can take advantage of the known stereo ba-line to determine absolute scale.
The VNS was designed to lect the highly accurate DGPS position estimate over the inertial estimate when both are available.The orientation of the vehicle,on the other hand,always came from the inertial nsors.
5.2Visual Odometry vs.DGPS
Our experiments prove that visual odometry is highly ef-fective for estimating position of the vehicle.Table1com-pares the estimates of total distance travelled by the vehicle in three outdoor runs conducted on a trail in a wooded area.
Figure5shows side-by-side plots of vehicle trajectories as recorded by DGPS and visual odometry.Figure6shows the visual odometry overlaid on the vehicle’s position in DGPS North-East-Down coordinates.In each ca the vi-sual odometry is stable and performs correctly over thou-sand
s of frames of dead reckoning and hundreds of meters of uneven terrain.Thisfigure also illustrates the result of using visual odometry as sole means of navigation for the vehicle.
5.3Visual Odometry vs.INS
We compare the vehicle yaw angle logged by the INS to the one from visual odometry.Unlike GPS,both the vi-sual odometry and the INS direction nsor function incre-mentally by dead reckoning,and therefore accumulate error over time.Table2shows the mean frame to frame accumu-lation of discrepancy in yaw magnitude between the visual odometry and the INS for each of our three quences.