首页 > 美文鉴赏

Object Recognition from Local Scale-Invariant Features

更新时间:2023-05-19 02:12:24 阅读：评论：0

David G.Lowe怎么格式化内存卡

Computer Science Department

University of British Columbia

V ancouver,B.C.,V6T1Z4,Canada

lowe@cs.ubc.ca

Abstract

An object recognition system has been developed that us a new class of local image features.The features are invariant to image scaling,translation,and rotation,and partially in-variant to illumination changes and afﬁne or3D projection. The features share similar properties with neurons in in-ferior temporal cortex that are ud for object recognition in primate vision.Features are efﬁciently detected through a stagedﬁltering approach that identiﬁes stable points in scale space.Image keys are created that allow for local ge-ometric deformations by reprenting blurred image gradi-ents in multiple orienta

tion planes and at multiple scales. The keys are ud as input to a nearest-neighbor indexing method that identiﬁes candidate object matches.Final veri-ﬁcation of each match is achieved byﬁnding a low-residual least-squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially-occluded images with

a computation time of under2conds.

1.Introduction

Object recognition in cluttered real-world scenes requires local image features that are unaffected by nearby clutter or partial occlusion.The features must be at least partially in-variant to illumination,3D projective transforms,and com-mon object variations.On the other hand,the features must also be sufﬁciently distinctive to identify speciﬁc objects among many alternatives.The difﬁculty of the object recog-nition problem is due in large part to the lack of success in ﬁnding such image features.However,recent rearch on the u of den local ,Schmid&Mohr[19]) has shown that efﬁcient recognition can often be achieved by using local image descriptors sampled at a large number of repeatable locations.

This paper prents a new method for image feature gen-eration called the Scale Invariant Feature Transform(SIFT). This approach transforms an image into a large collection of local feature vectors,each of which is invariant to image translation,scaling,and rotation,and partially invariant to illumination changes and afﬁne or3D projection.Previous approaches to local feature generation lacked invariance to scale and were more nsitive to projective distortion and illumination change.The SIFT features share a number of properties in common with the respons of neurons in infe-rior temporal(IT)cortex in primate vision.This paper also describes improved approaches to indexing and model ver-iﬁcation.

The scale-invariant features are efﬁciently identiﬁed by using a stagedﬁltering approach.Theﬁrst stage identiﬁes key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is ud to generate a feature vector that describes the local image region sampled relative to its scale-space co-ordinate frame.The features achieve partial invariance to local variations,such as afﬁne or3D projections,by blur-ring image gradient locations.This approach is bad on a model of the behavior of complex cells in the cerebral cor-tex of mammalian vision.The resulting feature vectors are called SIFT keys.In the current implementation,each im-age generates on the order of1000SIFT keys,a process that requires less than1cond of computation time.

手掌发黄是肝病吗

The SIFT keys derived from an image are ud in a nearest-neighbour approach to indexing to identify candi-date object models.Collections of keys that agree on a po-tential model po areﬁrst identiﬁed through a Hough trans-form hash table,and then through a least-squaresﬁt to aﬁnal estimate of model parameters.When at least3keys agree on the model parameters with low residual,there is strong evidence for the prence of the object.Since there may be dozens of SIFT keys in the image of a typical object,it is possible to have substantial levels of occlusion in the image and yet retain high levels of reliability.

The current object models are reprented as2D loca-tions of SIFT keys that can undergo afﬁne projection.Suf-ﬁcient variation in feature location is allowed to recognize perspective projection of planar shapes at up to a60degree rotation away from the camera or to allow up to a20degree rotation of a3D object.

手机qq农场

2.Related rearch

Object recognition is widely ud in the machine vision in-dustry for the purpos of inspection,registration,and ma-nipulation.However,current commercial systems for object recognition depend almost exclusively on correlation-bad template matching.While very effective for certain e

ngi-neered environments,where object po and illumination are tightly controlled,template matching becomes computa-tionally infeasible when object rotation,scale,illumination, and3D po are allowed to vary,and even more so when dealing with partial visibility and large model databas.

An alternative to arching all image locations for matches is to extract features from the image that are at least partially invariant to the image formation process and matching only to tho features.Many candidate feature types have been propod and explored,including line g-ments[6],groupings of edges[11,14],and regions[2], among many other proposals.While the features have worked well for certain object class,they are often not de-tected frequently enough or with sufﬁcient stability to form a basis for reliable recognition.

There has been recent work on developing much denr collections of image features.One approach has been to u a corner detector(more accurately,a detector of peaks in local image variation)to identify repeatable image loca-tions,around which local image properties can be measured. Zhang et al.[23]ud the Harris corner detector to iden-tify feature locations for epipolar alignment of images taken from differing viewpoints.Rather than attempting to cor-relate regions from one image against all possible regions in a cond image,large savings in computation time were achieved by only matching regions centered at corner points in each image.

For the object recognition problem,Schmid&Mohr [19]also ud the Harris corner detector to identify in-terest points,and then created a local image descriptor at each interest point from an orientation-invariant vector of derivative-of-Gaussian image measurements.The image descriptors were ud for robust object recognition by look-ing for multiple matching descriptors that satisﬁed object-bad orientation and location constraints.This work was impressive both for the speed of recognition in a large databa and the ability to handle cluttered images.

The corner detectors ud in the previous approaches have a major failing,which is that they examine an image at only a single scale.As the change in scale becomes sig-niﬁcant,the detectors respond to different image points. Also,since the detector does not provide an indication of the object scale,it is necessary to create image descriptors and attempt matching at a large number of scales.This paper de-scribes an efﬁcient method to identify stable key locations in scale space.This means that different scalings of an im-age will have no effect on the t of key locations lected.Furthermore,an explicit scale is determined for each point, which allows the image description vector for that point to be sampled at an equivalent scale in each image.A canoni-cal orientation is determined at each location,so that match-ing can be performed relative to a consistent local2D co-ordinate frame.This allows for the u of more distinctive image descriptors than the rotat

ion-invariant ones ud by Schmid and Mohr,and the descriptor is further modiﬁed to improve its stability to changes in afﬁne projection and illu-mination.

Other approaches to appearance-bad recognition in-clude eigenspace matching[13],color histograms[20],and receptiveﬁeld histograms[18].The approaches have all been demonstrated successfully on isolated objects or pre-gmented images,but due to their more global features it has been difﬁcult to extend them to cluttered and partially occluded images.Ohba&Ikeuchi[15]successfully apply the eigenspace approach to cluttered images by using many small local eigen-windows,but this then requires expensive arch for each window in a new image,as with template matching.

关于打篮球的作文3.Key localization

We wish to identify locations in image scale space that are invariant with respect to image translation,scaling,and ro-tation,and are minimally affected by noi and small dis-tortions.Lindeberg[8]has shown that under some rather general assumptions on scale invariance,the Gaussian ker-nel and its derivatives are the only possible smoothing ker-nels for scale space analysis.

To achieve rotation invariance and a high level of efﬁ-ciency,we have chon to lect key locations at maxima and minima of a difference of Gaussian function applied in scale space.This can be computed very efﬁciently by build-ing an image pyramid with resampling between each level. Furthermore,it locates key points at regions and scales of high variation,making the locations particularly stable for characterizing the image.Crowley&Parker[4]and Linde-berg[9]have previously ud the difference-of-Gaussian in scale space for other purpos.In the following,we describe a particularly efﬁcient and stable method to detect and char-acterize the maxima and minima of this function.

As the2D Gaussian function is parable,its convolution with the input image can be efﬁciently computed by apply-ing two pass of the1D Gaussian function in the horizontal and vertical directions:

For key localization,all smoothing operations are done us-ing,which can be approximated with sufﬁcient ac-curacy using a1D kernel with7sample points.

The input image isﬁrst convolved with the Gaussian function using to give an image A.This is then repeated a cond time with a further incremental smooth-ing of to give a new image,B,which now h

as an effective smoothing of.The difference of Gaussian function is obtained by subtracting image B from A,result-ing in a ratio of between the two Gaussians.

To generate the next pyramid level,we resample the al-ready smoothed image B using bilinear interpolation with a pixel spacing of1.5in each direction.While it may em more natural to resample with a relative scale of,the only constraint is that sampling be frequent enough to de-tect peaks.The1.5spacing means that each new sample will be a constant linear combination of4adjacent pixels.This is efﬁcient to compute and minimizes aliasing artifacts that would ari from changing the resampling coefﬁcients.

Maxima and minima of this scale-space function are de-termined by comparing each pixel in the pyramid to its neighbours.First,a pixel is compared to its8neighbours at the same level of the pyramid.If it is a maxima or minima at this level,then the clost pixel location is calculated at the next lowest level of the pyramid,taking account of the 1.5times resampling.If the pixel remains higher(or lower) than this clost pixel and its8neighbours,then the test is repeated for the level above.Since most pixels will be elim-inated within a few comparisons,the cost of this detection is small and much lower than that of building the pyramid.

If theﬁrst level of the pyramid is sampled at the same rate as the input image,the highest spatial frequencies will be ig-nored.This is due to the initial smoothing,which is needed to provide paration of peaks for robust detection.There-fore,we expand the input image by a factor of2,using bilin-ear interpolation,prior to building the pyramid.This gives on the order of1000key points for a typical pixel image,compared to only a quarter as many without the ini-tial expansion.

3.1.SIFT key stability

To characterize the image at each key location,the smoothed image A at each level of the pyramid is procesd to extract image gradients and orientations.At each pixel,,the im-age gradient magnitude,,and orientation,,are com-puted using pixel differences:

The pixel differences are efﬁcient to compute and provide sufﬁcient accuracy due to the substantial level of previous smoothing.The effective half-pixel shift in position is com-pensated for when determining key location.

Robustness to illuminationchange is enhanced by thresh-olding the gradient magnitudes at a value of0.1times

沈阳到鞍山

the Figure1:The cond image was generated from theﬁrst by rotation,scaling,stretching,change of brightness and con-trast,and addition of pixel noi.In spite of the changes, 78%of the keys from theﬁrst image have a cloly match-ing key in the cond image.The examples show only a subt of the keys to reduce clutter.

maximum possible gradient value.This reduces the effect of a change in illumination direction for a surface with3D relief,as an illumination change may result in large changes to gradient magnitude but is likely to have less inﬂuence on gradient orientation.

Each key location is assigned a canonical orientation so that the image descriptors are invariant to rotation.In or-der to make this as stable as possible against lighting or con-trast changes,the orientation is determined by the peak in a histogram of local image gradient orientations.The orien-tation histogram is created using a Gaussian-weighted win-dow with of3times that of the current smoothing scale. The weights are multiplied by the thresholded gradient values and accumulated in the histogram at locations corre-sponding to the orientation,.The histogram has36bins covering the360degree range of rotations,and is smoothed prior to peak lection.

The stability of the resulting keys can be tested by sub-jecting natural images to afﬁne projection,contrast and brightness changes,and addition of noi.The location of each key detected in theﬁrst image can be predicted in the transformed image from knowledge of the transform param-eters.This framework was ud to lect the various sam-pling and smoothing parameters given above,so that max-韭菜炒墨鱼仔

Image transformation Match%Ori%

A.Increa contrast by1.289.086.6

B.Decrea intensity by0.288.585.9

C.Rotate by20degrees85.481.0

D.Scale by0.785.180.3

E.Stretch by1.283.576.1

F.Stretch by1.577.765.0

红烧黄花鱼的家常做法G.Add10%pixel noi90.388.4

家庭翻译

H.All of A,B,C,D,E,G.78.671.8 Figure2:For various image transformations applied to a sample of20images,this table gives the percent of keys that are found at matching locations and scales(Match%)and that also match in orientation(Ori%).

imum efﬁciency could be obtained while retaining stability to changes.

Figure1shows a relatively small number of keys de-tected over a2octave range of only the larger sca

les(to avoid excessive clutter).Each key is shown as a square,with a line from the center to one side of the square indicating ori-entation.In the cond half of thisﬁgure,the image is ro-tated by15degrees,scaled by a factor of0.9,and stretched by a factor of1.1in the horizontal direction.The pixel inten-sities,in the range of0to1,have0.1subtracted from their brightness values and the contrast reduced by multiplication by0.9.Random pixel noi is then added to give less than 5bits/pixel of signal.In spite of the transformations,78% of the keys in theﬁrst image had cloly matching keys in the cond image at the predicted locations,scales,and ori-entations

The overall stability of the keys to image transformations can be judged from Table2.Each entry in this table is gen-erated from combining the results of20diver test images and summarizes the matching of about15,000keys.Each line of the table shows a particular image transformation. Theﬁrstﬁgure gives the percent of keys that have a match-ing key in the transformed image within in location(rel-ative to scale for that key)and a factor of1.5in scale.The cond column gives the percent that match the criteria as well as having an orientation within20degrees of the pre-diction.

4.Local image description

Given a stable location,scale,and orientation for each key,it is now possible to describe the local ima

ge region in a man-ner invariant to the transformations.In addition,it is desir-able to make this reprentation robust against small shifts in local geometry,such as ari from afﬁne or3D projection.One approach to this is suggested by the respon properties of complex neurons in the visual cortex,in which a feature position is allowed to vary over a small region while orienta-tion and spatial frequency speciﬁcity are maintained.Edel-man,Intrator&Poggio[5]have performed experiments that simulated the respons of complex neurons to different3D views of computer graphic models,and found that the com-plex cell outputs provided much better discrimination than simple correlation-bad matching.This can be en,for ex-ample,if an afﬁne projection stretches an image in one di-rection relative to another,which changes the relative loca-tions of gradient features while having a smaller effect on their orientations and spatial frequencies.

This robustness to local geometric distortion can be ob-tained by reprenting the local image region with multiple images reprenting each of a number of orientations(re-ferred to as orientation planes).Each orientation plane con-tains only the gradients corresponding to that orientation, with linear interpolation ud for intermediate orientations. Each orientation plane is blurred and resampled to allow for larger shifts in positions of the gradients.

This approach can be efﬁciently implemented by using the same precomputed gradients and orientat

ions for each level of the pyramid that were ud for orientation lection. For each keypoint,we u the pixel sampling from the pyra-mid level at which the key was detected.The pixels that fall in a circle of radius8pixels around the key location are in-rted into the orientation planes.The orientation is mea-sured relative to that of the key by subtracting the key’s ori-entation.For our experiments we ud8orientation planes, each sampled over a grid of locations,with a sample spacing4times that of the pixel spacing ud for gradient detection.The blurring is achieved by allocating the gradi-ent of each pixel among its8clost neighbors in the sam-ple grid,using linear interpolation in orientation and the two spatial dimensions.This implementation is much more efﬁ-cient than performing explicit blurring and resampling,yet gives almost equivalent results.

In order to sample the image at a larger scale,the same process is repeated for a cond level of the pyramid one oc-tave higher.However,this time a rather than a sample region is ud.This means that approximately the same image region will be examined at both scales,so that any nearby occlusions will not affect one scale more than the other.Therefore,the total number of samples in the SIFT key vector,from both scales,is or160 elements,giving enough measurements for high speciﬁcity.

5.Indexing and matching

For indexing,we need to store the SIFT keys for sample im-ages and then identify matching keys from new images.The problem of identifyingthe most similar keys for high dimen-

sional vectors is known to have high complexity if an ex-act solution is required.However,a modiﬁcation of the k-d tree algorithm called the best-bin-ﬁrst arch method(Beis &Lowe[3])can identify the nearest neighbors with high probability using only a limited amount of computation.To further improve the efﬁciency of the best-bin-ﬁrst algorithm, the SIFT key samples generated at the larger scale are given twice the weight of tho at the smaller scale.This means that the larger scale is in effect able toﬁlter the most likely neighbours for checking at the smaller scale.This also im-proves recognition performance by giving more weight to the least-noisy scale.In our experiments,it is possible to have a cut-off for examining at most200neighbors in a probabilisticbest-bin-ﬁrst arch of30,000key vectors with almost no loss of performance compared toﬁnding an exact solution.

An efﬁcient way to cluster reliable model hypothes is to u the Hough transform[1]to arch for keys that agree upon a particular model po.Each model key in the databa contains a record of the key’s parameters relative to the model coordinate system.Therefore,we can create an entry in a hash table predicting the model location,ori-entation,and scale from the match hypothesis.We u a bin size of30degrees for orientation,a factor of2for scale, and0.25times the maximum model dimensi

on for location. The rather broad bin sizes allow for clustering even in the prence of substantial geometric distortion,such as due to a change in3D viewpoint.To avoid the problem of boundary effects in hashing,each hypothesis is hashed into the2clos-est bins in each dimension,giving a total of16hash table entries for each hypothesis.

6.Solution for afﬁne parameters

The hash table is arched to identify all clusters of at least 3entries in a bin,and the bins are sorted into decreasing or-der of size.Each such cluster is then subject to a veriﬁcation procedure in which a least-squares solution is performed for the afﬁne projection parameters relating the model to the im-age.

The afﬁne transformation of a model point to an image point can be written as

where the model translation is and the afﬁne rota-tion,scale,and stretch are reprented by the parameters.

We wish to solve for the transformation parameters,

so Figure3:Model images of planar objects are shown in the top row.Recognition results below show model outlinesand image keys ud for matching.

the equation above can be rewritten as

This equation shows a single match,but any number of fur-ther matches can be added,with each match contributing two more rows to theﬁrst and last matrix.At least3matches are needed to provide a solution.

We can write this linear system as

The least-squares solution for the parameters x can be deter-

本文发布于:2023-05-19 02:12:24，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/915141.html

上一篇：部队遵规守纪方面个人总结范文六篇

下一篇：会计公务员年度考核个人总结【五篇】