Beyond Bags of Features:Spatial Pyramid Matching for Recognizing Natural Scene Categories
Svetlana Lazebnik,Cordelia Schmid,and Jean Ponce
直面挫折CVR-TR-2005-04
Abstract
This paper prents a method for recognizing scene cat-egories bad on approximate global geometric correspon-dence.This technique works by partitioning the image into increasinglyfine sub-regions and computing histograms of local features found inside each sub-region.The result-ing“spatial pyramid”is a simple and computationally effi-cient extension of an orderless bag-of-features image rep-rentation,and it shows significantly improved perfor-mance on challenging scene categorization tasks.Specifi-cally,our propod method exceeds the state of the art on the Caltech-101databa and achieves high accuracy on a large databa offifteen natural scene categories.The spa-tial pyramid framework also offers insights into the success of veral recently propod image descriptions,including Torralba’s“gist”and Lowe’s SIFT descriptors.
1.Introduction
In this paper,we consider the problem of recognizing the mantic category of an image.For example,we may want to classify a photograph as depicting a scene(forest, street,office,etc.)or as containing a certain object of in-terest.For such whole-image categorization tasks,bag-of-features methods,which reprent an image as an orderless collection of local features,have recently demonstrated im-pressive levels of performance[5,21,22,23].However, becau the methods disregard all information about the spatial layout of the features,they have verely limited de-scriptive ability.In particular,they are incapable of captur-ing shape or of gmenting an object from its background. Unfortunately,overcoming the limitations to build effec-tive structural object descriptions has proven to be quite challenging,especially when the recognition system must be made to work in the prence of heavy clutter,occlu-sion,or large viewpoint changes.Approaches bad on generative part models[10,3]and geometric correspon-dence arch[1,9]achieve robustness at significant compu-tational expen.A more efficient approach is to augment a basic bag-of-features reprentation with pairwi relations between neighboring local features,but existing implemen-tations of this idea[9,17]have yielded inconclusive results. One other strategy for increasing robustness to geometric deformations is to increa the level of invariance of local ,by using affine-invariant detectors),but a re-cent large-scale evaluation[23]suggests that this strategy usually does not pay off.
复试是什么意思Though we remain sympathetic to the goal of develop-ing robust and geometrically invariant structural object rep-rentations,we propo in this paper to revisit“global”non-invariant reprentations bad on aggregating statis-tics of local features overfixed subregions.We introduce a kernel-bad recognition method that works by computing rough geometric correspondence on a global scale using an efficient approximation technique adapted from the pyramid matching scheme of Grauman and Darrell[5].Our method involves repeatedly subdividing the image and computing histograms of local features at increasinglyfine resolutions. As shown by experiments in Section5,this simple oper-ation suffices to significantly improve performance over a basic bag-of-features reprentation,and even over meth-ods bad on detailed geometric correspondence.
Previous rearch has shown that statistical properties of the scene considered in a holistic fashion,without any anal-ysis of its constituent objects,yield a rich t of cues to its mantic category[13].Our own experiments confirm that global reprentations can be surprisingly effective not only for identifying the overall scene,but also for categorizing images as containing specific objects,even when the ob-jects are embedded in heavy clutter and vary significantly in po and appearance.This said,we do not advocate the direct u of a global method for object recognition(ex-cept for very restricted sorts of imagery).Instead,we en-vision a subordinate role for th
is method.It may be ud to capture the“gist”of an image[20]and to inform the subquent arch for specific ,if the image, bad on its global description,is likely to be a highway,we have a high probability offinding a car,but not a toaster). In addition,the simplicity and efficiency of our propod method,in combination with its tendency to yield unexpect-
edly high recognition rates on emingly challenging data, could make it a good baline for“calibrating”newly ac-quired datats and for evaluating more sophisticated recog-nition approaches.
2.Previous Work
In computer vision,histograms have a long history as a method for image description(,[16,18]).Koen-derink and Van Doorn[8]have generalized histograms to locally orderless images,or histogram-valued scale spaces (i.e.,for each Gaussian aperture at a given location and scale,the locally orderless image returns the histogram of image features aggregated over that aperture).Our spatial pyramid approach can be thought of as an alternative for-mulation of a locally orderless image,where instead of a Gaussian scale space of apertures,we define afixed hier-archy of rectangular windows.Koenderink and Van Doorn have argued persuasively that locally orderless ima
ges play an important role in visual perception.Our retrieval exper-iments(Fig.4)confirm that spatial pyramids can capture perceptually salient features and suggest that“locally or-derless matching”may be a powerful mechanism for esti-mating overall perceptual similarity between images.
It is important to contrast our propod approach with multiresolution histograms[6],which involve repeatedly subsampling an image and computing a global histogram of pixel values at each new level.In other words,a mul-tiresolution histogram varies the resolution at which the fea-tures(intensity values)are computed,but the histogram res-olution(intensity scale)staysfixed.We take the opposite approach offixing the resolution at which the features are computed,but varying the spatial resolution at which they are aggregated.This results in a higher-dimensional rep-rentation that prerves more ,an image consisting of thin black and white stripes would retain two modes at every level of a spatial pyramid,whereas it would become indistinguishable from a uniformly gray image at all but thefinest levels of a multiresolution histogram).Fi-nally,unlike a multiresolution histogram,a spatial pyramid, when equipped with an appropriate kernel,can be ud for approximate geometric matching.
The operation of“subdivide and disorder”—i.e.,par-tition the image into subblocks and compute histograms (or histogram statistics,such as means)of local features in the subblocks—has been pr
acticed numerous times in computer vision,both for global image description[4, 19,20]and for local description of interest regions[12]. Thus,though the operation itlf ems fundamental,pre-vious methods leave open the question of what is the right subdivision scheme(although a regular4×4grid ems to be the most popular implementation choice),and what is the right balance between“subdividing”and“disordering.”The spatial pyramid framework suggests a possible way to address this issue:namely,the best results may be achieved
when multiple resolutions are combined in a principled way.
It also suggests that the reason for the empirical success of
“subdivide and disorder”techniques is the fact that they ac-
tually perform approximate geometric matching.
3.Spatial Pyramid Matching
Wefirst describe the original formulation of pyramid
matching[5],and then introduce our application of this
framework to create a spatial pyramid image reprentation.
3.1.Pyramid Match Kernels
Let X and Y be two ts of vectors in a d-dimensional
feature space.Grauman and Darrell[5]propo pyramid
matching tofind an approximate correspondence between
the two ts.Informally,pyramid matching works by
placing a quence of increasingly coarr grids over the
feature space and taking a weighted sum of the number of
matches that occur at each level of resolution.At anyfixed
resolution,two points are said to match if they fall into the
same cell of the grid;matches found atfiner resolutions are
weighted more highly than matches found at coarr reso-
lutions.More specifically,we construct a quence of grids
at resolutions0,...,L,such that the grid at level has2
cells along each dimension,for a total of D=2d cells.Let
H X and H Y denote the histograms of X and Y at this res-olution,so that H X(i)and H Y(i)are the numbers of points from X and Y that fall into the i th cell of the grid.Then
the number of matches at level is given by the histogram
interction function[18]:
I(H X,H Y)=
D
i=1
min
H X(i),H Y(i)
.(1)
In the following,we will abbreviate I(H X,H Y)to I .
Note that the number of matches found at level also in-cludes all the matches found at thefiner level +1.There-fore,the number of new matches found at level is given by I −I +1for =0,...,L−1.The weight associated
with level is t to1
2L−
,which is inverly proportional to cell width at that level.Intuitively,we want to penalize matches found in larger cells becau they involve increas-ingly dissimilar features.
Putting all the pieces together,we get the following def-inition of a pyramid match kernel:
κL(X,Y)=I L+
L−1
=0
1
I −I +1
(2)
=
1
2
I0+
L
=1
1
2
如何共享电脑
I .(3)
Both the histogram interction and the pyramid match ker-nel are Mercer kernels[5].
Figure 1.Toy example of constructing a three-level pyramid.The image has three feature types,indicated by circles,diamonds,and cross.At the top,we subdivide the image at three different lev-els of resolution.Next,for each level of resolution and each chan-nel,we count the features that fall in each spatial bin.Finally,we weight each spatial histogram according to eq.(3).
3.2.Spatial Matching Scheme
As introduced in [5],a pyramid match kernel works with an orderless image reprentation.It allows for pre-ci matching of two collections of features in a high-dimensional appearance space,but discards all spatial in-formation.This paper advocates an “orthogonal”approach:perform pyramid matching in the two-dimensional image space,and u traditional clustering techniques in feature space.1Specifically,we quantize all feature vectors into M discrete types,and make the simplifying assumption that only features of the same type can be matched to one an-other.Each channel m gives us two ts of two-dimensional vectors,X m and Y m ,reprenting the coordinates of fea-tures of type m found in the respective images.The final kernel is then the sum of the parate channel kernels:
K L (X,Y )=
M m =1
κL (X m ,Y m ).(4)
This approach has the advantage of maintaining continuity
with the popular “visual vocabulary”paradigm —in fact,it reduces to a standard bag of features when L =0.
Becau the pyramid match kernel (3)is simply a weighted sum of histogram interctions,and becau c min(a,b )=min(ca,cb )for positive numbers,we can implement K L as a single histogram interction of “long”vectors formed by concatenating the appropriately weighted histograms of all channels at all resolutions (Fig.1).For L levels and M channels,the resulting vector has dimen-sionality M L =04 =M 13(4
L +1
−1).Several experi-ments reported in Section 5u the ttings of M =400and L =3,resulting in 34000-dimensional histogram in-terctions.However,the operations are efficient becau
1In
principle,it is possible to integrate geometric information directly into the original pyramid matching framework by treating image coordi-nates as two extra dimensions in the feature space.
the histogram vectors are extremely spar (in fact,just as in [5],the computational complexity of the kernel is linear in the number of features).It must also be noted that we did not obrve any significant increa in performance beyond M =200and L =2,where the concatenated histograms are only 4200-dimensional.
The final implementation issue is that of normalization.For maximum computational efficiency,we normalize all histograms by the total weight of all features in the image,in effect forcing the total number of features in all images to be the same.Becau we u a den feature reprentation (e Section 4),and thus do not need to worry about spuri-ous feature detections resulting from clutter,this practice is sufficient to deal with the effects of variable image size.
4.Feature Extraction
This ction briefly describes the two kinds of features ud in the experiments of Section 5.First,we have so-called “weak features,”which are oriented edge ,points who gradient magnitude in a given direction ex-ceeds a minimum threshold.We extract edge points at two scales and eight orientations,for a total of M =16chan-nels.We designed the features to obtain a reprentation similar to the “gist”[20]or to a global SIFT descriptor [12]of the image.
For better discriminative power,we also utilize higher-dimensional “strong features,”which are SIFT descriptors of 16×16pixel patches computed over a grid with spacing of 8pixels.Our decision to u a den regular grid in-stead of interest points was bad on the comparative eval-uation of Li and Perona [11],who have shown that den features work better for scene classification.Intuitively,a den
image description is necessary to capture uniform re-gions such as sky,calm water,or road surface (to deal with low-contrast regions,we skip the usual SIFT normalization procedure when the overall gradient magnitude of the patch is too weak).We perform k -means clustering of a random subt of patches from the training t to form a visual vo-cabulary.Typical vocabulary sizes for our experiments are M =200and M =400.
5.Experiments
In this ction,we report results on three diver datats:fifteen scene categories [11],Caltech-101[10],and Graz [14].We perform all processing in grayscale,even when color images are available.All experiments are re-peated ten times with different randomly lected training and test images,and the average of per-class recognition rates 2is recorded for each run.The final result is reported as
2The
alternative performance measure,the percentage of all test im-ages classified correctly,can be biad if test t sizes for different class vary significantly.This is especially true of the Caltech-101datat,where some of the “easiest”class are disproportionately large.
office kitchen living
room
bedroom store
industrial
tall building∗inside city∗street
如手如足
∗
highway∗coast∗open country
∗
mountain∗forest∗suburb Figure2.Example images from the scene category databa.The starred cat
egories originate from Oliva and Torralba[13].
Weak features(M=16)Strong features(M=200)Strong features(M=400)
L Single-level Pyramid Single-level Pyramid Single-level Pyramid
0(1×1)45.3±0.572.2±0.674.8±0.3
1(2×2)53.6±0.356.2±0.677.9±0.679.0±0.578.8±0.480.1±0.5
2(4×4)61.7±0.664.7±0.779.4±0.381.1±0.379.7±0.581.4±0.5
3(8×8)63.3±0.866.8±0.677.2±0.480.7±0.377.2±0.581.1±0.6
Table1.in bold. the mean and standard deviation of the results from the in-
dividual runs.Multi-class classification is done with a sup-
port vector machine(SVM)trained using the one-versus-all
rule:a classifier is learned to parate each class from the
rest,and a test image is assigned the label of the classifier
with the highest respon.
5.1.Scene Category Recognition
Ourfirst datat(Fig.2)is compod offifteen scene
categories:thirteen were provided by Li and Perona[11]
(eight of the were originally collected by Oliva and Tor-
ralba[13]),and two(industrial and store)were collected by
ourlves.Each category has200to400images,and av-
erage image size is300×250pixels.The major sources
of the pictures in the datat include the COREL collection,
personal photographs,and Google image arch.As stated
in[11],this is probably the most complete scene category
datat ud in the literature thus far.
Table1shows detailed results of classification experi-
ments using100images per class for training and the rest
for testing(the same tup as[11]).First,let us examine the
performance of strong features for L=0and M=200,
corresponding to a standard bag of features.Our classifica-
tion rate is72.2%(74.7%for the13class inherited from
Li and Perona),which is much higher than their best re-
sults of65.2%,achieved with an orderless method and a
feature t comparable to ours.We conjecture that Li and
Perona’s approach is disadvantaged by its reliance on latent
Dirichlet allocation(LDA)[2],which is esntially an unsu-
pervid dimensionality reduction technique and as such,is
not necessarily conducive to achieving the highest classifi-
cation accuracy.To verify this,we have experimented with
probabilistic latent mantic analysis(pLSA)[7],which at-
tempts to explain the distribution of features in the image
as a mixture of a few“scene topics”or“aspects”and per-
forms very similarly to LDA in practice[17].Following the
scheme of Quelhas et al.[15],we run pLSA in an unsuper-
vid tting to learn a60-aspect model of half the training
images.Next,we apply this model to the other half to obtain
probabilities of topics given each image(thus reducing the
dimensionality of the feature space from200to60).Finally,
we train the SVM on the reduced features and u them
to classify the test t.In this tup,our average classifica-
tion rate drops to63.3%from the original72.2%.For the13
office o 92.7
kitchen68.5 living room
l 60.4
bedroom b 68.3
store76.2 industrial
i 65.4
tall building91.1 inside city
i 80.5
street90.2 highway h 86.6
coast82.4 open country o 70.5
mountain88.8
forest94.7
suburb99.4
images from class i that were misidentified as class j.
class inherited from Li and Perona,it drops to65.9%from 74.7%,which is now very similar to their results.Thus,we can e that latent factor analysis techniques can adverly affect classification performance,which is also consistent with the results of Quelhas et al.[15].
Next,let us examine the behavior of spatial pyramid matching.For completeness,Table1lists the performance achieved using just the highest level of the pyramid(the “single-level”columns),as well as the performance of the complete matching scheme using multiple levels(the“pyra-mid”columns).For all three kinds of features,results im-prove dramatically as we go from L=0to a multi-level tup.Though matching at the highest pyramid level ems to account for most of the improvement,using all the levels together confers a statistically significant benefit.For strong features,single-level performance actually drops as we go from L=2to L=3.This means that the highest level of the L=3pyramid is toofinely subdivided,with individ-ual bins yielding too few matches.Despite the diminished discriminative power of the highest level,the performance of the entire L=3pyramid remains esntially identical to that of the L=2pyramid.This,then,is the main advantage of the spatial pyramid reprentation:becau it combines multiple resolutions in a principled fashion,it is robust to failures at individual levels.
It is also interesting to compare performance of differ-ent feature ts.As expected,weak features do not per-form as well as strong features,though in combination with the spatial pyramid,they can also achieve acceptable levels of accuracy(note that becau weak features have a much higher density and much smaller spatial extent than strong
their performance continues to improve as we go
=2to L=3).Increasing the visual vocabulary
M=200to M=400results in a small perfor-
at L=0,but this difference is all but elim-higher pyramid levels.Thus,we can conclude that
geometric cues provided by the pyramid discriminative power than an enlarged visual vo-
Of cour,the optimal way to exploit structure
image and in the feature space may be to com-
in a unified multiresolution framework;this is
future rearch.
shows a confusion table between thefifteen scene
Not surprisingly,confusion occurs between the (kitchen,bedroom,living room),and also be-
小声说话的英文
natural class,such as coast and open country.
examples of image retrieval using the spatial kernel and strong features with M=200.The
give a qualitative n of the kind of visual infor-
我是一粒米by our approach.In particular,spatial pyra-successful at capturing the organization of major
elements or“blobs,”the directionality of dominant edges,and the perspective(amount of foreshortening,lo-cation of vanishing points).Becau the spatial pyramid is bad on features computed at the original image resolution, even high-frequency details can be prerved.For exam-ple,query image(b)shows white kitchen cabinet doors with dark borders.Three of the retrieved“kitchen”images con-tain cabinets of a similar design,the“office”image shows a wall plastered with white documents in dark frames,and the“inside city”image shows a white building with darker window frames.
5.2.Caltech-101
Our cond t of experiments is on the Caltech-101 databa[10](Fig.5).This databa contains from31to 800images per category.Most images are medium ,about300×300pixels.Calt
ech-101is probably the most diver object databa available today,though it is not without shortcomings.Namely,most images feature relatively little clutter,and the objects are centered and oc-cupy most of the image.In addition,a number of categories, such as minaret(e Fig.5),are affected by“corner”arti-facts resulting from artificial image rotation.Though the artifacts are mantically irrelevant,they can provide stable cues resulting in misleadingly high recognition rates.
We follow the experimental tup of Grauman and Dar-rell[5]and Zhang et al.[23],namely,we train on30images per class and test on the rest.For efficiency,we limit the number of test images to50per class.Note that,becau some categories are very small,we may end up with just a single test image per class.Table2gives a breakdown of classification rates for different pyramid levels for weak features and strong features with M=200.The results for
(a)kitchen living room living room living room office living room living room living room living
关于反思的名言
room
(b)kitchen office inside
city
(c)store mountain
forest
(d)tall bldg inside city inside
city
(e)tall bldg inside city mountain mountain
mountain
(f)inside city tall
bldg
(g)
street
(h)coast open country
Figure4.Retrieval from the scene category databa.The query images are on the left,and the eight images giving the highest values of the spatial pyramid kernel(for L=2,M=200)are on the right.The actual class of incorrectly retrieved images is listed below
them.
minaret(97.6%)windsor chair(94.6%)joshua tree(87.9%)okapi(87.8%
)
cougar body(27.6%)beaver(27.5%)crocodile(25.0%)ant(25.0%)
Figure5.Caltech-101results.Top:some class on which our method(L=2,M=200)achieved high performance.Bottom:some class on which our method performed poorly.
M=400are not shown,becau just as for the scene cat-
egory databa,they do not bring any significant improve-
ment.For L=0,strong features give41.2%,which is
梦见发小slightly below the43%reported by Grauman and Darrell.
Our best result is64.6%,achieved with strong features at
L=2.This exceeds the highest classification rate known
to us so far,that of53.9%reported by Zhang et al.[23]3
Berg et al.[1]report48%accuracy using15training im-
ages per class.Our average recognition rate with this tup
is56.4%.The behavior of weak features on this databa is
3Zhang et al.list the percentage of all test images classified correctly,
which is probably higher than their per-class average,as discusd earlier.