Beyond Bags of Features Spatial Pyramid Matching for Recognizing Natural Scene Categories

更新时间:2023-07-31 20:47:10 阅读：评论：0

Beyond Bags of Features:Spatial Pyramid Matching for Recognizing Natural Scene Categories

Svetlana Lazebnik,Cordelia Schmid,and Jean Ponce

直面挫折CVR-TR-2005-04

Abstract

This paper prents a method for recognizing scene cat-egories bad on approximate global geometric correspon-dence.This technique works by partitioning the image into increasinglyﬁne sub-regions and computing histograms of local features found inside each sub-region.The result-ing“spatial pyramid”is a simple and computationally efﬁ-cient extension of an orderless bag-of-features image rep-rentation,and it shows signiﬁcantly improved perfor-mance on challenging scene categorization tasks.Speciﬁ-cally,our propod method exceeds the state of the art on the Caltech-101databa and achieves high accuracy on a large databa ofﬁfteen natural scene categories.The spa-tial pyramid framework also offers insights into the success of veral recently propod image descriptions,including Torralba’s“gist”and Lowe’s SIFT descriptors.

1.Introduction

In this paper,we consider the problem of recognizing the mantic category of an image.For example,we may want to classify a photograph as depicting a scene(forest, street,ofﬁce,etc.)or as containing a certain object of in-terest.For such whole-image categorization tasks,bag-of-features methods,which reprent an image as an orderless collection of local features,have recently demonstrated im-pressive levels of performance[5,21,22,23].However, becau the methods disregard all information about the spatial layout of the features,they have verely limited de-scriptive ability.In particular,they are incapable of captur-ing shape or of gmenting an object from its background. Unfortunately,overcoming the limitations to build effec-tive structural object descriptions has proven to be quite challenging,especially when the recognition system must be made to work in the prence of heavy clutter,occlu-sion,or large viewpoint changes.Approaches bad on generative part models[10,3]and geometric correspon-dence arch[1,9]achieve robustness at signiﬁcant compu-tational expen.A more efﬁcient approach is to augment a basic bag-of-features reprentation with pairwi relations between neighboring local features,but existing implemen-tations of this idea[9,17]have yielded inconclusive results. One other strategy for increasing robustness to geometric deformations is to increa the level of invariance of local ,by using afﬁne-invariant detectors),but a re-cent large-scale evaluation[23]suggests that this strategy usually does not pay off.

复试是什么意思Though we remain sympathetic to the goal of develop-ing robust and geometrically invariant structural object rep-rentations,we propo in this paper to revisit“global”non-invariant reprentations bad on aggregating statis-tics of local features overﬁxed subregions.We introduce a kernel-bad recognition method that works by computing rough geometric correspondence on a global scale using an efﬁcient approximation technique adapted from the pyramid matching scheme of Grauman and Darrell[5].Our method involves repeatedly subdividing the image and computing histograms of local features at increasinglyﬁne resolutions. As shown by experiments in Section5,this simple oper-ation sufﬁces to signiﬁcantly improve performance over a basic bag-of-features reprentation,and even over meth-ods bad on detailed geometric correspondence.

Previous rearch has shown that statistical properties of the scene considered in a holistic fashion,without any anal-ysis of its constituent objects,yield a rich t of cues to its mantic category[13].Our own experiments conﬁrm that global reprentations can be surprisingly effective not only for identifying the overall scene,but also for categorizing images as containing speciﬁc objects,even when the ob-jects are embedded in heavy clutter and vary signiﬁcantly in po and appearance.This said,we do not advocate the direct u of a global method for object recognition(ex-cept for very restricted sorts of imagery).Instead,we en-vision a subordinate role for th

is method.It may be ud to capture the“gist”of an image[20]and to inform the subquent arch for speciﬁc ,if the image, bad on its global description,is likely to be a highway,we have a high probability ofﬁnding a car,but not a toaster). In addition,the simplicity and efﬁciency of our propod method,in combination with its tendency to yield unexpect-

edly high recognition rates on emingly challenging data, could make it a good baline for“calibrating”newly ac-quired datats and for evaluating more sophisticated recog-nition approaches.

2.Previous Work

In computer vision,histograms have a long history as a method for image description(,[16,18]).Koen-derink and Van Doorn[8]have generalized histograms to locally orderless images,or histogram-valued scale spaces (i.e.,for each Gaussian aperture at a given location and scale,the locally orderless image returns the histogram of image features aggregated over that aperture).Our spatial pyramid approach can be thought of as an alternative for-mulation of a locally orderless image,where instead of a Gaussian scale space of apertures,we deﬁne aﬁxed hier-archy of rectangular windows.Koenderink and Van Doorn have argued persuasively that locally orderless ima

ges play an important role in visual perception.Our retrieval exper-iments(Fig.4)conﬁrm that spatial pyramids can capture perceptually salient features and suggest that“locally or-derless matching”may be a powerful mechanism for esti-mating overall perceptual similarity between images.

It is important to contrast our propod approach with multiresolution histograms[6],which involve repeatedly subsampling an image and computing a global histogram of pixel values at each new level.In other words,a mul-tiresolution histogram varies the resolution at which the fea-tures(intensity values)are computed,but the histogram res-olution(intensity scale)staysﬁxed.We take the opposite approach ofﬁxing the resolution at which the features are computed,but varying the spatial resolution at which they are aggregated.This results in a higher-dimensional rep-rentation that prerves more ,an image consisting of thin black and white stripes would retain two modes at every level of a spatial pyramid,whereas it would become indistinguishable from a uniformly gray image at all but theﬁnest levels of a multiresolution histogram).Fi-nally,unlike a multiresolution histogram,a spatial pyramid, when equipped with an appropriate kernel,can be ud for approximate geometric matching.

The operation of“subdivide and disorder”—i.e.,par-tition the image into subblocks and compute histograms (or histogram statistics,such as means)of local features in the subblocks—has been pr

acticed numerous times in computer vision,both for global image description[4, 19,20]and for local description of interest regions[12]. Thus,though the operation itlf ems fundamental,pre-vious methods leave open the question of what is the right subdivision scheme(although a regular4×4grid ems to be the most popular implementation choice),and what is the right balance between“subdividing”and“disordering.”The spatial pyramid framework suggests a possible way to address this issue:namely,the best results may be achieved

when multiple resolutions are combined in a principled way.

It also suggests that the reason for the empirical success of

“subdivide and disorder”techniques is the fact that they ac-

tually perform approximate geometric matching.

3.Spatial Pyramid Matching

Weﬁrst describe the original formulation of pyramid

matching[5],and then introduce our application of this

framework to create a spatial pyramid image reprentation.

3.1.Pyramid Match Kernels

Let X and Y be two ts of vectors in a d-dimensional

feature space.Grauman and Darrell[5]propo pyramid

matching toﬁnd an approximate correspondence between

the two ts.Informally,pyramid matching works by

placing a quence of increasingly coarr grids over the

feature space and taking a weighted sum of the number of

matches that occur at each level of resolution.At anyﬁxed

resolution,two points are said to match if they fall into the

same cell of the grid;matches found atﬁner resolutions are

weighted more highly than matches found at coarr reso-

lutions.More speciﬁcally,we construct a quence of grids

at resolutions0,...,L,such that the grid at level has2

cells along each dimension,for a total of D=2d cells.Let

H X and H Y denote the histograms of X and Y at this res-olution,so that H X(i)and H Y(i)are the numbers of points from X and Y that fall into the i th cell of the grid.Then

the number of matches at level is given by the histogram

interction function[18]:

I(H X,H Y)=

i=1

min

H X(i),H Y(i)

.(1)

In the following,we will abbreviate I(H X,H Y)to I .

Note that the number of matches found at level also in-cludes all the matches found at theﬁner level +1.There-fore,the number of new matches found at level is given by I −I +1for =0,...,L−1.The weight associated

with level is t to1

2L−

,which is inverly proportional to cell width at that level.Intuitively,we want to penalize matches found in larger cells becau they involve increas-ingly dissimilar features.

Putting all the pieces together,we get the following def-inition of a pyramid match kernel:

κL(X,Y)=I L+

L−1

I −I +1

(2)

I0+

如何共享电脑

I .(3)

Both the histogram interction and the pyramid match ker-nel are Mercer kernels[5].

Figure 1.Toy example of constructing a three-level pyramid.The image has three feature types,indicated by circles,diamonds,and cross.At the top,we subdivide the image at three different lev-els of resolution.Next,for each level of resolution and each chan-nel,we count the features that fall in each spatial bin.Finally,we weight each spatial histogram according to eq.(3).

3.2.Spatial Matching Scheme

As introduced in [5],a pyramid match kernel works with an orderless image reprentation.It allows for pre-ci matching of two collections of features in a high-dimensional appearance space,but discards all spatial in-formation.This paper advocates an “orthogonal”approach:perform pyramid matching in the two-dimensional image space,and u traditional clustering techniques in feature space.1Speciﬁcally,we quantize all feature vectors into M discrete types,and make the simplifying assumption that only features of the same type can be matched to one an-other.Each channel m gives us two ts of two-dimensional vectors,X m and Y m ,reprenting the coordinates of fea-tures of type m found in the respective images.The ﬁnal kernel is then the sum of the parate channel kernels:

K L (X,Y )=

M m =1

κL (X m ,Y m ).(4)

This approach has the advantage of maintaining continuity

with the popular “visual vocabulary”paradigm —in fact,it reduces to a standard bag of features when L =0.

Becau the pyramid match kernel (3)is simply a weighted sum of histogram interctions,and becau c min(a,b )=min(ca,cb )for positive numbers,we can implement K L as a single histogram interction of “long”vectors formed by concatenating the appropriately weighted histograms of all channels at all resolutions (Fig.1).For L levels and M channels,the resulting vector has dimen-sionality M L =04 =M 13(4

L +1

−1).Several experi-ments reported in Section 5u the ttings of M =400and L =3,resulting in 34000-dimensional histogram in-terctions.However,the operations are efﬁcient becau

1In

principle,it is possible to integrate geometric information directly into the original pyramid matching framework by treating image coordi-nates as two extra dimensions in the feature space.

the histogram vectors are extremely spar (in fact,just as in [5],the computational complexity of the kernel is linear in the number of features).It must also be noted that we did not obrve any signiﬁcant increa in performance beyond M =200and L =2,where the concatenated histograms are only 4200-dimensional.

The ﬁnal implementation issue is that of normalization.For maximum computational efﬁciency,we normalize all histograms by the total weight of all features in the image,in effect forcing the total number of features in all images to be the same.Becau we u a den feature reprentation (e Section 4),and thus do not need to worry about spuri-ous feature detections resulting from clutter,this practice is sufﬁcient to deal with the effects of variable image size.

4.Feature Extraction

This ction brieﬂy describes the two kinds of features ud in the experiments of Section 5.First,we have so-called “weak features,”which are oriented edge ,points who gradient magnitude in a given direction ex-ceeds a minimum threshold.We extract edge points at two scales and eight orientations,for a total of M =16chan-nels.We designed the features to obtain a reprentation similar to the “gist”[20]or to a global SIFT descriptor [12]of the image.

For better discriminative power,we also utilize higher-dimensional “strong features,”which are SIFT descriptors of 16×16pixel patches computed over a grid with spacing of 8pixels.Our decision to u a den regular grid in-stead of interest points was bad on the comparative eval-uation of Li and Perona [11],who have shown that den features work better for scene classiﬁcation.Intuitively,a den

image description is necessary to capture uniform re-gions such as sky,calm water,or road surface (to deal with low-contrast regions,we skip the usual SIFT normalization procedure when the overall gradient magnitude of the patch is too weak).We perform k -means clustering of a random subt of patches from the training t to form a visual vo-cabulary.Typical vocabulary sizes for our experiments are M =200and M =400.

5.Experiments

In this ction,we report results on three diver datats:ﬁfteen scene categories [11],Caltech-101[10],and Graz [14].We perform all processing in grayscale,even when color images are available.All experiments are re-peated ten times with different randomly lected training and test images,and the average of per-class recognition rates 2is recorded for each run.The ﬁnal result is reported as

2The

alternative performance measure,the percentage of all test im-ages classiﬁed correctly,can be biad if test t sizes for different class vary signiﬁcantly.This is especially true of the Caltech-101datat,where some of the “easiest”class are disproportionately large.

ofﬁce kitchen living

room

bedroom store

industrial

tall building∗inside city∗street

如手如足

∗

highway∗coast∗open country

∗

mountain∗forest∗suburb Figure2.Example images from the scene category databa.The starred cat

egories originate from Oliva and Torralba[13].

Weak features(M=16)Strong features(M=200)Strong features(M=400)

L Single-level Pyramid Single-level Pyramid Single-level Pyramid

0(1×1)45.3±0.572.2±0.674.8±0.3

1(2×2)53.6±0.356.2±0.677.9±0.679.0±0.578.8±0.480.1±0.5

2(4×4)61.7±0.664.7±0.779.4±0.381.1±0.379.7±0.581.4±0.5

3(8×8)63.3±0.866.8±0.677.2±0.480.7±0.377.2±0.581.1±0.6

Table1.in bold. the mean and standard deviation of the results from the in-

dividual runs.Multi-class classiﬁcation is done with a sup-

port vector machine(SVM)trained using the one-versus-all

rule:a classiﬁer is learned to parate each class from the

rest,and a test image is assigned the label of the classiﬁer

with the highest respon.

5.1.Scene Category Recognition

Ourﬁrst datat(Fig.2)is compod ofﬁfteen scene

categories:thirteen were provided by Li and Perona[11]

(eight of the were originally collected by Oliva and Tor-

ralba[13]),and two(industrial and store)were collected by

ourlves.Each category has200to400images,and av-

erage image size is300×250pixels.The major sources

of the pictures in the datat include the COREL collection,

personal photographs,and Google image arch.As stated

in[11],this is probably the most complete scene category

datat ud in the literature thus far.

Table1shows detailed results of classiﬁcation experi-

ments using100images per class for training and the rest

for testing(the same tup as[11]).First,let us examine the

performance of strong features for L=0and M=200,

corresponding to a standard bag of features.Our classiﬁca-

tion rate is72.2%(74.7%for the13class inherited from

Li and Perona),which is much higher than their best re-

sults of65.2%,achieved with an orderless method and a

feature t comparable to ours.We conjecture that Li and

Perona’s approach is disadvantaged by its reliance on latent

Dirichlet allocation(LDA)[2],which is esntially an unsu-

pervid dimensionality reduction technique and as such,is

not necessarily conducive to achieving the highest classiﬁ-

cation accuracy.To verify this,we have experimented with

probabilistic latent mantic analysis(pLSA)[7],which at-

tempts to explain the distribution of features in the image

as a mixture of a few“scene topics”or“aspects”and per-

forms very similarly to LDA in practice[17].Following the

scheme of Quelhas et al.[15],we run pLSA in an unsuper-

vid tting to learn a60-aspect model of half the training

images.Next,we apply this model to the other half to obtain

probabilities of topics given each image(thus reducing the

dimensionality of the feature space from200to60).Finally,

we train the SVM on the reduced features and u them

to classify the test t.In this tup,our average classiﬁca-

tion rate drops to63.3%from the original72.2%.For the13

office o 92.7

kitchen68.5 living room

l 60.4

bedroom b 68.3

store76.2 industrial

i 65.4

tall building91.1 inside city

i 80.5

street90.2 highway h 86.6

coast82.4 open country o 70.5

mountain88.8

forest94.7

suburb99.4

images from class i that were misidentiﬁed as class j.

class inherited from Li and Perona,it drops to65.9%from 74.7%,which is now very similar to their results.Thus,we can e that latent factor analysis techniques can adverly affect classiﬁcation performance,which is also consistent with the results of Quelhas et al.[15].

Next,let us examine the behavior of spatial pyramid matching.For completeness,Table1lists the performance achieved using just the highest level of the pyramid(the “single-level”columns),as well as the performance of the complete matching scheme using multiple levels(the“pyra-mid”columns).For all three kinds of features,results im-prove dramatically as we go from L=0to a multi-level tup.Though matching at the highest pyramid level ems to account for most of the improvement,using all the levels together confers a statistically signiﬁcant beneﬁt.For strong features,single-level performance actually drops as we go from L=2to L=3.This means that the highest level of the L=3pyramid is tooﬁnely subdivided,with individ-ual bins yielding too few matches.Despite the diminished discriminative power of the highest level,the performance of the entire L=3pyramid remains esntially identical to that of the L=2pyramid.This,then,is the main advantage of the spatial pyramid reprentation:becau it combines multiple resolutions in a principled fashion,it is robust to failures at individual levels.

It is also interesting to compare performance of differ-ent feature ts.As expected,weak features do not per-form as well as strong features,though in combination with the spatial pyramid,they can also achieve acceptable levels of accuracy(note that becau weak features have a much higher density and much smaller spatial extent than strong

their performance continues to improve as we go

=2to L=3).Increasing the visual vocabulary

M=200to M=400results in a small perfor-

at L=0,but this difference is all but elim-higher pyramid levels.Thus,we can conclude that

geometric cues provided by the pyramid discriminative power than an enlarged visual vo-

Of cour,the optimal way to exploit structure

image and in the feature space may be to com-

in a uniﬁed multiresolution framework;this is

future rearch.

shows a confusion table between theﬁfteen scene

Not surprisingly,confusion occurs between the (kitchen,bedroom,living room),and also be-

小声说话的英文

natural class,such as coast and open country.

examples of image retrieval using the spatial kernel and strong features with M=200.The

give a qualitative n of the kind of visual infor-

我是一粒米by our approach.In particular,spatial pyra-successful at capturing the organization of major

elements or“blobs,”the directionality of dominant edges,and the perspective(amount of foreshortening,lo-cation of vanishing points).Becau the spatial pyramid is bad on features computed at the original image resolution, even high-frequency details can be prerved.For exam-ple,query image(b)shows white kitchen cabinet doors with dark borders.Three of the retrieved“kitchen”images con-tain cabinets of a similar design,the“ofﬁce”image shows a wall plastered with white documents in dark frames,and the“inside city”image shows a white building with darker window frames.

5.2.Caltech-101

Our cond t of experiments is on the Caltech-101 databa[10](Fig.5).This databa contains from31to 800images per category.Most images are medium ,about300×300pixels.Calt

ech-101is probably the most diver object databa available today,though it is not without shortcomings.Namely,most images feature relatively little clutter,and the objects are centered and oc-cupy most of the image.In addition,a number of categories, such as minaret(e Fig.5),are affected by“corner”arti-facts resulting from artiﬁcial image rotation.Though the artifacts are mantically irrelevant,they can provide stable cues resulting in misleadingly high recognition rates.

We follow the experimental tup of Grauman and Dar-rell[5]and Zhang et al.[23],namely,we train on30images per class and test on the rest.For efﬁciency,we limit the number of test images to50per class.Note that,becau some categories are very small,we may end up with just a single test image per class.Table2gives a breakdown of classiﬁcation rates for different pyramid levels for weak features and strong features with M=200.The results for

(a)kitchen living room living room living room ofﬁce living room living room living room living

关于反思的名言

room

(b)kitchen ofﬁce inside

city

(c)store mountain

forest

(d)tall bldg inside city inside

city

(e)tall bldg inside city mountain mountain

mountain

(f)inside city tall

bldg

(g)

street

(h)coast open country

Figure4.Retrieval from the scene category databa.The query images are on the left,and the eight images giving the highest values of the spatial pyramid kernel(for L=2,M=200)are on the right.The actual class of incorrectly retrieved images is listed below

them.

minaret(97.6%)windsor chair(94.6%)joshua tree(87.9%)okapi(87.8%

)

cougar body(27.6%)beaver(27.5%)crocodile(25.0%)ant(25.0%)

Figure5.Caltech-101results.Top:some class on which our method(L=2,M=200)achieved high performance.Bottom:some class on which our method performed poorly.

M=400are not shown,becau just as for the scene cat-

egory databa,they do not bring any signiﬁcant improve-

ment.For L=0,strong features give41.2%,which is

梦见发小slightly below the43%reported by Grauman and Darrell.

Our best result is64.6%,achieved with strong features at

L=2.This exceeds the highest classiﬁcation rate known

to us so far,that of53.9%reported by Zhang et al.[23]3

Berg et al.[1]report48%accuracy using15training im-

ages per class.Our average recognition rate with this tup

is56.4%.The behavior of weak features on this databa is

3Zhang et al.list the percentage of all test images classiﬁed correctly,

which is probably higher than their per-class average,as discusd earlier.

本文发布于:2023-07-31 20:47:10，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1103478.html

上一篇：递归式特征消除：Recursivefeatureelimination

下一篇：GIS组件式开发期终考试试题

标签：意思电脑说话名言

留言与评论（共有 0 条评论）