Learning To Detect Unen Object Class by Between-Class Attribute Transfer Christoph H.Lampert Hannes Nickisch Stefan Harmeling
Max Planck Institute for Biological Cybernetics,T¨u bingen,Germany
{firstname.lastname}@tuebingen.mpg.de
花的组成
Abstract
We study the problem of object classification when train-ing and test class are training examples of the target class are available.This tup has hardly been studied in computer vision rearch,but it is the rule rather than the exception,becau the world contains tens of thou-sands of different object class and for only a very few of them image,collections have been formed and annotated with suitable class labels.
In this paper,we tackle the problem by introducing attribute-bad classification.It performs object detection bad on a human-specified high-level description of the target objects instead of training images.The description consists of arbitrary mantic attributes,like shape,color or even geographic information.Becau such properties transcend the specific learning task at hand,they can be pre-learn
from image datats unrelated to the cur-rent task.Afterwards,new class can be detected bad on their attribute reprentation,without the need for a new training pha.In order to evaluate our method and to facil-itate rearch in this area,we have asmbled a new large-scale datat,“Animals with Attributes”,of over30,000an-imal images that match the50class in Osherson’s clas-sic table of how strongly humans associate85mantic at-tributes with animal class.Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target class.
1.Introduction
Learning-bad methods for recognizing objects in natu-ral images have made large progress over the last years.For specific object class,in particular faces and vehicles,reli-able and efficient detectors are available,bad on the com-bination of powerful low-level SIFT or HoG, with modern machine learning boosting or support vector machines.However,in order to achieve good classification accuracy,the systems require a lot of man-ually labeled training data,typically hundreds or thousands of example images for each class to be learned.
It has been estimated that humans distinguish between at least30,000relevant object class[3].Training con-ventional object detectors for all the would require mil-otter
black:yes
white:no
brown:yes
夹缝中生存stripes:no
water:yes
eats fish:yes
polar bear
black:no
white:yes
brown:no
stripes:no
water:yes
eats fish:yes
zebra息钱
black:yes
white:yes
brown:no
stripes:yes
water:no
eats fish:
倍的组词
no
Figure1.A description by high-level attributes allows the transfer of knowledge between object categories:after learning the visual appearance of attributes from any class with training examples, we can detect also object class that do not have any training images,bad on which attribute description a test imagefits best. lions of well-labeled training images and is likely out of reach for years to come.Therefore,numerous techniques for reducing the number of necessary training images have been developed,some of which we will discuss in Section3. However,all of the techniques still require at least some labeled training examples to detect future object instances.
Human learning is different:although humans can learn and abstract well from examples,they are also capable of detecting completely unen class when provided with a high-level description. E.g.,from the phra“eight-sided red traffic sign with white writing”,we will be able to detect stop signs,and when looking for“large gray animals with long trunks”,we will reliably identify elephants.We build on this paradigm and propo a system that is able to detect objects from a list of high-level attributes.The attributes rve as an intermediate layer in a classifier cascade and they enable the system to detect object class,for which it had not en a single training example.
Clearly,a large number of possible attributes exist and collecting parate training material to learn an ordinary classifier for each of them would be as tedious as for all object class.But,instead of creating a parate training
t for each attribute,we can exploit the fact that meaning-ful high-level concepts transcend class boundaries.To learn such attributes,we can therefore make u of existing train-ing data by merging images of veral object class.To ,the attribute striped,we can u images of ze-bras,bees and tigers.For the attribute yellow,zebras would not be included,but bees and tigers would still prove u-ful,possibly together with canary birds.It is this possibility to obtain knowledge about attributes from different object class,and,vice versa,the fact that each attribute can be ud for the detection of many object class that makes our propod learning method statistically efficient.
2.Information Transfer by Attribute Sharing
We begin by formalizing the problem tting and our intuition from the previous ction that the u of attributes allows us to transfer information between object class. Wefirst define the problem of our interest:
Learning with Disjoint Training and Test Class:
Let(x1,l1),...,(x n,l n)⊂X×Y be training samples where X is an arbitrary feature space and Y={y1,...,y K}consists of K discrete class.The task is to learn a classifier f:X→Z for a label t Z={z1,...,z L}that is disjoint from Y1.
Clearly,this task cannot be solved by an ordinary multi-class classifier.Figure2(a)provides a graphical illustra-tion of the problem:typical classifiers learn one param-eter vector(or other reprentation)αk for each training class y1,...,y K.Becau the class z1,...,z L were not prent during the training step,no parameter vector can be derived for them,and it is impossible to make predictions about the class for future samples.
In order to make predictions about class,for which no training data is available,we need to introduce a cou-pling between class in Y and Z.Since no training data for the unobrved class is available,this coupling cannot be learned from samples,but has to be inrted into the sys-tem by human effort.This introduces two vere constraints on what kind of coupling mechanisms are feasible:1)the amount of human effort to specify new class should be minimal,becau otherwi collecting and labeling training samples would be a simpler solution;2)coupling data that requires only common knowledge is preferable over special-ized expert knowledge,becau the latter is often difficult and expensive to obtain.
2.1.Attribute-Bad Classification:
We achieve both goals by introducing a small t of high-level mantic per-class attributes.The can lor 1The conditions that Y and Z are disjoint is included only to clarify the later prentation.The problem described also occurs if just Z⊆Y.and shape for arbitrary objects,or the natural habitat for animals.Humans are typically able to provide good prior knowledge about such attributes,and it is therefore possible to collect the necessary information without a lot of over-head.Becau the attributes are assigned on a per-class ba-sis instead of a per-image basis,the manual effort to add a new object class is kept minimal.
For the situation where attribute data of this kind of available,we introduce attribute-bad classification: Attribute-Bad Classification:
Given the situation of learning with disjoint training and test class.If for each class z∈Z and y∈Y an attribute reprentation a∈A is available,then we can learn a non-trivial classifierα:X→Z by transferring information between Y and Z through A.
In the rest of this paper,we will demonstrate that attribute-bad classification is indeed a solution to the problem of learning with disjoint training and test class, and how it can be practically ud for o
bject classification. For this,we introduce and compare two generic methods to integrate attributes into multi-class classification: Direct attribute prediction(DAP),illustrated by Fig-ure2(b),us an in between layer of attribute variables to decouple the images from the layer of labels.During training,the output class label of each sample induces a deterministic labeling of the attribute layer.Conquently, any supervid learning method can be ud to learn per-attribute parametersβm.At test time,the allow the pre-diction of attribute values for each test sample,from which the test class label are inferred.Note that the class during testing can differ from the class ud for training,as long as the coupling attribute layer is determined in a way that does not require a training pha.
休闲裤男装
Indirect attribute prediction(IAP),depicted in Fig-ure2(c),also us the attributes to transfer knowledge be-tween class,but the attributes form a connecting layer be-tween two layers of labels,one for class that are known at training time and one for class that are not.The training pha of IAP is ordinary multi-class classification.At test time,the predictions for all training class induce a label-ing of the attribute layer,from which a labeling over the test class can be inferred.
The major difference between both approaches lies in the relationship between training class and test class.Di-rectly learning the attributes results in a network where all class are treated equally.
When class labels are inferred at test time,the decision for all class are bad only on the attribute layer.We can expect it therefore to also handle the situation where training and test class are not disjoint. In contrast,when predicting the attribute values indirectly, the training class occur also a test time as an intermediate
Figure 2.Graphical reprentation of the propod across-class learning task:dark gray nodes are always obrved,light gray nodes are obrved only during training.White nodes are never obrved but must be inferred.An ordinary,flat,multi-class classifier (left)learns one parameter αk for each training class.It cannot generalize to class (z l )l =1...,L that are not part of the training t.In an attribute-bad classifier (middle)with fixed class–attribute relations (thick lines),training labels (y k )k =1,...,K imply training values for the attributes (a m )m =1,...,M ,from which parameters βm are learn
ed.At test time,attribute values can directly be inferred,and the imply output class label even for previously unen class.A multi-class bad attribute classifier (right)combined both ideas:multi-class parameters αk are learned for each training class.At test time,the posterior distribution of the training class labels induces a distribution over the labels of unen class by means of the class–attribute relationship.
feature layer.On the one hand,this can introduce a bias,if training class are also potential output class during testing.On the other hand,one can argue that deriving the attribute layer from the label layer instead of from the sam-ples will act as regularization step that creates only nsible attribute combinations and therefore makes the system more robust.In the following,we will develop implementations for both methods and benchmark their performance.
2.2.Implementation
Both cascaded classification methods,DAP and IAP,can
in principle be implemented by combining a supervid classifier or regressor for the image–attribute or image–class prediction with a parameter free inference method to channel the information through the attribute layer.In the following,we u a probabilistic model that reflects the graphical structures o
粤剧
f Figures 2(b)and 2(c).For simplic-ity,we assume that all attributes have binary values such
that the attribute reprentation a y =(a y 1,...,a y
m )for any training class y are fixed-length binary vectors.Continuous attributes can in principle be handled in the same way by using regression instead of classification.
For DAP,we start by learning probabilistic classifiers for each attribute a m .We u all images from all training class as training samples with their label determined by the entry of the attribute vector corresponding to the sam-ple’s label,i.e .a sample of class y is assigned the binary label a y m .The trained classifiers provide us with estimates of p (a m |x ),from which we form a model for the complete
image–attribute layer as p (a |x )= M
m =1p (a m |x ).At test time,we assume that every class z induces its attribute vec-tor a z in a deterministic way,i.e .p (a |z )= a =a z ,mak-ing u of Iverson’s bracket notation: P =1if the con-
dition P is true and it is 0otherwi [19].Applying Bayes’
rule we obtain p (z |a )=p (z )p (a z ) a =a z
as reprentation of the attribute–class layer.Combining both layers,we can calculate the posterior of a test class given an image:p (z |x )=
a ∈{0,1}
M p (z |a )p (a |x )=p (z )p (a )M
m =1
p (a z m |x ).(1)In the abnce of more specific knowledge,we assume iden-tical class priors,which allows us to ignore the factor p (z )in the following.For the factor p (a )we assume a facto-rial distribution p (a )= M
m =1p (a m ),using the empirical
means p (a m )=1K K k =1a y k
m over the training class as attribute priors.2As decision rule f :X →Z that assigns the best output class from all test class z 1,...,z L to a test sample x ,we u MAP prediction:
f (x )=argmax l =1,...,L M m =1p (a z l
m |x )p (a z l
m )
.(2)
In order to implement IAP,we only modify the image–attribute stage:as first step,we learn a probabilistic multi-class classifier estimating p (y k |x )for all training class y 1,...,y K .Again assuming a deterministic dependence between attributes and class,we t p (a m |y )= a m =a y m .The combination of both steps yields
p (a m |x )=
K
k =1牛街清真寺
p (a m |y k )p (y k |x ),(3)
so inferring the attribute posterior probabilities p (a m |x )re-quires only a matrix-vector multiplication.Afterwards,we
马克笔绘画
2In
practice,the prior p (a )is not crucial to the procedure and tting p (a m )=12
yields comparable results.
continue in the same way as in for DAP,classifying test samples using Equation(2).
3.Connections to Previous Work
Multi-layer or cascaded classifiers have a long tradition in pattern recognition and computer vision:multi-layer per-ceptrons[29],decision trees[5],mixtures of experts[17] and boosting[14]are prominent examples of classifica-tion systems built as feed-forward architectures with veral stages.Multi-class classifiers are also often constructed as layers of binary decisions,from which thefinal output is [7,28].The methods differ in their training methodologies,but they share the goal of decomposing a difficult classification problem into a collection of simpler ones.Becau their emphasis lies on the classification per-formance in a fully supervid scenario,the methods are not capable of generalizing across class boundaries.
Especially in the area of computer vision,multi-layered classification systems have been constructed,
in which inter-mediate layers have interpretable properties:artificial neu-ral networks or deep belief networks have been shown to learn interpretablefilters,but the are typically restricted to low-level properties like edge and corner detectors[27]. Popular local feature descriptors,such as SIFT[21]or HoG[6],can be en as hand-crafted stages in a feed-forward architecture that transform an image from the pixel domain into a reprentation invariant to non-informative image variations.Similarly,image gmentation has been propod as an unsupervid method to extract contours that are discriminative for object class[37].Such pre-processing steps are generic in the n that they still allow the subquent detection of arbitrary object class.How-ever,the basic elements,local image descriptors or g-ments shapes,alone are not reliable enough indicators of generic visual object class,unless they are ud as input to a subquent statistical learning step.
On a higher level,pictorial structures[13],the constel-lation model[10]and recent discriminatively trained de-formable part models[9]are examples of the many methods that recognize objects in images by detecting discriminative parts.In principle,humans can give descriptions of object class in terms of such arms or wheels.How-ever,it is a difficult problem to build a system that learns to detect exactly the parts described.Instead,the identifi-cation of parts is integrated into the training of the model, which often reduces the parts to co-occurrence patterns of local feature points,
not to units with a mantic meaning. In general,parts learned this way do generalize across class boundaries.
3.1.Sharing Information between Class
The aspect of sharing information between class has also been recognized as an interestingfield before.A com-mon idea is to construct multi-class classifiers in a cascaded way.By making similar class share large parts of their decision paths,fewer classification functions need to be learned,thereby increasing the system’s performance[26]. Similarly,one can reduce the number of feature calculations by actively lecting low-level features that help discrimina-tion for many class simultaneously[33].Combinations of both approaches are also possible[39].
In contrast,inter-class transfer does not aim at higher speed,but at better generalization performance,typically for object class with only few available training instances. From known object class,one infers prior distributions over the expected intra-class variance in terms of distortions [22]or shapes and appearances[20].Alternatively,features that are known to be discriminative for some class can be reud and adapted to support the detection of new class [1].To our knowledge,no previous approach allows the direct incorporation of human prior knowledge.
Also,all methods require at least some training examples and cannot handle completely new object class.
A noticable exception is[8]that us high-level at-tributes to learn descriptions of object.Like our approach, this opens the possilibity to generalize between categories.
3.2.Learning Semantic Attributes
A different line of relevant rearch occurring as one building block for attribute-bad classification is the learn-ing of high-level mantic attributes from images.Prior work in the area of computer vision has mainly stud-ied elementary properties like colors and geometric pat-terns[11,36,38],achieving high accuracy by develop-ing task-specific features and reprentations.In thefield of multimedia retrieval,the annual TRECVID contest[32] contains a subtask of high-level feature extraction.It has stimulated a lot of rearch in the detection of mantic con-cepts,including the categorization of scene ut-door,urban,and high-level sports.Typical sys-tems in this area combine many feature reprentations and, becau they were designed for retrieval scenarios,they aim at high precision for low recall levels[34,40].
Our own task of attribute learning targets a similar prob-lem,but ourfinal goal is not the prediction of f
ew individual attributes.Instead,we want to infer class labels by combin-ing the predictions of many attributes.Therefore,we are relatively robust to prediction errors on the level of individ-ual attributes,and we will rely on generic classifiers and standard image features instead of specialized tups.
In contrast to computer science,a lot of work in cog-nitive science has been dedicated to studying the relations between object recognition and attributes.Typical ques-tions in thefield are how human judgements are influenced by characteristic object attributes[23,31].A related line of rearch studies how the human performance in object
b l a
c k w h i t e b l u e b r o w n g r a y o r a n g e r e
d y
e l l o w p a t c h e s s p o t s s t r i p e s
f u r r y h a i r l e s s t o u
g
h s k
i n b i g s m a l l b u l b o u s l e a n f l i p p e r s h a n d s h o o v e s p a d s p a w s l o n g l e g l o n g n e c k t a i l c h e w t e e t h m e a t t e e t h b u c k t e e t h s t r a i n t e e t h h o r n s c l a w s t u s k s
zebra
giant panda
deer bobcat
pig lion mou polar bear
collie walrus raccoon
cow dolphin
Class–attribute matrices from [24,18].The respons of persons were averaged to determine the real-valued sociation strength between attributes and class.The darker the boxes,the less is the at
tribute associated with the class.Binary attributes are obtained by thresholding at the overall matrix mean.
detection tasks depends on the prence or abnce of ob-ject properties and contextual cues [16].Since one of our goals is to integrate human knowledge into a computer vi-sion task,we would like to benefit from the prior work in this field,at least as a source of high quality data that,so far,cannot be obtained by an automatic process.In the follow-ing ction,we describe a new datat of animal images that allows us to make u of existing class-attribute association data,which was collected from cognitive science rearch.
4.The Animals with Attributes Datat
For their studies on attribute-bad object similarity,Os-herson and Wilkie [24]collected judgements from human subjects on the “relative strength of association”between 85attributes and 48animal class.Kemp et al.[18]made u of the same data in a machine learning context and added 2more animals class.Figure 3illustrates an ex-cerpt of the resulting 50×85class-attribute matrix.How-ever,so far this data was not usable in a computer vision context,becau the animals and attributes are only spec-ified by their abstract names,not by example images.To overcome this problem,we have collected the Animals with Attributes data.3
4.1.Image Collection
We have collected example images for all 50Osher-son/Kemp animal class by querying four large internet arch engines,Google ,Microsoft ,Yahoo and Flickr ,using the animal names as keywords.The resulting over 180,000images were manually procesd to remove outliers and du-plicates,and to ensure that the target animal is in a promi-nent view in all cas.The remaining collection consists of 30475images with at minimum of 92images for any class.Figure 1shows examples of some class with the values of exemplary attributes assigned to this class.Altogether,animals are uniquely characterized by their attribute vector.Conquently,the Animals with Attributes datat,formed
3Available
at attributes.kyb.tuebingen.mpg.de
by combining the collected images with the mantic at-tribute table,can rve as a testbed for the task of incorpo-rating human knowledge into an object detection system.
4.2.Feature Reprentations
Feature extraction is known to have a big influence in computer vision tasks.For most image datats,e.g .Cal-tech [15]and PASCAL VOC 4,is has become difficult to judge the true performance of newly propod classifica-tion methods,becau results bad on very different fea-ture ts need to be compared.We have therefore decided to include a reference t of pre-extracted features into the Animals with Attributes datat.
We have lected six different feature types:RGB color histograms,SIFT [21],rgSIFT [35],PHOG [4],SURF [2]and local lf-similarity histograms [30].The color his-tograms and PHOG feature vectors are extracted parately for all 21cells of a 3-level spatial pyramids (1×1,2×2,4×4).For each cell,128-dimensional color histograms are extracted and concatenated to form a 2688-dimensional feature vector.For PHOG,the same construction is ud,but with 12-dimensional ba histograms.The other feature vectors each are 2000-bin bag-of-visual words histograms.For the consistent evaluation of attribute-bad object classification methods,we have lected 10test class:chimpanzee,giant panda,hippopotamus,humpback whale,leopard,pig,racoon,rat,al .The 6180images of tho class act as test data,whereas the 24295images of the remaining 40class can be ud for training.Addition-ally,we also encourage the u of the datat for regular large-scale multi-class or multi-label classification.For this we provide ordinary training/test splits with both
parts con-taining images of all class.In particular,we expect the Animals with Attributes datat to be suitable to test hierar-chical classification techniques,becau the class contain natural subgroups of similar appearance.
5.Experimental Evaluation
In Section 2we introduced DAP and IAP,two meth-ods for attribute-bad classification,that allow the learn-ing of object classification systems for class for,which no training samples are available.In the following,we eval-uate both methods by applying them to the Animals with Attributes datat.For DAP,we train a non-linear sup-port vector machine (SVM)to predict each binary attributes a 1,...,a M .All attribute SVMs are bad the same kernel,the sum of individual χ2-kernels for each feature,where the bandwidth parameters are fixed to the five times inver of the median of the χ2-distances over the training samples.The SVM’s parameter C is t to 10,which had been deter-mined a priori by cross-validation on a subt of the training
4www.pascal-network/challenges/VOC/