ImageNet:A Large-Scale Hierarchical Image Databa Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li and Li Fei-Fei
Dept.of Computer Science,Princeton University,USA {jiadeng,wdong,rsocher,jial,li,feifeili}@cs.princeton.edu
Abstract
The explosion of image data on the Internet has the po-tential to foster more sophisticated and robust models and algorithms to index,retrieve,organize and interact with im-ages and multimedia data.But exactly how such data can be harnesd and organized remains a critical problem.We introduce here a new databa called“ImageNet”,a large-scale ontology of images built upon the backbone of the WordNet structure.ImageNet aims to populate the majority of the80,000synts of WordNet with an average of500-1000clean and full resolution images.This will result in tens of millions of annotated images organized by the -mantic hierarchy of WordNet.This paper offers a detailed analysis of ImageNet in its current state:12subtrees with 5247synts and3.2million images in total.We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datats.Construct-ing such a large-scale databa is a challenging task.We describe the data collecti
on scheme with Amazon Mechan-ical Turk.Lastly,we illustrate the ufulness of ImageNet through three simple applications in object recognition,im-age classification and automatic object clustering.We hope that the scale,accuracy,diversity and hierarchical struc-ture of ImageNet can offer unparalleled opportunities to re-archers in the computer vision community and beyond.
1.Introduction
观察日记50字
The digital era has brought with it an enormous explo-sion of data.The latest estimations put a number of more than3billion photos on Flickr,a similar number of video clips on YouTube and an even larger number for images in the Google Image Search databa.More sophisticated and robust models and algorithms can be propod by exploit-ing the images,resulting in better applications for urs to index,retrieve,organize and interact with the data.But exactly how such data can be utilized and organized is a problem yet to be solved.In this paper,we introduce a new image databa called“ImageNet”,a large-scale ontology of images.We believe that a large-scale ontology of images is a critical resource for developing advanced,large-scale
content-bad image arch and image understanding algo-rithms,as well as for providing critical training and bench-marking data for such algorithms.
椰子果ImageNet us the hierarchical structure of WordNet[9].
Each meaningful concept in WordNet,possibly described by multiple words or word phras,is called a“synonym t”or“synt”.There are around80,000noun synts in WordNet.In ImageNet,we aim to provide on aver-age500-1000images to illustrate each synt.Images of each concept are quality-controlled and human-annotated as described in Sec.3.2.ImageNet,therefore,will offer tens of millions of cleanly sorted images.In this paper, we report the current version of ImageNet,consisting of12“subtrees”:mammal,bird,fish,reptile,amphibian,vehicle, furniture,musical instrument,geological formation,tool,flower,fruit.The subtrees contain5247synts and3.2 million images.Fig.1shows a snapshot of two branches of the mammal and vehicle subtrees.The databa is publicly available at
The rest of the paper is organized as follows:Wefirst show that ImageNet is a large-scale,accurate and diver image databa(Section2).In Section4,we prent a few simple application examples by exploiting the current Ima-geNet,mostly the mammal and vehicle subtrees.Our goal is to show that ImageNet can rve as a uful resource for visual recognition applications such as object recognition, image classification and object localization.In addition,the construction of such a large-scale and high-quality databa can no longer rely on traditional data collection methods.
Sec.3describes how ImageNet is constructed by leverag-ing Amazon Mechanical Turk.
2.Properties of ImageNet
ImageNet is built upon the hierarchical structure pro-vided by WordNet.In its completion,ImageNet aims to contain in the order of50million cleanly labeled full reso-lution images(500-1000per synt).At the time this paper is written,ImageNet consists of12subtrees.Most analysis will be bad on the mammal and vehicle subtrees.
Scale ImageNet aims to provide the most comprehensive and diver coverage of the image world.The current12 subtrees consist of a total of3.2million cleanly annotated 1
mammal placental carnivore canine dog husky
滑稽反义词
vehicle craft watercraft sailing vesl sailboat trimaran
Figure 1:A snapshot of two root-to-leaf branches of ImageNet:the top row is from the mammal subtree;the bottom row is from the
vehicle subtree.For each synt,9randomly sampled images are prented.
Figure 2:Scale of ImageNet.Red curve :Histogram of number
of images per synt.About 20%of the synts have very few images.Over 50%synts have more than 500images.Table :Summary of lected subtrees.For complete and up-to-date statis-tics visit /about-stats .
images spread over 5247categories (Fig.2).On average over 600images are collected for each synt.Fig.2shows the distributions of the number of images per synt for the current ImageNet 1.To our knowledge this is already the largest clean image datat available to the vision rearch community,in terms of the total number of images,number of images per category as well as the number of categories 2.Hierarchy ImageNet organizes the different class of images in a denly populated mantic hierarchy.The main ast of WordNet [9]lies in its mantic its ontology of concepts.Similarly to WordNet,synts of images in ImageNet are interlinked by veral types of re-lations,the “IS-A”relation being the most comprehensive and uful.Although one can map any datat with cate-1About
20%of the synts have very few images,becau either there
are very few web images “vespertilian bat”,or the synt by definition is difficult to be illustrated by “two-year-old hor”.2It is claimed that the ESP game [25]has labeled a very large number of images,but only a subt of 60K images are publicly available.
衣被
Figure 3:Comparison of the “cat”and “cattle”subtrees between
ESP [25]and ImageNet.Within each tree,the size of a node is proportional to the number of images it contains.The number of images for the largest node is shown for each tree.Shared nodes between an ESP tree and an ImageNet tree are colored in red.男性肛瘘早期图片
gory labels into a mantic hierarchy by using WordNet,the density of ImageNet is unmatched by others.For example,to our knowledge no existing vision datat offers images of 147dog categories.Fig.3compares the “cat”and “cattle”subtrees of ImageNet and the ESP datat [25].We obrve that ImageNet offers much denr and larger trees.Accuracy We would like to offer a clean datat at all levels of the WordNet hierarchy.Fig.4demonstrates the labeling precision on a total of 80synts randomly sam-pled at different tree depths.An average of 99.7%preci-sion is achieved on average.Achieving a high precision for all depths of the ImageNet tree is challenging becau the lower in the hierarchy a synt is,the harder it is to Siame cat versus Burme cat.
Diversity ImageNet is constructed with the goal that ob-jects in images should have variable appearances,positions,
Fig.5compares the image diversity in four randomly sam-pled synts in Caltech101[8]3and the mammal subtree of ImageNet.
2.1.ImageNet and Related Datats
学习蛋糕制作
We compare ImageNet with other datats and summa-rize the differences in Table14.
Small image datats A number of well labeled small datats(Caltech101/256[8,12],MSRC[22],PASCAL[7] etc.)have rved as training and evaluation benchmarks for most of today’s computer vision algorithms.As com-puter vision rearch advances,larger and more challenging 3We also compare with Caltech256[12].The result indicates the diver-sity of ImageNet is comparable,which is reassuring since Caltech256was specifically designed to be more diver.
4We focus our comparisons on datats of generic objects.Special pur-po datats,such as FERET faces[19],Labeled faces in the Wild[13] and the Mammal Benchmark by Fink and Ullman[11]are not included.datats are needed for the next generation of algorithms. The current ImageNet offers20×the number of categories, and100×the number of total images than the datats. TinyImage TinyImage[24]is a datat of80million 32×32low resolution images,collected fro
m the Inter-net by nding all words in WordNet as queries to image arch engines.Each synt in the TinyImage datat con-tains an average of1000images,among which10-25%are possibly clean images.Although the TinyImage datat has had success with certain applications,the high level of noi and low resolution images make it less suitable for gen-eral purpo algorithm development,training,and evalua-tion.Compared to the TinyImage datat,ImageNet con-tains high quality synts(∼99%precision)and full reso-lution images with an average size of around400×350. ESP datat The ESP datat is acquired through an on-line game[25].Two players independently propo labels to one image with the goal of matching as many words as possible in a certain time limit.Millions of images are la-beled through this game,but its speeded nature also pos a major drawback.Rosch and Lloyd[20]have demonstrated that humans tend to label visual objects at an easily acces-sible mantic level termed as“basic level”(e.g.bird),as oppod to more specific level(“sub-ordinate level”,e.g. sparrow),or more general level(“super-ordinate level”,e.g. vertebrate).Labels collected from the ESP game largely concentrate at the“basic level”of the mantic hierarchy as illustrated by the color bars in Fig.6.ImageNet,how-ever,demonstrates a much more balanced distribution of images across the mantic hierarchy.Another critical dif-ference between ESP and ImageNet is n disambigua-tion.When human players input the word“bank”,it is un-clear whether it means“a river bank”or a“financial insti-tution”.At this large scale,disambiguation
becomes a non-trivial task.Without it,the accuracy and ufulness of the ESP data could be affected.ImageNet,on the other hand, does not have this problem by construction.See ction3.2 for more details.Lastly,most of the ESP datat is not pub-licly available.Only60K images and their labels can be accesd[1].
LabelMe and Lotus Hill datats LabelMe[21]and the Lotus Hill datat[27]provide30k and50k labeled and g-mented images,respectively5.The two datats provide complementary resources for the vision community com-pared to ImageNet.Both only have around200categories, but the outlines and locations of objects are provided.Im-ageNet in its current form does not provide detailed object outlines(e potential extensions in Sec.5.1),but the num-ber of categories and the number of images per category 5All statistics are from[21,27].In addition to the50k images,the Lotus Hill datat also includes587k video frames.
(b)(c)
Figure 5:ImageNet provides diversified images.(a)Comparison of the lossless JPG file sizes of average images for four different synts小浪底风景区
in ImageNet (the mammal subtree )and Caltech101.Average images are downsampled to 32×32and
sizes are measured in byte.A more diver t of images results in a smaller lossless JPG file size.(b)Example images from ImageNet and average images for each synt indicated by (a).(c)Examples images from Caltech101and average images.For each category shown,the average image is computed using all images from Caltech101and an equal number of randomly sampled images from ImageNet.
0.1 0.2
0.3 0.4 0.5p e r c e n t a g e
depth
Figure 6:Comparison of the distribution of “mammal”labels
over tree depth levels between ImageNet and ESP game.The y-axis indicates the percentage of the labels of the corresponding datat.ImageNet demonstrates a much more balanced distribu-tion,offering substantially more labels at deeper tree depth levels.The actual number of images corresponding to the highest bar is also given for each datat.
already far exceeds the two datats.In addition,images in the two datats are largely uploaded or provided by urs or rearchers of the datat,whereas ImageNet con-tains images crawled from the entire Internet.The Lotus Hill datat is only available through purcha.
3.Constructing ImageNet
ImageNet is an ambitious project.Thus far,we have constructed 12subtrees containing 3.2million images.Our goal is to complete the construction of around 50million images in the next two years.We describe here the method we u to construct ImageNet,shedding light on how prop-erties of Sec.2can be ensured in this process.
3.1.Collecting Candidate Images
The first stage of the construction of ImageNet involves collecting candidate images for each synt.The average
accuracy of image arch results from the Internet is around 10%[24].ImageNet aims to eventually offer 500-1000clean images per synt.We therefore collect a large t of candidate images.After intra-synt duplicate removal,each synt has over 10K images on average.
We collect candidate images from the Internet by query-ing veral image arch engines.For each synt,the queries are the t of WordNet synonyms.Search engines typically limit the number of images retrievable (in the or-der of a few hundred to a thousand).To obtain as many im-ages as possible,we expand the query t by appending the queries with the word from parent synts,if the same word appears in the gloss of the target synt.For example,when querying “whippet”,according to WordNet’s gloss a “small slender dog of greyhound type developed in England”,we also u “whippet dog”and “whippet greyhound”.
To further enlarge and diversify the candidate pool,we translate the queries into other languages [10],including Chine,Spanish,Dutch and Italian.We obtain accurate translations by WordNets in tho languages [3,2,4,26].
3.2.Cleaning Candidate Images
To collect a highly accurate datat,we rely on humans to verify each candidate image collected in th
e previous step for a given synt.This is achieved by using the rvice of Amazon Mechanical Turk (AMT),an online platform on which one can put up tasks for urs to complete and to get paid.AMT has been ud for labeling vision data [23].With a global ur ba,AMT is particularly suitable for large scale labeling.
In each of our labeling tasks,we prent the urs with a t of candidate images and the definition of the target synt (including a link to Wikipedia).We then ask the urs to verify whether each image contains objects of the synt.We encourage urs to lect images regardless of occlusions,number of objects and clutter in the scene to
Figure 7:Left :Is there a Burme cat in the images?Six ran-domly sampled urs have different answers.Right :The confi-dence score table for “Cat”and “Burme cat”.More votes are needed to reach the same degree of confidence for “Burme cat”images.
ensure diversity.
While urs are instructed to make accurate judgment,we need to t up a quality control system to ensure this accuracy.There are two issues to consider.First,human urs make mistakes and not all
urs follow the instruc-tions.Second,urs do not always agree with each other,especially for more subtle or confusing synts,typically at the deeper levels of the tree.Fig.7(left)shows an example of how urs’judgments differ for “Burme cat”.The solution to the issues is to have multiple urs in-dependently label the same image.An image is considered positive only if it gets a convincing majority of the votes.We obrve,however,that different categories require dif-ferent levels of connsus among urs.For example,while five urs might be necessary for obtaining a good conn-sus on “Burme cat”images,a much smaller number is needed for “cat”images.We develop a simple algorithm to dynamically determine the number of agreements needed for different categories of images.For each synt,we first randomly sample an initial subt of images.At least 10urs are asked to vote on each of the images.We then ob-tain a confidence score table,indicating the probability of an image being a good image given the ur votes (Fig.7(right)shows examples for “Burme cat”and “cat”).For each of remaining candidate images in this synt,we proceed with the AMT ur labeling until a pre-determined confidence score threshold is reached.It is worth noting that the con-fidence table gives a natural measure of the “mantic diffi-culty”of the synt.For some synts,urs fail to reach a majority vote for any image,indicating that the synt can-not be easily illustrated by images 6.Fig.4shows that our algorithm successfully filters the candidate images,result-ing in a high percentage of clean images per synt.
6An
alternative explanation is that we did not obtain enough suitable
candidate images.Given the extensiveness of our crawling scheme,this is a rare scenario.
4.ImageNet Applications
In this ction,we show three applications of ImageNet.The first t of experiments underline the advantages of hav-ing clean,full resolution images.The cond experiment exploits the tree structure of ImageNet,whereas the last ex-periment outlines a possible extension and gives more in-sights into the data.
4.1.Non-parametric Object Recognition
Given an image containing an unknown object,we would like to recognize its object class by querying similar images in ImageNet.Torralba et al .[24]has demonstrated that,given a large number of images,simple nearest neigh-bor methods can achieve reasonable performances despite a high level of noi.We show that with a clean t of full resolution images,object recognition can be more accurate,especially by exploiting more feature level information.We run four different object recogniti
on experiments.In all experiments,we test on images from the 16common categories 7between Caltech256and the mammal subtree.We measure classification performance on each category in the form of an ROC curve.For each category,the negative t consists of all images from the other 15categories.We now describe in detail our experiments and results(Fig.8).1.NN-voting +noisy ImageNet First we replicate one of the experiments described in [24],which we refer to as “NN-voting”hereafter.To imitate the TinyIm-age datat (i.e.images collected from arch engines without human cleaning),we u the original candi-date images for each synt (Section 3.1)and down-sample them to 32×32.Given a query image,we re-trieve 100of the nearest neighbor images by SSD pixel distance from the mammal subtree.Then we perform classification by aggregating votes (number of nearest neighbors)inside the tree of the target category.2.NN-voting +clean ImageNet Next we run the same NN-voting experiment described above on the clean ImageNet datat.This result shows that having more accurate data improves classification performance.3.NBNN We also implement the Naive Bayesian Nearest Neighbor (NBNN)method propod in [5]to underline the ufulness of full resolution im-ages.NBNN employs a bag-of-features reprenta-tion of images.SIFT [15]descriptors are ud in this experiment.Given a query image Q with de-scriptors {d i },i =1,...,M ,for each object class C ,we compute the query-class distance D C =
7The
categories are bat,bear,camel,chimp,dog,elk,giraffe,goat,
gorilla,greyhound,hor,killer-whale,porcupine,raccoon,skunk,zebra.Duplicates (∼20per category )with ImageNet are removed from the test t.面部浮肿一般是什么原因导致的