Learning Visual Reprentations using Images with Captions
Ariadna Quattoni Michael Collins Trevor Darrell
MIT Computer Science and Artificial Intelligence Laboratory
Cambridge MA02139
ariadna,mcollins,trevor@csail.mit.edu
January22,2007
1Overview中秋节英语翻译
Current methods for learning visual categories work well when a large amount of labeled data is available,but can run into vere difficulties when the number of labeled examples is small.When labeled data is scarce it may be beneficial to u unlabeled data to learn an image reprentation that is low-dimensional,but nevertheless captures the information required to discriminate between image categories.We describe a method for learning reprentations from large quantities of unlabeled images which have associated captions;the aim is to learn a reprentation that aids learning in image
classification problems.Experiments show that the method significantly outperforms a fully-supervid baline model as well as a model that ignores the captions and learns a visual reprentation by performing PCA on the unlabeled images alone.Our current work concentrates on captions as the source of meta-data,but more generally other types of meta-data could be ,video quences with accompanying speech).
andru
2Background
When few labeled examples are available most current supervid learning methods[9,3,4,7,5]for image classification may work poorly–for example when a ur defines a new category and provides only a few labeled examples.To reach human performance,it is clear that knowledge beyond the supervid training data needs to be leveraged.
There is a large literature on mi-supervid learning approaches,where unlabeled data is ud in addition to labeled data.Our work is related to work in multi-task learning,where training data in related tasks is ud to aid learning in the problem of interest.Multi-task learning has a relatively long history in machine learning[8,2,6,1],but has only recently been addresd in machine vision.We build on the structure learning approach of Ando and Zhang[1],who describe an algorithm for transfer lear
ning,and suggest the u of auxiliary problems on unlabeled data as a method for constructing related tasks.In some cas unlabeled data may contain uful meta-data that can be ud to learn a low-dimensional reprentation that reflects the mantic content of an image.As one example,large quantities of images with associated natural language captions can be found on the web.
3Approach
We propo to u the meta-data to induce a reprentation that reflects an underlying part structure in an existing,high-dimensional visual reprentation.The new reprentation groups together synonymous visual features—features that consistently play a similar role across different image classification tasks.Our approach exploits learning from auxiliary problems which can be created from images with associated captions.Each auxiliary problem involves taking an image as input,and predicting whether or not a particular content word(e.g,man,official,or celebrates)is in the caption associated with that image.In structural learning,a parate linear classifier is trained for each of the auxiliary problems;manifold ,SVD)is then applied to the resulting t of parameter vectors,in esncefinding a low-dimensional space which is a good approximation to the space of possible parameter vectors.If features in the high-dimensional space correspond to the sam
e mantic part,their associated classifier parameters(weights)across different auxiliary problems may be correlated in such a way that the basis functions learned by the SVD step collap the features to a single feature in a new,low-dimensional feature-vector reprentation.
Topic:visual processing and pattern recognition.
全新版大学英语综合教程3Preference:oral/poster.(Ariadna Quattoni)
青色的英文# positive training examples
A
v
e
r
a
g
e
E
q
u
a
l
Ejobinterview
英语翻译服务r
r
莱佛士学院o
r
R
a
t
e
Figure1:Equal error rates averaged across topics with standard deviation calculated for ten runs for each topic(left).Example images from the Figure Skating,Ice Hockey,and Golden Globes(right).
4Experiments
In afirst t of experiments,we u synthetic data examples to illustrate how the method can uncover latent part structures.
A cond t of experiments involves classification of news images into different topics.Images on the Reuters website are partitioned into stories which correspond to different topics in the news;each image has a topic label as well as associated caption meta-data.For both experiments we compare a baline model that us a bag-of-words SIFT reprentation of image data,to our method,which rep
laces the SIFT reprentation with a new reprentation that is learned from images with associated captions.In addition,we compare our method to a baline model that ignores the meta-data and learns a new visual reprentation by performing PCA on the unlabeled images.Note that our goal is to build classifiers that work on images ,images which do not have captions),and our experimental t-up reflects this,in that training and test examples for the topic classification tasks include image data only.The experiments show that our method significantly outperforms both baline models.See people.csail.mit.edu/ariadna/TransferLearning for further details on the method and the experiments.
5Summary
cabinet是什么意思We have described a method for learning visual reprentations from large quantities of unlabeled images which have associated captions.The method makes u of auxiliary training ts corresponding to different words in the captions, and structural learning,which learns a manifold in parameter space.The induced reprentations significantly speed up learning of image classifiers applied to topic classification.Our results show that when meta-data labels are suitably related to a target(core)task,the structure learning method can discover feature groupings that speed learning of the target task.Future work includes exploration of automatic determination of relevance between tar
get and auxiliary tasks, and experimental evaluation of the effectiveness of structure learning from more weakly related auxiliary domains. References
[1]A framework for learning predictive structures from multiple tasks and unlabeled data.Journal of Machine Learning Rearch,
6:1817–1853,2005.
[2]J.Baxter.A bayesian/information theoretic model of learning to learn via multiple task sampling.Machine Learning,28:7–39,
1997.
[3]K.Grauman and T.Darrell.The pyramid match kernel:discriminative classification with ts of image features.In Proceedings fo
the International Conference on Computer Vision(ICCV),2005.
[4]S.Lazebnik,C.Schmid,and J.Ponce.Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories.
In Proceedings of CVPR-2006,2006.
[5]J.Mutch and D.G.Lowe.Multiclass object recognition with spar,localized features.In Proceedings of CVPR-2006,2006.
[6]R.Raina,A.Y.Ng,and D.Koller.Constructing informative priors using transfer learning.In Proceedings of the23rd International
takethelead
Conference on Machine learning,pages713–720,2006.
[7]T.Serre,L.Wolf,and T.Poggio.Object recognition with features inspired by visual cortex.In Proceedings of2005IEEE Computer
Society Conference on Computer Vision and Pattern Recognition(CVPR2005),2005.
起息[8]S.Thrun.Is learning the n-th thing any easier than learning thefirst?In In Advances in Neural Information Processing Systems,
1996.
[9]H.Zhang,A.Berg,M.Maire,and J.Malik.Svm-knn:Discriminative nearest neighbor classification for visual category recognition.
In Proceedings of CVPR-2006,2006.