Visualizing and Understanding
Convolutional Networks
Matthew D.Zeiler and Rob Fergus
Dept.of Computer Science,
New York University,USA
{zeiler,fergus}@cs.nyu.edu
e用法Abstract.Large Convolutional Network models have recently demon-
strated impressive classification performance on the ImageNet bench-
mark Krizhevsky et al.[18].However there is no clear understanding of
why they perform so well,or how they might be improved.In this paper
we explore both issues.We introduce a novel visualization technique that
gives insight into the function of intermediate feature layers and the oper-
ation of the classifier.Ud in a diagnostic role,the visualizations allow
us tofind model architectures that outperform Krizhevsky the
ImageNet classification benchmark.We also perform an ablation study
to discover the performance contribution from different model layers.We
show our ImageNet model generalizes well to other datats:when the
softmax classifier is retrained,it convincingly beats the current state-of-
小门神the-art results on Caltech-101and Caltech-256datats.
1Introduction
Since their introduction by LeCun et al.[20]in the early1990’s,Convolutional Networks(convnets)have demonstrated excellent performance at tasks such as
hand-written digit classification and face detection.In the last18months,v-eral papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks.Ciresan et al.[4]demonstrate state-of-
the-art performance on NORB and CIFAR-10datats.Most notably,Krizhevsky et al.[18]show record beating performance on the ImageNet2012classification benchmark,with their convnet model achieving an error rate of16.4%,compared
to the2nd place result of26.1%.Following on from this work,Girshick et al.[10] have shown leading detection performance on the PASCAL VOC datat.Sev-eral factors are responsible for this dramatic improvement in performance:(i)the availability of much larger training ts,with millions of labeled examples;(ii) powerful GPU implementations,making the training of very large models practi-cal and(iii)better model regularization strategies,such as Dropout[14].乌龟怎么吃
Despite this encouraging progress,there is still little insight into the internal operation and behavior of the complex models,or how they achieve such good performance.From a scientific standpoint,this is deeply unsatisfactory.With-out clear understanding of how and why they work,the development of better models is reduced to trial-and-error.In this paper we introduce a visualization
阿飞正传台词
D.Fleet et al.(Eds.):ECCV2014,Part I,LNCS8689,pp.818–833,2014.
c Springer International Publishing Switzerland2014
Visualizing and Understanding Convolutional Networks819 technique that reveals the input stimuli that excite individual feature maps at any layer in the model.It also allows us to obrve the evolution of features during training and to diagno potential problems with the model.The visu-alization technique we propo us a multi-layered Deconvolutional Network (deconvnet),as propod by Zeiler et al.[29],to project the feature activations back to the input pixel space.We also perform a nsitivity analysis of the clas-sifier output by occluding portions of the input image,revealing which parts of the scene are important for classification.
Using the tools,we start with the architecture of Krizhevsky et al.[18]and explore different architectures,discovering ones that outperform their results on ImageNet.We then explore the generalization ability of the model to other datats,just retraining the softmax classifier on top.As such,this is a form of supervid pre-training,which contrasts with the unsupervid pre-training methods popularized by Hinton et al.[13]and others[1,26].美丽的家乡作文
1.1Related Work
Visualization:Visualizing features to gain intuition about the network is com-mon practice,but mostly limited to the1st layer where projections to pixel space are possible.In higher layers alternate methods must be ud.[8]find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation.This requires a careful initialization and does not give any information about the unit’s invariances.Motivated by the latter’s short-coming,[19](extending an idea by[2])show how the Hessian of a given unit may be computed numerically around the optimal respon,giving some insight into invariances.The problem is that for higher layers,the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach,by contrast,provides a non-parametric view of invariance,show-ing which patterns from the training t activate the feature map.Our approach is similar to contemporary work by Simonyan et al.[23]who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network,instead of the convolutional features that we u.Girshick et al.[10]show visualizations that identify patches within a datat that are responsible for strong activations at higher layers in the model.Our vi-sualizations differ in that they are not just crops of input images,but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.
Feature Generalization:Our demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al.[7]and Girshick et al.[10].They u the convnet features to obtain state-of-the-art performance on Caltech-101and the Sun scenes datat in the former ca,and for object detection on the PASCAL VOC datat,in the latter.
2Approach
We u standard fully supervid convnet models throughout the paper,as de-fined by LeCun et al.[20]and Krizhevsky et al.[18].The models map a color
820M.D.Zeiler and R.Fergus
2D input image x i,via a ries of layers,to a probability vectorˆy i over the C dif-ferent class.Each layer consists of(i)convolution of the previous layer output (or,in the ca of the1st layer,the input image)with a t of learnedfilters;(ii) passing the respons through a rectified linear function(relu(x)=max(x,0)); (iii)[optionally]max pooling over local neighborhoods and(iv)[optionally]a local contrast operation that normalizes the respons across feature maps.For more details of the operations,e[18]and[16].The top few layers of the net-work are conventional fully-connected networks and thefinal layer is a softmax classifier.Fig.3shows the model ud in many of our experim
ents.
We train the models using a large t of N labeled images{x,y},where label y i is a discrete variable indicating the true class.A cross-entropy loss function, suitable for image classification,is ud to compareˆy i and y i.The parameters of the network(filters in the convolutional layers,weight matrices in the fully-connected layers and bias)are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network,and updating the parameters via stochastic gradient descent.Details of training are given in Section3.
2.1Visualization with a Deconvnet
Understanding the operation of a convnet requires interpreting the feature activ-ity in intermediate layers.We prent a novel way to map the activities back to the input pixel space,showing what input pattern originally caud a given activation in the feature maps.We perform this mapping with a Deconvolutional Network (deconvnet)Zeiler et al.[29].A deconvnet can be thought of as a convnet model that us the same components(filtering,pooling)but in rever,so instead of mapping pixels to features does the opposite.In Zeiler et al.[29],deconvnets were propod as a way of performing unsupervid learning.Here,they are not ud in any learning capacity,just as a probe of an already trained convnet.
To examine a convnet,a deconvnet is attached to each of its layers,as illus-trated in Fig.1(top),providing a continuous path back to image pixels.To start, an input image is prented to the convnet and features computed throughout the layers.To examine a given convnet activation,we t all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer.Then we successively(i)unpool,(ii)rectify and(iii)filter to reconstruct the activity in the layer beneath that gave ri to the chon activation.This is then repeated until input pixel space is reached.
Unpooling:In the convnet,the max pooling operation is non-invertible,how-ever we can obtain an approximate inver by recording the locations of the maxima within each pooling region in a t of switch variables.In the decon-vnet,the unpooling operation us the switches to place the reconstructions from the layer above into appropriate locations,prerving the structure of the stimulus.See Fig.1(bottom)for an illustration of the procedure.
Rectification:The convnet us relu non-linearities,which rectify the fea-ture maps thus ensuring the feature maps are always positive.To obtain valid
Visualizing and Understanding Convolutional Networks821 feature reconstructions at each layer(which also should be positive),we pass the reconstructed signal through a relu non-linearity1.
什么时候恢复高考
Filtering:The convnet us learnedfilters to convolve the feature maps from the previous layer.To approximately invert this,the deconvnet us transpod versions of the samefilters(as other autoencoder models,such as RBMs),but applied to the rectified maps,not the output of the layer beneath.In practice this meansflipping eachfilter vertically and horizontally.
Note that we do not u any contrast normalization operations when in this reconstruction path.Projecting down from higher layers us the switch ttings generated by the max pooling in the convnet on the way up.As the switch ttings are peculiar to a given input image,the reconstruction obtained from a single activation thus rembles a small piece of the original input image,with structures weighted according to their contribution toward to the feature acti-vation.Since the model is trained discriminatively,they implicitly show which parts of the input image are discriminative.Note that the projections are not samples from the model,since there is no generative process involved.The whole procedure is similar to backpropping a single strong activation(rather than the
usual gradients),puting∂h
∂X n ,where h is the element of the feature map
with the strong activation and X n is the input image.However,it differs in that(i)the the relu is impod independently and(ii)contrast normalization operations are not ud.A general shortcoming of our approach is that it only visualizes a single activation,not the joint activity prent in a layer.Neverthe-less,as we show in Fig.6,the visualizations are accurate reprentations of the input pattern that stimulates the given feature map in the model:when the parts of the original input image corresponding to the pattern are occluded,we e a distinct drop in activity within the feature map.
3Training Details
We now describe the large convnet model that will be visualized in Section4. The architecture,shown in Fig.3,is similar to that ud by Krizhevsky et al.[18] for ImageNet classification.One difference is that the spar connections ud in Krizhevsky’s layers3,4,5(due to the model being split across2GPUs)are replaced with den connections in our model.Other important differences re-lating to layers1and2were made following inspection of the visualizations in Fig.5,as described in Section4.1.
The model was trained on the ImageNet2012training t(1.3million images, spread over1000differen
t class)[6].Each RGB image was preprocesd by resiz-ing the smallest dimension to256,cropping the center256x256region,subtract-ing the per-pixel mean(across all images)and then using10different sub-crops of size224x224(corners+center with(out)horizontalflips).Stochastic gradient descent with a mini-batch size of128was ud to update the parameters,starting with a learning rate of10−2,in conjunction with a momentum term of0.9.We 1We also tried rectifying using the binary mask impod by the feed-forward relu operation,but the resulting visualizations were significantly less clear.
822M.D.Zeiler and R.Fergus Max Locations
“Switches”
Pooled Maps
Feature Map
Layer Above
Reconstruction Unpooled
鸭的成语Maps Rectified Feature Maps Fig.1.Top:A deconvnet layer (left)attached to a convnet layer (right).The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath.Bottom:An illustration of the unpooling operation in the deconvnet,using switches which record the location of the local max in each pooling region (colored zones)during pooling in the convnet.The black/white bars are negative/positive activations within the feature map.
anneal the learning rate throughout training manually when the validation error plateaus.Dropout [14]is ud in the fully connected layers (6and 7)with a rate of 0.5.All weights are initialized to 10−2and bias are t to 0.老鼠咬伤
Visualization of the first layer filters during training reveals that a few of them dominate.To combat this,we renormalize each filter in the convolutional layers who RMS value exceeds a fixed radius of 10−1to this fixed radius.This is crucial,especially in the first layer of the model,where the input images are roughly in the [-128,128]range.As in Krizhevsky et al.[18],we produce multiple different crops and flips of each training example to boost training t size.We stopped training after 70epochs,which took around 12days on a single GTX580GPU,using an implementation bad on [18].
4Convnet Visualization
Using the model described in Section 3,we now u the deconvnet to visualize the feature activations on the ImageNet validation t.
Feature Visualization:Fig.2shows feature visualizations from our model once training is complete.For a given feature map,we show the top 9acti-vations,each projected parately down to pixel space,revealing the different