首页 > 美文阅读

Visualizing and Understanding Convolutional Networks

更新时间:2023-07-26 13:05:14 阅读：评论：0

Visualizing and Understanding

Convolutional Networks

Matthew D.Zeiler and Rob Fergus

Dept.of Computer Science,

New York University,USA

{zeiler,fergus}@cs.nyu.edu

e用法Abstract.Large Convolutional Network models have recently demon-

strated impressive classiﬁcation performance on the ImageNet bench-

mark Krizhevsky et al.[18].However there is no clear understanding of

why they perform so well,or how they might be improved.In this paper

we explore both issues.We introduce a novel visualization technique that

gives insight into the function of intermediate feature layers and the oper-

ation of the classiﬁer.Ud in a diagnostic role,the visualizations allow

us toﬁnd model architectures that outperform Krizhevsky the

ImageNet classiﬁcation benchmark.We also perform an ablation study

to discover the performance contribution from diﬀerent model layers.We

show our ImageNet model generalizes well to other datats:when the

softmax classiﬁer is retrained,it convincingly beats the current state-of-

小门神the-art results on Caltech-101and Caltech-256datats.

1Introduction

Since their introduction by LeCun et al.[20]in the early1990’s,Convolutional Networks(convnets)have demonstrated excellent performance at tasks such as

hand-written digit classiﬁcation and face detection.In the last18months,v-eral papers have shown that they can also deliver outstanding performance on more challenging visual classiﬁcation tasks.Ciresan et al.[4]demonstrate state-of-

the-art performance on NORB and CIFAR-10datats.Most notably,Krizhevsky et al.[18]show record beating performance on the ImageNet2012classiﬁcation benchmark,with their convnet model achieving an error rate of16.4%,compared

to the2nd place result of26.1%.Following on from this work,Girshick et al.[10] have shown leading detection performance on the PASCAL VOC datat.Sev-eral factors are responsible for this dramatic improvement in performance:(i)the availability of much larger training ts,with millions of labeled examples;(ii) powerful GPU implementations,making the training of very large models practi-cal and(iii)better model regularization strategies,such as Dropout[14].乌龟怎么吃

Despite this encouraging progress,there is still little insight into the internal operation and behavior of the complex models,or how they achieve such good performance.From a scientiﬁc standpoint,this is deeply unsatisfactory.With-out clear understanding of how and why they work,the development of better models is reduced to trial-and-error.In this paper we introduce a visualization

阿飞正传台词

D.Fleet et al.(Eds.):ECCV2014,Part I,LNCS8689,pp.818–833,2014.

c Springer International Publishing Switzerland2014

Visualizing and Understanding Convolutional Networks819 technique that reveals the input stimuli that excite individual feature maps at any layer in the model.It also allows us to obrve the evolution of features during training and to diagno potential problems with the model.The visu-alization technique we propo us a multi-layered Deconvolutional Network (deconvnet),as propod by Zeiler et al.[29],to project the feature activations back to the input pixel space.We also perform a nsitivity analysis of the clas-siﬁer output by occluding portions of the input image,revealing which parts of the scene are important for classiﬁcation.

Using the tools,we start with the architecture of Krizhevsky et al.[18]and explore diﬀerent architectures,discovering ones that outperform their results on ImageNet.We then explore the generalization ability of the model to other datats,just retraining the softmax classiﬁer on top.As such,this is a form of supervid pre-training,which contrasts with the unsupervid pre-training methods popularized by Hinton et al.[13]and others[1,26].美丽的家乡作文

1.1Related Work

Visualization:Visualizing features to gain intuition about the network is com-mon practice,but mostly limited to the1st layer where projections to pixel space are possible.In higher layers alternate methods must be ud.[8]ﬁnd the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation.This requires a careful initialization and does not give any information about the unit’s invariances.Motivated by the latter’s short-coming,[19](extending an idea by[2])show how the Hessian of a given unit may be computed numerically around the optimal respon,giving some insight into invariances.The problem is that for higher layers,the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach,by contrast,provides a non-parametric view of invariance,show-ing which patterns from the training t activate the feature map.Our approach is similar to contemporary work by Simonyan et al.[23]who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network,instead of the convolutional features that we u.Girshick et al.[10]show visualizations that identify patches within a datat that are responsible for strong activations at higher layers in the model.Our vi-sualizations diﬀer in that they are not just crops of input images,but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

Feature Generalization:Our demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al.[7]and Girshick et al.[10].They u the convnet features to obtain state-of-the-art performance on Caltech-101and the Sun scenes datat in the former ca,and for object detection on the PASCAL VOC datat,in the latter.

2Approach

We u standard fully supervid convnet models throughout the paper,as de-ﬁned by LeCun et al.[20]and Krizhevsky et al.[18].The models map a color

820M.D.Zeiler and R.Fergus

2D input image x i,via a ries of layers,to a probability vectorˆy i over the C dif-ferent class.Each layer consists of(i)convolution of the previous layer output (or,in the ca of the1st layer,the input image)with a t of learnedﬁlters;(ii) passing the respons through a rectiﬁed linear function(relu(x)=max(x,0)); (iii)[optionally]max pooling over local neighborhoods and(iv)[optionally]a local contrast operation that normalizes the respons across feature maps.For more details of the operations,e[18]and[16].The top few layers of the net-work are conventional fully-connected networks and theﬁnal layer is a softmax classiﬁer.Fig.3shows the model ud in many of our experim

ents.

We train the models using a large t of N labeled images{x,y},where label y i is a discrete variable indicating the true class.A cross-entropy loss function, suitable for image classiﬁcation,is ud to compareˆy i and y i.The parameters of the network(ﬁlters in the convolutional layers,weight matrices in the fully-connected layers and bias)are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network,and updating the parameters via stochastic gradient descent.Details of training are given in Section3.

2.1Visualization with a Deconvnet

Understanding the operation of a convnet requires interpreting the feature activ-ity in intermediate layers.We prent a novel way to map the activities back to the input pixel space,showing what input pattern originally caud a given activation in the feature maps.We perform this mapping with a Deconvolutional Network (deconvnet)Zeiler et al.[29].A deconvnet can be thought of as a convnet model that us the same components(ﬁltering,pooling)but in rever,so instead of mapping pixels to features does the opposite.In Zeiler et al.[29],deconvnets were propod as a way of performing unsupervid learning.Here,they are not ud in any learning capacity,just as a probe of an already trained convnet.

To examine a convnet,a deconvnet is attached to each of its layers,as illus-trated in Fig.1(top),providing a continuous path back to image pixels.To start, an input image is prented to the convnet and features computed throughout the layers.To examine a given convnet activation,we t all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer.Then we successively(i)unpool,(ii)rectify and(iii)ﬁlter to reconstruct the activity in the layer beneath that gave ri to the chon activation.This is then repeated until input pixel space is reached.

Unpooling:In the convnet,the max pooling operation is non-invertible,how-ever we can obtain an approximate inver by recording the locations of the maxima within each pooling region in a t of switch variables.In the decon-vnet,the unpooling operation us the switches to place the reconstructions from the layer above into appropriate locations,prerving the structure of the stimulus.See Fig.1(bottom)for an illustration of the procedure.

Rectiﬁcation:The convnet us relu non-linearities,which rectify the fea-ture maps thus ensuring the feature maps are always positive.To obtain valid

Visualizing and Understanding Convolutional Networks821 feature reconstructions at each layer(which also should be positive),we pass the reconstructed signal through a relu non-linearity1.

什么时候恢复高考

Filtering:The convnet us learnedﬁlters to convolve the feature maps from the previous layer.To approximately invert this,the deconvnet us transpod versions of the sameﬁlters(as other autoencoder models,such as RBMs),but applied to the rectiﬁed maps,not the output of the layer beneath.In practice this meansﬂipping eachﬁlter vertically and horizontally.

Note that we do not u any contrast normalization operations when in this reconstruction path.Projecting down from higher layers us the switch ttings generated by the max pooling in the convnet on the way up.As the switch ttings are peculiar to a given input image,the reconstruction obtained from a single activation thus rembles a small piece of the original input image,with structures weighted according to their contribution toward to the feature acti-vation.Since the model is trained discriminatively,they implicitly show which parts of the input image are discriminative.Note that the projections are not samples from the model,since there is no generative process involved.The whole procedure is similar to backpropping a single strong activation(rather than the

usual gradients),puting∂h

∂X n ,where h is the element of the feature map

with the strong activation and X n is the input image.However,it diﬀers in that(i)the the relu is impod independently and(ii)contrast normalization operations are not ud.A general shortcoming of our approach is that it only visualizes a single activation,not the joint activity prent in a layer.Neverthe-less,as we show in Fig.6,the visualizations are accurate reprentations of the input pattern that stimulates the given feature map in the model:when the parts of the original input image corresponding to the pattern are occluded,we e a distinct drop in activity within the feature map.

3Training Details

We now describe the large convnet model that will be visualized in Section4. The architecture,shown in Fig.3,is similar to that ud by Krizhevsky et al.[18] for ImageNet classiﬁcation.One diﬀerence is that the spar connections ud in Krizhevsky’s layers3,4,5(due to the model being split across2GPUs)are replaced with den connections in our model.Other important diﬀerences re-lating to layers1and2were made following inspection of the visualizations in Fig.5,as described in Section4.1.

The model was trained on the ImageNet2012training t(1.3million images, spread over1000diﬀeren

t class)[6].Each RGB image was preprocesd by resiz-ing the smallest dimension to256,cropping the center256x256region,subtract-ing the per-pixel mean(across all images)and then using10diﬀerent sub-crops of size224x224(corners+center with(out)horizontalﬂips).Stochastic gradient descent with a mini-batch size of128was ud to update the parameters,starting with a learning rate of10−2,in conjunction with a momentum term of0.9.We 1We also tried rectifying using the binary mask impod by the feed-forward relu operation,but the resulting visualizations were signiﬁcantly less clear.

822M.D.Zeiler and R.Fergus Max Locations

“Switches”

Pooled Maps

Feature Map

Layer Above

Reconstruction Unpooled

鸭的成语Maps Rectiﬁed Feature Maps Fig.1.Top:A deconvnet layer (left)attached to a convnet layer (right).The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath.Bottom:An illustration of the unpooling operation in the deconvnet,using switches which record the location of the local max in each pooling region (colored zones)during pooling in the convnet.The black/white bars are negative/positive activations within the feature map.

anneal the learning rate throughout training manually when the validation error plateaus.Dropout [14]is ud in the fully connected layers (6and 7)with a rate of 0.5.All weights are initialized to 10−2and bias are t to 0.老鼠咬伤

Visualization of the ﬁrst layer ﬁlters during training reveals that a few of them dominate.To combat this,we renormalize each ﬁlter in the convolutional layers who RMS value exceeds a ﬁxed radius of 10−1to this ﬁxed radius.This is crucial,especially in the ﬁrst layer of the model,where the input images are roughly in the [-128,128]range.As in Krizhevsky et al.[18],we produce multiple diﬀerent crops and ﬂips of each training example to boost training t size.We stopped training after 70epochs,which took around 12days on a single GTX580GPU,using an implementation bad on [18].

4Convnet Visualization

Using the model described in Section 3,we now u the deconvnet to visualize the feature activations on the ImageNet validation t.

Feature Visualization:Fig.2shows feature visualizations from our model once training is complete.For a given feature map,we show the top 9acti-vations,each projected parately down to pixel space,revealing the diﬀerent

本文发布于:2023-07-26 13:05:14，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1118221.html

上一篇：信息匹配

下一篇：21世纪大学英语应用型综合教程修订版4课本练习答案

标签：美丽高考正传老鼠家乡

留言与评论（共有 0 条评论）