[论⽂阅读-
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different class. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art.
The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final
1000-way softmax. To make training faster, we ud non-saturating neurons and a very efficient GPU
implementation of the convolution operation. To reduce overfitting in the fully-connected layers we
employed a recently-developed regularization method called that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the cond-best entry.
我们训练了⼀个⼤型的深度卷积神经⽹络,将ImageNet lsvprc -2010竞赛中的120万幅⾼分辨率图像分类为1000个不同的类。在测试数据上,我们实现了top-1和top-5的错误率,分别为37.5%和17.0%,这与前的最⾼⽔平相⽐有了很⼤的提⾼。该神经⽹络有6000万个参数和65万个神经元,由5个卷积层(其中⼀些后⾯接了最⼤池化层)和3个全连接层(最后的1000路softmax)组成。为了使训练更快,我们使⽤了⾮饱和神经元和⼀个⾮常⾼效的GPU实现卷积运算。为了减少全连通层的过拟合,我们采⽤了⼀种最近发展起来的正则化⽅法——dropout,结果显⽰它⾮常有效。我们还在ILSVRC-2012⽐赛中输⼊了该模型的⼀个变体,并获得了15.3%的top-5测试错误率,⽽第⼆名获得了26.2%的错误率.
1 Introduction
britishCurrent approaches to object recognition make esntial u of machine learning methods. To improve their performance, we can collect larger datats, learn more powerful models, and u better techniques for preventing overfitting. Until recently, datats of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datats of this size, especially if they are augmented with label-prerving transformations. For example, the current best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4]. But objects in realistic ttings exhibit considerable
variability, so to learn to recognize them it is necessary to u much larger training ts. And indeed, the shortcomings of small image datats have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datats with millions of images. The new larger datats
include LabelMe [23], which consists of hundreds of thousands of fully-gmented images, and ImageNet
乌克兰留学费用[6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.
当前的物体识别⽅法主要利⽤机器学习⽅法。为了提⾼它们的性能,我们可以收集更⼤的数据集,学习更强⼤的模型,并使⽤更好的技术来防⽌过度拟合。直到最近,标记图像的数据集在成千上万的图像(例如,NORB [16], Caltech-101/256 [8,9], CIFAR-10/100[12])中相对较⼩。使⽤这种⼤⼩的数据集可以很好地解决简单的识别任务,特别是如果使⽤保存标签的转换来扩展它们。例如,MNIST数字识别任务的当前最佳错误率(<0.3%)接近⼈类性能[4]。但是现实环境中的物体表现出相当⼤的可变性,所以为了学会识别它们,有必要使⽤更⼤的训练集。的确,⼩图像数据集的缺点已经被⼴泛认识(例如,Pinto等⼈的[21]),但直到最近才有可能收集数百万张图像的标记数据集。新的更⼤的数据集包括
LabelMe[23],它由成千上万的全分段图像组成,和ImageNet[6],它由超过22000个类别的超过1500万标记的⾼分辨率图像组成。
you son of bitch
To learn about thousands of objects from millions of images, we need a model with a large learning
国外电影明星capacity. However, the immen complexity of the object recognition task means that this problem cannot be specified even by a datat as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely,stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward
neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so
they are easier to train, while their theoretically-best performance is likely to be only slightly wor.
要从数百万张图像中了解数千个物体,我们需要⼀个具有巨⼤学习能⼒的模型。
is的过去式是什么
然⽽,对象识别任务的巨⼤复杂性意味着即使像ImageNet这样⼤的数据集也⽆法指定这个问题,因此我们的模型也应该具有⼤量的先验知识来补偿我们没有的所有数据。卷积神经⽹络(Convolutional neural networks, CNNs)就是这样⼀类模型[16,11,13,18,15,22,26]。它们的能⼒可以通过改变深度和宽度来控制,⽽且它们还对图像的性质(即统计的平稳性和像素依赖的局部性)做出了强有⼒且最正确的假设。
因此,与具有相似⼤⼩层的标准前馈神经⽹络相⽐,CNNs具有更少的连接和参数,因此更容易训练,⽽其理论上最好的性能可能只会稍微差⼀些。
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current
GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datats such as ImageNet contain enough labeled
examples to train such models without vere overfitting.
尽管CNNs的质量很吸引⼈,尽管它们的本地架构相对⾼效,但在⾼分辨率图像上⼤规模应⽤仍然⾮常昂贵。幸运的是,当前的gpu与⾼度优化的2D卷积实现相结合,已经⾜够强⼤,可以⽅便地训练有趣的⼤型CNNs,⽽最近的数据集(如ImageNet)包含了⾜够多的标记⽰例,可以在不严重过拟合的情况下训练此类模型。
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subts of ImageNet ud in the ILSVRC-2010 and ILSVRC-2012 competitions[2]and achieved by far the best results ever reported on the datats. We wrote a highly-optimized GPU
implementation of 2D convolution and all the other operations inherent in training convolutional neural
networks, which we make available publicly . Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we ud veral effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth ems to be important: we
found that removing any convolutional layer (each of which contains no more than 1% of the model’s
parameters) resulted in inferior performance.
本⽂的具体贡献如下:
我们在在 ILSVRC-2010 和 ILSVRC-2012 ⽐赛[2]中使⽤过的 ImageNet 的⼦集上训练了迄今为⽌最⼤的卷积神经⽹络之⼀,并取得了这些数据集上迄今为⽌最好的结果。
我们编写了⼀个关于 2D 卷积和所有其他的训练卷积神经⽹络时固有的操作的⾼度优化的 GPU 实现,并将其公开了。
我们的⽹络包含了许多新的和不寻常的特性,这些特性提⾼了它的性能并减少了它的训练时间,这些特性在第3节中详细介绍。
11
即使有120万个标记的训练样本,我们的⽹络规模(过⼤)使得过度拟合成为⼀个重要的问题。所以我们使⽤了⼀些有效的技术来防⽌过度拟合,如第4节所述。
美国达人第三季
我们最终的⽹络包含5个卷积层和3个全连接层,这个深度似乎很重要:我们发现去掉任何卷积层(每个卷积层只包含不到1%的模型参数)都会导致性能下降。
In the end, the network’s size is limited mainly by the amount of memory available on current
and by the amount of training that we are willing to tolerate. Our network takes between five and six days to train on two GTX580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datats to become
available.
最后,⽹络的⼤⼩主要受到当前gpu上可⽤内存的⼤⼩和我们愿意忍受的训练时间的⼤⼩的限制。我们的⽹络需要5到6天的时间来训练两个GTX 580 3GB GPU。我们所有的实验都表明,只要等待更快的gpu和更⼤的数据集可⽤,我们的结果就可以得到改善。
2 The Datat
ImageNet is a datat of over 15 million labeled high-resolution images belonging to roughly 22,000
categories. The images were collected from the web and labeled by human labelers using Amazon’s
Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held.
ILSVRC us a subt of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
elemaImageNet是⼀个包含超过1500万张⾼分辨率图像的数据集,属于⼤约22000个类别。这些图⽚是从⽹上收集来的,并由⼈⼯贴标签者使⽤亚马逊的⼟⽿其机械众包⼯具进⾏标记。从2010年开始,作为Pascal视觉对象挑战赛的⼀部分,每年都会举办⼀场名为ImageNet⼤型视觉识别挑战赛(ILSVRC)的⽐赛。ILSVRC使⽤ImageNet的⼀个⼦集,每个类别⼤约有1000张图⽚。总共⼤约有120万张训练图像、5万张验证图像和15万张测试图像。
ILSVRC-2010 is the only version of ILSVRC for which the test t labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the datat as well, for which test t labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label
is not among the five labels considered most probable by the model.
ILSVRC-2010 是唯⼀可⽤测试集标签的 ILSVRC 版本,因此这是我们进⾏⼤多数实验的版本。由于我们也在 ILSVRC-2012 竞赛中加⼊了我们的模型,在第6节中,我们也报告了我们在这个版本的数据集上的结果,对于这个版本的数据集,测试集标签是不可⽤的。在 ImageNet 上,通常报告两个错误率:top-1 和 top-5,其中 top-5 错误率是测试图像的⼀部分,其中正确的标签不在模型认为最可能的五个标签中。
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality.
Therefore, we down-sampled the images to a fixed resolution of 256 * 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256�256 patch from the resulting image. We did not pre-process the images in any other way, except for
subtracting the mean activity over the training t from each pixel. So we trained our network on the
(centered) raw RGB values of the pixels.
ImageNet由可变分辨率的图像组成,⽽我们的系统需要⼀个恒定的输⼊维数。
因此,我们将图像降采样到256 * 256的固定分辨率。给定⼀个矩形图像,我们⾸先重新调整图像的⼤⼩,使其短边长度为256,然后从结果图像中裁剪出中⼼的256%256块。除了从每个像素中减去训练集上的平均活动外,我们没有以任何其他⽅式对图像进⾏预处理。因此,我们将⽹络训练成像素的原始RGB值(居中)。
3 The Architecture
The architecture of our network is summarized in Figure 2. It contains eight learned layers — five
convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important first.
3.1 ReLU Nonlinearity
The standard way to model a neuron’s output f as a function of its input x is with or
. In terms of training time with gradient descent, the saturating nonlinearities are much slower than
the non-saturating nonlinearity f(x) = max(0; x). Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train veral times faster than their equivalents with tanh units. This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 datat for a particular four-layer convolutional network. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had ud traditional saturating neuron models.
the foxWe are not the first to consider alternatives to traditional neuron models in CNNs. For example, Jarrett et al.
[11] claim that the nonlinearity works particularly well with their type of contrast
normalization followed by local average pooling on the Caltech-101 datat. However, on this datat the primary concern is preventing overfitting, so the effect they are obrving is different from the accelerated ability to fit the training t which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datats.
3.2 Training on Multiple GPUs
A single GTX 580 GPU has only 3G
B of memory, which limits the maximum size of the networks that can
be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memorydram
directly, without going through host machine memory. The parallelization scheme that we employ esntially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer
2. However, kernels in layer 4 take input only from tho kernel maps in layer 3 which reside on the same
GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precily tune the amount of communication until it is an acceptable fraction of the amount of computation.
The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Cire¸san et al.
[5], except that our columns are not independent (e Figure 2). This scheme reduces our top-1 and top-5
error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each
convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net2.
3.3 Local Respon Normalization
ReLUs have the desirable property that they do not require input normalization to prevent them from
saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization.
Denoting by ai x;y the activity of a neuron computed by applying kernel i at position (x; y) and then applying the ReLU nonlinearity, the respon-normalized activity bi x;y is given by the expression.
where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of cour arbitrary and determined before
training begins. This sort of respon normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k; n; , and are hyper-parameters who values are determined using a
validation t; we ud k = 2, n = 5, = 10 4, and = 0:75. We applied this normalization after applying the ReLU nonlinearity in certain layers (e Section 3.5).
This scheme bears some remblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean
activity. Respon normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.
We also verified the effectiveness of this scheme on the CIFAR-10 datat: a four-layer
CNN achieved a 13% test error rate without normalization and 11% with normalization3.
3.4 Overlapping Pooling
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g.,[17, 11, 4]). To be more preci, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z z centered at the location of the pooling unit. If we t s = z, we obtain traditional local pooling as commonly employed in CNNs. If we t s < z, we obtain overlapping pooling. This is what we u throughout our network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme s = 2; z = 2, which produces output of equivalent dimensions. We generally obrve during training that models with overlapping pooling find it slightly more difficult to overfit.
3.5 Overall Architecture
>charmer