An Analysis of Single-Layer Networks
in Unsupervid Feature Learning
Adam Coates Honglak Lee Andrew Y.Ng
Stanford University Computer Science Dept.
353Serra Mall Stanford,CA94305
University of Michigan
Computer Science and Engineering
2260Hayward Street
Ann Arbor,MI48109
怎样测网速Stanford University
Computer Science Dept.
353Serra Mall
Stanford,CA94305
Abstract
A great deal of rearch has focud on al-
gorithms for learning features from unla-
beled data.Indeed,much progress has been
made on benchmark datats like NORB and
CIFAR by employing increasingly complex
封神榜简介unsupervid learning algorithms and deep
models.In this paper,however,we show that
veral simple factors,such as the number of
hidden nodes in the model,may be more im-
portant to achieving high performance than
the learning algorithm or the depth of the
model.Specifically,we will apply veral off-
the-shelf feature learning algorithms(spar
auto-encoders,spar RBMs,K-means clus-
tering,and Gaussian mixtures)to CIFAR,
NORB,and STL datats using only single-
layer networks.We then prent a detailed
analysis of the effect of changes in the model
tup:the receptivefield size,number of hid-
den nodes(features),the step-size(“stride”)
between extracted features,and the effect
of whitening.Our results show that large
numbers of hidden nodes and den fea-
ture extraction are critical to achieving high
performance—so critical,in fact,that when
the parameters are pushed to their limits,
we achieve state-of-the-art performance on
both CIFAR-10and NORB using only a sin-
gle layer of features.More surprisingly,our
best performance is bad on K-means clus-
tering,which is extremely fast,has no hyper-
parameters to tune beyond the model struc-
ture itlf,and is very easy to implement.De-
spite the simplicity of our system,we achieve
accuracy beyond all previously published re-
sults on the CIFAR-10and NORB datats
(79.6%and97.2%respectively).
Appearing in Proceedings of the14th International Con-ference on Artificial Intelligence and Statistics(AISTATS) 2011,Fort Lauderdale,FL,USA.Volume15of JMLR: W&CP15.Copyright2011by the authors.1Introduction
Much recent work in machine learning has focud on learning good feature reprentations from unlabeled input data for higher-level tasks such as classification. Current solutions typically learn mul
ti-level repren-tations by greedily“pre-training”veral layers of fea-tures,one layer at a time,using an unsupervid learn-ing algorithm[11,8,18].For each of the layers a number of design parameters are chon:the number of features to learn,the locations where the features will be computed,and how to encode the inputs and outputs of the system.In this paper we study the ef-fect of the choices on single-layer networks trained by veral feature learning methods.Our results demon-strate that veral key ingredients,orthogonal to the learning algorithm itlf,can have a large impact on performance:whitening,large numbers of features, and den feature extraction can all be major advan-tages.Even with very simple algorithms and a sin-gle layer of features,it is possible to achieve state-of-the-art performance by focusing effort on the choices rather than on the learning system itlf.
A major drawback of many feature learning systems is their complexity and expen.In addition,many algorithms require careful lection of multiple hyper-parameters like learning rates,momentum,sparsity penalties,weight decay,and so on that must be cho-n through cross-validation,thus increasing running times dramatically.Though it is true that recently in-troduced algorithms have consistently shown improve-ments on benchmark datats like NORB[16]and CIFAR-10[13],there are veral other factors that af-fect thefinal performance of a feature learning sy
s-tem.Specifically,there are many“meta-parameters”defining the network architecture,such as the recep-tivefield size and number of hidden nodes(features). In practice,the parameters are often determined by computational constraints.For instance,we might u the largest number of features possible considering the running time of the algorithm.In this paper,how-ever,we pursue an alternative strategy:we employ very simple learning algorithms and then more care-
An Analysis of Single-Layer Networks in Unsupervid Feature Learning
fully choo the network parameters in arch of higher performance.If(as is often the ca)larger repre-ntations perform better,then we can leverage the speed and simplicity of the learning algorithms to u larger reprentations.
To this end,we will begin in Section3by describing a simple feature learning framework that incorporates an unsupervid learning algorithm as a“black box”module within.For this“black box”,we have im-plemented veral off-the-shelf unsupervid learning algorithms:spar auto-encoders,spar RBMs,K-means clustering,and Gaussian mixture models.We then analyze the performance impact of veral dif-ferent elements in the feature learning framework,in-cluding:(i)whitening,which is a common pre-process in deep learning work,(ii)number of features trained, (iii)step-size(stride)between extracted features,and (iv)receptivefield size.
It will turn out that whitening,large numbers of fea-tures,and small stride lead to uniformly better perfor-mance regardless of the choice of unsupervid learning algorithm.On the one hand,the results are some-what unsurprising.For instance,it is widely held that highly over-complete feature reprentations tend to give better performance than smaller-sized repren-tations[32],and similarly with small strides between features[21].However,the main contribution of our work is demonstrating that the considerations may, in fact,be critical to the success of feature learning algorithms—potentially more important even than the choice of unsupervid learning algorithm.Indeed,it will be shown that when we push the parameters to their limits that we can achieve state-of-the-art perfor-mance,outperforming many other more complex algo-rithms on the same task.Quite surprisingly,our best results are achieved using K-means clustering,an algo-rithm that has been ud extensively in computer vi-sion,but that has not been widely adopted for“deep”feature learning.Specifically,we achieve the test accu-racies of79.6%on CIFAR-10and97.2%on NORB—better than all previously published results.
We will start by reviewing related work on feature learning,then move on to describe a general feature learning framework that we will u for evaluation in Section3.We then prent experimental analysis and results on CIFAR-10[13]as well as NORB[16]in Sec-tion4.
2Related work英语四级多少分过关
Since the introduction of unsupervid pre-training[8], many new schemes for stacking layers of features to build“deep”reprentations have been propod. Most have focud on creating new training algo-rithms to build single-layer models that are compod to build deeper structures.Among the algorithms considered in the literature are spar-coding[22,17, 32],RBMs[8,13],spar RBMs[18],spar auto-encoders[7,25],denoising auto-encoders[30],“fac-tored”[24]and mean-covariance[23]RBMs,as well as many others[19,33].Thus,amongst the many com-ponents of feature learning architectures,the unsuper-vid learning module appears to be the most heavily scrutinized.
Some work,however,has considered the impact of other choices in the feature learning systems,es-pecially the choice of network architecture.Jarret et al.[11],for instance,have considered the impact of changes to the“pooling”strategies frequently em-ployed between layers of features,as well as different forms of normalization and rectification between lay-ers.Similarly,Boureau et al.have considered the im-pact of coding strategies and different types of pooling, both in practice[3]and in theory[4].Our work fol-lows in this vein,but considers instead the structure of single-layer networks—before pooling,and orthogonal to the choice of algorithm or coding scheme.
Many common threads from the computer vision lit-erature also relate to our work and to feature learning more broadly.For instance,we will u the K-means clustering algorithm as an alternative unsupervid learning module.K-means has been ud less widely in“deep learning”work but has enjoyed wide adoption in computer vision for building codebooks of“visual words”[5,6,15,31],which are ud to define higher-level image features.This method has also been ap-plied recursively to build multiple layers of features[1]. The effects of pooling and choice of activation func-tion or coding scheme have similarly been studied for the models[15,28,21].Van Gemert et al.,for in-stance,demonstrate that“soft”activation functions (“kernels”)tend to work better than the hard assign-ment typically ud with visual words models.
This paper will compare results along some of the same axes as the prior ,we will consider both ’hard’and’soft’activation functions),but our conclu-sions differ somewhat:While we confirm that some feature-learning schemes are better than others,we also show that the differences can often be outweighed by other factors,such as the number of features.Thus, even though more complex learning schemes may im-prove performance slightly,the advantages can be overcome by fast,simple learning algorithms that are able to handle larger networks.
3Unsupervid feature learning framework
In this ction,we describe a common framework ud for feature learning.For concreteness,we will focus on the application of the algorithms to learning fea-tures from images,though our approach is applicable
Adam Coates,Honglak Lee,Andrew Y.Ng
to other forms of data as well.The framework we u involves veral stages and is similar to tho employed in computer vision[5,15,31,28,1],as well as other feature learning work[16,19,3].
At a high-level,our system performs the following steps to learn a feature reprentation:
1.Extract random patches from unlabeled training
images.
2.Apply a pre-processing stage to the patches.
3.Learn a feature-mapping using an unsupervid
learning algorithm.
Given the learned feature mapping and a t of labeled training images we can then perform feature extraction and classification:
1.Extract features from equally spaced sub-patches
covering the input image.
2.Pool features together over regions of the input
男士抗衰老image to reduce the number of feature values. 3.Train a linear classifier to predict the labels given
the feature vectors.
We will now describe the components of this pipeline and its parameters in more detail.
3.1Feature Learning
As mentioned above,the system begins by extract-ing random sub-patches from unlabeled input images. Each patch has dimension w-by-w and has d channels,1 with w referred to as the“receptivefield size”.Each w-by-w patch can be reprented as a vector in R N of pixel intensity v
alues,with N=w·w·d.We then construct a datat of m randomly sampled patches, X={x(1),...,x(m)},where x(i)∈R N.Given this datat,we apply the pre-processing and unsupervid learning steps.
3.1.1Pre-processing
It is common practice to perform veral simple nor-malization steps before attempting to generate fea-tures from data.In this work,we assume that every patch x(i)is normalized by subtracting the mean and dividing by the standard deviation of its elements.For visual data,this corresponds to local brightness and contrast normalization.
After normalizing each input vector,the entire datat X may optionally be whitened[10].While this process 1For example,if the input image is reprented in (R,G,B)colors,then it has three channels.is commonly ud in deep learning ,[24])it is less frequently employed in computer vision.We will prent experimental results obtained both with and without whitening to determine whether this compo-nent is generally necessary.
3.1.2Unsupervid learning
After pre-processing,an unsupervid learning algo-rithm is ud to discover features from the unlabeled data.For our purpos,we will view an unsupervid learning algorithm as a“black box”that takes the datat X and outputs a function f:R N→R K that maps an input vector x(i)to a new feature vector of K features,where K is a parameter of the algorithm. We denote the k th feature as f k.In this work,we will u veral different unsupervid learning methods2in this role:(i)spar auto-encoders,(ii)spar RBMs, (iii)K-means clustering,and(iv)Gaussian mixtures. We briefly summarize how the algorithms are em-ployed in our system.
1.Spar auto-encoder:We train an auto-
encoder with K hidden nodes using back-propagation to minimize squared reconstruction error with an additional penalty term that en-courages the units to maintain a low average ac-tivation[18,7].The algorithm outputs weights W∈R K×N and bias b∈R K such that the feature mapping f is defined by:
f(x)=g(W x+b),(1) where g(z)=1/(1+exp(−z))is the logistic sigmoid function,applied component-wi to the vector z.
There are veral hyper-parameters ud by the training ,weight decay,and target acti
vation).The parameters were chon using cross-validation for each choice of the receptive field size,w.3
2.Spar restricted Boltzmann machine:The
restricted Boltzmann machine(RBM)is an undi-rected graphical model with K binary hidden variables.Spar RBMs can be trained using the contrastive divergence approximation[9]with the same type of sparsity penalty as the auto-encoders.The training also produces weights 2The algorithms were chon since they can scale up straight-forwardly to the problem sizes considered in our experiments.
3Ideally,we would perform this cross-validation for ev-ery choice of parameters,but the expen is prohibitive for the number of experiments we perform here.This is a ma-jor advantage of the K-means algorithm,which requires no such procedure.
An Analysis of Single-Layer Networks in Unsupervid Feature Learning W and bias b,and we can u the same fea-
ture mapping as the auto-encoder(as in Equa-
tion(1))—thus,the algorithms differ primarily
in their training method.Also as above,the nec-
幼儿斜视
essary hyper-parameters are determined by cross-
validation for each receptivefield size.
3.K-means clustering:We apply K-means clus-
tering to learn K centroids c(k)from the input
data.Given the learned centroids c(k),we con-
sider two choices for the feature mapping f.The
first is the standard1-of-K,hard-assignment cod-
ing scheme:
f k(x)=
1if k=arg min j||c(j)−x||22
林晨耀0otherwi.
(2)
This is a(maximally)spar reprentation that has been ud frequently in computer vision[5].
It has been noted,however,that this may be too ter[28].Thus our cond choice of feature map-ping is a non-linear mapping that attempts to be “softer”than the above encoding while also keep-ing some sparsity:
f k(x)=max{0,µ(z)−z k}(3)
where z k=||x−c(k)||2andµ(z)is the mean of the elements of z.This activation function outputs 0for any feature f k where the distance to the centroid c(k)is“above average”.In practice,this means that roughly half of the features will be t to0.This can be thought of as a very simple form of“competition”between features.
We refer to the in our results as K-means (hard)and K-means(triangle)respectively.
4.Gaussian mixtures:Gaussian mixture models
(GMMs)reprent the density of input data as a mixture of K Gaussian distributions and is widely ud for clustering.GMMs can be trained using the Expectation-Maximization(EM)algorithm as in[1].We run a single iteration of K-means to ini-tialize the mixture model.4The feature mapping
f maps each input to the posterior membership
probabilities:
f k(x)=
1
(2π)d/2|Σk|1/2
·
exp
−
1
2
(x−c(k)) Σ−1
k
(x−c(k))
whereΣk is a diagonal covariance andφk are the cluster prior probabilities learned by the EM al-gorithm.
4When K-means is run to convergence we have found that the mixture model does not learn features substan-tially different from the K-means result.3.2Feature Extraction and Classification The above steps,for a particular choice of unsuper-vid learning algorithm,yield a function f that trans-forms an input patch x∈R N to a new reprentation y=f(x)∈R K.Using this feature extractor,we now apply it to our(labeled)training images for classifica-tion.
3.2.1Convolutional extraction
Using the learned feature extractor f:R N→R K, given any w-by-w image patch,we can now compute a reprentation y∈R K for that patch.We can thus define a(single layer)reprentation of the entire im-age by applying the function f to many sub-patches. Specifically,given an image of n-by-n pixels(with d channels),we define a(n−w+1)-by-(n−w+1) reprentation(with K channels),by computing the reprentation y for each w-by-w“subpatch”of the input image.More formally,we will let y(ij)be the K-dimensional reprentation extracted from location i,j of the input image.For computational efficiency,we may also“step”our w-by-w feature extractor across the image with some step-size(or“stride”)s greater than1.This is illustrated in Figure1.
3.2.2Classification
Before classification,it is standard practice to re-duce the dimensionality of the image reprentation by pooling.For a stride of s=1,our feature mapping produces a(n−w+1)-by-(n−w+1)-by-K reprenta-tion.We can reduce this by summing up over local re-gions of the y(ij)’s extracted as above.This procedure is commonly ud(in many variations)in computer vision[15]as well as deep feature learning[11].
In our system,we u a very simple form of pooling. Specifically,we split the y(ij)’s into four equal-siz
ed quadrants,and compute the sum of the y(ij)’s in each. This yields a reduced(K-dimensional)reprentation of each quadrant,for a total of4K features that we u for classification.
Given the pooled(4K-dimensional)feature vectors for each training image and a label,we apply standard linear classification algorithms.In our experiments we u(L2)SVM classification.The regularization pa-rameter is determined by cross-validation.
4Experiments and Analysis
The above framework includes a number of parameters that can be changed:(i)whether to u whitening or not,(ii)the number of features K,(iii)the stride s,and(iv)receptivefield size w.In this ction,we prent our experimental results on the impact of the parameters on performance.First,we will evaluate the effects of the parameters using cross-validation如何跟女生表白
on the CIFAR-10training t.We will then report the results achieved on both CIFAR-10and NORB test ts using each unsupervid learning algorithm and the parameter ttings that our analysis suggests is best overall (i.e.,in our final results,we u the same ttings for all algorithms).5
Our basic testing procedure is as follows.For each un-supervid learning algorithm in Section 3.1.2,we will train a single-layer of features using either whitened data or raw data and a choice of the parameters K ,s ,and w .We then train a linear classifier as described in Section 3.2.2,then test the classifier on a holdout t (for our main analysis)or the test t (for our final results).4.1
Visualization
Before we prent classification results,we first show visualizations of the learned feature reprentations.The bas (or centroids)learned by spar autoen-coders,spar RBMs,K-means,and Gaussian mix-ture models are shown in Figure 2for 8pixel recep-tive fields.It is well-known that autoencoders and RBMs yield localized filters that remble Gabor fil-ters and we can e this in our results both when us-ing whitened data and,to a lesr extent,raw data.However,the visualizations also show that similar results can be achieved using clustering algorithms.In particular,while clustering raw data leads to cen-troids consistent with tho in [6]and [2
9],we e that clustering whitened data yields sharply localized filters that are very similar to tho learned by the other al-gorithms.Thus,it appears that such features are easy to learn with clustering methods (without any param-eter tweaking)as a result of whitening.
5
To clarify:The parameters ud in our final evaluation are tho that achieved the best (average)cross-validation performance across all models:whitening,1pixel stride,6pixel receptive field,and 1600features.
# Features
C r o s s −V a l i d a t i o n A c c u r a c y (%)
Figure 3:Effect of whitening and number of bas (or
centroids).4.2
Effect of whitening
体育课用英语怎么说We now move on to our characterization of perfor-mance on various axes of parameters,starting with the effect of whitening 6,which visibly changes the learned bas (or centroids)as en in Figure 2.Figure 3shows the performance for all of our algorithms as a function of the number of features (which we will discuss in the next ction)both with and without whitening.The experiments ud a stride of 1pixel and 6pixel recep-tive field.
For spar autoencoders and RBMs,the effect of whitening is somewhat ambiguous.When using only 100features,there is a significant benefit of whiten-ing for spar RBMs,but this advantage disappears with larger numbers of features.For the clustering algorithms,however,we e that whitening is a cru-cial pre-process since the clustering algorithms cannot handle the correlations in the data.7
6In our experiments,we u Zero-pha whitening [2]7
Our GMM implementation us diagonal covariances