首页 > 美文鉴赏

An Analysis of Single-Layer Networks in Unsupervid Feature Learning

更新时间:2023-07-31 20:54:16 阅读：评论：0

An Analysis of Single-Layer Networks

in Unsupervid Feature Learning

Adam Coates Honglak Lee Andrew Y.Ng

Stanford University Computer Science Dept.

353Serra Mall Stanford,CA94305

University of Michigan

Computer Science and Engineering

2260Hayward Street

Ann Arbor,MI48109

怎样测网速Stanford University

Computer Science Dept.

353Serra Mall

Stanford,CA94305

Abstract

A great deal of rearch has focud on al-

gorithms for learning features from unla-

beled data.Indeed,much progress has been

made on benchmark datats like NORB and

CIFAR by employing increasingly complex

封神榜简介unsupervid learning algorithms and deep

models.In this paper,however,we show that

veral simple factors,such as the number of

hidden nodes in the model,may be more im-

portant to achieving high performance than

the learning algorithm or the depth of the

model.Speciﬁcally,we will apply veral oﬀ-

the-shelf feature learning algorithms(spar

auto-encoders,spar RBMs,K-means clus-

tering,and Gaussian mixtures)to CIFAR,

NORB,and STL datats using only single-

layer networks.We then prent a detailed

analysis of the eﬀect of changes in the model

tup:the receptiveﬁeld size,number of hid-

den nodes(features),the step-size(“stride”)

between extracted features,and the eﬀect

of whitening.Our results show that large

numbers of hidden nodes and den fea-

ture extraction are critical to achieving high

performance—so critical,in fact,that when

the parameters are pushed to their limits,

we achieve state-of-the-art performance on

both CIFAR-10and NORB using only a sin-

gle layer of features.More surprisingly,our

best performance is bad on K-means clus-

tering,which is extremely fast,has no hyper-

parameters to tune beyond the model struc-

ture itlf,and is very easy to implement.De-

spite the simplicity of our system,we achieve

accuracy beyond all previously published re-

sults on the CIFAR-10and NORB datats

(79.6%and97.2%respectively).

Appearing in Proceedings of the14th International Con-ference on Artiﬁcial Intelligence and Statistics(AISTATS) 2011,Fort Lauderdale,FL,USA.Volume15of JMLR: W&CP15.Copyright2011by the authors.1Introduction

Much recent work in machine learning has focud on learning good feature reprentations from unlabeled input data for higher-level tasks such as classiﬁcation. Current solutions typically learn mul

ti-level repren-tations by greedily“pre-training”veral layers of fea-tures,one layer at a time,using an unsupervid learn-ing algorithm[11,8,18].For each of the layers a number of design parameters are chon:the number of features to learn,the locations where the features will be computed,and how to encode the inputs and outputs of the system.In this paper we study the ef-fect of the choices on single-layer networks trained by veral feature learning methods.Our results demon-strate that veral key ingredients,orthogonal to the learning algorithm itlf,can have a large impact on performance:whitening,large numbers of features, and den feature extraction can all be major advan-tages.Even with very simple algorithms and a sin-gle layer of features,it is possible to achieve state-of-the-art performance by focusing eﬀort on the choices rather than on the learning system itlf.

A major drawback of many feature learning systems is their complexity and expen.In addition,many algorithms require careful lection of multiple hyper-parameters like learning rates,momentum,sparsity penalties,weight decay,and so on that must be cho-n through cross-validation,thus increasing running times dramatically.Though it is true that recently in-troduced algorithms have consistently shown improve-ments on benchmark datats like NORB[16]and CIFAR-10[13],there are veral other factors that af-fect theﬁnal performance of a feature learning sy

s-tem.Speciﬁcally,there are many“meta-parameters”deﬁning the network architecture,such as the recep-tiveﬁeld size and number of hidden nodes(features). In practice,the parameters are often determined by computational constraints.For instance,we might u the largest number of features possible considering the running time of the algorithm.In this paper,how-ever,we pursue an alternative strategy:we employ very simple learning algorithms and then more care-

An Analysis of Single-Layer Networks in Unsupervid Feature Learning

fully choo the network parameters in arch of higher performance.If(as is often the ca)larger repre-ntations perform better,then we can leverage the speed and simplicity of the learning algorithms to u larger reprentations.

To this end,we will begin in Section3by describing a simple feature learning framework that incorporates an unsupervid learning algorithm as a“black box”module within.For this“black box”,we have im-plemented veral oﬀ-the-shelf unsupervid learning algorithms:spar auto-encoders,spar RBMs,K-means clustering,and Gaussian mixture models.We then analyze the performance impact of veral dif-ferent elements in the feature learning framework,in-cluding:(i)whitening,which is a common pre-process in deep learning work,(ii)number of features trained, (iii)step-size(stride)between extracted features,and (iv)receptiveﬁeld size.

It will turn out that whitening,large numbers of fea-tures,and small stride lead to uniformly better perfor-mance regardless of the choice of unsupervid learning algorithm.On the one hand,the results are some-what unsurprising.For instance,it is widely held that highly over-complete feature reprentations tend to give better performance than smaller-sized repren-tations[32],and similarly with small strides between features[21].However,the main contribution of our work is demonstrating that the considerations may, in fact,be critical to the success of feature learning algorithms—potentially more important even than the choice of unsupervid learning algorithm.Indeed,it will be shown that when we push the parameters to their limits that we can achieve state-of-the-art perfor-mance,outperforming many other more complex algo-rithms on the same task.Quite surprisingly,our best results are achieved using K-means clustering,an algo-rithm that has been ud extensively in computer vi-sion,but that has not been widely adopted for“deep”feature learning.Speciﬁcally,we achieve the test accu-racies of79.6%on CIFAR-10and97.2%on NORB—better than all previously published results.

We will start by reviewing related work on feature learning,then move on to describe a general feature learning framework that we will u for evaluation in Section3.We then prent experimental analysis and results on CIFAR-10[13]as well as NORB[16]in Sec-tion4.

2Related work英语四级多少分过关

Since the introduction of unsupervid pre-training[8], many new schemes for stacking layers of features to build“deep”reprentations have been propod. Most have focud on creating new training algo-rithms to build single-layer models that are compod to build deeper structures.Among the algorithms considered in the literature are spar-coding[22,17, 32],RBMs[8,13],spar RBMs[18],spar auto-encoders[7,25],denoising auto-encoders[30],“fac-tored”[24]and mean-covariance[23]RBMs,as well as many others[19,33].Thus,amongst the many com-ponents of feature learning architectures,the unsuper-vid learning module appears to be the most heavily scrutinized.

Some work,however,has considered the impact of other choices in the feature learning systems,es-pecially the choice of network architecture.Jarret et al.[11],for instance,have considered the impact of changes to the“pooling”strategies frequently em-ployed between layers of features,as well as diﬀerent forms of normalization and rectiﬁcation between lay-ers.Similarly,Boureau et al.have considered the im-pact of coding strategies and diﬀerent types of pooling, both in practice[3]and in theory[4].Our work fol-lows in this vein,but considers instead the structure of single-layer networks—before pooling,and orthogonal to the choice of algorithm or coding scheme.

Many common threads from the computer vision lit-erature also relate to our work and to feature learning more broadly.For instance,we will u the K-means clustering algorithm as an alternative unsupervid learning module.K-means has been ud less widely in“deep learning”work but has enjoyed wide adoption in computer vision for building codebooks of“visual words”[5,6,15,31],which are ud to deﬁne higher-level image features.This method has also been ap-plied recursively to build multiple layers of features[1]. The eﬀects of pooling and choice of activation func-tion or coding scheme have similarly been studied for the models[15,28,21].Van Gemert et al.,for in-stance,demonstrate that“soft”activation functions (“kernels”)tend to work better than the hard assign-ment typically ud with visual words models.

This paper will compare results along some of the same axes as the prior ,we will consider both ’hard’and’soft’activation functions),but our conclu-sions diﬀer somewhat:While we conﬁrm that some feature-learning schemes are better than others,we also show that the diﬀerences can often be outweighed by other factors,such as the number of features.Thus, even though more complex learning schemes may im-prove performance slightly,the advantages can be overcome by fast,simple learning algorithms that are able to handle larger networks.

3Unsupervid feature learning framework

In this ction,we describe a common framework ud for feature learning.For concreteness,we will focus on the application of the algorithms to learning fea-tures from images,though our approach is applicable

Adam Coates,Honglak Lee,Andrew Y.Ng

to other forms of data as well.The framework we u involves veral stages and is similar to tho employed in computer vision[5,15,31,28,1],as well as other feature learning work[16,19,3].

At a high-level,our system performs the following steps to learn a feature reprentation:

1.Extract random patches from unlabeled training

images.

2.Apply a pre-processing stage to the patches.

3.Learn a feature-mapping using an unsupervid

learning algorithm.

Given the learned feature mapping and a t of labeled training images we can then perform feature extraction and classiﬁcation:

1.Extract features from equally spaced sub-patches

covering the input image.

2.Pool features together over regions of the input

男士抗衰老image to reduce the number of feature values. 3.Train a linear classiﬁer to predict the labels given

the feature vectors.

We will now describe the components of this pipeline and its parameters in more detail.

3.1Feature Learning

As mentioned above,the system begins by extract-ing random sub-patches from unlabeled input images. Each patch has dimension w-by-w and has d channels,1 with w referred to as the“receptiveﬁeld size”.Each w-by-w patch can be reprented as a vector in R N of pixel intensity v

alues,with N=w·w·d.We then construct a datat of m randomly sampled patches, X={x(1),...,x(m)},where x(i)∈R N.Given this datat,we apply the pre-processing and unsupervid learning steps.

3.1.1Pre-processing

It is common practice to perform veral simple nor-malization steps before attempting to generate fea-tures from data.In this work,we assume that every patch x(i)is normalized by subtracting the mean and dividing by the standard deviation of its elements.For visual data,this corresponds to local brightness and contrast normalization.

After normalizing each input vector,the entire datat X may optionally be whitened[10].While this process 1For example,if the input image is reprented in (R,G,B)colors,then it has three channels.is commonly ud in deep learning ,[24])it is less frequently employed in computer vision.We will prent experimental results obtained both with and without whitening to determine whether this compo-nent is generally necessary.

3.1.2Unsupervid learning

After pre-processing,an unsupervid learning algo-rithm is ud to discover features from the unlabeled data.For our purpos,we will view an unsupervid learning algorithm as a“black box”that takes the datat X and outputs a function f:R N→R K that maps an input vector x(i)to a new feature vector of K features,where K is a parameter of the algorithm. We denote the k th feature as f k.In this work,we will u veral diﬀerent unsupervid learning methods2in this role:(i)spar auto-encoders,(ii)spar RBMs, (iii)K-means clustering,and(iv)Gaussian mixtures. We brieﬂy summarize how the algorithms are em-ployed in our system.

1.Spar auto-encoder:We train an auto-

encoder with K hidden nodes using back-propagation to minimize squared reconstruction error with an additional penalty term that en-courages the units to maintain a low average ac-tivation[18,7].The algorithm outputs weights W∈R K×N and bias b∈R K such that the feature mapping f is deﬁned by:

f(x)=g(W x+b),(1) where g(z)=1/(1+exp(−z))is the logistic sigmoid function,applied component-wi to the vector z.

There are veral hyper-parameters ud by the training ,weight decay,and target acti

vation).The parameters were chon using cross-validation for each choice of the receptive ﬁeld size,w.3

2.Spar restricted Boltzmann machine:The

restricted Boltzmann machine(RBM)is an undi-rected graphical model with K binary hidden variables.Spar RBMs can be trained using the contrastive divergence approximation[9]with the same type of sparsity penalty as the auto-encoders.The training also produces weights 2The algorithms were chon since they can scale up straight-forwardly to the problem sizes considered in our experiments.

3Ideally,we would perform this cross-validation for ev-ery choice of parameters,but the expen is prohibitive for the number of experiments we perform here.This is a ma-jor advantage of the K-means algorithm,which requires no such procedure.

An Analysis of Single-Layer Networks in Unsupervid Feature Learning W and bias b,and we can u the same fea-

ture mapping as the auto-encoder(as in Equa-

tion(1))—thus,the algorithms diﬀer primarily

in their training method.Also as above,the nec-

幼儿斜视

essary hyper-parameters are determined by cross-

validation for each receptiveﬁeld size.

3.K-means clustering:We apply K-means clus-

tering to learn K centroids c(k)from the input

data.Given the learned centroids c(k),we con-

sider two choices for the feature mapping f.The

ﬁrst is the standard1-of-K,hard-assignment cod-

ing scheme:

f k(x)=

1if k=arg min j||c(j)−x||22

林晨耀0otherwi.

(2)

This is a(maximally)spar reprentation that has been ud frequently in computer vision[5].

It has been noted,however,that this may be too ter[28].Thus our cond choice of feature map-ping is a non-linear mapping that attempts to be “softer”than the above encoding while also keep-ing some sparsity:

f k(x)=max{0,µ(z)−z k}(3)

where z k=||x−c(k)||2andµ(z)is the mean of the elements of z.This activation function outputs 0for any feature f k where the distance to the centroid c(k)is“above average”.In practice,this means that roughly half of the features will be t to0.This can be thought of as a very simple form of“competition”between features.

We refer to the in our results as K-means (hard)and K-means(triangle)respectively.

4.Gaussian mixtures:Gaussian mixture models

(GMMs)reprent the density of input data as a mixture of K Gaussian distributions and is widely ud for clustering.GMMs can be trained using the Expectation-Maximization(EM)algorithm as in[1].We run a single iteration of K-means to ini-tialize the mixture model.4The feature mapping

f maps each input to the posterior membership

probabilities:

f k(x)=

(2π)d/2|Σk|1/2

exp

−

(x−c(k)) Σ−1

(x−c(k))

whereΣk is a diagonal covariance andφk are the cluster prior probabilities learned by the EM al-gorithm.

4When K-means is run to convergence we have found that the mixture model does not learn features substan-tially diﬀerent from the K-means result.3.2Feature Extraction and Classiﬁcation The above steps,for a particular choice of unsuper-vid learning algorithm,yield a function f that trans-forms an input patch x∈R N to a new reprentation y=f(x)∈R K.Using this feature extractor,we now apply it to our(labeled)training images for classiﬁca-tion.

3.2.1Convolutional extraction

Using the learned feature extractor f:R N→R K, given any w-by-w image patch,we can now compute a reprentation y∈R K for that patch.We can thus deﬁne a(single layer)reprentation of the entire im-age by applying the function f to many sub-patches. Speciﬁcally,given an image of n-by-n pixels(with d channels),we deﬁne a(n−w+1)-by-(n−w+1) reprentation(with K channels),by computing the reprentation y for each w-by-w“subpatch”of the input image.More formally,we will let y(ij)be the K-dimensional reprentation extracted from location i,j of the input image.For computational eﬃciency,we may also“step”our w-by-w feature extractor across the image with some step-size(or“stride”)s greater than1.This is illustrated in Figure1.

3.2.2Classiﬁcation

Before classiﬁcation,it is standard practice to re-duce the dimensionality of the image reprentation by pooling.For a stride of s=1,our feature mapping produces a(n−w+1)-by-(n−w+1)-by-K reprenta-tion.We can reduce this by summing up over local re-gions of the y(ij)’s extracted as above.This procedure is commonly ud(in many variations)in computer vision[15]as well as deep feature learning[11].

In our system,we u a very simple form of pooling. Speciﬁcally,we split the y(ij)’s into four equal-siz

ed quadrants,and compute the sum of the y(ij)’s in each. This yields a reduced(K-dimensional)reprentation of each quadrant,for a total of4K features that we u for classiﬁcation.

Given the pooled(4K-dimensional)feature vectors for each training image and a label,we apply standard linear classiﬁcation algorithms.In our experiments we u(L2)SVM classiﬁcation.The regularization pa-rameter is determined by cross-validation.

4Experiments and Analysis

The above framework includes a number of parameters that can be changed:(i)whether to u whitening or not,(ii)the number of features K,(iii)the stride s,and(iv)receptiveﬁeld size w.In this ction,we prent our experimental results on the impact of the parameters on performance.First,we will evaluate the eﬀects of the parameters using cross-validation如何跟女生表白

on the CIFAR-10training t.We will then report the results achieved on both CIFAR-10and NORB test ts using each unsupervid learning algorithm and the parameter ttings that our analysis suggests is best overall (i.e.,in our ﬁnal results,we u the same ttings for all algorithms).5

Our basic testing procedure is as follows.For each un-supervid learning algorithm in Section 3.1.2,we will train a single-layer of features using either whitened data or raw data and a choice of the parameters K ,s ,and w .We then train a linear classiﬁer as described in Section 3.2.2,then test the classiﬁer on a holdout t (for our main analysis)or the test t (for our ﬁnal results).4.1

Visualization

Before we prent classiﬁcation results,we ﬁrst show visualizations of the learned feature reprentations.The bas (or centroids)learned by spar autoen-coders,spar RBMs,K-means,and Gaussian mix-ture models are shown in Figure 2for 8pixel recep-tive ﬁelds.It is well-known that autoencoders and RBMs yield localized ﬁlters that remble Gabor ﬁl-ters and we can e this in our results both when us-ing whitened data and,to a lesr extent,raw data.However,the visualizations also show that similar results can be achieved using clustering algorithms.In particular,while clustering raw data leads to cen-troids consistent with tho in [6]and [2

9],we e that clustering whitened data yields sharply localized ﬁlters that are very similar to tho learned by the other al-gorithms.Thus,it appears that such features are easy to learn with clustering methods (without any param-eter tweaking)as a result of whitening.

To clarify:The parameters ud in our ﬁnal evaluation are tho that achieved the best (average)cross-validation performance across all models:whitening,1pixel stride,6pixel receptive ﬁeld,and 1600features.

# Features

C r o s s −V a l i d a t i o n A c c u r a c y (%)

Figure 3:Eﬀect of whitening and number of bas (or

centroids).4.2

Eﬀect of whitening

体育课用英语怎么说We now move on to our characterization of perfor-mance on various axes of parameters,starting with the eﬀect of whitening 6,which visibly changes the learned bas (or centroids)as en in Figure 2.Figure 3shows the performance for all of our algorithms as a function of the number of features (which we will discuss in the next ction)both with and without whitening.The experiments ud a stride of 1pixel and 6pixel recep-tive ﬁeld.

For spar autoencoders and RBMs,the eﬀect of whitening is somewhat ambiguous.When using only 100features,there is a signiﬁcant beneﬁt of whiten-ing for spar RBMs,but this advantage disappears with larger numbers of features.For the clustering algorithms,however,we e that whitening is a cru-cial pre-process since the clustering algorithms cannot handle the correlations in the data.7

6In our experiments,we u Zero-pha whitening [2]7

Our GMM implementation us diagonal covariances

本文发布于:2023-07-31 20:54:16，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1103486.html

上一篇：CUBIC A New TCP-Friendly High-Speed TCP Variant

下一篇：fault diagnosis bad on entropy feature fusion of enmble empirical mode decomposition