The Importance of Encoding Versus Training with Spar Coding
and Vector Quantization
Adam Coates acoates@cs.stanford.edu Andrew Y.Ng ang@cs.stanford.edu Stanford University,353Serra Mall,Stanford,CA94305
Abstract
While vector quantization(VQ)has been ap-
plied widely to generate features for visual
recognition problems,much recent work has
focud on more powerful methods.In par-
ticular,spar coding has emerged as a strong
alternative to traditional VQ approaches and
has been shown to achieve consistently higher
performance on benchmark datats.Both
approaches can be split into a training pha,
where the system learns a dictionary of ba-
sis functions,and an encoding pha,where
the dictionary is ud to extract features from
new inputs.In this work,we investigate
the reasons for the success of spar coding
over VQ by decoupling the phas,allow-
ing us to parate out the contributions of
training and encoding in a controlled way.
Through extensive experiments on CIFAR,
NORB and Caltech101datats,we compare
veral training and encoding schemes,in-
cluding spar coding and a form of VQ with
a soft threshold activation function.Our re-
sults show not only that we can u fast VQ
algorithms for training,but that we can just
as well u randomly chon exemplars from
the training t.Rather than spend resources
on training,wefind it is more important to
choo a good encoder—which can often be
a simple feed forward non-linearity.Our re-
sults include state-of-the-art performance on
both CIFAR and NORB.
1.Introduction
A great deal of work in computer vision has ud vector quantization(VQ)as a tool for constructing Appearing in Proceedings of the28th International Con-ference on Machine Learning,Bellevue,WA,USA,2011. Copyright2011by the author(s)/owner(s).higher level image reprentations.For instance,the K-means algorithm is often ud in“visual word mod-els”(Csurka et al.,2004;Lazebnik et al.,2006)to train a dictionary of exemplar low-level descriptors that are then ud to define a mapping(an“encoding”)of the descriptors into a new feature space.More re-cently,machine learning rearch has sought to employ more powerful algorithms and models for the prob-lems that generate better features than tho learned with VQ methods.One alternative to VQ that has rved in this role is spar coding,which has con-sistently yielded better results on benchmark recogni-tion tasks(Yang et al.,2009;Boureau et al.,2010).
A natural question is whether this higher performance is the result of learning a better dictionary for repre-nting the structure of the data,or whether spar codes are simply better non-linear features.
In either ca,are there other training algorithms or encodings that might be simpler yet competitive with spar cod-ing?We attempt to answer the questions through a large array of experiments on CIFAR,NOR
B and Cal-tech101datats where we carefully parate out the contributions of the training and encoding methods. More specifically,we note that feature learning algo-rithms are typically broken into two components:(i)a training algorithm that learns a t of basis functions, D,(referred to variously as“weights”,a“codebook”, or a“dictionary”),and(ii)an encoding algorithm that, given D,defines a mapping from an input vector x to a feature vector f.Even though the two components are often cloly connected,it is not strictly neces-sary to u an encoding algorithm that matches the training algorithm.For instance,while it is natural to pair K-means training with a hard-assignment en-coding scheme,it has been shown that soft encodings (e.g.,using Gaussian RBFs)yield better features even when hard assignment was ud during training(van Gemert et al.,2008;Boureau et al.,2010;Agarwal& Triggs,2006).In our experiments,we will exploit the ability to“mix and match”training and encoding algo-rithms in this way to analyze the contributions of each
module in a controlled tting.In particular,we will analyze the benefits of spar coding both as a training algorithm and as an encoding strategy in comparison to veral other methods,including VQ.
The main contributions of our work emerge from our analysis of the experiments.We discover two sur-prising results:
1.When using spar coding as the encoder,virtu-
ally any training algorithm can be ud to create读后感的英文
a suitable dictionary.We can u VQ,or even
randomly chon exemplars to achieve very high performance.
2.Regardless of the choice of dictionary,a very sim-
ple encoder(a soft thresholding function)can of-ten be competitive with spar coding.
The results not only shed light on the reasons for spar coding’s success(namely,it is a highly effective encoding scheme),but also suggests that we may be able to build and test large models very rapidly,with far simpler training and encoding methods than spar coding itlf.
We begin with an overview of some related work on spar coding,visual word models,and feature e
ncod-ing schemes in Section2followed by our experimental tup in Section3.We then continue with our results in Section4and conclude with some discussion of our findings and their relationship to prior results in Sec-tion5before closing.
2.Related Work
Vector quantization has been ud extensively in“vi-sual words”models in computer vision.Specifically, the K-means clustering algorithm is ud to learn a t of centroids that are then ud to map inputs into a new feature space.For instance,in the“bag of words”and spatial pyramid models(Csurka et al., 2004;Lazebnik et al.,2006)it is typical to map an in-put x to a1-of-K coded vector s,where the element s i is1if the input vector x belongs to cluster i.This quantization makes K-means learning very fast,but re-sults in crude features.Thus,other authors have ud “soft”,Gaussian activations)to im-prove performance of the encoding stage(van Gemert et al.,2008;Agarwal&Triggs,2006).
In our work,we will also u soft assignments with vec-tor quantization,but with some important differences. First,we u a different variant of vector quantiza-tion(known as“gain shape”vector quantization)that learns normalized basis functions with dot-products as the similarity metric(rather t
han Euclidean distance as with K-means).This makes the resulting algorithm more comparable to spar coding.Second,we u a soft threshold function for our soft assignment,which we will show gives excellent performance and is looly related to both spar coding and to recent results us-ing K-means.
The soft threshold function(namely, sign(z)max(0,|z|−α)whereαis an adjustable threshold)has also been adopted in other recent work. It has been ud in conjunction with the Predictive Spar Decomposition algorithm(Kavukcuoglu et al., 2008),where a feed-forward network is trained explicitly to mimic spar coding.It has also be-come popular as the non-linearity in various deep learning architectures(Kavukcuoglu et al.,2010; Nair&Hinton,2010;Krizhevsky,2010),and is often referred to as a“shrinkage”function for its role in regularization and spar coding algorithms(Gregor &LeCun,2010).Thus,we are by no means thefirst to obrve the ufulness of this particular activation function.In our work,however,we will show that such a nonlinearity on its own is consistently able to compete with spar coding in our experiments, even without any form of training to“tune”the basis functions to work in conjunction with it.
Though vector quantization is extremely fast,spar coding has been shown to work consistently bet-ter(Boureau et al.,2010;Kavukcuoglu et al.,2010; Yang et al.,2009).Thus,fast algorithms and appr
oxi-mations have been devid to make its u more prac-tical on large problems(Gregor&LeCun,2010;Wu& Lange,2008).Other authors,however,have chon in-stead to disct spar coding in arch of its strengths and weakness,with an eye toward developing new encodings.In particular,Yu et al.(2009)have argued for“locality prerving”encodings bad on the idea that such encodings allow higher-level systems to learn functions across the data manifold more easily.Their system ud a locality-aware variant of spar coding, but the feed-forward encoder we u may also have similar properties while being much simpler. Finally,we note that recent results in the literature also indicate that the choice of basis functions may not be as critical as one might imagine.Jarrett et al.(2009)showed that architectures with random weights could perform surprisingly well in recognition tasks,even though the performance was not as good as trained weights.Meanwhile,Wang et al.(2010) showed that K-means could be ud to learn approxi-mate but similarly performing dictionaries for u with their own encoding.Our results corroborate and ex-tend thefindings.In particular,we will provide
results using dictionaries created from random noi, randomly sampled exemplars,and vector quantization and show that the last two yield perfectly usable dic-tionaries in every ca.
3.Learning framework
Given an unsupervid learning algorithm,we learn a new feature reprentation from unlabeled data by employing a common framework.Our feature learn-ing framework is like the patch-bad system prented in(Coates et al.,2011)with a few modifications.It shares most of its key components with prior work in visual word models(Csurka et al.,2004;Lazebnik et al.,2006;Agarwal&Triggs,2006).
烧猪蹄的家常做法In order to generate a t of features,our systemfirst accumulates a batch of small image patches or im-age descriptors harvested from unlabeled data.When learning from raw pixels,we extract6pixel square patches,yielding a bank of vectors that are then nor-malized1and ZCA whitened(Hyvarinen&Oja,2000) (retaining full variance).If we are learning from SIFT descriptors,we simply take single descriptors to form a bank of128-dimensional vectors.
Given the batch of input vectors,x(i)∈R n,an unsu-pervid learning algorithm is then applied to learn a dictionary of d elements,D∈R n×d,where each col-umn D(j)is one element.In order to make all of our algorithms consistent(so that we may freely change the choice of encoder),we will make certain that each of the algorithms we u produces normalized dictio-nary elements:||D(j)||22=1.
In this work,we will u the following unsupervid learning algorithms for training the dictionary D: 1.Spar coding(SC):We train the dictionary
using the L1-penalized spar coding formulation.
That is,we optimize
min D,s(i)
i
||Ds(i)−x(i)||22+λ||s(i)||1(1)
subject to||D(j)||22=1,∀j
using alternating minimization over the spar codes,s(i),and the dictionary,D.We u the co-ordinate descent algorithm to solve for the spar codes(Wu&Lange,2008).
2.Orthogonal matching pursuit(OMP-k):
Similar to spar coding,the dictionary is trained 1We subtract the mean and divide by the standard de-viation of the pixel values.
using an alternating minimization of
min
D,s(i)
i
||Ds(i)−x(i)||22(2)
subject to||D(j)||22=1,∀j
and||s(i)||0≤k,∀i
where||s(i)||0is the number of non-zero elements in s(i).In this ca,the codes s(i)are computed (approximately)using Orthogonal Matching Pur-suit(Pati et al.,1993;Blumensath&Davies, 2007)to compute codes with at most k non-zeros (which we refer to as“OMP-k”).For a single input x(i),OMP-k begins with s(i)=0and at each iteration greedily lects an element of s(i) to be made non-zero to minimize the residual re-construction error.After each lection,s(i)is updated to minimize||Ds(i)−x(i)||22over s(i)al-lowing only the lected elements to be non-zero.
Importantly,OMP-1is a form of“gain-shape vec-tor quantization”(and is similar to K-means when the data x(i)and the dictionary elements D(j)all have unit length).Specifically,it choos k=
arg max j|D(j)⊤x(i)|,then ts s(i)
k
=D(k)⊤x(i) and all other elements of s(i)to0.Holding the “one hot”codesfixed,it is then easy to solve for the optimal D in(2).
3.Spar RBMs and spar auto-encoders
(RBM,SAE):In some of our experiments,we train spar RBMs(Hinton et al.,2006)and spar auto-encoders(Ranzato et al.,2007;Ben-gio et al.,2006),both using a logistic sigmoid non-linearity g(W x+b).The algorithms yield a t of weights W and bias b.To obtain the dictio-nary,D,we simply discard the bias and take D=W⊤,then normalize the columns of D.
4.Randomly sampled patches(RP):We also惰气
u a heuristic method for populating the dictio-nary:wefill the columns of D with normalized vectors sampled randomly from amongst the x(i).
5.Random weights(R):It has also been shown
that completely random weights can perform sur-prisingly well in some tasks(Jarrett et al.,2009;
Saxe et al.,2010).Thus,we have also triedfill-ing the columns of D with vectors sampled from
a unit normal distribution(subquently normal-
ized to unit length).
After running any of the above training procedures, we have a dictionary D.We must then define an“en-coder”that,given D,maps a new input vector x to
a vector of features,f.We will u the following en-coders:
1.Spar coding(SC):Given a dictionary D,
which may or may not have been trained using spar coding,we solve for the spar code s for x by minimizing(1)with Dfixed.Note that the choice ofλin this ca may be different from that ud during training.We then take:
f j=max{0,s}
包扑
f j+d=max{0,−s}
That is,we split the positive and negative compo-nents of the spar code s into parate features.
This allows the higher-level parts of the system
(i.e.,the classifier)to weight positive and nega-
tive respons differently if necessary.2
2.Orthogonal matching pursuit(OMP-k):As
above,we compute s given x and D using OMP-k to yield at most k non-zeros.When k=1,s will have just one non-zero element(equal to D(j)⊤x, for one choice of j).Given s,the features f are defined as for spar coding above.
3.Soft threshold(T):We u a simple feed-
forward non-linearity with afixed thresholdα:
f j=max 0,D(j)⊤x−α
f j+d=max 0,−D(j)⊤x−α
4.“Natural”encoding:Finally,we will also de-
fine the“natural”encoding for a dictionary D as whichever encoding is normally ud in conjunc-tion with the training algorithm that generated
D.So,for spar coding with penaltyλ,we would
u spar coding with the same penalty,and for OMP we would again u the OMP encoding as above with the same number of non-zeros.For RBMs and auto-encoders we u:
f j=g(W(j)x+b)(3)
f j+d=g(−W(j)x+b)(4)
where W(j)is the j’th row of W,and g is the lo-gistic sigmoid function.3For random patches and 2This polarity splitting has always improved perfor-mance in our experiments,and can be thought of as a more flexible form of the absolute value rectification in(Jarrett et al.,2009),or as non-negative spar coding with the dictionary[−D D].
3This“two sided”encoder ensures that we have still get 2d features,and do not put the RBM and auto-encoder at a disadvantage relative to the other encodings.Table1.Cross-validation results for combinations of learn-ing algorithms and encoders on CIFAR-10.All numbers are percent accuracy.The reported accuracies are from 5-fold cross validation on the CIFAR training t,max-imizing over the choice of hyper-parameters for both the training and encoding ,the are the best re-sults we can obtain using the given combination of training and encoding algorithms if we choo the hyper-parameters to maximize the CV accuracy.
Train E n
c
o
d
e
r
N
a
t
u
r
a
l腊八粥教学设计
S
C
O
M
P
-
1
O
M
P
-
1
T R70.574.065.868.673.2
RP76.076.670.171.678.1
RBM74.176.769.572.978.3
SAE74.876.568.871.576.7
SC77.978.570.875.378.5
OMP-171.478.771.476.078.9
OMP-273.878.571.075.879.0
OMP-575.478.871.076.179.1
OMP-1075.379.070.775.379.4
random weights we u the soft threshold with α=0(which corresponds to random linear pro-jections with the positive and negative polarities split into parate features).
Now,given D and a choice of encoder,we have the abil-ity to extract features from a patch or image descriptor x reprenting a small sub-window of an image.We can then take this feature extractor and sweep it over the entire image to extract a t of feature values to reprent the whole image.In our experiments,we u a step(stride)of1pixel for CIFAR and NORB(pixel inputs),and a step of8pixels for Caltech101(SIFT input).The feature values obtained from this extrac-tion process are then pooled(Jarrett et al.,2009;Yang et al.,2009)to form afinal feature vector for classi-fication.Depending on the datat,we u different types of pooling,which we will specify later.
Given a t of labels,we then standardize the data4 and train a linear classifier(L2-SVM).
4.Experimental Results
4.1.Comparison on CIF AR-10
Ourfirst and most expansive t of experiments are conducted on the CIFAR-10datat(Krizhevsky, 2009).Here,we perform a full comparison of all of the learning and encoding algorithms described in Sec-tion3.In particular,we train the dictionary with 4We subtract the mean and divide by the standard de-viation of each feature in the training t.
Table2.Test results for some of the best systems of Table1 on CIFAR-10.All numbers are percent accuracy.
Train/Encoder Test Acc. RP/T79.1%
SC/SC78.8%
关于雪的古诗100首SC/T78.9% OMP-1/SC78.8% OMP-1/T79.4% OMP-10/T80.1% OMP-1/T(d=6000)81.5% (Coates et al.,2011)1600features77.9% (Coates et al.,2011)4000features79.6% Improved LCC(Yu&Zhang,2010)74.5% Conv.DBN(Krizhevsky,2010)78.9% Deep NN(Ciresan et al.,2011)80.49% 1600entries from whitened,6by6pixel color image patches(108-dimensional vectors),
using spar coding (SC),orthogonal matching pursuit(OMP)with k= 1,2,5,10,spar RBMs(RBM),spar auto-encoders (SAE),randomly sampled image patches(RP),and random weights(R).
For each dictionary learned with the algorithms above, we then extract features not only using the“nat-ural”encoding associated with the learning algo-rithm,but also with a host of other encodings from the ones described in Section3.Specifically,we u spar coding,withλ∈{0.5,0.75,1.0,1.25,1.5}, OMP with k=1,10,and soft thresholding(T)with α∈{0.1,0.25,0.5,1.0}.After computing the fea-tures for a combination of dictionary and encoding method,we construct afinal feature vector by av-erage pooling over the4image quadrants,yielding 4×2×1600=12800features.We report the best5-fold cross-validation results,maximizing over the choice of hyper-parameters5,for each combination of training algorithm and encoding algorithm in Table1.
From the numbers,a handful of trends are readily apparent.First,we note that thefirst column(which pairs each learning algorithm with its standard en-coder)shows that spar coding is superior to all of the other methods by a fairly significant margin,with 77.9%accuracy.OMP-1(which is just slightly more powerful than hard-assignment K-means)is far wor (71.4%).However,we do get surprisingly clo with OMP-10and random patches.If we look at the results in the remaining column
s,it becomes clear that this is not due to the learned basis functions:when using spar coding as the activation(column2),all of the dictionaries,except the random one,perform competi-5Note that for spar coding this means that the number reported for the“natural”encoding is for the best choice ofλwhen using the same penalty for both training and encoding.The number in the“spar coding”column is the best performance possible when choosing differentλfor training and encoding.Table3.Test accuracies for the NORB jittered-cluttered datat.All numbers are percent accuracy.
Train E n
c
o
d
e
r
N
a
t
u
r
wps分页符
a
l
S
C
(
λ
=
1
)
T
(
α
=
.
5
)
R91.993.893.1 RP92.895.093.6 SCλ=194.194.193.5 OMP-190.994.292.6 Conv.Net(Scherer et al.,2010)94.4% SVM-Conv.Net(Huang&LeCun,2006)94.1% ReLU RBM(Nair&Hinton,2010)84.8%
tively.This suggests that the strength of spar coding on CIFAR comes not from the learned basis functions, but primarily from the encoding mechanism. Another striking result of the experiments is the suc-cess of the soft threshold activation function.Despite using only a feed-forward non-linearity with afixed threshold,this encoding also performs uniformly well across dictionaries,and as well or even better than spar coding.
We next take veral of the best performing systems according to the cross-validation results in Table1,re-train the classifiers on the full CIFAR training t and then test them on the standard test t.Thefinal test results are reported in Table2.We note veral key numbers.First,using a dictionary of random patches and a soft threshold,we obtain79.1%accuracy.This is very surprising since this algorithm requires no train-ing beyond the choice of the threshold(α=0.25).All of the other results are similar,with just more than1% parating them.The best overall system identified by cross-validation was OMP-10with the soft threshold, achieving80.1%accuracy.
In addition,we note that it is often possible to achieve better performance simply by using much larger dic-tionaries(van Gemert et al.,2008).This is easily achieved with inexpensive training and encoding algo-rithms like OMP-1and the soft-threshold.If we u a dictionary with d=6000basis vectors,we can achieve 81.5%accuracy—the best known result on CIFAR.
4.2.Experiments on NORB
We also perform experiments on the NORB(jittered-cluttered)datat(LeCun et al.,2004).Each108x108 image includes2gray stereo channels.We resize the images to96x96pixels and average-pool over a5x5 grid.We train on thefirst2folds of training data (58320examples),and test on both folds of test data (58320examples).Bad on our experience with CI-
FAR,we chofixed values for hyper-parameters for the experiments.For spar coding,we have ud λ=1.0and for the soft thresholdα=0.5,though the test results are mostly innsitive to the choices. We report test errors achieved with the natural en-coder for each dictionary as well as spar coding and the soft threshold in Table3.Again we e that the soft threshold,even when coupled with randomly sam-pled patches,performs nearly as well as spar coding. Though performance is slightly lower,we note that its best showing(93.6%)is achieved with far less labor: the spar coding system requires over7hours to run on402.26GHz cores,while the soft threshold scheme requires just1hour.In addition,we also e that spar coding performs comparably regardless of which training algorithm we u.Surprisingly,when using random patches we achieve95.0%accuracy—better than previously published results for this datat.For comparison,a recent convolutional neural network sys-tem(using max pooling)(Scherer et al.,2010)achieved 94.4%
on this datat.
4.3.Experiments on Caltech101
Finally,we also performed experiments on the Cal-tech101datat.For the experiments,we adopted the system of Yang et al.(2009).This system us SIFT descriptors as the input to feature learning in-stead of raw pixels.In particular,SIFT descriptors are extracted from each image over a grid.This yields a reprentation of the image as a t of128-dimensional vectors,with one descriptor reprenting each patch of the grid.The vectors become the inputs x∈R128 to the training and encoding algorithms and play the same role as the patches of pixels in our previous ex-periments.
Given the inputs x(i),a dictionary is constructed as before using random noi(R),randomly sampled de-scriptors(RP),spar coding(SC),or vector quanti-zation(OMP-1).After performing the encoding,the features are pooled using max-pooling in a3-level spa-tial pyramid(Lazebnik et al.,2006)(i.e.,we pool over 4x4,2x2,and1x1grids).We u30training examples per class in our experiments,and report the average accuracy over5samplings of the training and test ts in Table4.We uλ=0.15(the same ud in(Yang et al.,2009)),and againα=0.5for the soft threshold. As can be en in Table4the results are similar, though not identical to tho on CIFAR and NORB. First,
the results confirm that the choice of dictio-nary is not especially critical:when using spar cod-ing as the encoder,we can u randomly sampled de-scriptors and achieve high performance.However,it Table4.Test results for the Caltech101datat.Num-bers are percent accuracy(and standard deviation)with 30training images per class.
SC(λ=0.15)T(α=0.5) R67.2%(0.8%)66.6%(0.2%)
RP72.6%(0.9%)64.3%(1.2%)
SC72.6%(0.9%)67.7%(0.3%)
OMP-171.9%(0.9%)63.2%(1.4%)形容天空的句子
SC-SPM(Yang et al.,2009)73.2%(0.54%)
(Boureau et al.,2010)75.7%(1.1%)
(Jarrett et al.,2009)65.5% appears that the soft threshold works less well for this datat.It turns out that one shortcoming of the soft threshold activation is the u of a constant threshold. If we instead u a variable threshold(and a dictio-nary trained with spar coding),ttingαdynami-cally to yield e
xactly20non-zeros for each example,we achieve70.1%accuracy(±0.9%).Still a gap remains, which appears to be a result of having few training examples—we will discuss this further in the next c-tion.
5.Discussion
5.1.Spar coding and small datats
A primary distinction between the Caltech101datat and the CIFAR and NOR
B datats is the number of available labeled training examples(just30per class for Caltech101).In this situation,regularization and prior knowledge become much more important since we have very few labels for supervid training.It turns out that spar coding excels in this scenario: it yields a feature vector that works well even when we have very few labels,and even when we u simple algorithms to populate the dictionary.
We have verified this phenomenon on the CIFAR and STL-106(Coates et al.,2011)datats.We began with a dictionary learned using OMP-1.We then tested the performance of the spar-coding and soft-threshold encoders when the SVM training procedure is limited to a small number of labeled
examples.For CIFAR, the average test performance over5folds of labeled data,for various numbers of labeled examples,is plot-ted in Figure1.There it can be en that the per-formance of spar coding and the soft-threshold are esntially identical when we u large labeled train-ing ts,but that spar coding performs much better for smaller numbers of examples.The STL-10datat similarly emphasizes smaller labeled datats(100ex-amples per fold),though providing additional unla-beled data.On this datat,the same phenomenon 6cs.stanford.edu/∼acoates/stl10/