Yoshua Bengio,Pascal Lamblin,Dan Popovici,Hugo Larochelle
Universit´e de Montr´e al
Montr´e al,Qu´e bec
{bengioy,lamblinp,popovicd,larocheh}@iro.umontreal.ca
Abstract
Complexity theory of circuits strongly suggests that deep architectures can be much
more efficient(sometimes exponentially)than shallow architectures,in terms of
computational elements required to reprent some functions.Deep multi-layer
水貂养殖neural networks have many levels of non-linearities allowing them to compactly
reprent highly non-linear and highly-varying functions.However,until recently
贫困家庭
it was not clear how to train such deep networks,since gradient-bad optimization
starting from random initialization appears to often get stuck in poor solutions.Hin-
ton ly introduced a greedy layer-wi unsupervid learning algorithm
for Deep Belief Networks(DBN),a generative model with many layers of hidden
causal variables.In the context of the above optimization problem,we study this al-
gorithm empirically and explore variants to better understand its success and extend
it to cas where the inputs are continuous or where the structure of the input dis-
tribution is not revealing enough about the variable to be predicted in a supervid
task.Our experiments also confirm the hypothesis that the greedy layer-wi unsu-
pervid training strategy mostly helps the optimization,by initializing weights in a
region near a good local minimum,giving ri to internal distributed reprentations
that are high-level abstractions of the input,bringing better generalization.
1Introduction
Recent analys(Bengio,Delalleau,&Le Roux,2006;Bengio&Le Cun,2007)of modern non-parametric machine learning algorithms that are kernel machines,such as Support Vector Machines (SVMs),graph-bad manifold and mi-supervid learning algorithms suggest fundamental limita-tions of some learning algorithms.The problem is clear in kernel-bad approaches when the kernel is“local”(e.g.,the Gaussian kernel),i.e.,K(x,y)converges to a constant when||x−y||increas. The analys point to the difficulty of learning“highly-varying functions”,i.e.,functions that have a large number of“variations”in the domain of ,they would require a large number of pieces to be well reprented by a piecewi-linear approximation.Since the number of pieces can be made to grow exponentially with the number of factors of variations in the input,this is connected with the well-known cur of dimensionality for classical non-parametric learning algorithms(for regres-sion,classification and density estimation).If the shapes of all the pieces are unrelated,one needs enough examples for each piece in order to generalize properly.However,if the shapes are related and can be predicted from each other,“non-local”learning algorithms have the potential to generalize to pieces not covered by the training t.Such ability would em necessary for learning in complex domains such as Artificial Intelligence ,related to vision,language,speech,robotics). Kernel
machines(not only tho with a local kernel)have a shallow ,only two levels of data-dependent computational elements.This is also true of feedforward neural networks with a single hidden layer(which can become SVMs when the number of hidden units becomes large(Bengio,Le Roux,Vincent,Delalleau,&Marcotte,2006)).A rious problem with shallow architectures is that they can be very inefficient in terms of the number of computational , bas,hidden units),and thus in terms of required examples(Bengio&Le Cun,2007).One way to reprent a highly-varying function compactly(with few parameters)is through the composition of many ,with a deep architecture.For example,the parity function with d inputs requires O(2d)examples and parameters to be reprented by a Gaussian SVM(Bengio et al.,2006), O(d2)parameters for a one-hidden-layer neural network,O(d)parameters and units for a multi-layer network with O(log
d)layers,and O(1)parameters with a recurrent neural network.More generally,
2
boolean functions(such as the function that computes the multiplication of two numbers from their d-bit reprentation)expressible by O(log d)layers of combinatorial logic with O(d)elements in each layer may require O(2d)elements when expresd with only2layers(Utgoff&Stracuzzi,2002; Bengio&
Le Cun,2007).When the reprentation of a concept requires an exponential number of ,with a shallow circuit,the number of training examples required to learn the concept may also be impractical.Formal analys of the computational complexity of shallow circuits can be found in(Hastad,1987)or(Allender,1996).They point in the same direction:shallow circuits are much less expressive than deep ones.
However,until recently,it was believed too difficult to train deep multi-layer neural networks.Empiri-cally,deep networks were generally found to be not better,and often wor,than neural networks with one or two hidden layers(Tesauro,1992).As this is a negative result,it has not been much reported in the machine learning literature.A reasonable explanation is that gradient-bad optimization starting from random initialization may get stuck near poor solutions.An approach that has been explored with some success in the past is bad on constructively adding layers.This was previously done using a supervid criterion at each stage(Fahlman&Lebiere,1990;Lengell´e&Denoeux,1996).Hinton, Osindero,and Teh(2006)recently introduced a greedy layer-wi unsupervid learning algorithm for Deep Belief Networks(DBN),a generative model with many layers of hidden causal variables.The training strategy for such networks may hold great promi as a principle to help address the problem of training deep networks.Upper layers of a DBN are suppod to reprent more“abstract”
concepts that explain the input obrvation x,whereas lower layers extract“low-level features”from x.They learn simpler conceptsfirst,and build on them to learn more abstract concepts.This strategy,studied in detail here,has not yet been much exploited in machine learning.We hypothesize that three aspects of this strategy are particularly important:first,pre-training one layer at a time in a greedy way;c-ond,using unsupervid learning at each layer in order to prerve information from the input;and finally,fine-tuning the whole network with respect to the ultimate criterion of interest.
Wefirst extend DBNs and their component layers,Restricted Boltzmann Machines(RBM),so that they can more naturally handle continuous values in input.Second,we perform experiments to better understand the advantage brought by the greedy layer-wi unsupervid learning.The basic question to answer is whether or not this approach helps to solve a difficult optimization problem.In DBNs, RBMs are ud as building blocks,but applying this same strategy using auto-encoders yielded similar results.Finally,we discuss a problem that occurs with the layer-wi greedy unsupervid procedure when the input distribution is not revealing enough of the conditional distribution of the target variable given the input variable.We evaluate a simple and successful solution to this problem.
2Deep Belief Nets
Let x be the input,and g i the hidden variables at layer i,with joint distribution
P(x,g1,g2,...,g )=P(x|g1)P(g1|g2)···P(g −2|g −1)P(g −1,g ),
where all the conditional layers P(g i|g i+1)are factorized conditional distributions for which compu-tation of probability and sampling are easy.In Hinton et al.(2006)one considers the hidden layer g i a binary random vector with n i elements g i
j
:
P(g i|g i+1)=
n i
j=1P(g i j|g i+1)with P(g i j=1|g i+1)=sigm(b i j+n i+1 k=1W i kj g i+1k)(1)
where sigm(t)=1/(1+e−t),the b i
j are bias for unit j of layer i,and W i is the weight matrix for
layer i.If we denote g0=x,the generative model for thefirst layer P(x|g1)also follows(1).
2.1Restricted Boltzmann machines
The top-level prior P(g −1,g )is a Restricted Boltzmann Machine(RBM)between layer −1 and layer .To lighten notation,consider a generic RBM with input layer activations v(for visi-ble units)and hidden layer activations h(for hidden units).It has the following joint distribution: P(v,h)=1
The layer-to-layer conditionals associated with the RBM factorize like in(1)and give ri to P(v k=1|h)=sigm(b k+ j W jk h j)and Q(h j=1|v)=sigm(c j+ k W jk v k).
2.2Gibbs Markov chain and log-likelihood gradient in an RBM
To obtain an estimator of the gradient on the log-likelihood of an RBM,we consider a Gibbs Markov chain on the(visible units,hidden units)pair of variables.Gibbs sampling from an RBM proceeds by sampling h given v,then v given h,etc.Denote v t for the t-th v sample from that chain,starting at t=0with v0,the“input obrvation”for the RBM.Therefore,(v k,h k)for k→∞is a sample from the joint P(v,h).The log-likelihood of a value v0under the model of the RBM is
log P(v0)=log h P(v0,h)=log h e−energy(v0,h)−log v,h e−energy(v,h)
抒情方式有哪些and its gradient with respect toθ=(W,b,c)is
∂log P(v0)
∂θ
+ v k,h k P(v k,h k)∂energy(v k,h k)
∂θ+E h
k ∂energy(v k,h k)
(fitting p )will yield improvement on the training criterion for the previous layer (likelihood with respect to p −1).The greedy layer-wi training algorithm for DBNs is quite simple,as illustrated by the pudo-code in Algorithm TrainUnsupervidDBN of the Appendix.
2.4Supervid fine-tuning
As a last training stage,it is possible to fine-tune the parameters of all the layers together.For exam-ple Hinton et al.(2006)propo to u the wake-sleep algorithm (Hinton,Dayan,Frey,&Neal,1995)to c
ontinue unsupervid training.Hinton et al.(2006)also propo to optionally u a mean-field ap-proximation of the posteriors P (g i |g 0),by replacing the samples g i −1
j at level i −1by their bit-wi
mean-field expected value µi −1
j ,with µi =sigm(b i +W i µi −1).According to the propagation rules,the whole network now deterministically computes internal reprentations as functions of the network input g 0=x .After unsupervid pre-training of the layers of a DBN following Algorithm TrainUnsupervidDBN (e Appendix)the whole network can be further optimized by gradient descent with respect to any deterministically computable training criterion that depends on the rep-rentations.For example,this can be ud (Hinton &Salakhutdinov,2006)to fine-tune a very deep auto-encoder,minimizing a reconstruction error.It is also possible to u this as initialization of all except the last layer of a traditional multi-layer neural network,using gradient descent to fine-tune the whole network with respect to a supervid training criterion.
Algorithm DBNSupervidFineTuning in the appendix contains pudo-code for supervid fine-tuning,as part of the global supervid learning algorithm TrainSupervidDBN .Note that better resu
lts were obtained when using a 20-fold larger learning rate with the supervid criterion (here,squared error or cross-entropy)updates than in the contrastive divergence updates.
3Extension to continuous-valued inputs
With the binary units introduced for RBMs and DBNs in Hinton et al.(2006)one can “cheat”and handle continuous-valued inputs by scaling them to the (0,1)interval and considering each input con-tinuous value as the probability for a binary random variable to take the value 1.This has worked well for pixel gray levels,but it may be inappropriate for other kinds of input variables.Previous work on continuous-valued input in RBMs include (Chen &Murray,2003),in which noi is added to sigmoidal units,and the RBM forms a special form of Diffusion Network (Movellan,Mineiro,&Williams,2002).We concentrate here on simple extensions of the RBM framework in which only the energy function and the allowed range of values are changed.Linear energy:exponential or truncated exponential
Consider a unit with value y of an RBM,connected to units z of the other layer.p (y |z )can be obtained from the terms in the exponential that contain y ,which can be grouped in ya (z )for linear energy functions as in (2),where a (z )=b +w z with b the bias of unit y ,and w the vector of weights connecting unit y to units z .If we allow y to take any value in interval I ,the conditional density
of y becomes p (y |z )=
exp (ya (z ))1y ∈I
a (z )
.The conditional expectation of u given z is interesting becau
it has a sigmoidal-like saturating and monotone non-linearity:E [y |z ]=1
a (z ).A sampling from the truncated exponential is easily obtained from a uniform sample U ,using the inver cumulative F −1of the conditional density y |z :F −1(U )=log(1−U ×(1−exp (a (z ))))
c l a s s i f i c a t i o n e r r o r o n t r a i n i n g s e t
慈母碑文精选Figure 1:Training classification error vs training iteration,on the Cotton price task,for deep net-work w
ithout pre-training,for DBN with unsuper-vid pre-training,and DBN with partially super-vid pre-training.Illustrates optimization diffi-culty of deep networks and advantage of partially supervid training.Abalone
Cotton st.st.2.Logistic regression
···44.0%42.6%45.0%4.DBN,binomial inputs,partially supervid 4.39 4.45 4.2843.3%41.1%43.7%6.DBN,Gaussian inputs,partially supervid
4.23
4.43
4.18
大红袍怎么泡27.5%
28.4%
31.4%
俺组词Table 1:Mean squared prediction error on Abalone task and classification error on Cotton task,
锁定单元格怎么操作showing improvement with Gaussian units.
this ca the variance is unconditional,whereas the mean depends on the inputs of the unit:for a unit y with inputs z and inver variance d 2,E [y |z ]=a (z )
Training each layer as an auto-encoder
We want to verify that the layer-wi greedy unsupervid pre-training principle can be applied when using an auto-encoder instead of the RBM as a layer building block.Let x be the input vector with x i∈(0,1).For a layer with weights matrix W,hidden bias column vector b and input bias column vector c,the reconstruction probability for bit i is p i(x),with the vector of proba-bilities p(x)=sigm(c+W sigm(b+W x)).The training criterion for the layer is the average of negative log-likelihoods for predicting x from p(x).For example,if x is interpreted either as a quence of bits or a quence of bit probabilities,we minimize the reconstruction cross-entropy: R=− i x i log p i(x)+(1−x i)log(1−p i(x)).We report veral experimental results using this training criterion for each layer,in comparison to the contrastive divergence algorithm for an RBM. Pudo-code for a deep network obtained by training each layer as an auto-encoder is given in Ap-pendix(Algorithm TrainGreedyAutoEncodingDeepNet).
One question that aris with auto-encoders in comparison with RBMs is whether the auto-encoders will fail to learn a uful reprentation when the number of units is not strictly decreasing from one layer to the next(since the networks could theoretically just learn to be the identity and perfectly min-imize the reconstruction error).However,our experiments suggest that networks with non-decreasing layer sizes generalize well.This might be due to weight decay and stochastic gradient descent,prevent-ing large weights:optimization falls in a local minimum which corresponds to a good transformation of the input(that provides a good initialization for supervid training of the whole net). Greedy layer-wi supervid training
什么什么若狂A reasonable question to ask is whether the fact that each layer is trained in an unsupervid way is critical or not.An alternative algorithm is supervid,greedy and layer-wi:train each new hidden layer as the hidden layer of a one-hidden layer supervid neural network NN(taking as input the output of the last of previously trained layers),and then throw away the output layer of NN and u the parameters of the hidden layer of NN as pre-training initialization of the new top layer of the deep net, to map the output of the previous layers to a hopefully better reprentation.Pudo-code for a deep network obtained by training each layer as the hidden layer of a supervid one-hidden-layer neural network is given in Appendix(Algorithm TrainGreedySupervidDeepNet). Experiment2.
We compared the performance on the MNIST digit classification task obtained withfive algorithms: (a)DBN,(b)deep network who layers are initialized as auto-encoders,(c)above described su-pervid greedy layer-wi algorithm to pre-train each layer,(d)deep network with no pre-training (random initialization),(e)shallow network(1hidden layer)with no pre-training.
Thefinalfine-tuning is done by adding a logistic regression layer on top of the network and train-ing the whole network by stochastic gradient descent on the cross-entropy with respect to the target classification.The networks have the following architecture:784inputs,10outputs,3hidden layers with variable number of hidden units,lected by validation t performance(typically lected layer sizes are between500and1000).The shallow network has a single hidden layer.An L2weight decay hyper-parameter is also optimized.The DBN was slower to train and less experiments were performed,so that longer training and more appropriately chon sizes of layers and learning rates could yield better results(Hinton2006,unpublished,reports1.15%error on the MNIST test t).
Experiment2Experiment3
st st DBN,unsupervid pre-training0% 1.2% 1.2%0% 1.5% 1.5%
Deep net,auto-associator pre-training0% 1.4% 1.4%0% 1.4% 1.6%
Deep net,supervid pre-training0% 1.7% 2.0%0% 1.8% 1.9%
Deep net,no pre-training.004% 2.1% 2.4%.59% 2.1% 2.2%
Shallow net,no pre-training.004% 1.8% 1.9% 3.6% 4.7% 5.0%