Learning multiple layers of reprentation
Geoffrey E.Hinton
Department of Computer Science,University of Toronto,10King’s College Road,Toronto,M5S 3G4,Canada
To achieve its impressive performance in tasks such as speech perception or object recognition,the brain extracts multiple levels of reprentation from the n-sory input.Backpropagation was the first computation-ally efficient model of how neural networks could learn multiple layers of reprentation,but it required labeled training data and it did not work well in deep networks.The limitations of backpropagation learning can now be overcome by using multilayer neural networks that con-tain top-down connections and training them to gener-ate nsory data rather than to classify it.Learning multilayer generative models might em difficult,but a recent discovery makes it easy to learn nonlinear distributed reprentations one layer at a time.Learning feature detectors
To enable the perceptual system to make the fine distinctions that are required to control behavior,
nsory cortex needs an efficient way of adapting the synaptic weights of multiple layers of feature-detecting neurons.The backpropagation learning procedure [1]iteratively adjusts all of the weights to optimize some measure of the classification performance of the network,but this requires labeled training data.To learn multiple layers of feature detectors when labeled data are scarce or non-existent,some objective other than classification is required.In a neural network that contains both bot-tom-up ‘recognition’connections and top-down ‘generative’connections it is possible to recognize data using a bottom-up pass and to generate data using a top-down pass.If the neurons are stochastic,repeated top-down pass will generate a whole distribution of data-vectors.This suggests a nsible objective for learning:adjust the weights on the top-down connections to maximize the probability that the network would generate the training data.The neural network’s model of the training data then resides in its top-down connections.The role of the bottom-up connections is to enable the network to determine activations for the features in each layer that constitute a plausible explanation of how the network could have generated an obrved nsory data-vector.The hope is that the active features in the higher layers will be a much better guide to appropriate actions than the raw nsory data or the lower-level features.As we shall e,this is not just wishful thinking –if three layers of feature detectors are trained on unlabeled images of handwritten
digits,the complicated nonlinear features in the top layer enable excellent recognition of poorly written digits like tho in Figure 1b [2].
There are veral reasons for believing that our visual systems contain multilayer generative models in which top-down connections can be ud to generate low-level features of images from high-level reprentations,and bottom-up connections can be ud to infer the high-level reprentations that would have generated an obrved t of low-level features.Single cell recordings [3]and the reciprocal connectivity between cortical areas [4]both suggest a hierarchy of progressively more complex features in which each layer can influence the layer below.Vivid visual imagery,dreaming,and the disambiguating effect of context on the interpretation of local image regions [5]also suggest that the visual system can perform top-down generation.
The aim of this review is to complement the neural and psychological evidence for generative models by reviewing recent computational advances that make it easier to learn generative models than their feed-forward counterparts.The advances are illustrated in the domain of handwritten digits where a learned generative model outperforms dis-criminative learning methods at classification.
Inference in generative models
The crucial computational step in fitting a generative model to data is determining how the model,with its current generative parameters,might have ud its hidden variables to generate an obrved data-vector.Stochastic generative models generally have many different ways of generating any particular data-vector,so the best we can hope for is to infer a probability distribution over the various possible ttings of the hidden variables.Consider,for example,a mixture of gaussians model in which each data-vector is assumed to come from exactly one of the multivariate gaussian distributions in the mixture.Infer-ence then consists of computing the posterior probability that a particular data-vector came from each of the gaus-sians.This is easy becau the posterior probability assigned to each gaussian in the mixture is simply pro-portional to the probability density of the data-vector under that gaussian times the prior probability of using that gaussian when generating data.
The generative models that are most familiar in statistics and machine learning are the ones for which the posterior distribution can be inferred efficiently and exactly becau the model has been strongly constrained.The generative models
include:
TRENDS in Cognitive Sciences Vol.11No.10
Corresponding author:Hinton,G.E.(o.edu ).台山景点
1364-6613/$–e front matter ß2007Elvier Ltd.All rights rerved.doi:10.1016/j.tics.2007.09.004
Factor analysis –in which there is a single layer of gaussian hidden variables that have linear effects on the visible variables (e Figure 2).In addition,independent gaussian noi is added to each visible variable [6–8].Given a visible vector,it is impossible to infer the exact state of the factors that generated it,but it is easy to infer the mean and covariance of the gaussian posterior distribution over the factors,and this is sufficient to enable the parameters of the model to be improved. Independent components analysis –which generalizes factor analysis by allowing non-gaussian hidden vari-ables,but maintains tractable inference by eliminating the obrvation noi in the visible variables and using the same number of hidden and visible variables.The restrictions ensure that the posterior distribution collaps to a single point becau there is only one tting of the hidden variables that can generate each visible vector exactly [9–11].
Mixture models –in which each data-vector is assumed to be generated by one of the component distributions in the mixture and it is easy to compute the density under each of the component distributions.If factor analysis is generalized to allow non-gaussian hidden variables,it can model the development of low-level visual receptive fields [12].However,if the extra con-straints ud in independent components analysis are not impod,it is no longer easy to infer,or even to reprent,the posterior distribution over the hidden vari-ables.This is becau of a phenomenon known as explain-ing away [13](e Figure 3b).
Multilayer generative models
Generative models with only one hidden layer are much too simple for modeling the high-dimensional and richly struc-tured nsory data that arrive at the cortex,but they have been presd into rvice becau,until recently,it was too difficult to perform inference in the more complicated,multilayer,nonlinear models that are clearly required.There have been many attempts to develop multilayer,nonlinear models [14–18].In Bayes nets (also called belief nets),which have been studied intensively in artificial intelligence and statistics,the hidden variables typically have discrete values.Exact inference is possible if every variable only has a few parents.This can occur in Bayes nets that are ud to formalize expert knowledge in limited domains [19],but for more denly connected Bayes nets,exact inference is generally intractable.
It is important to realize that if some way can be found to infer the posterior distribution over the hidden variables for each data-vector,learning a multilayer generative model is relatively straightforward.Learning is also straightforward if we can get unbiad samples from the posterior distribution.In this ca,we simply adjust the parameters so as to increa the probability that the sampled states of the hidden variables in each layer
would
Figure 1.(a)The generative model ud to learn the joint distribution of digit images and digit labels.(b)Some test images that the network classifies correctly even though it has never en them
before.
Figure 2.The generative model ud in factor analysis.Each real-valued hidden factor is chon independently from a gaussian distribution,N (0,1),with zero mean and unit variance.The factors are then linearly combined using weights (W jk )and gaussian obrvation noi with mean (m i )and standard deviation (s i )is added independently to each real-valued variable (i ).
TRENDS in Cognitive Sciences Vol.11No.10429
generate the sampled states of the hidden or visible variables in the layer below.In the ca of the logistic belief net shown in Figure 3a,which will be a major focus of this review,the learning rule for each training ca is a version of the delta rule [20].The inferred state,h i ,of the ‘postsynaptic’unit,i ,acts as a target value and the prob-ability,ˆh
i ,of activating i given the inferred states,h j ,of all the ‘presynaptic’units,j ,in the layer above acts as a prediction:
D w ji /h j ðh i Àˆh
i Þ(Equation 1)
where D w ji is the change in the weight on the connection
from j to i .
If i is a visible unit,h i is replaced by the actual state of i in the training example.If training vectors are lected with equal probability from the training t and the hidden states are sampled from their posterior distribution given the training vector,the learning rule in Equation 1has a positive expected effect on the probability that the gen-erative model would produce exactly the N training vectors if it was run N times.
Approximate inference for multilayer generative models
The generative model in Figure 3a is defined by the weights on its top-down,generative connections,but it also has bottom-up,recognition connections that can be ud to perform approximate inference in a single,bottom-up pass.The inferred probability that h j =1is s (S i h i r ij ).This inference procedure is fast and simple,but it is incorrect becau it ignores explaining away.Surprisingly,learning is still possible with incorrect inference becau there is a more general objective function that the learning rule in Equation 1is guaranteed to improve [21,22].
Instead of just considering the log probability of gen-erating each training ca,we can also take the accuracy of
the inference procedure into account.Other things being equal,we would like our approximate inference method to be as accurate as possible,and we might prefer a model that is slightly less likely to generate the data if it enables more accurate inference of the hidden reprentations.So it makes n to u the inaccuracy of inference on each training ca as a penalty term when maximizing the log probability of the obrved data.This leads to a new objective function that is easy to maximize and is a ‘vari-ational’lower-bound on the log probability of generating the training data [23].Learning by optimizing a vari-ational bound is now a standard way of dealing with the intractability of inference in complex generative models [24–27].An approximate version of this type of learning has been propod as a model of learning in nsory cortex (Box 1),but it is slow in deep networks if the weights are initialized randomly.
A nonlinear module with fast exact inference
We now turn to a different type of model called a ‘restricted Boltzmann machine’(RBM)[28](Figure 4a).Despite its undirected,symmetric connections,the RBM is the key to finding an efficient learning procedure for deep,directed,generative models.
Images compod of binary pixels can be modeled by using the hidden layer of an RBM to model the higher-order correlations between pixels [29].To learn a good t of feature detectors from a t of training images,we start with zero weights on the symmetric connections between each pixel i and each feature detector j .Then we repeatedly update each weight,w ij ,using the difference between two measured,pairwi correlations D w i j ¼e ð<v i h j >data À<v i h i >recon Þ
(Equation 2)
where e is a learning rate,<v i h j >data is the frequency with which pixel i and feature detector j are on together
舞台设计方案when
无人机应用Figure 3.(a)A multilayer belief net compod of logistic binary units.To generate fantasies from the model,we start by picking a random binary state of 1or 0for each top-level unit.Then we perform a stochastic downwards pass in which the probability,ˆh
i ,of turning on each unit,i ,is determined by applying the logistic function s (x )=1/(1+exp(Àx ))to the total input S j h j w ji that i receives from the units,j ,in the layer above,where h j is the binary state that has already been chon for unit j .It is easy to give each unit an additional bias,but to simplify this review bias will usually be ignored.r ij is a recognition weight.(b)An illustration of ‘explaining away’in a simple logistic belief net containing two independent,rare,hidden caus that become highly anticorrelated when we obrve the hou jumping.The bias of À10on the earthquake unit means that,in the abnce of any obrvation,this unit is e 10times more likely to be off than on.If the earthquake unit is on and the truck unit is off,the jump unit has a total input of 0,which means that it has an even chance of being on.This is a much better explanation of the obrvation that the hou jumped than the odds of e À20,which apply if neither of the hidden caus is active.But it is wasteful to turn on both hidden caus to explain the obrvation becau the probability of them both happening is approximately e À20.
430
贺兰山海拔TRENDS in Cognitive Sciences Vol.11No.10
笔字组词
the feature detectors are being driven by images from the training t,and <v i h j >recon is the corresponding frequency when the feature detectors are being driven by recon-structed images.A similar learning rule can be ud for the bias.
Given a training image,we t the binary state,h j ,of each feature detector to be 1with probability
p ðh j ¼1Þ¼s ðb j þ
X
i
女性盆腔v i w i j Þ(Equation 3)where s ( )is the logistic function,b j is the bias of j and v i is
the binary state of pixel i .Once binary states have been chon for the hidden units we produce a ‘r
econstruction’of the training image by tting the state of each pixel to be 1with probability
亚琛工业大学p ðv i ¼1Þ¼s ðb i þ
X
j
h j w i j Þ(Equation 4)The learned weights and bias directly determine the
conditional distributions p (h j v )and p (v j h )using Equations 3and 4.Indirectly,the weights and bias define the joint and marginal distributions p (v ,h ),p (v )and p (h ).Sampling from the joint distribution is difficult,but it can be done by using ‘alternating Gibbs sampling’.This starts with a random image and then alternates between updating all of the features in parallel using Equation 3and updating all of the pixels in parallel using Equation 4.After Gibbs sampling for sufficiently long,the network reaches ‘thermal equilibrium’.The states of pixels and feature detectors still change,but the probability of finding the system in any particular binary configuration does not.By obrving the fantasies on the visible units at thermal equilibrium,we can e the distribution over visible vectors that the model believes in.
The RBM has two major advantages over directed models with one hidden layer.First,inference is easy becau there is no explaining away:given a visible vector,the posterior distribution over hidden vectors factorizes into a product of independent distributions for each hidden unit.So to get a sample from the posterior we simply turn on each hidden unit with a probability given by Equation 3.
Box 1.The wake-sleep algorithm
For the logistic belief net shown in Figure 3a,it is easy to improve the generative weights if the network already has a good t of recognition weights.For each data-vector in the training t,the recognition weights are ud in a bottom-up pass that stochastically picks a binary state for each hidden unit.Applying the learning rule in Equation 1will then follow the gradient of a variational bound on how well the network generates the training data [22].
It is not so easy to compute the derivatives of the bound with respect to the recognition weights,but there is a simple,approx-imate learning rule that works well in practice.If we generate fantasies from the model by using the generative weights in a top-down pass,we know the true caus of the activities in each layer,so we can compare the true caus with the predictions made by the approximate infererence procedure and adjust the recognition weights,r ij ,to maximize the probability that the predictions are correct:
D r i j /h i h j Às ðX
i
h i r i j Þ (Equation 5)
The combination of approximate inference for learning the gen-erative weights,and fantasies for learning the recognition weights is
known as the ‘wake-sleep’algorithm [22]
.
Figure 4.(a)Two parate restricted Boltzmann machines (RBMs).The stochastic,binary variables in the hidden layer of each RBM are symmetrically connected to the stochastic,binary variables in the visible layer.There are no connections within a layer.The higher-level RBM is trained by using the hidden activities of the lower RBM as data.(b)The composite generative model produced by composing the two RBMs.Note that the connections in the lower layer of the composite generative model are directed.The hidden states are still inferred by using bottom-up recognition connections,but the are no longer part of the generative model.
TRENDS in Cognitive Sciences Vol.11No.10431
Second,as we shall e,it is easy to learn deep directed networks one layer at a time by stacking RBMs.Layer-by-layer learning does not work nearly as well when the individual modules are directed,becau each directed module bites off more than it can chew:it tries to learn hidden caus that are marginally independent.This is generally beyond its abilities so it ttles for a generative model in which independent caus generate a poor approximation to the data distributio
n.
Learning many layers of features by composing RBMs After an RBM has been learned,the activities of its hidden units(when they are being driven by data)can be ud as the‘data’for learning a higher-level RBM.To understand why this is a good idea,it is helpful to consider decompos-ing the problem of modeling the data distribution,P0,into two subproblems by picking a distribution,P1,that is easier to model than P0.Thefirst subproblem is to model P1and the cond subproblem is to model the transform-ation from P1to P0.P1is the distribution obtained by applying p(h j v)to the data distribution to get the hidden activities for every data-vector in the training t.P1is easier for an RBM to model than P0becau it is obtained from P0by allowing an RBM to ttle towards a distri-bution that it can model perfectly:its equilibrium distri-bution.The RBM’s model of P1is p(h),the distribution over hidden vectors when the RBM is sampling from its equi-librium distribution.The RBM’s model of the transform-ation from P1to P0is p(v j h).
After thefirst RBM has been learned,we keep p(v j h)as part of the generative model and we keep p(h j v)as a quick way of performing inference,but we throw away our model of P1and replace it by a better model that is obtained, recursively,by treating P1as the training data for the cond-level RBM.This leads to the composite generative model shown in Figure4b.To generate from this model
we need to get an equilibrium sample from the top-level RBM, but then we simply perform a single downwards pass through the bottom layer of weights.So the composite model is a curious hybrid who top two layers form an undirected associative memory and who lower layers form a directed generative model.It is shown in reference [30]that if the cond RBM is initialized appropriately,the gain from building a better model of P1always outweighs the loss that comes from the fact that p(h j v)is no longer the correct way to perform inference in the composite genera-tive model shown in Figure4b.Adding another hidden layer always improves a variational bound on the log probability of the training data unless the top-level RBM is already a perfect model of the data it is trained on. Modeling images of handwritten digits
Figure1a shows a network that was ud to model the joint distribution of digit images and their labels.It was learned one layer at a time and the top-level RBM was trained using‘data’-vectors that were constructed by concatenat-ing the states of ten winner-take-all label units with500 binary features inferred from the image.After greedily learning one layer of weights at a time,all the weights were fine-tuned using a variant of the wake-sleep algorithm(e reference[30]for details).Thefine-tuning significantly improves the ability of the model to generate images that remble the data,but without the initial layer-by-layer learning,thefine-tuning alone is hopelessly slow.
The model was trained to generate both a label and an image,but it can be ud to classify new images.First,the recognition weights are ud to infer binary states for the 500feature units in the cond hidden layer,then alter-nating Gibbs sampling is applied to the top two layers with the500features heldfixed.The probability of each label is then reprented by the frequency with which it turns on.Using an efficient version of this method,the network significantly outperforms both backpropagation and sup-port vector machines[31]when trained on the same data [30].A demonstration of the model generating and recog-nizing digit images is at my homepage(o. edu/$hinton).
淡菜怎么吃
Instead offine-tuning the model to be better at generating the data,backpropagation can be ud to fine-tune it to be better at discrimination.This works extremely well[2,20].The initial layer-by-layer learning finds features that enable good generation and then the discriminativefine-tuning slightly modifies the features to adjust the boundaries between class.This has the great advantage that the limited amount of information in the labels is ud only for perturbing features,not for creating them.If the ultimate aim is discrimination it is possible to u autoencoders with a single hidden layer instead of restricted Boltzmann machines for the unsuper-vid,layer-by-layer learning[32].This produces the best results ever achieved on the most commonly ud bench-mark for handwritten digit recognition[33].
Modeling quential data
This review has focud on static images,but restricted Boltzmann machines can also be applied to high-dimen-sional quential data such as video quences[34]or the joint angles of a walking person[35].The visible and hidden units are given additional,conditioning inputs that come from previous visible frames.The conditioning inputs have the effect of dynamically tting the bias of the visible and hidden units.The conditional restricted Boltzmann machines can be compod by using the quence of hidden activities of one as the training data for the next.This creates multilayer distributed repres-entations of quences that are far more powerful than the reprentations learned by standard methods such as hidden Markov models or linear dynamical systems[34]. Concluding remarks
A combination of three ideas leads to a novel and effective way of learning multiple layers of reprentation.Thefirst idea is to learn a model that generates nsory data rather than classifying it.This eliminates the need for large amounts of labeled data.The cond idea is to learn one layer of reprentation at a time using restricted Boltz-mann machines.This decompos the overall learning task into multiple simpler tasks and eliminates the infer-ence problems that ari in directed generative models. The third idea is to u a paratefine-tuning stage to improve the generative or
discriminative abilities of the composite model.
432TRENDS in Cognitive Sciences Vol.11No.10