D
igital Object Identifier 10.1109/MSP.2012.2205597 D ate of publication: 15 October 2012M os t current s peech recognition s ys tems us e
hidden Markov models (HMMs ) to deal with
the temporal variability of s peech and
Gaus s ian mixture models (GMMs ) to deter-
mine how well each state of each HMM fits a
frame or a short window of frames of coefficients that repre-
nts the acoustic input. An alternative way to evaluate the fit
is to us e a feed-forward neural network that takes s everal
frames of coefficients as input and produces posterior proba-
bilities over HMM s tates as output. Deep neural networks (DNNs ) that have many hidden layers and are trained us ing new methods have been shown to outperform GMMs on a vari-ety of s peech recognition benchmarks , s ometimes by a large margin. This article provides an overview of this progress and reprents the shared views of four rearch groups that have had recent success in using DNNs for acoustic modeling in speech recognition.INTRODUCTION New machine learning algorithms can lead to s ignificant advances in automatic s peech recognition (ASR). The bigges t
[Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury
猬拼音
男生暗恋女生的表现细节
山水墨画
幸福原来如此简单
]
[The shared views of four rearch groups ]
© I S T O C K P H O T O .C O M /S U C H O A L E R T A D I P A T FUNDAMENTAL TECHNOLOGIES
IN MODERN SPEECH RECOGNITION
s ingle advance occurred nearly
four decades ago with the intro-
duction of the expectation-maxi-
mization (EM) algorithm for
training HMMs (s ee [1] and [2]
for informative historical reviews
of the introduction of HMMs).
With the EM algorithm, it be -
came possible to develop speech
recognition s ys tems for real-
world tas ks us ing the richnes s of GMMs [3] to repres ent the relations hip between HMM s tates and the acous tic input. In the systems the acoustic input is typically reprented by con-catenating Mel-frequency cepstral coefficients (MFCCs) or per-ceptual linear predictive coefficients (PLPs) [4] computed from the raw waveform and their firs t- and s econd-order temporal differences [5]. This nonadaptive but highly engineered prepro-cessing of the waveform is designed to discard the large amount of information in waveforms that is considered to be irrelevant for discrimination and to express the remaining information in a form that facilitates discrimination with GMM-HMMs.
GMMs have a number of advantages that make them suit-able for modeling the probability distributions over vectors of input features that are associated with each state of an HMM. With enou
gh components, they can model probability dis tri-butions to any required level of accuracy, and they are fairly easy to fit to data using the EM algorithm. A huge amount of rearch has gone into finding ways of constraining GMMs to increas e their evaluation s peed and to optimize the tradeoff between their flexibility and the amount of training data required to avoid rious overfitting [6].
The recognition accuracy of a GMM-HMM s ys tem can be further improved if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of gener-ating the obrved data, especially if the discriminative objec-tive function us ed for training is clos ely related to the error rate on phones, words, or ntences [7]. The accuracy can also be improved by augmenting (or concatenating) the input fea-tures (e.g., MFCCs) with “tandem” or bottleneck features gen-erated using neural networks [8], [69]. GMMs are so successful that it is difficult for any new method to outperform them for acoustic modeling.
Des pite all their advantages, GMMs have a s erious s hort-coming—they are s tatis tically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. For example, modeling the t of points that lie very clo to the s urface of a s phere only requires a few parameters us ing an appropriate model class, but it requires a very large number of diagonal Gaussians or a fairly large number of full-covariance Gaus s ians. Speech is produced by modulating a relatively small nu
mber of parameters of a dynamical system [10], [11], and this implies that its true underlying s tructure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. We believe, therefore, that other types of model may work better than GMMs for
acous tic modeling if they can
more effectively exploit informa-
tion embedded in a large win-
dow of frames.
Artificial neural networks
trained by backpropagating
error derivatives have the poten-
tial to learn much better models
of data that lie on or near a non-
linear manifold. In fact, two decades ago, rearchers achieved some success using artificial neural networks with a single layer of nonlinear hidden units to predict HMM s tates from windows of acous tic coefficients [9]. At that time, however, neither the hardware nor the learn-ing algorithms were adequate for training neural networks with many hidden layers on large amounts of data, and the performance benefits of us ing neural networks with a s ingle hidden layer were not sufficiently large to riously challenge GMMs. As a result, the main practical contribution of neural networks at that time was to provide extra features in tandem or bottleneck systems.
Over the last few years, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training DNNs that contain many layers of non-linear hidden units and a very large output layer. The large output layer is required to accommodate the large number of HMM states that ari when each phone is modeled by a num-ber of different “triphone” HMMs that take into account the phones on either side. Even when many of the states of the triphone HMMs are tied together, there can be thous ands of tied states. Using the new learning methods, veral different rearch groups have shown that DNNs can outperform GMMs at acous tic modeling for s peech recognition on a variety of data ts including large data ts with large vocabularies.
This review article aims to repres ent the s hared views of rearch groups at the University of Toron
从事法律工作to, Microsoft Rearch (MSR), Google, and IBM Rearch, who have all had recent suc-cess in using DNNs for acoustic modeling. The article starts by describing the two-stage training procedure that is ud for fit-ting the DNNs. In the first stage, layers of feature detectors are initialized, one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables. The generative models are trained without us ing any information about the HMM states that the acoustic model will need to dis-criminate. In the s econd s tage, each generative model in the stack is ud to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM s tates. Thes e targets are obtained by us ing a baline GMM-HMM system to produce a forced alignment.
In this article, we review exploratory experiments on the TIMIT databa [12], [13] that were ud to demonstrate the power of this two-stage training procedure for acoustic mod-eling. The DNNs that worked well on TIMIT were then applied to five different large-vocabulary continuous speech recogni-tion (LVCSR) tasks by three different rearch groups who
DEEP NEURAL NETWORKS THAT HAVE MANY HIDDEN LAYERS AND ARE TRAINED USING NEW METHODS HAVE BEEN SHOWN TO OUTPERFORM
GMMs ON A VARIETY OF SPEECH
RECOGNITION BENCHMARKS, SOMETIMES BY A LARGE MARGIN.
res ults we als o s ummarize. The DNNs worked well on all of thes e tas ks when compared with highly tuned GMM-HMM systems, and on some of the tasks they outperformed the state of the art by a large margin. We als o des cribe s ome other us es of DNNs for acous tic modeling and s ome variations on the training procedure. TRAINING DEEP NEURAL NETWORKS
A DNN is a feed-forward, artificial
neural network that has more than one layer of hidden units
between its inputs and its outputs. Each hidden unit, j , typically
us the logistic function (the cloly related hyberbolic tangent
is also often ud and any function with a well-behaved deriva-
tive can be ud) to map its total input from the layer below,
x ,j to the scalar state, y j that it nds to the layer above.
(),y x e x b y w 11logistic ,j j x j j i ij i
j ==+=+-/ (1)where b j is the bias of unit j, i is an index over units in the
layer below, and w ij is the weight on a connection to unit j
from unit i in the layer below. For multiclas s clas s ification,
output unit j converts its total input, x j , into a class probabil-
ity, p j , by using the “softmax” nonlinearity
()
(),exp exp p x x j k k j =/ (2)where k is an index over all class.
DNNs can be dis criminatively trained (DT) by backpropa-
gating derivatives of a cost function that measures the discrep-
ancy between the target outputs and the actual outputs
produced for each training ca [14]. When using the softmax
output function, the natural cost function C is the cross entro-
py between the target probabilities d and the outputs of the
softmax, p
,log C d p j j j =-/ (3)
where the target probabilities, typically taking values of one or
zero, are the s upervis ed information provided to train the
DNN classifier.
For large training ts, it is typically more efficient to com-
pute the derivatives on a small, random “minibatch” of training
cas, rather than the whole training t, before updating the
weights in proportion to the gradient. This stochastic gradient
descent method can be further improved by using a “momen-
tum” coefficient, 0111a , that smooths the gradient comput-
ed for minibatch t , thereby damping oscillations across ravines
and speeding progress down ravines
()(1)().w t w t w t C ij ij ij a e D D 22=-- (4)The update rule for bias can be derived by treating them as weights on connections coming from units that always have a state of one.
To reduce overfitting, large weights can be penalized in propor-tion to their squared magnitude, or the learning can simply be termi-nated at the point at which perfor-mance on a held-out validation t s tarts getting wors e [9]. In DNNs with full connectivity between adja-cent layers, the initial weights are given small random values to prevent all of the hidden units in a layer from getting exactly the same gradient. DNNs with many hidden layers are hard to optimize. Gradient descent from a random starting point near the origin is not the best way to find a good t of weights, and unless the initial scales of the weights are carefully chon [15], the back-propagated gradients will h
ave very different magnitudes in dif-
ferent layers. In addition to the optimization issues, DNNs may generalize poorly to held-out test data. DNNs with many hidden layers and many units per layer are very flexible models with a
very large number of parameters. This makes them capable of modeling very complex and highly nonlinear relations hips between inputs and outputs. This ability is important for high-quality acoustic modeling, but it also allows them to model spu-rious regularities that are an accidental property of the particular examples in the training t, which can lead to vere overfitting. Weight penalties or early stopping can reduce the overfitting but only by removing much of the modeling power.
Very large training ts [16] can reduce overfitting while pre-rving modeling power, but only by making training very com-putationally expens ive. What we need is a better method of using the information in the training t to build multiple lay-ers of nonlinear feature detectors. GENERATIVE PRETRAINING Instead of designing feature detectors to be good for discrimi-nating between class, we can start by designing them to be
good at modeling the structure in the input data. The idea is to learn one layer of feature detectors at
a time with the states of
the feature detectors in one layer acting as the data for training the next layer. After this generative “pretraining,” the multiple layers of feature detectors can be ud as a much better start-ing point for a discriminative “fine-tuning” pha during which backpropagation through the DNN slightly adjusts the weights found in pretraining [17]. Some of the high-level features cre-ated by the generative pretraining will be of little u for dis-crimination, but others will be far more us eful than the raw inputs. The generative pretraining finds a region of the weight-space that allows the discriminative fine-tuning to make rapid progress, and it also significantly reduces overfitting [18]. A single layer of feature detectors can be learned by fitting a
generative model with one layer of latent variables to the input
凡是都造句
data. There are two broad class of generative model to choo
OVER THE LAST FEW YEARS, ADVANCES IN BOTH MACHINE LEARNING ALGORITHMS AND COMPUTER HARDWARE HAVE LED TO MORE EFFICIENT METHODS FOR TRAINING DNNs.
from. A directed model generates data by firs t choos ing the
states of the latent variables from a prior distribution and then
choosing the states of the obrvable variables from their condi-
tional distributions given the latent states. Examples of directed models with one layer of latent variables are factor analysis, in
which the latent variables are drawn from an is otropic Gaussian, and GMMs, in which they are dra
wn from a discrete dis tribution. An undirected model has a very different way of generating data. Instead of using one t of parameters to define a prior distribution over the latent variables and a parate t
of parameters to define the condition-al distributions of the obrvable vari-ables given the values of the latent variables, an undirected model us a single t of parameters, W , to define the joint probability of a vector of val-ues of the obrvable variables, v , and a vector of values of the latent vari-ables, h , via an energy function, E v h W (,;),,p Z e Z e 1v h W v h W v h (,;)(,;),E E ==--l l l l / (5)where Z is called the partition function.
If many different latent variables interact nonlinearly to generate each data vector, it is difficult to infer the states of the latent variables from the obs erved data in a directed model becau of a phenomenon known as “explaining away” [19]. In undirected models , however, inference is eas y pro-vided the latent variables do not have edges linking them. Such a restricted class of undirected models is ideal for lay-
erwi pretraining becau each layer will have an easy infer-ence procedure.
We s tart by des cribing an approximate learning algorithm for a restricted Boltzmann machine (RBM)
which consists of a layer of stochastic binary “visible” units that reprent binary input data connected to a layer of stochastic binary hidden units that learn to model significant nonindependencies between the
vis ible units [20]. There are undirected connections between visible and hidden units but no visible-visible or hidden-hidden
connections. An RBM is a type of Markov random field (MRF) but differs from most MRFs in veral ways: it has a bipartite connectivity graph, it does not usually share weights between different units , and a s ubs et of the variables are unobs erved, even during training. AN EFFICIENT LEARNING PROCEDURE FOR RBMs A joint configuration, (v , h ) of the visible and hidden units of an RBM has an energy given by v h ()E a v b h v h w ,i i i j j j i j ij visible hidden ,i j =---!!///, (6)
where ,v h i j are the binary states of visible unit i and hidden
unit j , ,a b i j are their bias es , and w ij is the weight between
them. The network assigns a probability to every possible pair of
a visible and a hidden vector via this energy function as in (5) and the probability that the network as
signs to a visible vector, v , is given by summing over all possible hidden vectors v ()p Z e 1h v,h ()E =-/. (7)The derivative of the log probability of a training s et with respect to a weight is surprisingly simple
v ()log N w p v h v h 1ij n n n N i j i j 1data model 212122=-==/, (8)where N is the s ize of the training s et and the angle brackets are ud to denote expectations under the dis-tribution s pecified by the s ubs cript that follows. The
s imple derivative in (8)
leads to a very simple learn-ing rule for performing sto-
chastic steepest ascent in the log probability of the training data w v h v h data model ij i j i j 1212e D =-^h , (9)
where e is a learning rate.
The abnce of direct connections between hidden units in an RBM makes it is very eas y to get an unbias ed s ample of v h i j data 12. Given a randomly s elected training cas e, v , the
binary state, h j , of each hidden unit, j , is t to one with prob-ability
v (1)()p h b v w logistic j j i ij i ;==+/ (10) and v h i j is then an unbiad sample. The abnce of direct con-nections between visible units in an RBM makes it very easy to get an unbiad sample of the state of a visible unit, given a hid-den vector
()().h p v a h w 1logistic i i j ij j ;==+/ (11)
Getting an unbias ed s ample of v h i j model 12, however, is much more difficult. It can be done by starting at any random state of the visible units and performing alternating Gibbs sam-pling for a very long time. Alternating Gibbs sampling consists of updating all of the hidden units in parallel us ing (10) fol-
lowed by updating all of the visible units in parallel using (11).
验收报告单A much fas ter learning procedure called contras tive diver-
gence (CD) was propod in [20]. This starts by tting the states of the visible units to a training vector. Then the binary states of
the hidden units are all computed in parallel using (10). Once
binary states have been chon for the hidden units, a “recon-struction” is produced by tting each v i to one with a probabil-
ity given by (11). Finally, the s tates of the hidden units are updated again. The change in a weight is then given by ()w v h v h ij i j i j data recon 1212e D =-. (12)
WHAT WE NEED IS A BETTER METHOD OF USING THE INFORMATION IN THE TRAINING SET TO BUILD MULTIPLE LAYERS OF NONLINEAR FEATURE DETECTORS.
A s implified vers ion of the s ame learning rule that us es the
states of individual units instead of pairwi products is ud for
the bias.
CD works well even though it is only crudely approximating
the gradient of the log probability
of the training data [20]. RBMs learn better generative models if more s teps of alternating Gibbs sampling are ud before collecting the statistics for the cond term in the learning rule, but for the pur-pos es of pretraining feature detec-tors , more alternations are
generally of little value and all the
results reviewed here were obtained using CD 1 which does a sin-
gle full s tep of alternating Gibbs s ampling after the initial
update of the hidden units. To suppress noi in the learning,
the real-valued probabilities rather than binary samples are gen-
erally ud for the reconstructions and the subquent states of
the hidden units, but it is important to u sampled binary val-ues for the first computation of the hidden states becau the
sampling noi acts as a very effective regularizer that prevents overfitting [21]. MODELING REAL-VALUED DATA Real-valued data, such as MFCCs, are more naturally modeled by linear variables with Gaus s ian nois e and the RBM energy function can be modified to accommodate such variables, giving a Gaussian–Bernoulli RBM (GRBM) v h (,)()E v a b h v h w 2i i i i j j i i j ij 22vis ,j i j hid v v =---!!///, (13)where i v is the standard deviation of the Gaussian noi for vis-
ible unit i .
The two conditional distributions required for CD 1 learning
are
v ()p h b v w logistic j j i i ij i ;v =+c m / (14) h (),N p a h w v i i j ij j i i 2;v v =+c m /, (15)where (,)N 2n v is a Gaus s ian. Learning the s tandard devia-tions of a GRBM is problematic for reasons described in [21], so for pretraining using CD 1, the data are normalized so that each coefficient has zero mean and unit variance, the standard devia-
tions are t to one when computing ()v h p ;, and no noi is
added to the reconstructions. This avoids the issue of deciding
the right noi level.
STACKING RBMs TO MAKE A DEEP BELIEF NETWORK
After training an RBM on the data, the inferred states of the hid-
den units can be us ed as data for training another RBM that
learns to model the significant dependencies between the hid-
den units of the first RBM. This can be repeated as many times
福寿鱼
as desired to produce many layers of nonlinear feature detectors
that reprent progressively more complex statistical structure in the data. The RBMs in a stack can be combined in a surpris-ing way to produce [22] a single, multilayer generative model called a deep belief net (DBN) (not to be confus ed with a dynamic Bayesian net, which is a type of directed model of temporal data that unfortu-nately has the same acronym). Even though each RBM is an undirected model, the DBN formed by the whole stack is a hybrid generative model who top two layers are undi-rected (they are the final RBM in the s tack) but whos e lower layers have top-down, directed connections (e Figure 1). To unders tand how RBMs are compos ed into a DBN, it is helpful to rewrite (7) and to make explicit the dependence on W : v W h W h W (;)(;)(;),v p p p h ;=/ (16)
where h W (;)p is defined as in (7) but with the roles of the visi-ble and hidden units reverd. Now it is clear that the model can
be improved by holding v h W (;)p ; fixed after training the RBM,
but replacing the prior over hidden vectors h W (;)p by a better prior, i.e., a prior that is clor to the aggregated posterior over hidden vectors that can be sampled by first picking a training ca and the
n inferring a hidden vector using (14). This aggre-gated pos terior is exactly what the next RBM in the s tack is
trained to model. As shown in [22], there is a ries of variational bounds on the log probability of the training data, and furthermore, each
time a new RBM is added to the stack, the variational bound on the new and deeper DBN is better than the previous variational bound, provided the new RBM is initialized and learned in the right way. While the existence of a bound that keeps improving is mathematically reassuring, it does not answer the practical issue, addresd in this article, of whether the learned feature detectors are us eful for dis crimination on a tas k that is unknown while training the DBN. Nor does it guarantee that anything improves when we us e efficient s hort-cuts s uch as
CD 1 training of the RBMs.
One very nice property of a DBN that distinguishes it from other multilayer, directed, nonlinear generative models is that it
is possible to infer the states of the layers of hidden units in a s ingle forward pas s. This inference,
which is us ed in deriving the variational bound, is not exactly correct but is fairly accu-rate. So after learning a DBN by training a stack of RBMs, we can jettison the whole probabilistic framework and simply u
the generative weights in the rever direction as a way of ini-tializing all the feature detecting layers of a deterministic feed-forward DNN. We then just add a final softmax layer and train the whole DNN discriminatively. Unfortunately, a DNN that is pretrained generatively as a DBN is often still called a DBN in
the literature. For clarity, we call it a DBN-DNN. ONE VERY NICE PROPERTY OF A DBN THAT DISTINGUISHES IT FROM OTHER MULTILAYER, DIRECTED, NONLINEAR GENERATIVE MODELS IS THAT IT IS POSSIBLE TO INFER THE STATES OF THE LAYERS OF HIDDEN UNITS IN A SINGLE FORWARD PASS.