Recurrent Neural Network bad Language Modeling in Meeting Recognition
Stefan Kombrink,Tom´a ˇs Mikolov,Martin Karafi´a t,Luk´a ˇs Burget Speech@FIT,Brno University of Technology,Brno,Czech Republic
{kombrink,imikolov,karafiat,burget }@
Abstract
We u recurrent neural network (RNN)bad language mod-els to improve the BUT English meeting recognizer.On the baline tup using the original language models we decrea word error rate (WER)more than 1%absolute by n-best list rescoring and language model adaptation.When n-gram lan-guage models are trained on the same moderately sized data t as the RNN models,improvements are higher yielding a system which performs comparable to the baline.A noticeable im-provement was obrved with unsupervid adaptation of RNN models.Furthermore,we examine the influence of word his-tory on WER and show how to speed-up rescoring by caching comm
on prefix strings.
Index Terms :automatic speech recognition,language model-ing,recurrent neural networks,rescoring,adaptation
1.Introduction
Neural network (NN)bad language models as propod in [12]have been continuously reported to perform well amongst other language modeling techniques.The best results on some smaller tasks were obtained by using recurrent NN-bad lan-guage models [10],[11].In RNNs,the feedback between hid-den and input layer allows the hidden neurons to remember the history of previously procesd words.
Neural networks in language modeling offer veral advan-tages.In contrary to commonly ud n-gram language models,smoothing is applied in an implicit way,and due to the projec-tion of the entire vocabulary into a small hidden layer,man-tically similar words get clustered.This explains,why n-gram counts of data sampled from the distribution defined by NN-bad models could lead to better estimates for n-grams,which may have never been en during training:Words get substi-tuted by other words which the NN learned to be related.While no such relation could be learned by a standar
d n-gram model using the original spar training data,we already showed in [1]how we can incorporate some of the improvements gained by RNN language models into systems using just standard n-gram language models:by generating a large amount of additional training data from the RNN distribution.
The purpo of this paper is to show,to what extent the current RNN language model is suitable for mass application in common LVCSR systems.We will show that the promising results of previously conducted experiments on smaller tups [10],[1]generalize to our state-of-the-art meeting recognizer and can be applied in fact in any other ASR system without too
This work was partly supported by Technology Agency of the Czech Republic grant No.TA01011328,Czech Ministry of Educa-tion project No.MSM0021630528,Grant Agency of Czech Republic projects Nos.GP102/09/P635and 102/08/0707,and by BUT FIT grant No.FIT-11-S-2.
w i-1
S i-1
S i
第57届格莱美P(|s ·i )
Pc i (|s ·i )P(w i |w i-1,s i-1) =w i
endure
c i
P(c i |s i )Pc i (w i |s i )
Figure 1:Architecture of the class-bad recurrent NN.much effort.While RNN models effectively complement stan-dard n-grams,they can be ud also efficiently,even in systems where speed or memory consumption is an issue.
In the following,we briefly introduce the utilized class-bad RNN architecture for language modeling.A system de-scription and details about ud language models follows.Fi-nally,we prent our experiments in detail and conclude with a summary of our findings.
2.Class-bad RNN language model
The RNN language model operates as a predictive model for the next word given the previous ones.As in n-gram models,the joint probability of a given quence of words is factorized into the product of probability estimates of all words w i conditioned on their history h i =w w i −1:
P (w w n )=
n i =1
美剧 推荐P (w i |h i )(1)
The utilized RNN architecture is shown in figure 1.The previ-ous word w i −1is fed to the input of the net using 1-of-n encod-ing 1together with the information encoded in the state vector
s i −1from processing the previous words.By propagating the input layer we obtain the updated state vector s i so that we can write:
P (w i |h i )=P rnn (w i |w i −1,s i −1)=P rnn (w i |s i )
(2)
Usually,the posterior probability of the predicted word is estimated by using a softmax activation function on the out-put layer,which has the size of the vocabulary.The posterior probability for any given w i can be read immediately from the corresponding output.Although often just the posterior proba-bility of a particular w i is required,the entire distribution has to
1The
input vector has the same dimensionality as the vocabulary
size.All inputs are t to zero except the one corresponding to the word which is t to one.
be computed becau of the softmax.By assuming,that words can be mapped to class surjectively,we can add a part for es-timating the posterior probability of class to the output layer, and hence estimate the probability for the predicted word as the product of two independent probability distributions-one over class and the other one over words within a class:
P rnn(w i|s i)=P rnn(c i|s i)P c i rnn(w i|s i)(3) This leads to speed-up both in training and testing becau only the distribution over class and then the distribution over words belonging to the class c i of the predicted word have to be com-puted[11].
3.Setup
3.1.System description
Our state-of-the-art baline speech recognition system us acoustic and language models from th
e AMIDA Rich Transcrip-tion2009system[9].Standard speaker adaptation techniques (VTLN and per-speaker CMLLR),fMPE MPE trained acous-tic models and NN-bottleneck features[4]with CVN/CMN and HLDA are ud.The output of two complementary branches (one bad on PLP and the other bad on posterior features) rved for cross-adapting the system.In both branches,lattices are generated using a2-gram language model and subquently expanded up to4-gram order.The estimated adaptation trans-formations are ud in a lattice rescoring stage,who lattices finally rve as input to RNN rescoring as performed later in the experiments.
Corpus Words RT09RT11RNN
Web data931M!––
Hub4152M!33M–
Fisher1/221M!!!
Swbd/CHE 3.4M!!!
Meetings 2.1M!!!
Total 1.1G 1.1G60M26.5M Table1:Language models utilized in the LVCSR system
3.2.Language Models
In Table1we show the corpora2ud for training the ba-line language models.RT09and RT11were4-gram models using modified Kner-Ney smoothing,and shared the same vocabulary of50k words.RNN was a class-bad recurrent network model trained online with13iterations of backprop-agation through time(BPTT,[3])and a learning rate of0.1. It ud500hidden neurons,1000class and full vocabulary (without cut-offs,65k words).Using only a moderately sized subt3of26.5M words one iteration took approximately three days on a single CPU.The rt06val data t(30k words)rved as validation data in model training and combination.In our ex-periments we report speech recognition results in WER on the NIST rt05val and rt07val ts.
2The web data actually consists of four parate data ts described more thoroughly in[8].
3AMI meetings+Fisher1/2+CallHome English+Switchboard
4.Experiments
In ourfirst experiment we kept the existing LVCSR tup and just replaced the old n-gram models by models that ud artifi-cial RNN-sampled data in addition.Hence,no RNN language model is required in this system.
4.1.Adding RNN-generated data
Model PPL Data#n-grams
RT1182.5e Table114.4M
V A81.7300M words from RNN35.5M RT11+V A76.6interpolated RT11+V A46.5M RT0972.2e Table151.2M RT09+V A69.2interpolated RT09+V A78.6M Table2:Interpolated language model perplexities(4-grams)
We sampled300M additional words from the RNN lan-guage model and ud this data to create improved n-gram lan-guage models.In Table2we show an overview of all n-gram model combinations in decreasing order of perplexity(PPL). It can be en that the LM trained on the RNN data(V A)per-forms already comparable to the RT11model.Both models still em to be complementary:the RT11+V A model is an equally weighted mixture of the RT11and V A model and shows de-cread PPL.Its model size is almost comparable to the RT09 model which us much more data.When the RNN data is ud in combination with the RT09model(RT09+V A)PPL de-creas just slightly whereas the growth in number of n-grams (78.6M)turns out to be huge.
As shown in table3,by just replacing the original n-gram model by an improved n-gram model using sampled RNN data we can keep the original LVCSR tup and yet achieve some im-provement.RNN data sampling decreas WER in ca of the smaller RT11model,but does not work for the RT09model, which already us plenty of training data.The RT09+V A model showed no improvement over the RT09model which is why we did not u it in the following experiments at all.
Test t RT11RT11+V A RT09RT09+V A
rt07val22.221.520.320.4
rt05val19.018.517.717.7 Table3:WER reduction due to the u of RNN sampled data
4.2.RNN rescoring
Further improvements are obtained by running a RNN rescoring stage.In n-best list rescoring,the RNN model re-estimated a log-likelihood score for each n-best hypothesis s:
log L(s)=n·wp+
n
i=1
asc i+lms
n
i=1
log P x(w i|h i)(4)
where n is the number of words,wp is the word inrtion penalty,asc i is the acoustic score for word w i,h i the w i−1and lms the language model scale applied in the generation of the input lattices.P x is the combined probabil-ity estimate of standard4-gram and RNN models,which was obtained by linear interpolation:
P x(w i|h)=λP rnn(w i|h)+(1−λ)P ng(w i|h)(5)
rt07val-2.25hours-4527utterances
n-gram model baline RNN Adapt
RT0920.319.619.4
RT11+V A21.520.520.2
RT1122.220.720.4
rt05val-2.00hours-3130utterances
n-gram model baline RNN Adapt
RT0917.716.916.6
RT11+V A18.517.417.1
RT1119.017.417.2
皮肤保养秘诀Table4:Word error rates(WERs)on the rt05val and rt07val test ts using RNN rescoring and adaptation
Table4shows the n-gram models ud in the system and their performance gained by RNN rescoring.The4-gram lat-tices(4-gram)constituted the baline which was ud to extract n-best lists.
The improvement gained in our best system(RT09) is0.7-0.8%absolute,in the system enhanced by RNN data sam-pling(RT11+V A)1.0-1.1%and in the light-weight RT11sys-tem up to1.6%.
token
4.3.LM Adaptation
rt07val-8meetings-19speakers
WER Adaptation
19.6RNN rescoring,no adaptation
19.7on entire1-best using one model
19.4on1-best per meeting using8models
rt05val-10meetings-50speakers
WER Adaptation
16.9RNN rescoring,no adaptation
16.8on entire1-best using one model
张中载16.6on1-best per meeting using10models Table5:Influence of data ordering in adaptation(RT09system) Earlier experiments have already shown potential improve-ments in ca the language models gets adapted.The adaptation process for the RNN model is a one-iteration retraining on the 1-best output from the rescored n-best lists.In the following, rescoring is performed a cond time using the adapted RNN model,and an improved recognition output is obtained.Two criteria were tried to determine the learning rate for RNN adap-tation:best PPL on1-best and best PPL on the validation data. The optimal learning rate4in terms of WER was often found in between the estimates using both criteria.
In Table5we compare two ways of adaptation using our best system.Adapting only one RNN model on the entire recog-nition output did not work reliably.But when we adapted one model per meeting and applied it in a cond RNN rescoring to the respective n-best lists only,we obtained considerable im-provements.As shown in Table4we decread WER further between0.2-0.3%on all system variants.In ca the RT11and RT11+V A models are ud,the system obtainsfinally performs comparable to the original baline(RT09)without any RNN post-processing being ud.The models require approxi-mately18times less training data than the large RT09model.
4Since this value is clo to the inital learning rate ud for RNN training we suggest to u it as a g
uideline.4.4.Influence of History
In previous experiments in[10]cache models still comple-mented the RNN model.This suggested,that the simple RNN architecture we are currently using still does not allow to learn very long contexts.Hence,we tested the influence of the his-tory length ud in RNN rescoring.In general,the state vector is conditioning the posterior probability estimate of the predicted word on the preceding words.But the history can effectively be “forgotten”by initializing this vector to some random default value.Table6shows three different ways of using history in RNN rescoring and their influence on WER:
Full history is ud if the entire data t is procesd -quentially where the state vector potentially can reprent the entire history.The state vector is just initialized once in the be-ginning.For every utterance,all hypothes are procesd by initializing the RNN with the state from the winning hypothesis of the utterance procesd previously.The drawback is that the data t cannot be procesd easily in parallel.Binned history is ud if the entire data t is split into equally sized bins which are procesd independently of each other.The state vector for each bin gets initialized at the beginning of the processing.In our experiments we ran RNN rescoring in parallel ud bins containing as few as10-20utterances without noticeable degra-dation.Hypothesis history can be e外语哥
n as binned history with bins containing only one utterance.Although the considered history comes already clo to what is ud in(high order)n-gram models,it incread WER only by0.1%.We conclude, that the probability estimate of words is almost independent of words in previous ntences.
Test-t Full Binned Hypothesis
rt07val19.619.619.7
rt05val16.916.917.0
Table6:Influence of history in WER(RT09system)
4.5.Speeding up rescoring
A well-known technique to speed up NN/RNN training and evaluation is the u of shortlists as done in[5].Whereas short-list are known to degrade results,we u the factorization into class[11],which leads to faster processing and results without large degradation.Another propod speed-up is the block op-eration:veral words get propagated through the net in a single matrix×matrix operation,which can be performed faster than a quence of matrix×vector operations.While that me
英语四级听力thod can be applied even to recurrent neural networks,the speedup is smaller when class factorization is ud,becau the location of the cond softmax output layer is different for every word. Hence,its softmax computation has do be done exclusively.
Locally caching hypothes of single utterances
chine fontn-best size101001000
speed-up 2.1 2.6 3.3smelly
avg cache size553542575
max cache size438333631392
Table7:Trade-off between speed-up and cache size when state vectors of common prefix strings on the rt07val t are cached Nevertheless,we can still take advantage of the fact that n-best lists contain many hypothes sharing common prefix
strings.We can precompute the t of prefix strings that oc-cur at least twice and cache their corresponding state vectors on-the-fly.By doing so,we obtain a speed-up by factor2-3.
The cache accumulates state vectors and the posterior prob-abilities of the quence of words that has been cached.The size of the cache should be kept down,becau a medium-sized RNN model(1250hidden neurons)requires already5KB per cached state.Therefore,it is appropriate to apply the caching just within hypothes of each single utterance.That way,the precomputation of the prefix strings just needs to pro-cess the hypothes of a single utterance which allows to run online decoding and online rescoring.Cache size will be al-ways limited by the number of hypothes and their length.The last history state vector can be pasd easily across utterances to optimally prerve word history and obtain full accuracy in rescoring.
Another speed-up of factor2-3was obrved in rescoring whenfloats were ud instead of doubles.In that ca,the RNN model also consumed just half the memory.By now,1000-best rescoring can be done in0.5×RT even for larger hidden layer size of500,factored by1000class)on a3GHz single core using512MB memory without loss in accuracy and without the mentioned prefix caching.
Conquently,RNN rescoring of10-or100-best could be ud even in light-weight ASR tups,which due to memory and CPU limitations usually work bad on n-grams.Our soft-ware for training RNN models and rescoring n-best lists can be downloaded from www.fi/~imikolov/rnnlm.An
example package to repeat parts of the reported experiment is also available under the given link.
5.Conclusions
We recommend the u of RNN language models as easy mean to improve an existing LVCSR system,either by improving n-gram models using data sampled from an RNN or by perform-ing the propod rescoring and adaptation postprocessing steps.
Previous experiments in[10]and[1]already showed the advantage of RNN language models on simple ASR systems using limited training data.While RNN training times are still a bottleneck,we showed,that improvements can be obtained even in a state-of-the-art ASR system using n-gram language models trained on much more data than the RNN model.If the system us about18times less than the original language modeling data,it still reaches a performance similar to the baline.
Thus,RNN models are interesting also in cas of low-resource ASR.RNN data sampling is an easy way to increa the amount of training data than retrieving domain-relevant web data,in ca of low-resource languages it may be the only way.The existing system can be improved without the need to change the structure at all,just n-gram LMs needs to be re-placed.Improvements may however vanish if RNN rescoring is applied.
It was already pointed out in[2]that continuous space mod-els should adapt better on little data than n-gram models.Unsu-pervid adaptation of the RNN model on a meeting level pro-vided noticeable improvements in addition to RNN rescoring. We think there might be still by using dynamic adaptation or the adaptation of just a subt of all RNN weights.
Still,the current RNN architecture can hardly exploit con-text longer than a ntence.Further improvement could pos-sibly be obtained by using long short term memory(LSTM) RNNs[7]or temporal kernel[6].
Its fast rescoring process make class-bad RNNs interest-ing for light-weight and real-time ASR systems.We propod the caching of common prefix strings as an easy way to get a speed-up of factor2-3.Rescoring could be still made even faster by combining block operation and prefix caching or par-allelization.Prefix caching without block operation ems suit-able for very 10-best)online systems.
6.References
[1] A.Deoras,T.Mikolov,S.Kombrink,M.Karafi´a t and
S.Khudanpur.Variational Approximation of Long-Span Language Models in LVCSR.In IEEE Intl.Conference on Acoustics,Speech and Signal Processing(ICASSP), Prague,CZ,May2011.
[2]M.Afify,O.Siohan,and R.Sarikaya.Gaussian mix-
ture language models for speech recognition.In Acous-tics,Speech and Signal Processing,2007.ICASSP2007.
IEEE International Conference on,volume4,pages IV–
29.IEEE,2007.
[3]M.Bod´e n.A guide to recurrent neural networks and back-
propagation.In THE DALLAS PROJECT,SICS TECHNI-CAL REPORT T2002:03,SICS,2002.
[4] F.Grezl,M.Karafi´a t and L.Burget.Investigation into
bottle-neck features for meeting speech recognition.In Proc.Interspeech2009,number9,pages2947–2950, Brighton,GB,2009.International Speech Communica-tion Association.
[5]H.Schwenk and J.L.Gauvain.Building continuous space
language models for transcribing european languages.
pages737–740,Lisbon,Portugal,2005.
[6]M.C.Mozer.A focud backpropagation algorithm for
temporal pattern recognition,pages137–169.L.Erlbaum Associates Inc.,Hillsdale,NJ,USA,1995.
[7]S.Hochreiter and J.Schmidhuber.LSTM Can Solve Hard
Long Time Lag Problems.In Advances in Neural Infor-mation Processing Systems9,pages473–479.MIT Press, 1997.
[8]T.Hain,L.Burget,J.Dines,G.Garau,M.Karafi´a t,
D.v.Leeuwen,M.Lincoln and V.Wan.The2007
AMI(DA)system for meeting transcription.In Proc.Rich Transcription2007Spring Meeting Recognition Evalua-tion Workshop,Baltimore,Maryland USA,May2007. [9]T.Hain,L.Burget,J.Dines,N.P.Garner,A.
H.El,M.Hui-
jbregts,M.Karafi´a t,M.Lincoln and V.Wan.The AMIDA 2009Meeting Transcription System.In Proc.of INTER-SPEECH2010,volume2010,pages358–361,Makuhari, Chiba,JP,2010.International Speech Communication As-sociation.
[10]T.Mikolov,M.Karafi´a t,L.Burget,J.ˇCernock´y and
S.Khudanpur.Recurrent neural network bad language model.In Proc.of INTERSPEECH2010,number9, pages1045–1048,Makuhari,Chiba,JP,2010.Interna-tional Speech Communication Association.
[11]T.Mikolov,S.Kombrink,L.Burget,J.ˇCernock´y and
S.Khudanpur.Extensions of Recurrent Neural Network Language Models.In IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP), Prague,CZ,May2011.
[12]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,T.Hof-
mann,T.Poggio and J.Shawe-taylor.A neural proba-bilistic language model.In Journal of Machine Learning Rearch,volume3,pages1137–1155,2003.