动态卷积神经网

更新时间:2023-06-14 18:51:01 阅读: 评论:0

Den Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition
Tom Sercu
Multimodal Algorithms and Engines Group IBM T.J Watson Rearch Center,USA
Tom.
Vaibhava Goel Multimodal Algorithms and Engines Group IBM T.J Watson Rearch Center,USA
vgoel@
Abstract
In computer vision pixelwi den prediction is the task of predicting a label for
each pixel in the image.Convolutional neural networks achieve good performance
on this task,while being computationally efficient.In this paper we carry the
ideas over to the problem of assigning a quence of labels to a t of speech
frames,a task commonly known as framewi classification.We show that den
prediction view of framewi classification offers veral advantages and insights,
including computational efficiency and the ability to apply batch normalization.
When doing den prediction we pay specific attention to strided pooling in time
and introduce an asymmetric dilated convolution,called time-dilated convolution,
that allows for efficient and elegant implementation of pooling in time.We show
that by using time-dilated convolutions with a very deep VGG-style CNN with
batch normalization,we achieve best published single model accuracy result on
the switchboard-2000benchmark datat.
1Introduction
Deep convolutional networks[1]have en tremendous sucess both in computer vision[2,3,4]and spe
ech recognition[5,6,7]over the last years.Many computer vision problems fall into one of two problem types:thefirst is classification,where a single label is produced per image,the cond den pixelwi prediction,where a label is produced for each pixel in the image.Examples of den prediciton are mantic gmentation,depth map prediction,opticalflow,surface normal prediction, etc.Efficient convolutional architectures allow to produce a full image sized output rather than predicting the values for each pixel parately from a small patch centered around the pixel.In this paper we argue that we should look at acoustic modeling in speech as a den prediction task on quences.This is in contrast to the the usual viewpoint of“framewi classification”,indicating the cross-entropy training stage where a context-window is ud as input and the network predicts only for the center frame.However,during all other stages,we want the acoustic model to be applied to a quence,and produce a quence of predictions.This is the ca during quence training,test time, or in an end-to-end training tting.Similar to convolutional architectures for den prediction in computer vision,we focus our efforts on convolutional architectures that process an utterance at once and produce a quence of labels as output,rather than“splicing”up the labeling each frame independently from a small window around it.
There are four main advantages to convolutional architectures that allow efficient evaluation of full utterance(without need of splicing)in this den prediction viewpoint:
•Computational efficiency:processing a spliced utterance requires window_size times more floating point operations.
•Batch normalization can easily be adopted during quence training(or end to end training), which we will show gives strong improvements(as outlined in[8]).
in time.In the next two ctions,we will adopt a recent technique from den prediction, named dilated convolutions,for CNN acoustic models to enable strided pooling in time.
Experiments and results for this new model are in ction4.
•We will show a unifying viewpoint with Stacked Bottleneck Networks,and discuss the relevance for end-to-end models with convolutional layers in ction5.
2Related work:Pooling in CNNs for den prediction on images
Pooling with stride is an esntial ingredient of any classification CNN,allowing to access more context on higher feature maps,while reducing the spatial resolution before it is absorbed into the fully connected layers.However,for den pixelwi prediction tasks,it is less straightforward how to deal with downsampling:on the one hand downsampling allows for a“global view”by having large receptivefields at low resolution,on the other hand we also need detail on a small we need the high resolution information.
To incorporate both global and local information,downsampling pooling has been incorporated in den prediction networks in veral ways.Firstly,many methods involve upsampling lower resolution
feature maps,usually combined with some higher resolution feature maps.In[9],an image is procesd at three different scales with three different CNNs,after which the output feature maps are merged.The Fully Convolutional Networks(FCNs)from[10]u a VGG classification network as basis,introducing skip connections to merge hi-res lower layers with upsampled low-res layers from deeper in the network.SegNet[11]us a encoder-decoder structure,in which upsampling is done with max-unpooling[12],i.e.by remembering the max location of the encoder’s pooling layers.A cond way of using CNNs with strided pooling for den prediction was propod in[3]: at every pooling layer with stride s×s,the input is duplicated s×s times,but shifted with offt (∆x,∆y)∈[0...s−1]×[0...s−1].After the convolutional stages,the output is then interleaved to recover the full resolution.A third way(which we will u)is called spatial dilated convolutions, which keeps the feature maps in their original resolution.The idea is to replace the pooling with stride
s by pooling with stride1,then dilate all convolutions with a factor s,meaning that s−1
s values get
skipped.This was calledfilter rarefaction in[10],introduced as“d-regularly spar kernels”in[13], and dubbed spatial dilated convolutions in[14].It was noted[3,10]that this method is equivalent to shift-an
d-interleave,though more intuitive.The recent WaveNet work[15]us dilated convolutions for a generative model of audio.
3Time-dilated convolutions
Previous work on CNNs for acoustic modeling[5,6]eliminated the possibility of strided pooling in time becau of the downsampling effect.Recent work[7,8]shows a significant performance boost by using pooling in time during cross-entropy training,however quence training is prohibitively expensive since an utterance has to be spliced into uttLen independent windows.By adapting the notion of den prediction,we propo to allow pooling in time while maintaining efficient full-utterance processing,by using an asymmetric version of spatial dilated convolution with dilation in the time direction but not in the frequency direction,which we appropriately call time-dilated convolutions.
(a)Original CNN(XE)(b)Sequence:Problem(c)Solution
Figure1:Example of simple CNN(1conv,1pool,1conv layer).Pooling with stride2is replaced by pooling with stride1,while concutive convolutions are dilated with a factor2.
The problem with strided pooling in time is that the length of the output quence is shorter than the length of the input quence with a factor 2(p ),assuming p pooling layers with stride 2.For recurrent end-to-end networks typically a factor 4size reduction is accepted [16,17]which limits the number of pooling layers to 2,while in the hybrid NN/HMM framework,pooling is not acceptable.Esntially we need a way to do strided pooling in time,while keeping the resolution.We tackle this problem with a 1D version of spar kernels [13],or equivalently spatial dilated convolutions [14].Consider the simple toy CNN (conv3,pool2-s2,conv3)in Figure 1(a),which takes in a context window of 8frames and produces a single output.Let’s consider applying this CNN to a full utterance of length 10(padded to length 16),as in figure (b).The top row of blue outputs is downsampled with factor 2becau of the strided pooling,so the output quence length does not match the number of targets (i.e.input size).The solution of this problem is visualized in Figure 1(c).First,we pool without stride,which prerves the resolution after pooling.However,now our concutive convolutional layer needs to be modified;specifically the kernel has to skip every other value,in order to ignore the new (dark blue)values which came between the values.This is dilation (or sparsification)of the kernel with a factor 2in the time direction.Formally a 1-D discrete convolution ∗l with dilation l which convolves signal F with kernel k with size r is defined as (F ∗l k )(p )= s +lt =p F (s )k (t ),t ∈[−r,r ].
In general,the procedure to change a CNN with time-pooling from the cross-entropy training (classification)to den prediction stage for quence training and testing is as follows.Change pooling layers from stride s to stride 1,and multiply the dilation factor of all following convolutions with factor s .After this,any convolution coming after p pooling layers with original stride s ,will have the dilation factor s p .Fully connected layers are equivalent to,and can be trivially replaced by,convolutional layers with kernel 1×1(except the first convolution which has kernel size matching the output of the conv stack before being flattened for the fully connected layers).This dilating procedure is how a VGG classification network is adapted for mantic gmentation [13,14].
Using time-dilated convolutions,the feature maps and output can keep the full resolution of the input,while pooling with stride.With pooling,the receptive field in time of the CNN can be larger than the same network without pooling.This allows to combine the performance gains of pooling [7],while maintaining the computational efficiency and ability to apply batch normalization [8].
4Experiments and results
We trained a VGG style CNN [4]in the hybrid NN/HMM tting on the 2000h Switchboard+Fisher datat.The architecture and training method is similar to our earlier papers [7,8],and is bad on the
tup described in [21].Our input features are VTLN-warped logmel with ∆,∆∆,the outputs are 32k tied CD states from forced alignment.Table 1fully specifies the CNN when training on windows Layer
Output:fmaps ×f ×T Input window
3×64×48conv 7×7
64×64×42pool 2×1
64×32×42conv 3×3
64×32×40conv 3×3
64×32×38conv 3×3
64×32×36pool 2×1
64×16×36conv 3×3
128×16×34conv 3×3
128×16×32conv 3×3
128×16×30pool 2×1
128×8×30conv 3×3
256×8×28conv 3×3
256×8×26conv 3×3
256×8×24pool 2×2
256×4×12conv 3×3
512×4×10conv 3×3
出国口语512×4×8conv 3×3
512×4×6pool 2×2
葛尔丹怎样睡蓝齐儿
512×2×33×FC
2048FC
1024FC 32000Table 1:CNN architecture.SWB CH XE ST XE ST Classic 512CNN [18]12.610.4IBM 2016RNN+VGG+LSTM [19]8.6†14.4†MSR 2016ResNet *[20]8.9MSR 2016LACE *[20]8.6MSR 2016BLSTM *[20]8.7VGG (pool,inefficient)[19]10.29.416.316.0VGG (no pool)[8]10.89.717.116.7VGG-10+BN (no pool)[8]10.89.517.016.3VGG-13+BN (no pool)10.39.016.516.4VGG-13+BN +pool 9.58.515.115.4VGG-13+BN +pool (uncouple CH acwt)14.815.2Table 2:Results with small LM (4M n-grams)SWB CH IBM 2015DNN+RNN+CNN [21]8.8†15.3†IBM 2016RNN+VGG+LSTM [19]7.6†13.7†MSR 2016ResNet [20]8.614.8MSR 2016LACE [20]8.314.8MSR 2016BLSTM [20]8.716.2VGG +BN +pool 7.714.5VGG +BN +pool (uncouple CH acwt)14.4
Table 3:Results with big LM (36M n-grams)
and predicting the center frame.Corresponding to the obrvations in[8],we do not pad in time, though we do pad in the frequency direction.Training followed the standard two-stage scheme,with first1600M frames of cross-entropy training(XE)followed by310M frames of Sequence Training (ST).XE training was done with SGD with nesterov acceleration,with learning rate decaying from 0.03to9e−4over600M frames.We u the data balancing from[7]with exponentγ=0.8.We report results on Hub5’00(SWB and CH part)after decoding using the standard small4M n-gram language
model with a30.5k word vocabulary.We saw slight improvement in results when decoding with exponent on the priorγlower than what is ud during training.As mentioned in ction3, we u batch normalization in our network,where the mean and variance statistics are accumulated over both the feature maps and the frequency direction.The lection of models,decoding prior and acoustic weight happened by decoding on rt02as heldout t.
The result after XE and ST are prented in Tables2and3.Baline with*from personal communi-cation with the authors.Baline with†means system combination.Note that the balines from [20]u slightly smaller LMs:3.4M n-grams for small LM(table2)and16M n-grams for big LM (table3).With n-gram decoding,this result is to our knowledge the best published single model.
5Relation to other models
Stacked bottleneck networks(SBN)[22,23,24]or hierarchical bottleneck networks[25]are a influential acoustic model in hybrid NN/HMM speech recognition.SBNs are typically en as two concutive DNNs,each stage parately trained discriminatively with a bottleneck(small hidden layer).Thefirst DNN es the input features,while the cond DNN gets the bottleneck features from thefirst DNN as input.Typically,the cond DNN gets5bottleneck features with features from position{−
10,−5,0,5,10}relative to the center[24].In[23],it was pointed out that this SBN is convolutional and one can backpropagate through both stages together.社会习俗
In fact this multi-stage SBN architecture is a special ca of a CNN with time-dilated convolution. Specifically,the DNN is equivalent to a CNN with a largefirst kernel followed by all1×1kernels. The cond DNN is exactly equivalent to a CNN with thefirst kernel having size5and dilation factor 5in the time direction.The layers after the bottleneck in thefirst DNN form an auxilary classifier. This realization prompts a number of directions in which the SBNs can be extended.Firstly,by avoiding the large kernel in thefirst convolutional layer,it is possible to keep time and frequency structure in the internal reprentations in future layers,enabling incread depth.Secondly,rather than increasing the time-dilation factor to5at once,it ems more natural to gradually increa the time-dilation factor throughout the depth of the network.
Convolutional networks are also ud in end-to-end models for speech recognition.Both the CLDNN architecture[17]and Deep Speech2(DS2)[16]combine a convolutional network asfirst stage with LSTM and fully connected(DNN)output layers.In Wav2Letter[26],a competitive end-to-end model is prented which is fully convolutional.Both DS2and Wav2Letter do a certain amount of downsampling through pooling or striding,which can be accepted when training with a CTC(or AutoS
eg[26])criterion since it doesn’t require the output to be the same length as the input.However, DS2does report a degradation on English,which they work around using grapheme bigram targets. The time-dilated convolutions we introduced,could improve the end to end models in two ways: either,one could allow the same amount of pooling while keeping a higher liminate the need for the bigram targets.Alternatively,one could keep the same resolution,but expand the receptivefield by adding more time-dilated convolution layers,which gives access to a broader context in the CNN layers.In conclusion,this work is both relevant to end-to-end models and to hybrid HMM/NN models.
6Conclusion
微笑We drew the parallel between den prediction in computer vision and framewi quence labeling, both in the HMM/NN and end-to-end tting.This provided us with the ime-dilated convolutions)to adopt pooling in time to CNN acoustic models,while maintaining efficient processing and batch normalization on full utterances.We showed results on Hub5’00where we brought down the WER from9.4%in previous work to8.5%,a10%relative improvement.With a big(36M N-gram)language model,we achieve7.7%WER,the best single model performance reported so far.
References
[1]Yann LeCun,Léon Bottou,Yoshua Bengio,and Patrick Haffner,“Gradient-bad learning applied to document recog-
位育
nition,”Proceedings of the IEEE,vol.86,no.11,pp.2278–2324,1998.长脂肪粒
[2]Alex Krizhevsky,Ilya Sutskever,and Geoffrey E Hinton,“Imagenet classification with deep convolutional neural
哈姆雷特人物性格分析
networks,”in Advances in neural information processing systems,2012,pp.1097–1105.
[3]Pierre Sermanet,David Eigen,Xiang Zhang,Michaël Mathieu,Rob Fergus,and Yann LeCun,“Overfeat:Integrated
recognition,localization and detection using convolutional networks,”arXiv:1312.6229,2013.
[4]Karen Simonyan and Andrew Zisrman,“Very deep convolutional networks for large-scale image recognition,”CoRR
arXiv:1409.1556,2014.
[5]Ossama Abdel-Hamid,Abdel-rahman Mohamed,Hui Jiang,and Gerald Penn,“Applying convolutional neural networks
concepts to hybrid nn-hmm model for speech recognition,”in Proc.ICASSP,2012.
[6]Tara N Sainath,Abdel-rahman Mohamed,Brian Kingsbury,and Bhuvana Ramabhadran,“Deep convolutional neural
networks for lvcsr,”in Proc.ICASSP,2013.
[7]Tom Sercu,Christian Puhrsch,Brian Kingsbury,and Yann LeCun,“Very deep multilingual convolutional neural
networks for lvcsr,”Proc.ICASSP,2016.
[8]Tom Sercu and Vaibhava Goel,“Advances in very deep convolutional neural networks for lvcsr,”Proc.Interspeech,
2016.
[9]Clément Farabet,Camille Couprie,Laurent Najman,and Yann LeCun,“Scene parsing with multiscale feature learning,
purity trees,and optimal covers,”Proc.ICML,2012.
[10]Jonathan Long,Evan Shelhamer,and Trevor Darrell,“Fully convolutional networks for mantic gmentation,”CVPR,
2015.
[11]Vijay Badrinarayanan,Alex Kendall,and Roberto Cipolla,“Segnet:A deep convolutional encoder-decoder architecture
for image gmentation,”arXiv:1511.00561,2015.
[12]Matthew D Zeiler,Graham W Taylor,and Rob Fergus,“Adaptive deconvolutional networks for mid and high level
feature learning,”in2011International Conference on Computer Vision.IEEE,2011,pp.2018–2025.
[13]Hongsheng Li,Rui Zhao,and Xiaogang Wang,“Highly efficient forward and backward propagation of convolutional
neural networks for pixelwi classification,”arXiv:1412.4526,2014.
[14]Fisher Yu and Vladlen Koltun,“Multi-scale context aggregation by dilated convolutions,”proc ICLR,2016.
[15]Aaron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves,Nal Kalchbrenner,
Andrew Senior,and Koray Kavukcuoglu,“Wavenet:A generative model for raw audio,”arXiv:1609.03499,2016. [16]Dario Amodei,Rishita Anubhai,Eric Battenberg,Carl Ca,Jared Casper,Bryan Catanzaro,Jingdong Chen,Mike
Chrzanowski,Adam Coates,Greg Diamos,et al.,“Deep speech2:End-to-end speech recognition in english and mandarin,”CoRR arXiv:1512.02595,2015.
[17]Tara N Sainath,Oriol Vinyals,Andrew Senior,and Ha¸s im Sak,“Convolutional,long short-term memory,fully con-
nected deep neural networks,”in proc.ICASSP,2015.
[18]Hagen Soltau,George Saon,and Tara N Sainath,“Joint training of convolutional and non-convolutional neural net-
works,”to Proc.ICASSP,2014.
[19]George Saon,Tom Sercu,Steven Rennie,and Hong-Kwang J Kuo,“The ibm2016english conversational telephone
speech recognition system,”Proc.Interspeech,2016.rounddown
[20]W Xiong,J Droppo,X Huang,F Seide,M Seltzer,A Stolcke,D Yu,and G Zweig,“The microsoft2016conversational
speech recognition system,”arXiv:1609.03528,2016.
[21]George Saon,Hong-Kwang J Kuo,Steven Rennie,and Michael Picheny,“The ibm2015english conversational tele-
phone speech recognition system,”Proc.Interspeech,2015.
[22]Frantik Grezl,Martin Karafiát,and Lukas Burget,“Investigation into bottle-neck features for meeting speech recog-
nition.,”in Proc.Interspeech,2009.
[23]Karel Vel`y,Martin Karafiát,and František Grézl,“Convolutive bottleneck network features for lvcsr,”in ASRU,
2011.
[24]Frantik Grézl,Martin Karafiát,and Karel Vel`y,“Adaptation of multilingual stacked bottle-neck neural network
structure for new language,”in Proc.ICASSP,2014.
[25]Christian Plahl,Ralf Schlüter,and Hermann Ney,“Hierarchical bottle neck features for lvcsr.,”in Interspeech,2010,
pp.1197–1200.
[26]Ronan Collobert,Christian Puhrsch,and Gabriel Synnaeve,“Wav2letter:an end-to-end convnet-bad speech recogni-
tion system,”arXiv:1609.03193,2016.

本文发布于:2023-06-14 18:51:01,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/954519.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:口语   分析   出国   人物性格
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图