首页 > 美文阅读

动态卷积神经网

更新时间:2023-06-14 18:51:01 阅读：评论：0

Den Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition

Tom Sercu

Multimodal Algorithms and Engines Group IBM T.J Watson Rearch Center,USA

Tom.

Vaibhava Goel Multimodal Algorithms and Engines Group IBM T.J Watson Rearch Center,USA

vgoel@

Abstract

In computer vision pixelwi den prediction is the task of predicting a label for

each pixel in the image.Convolutional neural networks achieve good performance

on this task,while being computationally efﬁcient.In this paper we carry the

ideas over to the problem of assigning a quence of labels to a t of speech

frames,a task commonly known as framewi classiﬁcation.We show that den

prediction view of framewi classiﬁcation offers veral advantages and insights,

including computational efﬁciency and the ability to apply batch normalization.

When doing den prediction we pay speciﬁc attention to strided pooling in time

and introduce an asymmetric dilated convolution,called time-dilated convolution,

that allows for efﬁcient and elegant implementation of pooling in time.We show

that by using time-dilated convolutions with a very deep VGG-style CNN with

batch normalization,we achieve best published single model accuracy result on

the switchboard-2000benchmark datat.

1Introduction

Deep convolutional networks[1]have en tremendous sucess both in computer vision[2,3,4]and spe

ech recognition[5,6,7]over the last years.Many computer vision problems fall into one of two problem types:theﬁrst is classiﬁcation,where a single label is produced per image,the cond den pixelwi prediction,where a label is produced for each pixel in the image.Examples of den prediciton are mantic gmentation,depth map prediction,opticalﬂow,surface normal prediction, etc.Efﬁcient convolutional architectures allow to produce a full image sized output rather than predicting the values for each pixel parately from a small patch centered around the pixel.In this paper we argue that we should look at acoustic modeling in speech as a den prediction task on quences.This is in contrast to the the usual viewpoint of“framewi classiﬁcation”,indicating the cross-entropy training stage where a context-window is ud as input and the network predicts only for the center frame.However,during all other stages,we want the acoustic model to be applied to a quence,and produce a quence of predictions.This is the ca during quence training,test time, or in an end-to-end training tting.Similar to convolutional architectures for den prediction in computer vision,we focus our efforts on convolutional architectures that process an utterance at once and produce a quence of labels as output,rather than“splicing”up the labeling each frame independently from a small window around it.

There are four main advantages to convolutional architectures that allow efﬁcient evaluation of full utterance(without need of splicing)in this den prediction viewpoint:

•Computational efﬁciency:processing a spliced utterance requires window_size times more ﬂoating point operations.

•Batch normalization can easily be adopted during quence training(or end to end training), which we will show gives strong improvements(as outlined in[8]).

in time.In the next two ctions,we will adopt a recent technique from den prediction, named dilated convolutions,for CNN acoustic models to enable strided pooling in time.

Experiments and results for this new model are in ction4.

•We will show a unifying viewpoint with Stacked Bottleneck Networks,and discuss the relevance for end-to-end models with convolutional layers in ction5.

2Related work:Pooling in CNNs for den prediction on images

Pooling with stride is an esntial ingredient of any classiﬁcation CNN,allowing to access more context on higher feature maps,while reducing the spatial resolution before it is absorbed into the fully connected layers.However,for den pixelwi prediction tasks,it is less straightforward how to deal with downsampling:on the one hand downsampling allows for a“global view”by having large receptiveﬁelds at low resolution,on the other hand we also need detail on a small we need the high resolution information.

To incorporate both global and local information,downsampling pooling has been incorporated in den prediction networks in veral ways.Firstly,many methods involve upsampling lower resolution

feature maps,usually combined with some higher resolution feature maps.In[9],an image is procesd at three different scales with three different CNNs,after which the output feature maps are merged.The Fully Convolutional Networks(FCNs)from[10]u a VGG classiﬁcation network as basis,introducing skip connections to merge hi-res lower layers with upsampled low-res layers from deeper in the network.SegNet[11]us a encoder-decoder structure,in which upsampling is done with max-unpooling[12],i.e.by remembering the max location of the encoder’s pooling layers.A cond way of using CNNs with strided pooling for den prediction was propod in[3]: at every pooling layer with stride s×s,the input is duplicated s×s times,but shifted with offt (∆x,∆y)∈[0...s−1]×[0...s−1].After the convolutional stages,the output is then interleaved to recover the full resolution.A third way(which we will u)is called spatial dilated convolutions, which keeps the feature maps in their original resolution.The idea is to replace the pooling with stride

s by pooling with stride1,then dilate all convolutions with a factor s,meaning that s−1

s values get

skipped.This was calledﬁlter rarefaction in[10],introduced as“d-regularly spar kernels”in[13], and dubbed spatial dilated convolutions in[14].It was noted[3,10]that this method is equivalent to shift-an

d-interleave,though more intuitive.The recent WaveNet work[15]us dilated convolutions for a generative model of audio.

3Time-dilated convolutions

Previous work on CNNs for acoustic modeling[5,6]eliminated the possibility of strided pooling in time becau of the downsampling effect.Recent work[7,8]shows a signiﬁcant performance boost by using pooling in time during cross-entropy training,however quence training is prohibitively expensive since an utterance has to be spliced into uttLen independent windows.By adapting the notion of den prediction,we propo to allow pooling in time while maintaining efﬁcient full-utterance processing,by using an asymmetric version of spatial dilated convolution with dilation in the time direction but not in the frequency direction,which we appropriately call time-dilated convolutions.

(a)Original CNN(XE)(b)Sequence:Problem(c)Solution

Figure1:Example of simple CNN(1conv,1pool,1conv layer).Pooling with stride2is replaced by pooling with stride1,while concutive convolutions are dilated with a factor2.

The problem with strided pooling in time is that the length of the output quence is shorter than the length of the input quence with a factor 2(p ),assuming p pooling layers with stride 2.For recurrent end-to-end networks typically a factor 4size reduction is accepted [16,17]which limits the number of pooling layers to 2,while in the hybrid NN/HMM framework,pooling is not acceptable.Esntially we need a way to do strided pooling in time,while keeping the resolution.We tackle this problem with a 1D version of spar kernels [13],or equivalently spatial dilated convolutions [14].Consider the simple toy CNN (conv3,pool2-s2,conv3)in Figure 1(a),which takes in a context window of 8frames and produces a single output.Let’s consider applying this CNN to a full utterance of length 10(padded to length 16),as in ﬁgure (b).The top row of blue outputs is downsampled with factor 2becau of the strided pooling,so the output quence length does not match the number of targets (i.e.input size).The solution of this problem is visualized in Figure 1(c).First,we pool without stride,which prerves the resolution after pooling.However,now our concutive convolutional layer needs to be modiﬁed;speciﬁcally the kernel has to skip every other value,in order to ignore the new (dark blue)values which came between the values.This is dilation (or sparsiﬁcation)of the kernel with a factor 2in the time direction.Formally a 1-D discrete convolution ∗l with dilation l which convolves signal F with kernel k with size r is deﬁned as (F ∗l k )(p )= s +lt =p F (s )k (t ),t ∈[−r,r ].

In general,the procedure to change a CNN with time-pooling from the cross-entropy training (classiﬁcation)to den prediction stage for quence training and testing is as follows.Change pooling layers from stride s to stride 1,and multiply the dilation factor of all following convolutions with factor s .After this,any convolution coming after p pooling layers with original stride s ,will have the dilation factor s p .Fully connected layers are equivalent to,and can be trivially replaced by,convolutional layers with kernel 1×1(except the ﬁrst convolution which has kernel size matching the output of the conv stack before being ﬂattened for the fully connected layers).This dilating procedure is how a VGG classiﬁcation network is adapted for mantic gmentation [13,14].

Using time-dilated convolutions,the feature maps and output can keep the full resolution of the input,while pooling with stride.With pooling,the receptive ﬁeld in time of the CNN can be larger than the same network without pooling.This allows to combine the performance gains of pooling [7],while maintaining the computational efﬁciency and ability to apply batch normalization [8].

4Experiments and results

We trained a VGG style CNN [4]in the hybrid NN/HMM tting on the 2000h Switchboard+Fisher datat.The architecture and training method is similar to our earlier papers [7,8],and is bad on the

tup described in [21].Our input features are VTLN-warped logmel with ∆,∆∆,the outputs are 32k tied CD states from forced alignment.Table 1fully speciﬁes the CNN when training on windows Layer

Output:fmaps ×f ×T Input window

3×64×48conv 7×7

64×64×42pool 2×1

64×32×42conv 3×3

64×32×40conv 3×3

64×32×38conv 3×3

64×32×36pool 2×1

64×16×36conv 3×3

128×16×34conv 3×3

128×16×32conv 3×3

128×16×30pool 2×1

128×8×30conv 3×3

256×8×28conv 3×3

256×8×26conv 3×3

256×8×24pool 2×2

256×4×12conv 3×3

512×4×10conv 3×3

出国口语512×4×8conv 3×3

512×4×6pool 2×2

葛尔丹怎样睡蓝齐儿

512×2×33×FC

2048FC

1024FC 32000Table 1:CNN architecture.SWB CH XE ST XE ST Classic 512CNN [18]12.610.4IBM 2016RNN+VGG+LSTM [19]8.6†14.4†MSR 2016ResNet *[20]8.9MSR 2016LACE *[20]8.6MSR 2016BLSTM *[20]8.7VGG (pool,inefﬁcient)[19]10.29.416.316.0VGG (no pool)[8]10.89.717.116.7VGG-10+BN (no pool)[8]10.89.517.016.3VGG-13+BN (no pool)10.39.016.516.4VGG-13+BN +pool 9.58.515.115.4VGG-13+BN +pool (uncouple CH acwt)14.815.2Table 2:Results with small LM (4M n-grams)SWB CH IBM 2015DNN+RNN+CNN [21]8.8†15.3†IBM 2016RNN+VGG+LSTM [19]7.6†13.7†MSR 2016ResNet [20]8.614.8MSR 2016LACE [20]8.314.8MSR 2016BLSTM [20]8.716.2VGG +BN +pool 7.714.5VGG +BN +pool (uncouple CH acwt)14.4

Table 3:Results with big LM (36M n-grams)

and predicting the center frame.Corresponding to the obrvations in[8],we do not pad in time, though we do pad in the frequency direction.Training followed the standard two-stage scheme,with ﬁrst1600M frames of cross-entropy training(XE)followed by310M frames of Sequence Training (ST).XE training was done with SGD with nesterov acceleration,with learning rate decaying from 0.03to9e−4over600M frames.We u the data balancing from[7]with exponentγ=0.8.We report results on Hub5’00(SWB and CH part)after decoding using the standard small4M n-gram language

model with a30.5k word vocabulary.We saw slight improvement in results when decoding with exponent on the priorγlower than what is ud during training.As mentioned in ction3, we u batch normalization in our network,where the mean and variance statistics are accumulated over both the feature maps and the frequency direction.The lection of models,decoding prior and acoustic weight happened by decoding on rt02as heldout t.

The result after XE and ST are prented in Tables2and3.Baline with*from personal communi-cation with the authors.Baline with†means system combination.Note that the balines from [20]u slightly smaller LMs:3.4M n-grams for small LM(table2)and16M n-grams for big LM (table3).With n-gram decoding,this result is to our knowledge the best published single model.

5Relation to other models

Stacked bottleneck networks(SBN)[22,23,24]or hierarchical bottleneck networks[25]are a inﬂuential acoustic model in hybrid NN/HMM speech recognition.SBNs are typically en as two concutive DNNs,each stage parately trained discriminatively with a bottleneck(small hidden layer).Theﬁrst DNN es the input features,while the cond DNN gets the bottleneck features from theﬁrst DNN as input.Typically,the cond DNN gets5bottleneck features with features from position{−

10,−5,0,5,10}relative to the center[24].In[23],it was pointed out that this SBN is convolutional and one can backpropagate through both stages together.社会习俗

In fact this multi-stage SBN architecture is a special ca of a CNN with time-dilated convolution. Speciﬁcally,the DNN is equivalent to a CNN with a largeﬁrst kernel followed by all1×1kernels. The cond DNN is exactly equivalent to a CNN with theﬁrst kernel having size5and dilation factor 5in the time direction.The layers after the bottleneck in theﬁrst DNN form an auxilary classiﬁer. This realization prompts a number of directions in which the SBNs can be extended.Firstly,by avoiding the large kernel in theﬁrst convolutional layer,it is possible to keep time and frequency structure in the internal reprentations in future layers,enabling incread depth.Secondly,rather than increasing the time-dilation factor to5at once,it ems more natural to gradually increa the time-dilation factor throughout the depth of the network.

Convolutional networks are also ud in end-to-end models for speech recognition.Both the CLDNN architecture[17]and Deep Speech2(DS2)[16]combine a convolutional network asﬁrst stage with LSTM and fully connected(DNN)output layers.In Wav2Letter[26],a competitive end-to-end model is prented which is fully convolutional.Both DS2and Wav2Letter do a certain amount of downsampling through pooling or striding,which can be accepted when training with a CTC(or AutoS

eg[26])criterion since it doesn’t require the output to be the same length as the input.However, DS2does report a degradation on English,which they work around using grapheme bigram targets. The time-dilated convolutions we introduced,could improve the end to end models in two ways: either,one could allow the same amount of pooling while keeping a higher liminate the need for the bigram targets.Alternatively,one could keep the same resolution,but expand the receptiveﬁeld by adding more time-dilated convolution layers,which gives access to a broader context in the CNN layers.In conclusion,this work is both relevant to end-to-end models and to hybrid HMM/NN models.

6Conclusion

微笑We drew the parallel between den prediction in computer vision and framewi quence labeling, both in the HMM/NN and end-to-end tting.This provided us with the ime-dilated convolutions)to adopt pooling in time to CNN acoustic models,while maintaining efﬁcient processing and batch normalization on full utterances.We showed results on Hub5’00where we brought down the WER from9.4%in previous work to8.5%,a10%relative improvement.With a big(36M N-gram)language model,we achieve7.7%WER,the best single model performance reported so far.

References

[1]Yann LeCun,Léon Bottou,Yoshua Bengio,and Patrick Haffner,“Gradient-bad learning applied to document recog-

位育

nition,”Proceedings of the IEEE,vol.86,no.11,pp.2278–2324,1998.长脂肪粒

[2]Alex Krizhevsky,Ilya Sutskever,and Geoffrey E Hinton,“Imagenet classiﬁcation with deep convolutional neural

哈姆雷特人物性格分析

networks,”in Advances in neural information processing systems,2012,pp.1097–1105.

[3]Pierre Sermanet,David Eigen,Xiang Zhang,Michaël Mathieu,Rob Fergus,and Yann LeCun,“Overfeat:Integrated

recognition,localization and detection using convolutional networks,”arXiv:1312.6229,2013.

[4]Karen Simonyan and Andrew Zisrman,“Very deep convolutional networks for large-scale image recognition,”CoRR

arXiv:1409.1556,2014.

[5]Ossama Abdel-Hamid,Abdel-rahman Mohamed,Hui Jiang,and Gerald Penn,“Applying convolutional neural networks

concepts to hybrid nn-hmm model for speech recognition,”in Proc.ICASSP,2012.

[6]Tara N Sainath,Abdel-rahman Mohamed,Brian Kingsbury,and Bhuvana Ramabhadran,“Deep convolutional neural

networks for lvcsr,”in Proc.ICASSP,2013.

[7]Tom Sercu,Christian Puhrsch,Brian Kingsbury,and Yann LeCun,“Very deep multilingual convolutional neural

networks for lvcsr,”Proc.ICASSP,2016.

[8]Tom Sercu and Vaibhava Goel,“Advances in very deep convolutional neural networks for lvcsr,”Proc.Interspeech,

2016.

[9]Clément Farabet,Camille Couprie,Laurent Najman,and Yann LeCun,“Scene parsing with multiscale feature learning,

purity trees,and optimal covers,”Proc.ICML,2012.

[10]Jonathan Long,Evan Shelhamer,and Trevor Darrell,“Fully convolutional networks for mantic gmentation,”CVPR,

2015.

[11]Vijay Badrinarayanan,Alex Kendall,and Roberto Cipolla,“Segnet:A deep convolutional encoder-decoder architecture

for image gmentation,”arXiv:1511.00561,2015.

[12]Matthew D Zeiler,Graham W Taylor,and Rob Fergus,“Adaptive deconvolutional networks for mid and high level

feature learning,”in2011International Conference on Computer Vision.IEEE,2011,pp.2018–2025.

[13]Hongsheng Li,Rui Zhao,and Xiaogang Wang,“Highly efﬁcient forward and backward propagation of convolutional

neural networks for pixelwi classiﬁcation,”arXiv:1412.4526,2014.

[14]Fisher Yu and Vladlen Koltun,“Multi-scale context aggregation by dilated convolutions,”proc ICLR,2016.

[15]Aaron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves,Nal Kalchbrenner,

Andrew Senior,and Koray Kavukcuoglu,“Wavenet:A generative model for raw audio,”arXiv:1609.03499,2016. [16]Dario Amodei,Rishita Anubhai,Eric Battenberg,Carl Ca,Jared Casper,Bryan Catanzaro,Jingdong Chen,Mike

Chrzanowski,Adam Coates,Greg Diamos,et al.,“Deep speech2:End-to-end speech recognition in english and mandarin,”CoRR arXiv:1512.02595,2015.

[17]Tara N Sainath,Oriol Vinyals,Andrew Senior,and Ha¸s im Sak,“Convolutional,long short-term memory,fully con-

nected deep neural networks,”in proc.ICASSP,2015.

[18]Hagen Soltau,George Saon,and Tara N Sainath,“Joint training of convolutional and non-convolutional neural net-

works,”to Proc.ICASSP,2014.

[19]George Saon,Tom Sercu,Steven Rennie,and Hong-Kwang J Kuo,“The ibm2016english conversational telephone

speech recognition system,”Proc.Interspeech,2016.rounddown

[20]W Xiong,J Droppo,X Huang,F Seide,M Seltzer,A Stolcke,D Yu,and G Zweig,“The microsoft2016conversational

speech recognition system,”arXiv:1609.03528,2016.

[21]George Saon,Hong-Kwang J Kuo,Steven Rennie,and Michael Picheny,“The ibm2015english conversational tele-

phone speech recognition system,”Proc.Interspeech,2015.

[22]Frantik Grezl,Martin Karaﬁát,and Lukas Burget,“Investigation into bottle-neck features for meeting speech recog-

nition.,”in Proc.Interspeech,2009.

[23]Karel Vel`y,Martin Karaﬁát,and František Grézl,“Convolutive bottleneck network features for lvcsr,”in ASRU,

2011.

[24]Frantik Grézl,Martin Karaﬁát,and Karel Vel`y,“Adaptation of multilingual stacked bottle-neck neural network

structure for new language,”in Proc.ICASSP,2014.

[25]Christian Plahl,Ralf Schlüter,and Hermann Ney,“Hierarchical bottle neck features for lvcsr.,”in Interspeech,2010,

pp.1197–1200.

[26]Ronan Collobert,Christian Puhrsch,and Gabriel Synnaeve,“Wav2letter:an end-to-end convnet-bad speech recogni-

tion system,”arXiv:1609.03193,2016.

本文发布于:2023-06-14 18:51:01，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/954519.html

上一篇：最新六年级下册数学教学计划冀教版(十四篇)

下一篇：目送读书心得500字目送读书心得体会(7篇)

标签：口语分析出国人物性格

留言与评论（共有 0 条评论）