TONGUES:RAPID DEVELOPMENT OF A SPEECH-TO-SPEECH TRANSLATION SYSTEM
Alan W Black,Ralf D.Brown,Robert Frederking,Rita Singh,John Moody,Eric Steinbrecher
Language Technologies Institute,Carnegie Mellon University,
School of Computer Science,Carnegie Mellon University,
Lockheed Martin Systems Integration,Owego,NY
ABSTRACT
We carried out a one-year project to build a portable speech-to-
speech translation system in a new language that could run on a small portable computer.Croatian was chon as the target lan-
guage.The resulting system was tested with real urs on a trip to Croatia in the spring of2001.We describe its basic components, the methods we ud to build them,initial evaluation results,and related significant obrvations.This work was done in conjunc-tion with the US Army Chaplain School;chaplai
ns are often the only personnel in a position to communicate with local people over non-military issues such as medical supplies,refugees,etc.This paper thus reports on a realistic instance of rapidly deploying and field-testing a speech-to-speech translator using current technol-ogy.
INTRODUCTION
With speech recognition,synthesis and translation beginning to work well enough for small tasks,this paper describes a short project to build a portable speech-to-speech translation system in a new language.We describe its basic components,the methods we ud to build them and related significant obrvations.The end system was tested with real urs on a trip to Croatia in the spring of2001.
This work was done in conjunction with the US Army Chap-lain School.Army chaplains are often among the advance party of troop deployment.In many cas,the chaplains are the only per-sonnel in a position to communicate with local people over non-military issues such as medical supplies,refugees,etc.Often the chaplain has no knowledge of the local language,and due to im-mediacy requirements,no human translator is available.Thus the chaplain must communicate as best possible,perhaps without even a bilingual dictionary.
Given this domain,our task was to build a speech-to-speech translation system that could run on a small portable computer that will aid conversations between a chaplain and a native.Such a task requires
speech recognition systems for English and the target lan-guage
键盘英文
speech synthesizer for English and the target language
a translation system from English to the target language,and
the rever translation system
an interface that allows the components to be ud effec-tively in communication
The entire project,from start tofinish,was allowed to take only one calendar year,including initial contractual arrangements,hir-ing language experts,etc.All of the systems had to run on a sin-gle small sub-notebook computer,in a reasonable time;this added further interesting constraints on the project.
For topical reasons,Croatian was chon as the target language.
Although spoken by around5million people,it does not command enough economic weight that the commercial speech and language
community has produced recognizers,synthesizers and translation engines for it.Thus it is a realistic language type to u as an example.From a testing standpoint,although Croatia is still of interest to the US military,there are no current hostilities there, thus enabling a realisticfield-test under safe conditions.
PREPARATION
As we were to build the system in a short period and on a
small budget,data driven approaches were the only reasonable method.Such techniques must be ud for each of the three core components:machine translation,speech recognition engines,and speech synthesis engines.
死党是什么意思
Thus at the very start we arranged to record a number of chap-lains in role-playing conversations of the type we expected the de-vice to encounter.Fortunately,the chaplains were familiar with role-playing exercis,and all had relevantfield experiences to re-enact.Both sides of the conversations w
ere in English.The were digitally recorded with head mounted microphones at16KHz in stereo(one speaker on each channel),as this was clost to the in-tended audio channel characteristics of the eventual system.In all, we recorded46conversations,ranging from a few minutes to20 minutes length.In total there was4.25hours of actual speech. The conversations were then hand-transcribed at the word level,identifying fal starts,filled paus and the complete words. Next the transcriptions of the English-English conversations were translated into Croatian by native Croatian speakers by hand. This data provided the basic information from which we could boot strap the rest of the speech-to-speech translation system.
COMPONENTS
Recognition
For speech recognition,we ud the CMU Sphinx II system[8],a relatively light-weight recognizer that works in real time even on machines with relatively small memory and modest-speed proces-sors.For Automatic Speech Recognition(ASR)to work we need to build two basic types of models.Acoustic Models which model the acoustic phonetic space for the given language and Language Models which model the probability of word quences.In addi-tion to the models we also need two lexicons one for English and one for Croatian that map words to their pronunciations.
For the English acoustic models,we could have ud exist-ing acoustic models trained from similar wide-band speech,but as there were not any readily available conversational wide-band speech databas in the intended domain,it was felt better to train on the chaplain dialogs directly rather than u existing models and some form of adaptation.Although such adaptation tech-niques may have been beneficial and feasible for English,we knew
that for the Croatian no such data was available,and part of this
exerci was to develop speech-to-speech translation systems for languages that did not already have speech resources constructed
for them.Thus for English we took only the4.25hours of chap-lain speech and directly trained mi-continous HMM models for Sphinx2.
For the English language model we required a larger collection of in-domain text.We ud the dialog transcriptions themlves but also augmented that with text from chaplain handbooks that were made available to us.Although we knew we could provide better recognition accuracy by using more resources,we were in-terested in limiting what resources were necessary for this work, and also(e below)we found the trained models from this data adequate for the task.
机械波公式Building Croatian models was harder.As we were aware that our resource of Croatian speakers was limited,and they had less skill in carrying out full word transcription of conversa-tional speech,we wished tofind a simpler,less resource-intensive method to build Croatian acoustic models.From the the trans-lated chaplain transcripts,we wished to lect example utterances that when recorded would give sufficient acoustic coverage to al-low reasonable acoustic models to be trained.To do this,we ud a technique originally developed for lecting text to record for speech synthesis[2].By using the initially developed Croatian speech synthesizer,we couldfind the phonemes that would be ud to say each utterance.We then ran a greedy lection algorithm that lects utterances that would best cover the acoustic space[2]. From a list of veral thousand utterances,we lected groups of 250utterances that were phonetically rich.The ts were then read by a number of Croatian speakers.Using read speech avoided the process of hand-transcription of the speech,though it does make it less like the intended conversational speech.Due to the relative scarcity of native Croatian speakers,we recorded only15 different speakers,of which13were female and2were male.This resulted in a gender imbalance,which was not however obrved to affect the system’s performance greatly.In all,a total of4.0 hours of Croatian speech was collected.This data alone was then ud to train new acoustic models for Croatian.
For both English and Croatian recognition systems,mi-
continuous5-state triphone HMMs were trained.The number of tied states ud in each ca was commensurate with the amount of training data available.Although the English models did have ex-plicit modeling offilled paus(non-linguistic verbalized sounds such as“um”,“uh”etc.),none were trained for Croatian.This was partially becau the recorded speech was read,and had minimal spontaneous speech phenomena such asfilled paus. Language models in both cas were word-trigrams built with absolute discounting.The language-model vocabularies consisted of2900words for English and3900words for Croatian.In pilot experiments with heldout test ts,the word error rates were found to be below15%for English and below20%for Croatian.
宣法在线登录平台We note that as the utterances ud in the training were not spon-
taneous,the system was more easily confud by hesitations and filled paus.However in the actual ur tests of the system this proved to be less of a problem than we expected.As turns in a con-versation through a speech-to-speech translation system are slower and less spontaneous compared to single language conversations, speakers were more careful in their delivery than they might be in full conversations.
Synthesis
For English synthesis,we ud an existing English voice in the Festival Speech Synthesis System[3].Although there may have been a slight advantage in building a targeted synthesizer for con-versations,it would not have been significantly different in quality.
A few lexicon additions were made for the particular domain,but the existing English voice was esntially ud unchanged.
For Croatian,it was necessary to build a complete new speech
synthesis voice.To do this,we ud the tools available in the CMU FestV ox project[1],which is designed to provide the necessary
support for building new synthetic voices in new languages.Syn-thetic voices require:text processing,lexicons,a method for wave-form synthesis,and prosodic models.
In this ca,the text processing was minimal,as the type of language being given to the synthesizer was fairly regular,since it would be generated by the translation system(or the Croatian recognizer).
Luckily,orthographic-to-phoneme rules for Croatian are rela-tively easy and could be written by hand,so building a lexicon was much easier than it might be for some other languages.(The same le
xicon and letter-to-sound rules were ud by the recognition en-gine).
The waveform synthesis was done using a constrained version of general unit lection techniques.From the translated utterances from chaplain dialogs and other Croatian text,we lected1000 utterances that best covered the phonetic space(using the tech-nique more fully described in[2]).The were spoken by a na-tive male Croatian and automatically labelled by a simple dynamic time warp technique using cross-linguistic prompts(as decribed in [1]).The were then hand corrected.
Thefinal required piece was a t of prosodic models for Croa-tian;we found a very simple rule-bad method of phrasing ade-quate for this domain(mostly shorter ntences).We trained du-ration models from the recorded Croatian speech,which worked well.However,the intonation model was harder.We found that a model trained from the relatively small amount of speech in the Croatian databa did not produce a good intonation model.Thus we fell back on a different technique:we simply ud our English intonation model modified to the range of our Croatian speaker. On listening tests,native Croatians preferred this over the natively-trained model.For other languages such short cuts may not be so acceptable.
The resulting quality–although not alwaysfluent–was under-standable almost all the time,and much better than a standard di-phone synthesizer.It also readily captured the voice quality of the original Croatian speaker.
Machine Translation
Again due to the requirement of rapid development,data-driven approaches were preferred.Thus we ud a Multi-Engine MT (MEMT)system[7],who primary engines were an Example-Bad MT(EBMT)engine[4]and a bilingual dictionary/glossary. Carnegie Mellon’s EBMT system us a“shallower”approach than many other EBMT systems;examples to be ud are lected bad on string matching and inflectional and other heuristics,with no deep structural analysis.The MEMT architecture us a trigram language model of the output language to lect among competing partial translations produced by veral engines.It is ud in this system primarily to lect among competing(and possibly over-lapping)EBMT translation hypothes.
EBMT Architecture
For translation into Croatian,we incorporated afinite-state word
reordering mechanism,applied during the language model-driven lection of partial translations,to pl
ace clitics in a cluster in the
appropriate location.(Croatian syntax requires a very specific or-dering of clitics in a cluster in a specific position in the ntence.) The training corpus for the EBMT engine consisted of the translated chaplain dialogs plus pre-existing parallel text from the DIPLOMAT project[6]and newly-acquired parallel text from the web.The dictionary/glossary engine ud both statistically-extracted translations and manually-created entries.The English trigram model already existed,and had been generated from newswire and broadcast news transcripts.Finally,the Croatian tri-gram model was built from the Croatian half of the EBMT corpus, some Croatian text found on the web,and the full text of some sixty novels and other Croatian literary works(in total,approxi-mately six million words).
Integration and Interface
Simply stringing together a recognizer,translator,and synthesizer does not make a very uful speech-to-speech translation system.
A good interface is necessary to make the parts work together in such a way that a ur can actually derive benefit from it.Using our experience from the earlier DIPLOMAT system,we designed the interface to be asymmetric,with the Croatian side being as simple as possible,and any necessary co
mplexity handled on the English side,since the chaplain would be trained and practiced in using the system.
We included back-translation,to allow a ur with no knowl-edge of the target language to better asss the quality of the trans-lation.We also included veral ur-requested features,such as built-in pre-recorded instructions and explanations for the Croat-ian(since the Croatian speaker is completely naive regarding the device and the chaplain’s intentions),emergency key phras(such as“Don’t move!”),and enhancements such as being able to mod-ify the translation lexicon,so that the system could be tuned to more specific tasks.
Thefinal system ran on a Windows-bad Toshiba Libretto,run-ning at200MHz with192MB of memory.At the time of the project(2000)this was the best combination of speed and size that was readily available.The system was equipped with a custom touchscreen,so that the Croatian-speaker would not need to type or u a mou at all.Aware that the system may be ud in situa-tions where the non-English participant may be unfamiliar with the technology,we include a microphone/speaker handt that looks like a conventional telephone handt.This has the advantage of provided a clo-talking microphone,thus making speech recog-nition easier,and coming in a format that will be familiar to most people.
钢套钢EV ALUATION
In April2001,a group organized by the US Army Chaplain school took two versions of the device to Zagreb,where it was tested with non-English-speaking Croatians.A number of scenarios were pre-pared in English and Croatian,and were given to each participant to act out using the translation device.The scenarios were in the intended domain,involving refugees,medical supplies and getting general directions.
In all,21dialogs took place,between different Croatian speak-ers and one of5chaplains.After the test,the Croatian participants were given a questionnaire tofill out.Their overall impression was as follows:
Overall
Good
OK
Bad
grammar/ca
loudspeakers
translation
recognition
synthesis
speed
by hand,which they often did.However the speakers also learned to speak more fluently and less conversationally as they ud the system,improving recognition accuracy.Similarly,we asked what they found easy:
What works?
104
It was quickly discovered by most participants that the system did not translate long,rambling ntences well.Short,direct ntences were much more likely to produce good translations.This was
一年级安全手抄报not surprising,given the limitations of the platform and the deliberate limiting of development time to e if such limitations still allowed a uful translation device.We were actually plead to e that the system provided adequate coverage for successful translation of unreheard,naive dialogues.
Other specific obrvations we noted were that the urs could not easily identify where the problems lay with the system.(For example,if speech recognition produced and displayed a correct transcript,and then translation produced an unacceptable result,they would usually respeak the same utterance using the same words.)Thus even if we provided parate ur methods to add words to the recognizer,language model and translation engine,it is clear that the ur would not be able to identify which part (or parts)need to be updated.As we feel that such systems need to provide methods of adaptation in the field,it is clear that the in-terface prented to the ur to offer that adaptation needs more work.
Although there were problems with the volume of the output through the small built-in speakers on the device,which many urs commented on,mistakes in the synthesizer were often er-roneously attributed to the translator (and vice versa).
A cond obrvation was that the participants continued to u speech and did not resort to the alte
rnative typing interface (al-though they were clearly aware of it),and only resorted to typing as a last resort.This may have been due to the fact the participants were told to u the speech-to-speech translation device rather than have the more abstract goal of successful communication by the best means.The very small keyboard on the (required)small de-vice may also have been a
factor.
U of system in Croatia
Further details of the evaluation are described in [5].
CONCLUSION
As one of the goals of this work was to rapidly develop a speech-to-speech translation system we also wished to account for the ef-fort spent in building this system.Although the work took part over one calendar year,not everyone was working full time on
the system during that period.In total there were 6technical staff involved (the authors of this paper),each bringing their particu-lar experti.We estimate that in total there was about 2person-years total effort from the nior staff.In addition to this there was also part-timer Croatian informants,chaplains and some student helpers.We also should note that some of the translation datra ud to train the system was collect for a previous project.
Most of the basic systems that were ud in development of this system already existed and this was basically a test of how well they perform on new data.However some problems with the tools were found and some new development was carried out.Interest-ingly it is the organization of data collection,scheduling translators and labellers that is actually one of the most time consuming parts.If this technique were to be applied to some new language we believe less resources would be required,though we do not want to claim that each new language would be the same as the previ-ou
s,and hence different possibly non-trivial problems may appear when moving the techniques to new languages.
This project shows how a relatively simple speech-to-speech translation system can be rapidly and successfully constructed us-ing today’s tools.The system was indeed constructed in less than one year.The results of the 2001evaluation in Croatia indicated that,while the system was not ready for actual field u,it was ac-tually impressively clo to that level of performance,and worthy of further development to achieve that capability.
REFERENCES
[1]A.Black and K.Lenzo.Building voices in the Festival speech贝贝花火
刷牙歌儿歌
synthesis system. ,2000.
[2]A.Black and K.Lenzo.Optimal data lection for unit lec-tion synthesis.In 4rd ESCA Workshop on Speech Synthesis ,Scotland.,2001.
[3]A.Black,P.Taylor,and R.Caley.The Festival speech synthe-sis system./festival ,1998.
[4]R.Brown.Example-bad machine translation in the Pan-gloss system.In Proceedings of COLING-96,pages 169–174,Copenhagen,Denmark,1996.
[5]R.Frederking,A.Black,R.Brown,J.Moody,and E.Stein-brecher.Field testing the tongues speech-to-speech machine translation system.LREC 2002,2002.
[6]R.Frederking,A.Rudnicky,C.Hogan,and K.Lenzo.In-teractive speech translation in the diplomat project.Machine Translation Journal ,special issue on spoken language transla-tion,2000.
[7]R.E.Frederking and R.D.Brown.The Pangloss-Lite Machine
Translation System.In Proceedings of the Second Conference of the Association for Machine Translation in the Americas (AMTA),pages 268–272,Montreal,Quebec,Canada,October 1996.
[8]X.Huang,F.Alleva,H.-W.Hon,K.Hwang,M.-Y .Lee,and
R.Ronfeld.The SPHINX-II speech recognition system:an overview.Computer Speech and Language ,7(2):137–148,1992.