语音识别外文翻译外文文献英文文献

更新时间:2023-06-02 06:31:46 阅读：评论：0

Speech Recognition

Victor Zue, Ron Cole, & Wayne Ward

MIT Laboratory for Computer Science, Cambridge, Massachutts, USA Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

1 Defining the Problem

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a t of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also rve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in ction.

Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pau briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficul

t to recognize than speech read from script. Some systems require speaker enrollment---a ur must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a quence of words, language models or artificial grammars are ud to restrict the combination of words.

张明雪The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-nsitive grammar.

One popular measure of the difficulty of the task, combining the vocabulary size and the 1 language model, is perplexity, looly defined as the geometric mean of the number of words that can follow a word after the language model has been applied (e ction for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noi and the type and the placement of the microphone.

黄龙岘>电脑经常自动重启肖申克的救赎经典台词

适用范围Speech recognition is a difficult problem, largely becau of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are compod, are highly dependent on the context in which they appear. The phonetic variabilities are exemplified by the acoustic differences of the phoneme，At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.

Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a t of uful measurements or features at a fixed rate, 2 typically once every 10--20 mc (e ctionsand 11.3 for signal reprentation and digital signal processing, respectively). The measurements are then ud to arch for the most likely word candidate, making u of constraints impod by the acoustic, lexical, and language models. Throug

hout this process, training data are ud to determine the values of the model parameters.

Speech recognition systems attempt to model the sources of variability described above in veral ways. At the level of signal reprentation, rearchers have developed reprentations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the

acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to tho of the current speaker during system u, (e ction). Effects of linguistic context at the acoustic phonetic level are typically handled by training parate models for phonemes in different contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of words in reprentations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing arch algorithms to find alternate paths of phonemes through the networks. Statistical language models, bad on estimates of the f

requency of occurrence of word quences, are often ud to guide the arch through the most probable quence of words.

The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both reprented probabilistically as Markov process, as discusd in ctions,and 11.2. Neural networks have also been ud to estimate the frame bad scores; the scores are then integrated into HMM-bad system architectures, in what has come to be known as hybrid systems, as described in ction 11.5.

An interesting feature of frame-bad HMM systems is that speech gments are identified during the arch process, rather than explicitly. An alternate approach is to first identify speech gments, then classify the gments and u the gment scores to recognize words. This approach has produced competitive recognition performance in veral tasks.

2 State of the Art

Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for dif

ferent tasks. For example, when the vocabulary is

small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.

The past decade has witnesd significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are veral factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.

Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of the corpora are designed for acoustic phonetic rearch, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of ntences available for system training and testing. The corpora permit rearchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of the corpora (e.g., TIMIT, RM, A TIS, a

儿拼音

nd WSJ; e ction 12.3) were originally collected under the sponsorship of the U.S. Defen Advanced Rearch Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.

Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, rearchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing ts. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was prented with previously unen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in

uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).

素描树Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled rearchers to

run many large scale experiments in a short amount of time. This means that the elapd time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.

One of the most popular, and potentially most uful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.

One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource 5 Management (RM) task, in which inquiries can be made concerning various naval vesls in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, rearchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.

大学团支书职责

High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read ntences drawn from North America business news.

本文发布于:2023-06-02 06:31:46，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/965139.html

上一篇：2023年初二的生活字初二的生活是什么样的3篇(大全)

下一篇：最新除夕夜有月亮吗3篇(优质)

标签：重启职责自动大学电脑团支书

留言与评论（共有 0 条评论）