AI-HO-IntroNLP (printed 19/7/08)© P. Coxhead 2001Page 1
An Introduction to Natural Language Processing (NLP)
Peter Coxhead
Definition
A ‘natural language’ (NL) is any of the languages naturally ud by humans, i.e. not an arti-ficial or man-made language such as a programming language. ‘Natural language processing’(NLP) is a convenient description for all attempts to u computers to process natural language.1 NLP includes:
•Speech synthesis: although this may not at first sight appear very ‘intelligent’, the syn-thesis of natural-sounding speech is technically complex and almost certainly requires some ‘understanding’ of what is being spoken to ensure, for example, correct intonation.•Speech recognition: basically the reduction of continuous sound waves to discrete words.
向上英文•Natural language understanding: here treated as moving from isolated words (either written or determined via speech recognition) to ‘meaning’. This may involve complete model systems or ‘front-ends’, driving other programs by NL commands.
•Natural language generation: generating appropriate NL respons to unpredictable inputs.
•Machine translation (MT): translating one NL into another.黄芪的功效与作用及禁忌
Origins
•The idea of using digital computers in NLP is ‘old’, possibly becau one of the first us of computers was in breaking military codes in the cond world war. Some computer scientists em to have thought that Russian (for example) is just English in a different code. In which ca, since codes can be broken, so can Russian. This idea assumes there is a common ‘meaning ba’ to all natural languages, regardless of their surface differences. The overwhelming connsus among linguists is that this is simply not true.
•‘Artificial Language Processing’, in the form of compilers and interpreters for program-ming languages, was a key component in the success of digital computers from their earliest days. This success undoubtedly encouraged rearch into NLP (and also encour-aged an optimistic approach).
There have been cycles of optimism and pessimism in the field of NLP (we are possibly in a more optimistic pha at prent); although some very real advances have been made, the target of a gen
eral NLP system remains elusive. Historically, computer scientists have often been far too over-optimistic about NLP, probably for some of the reasons noted above. It is thus important to be clear from the outt exactly why the task is difficult.
It is also important to note that there are differences between natural languages. More work has probably been done on English than on any other language, largely becau of the importance of American rearchers, although there are very active workers in Europe and Japan. However, English is in some ways an untypical language, as it us few inflections and relies heavily on word order. Textbooks and other introductory sources written in
1NLP is often ud in a way which excludes speech; SNLP is then needed as a term to include both speech and other aspects of natural language processing.
Page 2AI-HO-IntroNLP (printed 19/7/08) English rarely contain adequate discussions of NLP for languages with markedly different grammatical structures.
We can distinguish at least three distinct ‘levels’ in processing NL:
•Sounds
•Grammar
•Meaning
Each can be divided into two or more sublevels, which need not concern us here. What I want to do in this brief introduction is to illustrate some of the problems in processing each level.
Speech
Consider the three words, spoken by a native English speaker from the south of England: input, intake, income. It’s clear that all three words contain the element in with the same meaning. To input is to put something in; the intake of a water pump is the place where water is taken in; your income is the money that you earn, i.e. that comes in.
Is the element in pronounced the same in all three words (by the specified speaker)? Careful listening will show that it is not. The word input is pronounced as if spelt imput, whereas intake is pronounced as spelt. If we let N stand for the sound usually spelt ng in English (e.g. in words like sing or singer), then income is pronounced i N come.
I specified native English speakers from the south of England becau many speakers of Scots Engl
ish do NOT behave in this way; instead they consistently pronounce the first element of all three words as it is spelt, i.e. as in (as may all English speakers when speaking slowly and emphatically).
Interestingly, English speakers are generally quite unaware of the differences, both in their own speech and the speech of others. This is not becau they cannot distinguish between the three sounds m, n and N. The three words rum, run and rung differ ONLY in the three sounds and are quite distinct to all native English speakers.
Another example of the same kind of phenomenon occurs in the plurals of English nouns. Consider the words cat, cats, dog and dogs. A native English speaker explaining how plurals are formed is likely to say something like “you add an s sound to the end.” Careful listening will show that cats is indeed pronounced with a s sound, but dogs is not: it ends with a z sound. Yet as with my previous example, English speakers don’t normally notice this difference. Again it isn’t becau they can’t distinguish between s and z since Sue and zoo or hiss and his differ only in their s and z sounds.
The conclusion is that the sounds that native speakers ‘hear’ are not the sounds they make. This is important in both speech synthesis and speech recognition. When generating speech from written text, a synthesizer must not turn every n or s into an n or s sound, but must u more complex rules
which mimic the native speaker. Similarly a speech recognition system must recognize the sounds in rum and run as distinct, but must NOT decide that imput and input are different words.
Even more awkward is the fact that this behaviour operates across word boundaries. Try saying the ntences quickly and in an ‘informal’ style:
When playing football, watch the referee.
When talking about other people, watch who’s listening.
When catching a hard ball, wear gloves.
You’ll find (at least if you speak the same dialect of English as me) that you say whem, when and whe N respectively. This means that
AI-HO-IntroNLP (printed 19/7/08)Page 3• a speech synthesis system will not sound right if it simply us pre-recorded words, since how the are pronounced changes depending on neighbouring words
• a speech recognition system must treat the three pronunciations of when as the same while distinguishing between the same sounds in rum, run and rung.
Even this example doesn’t capture all the complexities. A native speaker of American English pronounces the t and d in the words write, writer, ride, rider differently from a native speaker of English English. If we ignore the very slightly different pronunciation of the d, the words will be pronounced roughly as write, wrider, ride and rider. That is, write and ride are clearly distinguishable whereas writer and rider are pronounced EXACTLY the same. Yet a native speaker of this dialect doesn’t ‘hear’ this. The actual sounds ud in the ntence I’m a writer and I write books will be something like I’m a rider and I write books but will cau no confusion whatsoever; a listener who is ud to this dialect will hear writer not rider. Conclusion: TO RECOGNIZE WORDS CORRECTLY REQUIRES SIMULTANEOUS PROCESSING OF SOUND, GRAMMAR AND MEANING.
Grammar徐衫
Grammar in this context refers to both the structure of words (morphology) and the structure of ntences (syntax). I’m only going to consider some syntax examples here.
Compilers process the syntax of a programming language without needing to understand it. Can we write similar programs to process the syntax of a NL without processing the man-tics, i.e. without trying to understand what it means?
Consider:
’Twas brillig, and the slithy toves
Did gyre and gimbal in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.版画素材图片
Without knowing what Lewis Carroll2 meant by some of the words, i.e. without being able to perform mantic processing, considerable syntactic processing is possible. Thus you can answer questions such as:
What were the toves doing? – They were gyring and gimballing.
Where were they doing it? – In the wabe.
(Notice that the answers may involve morphological changes to words in the original, e.g. gyre becomes gyring.)
千斤拔
Chomsky’s famous
Colourless green ideas sleep furiously.
is another example of a syntactically correct ntence with erroneous or incomplete mantics. Again we can answer questions:
What were the ideas doing? – They were sleeping furiously.
Since people can answer such questions without needing to understand them, it ems at first sight plausible that a program could be written to do the same.
However, consider the cond ntence of the fifth stanza of Carroll’s poem: One, two! And through and through
The vorpal blade went snicker-snack!
2Lewis Carroll was the pen-name of the mathematician and logician Charles L Dodgson (1832-98).
Page 4AI-HO-IntroNLP (printed 19/7/08)
Syntactically the ntence is ambiguous. It could be equivalent to
Snicker-snack went through and through the vorpal blade.
or
The vorpal blade went through and through (something) snicker-snack.
depending on whether snicker-snack is taken to be a (plural?) noun or an adverb. (Consider And through and through the wabe went borogroves versus And through and through the wabe went mimsily.) However, there is actually no doubt as to the correct reading – things don’t go through blades, and snicker-snack is onomatopoeic enough to be interpreted: the vorpal blade went through and through the Jabberwock making a snicker-snack noi.
More rious examples make the same point: in NLs, syntax CANNOT be procesd indepen-dently from mantics. Consider this quence:
The girl eats the apple with a smile.
How does the girl eat the apple? – With a smile
Suppo we tried to write a computer program which could engage in conversations like this. The simplest approach ems to be to u pattern-matching. We might look for a ntence with the pattern:
The <noun1> <verb>s the <noun2> with a <noun3>.
Then if we had a question about this ntence which had the pattern:
How does the <noun1> <verb> the <noun2>?
the program could answer:
With a <noun3>.
Thus given:
The man clos the door with a bang.
the program could cope with the quence:
How does the man clo the door? – With a bang.白菜豆腐怎么炒
But now suppo we try the ntence:
The girl eats the apple with a brui.
How does the girl eat the apple? – With a brui. WRONG!!
The problem is that although on the surface the two ntences the girl eats the apple with a smile and the girl eats the apple with a brui are similar, their underlying syntax is quite different: with a smile qualifies eats in the first ntence whereas with a brui qualifies the apple in the cond ntence. We can try to show this by using parenthes: [The girl] [eats (the apple) (with a smile)]
[The girl] [eats (the apple (with a brui))]
Unfortunately, to work this out our program will need to know that, for example, apples can have bruis but not smiles.
Now consider:
The girl eats the hamburger with relish.
This is ambiguous: it may mean that the hamburger has relish on it or it may mean that the girl is eating the hamburger with enthusiasm. Programming languages are deliberately constructed so that their syntax is unambiguous, but NLs are not. Thus PROCESSING THE GRAMMAR OF NATURAL LANGUAGES DEPENDS ON SIMULTANEOUS PROCESSING OF MEANING.
AI-HO-IntroNLP (printed 19/7/08)Page 5 Meaning
Meaning can be subdivided in many ways. A simple division is between mantics, referring to the meaning of words and thus the meaning of ntences formed by tho words3and pragmatics, referring to the meanings intended by the originator of the words. Note the meaning of ntences depends not only on the meaning of the words, but also on rules about the meaning of word combinations, word orders, etc. For example, in the English ntence the boy es the girl, we know that the boy does the eing and the girl is en becau in English the normal word order is subject – verb – object.
A striking feature of NLs is that many words and ntences have more than one meaning (i.e. are mantically ambiguous), and which meaning is correct depends on the context. This problem aris at veral levels.
There are problems at the level of individual words. Consider:
苦瓜的好处The man went over to the bank.
What kind of ‘bank’? A river bank or a source of money? Here we have two distinct English words with the same spelling/pronunciation.
However, there are more subtle problems at the word level. Consider:
Mary loves Bill.
Mary loves chips.
The word loves in the two ntences does not have precily the same meaning (and might need translating by different words in other languages). This is especially clear if you try changing Bill to chips in:
Mary loved Bill – that’s why she killed herlf.
Metaphorical u of words also caus problems. Consider:
Water runs down the hill.
The river runs down the hill.
The road runs down the hill.
In the first ntence, runs implies movement; in the third it does not. What about the cond? Other languages may not allow the same word to be ud in all three ns. The literal/metaphorical boundary tends to shift with time and usage.性感钢管舞视频
There are also problems at the ntence level. The meaning of a ntence is not just the meaning of the words of cour, it also involves knowledge of the rules governing the meaning of word combinations, orderings, etc. in the NL. However, even given this knowledge, idioms cau problems:
He really put his foot in it that time.
The classic (but I suspect fictional) example from MT is the system that translated: The spirit is willing but the flesh is weak.
into Russian, and then translated the Russian back into:
The vodka is good but the meat is rotten.
At the passage level endless problems ari. ‘Anaphora’ (e.g. substituting less meaningful words such as pronouns for nouns) is a common feature of NLs. Suppo we read the ntence:
London had snow yesterday.
3In some languages, some words may have a purely syntactic function and should be excluded from this analysis; e.g. to in I want to eat.
Page 6AI-HO-IntroNLP (printed 19/7/08) and then read ONE of the following ntences, all of which are nsible continuations. What is the it?
It also had fog. (It = London)
It fell to a depth of 1 metre. (It = the snow)
It will continue cold today. (It = ?the weather)
Handling pronouns is a difficult task, and ems to require considerable world knowledge and infere
nce. (Note that even in MT where ‘understanding’ the text is apparently not an issue, inferring the original of pronouns can be important, becau it may be necessary to reflect gender; e.g. if translating the above into French, it may become either il or elle.) Conclusion: THE MEANING OF NATURAL LANGUAGE CANNOT BE ESTABLISHED FROM THE MEANING OF THE WORDS PLUS RULES FOR DETERMINING MEANING FROM WORD COMBINATIONS AND ORDERINGS; INFERENCE AND WORLD KNOWLEDGE ARE NEEDED.
If all this weren’t enough, we need to consider the (inferred) intention of the producer of the NL being analyd. It may be thought that this is covered by mantics, but a few examples soon show otherwi. Consider the question:
Can you tell me the time?
A syntactically and mantically perfectly correct respon is Yes! Pragmatically, this is wrong (assuming the question was asked in a ‘normal’ context by a native speaker of English), since the intention behind the question is that the respondent should TELL the questioner the time. How can a program be written which deals correctly with the difference between
Can you swim?
and
Can you pass me the life belt?
The difference between
Pass me the salt.
Pass me the salt, plea.
Can you pass me the salt?
Could you pass me the salt?
is esntially in levels of politeness, rather than in any other aspect. This issue is particularly important in MT, where literal (i.e. syntactically and mantically correct) translations may not be pragmatically so. For example, the Greek-English pairs are reasonably good translations as regards syntax and word/ntence mantics:
thélo mía bíra = I want a beer
tha íthela mía bíra = I would like a beer
but the first Greek ntence is more polite than the first English ntence, so that a better translation may be the cond English ntence.
Conclusion: THE SURFACE MEANING OF A SENTENCE IS NOT NECESSARILY THE MEANING INTENDED BY THE PRODUCER.
Conclusion
NLs are MASSIVELY LOCALLY AMBIGUOUS at every level (speech, grammar, meaning). Yet in normal u, NL is effective and rarely ambiguous. To resolve local ambiguity, humans employ not only a detailed knowledge of the language itlf – its sounds, rules about sound combinations, its grammar and lexicon together with word meanings and meanings derived from word combinations and orderings – but also: