Empirical methods for compound splitting

更新时间:2023-07-10 09:16:03 阅读: 评论:0

Empirical Methods for Compound Splitting
Philipp Koehn Information Sciences Institute Department of Computer Science University of Southern California koehn@isi.edu
Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California knight@isi.edu
Abstract什么是近义词
高考成绩怎么查
Compounded words are a challenge for
NLP applications such as machine trans-
lation(MT).We introduce methods to
learn splitting rules from monolingual
and parallel corpora.We evaluate them
against a gold standard and measure
their impact on performance of statisti-
cal MT systems.Results show accuracy
of99.1%and performance gains for MT
of0.039BLEU on a German-English
noun phra translation task.
1Introduction
Compounding of words is common in a number of languages(German,Dutch,Finnish,Greek,etc.). Since words may be joined freely,this vastly in-creas the vocabulary size,leading to spar data problems.This pos challenges for a number of NLP applications such as machine translation, speech recognition,text classification,information extraction,or information retrieval.
For machine translation,the splitting of an un-known compound into its parts enables the transla-tion of the compound by the translation of its parts. Take the word Aktionsplan in German(e Fig-ure1),which was created by joining the words Ak-tion and Plan.Breaking up this compound would as
sist the translation into English as action plan. Compound splitting is a well defined compu-tational linguistics task.One way to define the goal of compound splitting is to break up foreign
action plan
act ion plan Figure1:Splitting options for the German word Aktionsplan
words,so that a one-to-one correspondence to En-glish can be established.Note that we are looking for a one-to-one correspondence to English con-tent words:Say,the preferred translation of Ak-tionsplan is plan for action.The lack of corre-spondence for the English word for does not de-tract from the definition of the task:We would still like to break up the German compound into the two parts Aktion and Plan.The inrtion of function words is not our concern. Ultimately,the purpo of thi
s work is to im-prove the quality of machine translation systems. For instance,phra-bad translation systems [Marcu and Wong,2002]may recover more eas-ily from splitting regimes that do not create a one-to-one translation correspondence.One split-ting method may mistakenly break up the word Aktionsplan into the three words Akt,Ion,and Plan.But if we consistently break up the word Aktion into Akt and Ion in our training data,such a
system will likely learn the translation of the word pair Akt Ion into the single English word action. The considerations lead us to three different objectives and therefore three different evaluation metrics for the task of compound splitting:
One-to-One correspondence
Translation quality with a word-bad trans-
lation systemlo名词
Translation quality with a phra-bad trans-
lation system
For thefirst objective,we compare the output of our methods to a manually created gold stan-dard.For the cond and third,we provide differ-ently prepared training corpora to statistical ma-chine translation systems.
2Related Work
While the linguistic properties of compounds are widely studied[Langer,1998],there has been only limited work on empirical methods to split up compounds for specific applications.
Brown[2002]propos a approach guided by a parallel corpus.It is limited to breaking com-pounds into cognates and words found in a transla-tion lexicon.This lexicon may also be acquired by training a statistical machine translation system. The methods leads to improved text coverage of an example bad machine translation system,but no results on translation performance are reported. Monz and de Rijke[2001]and Hedlund et al. [2001]successfully u lexicon bad approaches to compound splitting for information retrieval. Compounds are broken into either the smallest or the biggest words that can be found in a given lex-icon.
Larson et al.[2000]propo a data-driven method that combines compound splitting and word recombination for speech recognition.While it reduces the number of out-of-vocabulary words, it doe
s not improve speech recognition accuracy. Morphological analyzers such as Morphix[Fin-kler and Neumann,1998]usually provide a variety of splitting options and leave it to the subquent application to pick the best choice.3Splitting Options
梦醒了Compounds are created by joining existing words together.Thus,to enumerate all possible splittings of a compound,we consider all splits into known words.Known words are words that exist in a training corpus,in our ca the European parlia-ment proceedings consisting of20million words of German[Koehn,2002].
When joining words,filler letters may be in-rted at the joint.The are called Fugenelemente in German.Recall the example of Aktionsplan, where the letter s was inrted between Aktion and Plan.Since there are no simple rules for when such letters may be inrted we allow them be-tween any two words.Asfillers we allow s and es when splitting German words,which covers al-most all cas.Other transformations at joints in-clude dropping of letters,such as when Schweigen and Minute are joined into Schweigeminute,drop-ping an n.A extensive study of such transforma-tions is carried out by Langer[1998]for German. To summarize:We try to cover the entire length of the compound with known words andfillers be-tween words.An algorithm to break up words in such a manner could be implemented using dynamic programming,but since computational complexity is no
t a problem,we employ an ex-haustive recursive arch.To speed up word matching,we store the known words in a hash bad on thefirst three letters.Also,we restrict known words to words of at least length three. For the word Aktionsplan,wefind the following splitting options:
aktionsplan
aktion–plan
aktions–plan
akt–ion–plan
We arrive at the splitting options,since all the parts–aktionsplan,aktions,aktion,akt,ion,and plan–have been obrved as whole words in the training corpus.
The splitting options are the basis of our work.In the following we discuss methods that pick one of them as the correct splitting of the compound.
4Frequency Bad Metric
The more frequent a word occurs in a training corpus,the bigger the statistical basis to esti-mate translation probabilities,and the more likely the correct translation probability distribution is learned[Koehn and Knight,2001].This insight leads us to define a splitting metric bad on word frequency.
Given the count of words in the corpus,we pick the split with the highest geometric mean of word frequencies of its parts(being the num-ber of parts):
argmax S count
find correspondences
in English translation
with help from
translation lexicon
Figure2:Acquisition of splitting knowledge from a parallel corpus:The split Aktion–plan is preferred since it has most coverage with the English(two words overlap)
here we are looking for a translation into the ad-jective basic or fundamental.Such a translation only occurs when Grund is ud as thefirst part of a compound.
To account for this,we build a cond transla-tion lexicon as follows:First,we break up German words in the parallel corpus with the frequency method.Then,we train a translation lexicon using Giza from the parallel corpus with split German and unchanged English.
Since in this corpus Grund is often broken off from a compound,we learn the translation table entry Grund basic.By joining the two transla-tion lexicons,we can apply the same method,but this time we correctly split Grundrechte.
By splitting all the words on the German side of the parallel corpus,we acquire a vast amount of splitting knowledge(for our data,this covers 75,055different words).This knowledge contains for instance,that Grundrechte was split up213 times,and kept together17times.
When making splitting decisions for new texts, we follow the most frequent option bad on the splitting knowledge.If the word has not been en before,we u the frequency method as a back-off. 6Limitation on Part-Of-Speech
A typical error of the method prented so far is that prefixes and suffixes are often split off.For instance,the word folgenden(English:following) is broken off into folgen(English:conquences) and den(English:the).While this is nonnsical,it is easy to explain:The word the is commonly found in English ntences,and therefore taken as evidence for the existence of a translation for den. Another example for this is the word Voraus-tzung(English:condition),which is split into vor and austzung.The word vor translates to many different prepositions,which frequently oc-cur in English.
To exclude the mistakes,we u informa-tion about the parts-of-speech of words.We do not want to break up a compound into parts that are prepositions or determiners,but only content words:nouns,adverbs,adjectives,and verbs.
To accomplish this,we tag the German cor-pus with POS tags using the TnT tagger[Brants, 2000].We then obtain statistics on the parts-of-speech of words in the corpus.This allows us to exclude words bad on their POS as possible parts of compounds.We limit possible parts of compounds to words that occur most of the time as one of following POS:ADJA,ADJD,ADV,NN, NE,PTKNEG,VVFIN,VVIMP,VVINF,VVIZU, VVPP,V AFIN,V AIMP,V AINF,V APP,VMFIN, VMINF,VMPP.
7Evaluation
The training t for the experiments is a corpus of650,000noun phras and prepositional phras (NP/PP).For each German NP/PP,we have a En-glish translation.This data was extracted from the Europarl corpus[Koehn,2002],with the help of a German and English statistical parr.This limita-
Method Correct Wrong Metrics
split not split recall raw32960-94.2%
148339773.3% frequency bad3176857.4%95.7%
180132789.1% using parallel and POS3287293.8%99.1% Table1:Evaluation of the methods compar
ed against a manual annotated gold standard of splits:Using knowledge from parallel corpus and part-of-speech information gives the best accuracy(99.1%).
tion is purely for computational reasons,since we expect most compounds to be nouns.An evalua-tion of full ntences is expected to show similar results.
We evaluate the performance of the described methods on a blind test t of1000NP/PPs,which contain3498words.Following good engineering practice,the methods have been developed with a different development test t.This restrains us from over-fitting to a specific test t.
7.1One-to-one Correspondence
Recall that ourfirst objective is to break up Ger-man words into parts that have a one-to-one trans-lation correspondence to English words.To judge this,we manually annotated the test t with cor-rect splits.Given this gold standard,we can eval-uate the splits propod by the methods.
The results of this evaluation are given in Ta-ble1.The columns in this table mean:
correct split:words that should be split and were split correctly
correct non:words that should not be split and were not
三国演义书wrong not:words that should be split but were not
wrong faulty split:words that should be split, were split,but wrongly(either too much or吸豆
too little)
wrong split:words that should not be split,but were
precision:(correct split)/(correct split+wrong faulty split+wrong superfluous split)recall:(correct split)/(correct split+wrong faulty split+wrong not split)
accuracy:(correct)/(correct+wrong)
To briefly review the methods:
raw:unprocesd data with no splits
eager:biggest ,the split into as many parts as possible.If multiple biggest splits are
possible,the one with the highest frequency
score is taken.
frequency bad:split into most frequent words, as described in Section4
using parallel:split guided by splitting knowl-edge from a parallel corpus,as described in
Section5
菠菜炒豆腐
using parallel and POS:as previous,with an ad-ditional restriction on the POS of split parts,
as described in Section6
Since we developed our methods to improve on this metric,it comes as no surpri that the most sophisticated method that employs splitting knowledge from a parallel corpus and information about POS tags proves to be superior with99.1% accuracy.Its main remaining source of error is the lack of training data.For instance,it fails on more obscure words such as Passagier–aufkommen(En-glish:pasnger volume),where even some of the parts have not been en in the training corpus. 7.2Translation Quality with Word Bad
Machine Translation
The immediate purpo of our work is to improve the performance of statistical machine translation
Method休息的英文
raw
eager
frequency bad using parallel
using parallel and POS
BLEU
0.305
0.344
0.342
0.330
0.326
Table3:Evaluation of the methods with a phra bad statistical machine translation system.The ability to group split words into phras over-comes the many mistakes of maximal(eager) splitting of words and outperforms the more ac-curate methods.
tem with the differentflavors of our training data, and evaluated the performance as before.Table3 shows the results.
Here,the eager splitting method that performed so poorly with the word bad SMT system comes out ahead.The task of deciding the granularity of good splits is deferred to the phra bad SMT system,which us a statistical method to group phras and rejoin split words.This turns out to be even slightly better than the frequency bad method.
8Conclusion
We introduced various methods to split compound words into parts.Our experimental results demon-strate that what constitutes the optimal splitting depends on the intended application.While one of our method reached99.1%accuracy compared against a gold standard of one-to-one correspon-dences to English,other methods show superior results in the context of statistical machine trans-lation.For this application,we could dramatically improve the translation quality by up to0.039 points
as measured by the BLEU score.
The words resulting from compound splitting could also be marked as such,and not just treated as regular words,as they are now.Future machine translation models that are nsitive to such lin-guistic clues might benefit even more. References
Al-Onaizan,Y.,Curin,J.,Jahr,M.,Knight, K.,Lafferty,J.,Melamed,  D.,Och,  F.-J.,

本文发布于:2023-07-10 09:16:03,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1075464.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:高考   豆腐   成绩   菠菜
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图