Japane Named Entity Extraction with Redundant Morphological Analysis

更新时间:2023-07-13 21:04:22 阅读: 评论:0

Japane Named Entity Extraction with Redundant Morphological Analysis Masayuki Asahara and Yuji Matsumoto
Graduate School of Information Science Nara Institute of Science and Technology,Japan masayu-a,matsu@is.aist-nara.ac.jp
Abstract
Named Entity(NE)extraction is an important
subtask of document processing such as in-
formation extraction and question answering.
A typical method ud for NE extraction of
ancient是什么意思Japane texts is a cascade of morphological
analysis,POS tagging and chunking.However,
there are some cas where gmentation gran-
ularity contradicts the results of morphologi-
cal analysis and the building units of NEs,so
that extraction of some NEs are inherently im-
possible in this tting.To cope with the unit
贝克汉姆4亿豪宅problem,we propo a character-bad chunk-
ing method.Firstly,the input ntence is an-
alyzed redundantly by a statistical morpholog-
ical analyzer to produce multiple(n-best)an-
swers.Then,each character is annotated with
its character types and its possible POS tags of
the top n-best answers.Finally,a support vec-
tor machine-bad chunker picks up some por-
tions of the input ntence as NEs.This method
introduces richer information to the chunker
than previous methods that ba on a single
morphological analysis result.We apply our
method to IREX NE extraction task.The cross
validation result of the F-measure being87.2
shows the superiority and effectiveness of the
method.
1Introduction
Named Entity(NE)extraction aims at identifying proper nouns and numerical expressions in a text,su
ch as per-sons,locations,organizations,dates,and so on.This is an important subtask of document processing like infor-mation extraction and question answering.
A common standard data t for Japane NE extrac-tion is provided by IREX workshop(IREX Committee, editor,1999).Generally,Japane NE extraction is done in the following steps:Firstly,a Japane text is g-mented into words and is annotated with POS tags by a morphological analyzer.Then,a chunker brings together the words into NE chunks bad on contextual informa-tion.However,such a straightforward method cannot ex-tract NEs who gmentation boundary contradicts that of morphological analysis outputs.For example,a n-tence“”is gmented as“//////”by a morphological analyzer.“”(“Koizumi Jun’ichiro”–family andfirst names)as a person name and“”(“Septem-
ber”)as a date will be extracted by combining word units. On the other hand,“”(abbreviation of North Korea) cannot be extracted as a name of location becau it is contained by the word unit“”(visiting North Korea). Figure1illustrates the example with English translation. Some previous works try to cope with the word unit problem:Uchimoto(Uchimoto et al.,2000)introduces transformation rules to modify the word units given by a morphological analyzer.Isozaki(Isozaki and Kazawa, 2002)controls the parameters of a statistical morpholog-ical analyzer so as to produce morefine-grained output. Th
e method are ud as a preprocessing of chunking. By contrast,we propo more straightforward method in which we perform the chunking process bad on char-acter units.Each character receives annotations with character type and multiple POS information of the words found by a morphological analyzer.We make u of re-dundant outputs of the morphological analysis as the ba features for the chunker to introduce more information-rich features.We u a support vector machine(SVM)-bad chunker yamcha(Kudo and Matsumoto,2001)for the chunking process.Our method achieves better score than all the systems reported previously for IREX NE ex-traction task.
Section2prents the IREX NE extraction task.Sec-tion3describes our method in detail.In ction4,we show the results of experiments,andfinally we give con-clusions in ction5.
2IREX NE extraction task
The task of NE extraction in the IREX workshop is to recognize eight NE types as shown in Table1(IREX Committee,editor,1999).In their definitions,“ARTI-FACT”contains book titles,laws,brand names and so on. The task can be defined as a chunking problem to iden-
Example Sentence:
Koizumi Jun’ichiro Prime-Minister particle September particle visiting-North-Korea Prime Minister Koisumi Jun’ichiro will visit North Korea in September.
Named Entities in the Sentence:
/“Koizumi Jun’ichiro”/PERSON,
/“September”/DATE,
/“North Korea”/LOCATION
Figure1:Example of word unit problem
IOB1
B-PERSON I-PERSON O O O B-LOCATION B-LOCATION O IOE1
I-PERSON E-PERSON O O O E-LOCATION E-LOCATION O SE
Prime Minister Koizumi does between Japan and North Korea.
Figure2:Examples of NE tag ts
Table1:Examples of NEs in IREX
NE Type Examples in English
tify word quences which compo NEs.The chunking problem is solved by annotation of chunk tags to tokens. Five chunk tag ts,IOB1,IOB2,IOE1,IOE2(Ramshaw and Marcus,1995)and SE(Uchimoto et al.,2000),are commonly ud.In IOB1and IOB2models,three tags I, O and B are ud,meaning inside,outside and beginning of a chunk.In IOB1,B is ud only at the beginning of a chunk that immediately follows another chunk,while in IOB2,B is always ud at the beginning of a chunk.IOE1 and IOE2u E tag instead of B and are almost the same as IOB1and IOB2except that the end points of chunks are tagged with E.In SE model,S is tagged only to one-symbol chunks,and B,I and E denote exactly the begin-ning,intermediate and end points of a chunk.Generally, the words given by the single output of a morphological analyzer are ud as the units for chunking.By contrast,we take characters as the units.We annotate a tag on each character.
Figure2shows examples of character-bad NE anno-tations according to thefive tag ts.“”(PERSON),“”(LOCATION)and“”(LOCATION)are NEs in the ntence and annotated as NEs.While the detailed expla-nation of the tags will be done later,note that an NE tag is a pair of an NE type and a chunk tag.
3Method
In this ction,we describe our method for Japane NE extraction.The method is bad on the following three steps:
1.A statistical morphological/POS analyzer is applied
to the input ntence and produces POS tags of the n-best answers.
2.Each character in the ntences is annotated with the
character type and multiple POS tag information ac-cording to the n-best answers.
3.Using annotated features,NEs are extracted by an
SVM-bad chunker.
Now,we illustrate each of the three steps in more detail.
3.1Japane Morphological Analysis
Our Japane morphological/POS analysis is bad on Markov model.Morphological/POS analysis can be de-
吸血鬼日记
fined as the determination of POS tag quence once a gmentation into a word quence is given.The goal is tofind the POS and word quences and that maximize the following probability:
Bayes’rule allows to be decompod as the product of tag and word probabilities.
p r cWe introduce approximations that the word probabil-ity is conditioned only on the tag of the word,and the tag probability is determined only by the immediately pre-ceding tag.The probabilities are estimated from the fre-quencies in tagged corpora using Maximum Likelihood Estimation.Using the parameters,the most probable tag and word quences are determined by the Viterbi al-gorithm.
In practice,we u log likelihood as cost.Maximiz-ing probabilities means minimizing costs.In our method, redundant analysis output means the top n-best answers within a certain cost width.The n-best answers are picked up for each character in the order of the accu-mulated cost from the beginning of the ntence.Note that,if the difference between the costs of the best answer and n-th best answer exceeds a predefined cost width,we abandon the n-th best answer.The cost width is de
dance with me
fined as the lowest probability in all events which occur in the training data.
3.2Feature Extraction for Chunking
From the output of redundant analysis,each character re-ceives a number of features.POS tag information is sub-categorized so as to encode relative positions of charac-ters within a word.For encoding the position we employ SE tag model.Then,a character is tagged with a pair of POS tag and the position tag within a word as one fea-ture.For example,the character at the initial,intermedi-ate andfinal positions of a common noun(Noun-General) are reprented as“Noun-General-B”,“Noun-General-I”and“Noun-General-E”,respectively.The list of tags for positions in a word is illustrated in Table2.Note that O tag is not necessary since every character is a part of a certain word.
Character types are also ud for features.We define ven character types as listed in Table3.
Figure3shows an example of the features ud for chunking process.
Table2:Tags for positions in a word
Description
one-character word
B
last character in a multi-character word I
word(only for words longer than2chars)
Tag
ZSPACE
Digit
ZLLET
Upperca alphabetical letter
HIRAG
Katakana
OTHER
3.3Support Vector Machine-bad Chunking
We ud the chunker yamcha(Kudo and Matsumoto, 2001),which is bad on support vector machines(Vap-nik,1998).Below we prent support vector machine-bad chunking briefly.
Suppo we have a t of training data for a binary class problem:,where德文名字
is a feature vector of the i-th sample in the training data and is the label of the sample.The goal is tofind a decision function which accurately predicts for an unen.An support vector machine classifier gives the decision function for an input vector where
means that is a positive member, means that is a negative member.The vectors are called support vectors.Support vectors and other con-stants are determined by solving a quadratic program-ming problem.is a kernel function which maps vectors into a higher dimensional space.We u the poly-nomial kernel of degree2given by. To facilitate chunking tasks by SVMs,we have to ex-tend binary classifiers to n-class classifiers.There are two well-known methods ud for the extension,“One-vs-Rest method”and“Pairwi method”.In“One-vs-Rest method”,we prepare binary classifiers,one be-tw
een a class and the rest of the class.In“Pairwi method”,we prepare binary classifiers between all pairs of class.
Position Char.Char.Type POS(Best)POS(2nd)POS(3rd)NE tag OTHER Noun-Proper-Name-Surname-B Prefix-Nominal-S Noun-General-S B-PERSON
OTHER Noun-Proper-Name-Surname-E Noun-Proper-Place-General-E Noun-Proper-General-E I-PERSON
O OTHER Noun-General-E Noun-Suffix-General-S*
HIRAG Particle-Ca-General-S**
Figure3:An example of features for chunking
enjoyChunking is done deterministically either from the be-ginning or the end of ntence.Figure3illustrates a snap-shot of chunking procedure.Two character contexts on both sides are referred to.Information of two preceding NE tags is also ud since the chunker has already deter-mined them and they are available.In the example,to infer the NE tag(“O”)at the position,the chunker us the features appearing within the solid box.
3.4The effect of n-best answer
The model copes with the problem of word gmentation by character-bad chunking.Furthermore,we introduce n-best answers as features for chunking to capture the fol-lowing behavior of the morphological analysis.The am-biguity of word gmentation occurs in compound words. When both longer and shorter unit words are included in the lexicon,the longer unit words are more likely to be output by the morphological analyzer.Then,the shorter units tend to be hidden behind the longer unit words. However,introducing the shorter unit words is more nec-essary to named entity extraction to generalize the model, becau the shorter units are shared by many compound words.Figure4shows the example in which the shorter units are effective for NE extraction.In this example“”(Japan)is extracted as a location by cond best answer,namely“Noun-Proper-Place-Country”. Unknown word problem is also solved by the n-best answers.Contextual information in Markov Model is lost at the position unknown word occurs.Then,preceding or succeeding words of an unknown word tend to be mis-taken in POS tagging.However,correct POS tags occur-ring in n-best answer may help to extract named entity. Figure5shows such an example.In this example,the begining of the person name is captured by the best an-swer at the position1and the end of the person name is captured by the cond best answer at the position5.
4Evaluation
4.1Data
We u CRL NE data(IREX Committee,editor,1999)for evaluation of our method.CRL NE data includes1,174 newspaper articles and19,262NEs.We performfive-fold cross-validation on veral ttings to investigate the length of contextual feature,the size of redundant mor-phological analysis,feature lection and the degree of polynomial Kernel functions.For the chunk tag scheme we u IOB2model since it gave the best result in a pilot study.F-Measure()is ud for evaluation.
4.2The length of contextual feature
Firstly,we compare the extraction accuracies of the mod-els by changing the length of contextual features and the direction of chunking.Table4shows the result in accu-racy for each of NEs as well as the total accuracy of all NEs.For example,“L2R2”denotes the model that us the features of two preceding and two succeeding char-acters.“For”and“Back”mean the chunking direction:“For”specifies the chunking direction from left to right, and“Back”specifies that from right to left. Concerning NE types except for“TIME”,“Back”di-rection gives better accuracy for all NE types than“For”direction.It is becau suffixes are crucial feature for NE extraction.“For”direction gives bett
er accuracy for “TIME”,since“TIME”often contains prefixes such as“”(a.m.)and“”(p.m.).
“L2R2”gives the best accurary for most of NE types. For“ORGANIZATION”,the model needs longer contex-tual length of features.The reason will be that the key prefixes and suffixes are longer in this NE type such as“”(company limited)and“”(rearch in-stitute).
4.3The depth of redundant morphological analysis Table5shows the results when we change the depth(the value n of the n-best answers)of redundant morphologi-cal analysis.
Redundant outputs of morphological analysis slightly improve the accuracy of NE extraction except for nu-meral expressions.The best answer ems enough to extract numeral experssions except for“MONEY”.It is becau numeral expressions do not cau much errors in morphological analysis.To extract“MONEY”,the model needs more redundant output of morphological analysis.A typical occurs at“”(Canadian dollars=MONEY)which is not including training data and is analyzed as“”(Canada=LOCATION).The similar error occurs at“”(Hong Kong dollars) and so on.
4.4Feature lection
We u POS tags,characters,character types and NE tags as features for chunking.To evaluate how they are
Position Char.POS(Best)POS(2nd)NE
Noun-General
3Noun-Suffix-General
1Noun-General PERSON
2
solution是什么意思
3*
4Adjective
5Noun-Suffix-General
Pair Wi Method One vs Rest Method
L2R2L2R2 Direction For For For Back Back Back ARTIFACT29.7442.1743.9045.5949.5847.82 DATE84.9891.1692.4790.2293.9793.41 LOCATION80.1684.0785.7586.6287.7587.61 MONEY43.4659.8872.5393.3093.8593.60 ORGANIZATION66.0672.6375.5574.8078.3379.95 PER
CENT67.6683.7785.2695.9696.0694.16 PERSON83.4485.3586.3184.9887.1987.65 TIME88.2189.8289.5487.5488.3388.08
83.7286.1986.0276.6582.1284.16
with thesaurus
Direction For For
50.0649.15
DATE91.1991.78
87.6188.59
MONEY61.6264.58
79.2780.37
PERCENT86.2386.64
87.4087.73
TIME90.5490.19
ALL82.5883.58
“L2R2”contextual feature,2-best answers of
redundant morphological analysis,
One vs Rest method with Features:POS,Characters,
Character Types and NE tags.
In the experimentation above,we follow the features
ud in the preceding work(Yamada et al.,2002).Isozaki
(Isozaki and Kazawa,2002)introduces the thesaurus–
NTT Goi Taikei(Ikehara et al.,1999)–to augment the
Table5:The depth of redundant analysis and the extraction accuracy
itav
Pair Wi Method
2-best ans.4-best ans.
Back Back Back Back ARTIFACT44.3743.5742.1742.10
93.8194.2394.1493.71
LOCATION84.3584.2084.0783.92
93.8994.2895.8295.96 ORGANIZATION73.8373.7172.6372.46
97.2096.7696.3196.81
PERSON86.2385.6585.3585.22
88.2287.7287.4787.77
86.2586.3086.1986.08
One vs Rest Method
2-best ans.4-best ans.
Back Back Back Back ARTIFACT43.1141.1239.8438.65
94.1894.1893.9793.83
LOCATION84.7284.6784.3184.15
93.7993.6793.8595.47 ORGANIZATION74.3773.7072.7472.73
97.0996.0296.0696.28
PERSON85.9286.0385.5185.41
89.0488.0788.3388.32
86.4086.3586.1186.07
F-measure
50.16 DATE
metropolis翻译
88.57 MONEY
80.44 PERCENT
87.81 TIME
ALL
While we must have afixed feature t among all NE types in Pairwi method,it is possible to lect differ-ent feature ts and models when applying One-vs-Rest method.The best combined model achieves F-measure 87.21(Table9).The model us one-vs-rest method with the best model for each type shown in Table4-8.Table 10shows comparison with related works.Our method attains the best result in the previously reported systems.
Previous works report that POS information in preced-ing and succeeding two-word window is the most effec-tive for Japane NE extraction.Our current work dis-proves the widespread belief about the contextual feature. In our experiments,the preceding and succeeding two or three character window is the best effective.
Our method employs exactly same chunker with the work by Yamada et.al.(2002).To e the influenc
e of boundary contradiction between morphological anal-ysis and NEs,they experimented with an ideal tting in which morphological analysis provides the perfect re-sults for the NE chunker.Their result shows F-measure 85.1in the same data t as ours.Tho results show that our method solves more than the word unit problem compared with their results.

本文发布于:2023-07-13 21:04:22,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/176467.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:日记   吸血鬼
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图