首页 > 英文翻译

Japane Named Entity Extraction with Redundant Morphological Analysis

更新时间:2023-07-13 21:04:22 阅读：评论：0

Japane Named Entity Extraction with Redundant Morphological Analysis Masayuki Asahara and Yuji Matsumoto

Graduate School of Information Science Nara Institute of Science and Technology,Japan masayu-a,matsu@is.aist-nara.ac.jp

Abstract

Named Entity(NE)extraction is an important

subtask of document processing such as in-

formation extraction and question answering.

A typical method ud for NE extraction of

ancient是什么意思Japane texts is a cascade of morphological

analysis,POS tagging and chunking.However,

there are some cas where gmentation gran-

ularity contradicts the results of morphologi-

cal analysis and the building units of NEs,so

that extraction of some NEs are inherently im-

possible in this tting.To cope with the unit

贝克汉姆4亿豪宅problem,we propo a character-bad chunk-

ing method.Firstly,the input ntence is an-

alyzed redundantly by a statistical morpholog-

ical analyzer to produce multiple(n-best)an-

swers.Then,each character is annotated with

its character types and its possible POS tags of

the top n-best answers.Finally,a support vec-

tor machine-bad chunker picks up some por-

tions of the input ntence as NEs.This method

introduces richer information to the chunker

than previous methods that ba on a single

morphological analysis result.We apply our

method to IREX NE extraction task.The cross

validation result of the F-measure being87.2

shows the superiority and effectiveness of the

method.

1Introduction

Named Entity(NE)extraction aims at identifying proper nouns and numerical expressions in a text,su

ch as per-sons,locations,organizations,dates,and so on.This is an important subtask of document processing like infor-mation extraction and question answering.

A common standard data t for Japane NE extrac-tion is provided by IREX workshop(IREX Committee, editor,1999).Generally,Japane NE extraction is done in the following steps:Firstly,a Japane text is g-mented into words and is annotated with POS tags by a morphological analyzer.Then,a chunker brings together the words into NE chunks bad on contextual informa-tion.However,such a straightforward method cannot ex-tract NEs who gmentation boundary contradicts that of morphological analysis outputs.For example,a n-tence“”is gmented as“//////”by a morphological analyzer.“”(“Koizumi Jun’ichiro”–family andﬁrst names)as a person name and“”(“Septem-

ber”)as a date will be extracted by combining word units. On the other hand,“”(abbreviation of North Korea) cannot be extracted as a name of location becau it is contained by the word unit“”(visiting North Korea). Figure1illustrates the example with English translation. Some previous works try to cope with the word unit problem:Uchimoto(Uchimoto et al.,2000)introduces transformation rules to modify the word units given by a morphological analyzer.Isozaki(Isozaki and Kazawa, 2002)controls the parameters of a statistical morpholog-ical analyzer so as to produce moreﬁne-grained output. Th

e method are ud as a preprocessing of chunking. By contrast,we propo more straightforward method in which we perform the chunking process bad on char-acter units.Each character receives annotations with character type and multiple POS information of the words found by a morphological analyzer.We make u of re-dundant outputs of the morphological analysis as the ba features for the chunker to introduce more information-rich features.We u a support vector machine(SVM)-bad chunker yamcha(Kudo and Matsumoto,2001)for the chunking process.Our method achieves better score than all the systems reported previously for IREX NE ex-traction task.

Section2prents the IREX NE extraction task.Sec-tion3describes our method in detail.In ction4,we show the results of experiments,andﬁnally we give con-clusions in ction5.

2IREX NE extraction task

The task of NE extraction in the IREX workshop is to recognize eight NE types as shown in Table1(IREX Committee,editor,1999).In their deﬁnitions,“ARTI-FACT”contains book titles,laws,brand names and so on. The task can be deﬁned as a chunking problem to iden-

Example Sentence:

Koizumi Jun’ichiro Prime-Minister particle September particle visiting-North-Korea Prime Minister Koisumi Jun’ichiro will visit North Korea in September.

Named Entities in the Sentence:

/“Koizumi Jun’ichiro”/PERSON,

/“September”/DATE,

/“North Korea”/LOCATION

Figure1:Example of word unit problem

IOB1

B-PERSON I-PERSON O O O B-LOCATION B-LOCATION O IOE1

I-PERSON E-PERSON O O O E-LOCATION E-LOCATION O SE

Prime Minister Koizumi does between Japan and North Korea.

Figure2:Examples of NE tag ts

Table1:Examples of NEs in IREX

NE Type Examples in English

tify word quences which compo NEs.The chunking problem is solved by annotation of chunk tags to tokens. Five chunk tag ts,IOB1,IOB2,IOE1,IOE2(Ramshaw and Marcus,1995)and SE(Uchimoto et al.,2000),are commonly ud.In IOB1and IOB2models,three tags I, O and B are ud,meaning inside,outside and beginning of a chunk.In IOB1,B is ud only at the beginning of a chunk that immediately follows another chunk,while in IOB2,B is always ud at the beginning of a chunk.IOE1 and IOE2u E tag instead of B and are almost the same as IOB1and IOB2except that the end points of chunks are tagged with E.In SE model,S is tagged only to one-symbol chunks,and B,I and E denote exactly the begin-ning,intermediate and end points of a chunk.Generally, the words given by the single output of a morphological analyzer are ud as the units for chunking.By contrast,we take characters as the units.We annotate a tag on each character.

Figure2shows examples of character-bad NE anno-tations according to theﬁve tag ts.“”(PERSON),“”(LOCATION)and“”(LOCATION)are NEs in the ntence and annotated as NEs.While the detailed expla-nation of the tags will be done later,note that an NE tag is a pair of an NE type and a chunk tag.

3Method

In this ction,we describe our method for Japane NE extraction.The method is bad on the following three steps:

1.A statistical morphological/POS analyzer is applied

to the input ntence and produces POS tags of the n-best answers.

2.Each character in the ntences is annotated with the

character type and multiple POS tag information ac-cording to the n-best answers.

3.Using annotated features,NEs are extracted by an

SVM-bad chunker.

Now,we illustrate each of the three steps in more detail.

3.1Japane Morphological Analysis

Our Japane morphological/POS analysis is bad on Markov model.Morphological/POS analysis can be de-

吸血鬼日记

ﬁned as the determination of POS tag quence once a gmentation into a word quence is given.The goal is toﬁnd the POS and word quences and that maximize the following probability:

Bayes’rule allows to be decompod as the product of tag and word probabilities.

p r cWe introduce approximations that the word probabil-ity is conditioned only on the tag of the word,and the tag probability is determined only by the immediately pre-ceding tag.The probabilities are estimated from the fre-quencies in tagged corpora using Maximum Likelihood Estimation.Using the parameters,the most probable tag and word quences are determined by the Viterbi al-gorithm.

In practice,we u log likelihood as cost.Maximiz-ing probabilities means minimizing costs.In our method, redundant analysis output means the top n-best answers within a certain cost width.The n-best answers are picked up for each character in the order of the accu-mulated cost from the beginning of the ntence.Note that,if the difference between the costs of the best answer and n-th best answer exceeds a predeﬁned cost width,we abandon the n-th best answer.The cost width is de

dance with me

ﬁned as the lowest probability in all events which occur in the training data.

3.2Feature Extraction for Chunking

From the output of redundant analysis,each character re-ceives a number of features.POS tag information is sub-categorized so as to encode relative positions of charac-ters within a word.For encoding the position we employ SE tag model.Then,a character is tagged with a pair of POS tag and the position tag within a word as one fea-ture.For example,the character at the initial,intermedi-ate andﬁnal positions of a common noun(Noun-General) are reprented as“Noun-General-B”,“Noun-General-I”and“Noun-General-E”,respectively.The list of tags for positions in a word is illustrated in Table2.Note that O tag is not necessary since every character is a part of a certain word.

Character types are also ud for features.We deﬁne ven character types as listed in Table3.

Figure3shows an example of the features ud for chunking process.

Table2:Tags for positions in a word

Description

one-character word

last character in a multi-character word I

word(only for words longer than2chars)

Tag

ZSPACE

Digit

ZLLET

Upperca alphabetical letter

HIRAG

Katakana

OTHER

3.3Support Vector Machine-bad Chunking

We ud the chunker yamcha(Kudo and Matsumoto, 2001),which is bad on support vector machines(Vap-nik,1998).Below we prent support vector machine-bad chunking brieﬂy.

Suppo we have a t of training data for a binary class problem:,where德文名字

is a feature vector of the i-th sample in the training data and is the label of the sample.The goal is toﬁnd a decision function which accurately predicts for an unen.An support vector machine classiﬁer gives the decision function for an input vector where

means that is a positive member, means that is a negative member.The vectors are called support vectors.Support vectors and other con-stants are determined by solving a quadratic program-ming problem.is a kernel function which maps vectors into a higher dimensional space.We u the poly-nomial kernel of degree2given by. To facilitate chunking tasks by SVMs,we have to ex-tend binary classiﬁers to n-class classiﬁers.There are two well-known methods ud for the extension,“One-vs-Rest method”and“Pairwi method”.In“One-vs-Rest method”,we prepare binary classiﬁers,one be-tw

een a class and the rest of the class.In“Pairwi method”,we prepare binary classiﬁers between all pairs of class.

Position Char.Char.Type POS(Best)POS(2nd)POS(3rd)NE tag OTHER Noun-Proper-Name-Surname-B Preﬁx-Nominal-S Noun-General-S B-PERSON

OTHER Noun-Proper-Name-Surname-E Noun-Proper-Place-General-E Noun-Proper-General-E I-PERSON

O OTHER Noun-General-E Noun-Sufﬁx-General-S*

HIRAG Particle-Ca-General-S**

Figure3:An example of features for chunking

enjoyChunking is done deterministically either from the be-ginning or the end of ntence.Figure3illustrates a snap-shot of chunking procedure.Two character contexts on both sides are referred to.Information of two preceding NE tags is also ud since the chunker has already deter-mined them and they are available.In the example,to infer the NE tag(“O”)at the position,the chunker us the features appearing within the solid box.

3.4The effect of n-best answer

The model copes with the problem of word gmentation by character-bad chunking.Furthermore,we introduce n-best answers as features for chunking to capture the fol-lowing behavior of the morphological analysis.The am-biguity of word gmentation occurs in compound words. When both longer and shorter unit words are included in the lexicon,the longer unit words are more likely to be output by the morphological analyzer.Then,the shorter units tend to be hidden behind the longer unit words. However,introducing the shorter unit words is more nec-essary to named entity extraction to generalize the model, becau the shorter units are shared by many compound words.Figure4shows the example in which the shorter units are effective for NE extraction.In this example“”(Japan)is extracted as a location by cond best answer,namely“Noun-Proper-Place-Country”. Unknown word problem is also solved by the n-best answers.Contextual information in Markov Model is lost at the position unknown word occurs.Then,preceding or succeeding words of an unknown word tend to be mis-taken in POS tagging.However,correct POS tags occur-ring in n-best answer may help to extract named entity. Figure5shows such an example.In this example,the begining of the person name is captured by the best an-swer at the position1and the end of the person name is captured by the cond best answer at the position5.

4Evaluation

4.1Data

We u CRL NE data(IREX Committee,editor,1999)for evaluation of our method.CRL NE data includes1,174 newspaper articles and19,262NEs.We performﬁve-fold cross-validation on veral ttings to investigate the length of contextual feature,the size of redundant mor-phological analysis,feature lection and the degree of polynomial Kernel functions.For the chunk tag scheme we u IOB2model since it gave the best result in a pilot study.F-Measure()is ud for evaluation.

4.2The length of contextual feature

Firstly,we compare the extraction accuracies of the mod-els by changing the length of contextual features and the direction of chunking.Table4shows the result in accu-racy for each of NEs as well as the total accuracy of all NEs.For example,“L2R2”denotes the model that us the features of two preceding and two succeeding char-acters.“For”and“Back”mean the chunking direction:“For”speciﬁes the chunking direction from left to right, and“Back”speciﬁes that from right to left. Concerning NE types except for“TIME”,“Back”di-rection gives better accuracy for all NE types than“For”direction.It is becau sufﬁxes are crucial feature for NE extraction.“For”direction gives bett

er accuracy for “TIME”,since“TIME”often contains preﬁxes such as“”(a.m.)and“”(p.m.).

“L2R2”gives the best accurary for most of NE types. For“ORGANIZATION”,the model needs longer contex-tual length of features.The reason will be that the key preﬁxes and sufﬁxes are longer in this NE type such as“”(company limited)and“”(rearch in-stitute).

4.3The depth of redundant morphological analysis Table5shows the results when we change the depth(the value n of the n-best answers)of redundant morphologi-cal analysis.

Redundant outputs of morphological analysis slightly improve the accuracy of NE extraction except for nu-meral expressions.The best answer ems enough to extract numeral experssions except for“MONEY”.It is becau numeral expressions do not cau much errors in morphological analysis.To extract“MONEY”,the model needs more redundant output of morphological analysis.A typical occurs at“”(Canadian dollars=MONEY)which is not including training data and is analyzed as“”(Canada=LOCATION).The similar error occurs at“”(Hong Kong dollars) and so on.

4.4Feature lection

We u POS tags,characters,character types and NE tags as features for chunking.To evaluate how they are

Position Char.POS(Best)POS(2nd)NE

Noun-General

3Noun-Sufﬁx-General

1Noun-General PERSON

solution是什么意思

4Adjective

5Noun-Sufﬁx-General

Pair Wi Method One vs Rest Method

L2R2L2R2 Direction For For For Back Back Back ARTIFACT29.7442.1743.9045.5949.5847.82 DATE84.9891.1692.4790.2293.9793.41 LOCATION80.1684.0785.7586.6287.7587.61 MONEY43.4659.8872.5393.3093.8593.60 ORGANIZATION66.0672.6375.5574.8078.3379.95 PER

CENT67.6683.7785.2695.9696.0694.16 PERSON83.4485.3586.3184.9887.1987.65 TIME88.2189.8289.5487.5488.3388.08

83.7286.1986.0276.6582.1284.16

with thesaurus

Direction For For

50.0649.15

DATE91.1991.78

87.6188.59

MONEY61.6264.58

79.2780.37

PERCENT86.2386.64

87.4087.73

TIME90.5490.19

ALL82.5883.58

“L2R2”contextual feature,2-best answers of

redundant morphological analysis,

One vs Rest method with Features:POS,Characters,

Character Types and NE tags.

In the experimentation above,we follow the features

ud in the preceding work(Yamada et al.,2002).Isozaki

(Isozaki and Kazawa,2002)introduces the thesaurus–

NTT Goi Taikei(Ikehara et al.,1999)–to augment the

Table5:The depth of redundant analysis and the extraction accuracy

itav

Pair Wi Method

2-best ans.4-best ans.

Back Back Back Back ARTIFACT44.3743.5742.1742.10

93.8194.2394.1493.71

LOCATION84.3584.2084.0783.92

93.8994.2895.8295.96 ORGANIZATION73.8373.7172.6372.46

97.2096.7696.3196.81

PERSON86.2385.6585.3585.22

88.2287.7287.4787.77

86.2586.3086.1986.08

One vs Rest Method

2-best ans.4-best ans.

Back Back Back Back ARTIFACT43.1141.1239.8438.65

94.1894.1893.9793.83

LOCATION84.7284.6784.3184.15

93.7993.6793.8595.47 ORGANIZATION74.3773.7072.7472.73

97.0996.0296.0696.28

PERSON85.9286.0385.5185.41

89.0488.0788.3388.32

86.4086.3586.1186.07

F-measure

50.16 DATE

metropolis翻译

88.57 MONEY

80.44 PERCENT

87.81 TIME

ALL

While we must have aﬁxed feature t among all NE types in Pairwi method,it is possible to lect differ-ent feature ts and models when applying One-vs-Rest method.The best combined model achieves F-measure 87.21(Table9).The model us one-vs-rest method with the best model for each type shown in Table4-8.Table 10shows comparison with related works.Our method attains the best result in the previously reported systems.

Previous works report that POS information in preced-ing and succeeding two-word window is the most effec-tive for Japane NE extraction.Our current work dis-proves the widespread belief about the contextual feature. In our experiments,the preceding and succeeding two or three character window is the best effective.

Our method employs exactly same chunker with the work by Yamada et.al.(2002).To e the inﬂuenc

e of boundary contradiction between morphological anal-ysis and NEs,they experimented with an ideal tting in which morphological analysis provides the perfect re-sults for the NE chunker.Their result shows F-measure 85.1in the same data t as ours.Tho results show that our method solves more than the word unit problem compared with their results.

本文发布于:2023-07-13 21:04:22，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/176467.html

上一篇：英语语法层次

下一篇：基于U-Net网络的多主动轮廓细胞分割方法研究

标签：日记吸血鬼

留言与评论（共有 0 条评论）