首页 > 美文鉴赏

Named entity recognition with long short-term memory

更新时间:2023-05-24 00:27:37 阅读：评论：0

Named Entity Recognition with Long Short-Term Memory

James Hammerton

Alfa-Informatica,University of Groningen

Groningen,The Netherlands

james@let.rug.nl

Abstract

In this approach to named entity recognition,

a recurrent neural network,known as Long

Short-Term Memory,is applied.The network

is trained to perform2pass on each ntence,

outputting its decisions on the cond pass.The

ﬁrst pass is ud to acquire information for dis-

ambiguation during the cond pass.SARD-

NET,a lf-organising map for quences is

ud to generate reprentations for the lexical

items prented to the LSTM network,whilst

orthogonal reprentations are ud to repre-

nt the part of speech and chunk tags.

1Introduction

In this paper,Long Short-Term Mem-ory(LSTM)(Hochreiter and Schmidhuber,1997)is applied to named entity recognition,using data from the Reuters Corpus,English Language,V olume1,and the European Corpus Initiative Multilingual Corpus1. LSTM is an architecture and training algorithm for recur-rent neural networks(RNNs),capable of remembering information over long time periods during the processing of a quence.

LSTM was applied to an earlier CoNLL shared task, namely clau identiﬁcation(Hammerton,2001)al-though the performance was signiﬁcantly below the per-formance of other LSTM achieved an fs-core of50.42on the test data where other systems’fs-cores ranged from62.77to80.44.However,not all train-ing data was ud in training the LSTM networks.Better performance has since been obtained where the complete training t was ud(Hammerton,unpublished),yield-ing an fscore of64.66on the test data.

2Reprenting lexical items

An efﬁcient method of reprenting lexical items is needed.Hammerton(2001;unpublished)employed lex-ical space(Zavrel and Veenstra,1996)reprentations of the words which are derived from their co-occurrence statistics.Here,however,a different approach is ud.

A SARDNET(James and Miikkulainen,1995),a lf-organising map(SOM)for quences,is trained to form reprentations of the words and the resulting reprenta-tions reﬂect the morphology of the words.

James and Miikkulainen(1995)provide a detailed description of how SARDNET operates.Brieﬂy,the SARDNET operates in a similar manner to the standard SOM.It consists of a t of inputs and a t

of map units. Each map unit contains a t of weights equal in size to the number of inputs.When an input is prented,the map unit with the clost weights to the input vector is chon as the winner.When processing a quence,this winning unit is taken out of the competition for sub-quent inputs.The activation of a winning unit is t at 1when it isﬁrst chon and then multiplied by a decay factor(here t at0.9)for subquent inputs in the -quence.At the beginning of a new quence all map units are made available again for theﬁrst input.Thus,once a quence of inputs has been prented,the map units ac-tivated as winners indicate which inputs were prented and the activation levels of tho units indicate the order of prentation.An advantage of SARDNET is that it can generali naturally to novel words.

The resulting reprentations are real-valued vectors, reﬂecting the size of the map layer in the SARDNET (enough to reprent words of upto length where

is the size of the map).A SARDNET was trained over a single prentation of all the distinct words that appear in the training and development data for English and a parate SARDNET was trained on all the distinct words appearing in the training data for German.The generali-sation of the map to novel words was just as good with the German map as with the English map,suggesting training on the map only on the English training data words would make little difference to performance.Initially t

he neigh-bourhood was t to cover the whole SARDNET and the learning rate was t at0.4.As each word was prented, the neighbourhood and learning rate were reduced in lin-

Inputs

Outputs

Input gate Output gate

gate activation

甲状腺结节的原因

Cell input

multiplied by input Figure 1:A single-celled memory block

ear increments,so that at the end of training the learning rate was zero and the neighbourhood was 1.Both the En-glish and German experiments ud a SARDNET with 64units.

3Long Short-Term Memory (LSTM)

An LSTM network consists of 3layers,an input layer,a recurrent hidden layer and an output layer.The hid-den layer in LSTM constitutes the main innovation.It consists of one or more memory blocks each with one or more memory cells.Normally the inputs are connected to all of the cells and gates.The cells are connected to the outputs and the gates are connected to other cells and gates in the hidden layer.

A single-celled memory block is illustrated in Figure 1.The block consists of an input gate,the memory cell and an output gate.The memory cell is a linear unit with lf-connection with a weight of value 1.When not receiving any input,the cell maintains its current activation over time.The input to the memory cell is pasd through a squashing function and gated (multiplied)by the activa-tion of the input gate.The input gate thus controls the ﬂow of activation into the cell.

The memory cell’s output pass through a squashing function before being gated by the output gate activa-tion.Thus the output gate controls the activation ﬂow from cells to outputs.During training

the gates learn to open and clo in order to let new information into the cells and let the cells inﬂuence the outputs.The cells oth-erwi hold onto information unless new information is accepted by the input gate.Training of LSTM networks proceeds by a fusion of back-propagation through time and real-time recurrent learning,details of which can be found in (Hochreiter and Schmidhuber,1997).

六年级上册计算题In artiﬁcial tasks LSTM is capable of remembering in-formation for up-to 1000time-steps.It thus tackles one of the most rious problems affect the performance of re-current networks on temporal quence processing tasks.

Tag B-LOC

0010100

B-ORG

0010001

I-LOC

1000100

I-ORG

1000001

Hidden Wts 8x6

13543Net2

int

手链品牌8x6

18087Net4

int,list 8x6

18442Net6

int2,lex 8x5

15270

Net8

int2,list,lex,FG

volves computing,for each word,the frequen-

cies with which the most frequent250words

appear either immediately before or immedi-

ately after that word in the training t.The re-

sulting500element vectors(250elements each

for the left and right context)are normalid

then mapped onto their top25principal com-

ponents.

–An orthogonal reprentation of the current

part of speech(POS)tag.However,for some

networks,the input units the POS tag is pre-

nted to perform a form of time integration as

follows.The units are updated according to the

formula,where

,is the pattern reprenting the current POS tag,and where is the

length of the current quence of inputs(twice

the length of the current ntence due to the2

pass processing).By doing this the network re-

ceives a reprentation of the quence of POS

tags prented thus far,integrating the inputs

over time.

–An orthogonal reprentation of the current

chunk tag,though with some networks time in-

tegration is performed as described above.

–One input indicates which pass through the

ntence is in progress.

–Some networks ud a list of named entities

(NEs)as follows.Some units are t aside cor-

responding to the category of NE,1unit per

category.If the current word occurs in a NE,

wps文字竖排怎么设置

the unit for that NE’s category is activated.If

the word occurs in more than one NE,the units

广场舞歌for all the NEs’categories are activated.In the

ca of the English data there were5categories

of NE(though one category”MO”ems to

ari from an error in the data).小孩吃什么补锌

The networks were trained with a learning rate of

0.3,no momentum and direct connections from the

input to the output layers for100iterations.Weight updating occurred after the cond pass of each n-tence was prented.The best t of weights during training were saved and ud for evaluation with the development data.

The results reported for each network are averaged over5runs from different randomid initial weig

ht ttings.

Table2lists the various networks ud in the experi-ments.The“Net”column lists the names of the networks ud.The“Opts”column indicates whether word lists are ud(list),a1word lookahead is ud(look),lexi-cal space vectors are ud(lex),whether the units for the

Net Recall Range

61.42%52.98

62.42%55.30

62.80%54.41

75.27%69.53

75.03%69.73

67.92%62.08

68.04%62.95

76.37%70.96

Bal.65.23%n/a Table3:Results of named entity recognition on English development data for networks trained on the English training data.Results are averaged over5runs using dif-ferent initial weights.*indicates u of the list of NEs. Italics indicate best result reported onﬁrst submission, whilst bold indicates best result achieved overall.

POS tags u time integration as described above(int) and whether time integration is performed on both the units for POS tags and the units for chunk tags(int2). Additionally,it indicates whether forget gates were ud (FG).The“Hidden”column gives the size of the hidden layer of the 8x6means8blocks of6cells). The“Wts”column gives the number of weights ud. Table3gives the results for extracting named entities from the English development data for the networks.The “Precision”,“Recall”and“Fscore”columns show the av-erage scores across5runs from different random weight ttings.The“Range”column shows the range of fscores produced across the5runs ud for each network.The Precision gives the percentage of named entities found that were correct,whilst the Recall is the percentage of named entities deﬁned in the data that were found.The Fscore is(2*Precision*Recall)/(Precision+Recall). Most options boosted performance.The biggest boosts came from the lexical space vectors and the word lists. The u of forget gates impro

ved performance despite leading to fewer weights being ud.Lookahead ems to make no signiﬁcant difference overall.Only Net8gets above baline performance(best fscore=72.88),but the average performance is lower than the baline.

Table4gives the results for the best network broken down by the type of NE for both the English development and testing data.This is from the best performing run for Net8.Table4also depicts the best result from5runs of a network conﬁgured similarly to Net7above,using the German data.This did not employ a list of NEs and the lemmas in the data were ignored.The fscore of43.501is almost13points higher than the baline of30.65.With the German test t the fscore is47.74,17points higher

than the baline of30.30.

5Conclusion

A LSTM network was trained on named entity recogni-tion,yielding an fscore just above the baline perfor-mance on English and signiﬁcantly above baline for German.Whilst the just-above-baline performance for English is disappointing,it is hoped that further work will improve on the results.A number of ways of boosting performance will be looked at including:

Increasing the size of the hidden layers will increa the power of the networks at the risk of overﬁtting.

Increasing training times may also increa perfor-mance,again at the risk of overﬁtting.

Increasing the informativeness of the lexical repre-ntations.Given that the number of elements ud here is less than the number of characters in the char-acter ts,there should be some scope for boosting performance by increasing the size of the SARD-NETs.The reprentations of different words will then become more distinct from each other.

The lexical space vectors were derived from a con-text of+/-1word,where in earlier work on clau splitting a context of+/-2words was ud.Using the larger context and/or using more than25of the top principal components may again boost perfor-mance by incorporating more information into the vectors.

Further exploitation of the word lists.Whilst the net-works are made aware of which categories of named entity the current word can belong to,it is not made aware of how many named entities it belongs to or of what positions on the named entities it could oc-cupy.

Acknowledgements

The LSTM code ud here is a modiﬁed version of code provided by Fred Cummins.The training of the SARD-NETs was done using the PDP++neural network simu-lator(u.edu/Resources/ PDP++/PDP++.html).每日安排

This work is supported by the Connectionist Lan-guage Learning Project of the High Performance Com-puting/Visualisation centre of the University of Gronin-gen.

References

F.A.Gers and J.Schmidhuber.2000.Long Short-Term Memory Learns Context-Free and Context-Sensitive Languages.Technical Report IDSIA-03-00,IDSIA, Switzerland.

Precision F

88.17%85.40

MISC74.95%

71.83%66.67

PER52.93%

Overall67.67%

English test Recall

LOC78.60%

70.20%68.09

ORG47.80%

49.11%34.97

69.09%60.15

Precision F

60.15%49.40

MISC9.90%

56.19%38.47

PER51.25%

Overall34.68%

German test Recall

分别的歌LOC40.00%

61.61%17.65

ORG28.59%

66.45%62.31

63.49%47.74 Table4:Performance of best network from Table3on English development and test data by type of NE,and performance of the best run of a network conﬁgured sim-ilarly to Net7on German development and test data.

J.A.Hammerton.2001.Clau identiﬁcation with Long Short-Term Memory.In W.Daelemans and R.Za-jac,editors,Proceedings of the workshop on Compu-tational Natural Language Learning(CoNLL2001), ACL2001,Toulou,France.

J.A.Hammerton.unpublished.Towards scaling up con-nectionist language learning:Connectionist Shallow Parsing.Unpublished manuscript.

S.Hochreiter and J.Schmidhuber.1997.Long Short-Term Memory.Neural Computation,9(8):1735–1780.

D.L.James and R.Miikkulainen,1995.SARDNET: A Self-0rganizing Feature Map for Sequences,pages 577–584.MIT Press,Cambridge,MA.

J.Zavrel and J.Veenstra.1996.The language environ-ment and syntactic word class acquisition.In Koster C. and Wijnen F.,editors,Proceedings of the Groningen Asmbly on Language Acquisition(GALA’95).

本文发布于:2023-05-24 00:27:37，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/925876.html

上一篇：威斯敏斯特教堂遐思英语

下一篇：【转载】【地球演义】第120回FirstBite

标签：设置结节计算题手链舞歌补锌

留言与评论（共有 0 条评论）