Named Entity Recognition with Long Short-Term Memory
James Hammerton
Alfa-Informatica,University of Groningen
Groningen,The Netherlands
james@let.rug.nl
Abstract
In this approach to named entity recognition,
a recurrent neural network,known as Long
Short-Term Memory,is applied.The network
is trained to perform2pass on each ntence,
outputting its decisions on the cond pass.The
first pass is ud to acquire information for dis-
ambiguation during the cond pass.SARD-
NET,a lf-organising map for quences is
ud to generate reprentations for the lexical
items prented to the LSTM network,whilst
orthogonal reprentations are ud to repre-
nt the part of speech and chunk tags.
1Introduction
In this paper,Long Short-Term Mem-ory(LSTM)(Hochreiter and Schmidhuber,1997)is applied to named entity recognition,using data from the Reuters Corpus,English Language,V olume1,and the European Corpus Initiative Multilingual Corpus1. LSTM is an architecture and training algorithm for recur-rent neural networks(RNNs),capable of remembering information over long time periods during the processing of a quence.
LSTM was applied to an earlier CoNLL shared task, namely clau identification(Hammerton,2001)al-though the performance was significantly below the per-formance of other LSTM achieved an fs-core of50.42on the test data where other systems’fs-cores ranged from62.77to80.44.However,not all train-ing data was ud in training the LSTM networks.Better performance has since been obtained where the complete training t was ud(Hammerton,unpublished),yield-ing an fscore of64.66on the test data.
2Reprenting lexical items
An efficient method of reprenting lexical items is needed.Hammerton(2001;unpublished)employed lex-ical space(Zavrel and Veenstra,1996)reprentations of the words which are derived from their co-occurrence statistics.Here,however,a different approach is ud.
A SARDNET(James and Miikkulainen,1995),a lf-organising map(SOM)for quences,is trained to form reprentations of the words and the resulting reprenta-tions reflect the morphology of the words.
James and Miikkulainen(1995)provide a detailed description of how SARDNET operates.Briefly,the SARDNET operates in a similar manner to the standard SOM.It consists of a t of inputs and a t
of map units. Each map unit contains a t of weights equal in size to the number of inputs.When an input is prented,the map unit with the clost weights to the input vector is chon as the winner.When processing a quence,this winning unit is taken out of the competition for sub-quent inputs.The activation of a winning unit is t at 1when it isfirst chon and then multiplied by a decay factor(here t at0.9)for subquent inputs in the -quence.At the beginning of a new quence all map units are made available again for thefirst input.Thus,once a quence of inputs has been prented,the map units ac-tivated as winners indicate which inputs were prented and the activation levels of tho units indicate the order of prentation.An advantage of SARDNET is that it can generali naturally to novel words.
The resulting reprentations are real-valued vectors, reflecting the size of the map layer in the SARDNET (enough to reprent words of upto length where
is the size of the map).A SARDNET was trained over a single prentation of all the distinct words that appear in the training and development data for English and a parate SARDNET was trained on all the distinct words appearing in the training data for German.The generali-sation of the map to novel words was just as good with the German map as with the English map,suggesting training on the map only on the English training data words would make little difference to performance.Initially t
he neigh-bourhood was t to cover the whole SARDNET and the learning rate was t at0.4.As each word was prented, the neighbourhood and learning rate were reduced in lin-
Inputs
Outputs
Input gate Output gate
gate activation
甲状腺结节的原因
Cell input
multiplied by input Figure 1:A single-celled memory block
ear increments,so that at the end of training the learning rate was zero and the neighbourhood was 1.Both the En-glish and German experiments ud a SARDNET with 64units.
3Long Short-Term Memory (LSTM)
An LSTM network consists of 3layers,an input layer,a recurrent hidden layer and an output layer.The hid-den layer in LSTM constitutes the main innovation.It consists of one or more memory blocks each with one or more memory cells.Normally the inputs are connected to all of the cells and gates.The cells are connected to the outputs and the gates are connected to other cells and gates in the hidden layer.
A single-celled memory block is illustrated in Figure 1.The block consists of an input gate,the memory cell and an output gate.The memory cell is a linear unit with lf-connection with a weight of value 1.When not receiving any input,the cell maintains its current activation over time.The input to the memory cell is pasd through a squashing function and gated (multiplied)by the activa-tion of the input gate.The input gate thus controls the flow of activation into the cell.
The memory cell’s output pass through a squashing function before being gated by the output gate activa-tion.Thus the output gate controls the activation flow from cells to outputs.During training
the gates learn to open and clo in order to let new information into the cells and let the cells influence the outputs.The cells oth-erwi hold onto information unless new information is accepted by the input gate.Training of LSTM networks proceeds by a fusion of back-propagation through time and real-time recurrent learning,details of which can be found in (Hochreiter and Schmidhuber,1997).
六年级上册计算题In artificial tasks LSTM is capable of remembering in-formation for up-to 1000time-steps.It thus tackles one of the most rious problems affect the performance of re-current networks on temporal quence processing tasks.
Tag B-LOC
0010100
B-ORG
0010001
I-LOC
1000100
I-ORG
1000001
O
Hidden Wts 8x6
13543Net2
int
手链品牌8x6
18087Net4
int,list 8x6
18442Net6
int2,lex 8x5
15270
Net8
int2,list,lex,FG
volves computing,for each word,the frequen-
cies with which the most frequent250words
appear either immediately before or immedi-
ately after that word in the training t.The re-
sulting500element vectors(250elements each
for the left and right context)are normalid
then mapped onto their top25principal com-
ponents.
–An orthogonal reprentation of the current
part of speech(POS)tag.However,for some
networks,the input units the POS tag is pre-
nted to perform a form of time integration as
follows.The units are updated according to the
formula,where
,is the pattern reprenting the current POS tag,and where is the
length of the current quence of inputs(twice
the length of the current ntence due to the2
pass processing).By doing this the network re-
ceives a reprentation of the quence of POS
tags prented thus far,integrating the inputs
over time.
–An orthogonal reprentation of the current
chunk tag,though with some networks time in-
tegration is performed as described above.
–One input indicates which pass through the
ntence is in progress.
–Some networks ud a list of named entities
(NEs)as follows.Some units are t aside cor-
responding to the category of NE,1unit per
category.If the current word occurs in a NE,
wps文字竖排怎么设置
the unit for that NE’s category is activated.If
the word occurs in more than one NE,the units
广场舞歌for all the NEs’categories are activated.In the
ca of the English data there were5categories
of NE(though one category”MO”ems to
ari from an error in the data).小孩吃什么补锌
The networks were trained with a learning rate of
0.3,no momentum and direct connections from the
input to the output layers for100iterations.Weight updating occurred after the cond pass of each n-tence was prented.The best t of weights during training were saved and ud for evaluation with the development data.
The results reported for each network are averaged over5runs from different randomid initial weig
ht ttings.
Table2lists the various networks ud in the experi-ments.The“Net”column lists the names of the networks ud.The“Opts”column indicates whether word lists are ud(list),a1word lookahead is ud(look),lexi-cal space vectors are ud(lex),whether the units for the
Net Recall Range
61.42%52.98
62.42%55.30
62.80%54.41
75.27%69.53
75.03%69.73
67.92%62.08
68.04%62.95
76.37%70.96
Bal.65.23%n/a Table3:Results of named entity recognition on English development data for networks trained on the English training data.Results are averaged over5runs using dif-ferent initial weights.*indicates u of the list of NEs. Italics indicate best result reported onfirst submission, whilst bold indicates best result achieved overall.
POS tags u time integration as described above(int) and whether time integration is performed on both the units for POS tags and the units for chunk tags(int2). Additionally,it indicates whether forget gates were ud (FG).The“Hidden”column gives the size of the hidden layer of the 8x6means8blocks of6cells). The“Wts”column gives the number of weights ud. Table3gives the results for extracting named entities from the English development data for the networks.The “Precision”,“Recall”and“Fscore”columns show the av-erage scores across5runs from different random weight ttings.The“Range”column shows the range of fscores produced across the5runs ud for each network.The Precision gives the percentage of named entities found that were correct,whilst the Recall is the percentage of named entities defined in the data that were found.The Fscore is(2*Precision*Recall)/(Precision+Recall). Most options boosted performance.The biggest boosts came from the lexical space vectors and the word lists. The u of forget gates impro
ved performance despite leading to fewer weights being ud.Lookahead ems to make no significant difference overall.Only Net8gets above baline performance(best fscore=72.88),but the average performance is lower than the baline.
Table4gives the results for the best network broken down by the type of NE for both the English development and testing data.This is from the best performing run for Net8.Table4also depicts the best result from5runs of a network configured similarly to Net7above,using the German data.This did not employ a list of NEs and the lemmas in the data were ignored.The fscore of43.501is almost13points higher than the baline of30.65.With the German test t the fscore is47.74,17points higher
than the baline of30.30.
5Conclusion
A LSTM network was trained on named entity recogni-tion,yielding an fscore just above the baline perfor-mance on English and significantly above baline for German.Whilst the just-above-baline performance for English is disappointing,it is hoped that further work will improve on the results.A number of ways of boosting performance will be looked at including:
Increasing the size of the hidden layers will increa the power of the networks at the risk of overfitting.
Increasing training times may also increa perfor-mance,again at the risk of overfitting.
Increasing the informativeness of the lexical repre-ntations.Given that the number of elements ud here is less than the number of characters in the char-acter ts,there should be some scope for boosting performance by increasing the size of the SARD-NETs.The reprentations of different words will then become more distinct from each other.
The lexical space vectors were derived from a con-text of+/-1word,where in earlier work on clau splitting a context of+/-2words was ud.Using the larger context and/or using more than25of the top principal components may again boost perfor-mance by incorporating more information into the vectors.
Further exploitation of the word lists.Whilst the net-works are made aware of which categories of named entity the current word can belong to,it is not made aware of how many named entities it belongs to or of what positions on the named entities it could oc-cupy.
Acknowledgements
The LSTM code ud here is a modified version of code provided by Fred Cummins.The training of the SARD-NETs was done using the PDP++neural network simu-lator(u.edu/Resources/ PDP++/PDP++.html).每日安排
This work is supported by the Connectionist Lan-guage Learning Project of the High Performance Com-puting/Visualisation centre of the University of Gronin-gen.
References
F.A.Gers and J.Schmidhuber.2000.Long Short-Term Memory Learns Context-Free and Context-Sensitive Languages.Technical Report IDSIA-03-00,IDSIA, Switzerland.
Precision F
88.17%85.40
MISC74.95%
71.83%66.67
PER52.93%
Overall67.67%
English test Recall
LOC78.60%
70.20%68.09
ORG47.80%
49.11%34.97
69.09%60.15
Precision F
60.15%49.40
MISC9.90%
56.19%38.47
PER51.25%
Overall34.68%
German test Recall
分别的歌LOC40.00%
61.61%17.65
ORG28.59%
66.45%62.31
63.49%47.74 Table4:Performance of best network from Table3on English development and test data by type of NE,and performance of the best run of a network configured sim-ilarly to Net7on German development and test data.
J.A.Hammerton.2001.Clau identification with Long Short-Term Memory.In W.Daelemans and R.Za-jac,editors,Proceedings of the workshop on Compu-tational Natural Language Learning(CoNLL2001), ACL2001,Toulou,France.
J.A.Hammerton.unpublished.Towards scaling up con-nectionist language learning:Connectionist Shallow Parsing.Unpublished manuscript.
S.Hochreiter and J.Schmidhuber.1997.Long Short-Term Memory.Neural Computation,9(8):1735–1780.
D.L.James and R.Miikkulainen,1995.SARDNET: A Self-0rganizing Feature Map for Sequences,pages 577–584.MIT Press,Cambridge,MA.
J.Zavrel and J.Veenstra.1996.The language environ-ment and syntactic word class acquisition.In Koster C. and Wijnen F.,editors,Proceedings of the Groningen Asmbly on Language Acquisition(GALA’95).