首页 > 美文鉴赏

Automatic Retrieval and Clustering of Similar Words

更新时间:2023-07-06 16:23:55 阅读：评论：0

Dekang Lin

Department of Computer Science

University of Manitoba

Winnipeg,Manitoba,Canada R3T2N2

lindek@cs.umanitoba.ca

Abstract

Bootstrapping mantics from text is one of the greatest challenges in natural language learning. Weﬁrst deﬁne a word similarity measure bad on the distributional pattern of words.The similarity measure allows us to construct a thesaurus using a pard corpus.We then prent a new evaluation methodology for the automatically constructed the-saurus.The evaluation results show that the the-saurus is signiﬁcantly clor to WordNet than Roget Thesaurus is.

1Introduction

The meaning of an unknown word can often be inferred from its context.Consider the following (slightly modiﬁed)example in(Nida,1975,p.167): (1)A bottle of tezg¨u ino is on the table.

Everyone likes tezg¨u ino.

Tezg¨u ino makes you drunk.

We make tezg¨u ino out of corn.

The contexts in which the word tezg¨u ino is ud suggest that tezg¨u ino may be a kind of alcoholic beverage made from corn mash.

Bootstrapping mantics from text is one of the greatest challenges in natural language learning.It has been argued that similarity plays an important role in word acquisition(Gentner,1982).Identify-ing similar words is an initial step in learning the deﬁnition of a word.This paper prents a method for making thisﬁrst step.For example,given a cor-pus that includes the ntences in(1),our goal is to be able to infer that tezg¨u ino is similar to“beer”,“wine”,“vodka”,etc.

In addition to the long-term goal of bootstrap-ping mantics from text,automatic identiﬁcation of similar words has many immediate applications. The most obvious one is thesaurus construction.An

automatically created thesaurus offers many advan-tages over manually constructed thesauri.Firstly,the terms can be corpus-or genre-speciﬁc.Man-ually constructed general-purpo dictionaries and thesauri include many usages that are very infre-quent in a particular corpus or genre of documents. Secondly,certain word usages may be particular to a period of time,which are unlikely to be cap-tured by manually compiled lexicons.For example, among274occurrences of the word“westerner”in a45million word San Jo Mercury corpus,55% of them refer to hostages.If one needs to arch hostage-related articles,“westerner”may well be a good arch term.

Another application of automatically extracted similar words is to help solve the problem of data sparness in statistical natural language process-ing(Dagan et al.,1994;Esn and Steinbiss,1992). When the frequency of a word does not warrant reli-able maximum likelihood estimation,its probability can be computed as a weighted sum of the probabil-ities of words that are similar to it.It was shown in (Dagan et al.,1997)that a similarity-bad smooth-ing method achieved much better results than back-off smoothing methods in word n disambigua-tion.

The remainder of the paper is organized as fol-lows.The next ction is concerned with similari-ties between words bad on their distributional pat-terns.The similarity measure can then be ud to create a thesaurus.In Section3,we evaluate the constructed thesauri by computing the similarity be-t

ween their entries and entries in manually created thesauri.Section4brieﬂy discuss future work in clustering similar words.Finally,Section5reviews related work and summarize our contributions.

2Word Similarity

Our similarity measure is bad on a proposal in (Lin,1997),where the similarity between two ob-jects is deﬁned to be the amount of information con-tained in the commonality between the objects di-

vided by the amount of information in the descrip-

tions of the objects.

We u a broad-coverage parr(Lin,1993;Lin, 1994)to extract dependency triples from the text

corpus.A dependency triple consists of two words and the grammatical relationship between them in

the input ntence.For example,the triples ex-

tracted from the ntence“I have a brown dog”are: (2)(have subj I),(I subj-of have),(dog obj-of

have),(dog adj-mod brown),(brown

adj-mod-of dog),(dog det a),(a det-of dog) We u the notation to denote the fre-

quency count of the dependency triple in the pard corpus.When,,or is the wild

card(),the frequency counts of all the depen-dency triples that matches the rest of the pattern are

summed up.For example,is the to-

tal occurrences of cook–object relationships in the pard corpus,and is the total number of dependency triples extracted from the pard cor-pus.

The description of a word consists of the fre-

quency counts of all the dependency triples that matches the pattern.The commonality be-tween two words consists of the dependency triples that appear in the descriptions of both words.For example,(3)is the the description of the word “cell”.

(3)cell,subj-of,absorb=1

cell,subj-of,adapt=1

cell,subj-of,behave=1

......

cell,pobj-of,in=159

cell,pobj-of,inside=16

cell,pobj-of,into=30

......

cell,nmod-of,abnormality=3

cell,nmod-of,anemia=8

cell,nmod-of,architecture=1

......

cell,obj-of,attack=6

cell,obj-of,bludgeon=1

cell,obj-of,call=11

cell,obj-of,come from=3

手机怎么建文件夹

cell,obj-of,contain=4烟消云散的意思

cell,obj-of,decorate=2

......

cell,nmod,bacteria=3

cell,nmod,blood vesl=1

cell,nmod,body=2

cell,nmod,bone marrow=2

cell,nmod,burial=1

cell,nmod,chameleon=1

......

Assuming that the frequency counts of the depen-dency triples are independent of each other,the in-formation contained in the description of a word is the sum of the information contained in each indi-vidual frequency count.

To measure the information contained in the statement=,weﬁrst measure the amount of information in the statement that a randomly -lected dependency triple is when we do not know the value of.We then mea-sure the amount of information in the same state-ment when we do know the value of.The difference between the two amounts is taken to be the information contained in=.

An occurrence of a dependency triple

can be regarded as the co-occurrence of three events:

:a randomly lected word is;

:a randomly lected dependency type is;

:a randomly lected word is.

When the value of is unknown,we assume that and are conditionally indepen-dent given.The probability of,and co-occurring is estimated by

MLE MLE MLE

where

MLE

is the maximum likelihood estimation of a probability distribution and

MLE

It is worth noting that is equal to the mutual information between and(Hindle,

sim

subj-of,obj-of sim

sim惊悚悬疑电影

sim

Figure1:Other Similarity Measures

未来恋人1990).

Let be the t of pairs such that

We pard a64-million-word corpus consisting

of the Wall Street Journal(24million words),San

Jo Mercury(21million words)and AP Newswire

(19million words).From the pard corpus,we

extracted56.5million dependency triples(8.7mil-

lion unique).In the pard corpus,there are5469

nouns,2173verbs,and2632adjectives/adverbs that

occurred at least100times.We computed the pair-

wi similarity between all the nouns,all the verbs

and all the adjectives/adverbs,using the above sim-

ilarity measure.For each word,we created a the-

saurus entry which contains the top-N1words that

are most similar to it.2The thesaurus entry for word

has the following format:

where is a part of speech,is a word,

=sim and’s are ordered in descending

order.For example,the top-10words in the noun,

verb,and adjective entries for the word“brief”are

shown below:

brief(noun):afﬁdavit0.13,petition0.05,memo-

randum0.05,motion0.05,lawsuit0.05,depo-牙线怎么使用

sition0.05,slight0.05,prospectus0.04,docu-

ment0.04paper0.04,...

sim

where is the t of ns of in the WordNet,is the t of(possibly indirect) superclass of concept in the WordNet,is the t of words that belong to a same Roget category as.

Figure2:Word similarity measures bad on WordNet and Roget

tures of WordNet and Roget(Figure2).The simi-larity measure sim is bad on the proposal in

(Lin,1997).The similarity measure sim treats all the words in Roget as features.A word pos-

ss the feature if and belong to a same Roget category.The similarity between two words is then deﬁned as the cosine coefﬁcient of the two feature vectors.

With sim and sim,we transform Word-Net and Roget into the same format as the automat-ically constructed thesauri in the previous ction. We now discuss how to measure the similarity be-tween two thesaurus entries.Suppo two thesaurus entries for the same word are as follows:

Their similarity is deﬁned as:

(4)

扬琴是什么乐器For example,(5)is the entry for“brief(noun)”in our automatically generated thesaurus and(6)and (7)are corresponding entries in WordNet thesaurus and Roget thesaurus.

(5)brief(noun):

deposition0.05,slight0.05,prospectus0.04,

document0.04paper0.04.

(6)brief(noun):outline0.96,instrument0.84,

summary0.84,deposition

0.80,law0.77,survey0.74,sketch0.74,

resume0.74,argument0.74.

(7)brief(noun):recital0.77,saga0.77,

autobiography0.77,anecdote0.77,novel

0.77,novelist0.77,tradition0.70,historian

0.70,tale0.64.

According to(4),the similarity between(5)and (6)is0.297,whereas the similarities between(5) and(7)and between(6)and(7)are0.

Our evaluation was conducted with4294nouns that occurred at least100times in the pard cor-pus and are found in both WordNet1.5and the Ro-get Thesaurus.Table1shows the average similarity bet

ween corresponding entries in different thesauri and the standard deviation of the average,which is the standard deviation of the data items divided by the square root of the number of data items. Since the differences among sim,sim and sim are very small,we only included the re-sults for sim in Table1for the sake of brevity.

Table1:Evaluation with WordNet and Roget

WordNet

average

Roget0.001636

sim0.001484

Hindle0.001424

Hindle0.001200

cosine0.001352

Roget

average

WordNet0.001636

sim0.001429

Hindle0.001383

nod过去式Hindle0.001140

cosine0.001275

It can be en that sim,Hindle and cosine are signiﬁcantly more similar to WordNet than Roget is,but are signiﬁcantly less similar to Roget than WordNet is.The differences between Hindle and Hindle clearly demonstrate that the u of other types of dependencies in addition to subject and ob-

ject relationships is very beneﬁcial.

The performance of sim,Hindle and cosine are quite clo.To determine whether or not the dif-ferenc

es are statistically signiﬁcant,we computed their differences in similarities to WordNet and Ro-get thesaurus for each individual entry.Table2 shows the average and standard deviation of the av-erage difference.Since the95%conﬁdence inter-vals of all the differences in Table2are on the posi-tive side,one can draw the statistical conclusion that simis better than sim,which is better than sim.

Table2:Distribution of Differences

WordNet

average

sim Hindle0.000428

sim cosine0.000386

Hindle cosine0.000561

Roget

average

sim Hindle0.000401

sim cosine0.000375

Hindle cosine0.000509

4Future Work

Reliable extraction of similar words from text cor-pus opens up many avenues for future work.For example,one can go a step further by constructing a tree structure among the most similar words so that different ns of a given word can be identiﬁed with different subtrees.Let be a list of words in descending order of their similar-ity to a given word.The similarity tree for is created as follows:

Initialize the similarity tree to consist of a sin-gle node.

For=1,2,,,inrt as a child of such that is the most similar one to among,,,.

For example,Figure3shows the similarity tree for the top-40most similar words to duty.Theﬁrst number behind a word is the similarity of the word to its parent.The cond number is the similarity of the word to the root node of the tree.

Inspection of sample outputs shows that this al-gorithm works well.However,formal evaluation of its accuracy remains to be future work.duty

|___responsibility0.210.21

||___role0.120.11

|||___action0.110.10

|||___change0.240.08

||||___rule0.160.08

||||___restriction0.270.08

|||||___ban0.300.08

|||||___sanction0.190.08 ||||___schedule0.110.07

||||___regulation0.370.07

|||___challenge0.130.07

||||___issue0.130.07

||||___reason0.140.07

||||___matter0.280.07

|||___measure0.220.07

||___obligation0.120.10

||___power0.170.08

|||___jurisdiction0.130.08

|||___right0.120.07

|||___control0.200.07

|||___ground0.080.07

||___accountability0.140.08

||___experience0.120.07

扑克牌算命|___post0.140.14

||___job0.170.10

|||___work0.170.10

|||___training0.110.07

||___position0.250.10

|___task0.100.10

||___chore0.110.07

|___operation0.100.10

||___function0.100.08

||___mission0.120.07

|||___patrol0.070.07

||___staff0.100.07

|___penalty0.090.09

||___fee0.170.08

||___tariff0.130.08

||___tax0.190.07

|___rervist0.070.07

Figure3:Similarity tree for“duty”

5Related Work and Conclusion

There have been many approaches to automatic de-tection of similar words from text corpora.Ours is similar to(Grefenstette,1994;Hindle,1990;Ruge, 1992)in the u of dependency relationship as the word features,bad on which word similarities are computed.

Evaluation of automatically generated lexical re-sources is a difﬁcult problem.In(Hindle,1990), a small t of sample results are prented.In (Smadja,1993),automatically extracted colloca-tions are judged by a lexicographer.In(Dagan et

al.,1993)and(Pereira et al.,1993),clusters of sim-ilar words are evaluated by how well they are able to recover data items that are removed from the in-put corpus one at a time.In(Alshawi and Carter, 1994),the collocations and their associated scores were evaluated indirectly by their u in par tree lection.The merits of different measures for as-sociation strength are judged by the differences they make in the precision and the recall of the parr outputs.

The main contribution of this paper is a new eval-uation methodology for automatically constructed thesaurus.While previous methods rely on indirect tasks or subjective judgments,our method allows direct and objective comparison between automati-cally and manually constructed thesauri.The results show that our automatically created thesaurus is sig-niﬁcantly clor to WordNet than Roget Thesaurus is.Our experiments also surpass previous experi-ments on automatic thesaurus construction in scale and(possibly)accuracy.

Acknowledgement

This rearch has also been partially supported by NSERC Rearch Grant OGP121338and by the In-stitute for Robotics and Intelligent Systems. References

Hiyan Alshawi and David Carter.1994.Training and scaling preference functions for disambiguation.

Computational Linguistics,20(4):635–648,Decem-ber.

Ido Dagan,Shaul Marcus,and Shaul Markovitch.1993.

Contextual word similarity and estimation from spar data.In Proceedings of ACL-93,pages164–171, Columbus,Ohio,June.

Ido Dagan,Fernando Pereira,and Lillian Lee.1994.

Similarity-bad estimation of word cooccurrence probabilities.In Proceedings of the32nd Annual Meeting of the ACL,pages272–278,Las Cruces,NM. Ido Dagan,Lillian Lee,and Fernando Pereira.1997.

Similarity-bad method for word n disambigua-tion.In Proceedings of the35th Annual Meeting of the ACL,pages56–63,Madrid,Spain.

Ute Esn and V olker Steinbiss.1992.Cooccurrence smoothing for stochastic language modeling.In Pro-ceedings of ICASSP,volume1,pages161–164.

W.B.Frakes and R.Baeza-Yates,editors.1992.In-formation Retrieval,Data Structure and Algorithms.

Prentice Hall.

D.Gentner.1982.Why nouns are learned before verbs:

Linguistic relativity versus natural partitioning.In S.A.Kuczaj,editor,Language development:Vol.2.

Language,thought,and culture,pages301–334.Erl-baum,Hillsdale,NJ.

Gregory Grefenstette.1994.Explorations in Auto-matic Thesaurus Discovery.Kluwer Academic Press, Boston,MA.

Donald Hindle.1990.Noun classiﬁcation from predicate-argument structures.In Proceedings of ACL-90,pages268–275,Pittsburg,Pennsylvania, June.

Dekang Lin.1993.Principle-bad parsing without overgeneration.In Proceedings of ACL–93,pages 112–120,Columbus,Ohio.

Dekang Lin.1994.Principar—an efﬁcient,broad-coverage,principle-bad parr.In Proceedings of COLING–94,pages482–488.Kyoto,Japan. Dekang Lin.1997.Using syntactic dependency as local context to resolve word n ambiguity.In Proceed-ings of ACL/EACL-97,pages64–71,Madrid,Spain, July.

George A.Miller,Richard Beckwith,Christiane Fell-baum,Derek Gross,and Katherine J.Miller.1990.

Introduction to WordNet:An on-line lexical databa.

International Journal of Lexicography,3(4):235–244. George A.Miller.1990.WordNet:An on-line lexi-cal databa.International Journal of Lexicography, 3(4):235–312.

Eugene A.Nida.1975.Componential Analysis of Mean-ing.The Hague,Mouton.

F.Pereira,N.Tishby,and L.Lee.1993.Distributional

Clustering of English Words.In Proceedings of ACL-93,pages183–190,Ohio State University,Columbus, Ohio.

Gerda Ruge.1992.Experiments on linguistically bad term associations.Information Processing&Man-agement,28(3):317–332.

Frank Smadja.1993.Retrieving collocations from text: Xtract.Computational Linguistics,19(1):143–178.

本文发布于:2023-07-06 16:23:55，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1070436.html

上一篇：英语辅修-学生学习资料-Analogy22

下一篇：Guidance for the Content of Premarket Submissions for Software Contained in Medical Devices