A comparison of string distance metrics for name-matching tasks

更新时间:2023-06-19 11:43:18 阅读：评论：0

A Comparison of String Distance Metrics for Name-Matching Tasks

William W.Cohen Pradeep Ravikumar Stephen E.Fienberg

Center for Automated Center for Automated Department of Statistics,

Learning and Discovery,Learning and Discovery,Center for Computer&Communications Security, School of Computer Science,School of Computer Science,&Center for Automated Learning&Discovery Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University

u.edu u.edu

Abstract

Using an open-source,Java toolkit of name-matching

methods,we experimentally compare string distance

metrics on the task of matching entity names.We inves-

tigate a number of different metrics propod by differ-

ent communities,including edit-distance metrics,fast

heuristic string comparators,token-bad distance met-

rics,and hybrid methods.Overall,the best-performing

method is a hybrid scheme combining a TFIDF weight-

ing scheme,which is widely ud in information re-

trieval,with the Jaro-Winkler string-distance scheme,

which was developed in the probabilistic record linkage

community.

Introduction

The task of matching entity names has been explored by a number of communities,including statistic

s,databas,and artiﬁcial intelligence.Each community has formulated the problem differently,and different techniques have been pro-pod.

In statistics,a long line of rearch has been conducted in probabilistic record linkage,largely bad on the m-inal paper by Fellegi and Sunter(1969).This paper for-mulates entity matching as a classiﬁcation problem,where the basic goal is to classify entity pairs as matching or non-matching.Fellegi and Sunter propo using largely unsupervid methods for this task,bad on a feature-bad reprentation of pairs which is manually designed and to some extent problem-speciﬁc.The proposals have been,by and large,adopted by subquent rearchers,al-though often with elaborations of the underlying statisti-cal model(Jaro1989;1995;Winkler1999;Larn1999; Belin&Rubin1997).The methods have been ud to match individuals and/or families between samples and ,in evaluation of the coverage of the U.S.decen-nial census;or between administrative records and survey data ,in the creation of an anonymized rearch data ba combining tax information from the Internal Rev-enue Service and data from the Current Population Survey. In the databa community,some work on record match-ing has been bad on knowledge-intensive approaches Copyright c 2003,American Association for Artiﬁcial Intelli-gence(www.aaai).All rights rerved.(Hernandez&Stolfo1995;Galhardas et al.2000;Raman &Hell

erstein2001).However,the u of string-edit dis-tances as a general-purpo record matching scheme was propod by Monge and Elkan(Monge&Elkan1997; 1996),and in previous work,we propod u of the TFIDF distance metric for the same purpo(Cohen2000). In the AI community,supervid learning has been ud for learning the parameters of string-edit distance metrics (Ristad&Yianilos1998;Bilenko&Mooney2002)and combining the results of different distance functions(Te-jada,Knoblock,&Minton2001;Cohen&Richman2002; Bilenko&Mooney2002).More recently,probabilistic ob-ject identiﬁcation methods have been adapted to matching tasks(Pasula et al.2002).In the communities there has been more emphasis on developing autonomous matching techniques which require little or or no conﬁguration for a new task,and less emphasis on developing“toolkits”of methods that can be applied to new tasks by experts. Recently,we have begun implementing an open-source, Java toolkit of name-matching methods(Cohen&Raviku-mar2003)that includes a variety of different techniques.In this paper we u this toolkit to conduct a comparison of veral string distances on the tasks of matching and clus-tering lists of entity names.We also introduce and evaluate a number of novel string-distance methods.One of the novel distance metrics performs better,on average,than any previous string-distance metric on our benchmark problems. This new distance metric extends cosine similarity by using the Jaro-Winkler method(Winkler1999)to exploit nearly-matching tokens.

This experimental u of string distance metrics,while similar to previous experiments in the databa and AI com-munities,is a substantial departure from their usual u in statistics.In statistics,databas tend to have more structure and speciﬁcation,by design.Thus the statistical literature on probabilistic record linkage reprents pairs of entities not by pairs of strings,but by vectors of“match features”such as names and categories for variables in survey databas. By developing appropriate match features,and appropriate statistical models of matching and non-matching pairs,this approach can achieve better matching performance(at least potentially).

The u of string distances considered here is most uful for matching problems with little prior knowledge,or ill-

structured data.Better string distance metrics might also be uful in the generation of“match features”in more struc-tured databa situations.

Methods

Edit-distance like functions

Distance functions map a pair of strings s and t to a real number r,where a smaller value of r indicate

s greater sim-ilarity between s and t.Similarity functions are analogous, except that larger values indicate greater similarity;at some risk of confusion to the reader,we will u this terms inter-changably,depending on which interpretation is most natu-ral.

One important class of distance functions are edit dis-tances,in which distance is the cost of best quence of edit operations that convert s to t.Typical edit operations are character inrtion,deletion,and substitution,and each op-eration much be assigned a cost.

We will consider two edit-distance functions.The sim-ple Levenstein distance assigns a unit cost to all edit opera-tions.As an example of a more complex well-tuned distance function,we also consider the Monger-Elkan distance func-tion(Monge&Elkan1996),which is an afﬁne1variant of the Smith-Waterman distance function(Durban et al.1998) with particular cost parameters,and scaled to the interval [0,1].

A broadly similar metric,which is not bad on an edit-distance model,is the Jaro metric(Jaro1995;1989;Winkler 1999).In the record-linkage literature,good results have been obtained using variants of this method,which is bad on the number and order of the common characters between two strings.Given strings a K and b L, deﬁne a character a i in s to be common with t there is a b j=a i in t such that i−H≤j≤i+H,where H=

min(|s|,|t|)

2.Let s =a

be the characters in s which

are common with t(in the same order they appear in s)and

let t =b

L be analogous;now deﬁne a transposition

for s ,t to be a position i such that a i=b i.Let T s ,t be half the number of transpositions for s and t .The Jaro similarity metric for s and t is

Jaro(s,t)=1stink

|s |

|s|

|t |

|t|

|s |−T s ,t

|s |

A variant of this due to Winkler(1999)also us the length P of the longest common preﬁx of s and t.Letting P = max(P,4)we deﬁne

Jaro-Winkler(s,t)=Jaro(s,t)+P

·(1−Jaro(s,t))

The Jaro and Jaro-Winkler metrics em to be intended pri-marily for short ,personal last names.)

Token-bad distance functions

Two strings s and t can also be considered as multits(or bags)of words(or tokens).We also considered veral 1Afﬁne edit-distance functions assign a relatively lower cost to a quence of inrtions ken-bad distance metrics.The Jaccard similarity be-tween the word ts S and T is simply|S∩T|

|S∪T|

.TFIDF or cosine similarity,which is widely ud in the information retrieval community can be deﬁned as

TFIDF(S,T)=

w∈S∩T

V(w,S)·V(w,T)

where TF w,S is the frequency of word w in S,N is the size of the“corpus”,IDF w is the inver of the fraction of names in the corpus that contain w,

V (w,S)=log(TF w,S+1)·log(IDF w)

and V(w,S)=V (w,S)/

V (w,S)2.Our imple-mentation measures all document frequency statistics from the complete corpus of names to be matched.

Following Dagan et al(1999),a token t S can be viewed as samples from an unknown distribution P S of tokens,and a distance between S and T can be computed bad on the distributions.Inspired by Dagan et al we considered the Jenn-Shannon distance between P S and P T.Letting KL(P||Q)be the Kullback-Lieber divergence and letting Q(w)=1

(P S(w)+P T(w)),this is simply

Jenn-Shannon(S,T)=

(KL(P S||Q)+KL(P T||Q))

Distributions P S were estimated using maximum likelihood, a Dirichlet prior,and a Jelenik-Mercer mixture model(Laf-ferty&Zhai2001).The latter two cas require parameters; for Dirichlet,we udα=1,and for Jelenik-Mercer,we ud the mixture coefﬁcientλ=0.5.

From the record-linkage literature,a method propod by Fellegi and Sunter(1969)can be easily extended to a token distance.As notation,let A and B be two ts of records to match,let C=A∩B,let D=A∪B,and for X= A,B,C,D let P X(w)be the empirical probability of a name containing the word w in the t X.Also let e X be the empirical probability of an error in a name in t X;e X,0 be the probability of a missing name in t X;e T be the probability of two correct but differing names in A and B; and let e=e A+e B+e T+e A,0+e B,0.

Fellegi and Sunter propo ranking pairs s,t by the

odds ratio log(Pr(M|s,t)

Pr(U|s,t)

)where M is the class of matched pairs and U is the class of unmatched pairs.Letting AGREE(s,t,w)denote the event“s and t both agree in containing word w”,Fellegi and Sunter note that under cer-tain assumptions

Pr(M|AGREE(s,t,w))≈P C(w)(1−e)

Pr(U|AGREE(s,t,w))≈P A(w)·P B(w)(1−e)

If we make two addition assumptions suggested by Fellegi and Sunter,and assume that(a)matches on a word w are independent,and(b)that P A(w)≈P B(w)≈P C(w)≈P D(w),weﬁnd that the incremental score for the odds ratio associated with AGREE(s,t,w)is simply−log P D(w).In information retrieval terms,this is a simple un-normalized IDF weight.

Unfortunately,updating the log-odds score of a pair on discovering a disagreeing token w is not as si

m-ple.Estimates are provided by Fellegi and Sunter for P(M|¬AGREE(s,t,w))and P(U|¬AGREE(s,t,w)), but the error parameters e A,e B,...do not cancel out–instead one is left with a constant penalty term,independent of w.Departing slightly from this(and following the intu-ition that disagreement on frequent terms is less important) we introduce a variable penalty term of k log P D(w),where k is a parameter to be t by the ur.

In the experiments we ud k=0.5,and call this method SFS distance(for Simpliﬁed Fellegi-Sunter).

Hybrid distance functions

Monge and Elkan propo the following recursive matching scheme for comparing two long strings s and t.First,s and t are broken into substrings a K and b L. Then,similarity is deﬁned as

sim(s,t)=1

i=1

max

j=1

sim (A i,B j)

where sim is some condary distance function.We con-sidered an implementation of this scheme in which the sub-strings are tokens;following Monge and Elkan,we call this a level two distance function.We experimented with level two similarity functions which ud Monge-Elkan,Jaro,and Jaro-Winkler as their ba function sim .

We also considered a“soft”version of TFIDF,in which similar tokens are considered as well as tokens in S∩T. Again let sim be a condary similarity function.Let CLOSE(θ,S,T)be the t of words w∈S such that there is some v∈T such that dist (w,v)>θ,and for w∈CLOSE(θ,S,T),let D(w,T)=max v∈T dist(w,v). We deﬁne

SoftTFIDF(S,T)=

w∈CLOSE(θ,S,T)

V(w,S)·V(w,T)·D(w,T)

In the experiments,we ud Jaro-Winkler as a condary distance andθ=0.9.

“Blocking”or pruning methods

In matching or clustering large lists of names,it is not com-putationally practical to measure distances between all pairs of strings.In statistical record linkage,it is common to group records by some variable which is known a priori to be usually the same for matching pairs.In census work,this grouping variable often names a small geographic region, and perhaps for this reason the technique is usually called “blocking”.

Since this paper focus on the behavior of string match-ing tools when little prior knowledge is available,we will u here knowledge-free approaches to reducing the t of string pairs to compare.In a matching task,there are two ts A and B.We consider as candidates all pairs of strings (s,t)∈A×B such that s and t share some substring v

Name Src M/C#Strings#Tokens

animal1M570930,006

bird11M3771,977

bird21M9824,905

bird31M38188

bird41M7194,618

business1M213910,526

game1M9115,060

father and sonpark1M6543,425

fodorZagrat2M86310,846

ucdFolks3M90454

census4M8415,765

UV A3C1166,845

coraATDV5C191647,512 Table1:Datats ud in experiments.Column2indicates the source of the data,and column3indicates if it is a matching(M)or clustering(C)problem.Original sources are1.(Cohen2000)2.(Tejada,Knoblock,&Minton2001) 3.(Monge&Elkan1996)4.William Winkler(personal communication)5.(McCallum,Nigam,&Ungar2000)

which appears in at most a fraction f of all names.In a clus-tering task,there is one t C,and we consider all candi-dates(s,t)∈C×C such that s=t,and again s and t share some not-too-frequent substring v.For purely token-bad methods,the substring v must be a token,and otherwi,it must be a character4-gram.Using inverted indices this t of pairs can be enumerated quickly.

For the moderate-size test ts considered here,we ud f=1.On the matching datats above,the token blocker ﬁnds between93.3%and100.0%of the correct pairs,with an average of98.9%.The4-gram blocker alsoﬁnds between 93.3%and100.0%of the correct pairs,with an average of 99.0%.

Experiments

Data and methodology

The data ud to evaluate the methods is shown in Ta-ble1.Most been described elwhere in the

groupie

literature.The “coraATDV”datat includes theﬁelds author,title,date, and venue in a single string.The“census”datat is a syn-thetic,census-like datat,from which only textualﬁelds were ud(last name,ﬁrst name,middle initial,hou num-ber,and street).

To evaluate a method on a datat,we ranked by distance all candidate pairs from the appropriate grouping algorithm. We computed the non-interpolated average precision of this ranking,the maximum F1score of the ranking,and also in-terpolated precision at the eleven recall levels0.0,0.1,..., 0.9,1.0.The non-interpolated average precision of a rank-ing containing N pairs for a task with m correct matches is 1

r=1

c(i)δ(i)

,snh48二期生

where c(i)is the number of correct pairs ranked before position i,andδ(i)=1if the pair at rank i is correct and0otherwi.Interpolated precision at recall r is the max i c(i)

,where the max is taken over all ranks i such

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.5

0.55

0.6

0.65

0.7 0.75 0.8 0.85

0.9

0.95

M a x F 1 o f T F I D F

MaxF1 of other

vs Jaccard

vs SFS

vs Jenn-Shannon

y=x

0.3 0.4 0.5函授专升本

0.6 0.7 0.8

0.9 1 0.3

0.4 0.5

0.6 0.7 0.8 0.9 1A v g P r e c o f T F I D F

cet

AvgPrec of other

vs Jaccard

绩效与薪酬管理vs SFS

vs Jenn-Shannon

y=x

0.1

0.2 0.3

0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4

怦然心动美国电影0.6 0.8 1

P r e c i s i o n

Recall

TFIDF

Jenn-Shannon

SFS Jaccard

Figure 1:Relative performance of token-bad measures.Left,max F1of methods on matching problems,with points above the line y =x indicating better performance of TFIDF.Middle,same for non-interpolated average precision.Right,precision-recall curves averaged over all matching problems.Smoothed versions of Jenn-Shannon (not shown)are comparable in performance to the unsmoother version.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9

m a x F 1 o f M o n g e -E l k a n

max F1 of other distance metric

vs Levenstein vs SmithWaterman

vs Jaro

vs Jaro-Winkler

y=x

0 0.1

0.2 0.3 0.4 0.5 0.6 0.7

0.8 0.9 1 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9

a v g P r e c o f M o n g e -E l k a n

avgPrec of other distance metric

vs Levenstein vs SmithWaterman

vs Jaro

vs Jaro-Winkler

y=x

P r e c i s i o n

Recall

Figure 2:Relative performance of edit-distance measures.Left and middle,points above (below)the line y =x indicating better (wor)

arwu

performance for Monge-Elkan,the system performing best on average.

that c (i )

m ≥r .Maximum F1is max i>0F 1(i ),where F 1(i )is the harmonic mean of recall at rank i (i.e.,c (i )/m )and precision at rank i (i.e.,c (i )/i ).

Results for matching

We will ﬁrst consider the matching datats.As shown in Figure 1,TFIDF ems generally the best among the token-bad distance metrics.It does slightly better on average than the others,and is ldom much wor than any other method with respect to interpolated average precision or maximum F1.

As shown in Figure 2,the situation is less clear for the edit-distance bad methods.The Monge-Elkan method does best on average,but the Jaro and Jaro-Winkler methods are clo in average performance,and do noticably better than Monge-Elkan on veral problems.The Jaro variants are also substantially more efﬁcient (at least in our imple-mentation),taking about 1/10the time of the Monge-Elkan method.(The token-bad methods are faster still,averaging less than 1/10the time of the Jaro variants.)

As shown in Figure 3,SoftTFIDF is generally the best among the hybrid methods we considered.In general,the time for the hybrid methods is comparable to using the un-derlying string edit distance.(For instance,the average matching time for SoftTFIDF is about 13conds on the problems,versus about 20conds for the Jaro-Winkler method,and 1.2conds for pure token-bad TFIDF.)

Finally,Figure 4compares the three best performing edit-distance like methods,the two best token-bad methods,and the two best hybrid methods,using a similar method-ology.Generally speaking,SoftTFIDF is the best overall distance measure for the datats.

Results for clustering

It should be noted that the test suite of matching problems is dominated by problems from one source—eight of the eleven test cas are associated with the WHIRL project—and a different distribution of problems might well lead to quite different results.For instance,while the token-bad methods perform well on average,they perform poorly on the census-like datat,which contains many misspellings.As an additional experiment,we evaluated the four best-performing distance metrics (SoftTFIDF,TFIDF,SFS,and Level 2Jaro-Winkler)on the two clustering problems,which are taken from sources other than the WHIRL project.Ta-ble 2shows MaxF1and non-interpolated average precision for each method on each problem.SoftTFIDF again slightly outperforms the other methods on both of the tasks.

Learning to combine distance metrics

Another type of hybrid distance function can be obtained by combining other distance metrics.Following previous rearchers (Tejada,Knoblock,&Minton 2001;Cohen &Richman 2002;Bilenko &Mooney 2002)we ud a learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.9 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8

0.9

m a x F 1 o f S o f t T F I D F

max F1 of other distance metric

vs Level2 Levenstein vs Level2 Monge-Elkan

vs Level2 Jaro

vs Level2 Jaro-Winkler

y=x

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.8 0.9 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8

0.9

a v g P r e c o f S o f t T F I D F

avgPrec of other distance metric

vs Level2 Levenstein vs Level2 Monge-Elkan

vs Level2 Jaro

vs Level2 Jaro-Winkler

y=x

P r e c i s i o n

Recall

Figure 3:Relative performance of hybrid distance measures on matching problems,relative to the SoftTFIDF metric,which performs best

on average.

m a x F 1 o f S o f t T F I D F

max F1 of other distance metric

a v g P r e c o f S o f t T F I D F

avgPrec of other distance metric

P r e c i s i o n

Recall

Figure 4:Relative performance of best distance measures of each type on matching problems,relative to the SoftTFIDF metric.

UV A

CoraATDV

Method

MaxF1AvgPrec MaxF1AvgPrec SoftTFIDF 0.890.910.850.914TFIDF 0.790.840.840.907SFS

0.710.750.820.864Level2J-W

0.730.69

0.760.804

Table 2:Results for lected distance metrics on two clustering

problems.

scheme to combine veral of the distance functions above.Speciﬁcally,we reprented pairs as feature vectors,using as features the numeric scores of Monge-Elkan,Jaro-Winkler,TFIDF,SFS,and SoftTFIDF.We then trained a binary SVM classiﬁer (using SVM Light (Joachims 2002))using the features,and ud its conﬁdence in the “match”class as a score.

Figure 5shows the results of a three-fold cross-validation on nine of the matching problems.The learned combina-tion generally slightly outperforms the individual metrics,including SoftTFIDF,particularly at extreme recall levels.Note,however,that the learned metric us much more in-formation:in particular,in most cas it has been trained on veral thousand labeled candidate pairs,while the other metrics considered here require no training data.

Concluding remarks

Recently,we have begun implementing an open-source,Java toolkit of name-matching methods.This toolkit includes a variety of different techniques,as well as the infrastructure to combine techniques readily,and evaluate them systemat-ically on test data.Using this toolkit,we conducted a com-parison of veral string distances on the tasks of matching and clustering lists of entity names.Many of the were techniques previously propod in the literature,and some are novel hybrids of previous methods.

We compared the accuracy of the methods for u in an automatic matching scheme,in which pairs of names are propod by a simple grouping method,and then ranked ac-cording to distance.Ud in this way,we saw that the TFIDF ranking performed best among veral token-bad distance metrics,and that a tuned afﬁne-gap edit-distance metric pro-pod by Monge and Elkan performed best among veral string edit-distance metrics.A surprisingly good distance metric is a fast heuristic scheme,propod by Jaro (Jaro 1995;1989)and later extended by Winkler (Winkler 1999).This works almost as well as the Monge-Elkan scheme,but is an order of magnitude faster.

One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches ud in TFIDF with approximate token matches bad on the Jaro-Winkler scheme.This combination performs slightly bet-ter than either Jaro-Winkler or TFIDF on average,and oc-casionall

y performs much better.It is also clo in perfor-mance to a learned combination of veral of the best metrics

m a x F 1 o f l e a r n e d m e t r i c

max F1 of other distance metric

a v g P r e c o f l e a r n e d m e t r i c

avgPrec of other distance metric

0.2

0.3 0.4 0.5 0.6 0.7 0.8

0.9 0 0.2 0.4

0.6 0.8 1

P r e c i s i o n

Recall

SoftTFIDF Learned metric

Figure 5:Relative performance of best distance measures of each type on matching problems,relative to a learned combination of the same

measures.

considered in this paper.

Acknowledgements

The preparation of this paper was supported in part by Na-tional Science Foundation Grant No.EIA-0131884to the National Institute of Statistical Sciences and by a contract from the Army Rearch Ofﬁce to the Center for Computer and Communications Security with Carnegie Mellon Uni-versity.

References

Belin,T.R.,and Rubin,D.B.1997.A method for calibrating fal-match rates in record linkage.In Record Linkage –1997:Proceedings of an International Workshop and Exposition ,81–94.U.S.Ofﬁce of Management and Budget (Washington).

Bilenko,M.,and Mooney,R.2002.Learning to combine trained distance metrics for duplicate detection in databas.Technical Report Technical Report AI 02-296,Artiﬁcial Intel-ligence Lab,University of Texas at Austin.Available from www.cs.utexas.edu/urs/ml/papers/marlin-tr-02.pdf.

Cohen,W.W.,and Ravikumar,P.2003.Secondstring:An open-source java toolkit of approximate string-matching techniques.Project web page,condstring.sourceforge.

Cohen,W.W.,and Richman,J.2002.Learning to match and cluster large high-dimensional data ts fo

chrisrene

r data integration.In Proceedings of The Eighth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD-2002).Cohen,W.W.2000.Data integration using similarity joins and a word-bad information reprentation language.ACM Transac-tions on Information Systems 18(3):288–321.

Dagan,I.;Lee,L.;and Pereira,F.1999.Similarity-bad models of word cooccurrence probabilities.Machine Learning 34(1-3).Durban,R.;Eddy,S.R.;Krogh,A.;and Mitchison,G.1998.Biological quence analysis -Probabilistic models of proteins and nucleic acids .Cambridge:Cambridge University Press.Fellegi,I.P.,and Sunter,A.B.1969.A theory for record linkage.Journal of the American Statistical Society 64:1183–1210.

Galhardas,H.;Florescu,D.;Shasha,D.;and Simon,E.2000.An extensible framework for data cleaning.In ICDE ,312.

Hernandez,M.,and Stolfo,S.1995.The merge/purge problem for large databas.In Proceedings of the 1995ACM SIGMOD .Jaro,M.A.1989.Advances in record-linkage methodology as applied to matching the 1985census of Tampa,Florida.Journal of the American Statistical Association 84:414–420.

Jaro,M.A.1995.Probabilistic linkage of large public health data ﬁles (disc:P687-689).Statistics in Medicine 14:491–498.

Joachims,T.2002.Learning to Classify Text Using Support Vec-tor Machines .Kluwer.

Lafferty,J.,and Zhai,C.2001.A study of smoothing methods for language models applied to ad hoc information retrieval.In 2001ACM SIGIR Conference on Rearch and Development in Information Retrieval (SIGIR).

Larn,M.1999.Multiple imputation analysis of records linked using mixture models.In Statistical Society of Canada Proceed-ings of the Survey Methods Section ,65–71.Statistical Society of Canada (McGill University,Montreal).

McCallum,A.;Nigam,K.;and Ungar,L.H.2000.Efﬁcient clus-tering of high-dimensional data ts with application to reference matching.In Knowledge Discovery and Data Mining ,169–178.Monge,A.,and Elkan,C.1996.The ﬁeld-matching problem:algorithm and applications.In Proceedings of the Second Inter-national Conference on Knowledge Discovery and Data Mining .Monge,A.,and Elkan,C.1997.An efﬁcient domain-independent algorithm for detecting approximately duplicate databa records.In The proceedings of the SIGMOD 1997workshop on data min-ing and knowledg

e discovery .

Pasula,H.;Marthi,B.;Milch,B.;Rusll,S.;and Shpitr,I.2002.Identity uncertainty and citation matching.In Advances in Neural Processing Systems 15.Vancouver,British Columbia:MIT Press.

Raman,V .,and Hellerstein,J.2001.Potter’s wheel:An interac-tive data cleaning system.In The VLDB Journal ,381–390.

Ristad,E.S.,and Yianilos,P.N.1998.Learning string edit distance.IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):522–532.

Tejada,S.;Knoblock,C.A.;and Minton,S.2001.Learning ob-ject identiﬁcation rules for information integration.Information Systems 26(8):607–633.

Winkler,W.E.1999.The state of record linkage and cur-rent rearch problems.Statistics of Income Division,In-ternal Revenue Service Publication R99/04.Available v/srd/www/byname.html.

本文发布于:2023-06-19 11:43:18，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/990697.html

上一篇：The Comparison of Education between China and America

下一篇：Comparison of lected aroma compounds in cultivar

标签：函授薪酬美国专升本管理电影

留言与评论（共有 0 条评论）