A Comparison of String Distance Metrics for Name-Matching Tasks
William W.Cohen Pradeep Ravikumar Stephen E.Fienberg
Center for Automated Center for Automated Department of Statistics,
Learning and Discovery,Learning and Discovery,Center for Computer&Communications Security, School of Computer Science,School of Computer Science,&Center for Automated Learning&Discovery Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
u.edu u.edu
Abstract
Using an open-source,Java toolkit of name-matching
methods,we experimentally compare string distance
metrics on the task of matching entity names.We inves-
tigate a number of different metrics propod by differ-
ent communities,including edit-distance metrics,fast
heuristic string comparators,token-bad distance met-
rics,and hybrid methods.Overall,the best-performing
method is a hybrid scheme combining a TFIDF weight-
ing scheme,which is widely ud in information re-
trieval,with the Jaro-Winkler string-distance scheme,
which was developed in the probabilistic record linkage
community.
Introduction
The task of matching entity names has been explored by a number of communities,including statistic
s,databas,and artificial intelligence.Each community has formulated the problem differently,and different techniques have been pro-pod.
In statistics,a long line of rearch has been conducted in probabilistic record linkage,largely bad on the m-inal paper by Fellegi and Sunter(1969).This paper for-mulates entity matching as a classification problem,where the basic goal is to classify entity pairs as matching or non-matching.Fellegi and Sunter propo using largely unsupervid methods for this task,bad on a feature-bad reprentation of pairs which is manually designed and to some extent problem-specific.The proposals have been,by and large,adopted by subquent rearchers,al-though often with elaborations of the underlying statisti-cal model(Jaro1989;1995;Winkler1999;Larn1999; Belin&Rubin1997).The methods have been ud to match individuals and/or families between samples and ,in evaluation of the coverage of the U.S.decen-nial census;or between administrative records and survey data ,in the creation of an anonymized rearch data ba combining tax information from the Internal Rev-enue Service and data from the Current Population Survey. In the databa community,some work on record match-ing has been bad on knowledge-intensive approaches Copyright c 2003,American Association for Artificial Intelli-gence(www.aaai).All rights rerved.(Hernandez&Stolfo1995;Galhardas et al.2000;Raman &Hell
erstein2001).However,the u of string-edit dis-tances as a general-purpo record matching scheme was propod by Monge and Elkan(Monge&Elkan1997; 1996),and in previous work,we propod u of the TFIDF distance metric for the same purpo(Cohen2000). In the AI community,supervid learning has been ud for learning the parameters of string-edit distance metrics (Ristad&Yianilos1998;Bilenko&Mooney2002)and combining the results of different distance functions(Te-jada,Knoblock,&Minton2001;Cohen&Richman2002; Bilenko&Mooney2002).More recently,probabilistic ob-ject identification methods have been adapted to matching tasks(Pasula et al.2002).In the communities there has been more emphasis on developing autonomous matching techniques which require little or or no configuration for a new task,and less emphasis on developing“toolkits”of methods that can be applied to new tasks by experts. Recently,we have begun implementing an open-source, Java toolkit of name-matching methods(Cohen&Raviku-mar2003)that includes a variety of different techniques.In this paper we u this toolkit to conduct a comparison of veral string distances on the tasks of matching and clus-tering lists of entity names.We also introduce and evaluate a number of novel string-distance methods.One of the novel distance metrics performs better,on average,than any previous string-distance metric on our benchmark problems. This new distance metric extends cosine similarity by using the Jaro-Winkler method(Winkler1999)to exploit nearly-matching tokens.
This experimental u of string distance metrics,while similar to previous experiments in the databa and AI com-munities,is a substantial departure from their usual u in statistics.In statistics,databas tend to have more structure and specification,by design.Thus the statistical literature on probabilistic record linkage reprents pairs of entities not by pairs of strings,but by vectors of“match features”such as names and categories for variables in survey databas. By developing appropriate match features,and appropriate statistical models of matching and non-matching pairs,this approach can achieve better matching performance(at least potentially).
The u of string distances considered here is most uful for matching problems with little prior knowledge,or ill-
structured data.Better string distance metrics might also be uful in the generation of“match features”in more struc-tured databa situations.
Methods
Edit-distance like functions
Distance functions map a pair of strings s and t to a real number r,where a smaller value of r indicate
s greater sim-ilarity between s and t.Similarity functions are analogous, except that larger values indicate greater similarity;at some risk of confusion to the reader,we will u this terms inter-changably,depending on which interpretation is most natu-ral.
One important class of distance functions are edit dis-tances,in which distance is the cost of best quence of edit operations that convert s to t.Typical edit operations are character inrtion,deletion,and substitution,and each op-eration much be assigned a cost.
We will consider two edit-distance functions.The sim-ple Levenstein distance assigns a unit cost to all edit opera-tions.As an example of a more complex well-tuned distance function,we also consider the Monger-Elkan distance func-tion(Monge&Elkan1996),which is an affine1variant of the Smith-Waterman distance function(Durban et al.1998) with particular cost parameters,and scaled to the interval [0,1].
A broadly similar metric,which is not bad on an edit-distance model,is the Jaro metric(Jaro1995;1989;Winkler 1999).In the record-linkage literature,good results have been obtained using variants of this method,which is bad on the number and order of the common characters between two strings.Given strings a K and b L, define a character a i in s to be common with t there is a b j=a i in t such that i−H≤j≤i+H,where H=
min(|s|,|t|)
2.Let s =a
K
be the characters in s which
are common with t(in the same order they appear in s)and
let t =b
L be analogous;now define a transposition
for s ,t to be a position i such that a i=b i.Let T s ,t be half the number of transpositions for s and t .The Jaro similarity metric for s and t is
Jaro(s,t)=1stink
3
·
|s |
|s|
+
|t |
|t|
+
|s |−T s ,t
|s |
A variant of this due to Winkler(1999)also us the length P of the longest common prefix of s and t.Letting P = max(P,4)we define
Jaro-Winkler(s,t)=Jaro(s,t)+P
10
·(1−Jaro(s,t))
The Jaro and Jaro-Winkler metrics em to be intended pri-marily for short ,personal last names.)
Token-bad distance functions
Two strings s and t can also be considered as multits(or bags)of words(or tokens).We also considered veral 1Affine edit-distance functions assign a relatively lower cost to a quence of inrtions ken-bad distance metrics.The Jaccard similarity be-tween the word ts S and T is simply|S∩T|
|S∪T|
.TFIDF or cosine similarity,which is widely ud in the information retrieval community can be defined as
TFIDF(S,T)=
w∈S∩T
V(w,S)·V(w,T)
where TF w,S is the frequency of word w in S,N is the size of the“corpus”,IDF w is the inver of the fraction of names in the corpus that contain w,
V (w,S)=log(TF w,S+1)·log(IDF w)
and V(w,S)=V (w,S)/
w
V (w,S)2.Our imple-mentation measures all document frequency statistics from the complete corpus of names to be matched.
Following Dagan et al(1999),a token t S can be viewed as samples from an unknown distribution P S of tokens,and a distance between S and T can be computed bad on the distributions.Inspired by Dagan et al we considered the Jenn-Shannon distance between P S and P T.Letting KL(P||Q)be the Kullback-Lieber divergence and letting Q(w)=1
2
(P S(w)+P T(w)),this is simply
Jenn-Shannon(S,T)=
1
2
(KL(P S||Q)+KL(P T||Q))
Distributions P S were estimated using maximum likelihood, a Dirichlet prior,and a Jelenik-Mercer mixture model(Laf-ferty&Zhai2001).The latter two cas require parameters; for Dirichlet,we udα=1,and for Jelenik-Mercer,we ud the mixture coefficientλ=0.5.
From the record-linkage literature,a method propod by Fellegi and Sunter(1969)can be easily extended to a token distance.As notation,let A and B be two ts of records to match,let C=A∩B,let D=A∪B,and for X= A,B,C,D let P X(w)be the empirical probability of a name containing the word w in the t X.Also let e X be the empirical probability of an error in a name in t X;e X,0 be the probability of a missing name in t X;e T be the probability of two correct but differing names in A and B; and let e=e A+e B+e T+e A,0+e B,0.
Fellegi and Sunter propo ranking pairs s,t by the
odds ratio log(Pr(M|s,t)
Pr(U|s,t)
)where M is the class of matched pairs and U is the class of unmatched pairs.Letting AGREE(s,t,w)denote the event“s and t both agree in containing word w”,Fellegi and Sunter note that under cer-tain assumptions
Pr(M|AGREE(s,t,w))≈P C(w)(1−e)
Pr(U|AGREE(s,t,w))≈P A(w)·P B(w)(1−e)
If we make two addition assumptions suggested by Fellegi and Sunter,and assume that(a)matches on a word w are independent,and(b)that P A(w)≈P B(w)≈P C(w)≈P D(w),wefind that the incremental score for the odds ratio associated with AGREE(s,t,w)is simply−log P D(w).In information retrieval terms,this is a simple un-normalized IDF weight.
Unfortunately,updating the log-odds score of a pair on discovering a disagreeing token w is not as si
m-ple.Estimates are provided by Fellegi and Sunter for P(M|¬AGREE(s,t,w))and P(U|¬AGREE(s,t,w)), but the error parameters e A,e B,...do not cancel out–instead one is left with a constant penalty term,independent of w.Departing slightly from this(and following the intu-ition that disagreement on frequent terms is less important) we introduce a variable penalty term of k log P D(w),where k is a parameter to be t by the ur.
In the experiments we ud k=0.5,and call this method SFS distance(for Simplified Fellegi-Sunter).
Hybrid distance functions
Monge and Elkan propo the following recursive matching scheme for comparing two long strings s and t.First,s and t are broken into substrings a K and b L. Then,similarity is defined as
sim(s,t)=1
K
K
i=1
L
max
j=1
sim (A i,B j)
where sim is some condary distance function.We con-sidered an implementation of this scheme in which the sub-strings are tokens;following Monge and Elkan,we call this a level two distance function.We experimented with level two similarity functions which ud Monge-Elkan,Jaro,and Jaro-Winkler as their ba function sim .
We also considered a“soft”version of TFIDF,in which similar tokens are considered as well as tokens in S∩T. Again let sim be a condary similarity function.Let CLOSE(θ,S,T)be the t of words w∈S such that there is some v∈T such that dist (w,v)>θ,and for w∈CLOSE(θ,S,T),let D(w,T)=max v∈T dist(w,v). We define
SoftTFIDF(S,T)=
w∈CLOSE(θ,S,T)
V(w,S)·V(w,T)·D(w,T)
In the experiments,we ud Jaro-Winkler as a condary distance andθ=0.9.
“Blocking”or pruning methods
In matching or clustering large lists of names,it is not com-putationally practical to measure distances between all pairs of strings.In statistical record linkage,it is common to group records by some variable which is known a priori to be usually the same for matching pairs.In census work,this grouping variable often names a small geographic region, and perhaps for this reason the technique is usually called “blocking”.
Since this paper focus on the behavior of string match-ing tools when little prior knowledge is available,we will u here knowledge-free approaches to reducing the t of string pairs to compare.In a matching task,there are two ts A and B.We consider as candidates all pairs of strings (s,t)∈A×B such that s and t share some substring v
Name Src M/C#Strings#Tokens
animal1M570930,006
bird11M3771,977
bird21M9824,905
bird31M38188
bird41M7194,618
business1M213910,526
game1M9115,060
father and sonpark1M6543,425
fodorZagrat2M86310,846
ucdFolks3M90454
census4M8415,765
UV A3C1166,845
coraATDV5C191647,512 Table1:Datats ud in experiments.Column2indicates the source of the data,and column3indicates if it is a matching(M)or clustering(C)problem.Original sources are1.(Cohen2000)2.(Tejada,Knoblock,&Minton2001) 3.(Monge&Elkan1996)4.William Winkler(personal communication)5.(McCallum,Nigam,&Ungar2000)
which appears in at most a fraction f of all names.In a clus-tering task,there is one t C,and we consider all candi-dates(s,t)∈C×C such that s=t,and again s and t share some not-too-frequent substring v.For purely token-bad methods,the substring v must be a token,and otherwi,it must be a character4-gram.Using inverted indices this t of pairs can be enumerated quickly.
For the moderate-size test ts considered here,we ud f=1.On the matching datats above,the token blocker finds between93.3%and100.0%of the correct pairs,with an average of98.9%.The4-gram blocker alsofinds between 93.3%and100.0%of the correct pairs,with an average of 99.0%.
Experiments
Data and methodology
The data ud to evaluate the methods is shown in Ta-ble1.Most been described elwhere in the
groupie
literature.The “coraATDV”datat includes thefields author,title,date, and venue in a single string.The“census”datat is a syn-thetic,census-like datat,from which only textualfields were ud(last name,first name,middle initial,hou num-ber,and street).
To evaluate a method on a datat,we ranked by distance all candidate pairs from the appropriate grouping algorithm. We computed the non-interpolated average precision of this ranking,the maximum F1score of the ranking,and also in-terpolated precision at the eleven recall levels0.0,0.1,..., 0.9,1.0.The non-interpolated average precision of a rank-ing containing N pairs for a task with m correct matches is 1
m
N
r=1
c(i)δ(i)
i
,snh48二期生
where c(i)is the number of correct pairs ranked before position i,andδ(i)=1if the pair at rank i is correct and0otherwi.Interpolated precision at recall r is the max i c(i)
i
,where the max is taken over all ranks i such
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.5
0.55
0.6
0.65
0.7 0.75 0.8 0.85
0.9
0.95
1
M a x F 1 o f T F I D F
MaxF1 of other
vs Jaccard
vs SFS
vs Jenn-Shannon
y=x
0.3 0.4 0.5函授专升本
0.6 0.7 0.8
0.9 1 0.3
0.4 0.5
0.6 0.7 0.8 0.9 1A v g P r e c o f T F I D F
cet
AvgPrec of other
vs Jaccard
绩效与薪酬管理vs SFS
vs Jenn-Shannon
y=x
0.1
0.2 0.3
0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4
怦然心动 美国电影0.6 0.8 1
P r e c i s i o n
Recall
TFIDF
Jenn-Shannon
SFS Jaccard
Figure 1:Relative performance of token-bad measures.Left,max F1of methods on matching problems,with points above the line y =x indicating better performance of TFIDF.Middle,same for non-interpolated average precision.Right,precision-recall curves averaged over all matching problems.Smoothed versions of Jenn-Shannon (not shown)are comparable in performance to the unsmoother version.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
m a x F 1 o f M o n g e -E l k a n
max F1 of other distance metric
vs Levenstein vs SmithWaterman
vs Jaro
vs Jaro-Winkler
y=x
0 0.1
0.2 0.3 0.4 0.5 0.6 0.7
0.8 0.9 1 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
a v g P r e c o f M o n g e -E l k a n
avgPrec of other distance metric
vs Levenstein vs SmithWaterman
vs Jaro
vs Jaro-Winkler
y=x
P r e c i s i o n
Recall
Figure 2:Relative performance of edit-distance measures.Left and middle,points above (below)the line y =x indicating better (wor)
arwu
performance for Monge-Elkan,the system performing best on average.
that c (i )
m ≥r .Maximum F1is max i>0F 1(i ),where F 1(i )is the harmonic mean of recall at rank i (i.e.,c (i )/m )and precision at rank i (i.e.,c (i )/i ).
Results for matching
We will first consider the matching datats.As shown in Figure 1,TFIDF ems generally the best among the token-bad distance metrics.It does slightly better on average than the others,and is ldom much wor than any other method with respect to interpolated average precision or maximum F1.
As shown in Figure 2,the situation is less clear for the edit-distance bad methods.The Monge-Elkan method does best on average,but the Jaro and Jaro-Winkler methods are clo in average performance,and do noticably better than Monge-Elkan on veral problems.The Jaro variants are also substantially more efficient (at least in our imple-mentation),taking about 1/10the time of the Monge-Elkan method.(The token-bad methods are faster still,averaging less than 1/10the time of the Jaro variants.)
As shown in Figure 3,SoftTFIDF is generally the best among the hybrid methods we considered.In general,the time for the hybrid methods is comparable to using the un-derlying string edit distance.(For instance,the average matching time for SoftTFIDF is about 13conds on the problems,versus about 20conds for the Jaro-Winkler method,and 1.2conds for pure token-bad TFIDF.)
Finally,Figure 4compares the three best performing edit-distance like methods,the two best token-bad methods,and the two best hybrid methods,using a similar method-ology.Generally speaking,SoftTFIDF is the best overall distance measure for the datats.
Results for clustering
It should be noted that the test suite of matching problems is dominated by problems from one source—eight of the eleven test cas are associated with the WHIRL project—and a different distribution of problems might well lead to quite different results.For instance,while the token-bad methods perform well on average,they perform poorly on the census-like datat,which contains many misspellings.As an additional experiment,we evaluated the four best-performing distance metrics (SoftTFIDF,TFIDF,SFS,and Level 2Jaro-Winkler)on the two clustering problems,which are taken from sources other than the WHIRL project.Ta-ble 2shows MaxF1and non-interpolated average precision for each method on each problem.SoftTFIDF again slightly outperforms the other methods on both of the tasks.
Learning to combine distance metrics
Another type of hybrid distance function can be obtained by combining other distance metrics.Following previous rearchers (Tejada,Knoblock,&Minton 2001;Cohen &Richman 2002;Bilenko &Mooney 2002)we ud a learning
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.9 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8
0.9
1
m a x F 1 o f S o f t T F I D F
max F1 of other distance metric
vs Level2 Levenstein vs Level2 Monge-Elkan
vs Level2 Jaro
vs Level2 Jaro-Winkler
y=x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.8 0.9 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8
0.9
1
a v g P r e c o f S o f t T F I D F
avgPrec of other distance metric
vs Level2 Levenstein vs Level2 Monge-Elkan
vs Level2 Jaro
vs Level2 Jaro-Winkler
y=x
P r e c i s i o n
Recall
Figure 3:Relative performance of hybrid distance measures on matching problems,relative to the SoftTFIDF metric,which performs best
on average.
m a x F 1 o f S o f t T F I D F
max F1 of other distance metric
a v g P r e c o f S o f t T F I D F
avgPrec of other distance metric
P r e c i s i o n
Recall
Figure 4:Relative performance of best distance measures of each type on matching problems,relative to the SoftTFIDF metric.
UV A
CoraATDV
Method
MaxF1AvgPrec MaxF1AvgPrec SoftTFIDF 0.890.910.850.914TFIDF 0.790.840.840.907SFS
0.710.750.820.864Level2J-W
0.730.69
0.760.804
Table 2:Results for lected distance metrics on two clustering
problems.
scheme to combine veral of the distance functions above.Specifically,we reprented pairs as feature vectors,using as features the numeric scores of Monge-Elkan,Jaro-Winkler,TFIDF,SFS,and SoftTFIDF.We then trained a binary SVM classifier (using SVM Light (Joachims 2002))using the features,and ud its confidence in the “match”class as a score.
Figure 5shows the results of a three-fold cross-validation on nine of the matching problems.The learned combina-tion generally slightly outperforms the individual metrics,including SoftTFIDF,particularly at extreme recall levels.Note,however,that the learned metric us much more in-formation:in particular,in most cas it has been trained on veral thousand labeled candidate pairs,while the other metrics considered here require no training data.
Concluding remarks
Recently,we have begun implementing an open-source,Java toolkit of name-matching methods.This toolkit includes a variety of different techniques,as well as the infrastructure to combine techniques readily,and evaluate them systemat-ically on test data.Using this toolkit,we conducted a com-parison of veral string distances on the tasks of matching and clustering lists of entity names.Many of the were techniques previously propod in the literature,and some are novel hybrids of previous methods.
We compared the accuracy of the methods for u in an automatic matching scheme,in which pairs of names are propod by a simple grouping method,and then ranked ac-cording to distance.Ud in this way,we saw that the TFIDF ranking performed best among veral token-bad distance metrics,and that a tuned affine-gap edit-distance metric pro-pod by Monge and Elkan performed best among veral string edit-distance metrics.A surprisingly good distance metric is a fast heuristic scheme,propod by Jaro (Jaro 1995;1989)and later extended by Winkler (Winkler 1999).This works almost as well as the Monge-Elkan scheme,but is an order of magnitude faster.
One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches ud in TFIDF with approximate token matches bad on the Jaro-Winkler scheme.This combination performs slightly bet-ter than either Jaro-Winkler or TFIDF on average,and oc-casionall
y performs much better.It is also clo in perfor-mance to a learned combination of veral of the best metrics
m a x F 1 o f l e a r n e d m e t r i c
max F1 of other distance metric
a v g P r e c o f l e a r n e d m e t r i c
avgPrec of other distance metric
0.2
0.3 0.4 0.5 0.6 0.7 0.8
0.9 0 0.2 0.4
0.6 0.8 1
P r e c i s i o n
Recall
SoftTFIDF Learned metric
Figure 5:Relative performance of best distance measures of each type on matching problems,relative to a learned combination of the same
measures.
considered in this paper.
Acknowledgements
The preparation of this paper was supported in part by Na-tional Science Foundation Grant No.EIA-0131884to the National Institute of Statistical Sciences and by a contract from the Army Rearch Office to the Center for Computer and Communications Security with Carnegie Mellon Uni-versity.
References
Belin,T.R.,and Rubin,D.B.1997.A method for calibrating fal-match rates in record linkage.In Record Linkage –1997:Proceedings of an International Workshop and Exposition ,81–94.U.S.Office of Management and Budget (Washington).
Bilenko,M.,and Mooney,R.2002.Learning to combine trained distance metrics for duplicate detection in databas.Technical Report Technical Report AI 02-296,Artificial Intel-ligence Lab,University of Texas at Austin.Available from www.cs.utexas.edu/urs/ml/papers/marlin-tr-02.pdf.
Cohen,W.W.,and Ravikumar,P.2003.Secondstring:An open-source java toolkit of approximate string-matching techniques.Project web page,condstring.sourceforge.
Cohen,W.W.,and Richman,J.2002.Learning to match and cluster large high-dimensional data ts fo
chrisrene
r data integration.In Proceedings of The Eighth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD-2002).Cohen,W.W.2000.Data integration using similarity joins and a word-bad information reprentation language.ACM Transac-tions on Information Systems 18(3):288–321.
Dagan,I.;Lee,L.;and Pereira,F.1999.Similarity-bad models of word cooccurrence probabilities.Machine Learning 34(1-3).Durban,R.;Eddy,S.R.;Krogh,A.;and Mitchison,G.1998.Biological quence analysis -Probabilistic models of proteins and nucleic acids .Cambridge:Cambridge University Press.Fellegi,I.P.,and Sunter,A.B.1969.A theory for record linkage.Journal of the American Statistical Society 64:1183–1210.
Galhardas,H.;Florescu,D.;Shasha,D.;and Simon,E.2000.An extensible framework for data cleaning.In ICDE ,312.
Hernandez,M.,and Stolfo,S.1995.The merge/purge problem for large databas.In Proceedings of the 1995ACM SIGMOD .Jaro,M.A.1989.Advances in record-linkage methodology as applied to matching the 1985census of Tampa,Florida.Journal of the American Statistical Association 84:414–420.
Jaro,M.A.1995.Probabilistic linkage of large public health data files (disc:P687-689).Statistics in Medicine 14:491–498.
Joachims,T.2002.Learning to Classify Text Using Support Vec-tor Machines .Kluwer.
Lafferty,J.,and Zhai,C.2001.A study of smoothing methods for language models applied to ad hoc information retrieval.In 2001ACM SIGIR Conference on Rearch and Development in Information Retrieval (SIGIR).
Larn,M.1999.Multiple imputation analysis of records linked using mixture models.In Statistical Society of Canada Proceed-ings of the Survey Methods Section ,65–71.Statistical Society of Canada (McGill University,Montreal).
McCallum,A.;Nigam,K.;and Ungar,L.H.2000.Efficient clus-tering of high-dimensional data ts with application to reference matching.In Knowledge Discovery and Data Mining ,169–178.Monge,A.,and Elkan,C.1996.The field-matching problem:algorithm and applications.In Proceedings of the Second Inter-national Conference on Knowledge Discovery and Data Mining .Monge,A.,and Elkan,C.1997.An efficient domain-independent algorithm for detecting approximately duplicate databa records.In The proceedings of the SIGMOD 1997workshop on data min-ing and knowledg
e discovery .
Pasula,H.;Marthi,B.;Milch,B.;Rusll,S.;and Shpitr,I.2002.Identity uncertainty and citation matching.In Advances in Neural Processing Systems 15.Vancouver,British Columbia:MIT Press.
Raman,V .,and Hellerstein,J.2001.Potter’s wheel:An interac-tive data cleaning system.In The VLDB Journal ,381–390.
Ristad,E.S.,and Yianilos,P.N.1998.Learning string edit distance.IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):522–532.
Tejada,S.;Knoblock,C.A.;and Minton,S.2001.Learning ob-ject identification rules for information integration.Information Systems 26(8):607–633.
Winkler,W.E.1999.The state of record linkage and cur-rent rearch problems.Statistics of Income Division,In-ternal Revenue Service Publication R99/04.Available v/srd/www/byname.html.