中文字符串的相似度(Similarity of Chine strings)
How to determine the similarity of Chine strings (ZT) (2008-06-14, 09:19:27)?
Reprint Tags: Chine string similarity, component weight, it
abstract
Rearch on data mining, we often need to determine whether the article is similar, classify of similar articles or short ntences, which will encounter this problem: how to determine the degree of similarity between two strings.
Bad on the author's practical work experience and data mining theory, this paper introduces a relatively complete t of methods to solve the above problems in combination with Chine string characteristics.
Analysis
The simplest problem solver
The string consists of a t of different meanings of the word, it is different from the numerical variables, to determine its size or position with a specific value, so how to describe the distance between two strings, has become a problem worthy of discussion.
Typically, the types of data ud for analysis are as follows: interval scale traversal, two variables, nominal variables, ordinal variables, scale type variables, mixed type variables, and so on.
The types of variables, this paper argues that the string variable is more suitable for classified in two variables, we can u the word string into a number of words, each word as an attribute of two variables. We t all the words as a t of two yuan variables, R, string 1 and string 2 are included in the t R. Q is the total number of 1 and 2 are string string of words, s is 1 in the total number of words in a string string does not exist, R is a string 2, the total number of words in the 1 string does not exist, t is the total number of words and 1 string string 2 does not exist. We call Q, R, s, and t as 4 state components in string comparison. As shown in figure 1:
Becau of the two strings of words two string does not exist is not any effect, so ignore T, so we u the similarity evaluation of non constant coefficients (Jaccard coefficients) to describe the two string e dissimilarity reprentation formula
Dissimilarity = r+s / (q+r+s), it is not difficult to infer that their shape similarity formula is
Similarity =q/ (q+r+s) formula 1
Figure 1 string relation description
For example, the following two strings of characters:
String 1: asymmetric variables
String 2: asymmetric space
Their two attribute relational table is:
Character string / attribute asymmetric variable space
Asymmetric variables Y, Y, Y, N
Asymmetric space Y, Y, N, Y
Y indicates the existence of the word attribute, and N indicates that the word property is not prent
So corresponding
S = 1; q = 2; r = 1
The similarity of the two strings is 2/ (1+2+1) = 50%
Word repetition problem solving
The discussion is the most simple string comparison problem, this problem does not exist in a single string repetition of words, however, if repeated words appear in the string, the ction on the formula after the results are not ideal, such as
String 1: move forward
String 2: move forward
Formula 1 similarity =q/ (q+r+s) to calculate,
Q = 1, r=s=0, get the similarity of 100%, and in fact, the two strings are not exactly the same. To solve this problem, we must assume that the same words appear at different locations as different words, and distinguish them in the order in which they appear in the string, so that the binary property relations table is as follows:
String / property 1 forward 2
Move forward, Y Y