中文字符串的相似度（Similarity of Chine strings）

更新时间:2023-05-08 18:49:42 阅读：评论：0

How to determine the similarity of Chine strings (ZT) (2008-06-14, 09:19:27)?

Reprint Tags: Chine string similarity, component weight, it

abstract

Rearch on data mining, we often need to determine whether the article is similar, classify of similar articles or short ntences, which will encounter this problem: how to determine the degree of similarity between two strings.

Bad on the author's practical work experience and data mining theory, this paper introduces a relatively complete t of methods to solve the above problems in combination with Chine string characteristics.

Analysis

The simplest problem solver

The string consists of a t of different meanings of the word, it is different from the numerical variables, to determine its size or position with a specific value, so how to describe the distance between two strings, has become a problem worthy of discussion.

Typically, the types of data ud for analysis are as follows: interval scale traversal, two variables, nominal variables, ordinal variables, scale type variables, mixed type variables, and so on.

The types of variables, this paper argues that the string variable is more suitable for classified in two variables, we can u the word string into a number of words, each word as an attribute of two variables. We t all the words as a t of two yuan variables, R, string 1 and string 2 are included in the t R. Q is the total number of 1 and 2 are string string of words, s is 1 in the total number of words in a string string does not exist, R is a string 2, the total number of words in the 1 string does not exist, t is the total number of words and 1 string string 2 does not exist. We call Q, R, s, and t as 4 state components in string comparison. As shown in figure 1:

Becau of the two strings of words two string does not exist is not any effect, so ignore T, so we u the similarity evaluation of non constant coefficients (Jaccard coefficients) to describe the two string e dissimilarity reprentation formula

Dissimilarity = r+s / (q+r+s), it is not difficult to infer that their shape similarity formula is

Similarity =q/ (q+r+s) formula 1

Figure 1 string relation description

For example, the following two strings of characters:

String 1: asymmetric variables

String 2: asymmetric space

Their two attribute relational table is:

Character string / attribute asymmetric variable space

Asymmetric variables Y, Y, Y, N

Asymmetric space Y, Y, N, Y

Y indicates the existence of the word attribute, and N indicates that the word property is not prent

So corresponding

S = 1; q = 2; r = 1

The similarity of the two strings is 2/ (1+2+1) = 50%

Word repetition problem solving

The discussion is the most simple string comparison problem, this problem does not exist in a single string repetition of words, however, if repeated words appear in the string, the ction on the formula after the results are not ideal, such as

String 1: move forward

String 2: move forward

Formula 1 similarity =q/ (q+r+s) to calculate,

Q = 1, r=s=0, get the similarity of 100%, and in fact, the two strings are not exactly the same. To solve this problem, we must assume that the same words appear at different locations as different words, and distinguish them in the order in which they appear in the string, so that the binary property relations table is as follows:

String / property 1 forward 2

Move forward, Y Y

本文发布于:2023-05-08 18:49:42，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/558448.html

上一篇：形容词基本词根表（300个）必备!!

下一篇：c语言常见错误合集

标签：相似字符串

留言与评论（共有 0 条评论）