中文字符串的相似度(Similarity of Chine strings)

更新时间:2023-05-08 18:49:42 阅读: 评论:0

中文字符串的相似度(Similarity of Chine strings
How to determine the similarity of Chine strings (ZT) (2008-06-14, 09:19:27)?
Reprint Tags: Chine string similarity, component weight, it
abstract
Rearch on data mining, we often need to determine whether the article is similar, classify of similar articles or short ntences, which will encounter this problem: how to determine the degree of similarity between two strings.
Bad on the author's practical work experience and data mining theory, this paper introduces a relatively complete t of methods to solve the above problems in combination with Chine string characteristics.
Analysis
The simplest problem solver
The string consists of a t of different meanings of the word, it is different from the numerical variables, to determine its size or position with a specific value, so how to describe the distance between two strings, has become a problem worthy of discussion.
Typically, the types of data ud for analysis are as follows: interval scale traversal, two variables, nominal variables, ordinal variables, scale type variables, mixed type variables, and so on.
The types of variables, this paper argues that the string variable is more suitable for classified in two variables, we can u the word string into a number of words, each word as an attribute of two variables. We t all the words as a t of two yuan variables, R, string 1 and string 2 are included in the t R. Q is the total number of 1 and 2 are string string of words, s is 1 in the total number of words in a string string does not exist, R is a string 2, the total number of words in the 1 string does not exist, t is the total number of words and 1 string string 2 does not exist. We call Q, R, s, and t as 4 state components in string comparison. As shown in figure 1:
Becau of the two strings of words two string does not exist is not any effect, so ignore T, so we u the similarity evaluation of non constant coefficients (Jaccard coefficients) to describe the two string e dissimilarity reprentation formula
Dissimilarity = r+s / (q+r+s), it is not difficult to infer that their shape similarity formula is
Similarity =q/ (q+r+s) formula 1
Figure 1 string relation description
For example, the following two strings of characters:
String 1: asymmetric variables
String 2: asymmetric space
Their two attribute relational table is:
Character string / attribute asymmetric variable space
Asymmetric variables Y, Y, Y, N
Asymmetric space Y, Y, N, Y
Y indicates the existence of the word attribute, and N indicates that the word property is not prent
So corresponding
S = 1; q = 2; r = 1
The similarity of the two strings is 2/ (1+2+1) = 50%
Word repetition problem solving
The discussion is the most simple string comparison problem, this problem does not exist in a single string repetition of words, however, if repeated words appear in the string, the ction on the formula after the results are not ideal, such as
String 1: move forward
String 2: move forward
Formula 1 similarity =q/ (q+r+s) to calculate,
Q = 1, r=s=0, get the similarity of 100%, and in fact, the two strings are not exactly the same. To solve this problem, we must assume that the same words appear at different locations as different words, and distinguish them in the order in which they appear in the string, so that the binary property relations table is as follows:
String / property 1 forward 2
Move forward, Y Y

本文发布于:2023-05-08 18:49:42,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/78/558448.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:相似   字符串
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图