⽂本相似度计算(中英⽂)详解实战
使⽤tf_idf 模型实现中英⽂⽂本相似度计算
1. 英⽂⽂本相似度计算
1. 测试⽂本
2.去除停⽤词,英⽂不需要分词操作
3. 构建词典,也就是给每个词分配⼀个唯⼀编号
4. 根据词袋模型将分词列表集转换成稀疏向量集
doc2bow和word2vec等⼀样,是将词表⽰成特征,corpus的结果中(0, 1)理解为每个编号为0的词出现了1次。
airtight5. 训练tf-idf模型,语料库进⾏训练documents = [ "Is there anything good playing?", "let's meet at the movie theater entrance tonight. Don't be late.", "Are you going to the movie theater with me tonight?", "I get a lump in my throat whenever I e a tragic movie.", "you're just too emotional.", "
The interction graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]
1
2
3
4
5
6
7
8
9# 去掉停⽤词stoplist = t ('for a of the and to in'.split ())texts = [[word for word in document .lower ().split () if word not in stoplist ] for document in documents ]print (texts )
1
2
3
4
5
6[['is', 'there', 'anything', 'good', 'playing?'], ["let's", 'meet', 'at', 'movie', 'theater', 'entrance', 'tonight.', "don't", 'be', 'late.'], ['are', 'you', 'going', 'movie',
1# 将词语分词并保存dictionary = corpora .Dictionary (texts )print (dictionary .token2id ) # 查看每个词的唯⼀编号# 词袋模型 基于词典,将分词列表集转换成稀疏向量集,即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )
1
2
3
4
5
6{'anything': 0, 'good': 1, 'is': 2, 'playing?': 3, 'there': 4, 'at': 5, 'be': 6, "don't": 7, 'entrance': 8, 'late.': 9, "let's": 10, 'meet': 11, 'movie': 12, 'theater': 13,Dictionary(44 unique tokens: ['anything', 'good', 'is', 'playing?', 'there']...)
1
2corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )
1
2[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (13, 1), (15, 1), (16, 1), (17,
1
6. ⽤测试集来测试
7. ⽤测试集的这句话与⽂本中的每⼀句话计算相似度
可以看的出来,测试集数据与第1句话和第2句话有相似性(编号从0开始的)
7. 提取出相似值最⼤的值
英语翻译转换器拿到了索引,我们可以做字典映射将原始⽂档中的话找出来,任务完成。
完整代码# 训练tf-idf 模型,语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本,即语料库corpus_tfidf = tfidf [corpus ]
thriller
1
2
3
4
5# 要检索的句⼦query = 'I like movie'# 利⽤doc2bow 对其进⾏分割 把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())print (vec_bow ) # [(12, 1)] 说明movie 的编号是12,出现了⼀次,i like 没有在语料库中出现# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]print (vec_tfidf ) # [(12, 1.0)]
1
2
3
4
5
6
7
8# 相似度检索 返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity [vec_tfidf ]print (sims )
1
23
4
5[0. 0.21666655 0.24635962 0. 0. 0. 0. ]
1# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )print (max_loc ) # 2# 最⼤相似度值max_sim = sims [max_loc ]print (max_sim ) # 0.24635962
1
2
3
4
5
620.24635962
1
2
2. 中⽂⽂本相似度计算
上⾯讲了详细步骤,直接上代码,中⽂就多了⼀个分词,其他都⼀样from gensim import corpora , models , similarities import numpy as np # 建⽴词典documents = [ "Is there anything good playing?", "let's meet at the movie theater entrance tonight. Don't be late.", "Are you going to the movie theater with me tonight?", "I get a lump in my throat whenever I e a tragic movie.", "you're just too emotional.", "The interction graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]# 去掉停⽤词stoplist = t ('for a of the and to in'.split ())texts = [[word for word in document .lower ().split () if word not in stoplist ] for document in documents ]print (texts )# 将词语分词并保存dictionary = corpora .Dictionary (texts )print (dictionary )print (dictionary .token2id ) # 查看每个词的唯⼀编号# 词袋模型 基于词典,将分词列表集转换成稀疏向量集,即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )# 训练tf-idf 模型,语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本,即语料库corpus_tfidf = tfidf [corpus ]# 要检索的句⼦query = 'I like movie'# 利⽤doc2bow 对其进⾏分割
把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]# 相似度检索 返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity [vec_tfidf ]# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )# 最⼤相似度值max_sim = sims [max_loc ]# 模型保存并加载# tfidf.save("data.tfidf")# tfidf = models.TfidfModel.load("data.tfidf")# print(tfidf_model.dfs)
1
2
3
4
5
6
7
66是什么意思8
9
10
11
12
13
14
15
acrylic是什么意思
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57import jieba import numpy as np
1
2
3
from gensim import corpora , models , similarities def _get_stop_words (): # 该函数⽤于读⼊停⽤词 """ 读取停⽤词问价你 :return: """ with open ('', encoding ='utf-8') as f : stopwords = f .read () return t (stopwords .split ()) | {'shi'}def delete_stopwords (documents ): """ 去除停⽤词 :return: """ # 停⽤词列表 stopwords_list = _get_stop_words () # 精准全模
式 分词 cut_words_list = [jieba .lcut (i , cut_all =Fal , HMM =True ) for i in documents ] # 去除停⽤词 del_stop_words_list = [] for word_list in cut_words_list : del_stop_words_list .append ([word for word in word_list if word not in stopwords_list ]) return del_stop_words_list # 建⽴词典documents = [ "今天去打篮球吗?", "明天晚上⼋点半的电影,准时到shi", "最近有新上映的电影,挺好看的,改天去看吗?", "明天天⽓要下⾬.", "今天太热了.", "⼯作好累啊,不想努⼒了", "跟着⾃⼰的内⼼⾛",]# 去掉停⽤词texts = delete_stopwords (documents )print (texts )# 将词语分词并保存dictionary = corpora .Dictionary (texts )# ken2id) # 查看每个词的唯⼀编号# 词袋模型 基于词典,将分词列表集转换成稀疏向量集,即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]# 训练tf-idf 模型,语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本,即语料库corpus_tfidf = tfidf [corpus ]# 要检索的句⼦query = '好想 去 电影院 看 电影'# 利⽤doc2bow 对其进⾏分割 把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]# 相似度检索 返回最
3
4
body language
5
6
7
8
9
10
11
complete是什么意思12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
有道英语31
32
33
34
35exploitation
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
口福的意思67
68
说明测试集数据’好想 去 电影院 看 电影’和⽂本中的 "最近有新上映的电影,挺好看的,改天去看吗?"最相似。总结
⼀般情况下,我们可以设置阈值来过滤相似度较低的⽂本。# 相似度检索 返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity
[vec_tfidf ]# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )print (max_loc )# 最⼤相似度值max_sim = sims [max_loc ]68
69
70
71
72
73
74
75
76
7778
[['今天', '去', '打篮球'], ['明天', '⼋点半', '电影', '准时'], ['新', '上映', '电影', '挺', '好看', '改天', '去', '看'], ['明天', '天⽓', '下⾬'], ['今天', '太热'], ['好累', '不想'], ['跟着', 2
1
2