首页 > 英文翻译

文本相似度计算（中英文）详解实战

更新时间:2023-05-14 00:14:29 阅读：评论：0

⽂本相似度计算（中英⽂）详解实战

使⽤tf_idf 模型实现中英⽂⽂本相似度计算

1. 英⽂⽂本相似度计算

1. 测试⽂本

2.去除停⽤词，英⽂不需要分词操作

3. 构建词典，也就是给每个词分配⼀个唯⼀编号

4. 根据词袋模型将分词列表集转换成稀疏向量集

doc2bow和word2vec等⼀样，是将词表⽰成特征，corpus的结果中(0, 1)理解为每个编号为0的词出现了1次。

airtight5. 训练tf-idf模型，语料库进⾏训练documents = [ "Is there anything good playing?", "let's meet at the movie theater entrance tonight. Don't be late.", "Are you going to the movie theater with me tonight?", "I get a lump in my throat whenever I e a tragic movie.", "you're just too emotional.", "

The interction graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]

9# 去掉停⽤词stoplist = t ('for a of the and to in'.split ())texts = [[word for word in document .lower ().split () if word not in stoplist ] for document in documents ]print (texts )

6[['is', 'there', 'anything', 'good', 'playing?'], ["let's", 'meet', 'at', 'movie', 'theater', 'entrance', 'tonight.', "don't", 'be', 'late.'], ['are', 'you', 'going', 'movie',

1# 将词语分词并保存dictionary = corpora .Dictionary (texts )print (dictionary .token2id ) # 查看每个词的唯⼀编号# 词袋模型基于词典，将分词列表集转换成稀疏向量集，即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )

6{'anything': 0, 'good': 1, 'is': 2, 'playing?': 3, 'there': 4, 'at': 5, 'be': 6, "don't": 7, 'entrance': 8, 'late.': 9, "let's": 10, 'meet': 11, 'movie': 12, 'theater': 13,Dictionary(44 unique tokens: ['anything', 'good', 'is', 'playing?', 'there']...)

2corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )

2[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (13, 1), (15, 1), (16, 1), (17,

6. ⽤测试集来测试

7. ⽤测试集的这句话与⽂本中的每⼀句话计算相似度

可以看的出来，测试集数据与第1句话和第2句话有相似性（编号从0开始的）

7. 提取出相似值最⼤的值

英语翻译转换器拿到了索引，我们可以做字典映射将原始⽂档中的话找出来，任务完成。

完整代码# 训练tf-idf 模型，语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本，即语料库corpus_tfidf = tfidf [corpus ]

thriller

5# 要检索的句⼦query = 'I like movie'# 利⽤doc2bow 对其进⾏分割把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())print (vec_bow ) # [(12, 1)] 说明movie 的编号是12，出现了⼀次，i like 没有在语料库中出现# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]print (vec_tfidf ) # [(12, 1.0)]

8# 相似度检索返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity [vec_tfidf ]print (sims )

5[0. 0.21666655 0.24635962 0. 0. 0. 0. ]

1# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )print (max_loc ) # 2# 最⼤相似度值max_sim = sims [max_loc ]print (max_sim ) # 0.24635962

620.24635962

2. 中⽂⽂本相似度计算

上⾯讲了详细步骤，直接上代码，中⽂就多了⼀个分词，其他都⼀样from gensim import corpora , models , similarities import numpy as np # 建⽴词典documents = [ "Is there anything good playing?", "let's meet at the movie theater entrance tonight. Don't be late.", "Are you going to the movie theater with me tonight?", "I get a lump in my throat whenever I e a tragic movie.", "you're just too emotional.", "The interction graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]# 去掉停⽤词stoplist = t ('for a of the and to in'.split ())texts = [[word for word in document .lower ().split () if word not in stoplist ] for document in documents ]print (texts )# 将词语分词并保存dictionary = corpora .Dictionary (texts )print (dictionary )print (dictionary .token2id ) # 查看每个词的唯⼀编号# 词袋模型基于词典，将分词列表集转换成稀疏向量集，即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]print (corpus )# 训练tf-idf 模型，语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本，即语料库corpus_tfidf = tfidf [corpus ]# 要检索的句⼦query = 'I like movie'# 利⽤doc2bow 对其进⾏分割

把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]# 相似度检索返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity [vec_tfidf ]# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )# 最⼤相似度值max_sim = sims [max_loc ]# 模型保存并加载# tfidf.save("data.tfidf")# tfidf = models.TfidfModel.load("data.tfidf")# print(tfidf_model.dfs)

66是什么意思8

acrylic是什么意思

57import jieba import numpy as np

from gensim import corpora , models , similarities def _get_stop_words (): # 该函数⽤于读⼊停⽤词 """ 读取停⽤词问价你 :return: """ with open ('', encoding ='utf-8') as f : stopwords = f .read () return t (stopwords .split ()) | {'shi'}def delete_stopwords (documents ): """ 去除停⽤词 :return: """ # 停⽤词列表 stopwords_list = _get_stop_words () # 精准全模

式分词 cut_words_list = [jieba .lcut (i , cut_all =Fal , HMM =True ) for i in documents ] # 去除停⽤词 del_stop_words_list = [] for word_list in cut_words_list : del_stop_words_list .append ([word for word in word_list if word not in stopwords_list ]) return del_stop_words_list # 建⽴词典documents = [ "今天去打篮球吗?", "明天晚上⼋点半的电影，准时到shi", "最近有新上映的电影，挺好看的，改天去看吗?", "明天天⽓要下⾬.", "今天太热了.", "⼯作好累啊，不想努⼒了", "跟着⾃⼰的内⼼⾛",]# 去掉停⽤词texts = delete_stopwords (documents )print (texts )# 将词语分词并保存dictionary = corpora .Dictionary (texts )# ken2id) # 查看每个词的唯⼀编号# 词袋模型基于词典，将分词列表集转换成稀疏向量集，即语料库corpus = [dictionary .doc2bow (text ) for text in texts ]# 训练tf-idf 模型，语料库进⾏训练tfidf = models .TfidfModel (corpus )# ⽤训练好的TF-IDF 模型处理被检索⽂本，即语料库corpus_tfidf = tfidf [corpus ]# 要检索的句⼦query = '好想去电影院看电影'# 利⽤doc2bow 对其进⾏分割把该句⼦列表根据dictionary 变成稀疏向量vec_bow = dictionary .doc2bow (query .split ())# 然后求tf_idf vec_tfidf = tfidf [vec_bow ]# 相似度检索返回最

body language

complete是什么意思12

有道英语31

35exploitation

口福的意思67

说明测试集数据’好想去电影院看电影’和⽂本中的 "最近有新上映的电影，挺好看的，改天去看吗?"最相似。总结

⼀般情况下，我们可以设置阈值来过滤相似度较低的⽂本。# 相似度检索返回最similarity = similarities .MatrixSimilarity (corpus_tfidf )# 该句话与语料库所有句⼦计算的相似度值sims = similarity

[vec_tfidf ]# 最⼤相似度的⽂本索引max_loc = np .argmax (sims )print (max_loc )# 最⼤相似度值max_sim = sims [max_loc ]68

7778

[['今天', '去', '打篮球'], ['明天', '⼋点半', '电影', '准时'], ['新', '上映', '电影', '挺', '好看', '改天', '去', '看'], ['明天', '天⽓', '下⾬'], ['今天', '太热'], ['好累', '不想'], ['跟着', 2

本文发布于:2023-05-14 00:14:29，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/107573.html

上一篇：汉英叠词对比及古诗中的叠词英译

下一篇：《英语姓名词典》与外国人名翻译问题

标签：模型相似电影分词语料库

留言与评论（共有 0 条评论）