gensim库_LDA主题模型困惑度Perplexity计算
LDA主题模型困惑度Perplexity计算
perplexity是⼀种信息理论的测量⽅法,b的perplexity值定义为基于b的熵的能量(b可以是⼀个概率分布,或者概率模型),通常⽤于概率模型的⽐较。
该部分内容可参考、、
可搜索到的资料都通过编程实现了困惑度的计算,不过gensim库其实⾃带了perplexity的计算模块,稍作修改即可返回模型困惑度。
对于困惑度的理解还⼗分有限,有待⽇后更新。
reali导⼊gensim库
尼泊尔人dels import LdaModel
⾸先,导⼊gensim库的LdaModel模块。
然后,查看gensim\models\ldamodel.py源码。搜索perplexity。
def log_perplexity(lf, chunk, total_docs=None):
"""Calculate and return per-word likelihood bound, using a chunk of documents as evaluation corpus.
Also output the calculated statistics, including the perplexity=2^(-bound), to log at INFO level.
Parameters初中英语感叹句
----------
chunk : {list of list of (int, float), scipy.spar.csc}日语在线翻译
The corpus chunk on which the inference step will be performed.
total_docs : int, optional
plateNumber of docs ud for evaluation of the perplexity.
视频英文Returns
-------
numpy.ndarray
The variational bound score calculated for each word.
"""
忽略是什么意思if total_docs is None:
total_docs = len(chunk)
corpus_words = sum(cnt for document in chunk for _, cnt in document)
subsample_ratio = 1.0 * total_docs / len(chunk)
perwordbound = lf.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
logger.info(
"%.3f per-word bound, %.1f perplexity estimate bad on a held-out corpus of %i documents with %i words",
perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
)
return perwordbound
可以看到在模型输出中,其实有困惑度的计算过程,只是没有输出⽽已。
修改源代码,最后return的部分
花体英文
#添加perplexity变量,输出模糊度
垃圾桶的英文
p2(-perwordbound)
return perwordbound,perplexity
即可,返回困惑度计算结果。
计算困惑度
肽链内切酶lda=LdaModel(common_corpus,num_topics=num_topic,id2word=dic,alpha='auto',chunksize=len(texts_all),iterations=20000)
_,perplexity=lda.log_perplexity(common_corpus)
返回值perplexity即为LDA模型的困惑度。