首页 > 英文翻译

idiotic

更新时间:2022-11-24 08:38:55 阅读：评论：0

2022年11月24日发(作者：拙荆)

⽂本分类之情感分析–停⽤词和惯⽤语

改善特征提取往往可以对分类的accuracy（和precision和召回率）有显著的正⾯影响。在本⽂中，我将评估word_feats的两

项修改特征提取的⽅法：

1.过滤停⽤词

2.包含⼆元语法搭配

为了有效地做到这⼀点，我们将修改前⾯的代码，这样我们就可以使⽤任意的特征提取函数，它接收⼀个⽂件中的词，并返

回特征字典。和以前⼀样，我们将使⽤这些特征来训练朴素贝叶斯分类器。

importcollections

fyimportNaiveBayesClassifier

importmovie_reviews

defevaluate_classifier(featx):

negids=movie_s('neg')

posids=movie_s('pos')

negfeats=[(featx(movie_(fileids=[f])),'neg')forfinnegids]

posfeats=[(featx(movie_(fileids=[f])),'pos')forfinposids]

negcutoff=len(negfeats)*3/4

poscutoff=len(posfeats)*3/4

trainfeats=negfeats[:negcutoff]+posfeats[:poscutoff]

testfeats=negfeats[negcutoff:]+posfeats[poscutoff:]

classifier=(trainfeats)

refts=tdict(t)

testts=tdict(t)

fori,(feats,label)inenumerate(testfeats):

refts[label].add(i)

obrved=fy(feats)

testts[obrved].add(i)

print'accuracy:',cy(classifier,testfeats)

print'posprecision:',ion(refts['pos'],testts['pos'])

print'posrecall:',(refts['pos'],testts['pos'])

print'negprecision:',ion(refts['neg'],testts['neg'])

print'negrecall:',(refts['neg'],testts['neg'])

_most_informative_features()

词袋特征抽取的基准

这是词袋特征选择的特征抽取。

defword_feats(words):

returndict([(word,True)forwordinwords])

evaluate_classifier(word_feats)

结果与前⾯的⽂章中的⼀样，但是我已经把它们包括在这⾥以供参考：

accuracy:0.728

posprecision:0.651595744681

posrecall:0.98

negprecision:0.959677419355

negrecall:0.476

MostInformativeFeatures

magnificent=Truepos:neg=15.0:1.0

outstanding=Truepos:neg=13.6:1.0

insulting=Trueneg:pos=13.0:1.0

vulnerable=Truepos:neg=12.3:1.0

ludicrous=Trueneg:pos=11.8:1.0

avoids=Truepos:neg=11.7:1.0

uninvolving=Trueneg:pos=11.7:1.0

astounding=Truepos:neg=10.3:1.0

fascination=Truepos:neg=10.3:1.0

idiotic=Trueneg:pos=9.8:1.0

停⽤词过滤

停⽤词是通常被认为没⽤的词。⼤多数搜索引擎忽略这些词，因为他们是如此普遍，包括他们将⼤⼤增加索引的⼤⼩，⽽不

会提⾼精度和召回率。NLTK附带了⼀个停⽤词语料列表，其中包括128个英⽂停⽤词。让我们看看当我们过滤掉这些词，

会发⽣什么。

importstopwords

stopt=t(('english'))

defstopword_filtered_word_feats(words):

returndict([(word,True)forwordinwordsifwordnotinstopt])

evaluate_classifier(stopword_filtered_word_feats)

⼀个停⽤词过滤的词袋的结果是：

accuracy:0.726

posprecision:0.649867374005

posrecall:0.98

negprecision:0.959349593496

negrecall:0.472

accuracy下降了0.2％，pos的precision和负recall也下降了！显然，停⽤词将信息添加到情感分析分类。我并没有包括

最具信息量的特征，因为他们并没有改变。

⼆元语法搭配

正如在对精度和召回率的⽂章的末尾提到的，包括⼆元语法(连词)将可能提⾼分类精度。假设是，⼈们说这样的话“不是很

⼤”，由于它认为“伟⼤”作为⼀个单独的词，这种负⾯的表达被词袋模型解释为正的。

为了找到显著⼆元词组，我们可以使⽤CollocationFinder和AssocMeasures。该

BigramCollocationFinder维持2个内部FreqDists，⼀个是独⽴单词的频率，另⼀个是⼆元词组的频率。⼀旦有了这些频率分

布，它可以利⽤BigramAssocMeasures提供的打分函数为单独的⼆元词组打分，⽐如卡⽅。这些计分函数度量2个词的搭配

关系，⼆元组基本上是与每个独⽴的词的出现⼀样频繁。

importitertools

ationsimportBigramCollocationFinder

simportBigramAssocMeasures

defbigram_word_feats(words,score_fn=_sq,n=200):

bigram_finder=_words(words)

bigrams=bigram_(score_fn,n)

returndict([(ngram,True)(words,bigrams)])

evaluate_classifier(bigram_word_feats)

⼀些实验后，我发现，对每个⽂件使⽤200个最好的⼆元组产⽣了很⼤的成效：

accuracy:0.816

posprecision:0.753205128205

posrecall:0.94

negprecision:0.92

negrecall:0.692

MostInformativeFeatures

magnificent=Truepos:neg=15.0:1.0

outstanding=Truepos:neg=13.6:1.0

insulting=Trueneg:pos=13.0:1.0

vulnerable=Truepos:neg=12.3:1.0

('matt','damon')=Truepos:neg=12.3:1.0

('give','us')=Trueneg:pos=12.3:1.0

ludicrous=Trueneg:pos=11.8:1.0

uninvolving=Trueneg:pos=11.7:1.0

avoids=Truepos:neg=11.7:1.0

('absolutely','no')=Trueneg:pos=10.6:1.0

是的，你没有看错，显然是在电影评论中正向情绪的最佳指标之⼀。但是，尽管如此，这也是值得的结果

精度可达近9％

POS精度已经增加了10％，召回率只有4％的降幅

负召回已经增加了21％，precision只有不到4％的降幅

所以看来⼆元组的假设是正确的，包括显著⼆元组可以提⾼分类效率。请注意，它是显著的⼆元组所以提⾼了效率。我试着

⽤s包括所有的⼆元组，结果只⽐基准⾼⼏个点。这证明了只包括显著的特征可以提⾼精度，相⽐使⽤所有

功能的想法。在以后的⽂章中，我将尝试修剪下单个词的特征，只包括显著词。

本文发布于:2022-11-24 08:38:55，感谢您对本站的认可！

本文链接：http://www.wtabcd.cn/fanwen/fan/90/10640.html

上一篇：杰克逊经典歌曲mp3

下一篇：联合国的英文缩写怎么读

标签：idiotic

留言与评论（共有 0 条评论）