A comparison of event models for naive bayes text classification

更新时间:2023-06-19 11:33:56 阅读：评论：0

A Comparison of Event Models for Naive Bayes Text Classiﬁcation

Andrew McCallum‡† Kamal Nigam†u.edu

‡Just Rearch 4616Henry Street Pittsburgh,PA15213†School of Computer Science Carnegie Mellon University Pittsburgh,PA15213

Abstract

Recent approaches to text classiﬁcation have ud two diﬀerentﬁrst-order probabilistic models for classiﬁca-tion,both of which make the naive Bayes assumption.

Some u a multi-variate Bernoulli model,that is,a Bayesian Network with no dependencies between words and binary word Larkey and Croft1996;

Koller and Sahami1997).Others u a multinomial model,that is,a uni-gram language model with integer word Lewis and Gale1994;Mitchell1997).

This paper aims to clarify the confusion by describing the diﬀerences and details of the two models,and by empirically comparing their classiﬁcation performance onﬁve text corpora.Weﬁnd that th

e multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes—providing on average a27%reduction in error over the multi-variate Bernoulli model at any vocabulary size.

Introduction

Simple Bayesian classiﬁers have been gaining popularity lately,and have been found to perform surprisingly well (Friedman1997;Friedman et al.1997;Sahami1996; Langley et al.1992).The probabilistic approaches make strong assumptions about how the data is gen-erated,and posit a probabilistic model that embodies the assumptions;then they u a collection of labeled training examples to estimate the parameters of the generative model.Classiﬁcation on new examples is performed with Bayes’rule by lecting the class that is most likely to have generated the example.

The naive Bayes classiﬁer is the simplest of the models,in that it assumes that all attributes of the examples are independent of each other given the con-text of the class.This is the so-called“naive Bayes assumption.”While this assumption is clearly fal in most real-world tasks,naive Bayes often performs classiﬁcation very well.This paradox is explained by the fact that classiﬁcation estimation is only a function of the sign(in binary cas)of the function estima-tion;the function approximation can

still be poor while classiﬁcation accuracy remains high(Friedman1997; Domingos and Pazzani1997).Becau of the indepen-dence assumption,the parameters for each attribute can be learned parately,and this greatly simpliﬁes learning,especially when the number of attributes is large.

Document classiﬁcation is just such a domain with a large number of attributes.The attributes of the examples to be classiﬁed are words,and the number of diﬀerent words can be quite large indeed.While some simple document classiﬁcation tasks can be ac-curately performed with vocabulary sizes less than one hundred,many complex tasks on real-world data from the Web,UNet and newswire articles do best with vo-cabulary sizes in the thousands.Naive Bayes has been successfully applied to document classiﬁcation in many rearch eﬀorts(e references below).

Despite its popularity,there has been some confu-sion in the document classiﬁcation community about the“naive Bayes”classiﬁer becau there are two dif-ferent generative models in common u,both of which make the“naive Bayes assumption.”Both are called “naive Bayes”by their practitioners.

One model speciﬁes that a document is reprented by a vector of binary attributes indicating which words occur and do not occur in the document.The number of times a word occurs in a document is not captured. When calculating the probability of a document,one multiplies the probability of all the attribute values, including the probability of non-occurrence for words that do not occur in the document.Here we can un-derstand the document to be the“event,”and the ab-nce or prence of words to be attributes of the event. This describes a distribution bad on a multi-variate Bernoulli event model.This approach is more tradi-tional in theﬁeld of Bayesian networks,and is appro-priate for tasks that have aﬁxed number of attributes. The approach has been ud for text classiﬁcation by numerous people(Robertson and Sparck-Jones1976; Lewis1992;Kalt and Croft1996;Larkey and Croft 1996;Koller and Sahami1997;Sahami1996).

The cond model speciﬁes that a document is rep-rented by the t of word occurrences from the doc-ument.As above,the order of the words is lost,how-ever,the number of occurrences of each word in the document is captured.When calculating the proba-bility of a document,one multiplies the probability of the words that occur.Here we can understand the in-dividual word occurrences to be the“events”and the document to be the collection of word events.We call

this the multinomial event model.This approach is more traditional in statistical language modeling fo

r speech recognition,where it would be called a“uni-gram language model.”This approach has also been ud for text classiﬁcation by numerous people(Lewis and Gale1994;Kalt and Croft1996;Joachims1997; Guthrie and Walker1994;Li and Yamanishi1997; Mitchell1997;Nigam et al.1998;McCallum et al. 1998).

This paper aims to clarify the confusion between the two approaches by explaining both models in detail.We prent an extensive empirical compari-son onﬁve corpora,including Web pages,UNet ar-ticles and Reuters newswire articles.Our results indi-cate that the multi-variate Bernoulli model sometimes performs better than the multinomial at small vocab-ulary sizes.However,the multinomial usually out-performs the multi-variate Bernoulli at large vocabu-lary sizes,and almost always beats the multi-variate Bernoulli when vocabulary size is chon optimally for both.While sometimes the diﬀerence in performance is not great,on average across data ts,the multinomial provides a27%reduction in error over the multi-variate Bernoulli.

Probabilistic Framework of Naive Bayes This ction prents the generative model for both cas of the naive Bayes classiﬁer.First we explain the mechanisms they have in common,then,where the event models diverge,the assumptions and formulations of each are prented.

Consider the task of text classiﬁcation in a Bayesian learning framework.This approach assumes that the text data was generated by a parametric model,and us training data to calculate Bayes-optimal estimates of the model parameters.Then,equipped with the estimates,it classiﬁes new test documents using Bayes’rule to turn the generative model around and calculate the posterior probability that a class would have gener-ated the test document in question.Classiﬁcation then becomes a simple matter of lecting the most probable class.

Both scenarios assume that text documents are gen-erated by a mixture model parameterized byθ.The mixture model consists of mixture components c j∈C={c1,...,c|C|}.Each component is parameterized by

a disjoint subt ofθ.Thus a document,d i,is created by(1)lecting a component according to the priors, P(c j|θ),then(2)having the mixture component gener-ate a document according to its own parameters,with distribution P(d i|c j;θ).We can characterize the like-lihood of a document with a sum of total probability over all mixture components:

P(d i|θ)=

|C|

j=1

P(c j|θ)P(d i|c j;θ).(1)

Each document has a class label.We assume that there is a one-to-one correspondence between class and mixture model components,and thus u c j to in-dicate both the j th mixture component and the j th class.1In this tting,(supervid learning from la-beled training examples),the typically“hidden”indica-tor variables for a mixture model are provided as the class labels.

ibsMulti-variate Bernoulli Model

In the multi-variate Bernoulli event model,a document is a binary vector over the space of words.Given a vocabulary V,each dimension of the space t,t∈{1,...,|V|},corresponds to word w t from the vocabu-

lary.Dimension t of the vector for document d i is writ-ten B it,and is either0or1,indicating whether word w t occurs at least once in the document.With such a document reprentation,we make the naive Bayes assumption:that the probability of each word occur-ring in a document is independent of the occurrence of other words in a document.Then,the probability of a document given its class from Equation1is simply the product of the probability of the attribute values over all word attributes:

P(d i|c j;θ)=

|V|

t=1

(B it P(w t|c j;θ)+(2)

(1−B it)(1−P(w t|c j;θ))). Thus given a generating component,a document can be en as a collection of multiple independent Bernoulli experiments,one for each word in the vocabulary,with the probabilities for each of the word events deﬁned by each component,P(w t|c j;θ).This is equivalent to viewing the distribution over documents as being de-scribed by a Bayesian network,where the abnce or prence of each word is dependent only on the class of the document.

Given a t of labeled training documents,D= {d1,...,d|D|},learning the parameters of a probabilis-tic classiﬁcation model corresponds to estimating each of the class-conditional word probabilities.The pa-rameters of a mixture component are writtenθw

t|c j

= P(w t|c j;θ),where0≤θw t|c j≤1.We can calcu-late Bayes-optimal estimates for the probabilities by straightforward counting of events,supplemented by a prior(Vapnik1982).We u the Laplacean prior,prim-ing each word’s count with a count of one to avoid prob-abilities of zero or one.Deﬁne P(c j|d i)∈{0,1}as given by the document’s class label.Then,we estimate the probability of word w t in class c j with:

ˆθ

w t|c j=P(w t|c j;θ)=

tangram1+

|D|

i=1

B it P(c j|d i)

1In a more general tting,this one-to-one correspon-dence can be avoided(Li and Yamanishi1997;Nigam et al. 1998).

The class prior parameters,θc

j ,are t by the maxi-

mum likelihood estimate:

ˆθc j =P(c j|ˆθ)=

soulmate是什么意思|D|

i=1

P(c j|d i)

N it!

.(5)

The parameters of the generative component for each class are the probabilities for each word,writ-

成人高考什么时候出成绩tenθw

t|c j =P(w t|c j;θ),where0≤θw t|c j≤1and

t θw

t|c j英语四六级报名

=1.

Again,we can calculate Bayes-optimal estimates for the parameters from a t of labeled training data. Here,the estimate of the probability of word w t in class c j is:

|V|+

|V|

s=1

|D|

i=1

N is P(c j|d i)

姘居.

(6) The class prior parameters are calculated as before according to Equation4.

Classiﬁcation

Given estimates of the parameters calculated from the training documents,classiﬁcation can be performed on test documents by calculating the posterior probability of each class given the evidence of the test document, and lecting the class with the highest probability.We formulate this by applying Bayes’rule:

P(c j|d i;ˆθ)=

P(c j|ˆθ)P(d i|c j;ˆθj)

P(c)P(f t)

where P(c),P(f t)and P(c,f t)are calculated by sums over all documents—that is P(c)is the number of

docu-ments with class label c divided by the total number of documents;P(f t)is the number of documents contain-ing one or more occurrences of word w t divided by the total number of documents;and P(c,f t)is the number of documents with class label c that also contain word w t,divided by the total number of documents.

We experimented with this method,as well as an event model that corresponds to the multinomial:cal-culating the mutual information between(1)the class of the document from which a word occurrence is drawn, and(2)a random variable over all word occurrences. Here the word occurrences are the events.This calcu-lation also us Equation8,but calculates the values of the terms by sums over word occurrences instead of over documents—that is P(c)is the number of word occurrences appearing in documents with class label c divided by the total number of word occurrences;P(f t) is the number of occurrences of word w t divided by the total number of word occurrences;and P(c,f t)is the number of word occurrences of word w t that also ap-pear in documents with class label c,divided by the total number of word occurrences.

Our preliminary experiments comparing the two feature lection methods on the Newsgroups data t with the multinomial event model showed little diﬀer-ence in classiﬁcation accuracy.The results reported in this paper u the feature lection event model that corresponds to the event mo

del ud for classiﬁcation.

Experimental Results

This ction provides empirical evidence that the multi-nomial event model usually performs better than the multi-variate Bernoulli.The results are bad onﬁve diﬀerent data ts.3

Data Sets and Protocol

The web pages pointed to by the Yahoo!‘Science’hi-erarchy were gathered in July1997.The web pages are divided into95disjoint class containing13589pages as the result of coalescing class of hierarchy-depth greater than two,and removing tho class with fewer than40documents.After tokenizing as above and re-moving stopwords and words that occur only once,the corpus has a vocabulary size of44383(McCallum et al. 1998).

The Industry Sector hierarchy,made available by Mar-ket Guide Inc.()consists of company web pages classiﬁed in a hierarchy of industry ctors(McCallum et al.1998).There are6440web pages partitioned into the71class that are two levels deep in the hierarchy.In tokenizing the data we do not stem.After removing tokens that occur only once or

#of positive examples

(9) Precision=

#of correct positive predictions

100

10100

1000

10000100000

C l a s s i f i c a t i o n A c c u r a c y

usher

Vocabulary Size

compassionateYahoo Science

Multinomial

Multi-variate Bernoulli

上海水晶石Figure 1:A comparison of event models for diﬀerent vocabulary sizes on the Yahoo data t.Note that the multi-variate Bernoulli performs best with a small vo-cabulary and that the multinomial performs best with a larger vocabulary.The multinomial achieves higher accuracy overall.

100

10100

好看的英文1000

10000100000

C l a s s i f i c a t i o n A c c u r a c y

Vocabulary Size

Industry Sector 71

Multinomial

Multi-variate Bernoulli

Figure 2:A comparison of event models for diﬀerent vocabulary sizes on the Industry Sector data t.Note the same trends as en in the previous ﬁgure.accuracy at a vocabulary size of 1000words.The multi-variate Bernoulli event model reaches a maximum of 41%accuracy with only 2

00words.Note that the multi-variate Bernoulli shows its best results at a smaller vo-cabulary than the multinomial,and that the multino-mial has best performance at a larger vocabulary size.The same pattern is en in the Industry Sector data t,displayed in Figure 2.Here,multinomial has the high-est accuracy of 74%at 20000words,and multi-variate Bernoulli is best with 46%accuracy at 1000words.4Figure 3shows results for the Newsgroups data t.Here,both event models do best at the maximum vo-cabulary sizes.Multinomial achieves 85%accuracy and

本文发布于:2023-06-19 11:33:56，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/150449.html

上一篇：英语作文的对比题型、原因题型、利弊型大全

下一篇：experience的同义词

标签：时候水晶石成绩上海

留言与评论（共有 0 条评论）