首页 > 美文阅读

Cost-Sensitive Learning by Cost-Proportionate Example Weighting

更新时间:2023-05-07 08:42:45 阅读：评论：0

Bianca Zadrozny,John Langford,Naoki Abe

Mathematical Sciences Department

IBM T.J.Watson Rearch Center

Yorktown Heights,NY10598

Abstract

We propo and evaluate a family of methods for con-

verting classiﬁer learning algorithms and classiﬁcation the-

ory into cost-nsitive algorithms and theory.The propod

conversion is bad on cost-proportionate weighting of the

training examples,which can be realized either by feeding

the weights to the classiﬁcation algorithm(as often done in

boosting),or by careful subsampling.We give some theo-

retical performance guarantees on the propod methods,

as well as empirical evidence that they are practical al-

ternatives to existing approaches.In particular,we pro-

po costing,a method bad on cost-proportionate rejec-

tion sampling and enmble aggregation,which achieves

excellent predictive performance on two publicly available

datats,while drastically reducing the computation re-

quired by other methods.

1Introduction

Highly non-uniform misclassiﬁcation costs are very

common in a variety of challenging real-world data min-

ing problems,such as fraud detection,medical diagnosis

and various problems in business decision-making.In many

cas,one class is rare but the cost of not recognizing some

of the examples belonging to this class is high.In the

domains,classiﬁer learning methods that do not take mis-

classiﬁcation costs into account do not perform well.In

extreme cas,ignoring costs may produce a model that is

uless becau it classiﬁes every example as belonging to

the most frequent class even though misclassiﬁcations of

the least frequent class result in a very large cost.

Recently a body of work has attempted to address this

issue,with techniques known as cost-nsitive learning in

the machine learning and data mining communities.Cur-

rent cost-nsitive learning rearch falls into three cat-

egories.Theﬁrst is concerned with making particular

classiﬁer learners cost-nsitive[3,7].The cond us

Bayes risk theory to assign each example to its lowest risk

approximate cost minimization as applying the ba classi-ﬁer learning algorithm on the entire sample.This is a re-markable property for a subsampling scheme:in general, we expect any technique using only a subt of the exam-ples to compromi predictive performance.

The runtime savings made possible by this sampling technique enable us to run the classiﬁcation alg

orithm on multiple draws of subsamples and average over the resulting classiﬁers.This last method is what we call costing(cost-proportionate rejection sampling with aggregation).Cost-ing allows us to u an arbitrary cost-innsitive learning al-gorithm as a black box in order to accomplish cost-nsitive learning,achieves excellent predictive performance and can achieve drastic savings of computational resources.

2Motivating Theory and Methods

2.1A Folk Theorem

We assume that examples are drawn independently from a distribution D with domain X Y C where X is the in-put space to a classiﬁer,Y is a(binary)output space and C0∞is the importance(extra cost)associated with mislabeling that example.The goal is to learn a classiﬁer h:X Y which minimizes the expected cost,

E x y c D cI h x y

given training data of the form:x y c,where I is the indicator function that has value1in ca its argument is true and0otherwi.This model does not explicitly allow using cost information at prediction time although X might include a cost feature if that is available.

This formulation of cost-nsitive learning in terms of one number per example is more general than“cost matrix”formulations which are more typical in cost-nsitive learn-ing[6,2],when the output space is binary.1In the cost matrix formulation,costs are associated with fal negative, fal positive,true negative,and true positive predictions. Given the cost matrix and an example,only two entries (fal positive,true negative)or(fal negative,true posi-tive)are relevant for that example.The two numbers can be further reduced to one:(fal positive-true negative)or (fal negative-true positive),becau it is the difference in cost between classifying an example correctly or incorrectly which controls the importance of correct classiﬁcation.This difference is the importance c we u here.This tting is more general in the n that the importance may vary on a example-by-example basis.

A basic folk theorem2states that if we have examples drawn from the distribution:

ˆD x y c c

1How to formulate the problem in this way when the output space is not binary is nontrivial and is beyond the scope of this paper.

2We say“folk theorem”here becau the result appears to be known by some and it is straightforwar

d to derive it from results in decision theory, although we have not found it published.then optimal error rate classiﬁers forˆD are optimal cost minimizers for data drawn from D.

Theorem2.1.(Translation Theorem)For all distributions, D,there exists a constant N E x y c D c such that for all classiﬁers,h:

E x y cˆD I h x y

D x y c

Despite its simplicity,this theorem is uful to us be-cau the right-hand side express the expectation we want to control(via the choice of h)and the left-hand side is the probability that h errs under another distribution.Choos-ing h to minimize the rate of errors underˆD is equivalent to choosing h to minimize the expected cost under D.Simi-larly,ε-approximate error minimization underˆD is equiva-lent to Nε-approximate cost minimization under D.

The prescription for coping with cost-nsitive problems is straightforward:re-weight the distribution in your train-ing t according to the importances so that the training t is effectively drawn fromˆD.Doing this in a correct and general manner is more challenging than it may em and is the topic of the rest of the paper.

2.2Transparent Box:Using Weights Directly 2.2.1General conversion

Here we examine how importance weights can be ud within different learning algorithms to accomplish cost-nsitive classiﬁcation.We call this the transparent box approach becau it requires knowledge of the particular learning algorithm(as oppod to the black box approach that we develop later).

The mechanisms for realizing the transparent box ap-proach have been described elwhere for a number of weak learners ud in boosting,but we will describe them here for completeness.The classiﬁer learning algorithm must u the weights so that it effectively learns from data drawn ac-cording toˆD.This requirement is easy to apply for all learn-ing algorithms whichﬁt the statistical query model[13].

As shown inﬁgure1,many learning algorithms can be divided into two components:a portion which ca

lculates the (approximate)expected value of some function(or query) f and a portion which forms the queries and us their output to construct a classiﬁer.For example,neural net-works,decision trees,and Naive Bayes classiﬁers can be

Figure1.The statistical query model. constructed in this manner.Support vector machines are not easily constructible in this way,becau the individual classiﬁer is explicitly dependent upon individual examples rather than on statistics derived from the entire sample.

Withﬁnite data we cannot precily calculate the expec-tation E x y D f x y.With high probability,however,we can approximate the expectation given a t of examples drawn independently from the underlying distribution D.

Whenever we have a learning algorithm that can be de-compod as inﬁgure1,there is a simple recipe for using the weights directly.Instead of simulating the expectation with1

∑x y c S c

∑x y c S c f x y. This method is equivalent to importance sampling forˆD us-ing the distribution D,and so the modiﬁed expectation is an unbiad Monte Carlo estimate of the ˆD.

Even when a learning algorithm does notﬁt this model,it may be possible to incorporate importance weights directly. We now discuss how to incorporate importance weights into some speciﬁc learning algorithms.

2.2.2Naive Bayes and boosting

Naive Bayes learns by calculating empirical probabilities for each output y using Bayes’rule and assuming that each feature is independent given the output:

P y x P x y P y

∏i P x i

Each probability estimate in the above expression can be thought of as a function of empirical expectations according to D,and thus it can be formulated in the statistical query model.For example,P x i y is just the expectation of I x i x i I y y divided by the expectation of I y y.More

speciﬁcally,to compute the empirical estimate of P x i y with respect to D,we need to count the number of training examples that have y as output,and tho having x i as the i-th input dimension among tho.When we compute the empirical estimates with respect toˆD,we simply have to sum the weight of each example,instead of counting the examples.(This property is ud in the implementation of boosted Naive Bayes[5].)

To incorporate importance weights into AdaBoost[8], we give the importance weights to the weak learner in the ﬁrst iteration,thus effectively drawing examples fromˆD. In the subquent iterations,we u the standard AdaBoost rule to update the weights.Therefore,the weights are ad-justed according to the accuracy onˆD,which corresponds to the expected cost on D.

2.2.3C4.5

C4.5[16]is a widely ud decision tree learner.There is a standard way of incorporating example weights to it,which in the original algorithm was intended to handle missing at-tributes(examples with missing attributes were divided into fractional examples,each with a smaller weight,during the growth of the tree).This same facility was later ud by Quinlan in the implementation of boosted C4.5[15].

2.2.4Support Vector Machine

The SVM algorithm[11]learns the parameters a and b de-scribing a linear decision rule h x sign a x b,so that the smallest distance between each training example and the decision boundary(the margin)is maximized.It works by solving the following optimization problem:

minimize:V a bξ1

a a C∑n i1c iξi

Now C controls model complexity versus total cost.

The SVMLight package[10]allows urs to input weights c i and works with the modiﬁed V a bξas above, although this feature has not yet been documented.

2.3Black Box:Sampling methods

Suppo we do not have transparent box access to the learner.In this ca,sampling is the obvious method to con-vert from one distribution of examples to another to obtain a cost-nsitive learner using the translation theorem(The-orem2.1).As it turns out,straightforward sampling does not work well in this ca,motivating us to propo an al-ternative method bad on rejection sampling.

2.3.1Sampling-with-replacement

Sampling-with-replacement is a sampling scheme where each example x y c is drawn according to the distribution p x y c c

∑x y c S c

and the next example is drawn from the t S x y c.This process is repeated,drawing from a smaller and smaller t according to the weights of the examples remaining in the t.

To e how this method fails,note that sampling-without-replacement m times from a t of size m results in the original t,which(by assumption)is drawn from the distribution D,and notˆD as desired.

2.3.2Cost-proportionate rejection sampling

There is another sampling scheme called rejection sampling [18]which allows us to draw examples independently from the distributionˆD,given examples drawn independently from D.In rejection sampling,examples fromˆD are ob-tained byﬁrst drawing examples from D,and then keep-ing(or accepting)the sample with probability proportional toˆD D.Here,we haveˆD D∝c,so we accept an exam-ple with probability c Z,where Z is some constant cho-n so that max x y c S c Z,3leading to t

he name cost-proportionate rejection sampling.Rejection sampling re-sults in a t S which is generally smaller than S.Further-more,becau inclusion of an example in S is independent of other examples,and the examples in S are drawn inde-pendently,we know that the examples in S are distributed independently according toˆD.

Using cost-proportionate rejection sampling to create a t S and then using a learning algorithm A S is guaran-teed to produce an approximately cost-minimizing classi-ﬁer,as long as the learning algorithm A achieves approxi-mate minimization of classiﬁcation error.

Theorem2.2.(Correctness)For all cost-nsitive sample ts S,if cost-proportionate rejection sampling produces a sample t S and A S achievesεclassiﬁcation error:

E x y cˆD I h x yε

E x y c D cI h x y

Thus,E

x y c D cI h x yεN

2.3.3Sample complexity of cost-proportionate rejec-

tion sampling

The accuracy of a learned classiﬁer generally improves monotonically with the number of examples in the training t.Since cost-proportionate rejection sampling produces a smaller training t(by a factor of about N Z),one would expect wor performance than using the entire training t.

This turns out to not be the ca,in the agnostic PAC-learning model[17,12],which formalizes the notion of probably approximately optimal learning from arbitrary dis-tributions D.

Deﬁnition2.1.A learning algorithm A is said to be an agnostic PAC-learner for hypothesis class H,with sample complexity m1ε1δif for allε0andδ0,m m1ε1δis the least sample size such that for all dis-tributions D(over X Y),the classiﬁcation error rate of its output h is at mostεmore than the best achievable by any member of H with probability at least1δ,whenever the sample size exceeds m.

By analogy,we can formalize the notion of cost-nsitive agnostic PAC-learning.

Deﬁnition2.2.A learning algorithm A is said to be a cost-nsitive agnostic PAC-learner for hypothesis class H, with cost-nsitive sample complexity m1ε1δ,if for all ε0andδ0,m m1ε1δis the least sample

size such that for all distributions D(over X Y C),the ex-pected cost of its output h is at mostεmore than the best achievable by any member of H with probability at least 1δ,whenever the sample size exceeds m.

We will now u this formalization to compare the cost-nsitive PAC-learning sample complexity of two methods: applying a given ba classiﬁer learning algorithm to a sam-ple obtained through cost-proportionate rejection sampling, and applying the same algorithm on the original training t.We show that the cost-nsitive sample complexity of the latter method is lower-bounded by that of the former.

Theorem2.3.(Sample Complexity Comparison)Fix an

arbitrary ba classiﬁer learning algorithm A,and sup-po that m orig1ε1δand m rej1ε1δ,respectively, are cost-nsitive sample complexity of applying A on the

original training t,and that of applying A with cost-proportionate rejection sampling.Then,we have

m orig1ε1δΩm rej1ε1δ

Proof.Let m1ε1δbe the(cost-innsitive)sample complexity of the ba classiﬁer learning algorithm A.

(If no such function exists,then neither m orig1ε1δnor m rej1ε1δexists,and the theorem holds vacuously.) Since Z is an upper bound on the cost of misclassifying an example,we have that the cost-nsitive sample complexity of using the original training t satisﬁes

m orig1ε1δΘm Zε1δ

This is becau given a distribution that forcesεmore clas-

siﬁcation error than optimal,another distribution can be constructed,that forcesεZ more cost than optimal,by as-signing cost Z to all examples on which A errs.

Now from Theorem2.2and noting that the central limit theorem implies that cost-proportionate rejection sampling reduces the sample size by a factor ofΘN Z,the cost-nsitive sample complexity for rejection sampling is:

m rej1ε1δΘ

The datat is divided in aﬁxed way into a training t

and a test t.Each t consists of approximately96000 records for which it is known whether or not the person

made a donation and how much the person donated,if a do-

nation was made.The overall percentage of donors is about 5%.Mailing a solicitation to an individual costs the charity

$068.The donation amount for persons who respond varies from$1to$200.The proﬁt obtained by soliciting every in-dividual in the test t is$10560,while the proﬁt attained

by the winner of the KDD-98competition was$14712.

The importance of each example is the absolute dif-

ference in proﬁt between mailing and not mailing an in-

dividual.Mailing results in the donation amount minus the cost of mailing.Not mailing results in zero proﬁt. Thus,for positive examples(respondents),the importance varies from$032to$19932.For negative examples(non-respondents),it isﬁxed at$068.

3.1.2DMEF-2datat

This datat can be obtained from the DMEF datat library [1]for a nominal fee.It contains customer buying history for96551customers of a nationally known catalog.The decision-making task is to choo which customers should receive a new catalog so as to maximize the total proﬁt on the catalog mailing campaign.Information on the cost of mailing a catalog is not available,so weﬁxed it at$2.

The overall percentage of respondents is about2.5%. The purcha amount for customers who respond varies from$3to$6247.As is the ca for the KDD-98datat, the importance of each example is the absolute difference in proﬁt between mailing and not mailing a customer.There-fore,for positive examples(respondents),the importance varies from$1to$6245.For negative examples(non-respondents),it isﬁxed at$2.

We divided the datat in half to create a training t and a test t.As a baline for comparison,the

proﬁt obtained by mailing a catalog to every individual on the training t is$26474and on the test t is$27584.

3.2Experimental results

3.2.1Transparent box results

Table1(top)shows the results for Naive Bayes,boosted Naive Bayes(100iterations)C4.5and SVMLight on the KDD-98and DMEF-2datats,with and without the im-portance weights.Without the importance weights,the clas-siﬁers label very few of the examples positive,resulting in small(and even negative)proﬁts.With the costs given as weights to the learners,the results improve signiﬁcantly for all learners,except C4.5.Cost-nsitive boosted Naive Bayes gives results comparable to the best so far with this datat[19]using more complicated methods.

We optimized the parameters of the SVM by cross-validation on the training t.Without weights,no tting of the parameters prevented the algorithm of labeling all exam-ples as negatives.With weights,the best parameters were

KDD-98:

Method With Weights

0.24

-1.36

Without Weights

Naive Bayes32608

Boosted NB36381

C4.5478

SVMLight36443

Table1.Test t proﬁts with transparent box.

a polynomial kernel with degree3and C5105for KDD-98and a linear kernel with C00005for DMEF-2. However,even with this parameter tting,the results are not so impressive.This may be a hard problem for margin-bad classiﬁers becau the data is very noisy.Note also that running SVMLight

on this datat takes about3orders of magnitude longer than AdaBoost with100iterations.

The failure of C4.5to achieve good proﬁts with impor-tance weights is probably related to the fact that the facil-ity for incorporating weights provided in the algorithm is heuristic.So far,it has been ud only in situations where the weights are fairly uniform(such as is the ca for frac-tional instances due to missing data).The results indicate that it might not be suitable for situations with highly non-uniform costs.The fact that it is non-trivial to incorporate costs directly into existing learning algorithms is the moti-vation for the black box approaches that we prent here.

3.2.2Black box results

Table2shows the results of applying the same learn-ing algorithms to the KDD-98and DMEF-2data using training ts of different sizes obtained by sampling-with-replacement.For each size,we repeat the experiments10 times with different sampled ts to get mean and standard error(in parenthes).The training t proﬁts are on the original training t from which we draw the sampled ts.

The results conﬁrm that application of sampling-with-replacement to implement the black box approach can result in very poor performance due to overﬁtting.When there are large differences in t

he magnitude of importance weights,it is typical for an example to be picked twice(or more).In table2,we e that as we increa the sampled training t size and,as a conquence,the number of duplicate exam-ples in the training t,the training proﬁt becomes larger while the test proﬁt becomes smaller for C4.5.

Examples which appear multiple times in the training t of a learning algorithm can defeat complexity control mech-anisms built into learning algorithms For example,suppo that we have a decision tree algorithm which divides the training data into a“growing t”(ud to construct a tree)

本文发布于:2023-05-07 08:42:45，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/546892.html

上一篇：安全生产年终总结(6篇)

下一篇：物业管理规章制度最新文件(四篇)

标签：

留言与评论（共有 0 条评论）