Cost-Sensitive Learning by Cost-Proportionate Example Weighting
Bianca Zadrozny,John Langford,Naoki Abe
Mathematical Sciences Department
IBM T.J.Watson Rearch Center
Yorktown Heights,NY10598
Abstract
We propo and evaluate a family of methods for con-
verting classifier learning algorithms and classification the-
ory into cost-nsitive algorithms and theory.The propod
conversion is bad on cost-proportionate weighting of the
training examples,which can be realized either by feeding
the weights to the classification algorithm(as often done in
boosting),or by careful subsampling.We give some theo-
retical performance guarantees on the propod methods,
as well as empirical evidence that they are practical al-
ternatives to existing approaches.In particular,we pro-
po costing,a method bad on cost-proportionate rejec-
tion sampling and enmble aggregation,which achieves
excellent predictive performance on two publicly available
datats,while drastically reducing the computation re-
quired by other methods.
1Introduction
Highly non-uniform misclassification costs are very
common in a variety of challenging real-world data min-
ing problems,such as fraud detection,medical diagnosis
and various problems in business decision-making.In many
cas,one class is rare but the cost of not recognizing some
of the examples belonging to this class is high.In the
domains,classifier learning methods that do not take mis-
classification costs into account do not perform well.In
extreme cas,ignoring costs may produce a model that is
uless becau it classifies every example as belonging to
the most frequent class even though misclassifications of
the least frequent class result in a very large cost.
Recently a body of work has attempted to address this
issue,with techniques known as cost-nsitive learning in
the machine learning and data mining communities.Cur-
rent cost-nsitive learning rearch falls into three cat-
egories.Thefirst is concerned with making particular
classifier learners cost-nsitive[3,7].The cond us
Bayes risk theory to assign each example to its lowest risk
approximate cost minimization as applying the ba classi-fier learning algorithm on the entire sample.This is a re-markable property for a subsampling scheme:in general, we expect any technique using only a subt of the exam-ples to compromi predictive performance.
The runtime savings made possible by this sampling technique enable us to run the classification alg
orithm on multiple draws of subsamples and average over the resulting classifiers.This last method is what we call costing(cost-proportionate rejection sampling with aggregation).Cost-ing allows us to u an arbitrary cost-innsitive learning al-gorithm as a black box in order to accomplish cost-nsitive learning,achieves excellent predictive performance and can achieve drastic savings of computational resources.
2Motivating Theory and Methods
2.1A Folk Theorem
We assume that examples are drawn independently from a distribution D with domain X Y C where X is the in-put space to a classifier,Y is a(binary)output space and C0∞is the importance(extra cost)associated with mislabeling that example.The goal is to learn a classifier h:X Y which minimizes the expected cost,
E x y c D cI h x y
given training data of the form:x y c,where I is the indicator function that has value1in ca its argument is true and0otherwi.This model does not explicitly allow using cost information at prediction time although X might include a cost feature if that is available.
This formulation of cost-nsitive learning in terms of one number per example is more general than“cost matrix”formulations which are more typical in cost-nsitive learn-ing[6,2],when the output space is binary.1In the cost matrix formulation,costs are associated with fal negative, fal positive,true negative,and true positive predictions. Given the cost matrix and an example,only two entries (fal positive,true negative)or(fal negative,true posi-tive)are relevant for that example.The two numbers can be further reduced to one:(fal positive-true negative)or (fal negative-true positive),becau it is the difference in cost between classifying an example correctly or incorrectly which controls the importance of correct classification.This difference is the importance c we u here.This tting is more general in the n that the importance may vary on a example-by-example basis.
A basic folk theorem2states that if we have examples drawn from the distribution:
ˆD x y c c
1How to formulate the problem in this way when the output space is not binary is nontrivial and is beyond the scope of this paper.
2We say“folk theorem”here becau the result appears to be known by some and it is straightforwar
d to derive it from results in decision theory, although we have not found it published.then optimal error rate classifiers forˆD are optimal cost minimizers for data drawn from D.
Theorem2.1.(Translation Theorem)For all distributions, D,there exists a constant N E x y c D c such that for all classifiers,h:
E x y cˆD I h x y
1
N
D x y c
Despite its simplicity,this theorem is uful to us be-cau the right-hand side express the expectation we want to control(via the choice of h)and the left-hand side is the probability that h errs under another distribution.Choos-ing h to minimize the rate of errors underˆD is equivalent to choosing h to minimize the expected cost under D.Simi-larly,ε-approximate error minimization underˆD is equiva-lent to Nε-approximate cost minimization under D.
The prescription for coping with cost-nsitive problems is straightforward:re-weight the distribution in your train-ing t according to the importances so that the training t is effectively drawn fromˆD.Doing this in a correct and general manner is more challenging than it may em and is the topic of the rest of the paper.
2.2Transparent Box:Using Weights Directly 2.2.1General conversion
Here we examine how importance weights can be ud within different learning algorithms to accomplish cost-nsitive classification.We call this the transparent box approach becau it requires knowledge of the particular learning algorithm(as oppod to the black box approach that we develop later).
The mechanisms for realizing the transparent box ap-proach have been described elwhere for a number of weak learners ud in boosting,but we will describe them here for completeness.The classifier learning algorithm must u the weights so that it effectively learns from data drawn ac-cording toˆD.This requirement is easy to apply for all learn-ing algorithms whichfit the statistical query model[13].
As shown infigure1,many learning algorithms can be divided into two components:a portion which ca
lculates the (approximate)expected value of some function(or query) f and a portion which forms the queries and us their output to construct a classifier.For example,neural net-works,decision trees,and Naive Bayes classifiers can be
Figure1.The statistical query model. constructed in this manner.Support vector machines are not easily constructible in this way,becau the individual classifier is explicitly dependent upon individual examples rather than on statistics derived from the entire sample.
Withfinite data we cannot precily calculate the expec-tation E x y D f x y.With high probability,however,we can approximate the expectation given a t of examples drawn independently from the underlying distribution D.
Whenever we have a learning algorithm that can be de-compod as infigure1,there is a simple recipe for using the weights directly.Instead of simulating the expectation with1
∑x y c S c
∑x y c S c f x y. This method is equivalent to importance sampling forˆD us-ing the distribution D,and so the modified expectation is an unbiad Monte Carlo estimate of the ˆD.
Even when a learning algorithm does notfit this model,it may be possible to incorporate importance weights directly. We now discuss how to incorporate importance weights into some specific learning algorithms.
2.2.2Naive Bayes and boosting
Naive Bayes learns by calculating empirical probabilities for each output y using Bayes’rule and assuming that each feature is independent given the output:
P y x P x y P y
∏i P x i
Each probability estimate in the above expression can be thought of as a function of empirical expectations according to D,and thus it can be formulated in the statistical query model.For example,P x i y is just the expectation of I x i x i I y y divided by the expectation of I y y.More
specifically,to compute the empirical estimate of P x i y with respect to D,we need to count the number of training examples that have y as output,and tho having x i as the i-th input dimension among tho.When we compute the empirical estimates with respect toˆD,we simply have to sum the weight of each example,instead of counting the examples.(This property is ud in the implementation of boosted Naive Bayes[5].)
To incorporate importance weights into AdaBoost[8], we give the importance weights to the weak learner in the first iteration,thus effectively drawing examples fromˆD. In the subquent iterations,we u the standard AdaBoost rule to update the weights.Therefore,the weights are ad-justed according to the accuracy onˆD,which corresponds to the expected cost on D.
2.2.3C4.5
C4.5[16]is a widely ud decision tree learner.There is a standard way of incorporating example weights to it,which in the original algorithm was intended to handle missing at-tributes(examples with missing attributes were divided into fractional examples,each with a smaller weight,during the growth of the tree).This same facility was later ud by Quinlan in the implementation of boosted C4.5[15].
2.2.4Support Vector Machine
The SVM algorithm[11]learns the parameters a and b de-scribing a linear decision rule h x sign a x b,so that the smallest distance between each training example and the decision boundary(the margin)is maximized.It works by solving the following optimization problem:
minimize:V a bξ1
2
a a C∑n i1c iξi
Now C controls model complexity versus total cost.
The SVMLight package[10]allows urs to input weights c i and works with the modified V a bξas above, although this feature has not yet been documented.
2.3Black Box:Sampling methods
Suppo we do not have transparent box access to the learner.In this ca,sampling is the obvious method to con-vert from one distribution of examples to another to obtain a cost-nsitive learner using the translation theorem(The-orem2.1).As it turns out,straightforward sampling does not work well in this ca,motivating us to propo an al-ternative method bad on rejection sampling.
2.3.1Sampling-with-replacement
Sampling-with-replacement is a sampling scheme where each example x y c is drawn according to the distribution p x y c c
∑x y c S c
and the next example is drawn from the t S x y c.This process is repeated,drawing from a smaller and smaller t according to the weights of the examples remaining in the t.
To e how this method fails,note that sampling-without-replacement m times from a t of size m results in the original t,which(by assumption)is drawn from the distribution D,and notˆD as desired.
2.3.2Cost-proportionate rejection sampling
There is another sampling scheme called rejection sampling [18]which allows us to draw examples independently from the distributionˆD,given examples drawn independently from D.In rejection sampling,examples fromˆD are ob-tained byfirst drawing examples from D,and then keep-ing(or accepting)the sample with probability proportional toˆD D.Here,we haveˆD D∝c,so we accept an exam-ple with probability c Z,where Z is some constant cho-n so that max x y c S c Z,3leading to t
he name cost-proportionate rejection sampling.Rejection sampling re-sults in a t S which is generally smaller than S.Further-more,becau inclusion of an example in S is independent of other examples,and the examples in S are drawn inde-pendently,we know that the examples in S are distributed independently according toˆD.
Using cost-proportionate rejection sampling to create a t S and then using a learning algorithm A S is guaran-teed to produce an approximately cost-minimizing classi-fier,as long as the learning algorithm A achieves approxi-mate minimization of classification error.
Theorem2.2.(Correctness)For all cost-nsitive sample ts S,if cost-proportionate rejection sampling produces a sample t S and A S achievesεclassification error:
E x y cˆD I h x yε
N
E x y c D cI h x y
Thus,E
x y c D cI h x yεN
2.3.3Sample complexity of cost-proportionate rejec-
tion sampling
The accuracy of a learned classifier generally improves monotonically with the number of examples in the training t.Since cost-proportionate rejection sampling produces a smaller training t(by a factor of about N Z),one would expect wor performance than using the entire training t.
This turns out to not be the ca,in the agnostic PAC-learning model[17,12],which formalizes the notion of probably approximately optimal learning from arbitrary dis-tributions D.
Definition2.1.A learning algorithm A is said to be an agnostic PAC-learner for hypothesis class H,with sample complexity m1ε1δif for allε0andδ0,m m1ε1δis the least sample size such that for all dis-tributions D(over X Y),the classification error rate of its output h is at mostεmore than the best achievable by any member of H with probability at least1δ,whenever the sample size exceeds m.
By analogy,we can formalize the notion of cost-nsitive agnostic PAC-learning.
Definition2.2.A learning algorithm A is said to be a cost-nsitive agnostic PAC-learner for hypothesis class H, with cost-nsitive sample complexity m1ε1δ,if for all ε0andδ0,m m1ε1δis the least sample
size such that for all distributions D(over X Y C),the ex-pected cost of its output h is at mostεmore than the best achievable by any member of H with probability at least 1δ,whenever the sample size exceeds m.
We will now u this formalization to compare the cost-nsitive PAC-learning sample complexity of two methods: applying a given ba classifier learning algorithm to a sam-ple obtained through cost-proportionate rejection sampling, and applying the same algorithm on the original training t.We show that the cost-nsitive sample complexity of the latter method is lower-bounded by that of the former.
Theorem2.3.(Sample Complexity Comparison)Fix an
arbitrary ba classifier learning algorithm A,and sup-po that m orig1ε1δand m rej1ε1δ,respectively, are cost-nsitive sample complexity of applying A on the
original training t,and that of applying A with cost-proportionate rejection sampling.Then,we have
m orig1ε1δΩm rej1ε1δ
Proof.Let m1ε1δbe the(cost-innsitive)sample complexity of the ba classifier learning algorithm A.
(If no such function exists,then neither m orig1ε1δnor m rej1ε1δexists,and the theorem holds vacuously.) Since Z is an upper bound on the cost of misclassifying an example,we have that the cost-nsitive sample complexity of using the original training t satisfies
m orig1ε1δΘm Zε1δ
This is becau given a distribution that forcesεmore clas-
sification error than optimal,another distribution can be constructed,that forcesεZ more cost than optimal,by as-signing cost Z to all examples on which A errs.
Now from Theorem2.2and noting that the central limit theorem implies that cost-proportionate rejection sampling reduces the sample size by a factor ofΘN Z,the cost-nsitive sample complexity for rejection sampling is:
Z
m rej1ε1δΘ
N
N
The datat is divided in afixed way into a training t
and a test t.Each t consists of approximately96000 records for which it is known whether or not the person
made a donation and how much the person donated,if a do-
nation was made.The overall percentage of donors is about 5%.Mailing a solicitation to an individual costs the charity
$068.The donation amount for persons who respond varies from$1to$200.The profit obtained by soliciting every in-dividual in the test t is$10560,while the profit attained
by the winner of the KDD-98competition was$14712.
The importance of each example is the absolute dif-
ference in profit between mailing and not mailing an in-
dividual.Mailing results in the donation amount minus the cost of mailing.Not mailing results in zero profit. Thus,for positive examples(respondents),the importance varies from$032to$19932.For negative examples(non-respondents),it isfixed at$068.
3.1.2DMEF-2datat
This datat can be obtained from the DMEF datat library [1]for a nominal fee.It contains customer buying history for96551customers of a nationally known catalog.The decision-making task is to choo which customers should receive a new catalog so as to maximize the total profit on the catalog mailing campaign.Information on the cost of mailing a catalog is not available,so wefixed it at$2.
The overall percentage of respondents is about2.5%. The purcha amount for customers who respond varies from$3to$6247.As is the ca for the KDD-98datat, the importance of each example is the absolute difference in profit between mailing and not mailing a customer.There-fore,for positive examples(respondents),the importance varies from$1to$6245.For negative examples(non-respondents),it isfixed at$2.
We divided the datat in half to create a training t and a test t.As a baline for comparison,the
profit obtained by mailing a catalog to every individual on the training t is$26474and on the test t is$27584.
3.2Experimental results
3.2.1Transparent box results
Table1(top)shows the results for Naive Bayes,boosted Naive Bayes(100iterations)C4.5and SVMLight on the KDD-98and DMEF-2datats,with and without the im-portance weights.Without the importance weights,the clas-sifiers label very few of the examples positive,resulting in small(and even negative)profits.With the costs given as weights to the learners,the results improve significantly for all learners,except C4.5.Cost-nsitive boosted Naive Bayes gives results comparable to the best so far with this datat[19]using more complicated methods.
We optimized the parameters of the SVM by cross-validation on the training t.Without weights,no tting of the parameters prevented the algorithm of labeling all exam-ples as negatives.With weights,the best parameters were
KDD-98:
Method With Weights
0.24
-1.36
Without Weights
Naive Bayes32608
Boosted NB36381
C4.5478
SVMLight36443
Table1.Test t profits with transparent box.
a polynomial kernel with degree3and C5105for KDD-98and a linear kernel with C00005for DMEF-2. However,even with this parameter tting,the results are not so impressive.This may be a hard problem for margin-bad classifiers becau the data is very noisy.Note also that running SVMLight
on this datat takes about3orders of magnitude longer than AdaBoost with100iterations.
The failure of C4.5to achieve good profits with impor-tance weights is probably related to the fact that the facil-ity for incorporating weights provided in the algorithm is heuristic.So far,it has been ud only in situations where the weights are fairly uniform(such as is the ca for frac-tional instances due to missing data).The results indicate that it might not be suitable for situations with highly non-uniform costs.The fact that it is non-trivial to incorporate costs directly into existing learning algorithms is the moti-vation for the black box approaches that we prent here.
3.2.2Black box results
Table2shows the results of applying the same learn-ing algorithms to the KDD-98and DMEF-2data using training ts of different sizes obtained by sampling-with-replacement.For each size,we repeat the experiments10 times with different sampled ts to get mean and standard error(in parenthes).The training t profits are on the original training t from which we draw the sampled ts.
The results confirm that application of sampling-with-replacement to implement the black box approach can result in very poor performance due to overfitting.When there are large differences in t
he magnitude of importance weights,it is typical for an example to be picked twice(or more).In table2,we e that as we increa the sampled training t size and,as a conquence,the number of duplicate exam-ples in the training t,the training profit becomes larger while the test profit becomes smaller for C4.5.
Examples which appear multiple times in the training t of a learning algorithm can defeat complexity control mech-anisms built into learning algorithms For example,suppo that we have a decision tree algorithm which divides the training data into a“growing t”(ud to construct a tree)