An introduction to ROC analysis
Tom Fawcett
Institute for the Study of Learning and Experti,2164Staunton Court,Palo Alto,CA 94306,USA拼音口诀记忆法
Available online 19December 2005
Abstract
Receiver operating characteristics (ROC)graphs are uful for organizing classifiers and visualizing their performance.ROC graphs are commonly ud in medical decision making,and in recent years have been ud increasingly in machine learning and data mining rearch.Although ROC graphs are apparently simple,there are some common misconceptions and pitfalls when using them in practice.The purpo of this article is to rve as an introduction to ROC graphs and as a guide for using them in rearch.Ó2005Elvier B.V.All rights rerved.
Keywords:ROC analysis;Classifier evaluation;Evaluation metrics
1.Introduction
A receiver operating characteristics (ROC)graph is a technique for visualizing,organizing and lecting classifi-ers bad on their performance.ROC graphs have long been ud in signal detection theory to depict the tradeoffbetween hit rates and fal alarm rates of classifiers (Egan,1975;Swets et al.,2000).ROC analysis has been extended for u in visualizing and analyzing the behavior of diag-nostic systems (Swets,1988).The medical decision making community has an extensive literature on the u of ROC graphs for diagnostic testing (Zou,2002).Swets et al.(2000)brought ROC curves to the attention of the wider public with their Scientific American article.
One of the earliest adopters of ROC graphs in machine learning was Spackman (1989),who demonstrated the value of ROC curves in evaluating and comparing algo-rithms.Recent years have en an increa in the u of ROC graphs in the machine learning community,due in part to the realization that simple classification accuracy is often a poor metric for measuring performance (Provost and Fawcett,1997;Provost et al.,1998).In addition to being a generally uful performance graphing method,they have properties that make them especially uful for
domains with skewed class distribution and unequal clas-sification error costs.The characteristics have become increasingly important as rearch continues into the areas of cost-nsitive learning and learning in the prence of unbalanced class.
ROC graphs are conceptually simple,but there are some non-obvious complexities that ari when they are ud in rearch.There are also common misconceptions and pit-falls when using them in practice.This article attempts to rve as a basic introduction to ROC graphs and as a guide for using them in rearch.The goal of this article is to advance general knowledge about ROC graphs so as to promote better evaluation practices in the field.2.Classifier performance
We begin by considering classification problems using only two class.Formally,each instance I is mapped to one element of the t {p ,n }of positive and negative class labels.A classification model (or classifier )is a mapping from instances to predicted class.Some classification models produce a continuous output (e.g.,an estimate of an instance Õs class membership probability)to which differ-ent thresholds may be applied to predict class membership.Other models produce a discrete class label indicating only the predicted class of the instance.To distinguish between
0167-8655/$-e front matter Ó2005Elvier B.V.All rights rerved.doi:10.1016/j.patrec.2005.10.010
E-mail address:tfawcett@acm ,tom.
/locate/patrec
the actual class and the predicted class we u the labels {Y,N}for the class predictions produced by a model.
汽修专修培训Given a classifier and an instance,there are four possible outcomes.If the instance is positive and it is classified as positive,it is counted as a true positive;if it is classified as negative,it is counted as a fal negative.If the instance is negative and it is classified as negative,it is counted as a true negative;if it is classified as positive,it is counted as a fal positive.Given a classifier and a t of instances(the test t),a two-by-two confusion matrix(also called a con-tingency table)can be constructed reprenting the disposi-tions of the t of instances.This matrix forms the basis for many common metrics.
Fig.1shows a confusion matrix and equations of veral common metrics that can be calculated from it.The num-bers along the major diagonal reprent the correct deci-sions made,and the numbers of this diagonal reprent the errors—the confusion—between the various class. The true positive rate1(also called hit rate and recall)of a classifier is estimated as
tp rate%Positives correctly classified Total positives
The fal positive rate(also called fal alarm rate)of the classifier is
fp rate%Negatives incorrectly classified
Total negatives
Additional terms associated with ROC curves are nsitivity¼recall
specificity¼
True negatives
Fal positivesþTrue negatives
¼1Àfp rate
positive predictive value¼precision 3.ROC space
中性洗发水ROC graphs are two-dimensional graphs in which tp rate is plotted on the Y axis and fp rate is plotted on the X axis.An ROC graph depicts relative tradeoffs between benefits(true positives)and costs(fal positives).Fig.2 shows an ROC graph withfive classifiers labeled A through E.
A discrete classifier is one that outputs only a class label. Each discrete classifier produces an(fp rat
e,tp rate)pair corresponding to a single point in ROC space.The classifi-ers in Fig.2are all discrete classifiers.
Several points in ROC space are important to note.The lower left point(0,0)reprents the strategy of never issu-ing a positive classification;such a classifier commits no fal positive errors but also gains no true positives.The opposite strategy,of unconditionally issuing positive classi-fications,is reprented by the upper right point(1,1).
The point(0,1)reprents perfect classification.DÕs per-formance is perfect as shown.
Informally,one point in ROC space is better than another if it is to the northwest(tp rate is higher,fp rate is lower,or both)of thefirst.Classifiers appearing on the left-hand side of an ROC graph,near the X axis,may be
高中语文必背篇目
1For clarity,counts such as TP and FP will be denoted with upper-ca安得后羿弓
letters and rates such as tp rate will be denoted with lower-ca.
862T.Fawcett/Pattern Recognition Letters27(2006)861–874
thought of as‘‘conrvative’’:they make positive classifica-tions only with strong evidence so they make few fal posi-tive errors,but they often have low true positive rates as well.Classifiers on the upper right-hand side of an ROC graph may be thought of as‘‘liberal’’:they make positive classificatio
ns with weak evidence so they classify nearly all positives correctly,but they often have high fal posi-tive rates.In Fig.2,A is more conrvative than B.Many real world domains are dominated by large numbers of negative instances,so performance in the far left-hand side of the ROC graph becomes more interesting.
3.1.Random performance
The diagonal line y=x reprents the strategy of ran-domly guessing a class.For example,if a classifier ran-domly guess the positive class half the time,it can be expected to get half the positives and half the negatives correct;this yields the point(0.5,0.5)in ROC space.If it guess the positive class90%of the time,it can be expected to get90%of the positives correct but its fal positive rate will increa to90%as well,yielding (0.9,0.9)in ROC space.Thus a random classifier will pro-duce a ROC point that‘‘slides’’back and forth on the dia-gonal bad on the frequency with which it guess the positive class.In order to get away from this diagonal into the upper triangular region,the classifier must exploit some information in the data.In Fig.2,CÕs performance is virtu-ally random.At(0.7,0.7),C may be said to be guessing the positive class70%of the time.
Any classifier that appears in the lower right triangle performs wor than random guessing.This tria
ngle is therefore usually empty in ROC graphs.If we negate a classifier—that is,rever its classification decisions on every instance—its true positive classifications become fal negative mistakes,and its fal positives become true neg-atives.Therefore,any classifier that produces a point in the lower right triangle can be negated to produce a point in the upper left triangle.In Fig.2,E performs much wor than random,and is in fact the negation of B.Any classifier on the diagonal may be said to have no information about the class.A classifier below the diagonal may be said to have uful information,but it is applying the information incorrectly(Flach and Wu,2003).
Given an ROC graph in which a classifierÕs performance appears to be slightly better than random,it is natural to ask:‘‘is this classifierÕs performance truly significant or is it only better than random by chance?’’There is no conclu-sive test for this,but Forman(2002)has shown a method-ology that address this question with ROC curves.玉器知识
4.Curves in ROC space
以此Many classifiers,such as decision trees or rule ts,are designed to produce only a class ,a Y or N on each instance.When such a discrete classifier is applied to a test t,it yields a single confusion matrix,which in turn corresponds to one ROC point.Thus,a discrete clas-sifier produces only a single point in ROC space.
Some classifiers,such as a Naive Bayes classifier or a neural network,naturally yield an instance probability or score,a numeric value that reprents the degree to which an instance is a member of a class.The values can be strict probabilities,in which ca they adhere to standard theorems of probability;or they can be general,uncali-brated scores,in which ca the only property that holds is that a higher score indicates a higher probability.We shall call both a probabilistic classifier,in spite of the fact that the output may not be a proper probability.2 Such a ranking or scoring classifier can be ud with a threshold to produce a discrete(binary)classifier:if the classifier output is above the threshold,the classifier pro-duces a Y,el a N.Each threshold value produces a differ-ent point in ROC space.Conceptually,we may imagine varying a threshold fromÀ1to+1and tracing a curve through ROC space.Computationally,this is a poor way of generating an ROC curve,and the next ction describes a more efficient and careful method.
Fig.3shows an example of an ROC‘‘curve’’on a test t of20instances.The instances,10positive and10nega-tive,are shown in the table beside the graph.Any ROC curve generated from afinite t of instances is actually a step function,which approaches a true curve as the number of instances approaches infinity.The step function in Fig.3 is taken from a very small instance t so that each pointÕs derivation can be understood.In the table of Fig.3,the instances are sorted by their scores,a
nd each point in the ROC graph is labeled by the score threshold that produces it.A threshold of+1produces the point(0,0).As we lower the threshold to0.9thefirst positive instance is clas-sified positive,yielding(0,0.1).As the threshold is further reduced,the curve climbs up and to the right,ending up at(1,1)with a threshold of0.1.Note that lowering this threshold corresponds to moving from the‘‘conrvative’’to the‘‘liberal’’areas of the graph.
Although the test t is very small,we can make some tentative obrvations about the classifier.It appears to perform better in the more conrvative region of the graph;the ROC point at(0.1,0.5)produces its highest accuracy(70%).This is equivalent to saying that the classi-fier is better at identifying likely positives than at identify-ing likely negatives.Note also that the classifierÕs best accuracy occurs at a threshold of P0.54,rather than at P0.5as we might expect with a balanced distribution. The next ction discuss this phenomenon.
4.1.Relative versus absolute scores
An important point about ROC graphs is that they mea-sure the ability of a classifier to produce good relative
2Techniques exist for converting an uncalibrated score into a proper probability but this conversion is
unnecessary for ROC curves.
T.Fawcett/Pattern Recognition Letters27(2006)861–874863
instance scores.A classifier need not produce accurate,cal-ibrated probability estimates;it need only produce relative accurate scores that rve to discriminate positive and neg-ative instances.
Consider the simple instance scores shown in Fig.4, which came from a Naive Bayes classifier.Comparing the hypothesized class(which is Y if score>0.5,el N)against the true class,we can e that the classifier gets instances 7and8wrong,yielding80%accuracy.However,consider the ROC curve on the left side of thefigure.The curve ris vertically from(0,0)to(0,1),then horizontally to(1,1). This indicates perfect classification performance on this test t.Why is there a discrepancy?
The explanation lies in what each is measuring.The ROC curve shows the ability of the classifier to rank the positive instances relative to the negative instances,and it is indeed perfect in this ability.The accuracy metric impos a threshold(score>0.5)and measures the result-ing classifications with respect to the scores.The accuracy measure would be appropriate if the scores were proper probabilities,but they are not.Another way of saying this is that the scores are not proper
ly calibrated,as true prob-abilities are.In ROC space,the imposition of a0.5thres-hold results in the performance designated by the circled ‘‘accuracy point’’in Fig.4.This operating point is subop-timal.We could u the training t to estimate a prior for p(p)=6/10=0.6and u this as a threshold,but it would still produce suboptimal performance(90%accuracy).
One way to eliminate this phenomenon is to calibrate the classifier scores.There are some methods for doing this (Zadrozny and Elkan,2001).Another approach is to u an ROC method that choos operating points bad on their relative performance,and there are methods for doing this as well(Provost and Fawcett,1998,2001).The latter methods are discusd briefly in Section6.
A conquence of relative scoring is that classifier scores should not be compared across model class.One model class may be designed to produce scores in the range [0,1]while another produces scores in[À1,+1]or[1,100]. Comparing model performance at a common threshold will be meaningless.
4.2.Class skew
ROC curves have an attractive property:they are inn-sitive to changes in class distribution.If the proportion of positive to negative instances changes in a test t,the ROC curves will not change.To
e why this is so,consider the confusion matrix in Fig.1.Note that the class distribu-tion—the proportion of positive to negative instances—is the relationship of the left(+)column to the right(À)col-umn.Any performance metric that us values from both columns will be inherently nsitive to class skews.Metrics such as accuracy,precision,lift and F score u values from both columns of the confusion matrix.As a class distribu-tion changes the measures will change as well,even if the fundamental classifier performance does not.ROC graphs are bad upon tp rate and fp rate,in which each dimension is a strict columnar ratio,so do not depend on class distributions.
To some rearchers,large class skews and large changes in class distributions may em contrived and unrealistic. However,class skews of101and102are very common in real world domains,and skews up to106have been obrved in some domains(Clearwater and Stern,1991; Fawcett and Provost,1996;Kubat et al.,1998;Saitta and Neri,1998).Substantial changes in class distributions are not unrealistic either.For example,in medical decision making epidemics may cau the incidence of a dia to increa over time.In fraud detection,proportions of fraud varied significantly from month to month and place to place(Fawcett and Provost,1997).Changes in a manufac-turing practice may cau the proportion of defective units
万用表用法
864T.Fawcett/Pattern Recognition Letters27(2006)861–874
produced by a manufacturing line to increa or decrea. In each of the examples the prevalence of a class may change drastically without altering the fundamental char-acteristic of the ,the target concept.
Precision and recall are common in information retrie-val for evaluating retrieval(classification)performance (Lewis,1990,1991).Precision-recall graphs are commonly ud where static document ts can sometimes be assumed;however,they are also ud in dynamic environ-ments such as web page retrieval,where the number of pages irrelevant to a query(N)is many orders of magni-tude greater than P and probably increas steadily over time as web pages are created.
To e the effect of class skew,consider the curves in Fig.5,which show two classifiers evaluated using ROC curves and precision-recall curves.In Fig.5a and b,the test
T.Fawcett/Pattern Recognition Letters27(2006)861–874865