Journal of Machine Learning Rearch9(2008)2677-2694Submitted10/07;Revid4/08;Published12/08 An Extension on“Statistical Comparisons of Classifiers over Multiple Data Sets”for all Pairwi Comparisons
Salvador Garc´ıa SALVAGL@DECSAI.UGR.ES Francisco Herrera HERRERA@DECSAI.UGR.ES Department of Computer Science and Artificial Intelligence
University of Granada
机械类英语Granada,18071,Spain
Editor:John Shawe-Taylor
Abstract
In a recently published paper in JMLR,Demˇs ar(2006)recommends a t of non-parametric sta-tistical tests and procedures which can be safely ud for comparing the performance of classifiers over multiple data ts.After studying the paper,we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method.How-ever,it does not deal with some advanced topics in depth.Regarding the topics,we focus on more powerful propo
sals of statistical procedures for comparing n×n classifiers.Moreover,we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures.
Keywords:statistical methods,non-parametric test,multiple comparisons tests,adjusted p-values, logically related hypothes
1.Introduction
receptionistIn the Machine Learning(ML)scientific community there is a need for rigorous and correct statisti-cal analysis of published results,due to the fact that the development or modifications of algorithms is a relatively easy task.The main inconvenient related to this necessity is to understand and study the statistics and to know the exact techniques which can or cannot be applied depending on the situation,that is,type of results obtained.In a recently published paper in JMLR by Demˇs ar(2006), a group of uful guidelines are given in order to perform a correct analysis when we compare a t of classifiers over multiple data ts.Demˇs ar recommends a t of non-parametric statistical techniques(Zar,1999;Sheskin,2003)for comparing classifiers under the circumstances,given that the sample of results obtained by them does not fulfill the required conditions and it is not large enough for making a parametric statistical analysis.He analyzed the behavior of the pro-pod statist
ics on classification tasks and he checked that they are more convenient than parametric techniques.
Recent studies apply the guidelines given by Demˇs ar in the analysis of performance of classifiers (Esmeir and Markovitch,2007;Marrocco et al.,2008).In them,a new proposal or methodology is offered and it is compared with other methods by means of pairwi comparisons.Another type of studies assume an empirical comparison or review of already propod methods.In the cas,no proposal is offered and a statistical comparison could be very uful in determining the differences among the methods.In the specialized literature,many papers provide reviews on a specific topic and they also u statistical methodology to perform comparisons.For example,in a review of
G ARC´I A AND H ERRERA
enmbles of decision trees,non-parametric tests are also applied in the analysis of performance (Banfield et al.,2007).However,only the rankings computed by Friedman’s method(Friedman, 1937)are stipulated and authors establish comparisons bad on them,without taking into account significance levels.Demˇs ar focud his work in the analysis of new proposals,and he introduced the Nemenyi test for making all pairwi comparisons(Nemenyi,1963).Nevertheless,the Nemenyi tes
already什么意思t is very conrvative and it may notfind any difference in most of the experimentations.In recent papers,the authors have ud the Nemenyi test in multiple comparisons.Due to the fact that this test poss low power,authors have to employ many data ts(Yang et al.,2007b)or most of the differences found are not significant(Yang et al.,2007a;N´u˜n ez et al.,2007).Although the employment of many data ts could em beneficial in order to improve the generalization of results,in some specific domains,that is,imbalanced classification(Owen,2007)or multi-instance classification(Murray et al.,2005),data ts are difficult tofind.
Procedures with more power than Nemenyi’s one can be found in specialized literature.We have bad on the necessity to apply more powerful procedures in empirical studies in which no new method is propod and the benefit consists of obtaining more statistical differences among the classifiers compared.Thus,in this paper we describe the procedures and we analyze their behavior by means of the analysis of multiple repetitions of experiments with randomly lected data ts.
On the other hand,we can e other works in which the p-value associated to a comparison between two classifiers is reported(Garc´ıa-Pedrajas and Fyfe,2007).Classical non-parametric tests, such as Wilcoxon and Friedman(Sheskin,2003),may be incorporated in most of the statistical packages(SPSS,SAS,R,etc.)and the computation of thefinal p-value is usually implemented. Howev
er,advanced procedures such as Holm(1979),Hochberg(1988),Hommel(1988)and the ones described in this paper are usually not incorporated in statistical packages.The computation of the correct p-value,or Adjusted P-Value(APV)(Westfall and Young,2004),in a comparison using any of the procedures is not very difficult and,in this paper,we show how to include it with an illustrative example.
The paper is t up as follows.Section2prents more powerful procedures for comparing all the classifiers among them in a n×n comparison of multiple classifiers and a ca study.In Section3 we describe the procedures for obtaining the APV by considering the post-hoc procedures explained by Demˇs ar and the ones explained in this paper.In Section4,we perform an experimental study of the behavior of the statistical procedures and we discuss the results obtained.Finally,Section5 concludes the paper.
2.Comparison of Multiple Classifiers:Performing All Pairwi Comparisons
In the paper Demˇs ar(2006),referring to carrying out comparisons of more than two classifiers,a t of uful guidelines were given for detecting significant differences among the results obtained and post-hoc procedures for identifying the differences.Friedman’s test is an omnibus test which can b
e ud to carry out the types of comparison.It allows to detect differences considering the global t of classifiers.Once Friedman’s test rejects the null hypothesis,we can proceed with a post-hoc test in order tofind the concrete pairwi comparisons which produce differences.Demˇs ar described the u of the Nemenyi test ud when all classifiers are compared with each other.Then,he focud on procedures that control the family-wi error when comparing with a control classifier,arguing that the objective of a study is to test whether a newly propod method is better than the existing
A N E XTENSION ON“S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE D ATA S ETS”ones.For this reason,he described and studied in depth more powerful and sophisticated procedures derived from Bonferroni-Dunn such as Holm’s,Hochberg’s and Hommel’s methods.
Nevertheless,we think that performing all pairwi comparisons in an experimental analysis may be uful and interesting in different cas when proposing a new method.For example,it would be interesting to conduct a statistical analysis over multiple classifiers in review works in which no method is propod.In this ca,the repetition of comparisons choosing different control classifiers may lo the control of the family-wi error.
finance and economics
Our intention in this ction is to give a detailed description of more powerful and advanced procedures derived from the Nemenyi test and to show a ca study that us the procedures.
2.1Advanced Procedures for Performing All Pairwi Comparisons
A t of pairwi comparisons can be associated with a t or family of hypothes.Any of the post-hoc tests which can be applied to non-parametric tests(that is,tho derived from the Bonferroni correction or similar procedures)work over a family of hypothes.As Demˇs ar explained,the test statistics for comparing the i-th and j-th classifier is
z=(R i−R j) k(k+1)6N,
where R i is the average rank computed through the Friedman test for the i-th classifier,k is the number of classifiers to be compared and N is the number of data ts ud in the comparison.
The z value is ud tofind the corresponding probability(p-value)from the table of normal dis-tribution,which is then compared with an appropriate level of significanceα(Table A1in Sheskin, 2003).Two basic procedures are:
•Nemenyi(1963)procedure:it adjusts the value ofαin a single step by dividing the value of αby the nu
mber of comparisons performed,m=k(k−1)/2.This procedure is the simplest but it also has little power.
一对一辅导机构比较
•Holm(1979)procedure:it was also described in Demˇs ar(2006)but it was ud for compar-isons of multiple classifiers involving a control method.It adjusts the value ofαin a step down method.Let p1,...,p m be the ordered p-values(smallest to largest)and H1,...,H m be the corresponding hypothes.Holm’s procedure rejects H1to H(i−1)if i is the smallest in-teger such that p i>α/(m−i+1).Other alternatives were developed by Hochberg(1988), Hommel(1988)and Rom(1990).They are easy to perform,but they often have a similar power to Holm’s procedure(they have more power than Holm’s procedure,but the difference between them is not very notable)when considering all pairwi comparisons.
The hypothes being tested belonging to a family of all pairwi comparisons are logically interrelated so that not all combinations of true and fal hypothes are possible.As a simple example of such a situation suppo that we want to test the three hypothes of pairwi equality associated with the pairwi comparisons of three classifiers C i,i=1,2,3.It is easily en from the relations among the hypothes that if any one of them is fal,at least one other must be fal.For example,if C1is better/wor than C2,then it is not possible that C1has the same performance as C3
and C2has the same performance as C3.C3must be better/wor than C1or C2or the two classifiers at the same time.Thus,there cannot be one fal and two true hypothes among the three.即将到来
G ARC´I A AND H ERRERA
moroccoBad on this argument,Shaffer propod two procedures which make u of the logical relation among the family of hypothes for adjusting the value ofα(Shaffer,1986).
•Shaffer’s static procedure:following Holm’s step down method,at stage j,instead of rejecting
造价工程师报考资格H i if p i≤α/(m−i+1),reject H i if p i≤α/t i,where t i is the maximum number of hypothes六级什么时候出成绩
which can be true given that any(i−1)hypothes are fal.It is a static procedure,that is, t1,...,t m are fully determined for the given hypothes H1,...,H m,independent of the obrved p-values.The possible numbers of true hypothes,and thus the values of t i can be obtained from the recursive formula
k[
S(k)=
{ j2 +x:x∈S(k−j)},
j=1
where S(k)is the t of possible numbers of true hypothes with k classifiers being compared, k≥2,and S(0)=S(1)={0}.
•Shaffer’s dynamic procedure:it increas the power of thefirst by substitutingα/t i at stage i by the valueα/t∗i,where t∗i is the maximum number of hypothes that could be true,given that the previous hypothes are fal.It is a dynamic procedure since t∗i depends not only on the logical structure of the hypothes,but also on the hypothes already rejected at step
i.Obviously,this procedure has more power than thefirst one.In this paper,we have not
ud this cond procedure,given that it is included in an advanced procedure which we will describe in the following.
In Bergmann and Hommel(1988)was propod a procedure bad on the idea offinding all elementary hypothes which cannot be rejected.In order to formulate Bergmann-Hommel’s pro-cedure,we need the following definition.
Definition1An index t of hypothes I⊆{1,...,m}is called exhaustive if exactly all H j,j∈I, could be true.
In order to exemplify the previous definition,we will consider the following ca:We have three classifiers,and we will compare them in a n×n comparison.We will obtain three hypothes:•H1=C1es equal in behavior than C2.
•H2=C1es equal in behavior than C3.
•H3=C2es equal in behavior than C3.
and eight possible ts S i:
•S1:All H j are true.
•S2:H1and H2are true and H3is fal.
•S3:H1and H3are true and H2is fal.
A N E XTENSION ON“S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE D ATA S ETS”
•S4:H2and H3are true and H1is fal.
•S5:H1is true and H2and H3are fal.
•S6:H2is true and H1and H3are fal.
•S7:H3is true and H1and H2are fal.
•S8:All H j are fal.
Sets S1,S5,S6,S7and S8can be possible,becau their hypothes can be true at the same time, so they are exhaustive ts.Set S2,basing on logically related hypothes principles,is not possible becau the performance of C1cannot be equal to C2and C3,whereas C2has different performance than C3.The same consideration can be done to S3and S4,which are not exhaustive ts.
Under this definition,it works as follows.
•Bergmann and Hommel(1988)procedure:Reject all H j with j/∈A,where the acceptance t
A=[{I:I exhaustive,min{P i:i∈I}>α/|I|}
is the index t of null hypothes which are retained.
ctvFor this procedure,one has to check for each subt I of{1,...,m}if I is exhaustive,which leads to intensive computation.Due to this fact,we will obtain a t,named E,which will contain all the possible exhaustive ts of hypothes for a certain comparison.A rapid algo-rithm which was described in Hommel and Bernhard(1994)allows a substantial reduction in computing time.Once the E t is obtained,the hypothes that do not belong to the A t are rejected.
Figure1shows a valid algorithm for obtaining all the exhaustive ts of hypothes,using as input a list of classifiers C.E is a t of families of hypothes;likewi,a family of hypothes is a t of hypothes.The most important step in the algorithm is the number
6.It performs a division of the classifiers into two subts,in which the last classifier k al-
ways is inrted in the cond subt and thefirst subt cannot be empty.In this way,we ensure that a subt yielded in a division is never empty and no repetitions are produced.For example,suppo a t C with three classifiers C={1,2,3}.All possible divisions without taking into account the previous assumptions are:D1={C1={},C2={1,2,3}},D2={C1= {1},C2={2,3}},D3={C1={2},C2={1,3}},D4={C1={1,2},C2={3}},D5={C1= {3},C2={1,2}},D6={C1={1,3},C2={2}},D7={C1={2,3},C2={1}},D8={C1= {1,2,3},
C2={}}.Divisions D1and D8,D2and D7,D3and D6,D4and D5are equivalent, respectively.Furthermore,divisions D1and D8are not interesting.Using the assumptions in step6of the algorithm,the possible divisions are:D1={C1={1},C2={2,3}},D2={C1= {2},C2={1,3}},D3={C1={1,2},C2={3}}.In this ca,all the divisions are interesting and no repetitions are yielded.The computational complexity of the algorithm for obtaining exhaustive ts is O(2n2).However,the computation requirements may be reduced by means of using storage capabilities.Relative exhaustive ts for k−i,1≤i≤(k−2)classifiers can be stored in memory and there is no necessity of invoking the obtainingExhaustive func-tion recursively.The computational complexity using storage capabilities is O(2n),so the algorithm still requires intensive computation.