An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets”

更新时间:2023-06-06 01:29:39 阅读：评论：0

Journal of Machine Learning Rearch9(2008)2677-2694Submitted10/07;Revid4/08;Published12/08 An Extension on“Statistical Comparisons of Classiﬁers over Multiple Data Sets”for all Pairwi Comparisons

Salvador Garc´ıa SALVAGL@DECSAI.UGR.ES Francisco Herrera HERRERA@DECSAI.UGR.ES Department of Computer Science and Artiﬁcial Intelligence

University of Granada

Granada,18071,Spain

Editor:John Shawe-Taylor

Abstract

In a recently published paper in JMLR,Demˇs ar(2006)recommends a t of non-parametric sta-tistical tests and procedures which can be safely ud for comparing the performance of classiﬁers over multiple data ts.After studying the paper,we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method.How-ever,it does not deal with some advanced topics in depth.Regarding the topics,we focus on more powerful propo

sals of statistical procedures for comparing n×n classiﬁers.Moreover,we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures.

Keywords:statistical methods,non-parametric test,multiple comparisons tests,adjusted p-values, logically related hypothes

1.Introduction

In the Machine Learning(ML)scientiﬁc community there is a need for rigorous and correct statisti-cal analysis of published results,due to the fact that the development or modiﬁcations of algorithms is a relatively easy task.The main inconvenient related to this necessity is to understand and study the statistics and to know the exact techniques which can or cannot be applied depending on the situation,that is,type of results obtained.In a recently published paper in JMLR by Demˇs ar(2006), a group of uful guidelines are given in order to perform a correct analysis when we compare a t of classiﬁers over multiple data ts.Demˇs ar recommends a t of non-parametric statistical techniques(Zar,1999;Sheskin,2003)for comparing classiﬁers under the circumstances,given that the sample of results obtained by them does not fulﬁll the required conditions and it is not large enough for making a parametric statistical analysis.He analyzed the behavior of the pro-pod statist

ics on classiﬁcation tasks and he checked that they are more convenient than parametric techniques.

封侯是什么意思

Recent studies apply the guidelines given by Demˇs ar in the analysis of performance of classiﬁers (Esmeir and Markovitch,2007;Marrocco et al.,2008).In them,a new proposal or methodology is offered and it is compared with other methods by means of pairwi comparisons.Another type of studies assume an empirical comparison or review of already propod methods.In the cas,no proposal is offered and a statistical comparison could be very uful in determining the differences among the methods.In the specialized literature,many papers provide reviews on a speciﬁc topic and they also u statistical methodology to perform comparisons.For example,in a review of

G ARC´I A AND H ERRERA

enmbles of decision trees,non-parametric tests are also applied in the analysis of performance (Banﬁeld et al.,2007).However,only the rankings computed by Friedman’s method(Friedman, 1937)are stipulated and authors establish comparisons bad on them,without taking into account signiﬁcance levels.Demˇs ar focud his work in the analysis of new proposals,and he introduced the Nemenyi test for making all pairwi comparisons(Nemenyi,1963).Nevertheless,the Nemenyi tes

t is very conrvative and it may notﬁnd any difference in most of the experimentations.In recent papers,the authors have ud the Nemenyi test in multiple comparisons.Due to the fact that this test poss low power,authors have to employ many data ts(Yang et al.,2007b)or most of the differences found are not signiﬁcant(Yang et al.,2007a;N´u˜n ez et al.,2007).Although the employment of many data ts could em beneﬁcial in order to improve the generalization of results,in some speciﬁc domains,that is,imbalanced classiﬁcation(Owen,2007)or multi-instance classiﬁcation(Murray et al.,2005),data ts are difﬁcult toﬁnd.

Procedures with more power than Nemenyi’s one can be found in specialized literature.We have bad on the necessity to apply more powerful procedures in empirical studies in which no new method is propod and the beneﬁt consists of obtaining more statistical differences among the classiﬁers compared.Thus,in this paper we describe the procedures and we analyze their behavior by means of the analysis of multiple repetitions of experiments with randomly lected data ts.

棉袄能用洗衣机洗吗

On the other hand,we can e other works in which the p-value associated to a comparison between two classiﬁers is reported(Garc´ıa-Pedrajas and Fyfe,2007).Classical non-parametric tests, such as Wilcoxon and Friedman(Sheskin,2003),may be incorporated in most of the statistical packages(SPSS,SAS,R,etc.)and the computation of theﬁnal p-value is usually implemented. Howev

er,advanced procedures such as Holm(1979),Hochberg(1988),Hommel(1988)and the ones described in this paper are usually not incorporated in statistical packages.The computation of the correct p-value,or Adjusted P-Value(APV)(Westfall and Young,2004),in a comparison using any of the procedures is not very difﬁcult and,in this paper,we show how to include it with an illustrative example.

The paper is t up as follows.Section2prents more powerful procedures for comparing all the classiﬁers among them in a n×n comparison of multiple classiﬁers and a ca study.In Section3 we describe the procedures for obtaining the APV by considering the post-hoc procedures explained by Demˇs ar and the ones explained in this paper.In Section4,we perform an experimental study of the behavior of the statistical procedures and we discuss the results obtained.Finally,Section5 concludes the paper.

2.Comparison of Multiple Classiﬁers:Performing All Pairwi Comparisons

In the paper Demˇs ar(2006),referring to carrying out comparisons of more than two classiﬁers,a t of uful guidelines were given for detecting signiﬁcant differences among the results obtained and post-hoc procedures for identifying the differences.Friedman’s test is an omnibus test which can b绿豆芽的做法

e ud to carry out the types of comparison.It allows to detect differences considering the global t of classiﬁers.Once Friedman’s test rejects the null hypothesis,we can proceed with a post-hoc test in order toﬁnd the concrete pairwi comparisons which produce differences.Demˇs ar described the u of the Nemenyi test ud when all classiﬁers are compared with each other.Then,he focud on procedures that control the family-wi error when comparing with a control classiﬁer,arguing that the objective of a study is to test whether a newly propod method is better than the existing

A N E XTENSION ON“S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE D ATA S ETS”ones.For this reason,he described and studied in depth more powerful and sophisticated procedures derived from Bonferroni-Dunn such as Holm’s,Hochberg’s and Hommel’s methods.

Nevertheless,we think that performing all pairwi comparisons in an experimental analysis may be uful and interesting in different cas when proposing a new method.For example,it would be interesting to conduct a statistical analysis over multiple classiﬁers in review works in which no method is propod.In this ca,the repetition of comparisons choosing different control classiﬁers may lo the control of the family-wi error.

Our intention in this ction is to give a detailed description of more powerful and advanced procedures derived from the Nemenyi test and to show a ca study that us the procedures.

2.1Advanced Procedures for Performing All Pairwi Comparisons

A t of pairwi comparisons can be associated with a t or family of hypothes.Any of the post-hoc tests which can be applied to non-parametric tests(that is,tho derived from the Bonferroni correction or similar procedures)work over a family of hypothes.As Demˇs ar explained,the test statistics for comparing the i-th and j-th classiﬁer is

z=(R i−R j) k(k+1)6N,

where R i is the average rank computed through the Friedman test for the i-th classiﬁer,k is the number of classiﬁers to be compared and N is the number of data ts ud in the comparison.教八哥说话

The z value is ud toﬁnd the corresponding probability(p-value)from the table of normal dis-tribution,which is then compared with an appropriate level of signiﬁcanceα(Table A1in Sheskin, 2003).Two basic procedures are:

•Nemenyi(1963)procedure:it adjusts the value ofαin a single step by dividing the value of αby the nu

mber of comparisons performed,m=k(k−1)/2.This procedure is the simplest but it also has little power.

•Holm(1979)procedure:it was also described in Demˇs ar(2006)but it was ud for compar-isons of multiple classiﬁers involving a control method.It adjusts the value ofαin a step down method.Let p1,...,p m be the ordered p-values(smallest to largest)and H1,...,H m be the corresponding hypothes.Holm’s procedure rejects H1to H(i−1)if i is the smallest in-teger such that p i>α/(m−i+1).Other alternatives were developed by Hochberg(1988), Hommel(1988)and Rom(1990).They are easy to perform,but they often have a similar power to Holm’s procedure(they have more power than Holm’s procedure,but the difference between them is not very notable)when considering all pairwi comparisons.

The hypothes being tested belonging to a family of all pairwi comparisons are logically interrelated so that not all combinations of true and fal hypothes are possible.As a simple example of such a situation suppo that we want to test the three hypothes of pairwi equality associated with the pairwi comparisons of three classiﬁers C i,i=1,2,3.It is easily en from the relations among the hypothes that if any one of them is fal,at least one other must be fal.For example,if C1is better/wor than C2,then it is not possible that C1has the same performance as C3

and C2has the same performance as C3.C3must be better/wor than C1or C2or the two classiﬁers at the same time.Thus,there cannot be one fal and two true hypothes among the three.

G ARC´I A AND H ERRERA

厨房设计图片大全

Bad on this argument,Shaffer propod two procedures which make u of the logical relation among the family of hypothes for adjusting the value ofα(Shaffer,1986).

•Shaffer’s static procedure:following Holm’s step down method,at stage j,instead of rejecting

H i if p i≤α/(m−i+1),reject H i if p i≤α/t i,where t i is the maximum number of hypothes

which can be true given that any(i−1)hypothes are fal.It is a static procedure,that is, t1,...,t m are fully determined for the given hypothes H1,...,H m,independent of the obrved p-values.The possible numbers of true hypothes,and thus the values of t i can be obtained from the recursive formula

S(k)=

{ j2 +x:x∈S(k−j)},

j=1

where S(k)is the t of possible numbers of true hypothes with k classiﬁers being compared, k≥2,and S(0)=S(1)={0}.

•Shaffer’s dynamic procedure:it increas the power of theﬁrst by substitutingα/t i at stage i by the valueα/t∗i,where t∗i is the maximum number of hypothes that could be true,given that the previous hypothes are fal.It is a dynamic procedure since t∗i depends not only on the logical structure of the hypothes,but also on the hypothes already rejected at step

i.Obviously,this procedure has more power than theﬁrst one.In this paper,we have not

ud this cond procedure,given that it is included in an advanced procedure which we will describe in the following.

医疗证明怎么开In Bergmann and Hommel(1988)was propod a procedure bad on the idea ofﬁnding all elementary hypothes which cannot be rejected.In order to formulate Bergmann-Hommel’s pro-cedure,we need the following deﬁnition.

退烧药

Deﬁnition1An index t of hypothes I⊆{1,...,m}is called exhaustive if exactly all H j,j∈I, could be true.

In order to exemplify the previous deﬁnition,we will consider the following ca:We have three classiﬁers,and we will compare them in a n×n comparison.We will obtain three hypothes:•H1=C1es equal in behavior than C2.

•H2=C1es equal in behavior than C3.

•H3=C2es equal in behavior than C3.

and eight possible ts S i:

•S1:All H j are true.

•S2:H1and H2are true and H3is fal.

•S3:H1and H3are true and H2is fal.

A N E XTENSION ON“S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE D ATA S ETS”

•S4:H2and H3are true and H1is fal.

•S5:H1is true and H2and H3are fal.

•S6:H2is true and H1and H3are fal.

•S7:H3is true and H1and H2are fal.

•S8:All H j are fal.

Sets S1,S5,S6,S7and S8can be possible,becau their hypothes can be true at the same time, so they are exhaustive ts.Set S2,basing on logically related hypothes principles,is not possible becau the performance of C1cannot be equal to C2and C3,whereas C2has different performance than C3.The same consideration can be done to S3and S4,which are not exhaustive ts.

Under this deﬁnition,it works as follows.

•Bergmann and Hommel(1988)procedure:Reject all H j with j/∈A,where the acceptance t

A=[{I:I exhaustive,min{P i:i∈I}>α/|I|}

is the index t of null hypothes which are retained.

For this procedure,one has to check for each subt I of{1,...,m}if I is exhaustive,which leads to intensive computation.Due to this fact,we will obtain a t,named E,which will contain all the possible exhaustive ts of hypothes for a certain comparison.A rapid algo-rithm which was described in Hommel and Bernhard(1994)allows a substantial reduction in computing time.Once the E t is obtained,the hypothes that do not belong to the A t are rejected.

Figure1shows a valid algorithm for obtaining all the exhaustive ts of hypothes,using as input a list of classiﬁers C.E is a t of families of hypothes;likewi,a family of hypothes is a t of hypothes.The most important step in the algorithm is the number

6.It performs a division of the classiﬁers into two subts,in which the last classiﬁer k al-

倪文

ways is inrted in the cond subt and theﬁrst subt cannot be empty.In this way,we ensure that a subt yielded in a division is never empty and no repetitions are produced.For example,suppo a t C with three classiﬁers C={1,2,3}.All possible divisions without taking into account the previous assumptions are:D1={C1={},C2={1,2,3}},D2={C1= {1},C2={2,3}},D3={C1={2},C2={1,3}},D4={C1={1,2},C2={3}},D5={C1= {3},C2={1,2}},D6={C1={1,3},C2={2}},D7={C1={2,3},C2={1}},D8={C1= {1,2,3},

C2={}}.Divisions D1and D8,D2and D7,D3and D6,D4and D5are equivalent, respectively.Furthermore,divisions D1and D8are not interesting.Using the assumptions in step6of the algorithm,the possible divisions are:D1={C1={1},C2={2,3}},D2={C1= {2},C2={1,3}},D3={C1={1,2},C2={3}}.In this ca,all the divisions are interesting and no repetitions are yielded.The computational complexity of the algorithm for obtaining exhaustive ts is O(2n2).However,the computation requirements may be reduced by means of using storage capabilities.Relative exhaustive ts for k−i,1≤i≤(k−2)classiﬁers can be stored in memory and there is no necessity of invoking the obtainingExhaustive func-tion recursively.The computational complexity using storage capabilities is O(2n),so the algorithm still requires intensive computation.

本文发布于:2023-06-06 01:29:39，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/878650.html

上一篇：2023年特别的礼物初中作文600字(十六篇)

下一篇：【推荐】有关师德教育心得体会

标签：设计洗衣机说话证明图片棉袄八哥医疗

留言与评论（共有 0 条评论）