Feature lection for high-dimensional data A fast correlation-bad filter solution

更新时间:2023-06-13 08:21:50 阅读：评论：0

Feature Selection for High-Dimensional Data:

A Fast Correlation-Bad Filter Solution

Lei Yu leiyu@asu.edu Huan Liu hliu@asu.edu Department of Computer Science&Engineering,Arizona State University,Tempe,AZ85287-5406,USA

Abstract

Feature lection,as a preprocessing step to

machine learning,has been eﬀective in reduc-

ing dimensionality,removing irrelevant data,

increasing learning accuracy,and improving

comprehensibility.However,the recent in-

crea of dimensionality of data pos a -

vere challenge to many existing feature -

lection methods with respect to eﬃciency

and eﬀectiveness.In this work,we intro-

duce a novel concept,predominant correla-关于生气的成语

拜年祝福短信

tion,and propo a fastﬁlter method which

can identify relevant features as well as re-

dundancy among relevant features without

pairwi correlation analysis.The eﬃciency

and eﬀectiveness of our method is demon-

strated through extensive comparisons with

other methods using real-world data of high

dimensionality.

1.Introduction

Feature lection is frequently ud as a preprocessing step to machine learning.It is a process of choosing a subt of original features so that the feature space is optimally reduced according to a certain evaluation criterion.Feature lection has been a fertileﬁeld of rearch and development since1970’s and shown very eﬀective in removing irrelevant and redundant fea-tures,increasing eﬃciency in learning tasks,improv-ing learning performance like predictive accuracy,and enhancing comprehensibility of learned results(Blum &Langley,1997;Dash&Liu,1997;Kohavi&John, 1997).In recent years,data has become increasingly larger in both ,number of instances)and ,number of features)in many applica-tions such as genome projects(Xing et al.,2001),text categorization(Yang&Pederson,1997),image re-trieval(Rui et al.,1999),and customer relationship management(Ng&Liu,2000).This enormity may cau rious problems to many machine learning al-gorithms with respect to scalability and learning per-formance.For example,high dimensional , data ts with hundreds or thousands of features),can contain high degree of irrelevant and redundant infor-mation which may greatly degrade the performance of learning algorithms.Therefore,feature lection be-comes very necessary for machine learning tasks when facing high dimensional data nowadays.However,this trend of enormity on both size and dimensional

ity also pos vere challenges to feature lection algorithms. Some of the recent rearch eﬀorts in feature lection have been focud on the challenges from handling a huge number of instances(Liu et al.,2002b)to deal-ing with high dimensional data(Das,2001;Xing et al., 2001).This work is concerned about feature lection for high dimensional data.In the following,weﬁrst review models of feature lection and explain why a ﬁlter solution is suitable for high dimensional data,and then review some recent eﬀorts in feature lection for high dimensional data.

Feature lection algorithms can broadly fall into the ﬁlter model or the wrapper model(Das,2001;Kohavi &John,1997).Theﬁlter model relies on general char-acteristics of the training data to lect some features without involving any learning algorithm,therefore it does not inherit any bias of a learning algorithm.The wrapper model requires one predetermined learning al-gorithm in feature lection and us its performance to evaluate and determine which features are lected. As for each new subt of features,the wrapper model needs to learn a hypothesis(or a classiﬁer).It tends to give superior performance as itﬁnds features better suited to the predetermined learning algorithm,but it also tends to be more computationally expensive(Lan-gley,1994).When the number of features becomes very large,theﬁlter model is usually a choice due to its computational eﬃciency.

Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003),Washingto

n DC,2003.

To combine the advantages of both models,algorithms in a hybrid model have recently been propod to deal with high dimensional data(Das,2001;Ng,1998;Xing et al.,2001).In the algorithms,ﬁrst,a goodness measure of feature subts bad on data characteris-tics is ud to choo best subts for a given cardinal-ity,and then,cross validation is exploited to decide a ﬁnal best subt across diﬀerent cardinalities.The algorithms mainly focus on combiningﬁlter and wrap-per algorithms to achieve best possible performance with a particular learning algorithm at the same time complexity ofﬁlter algorithms.In this work,we focus on theﬁlter model and aim to develop a new feature lection algorithm which can eﬀectively remove both irrelevant and redundant features and is less costly in computation than the current available algorithms.

In ction2,we review current algorithms within the ﬁlter model and point out their problems in the context of high dimensionality.In ction3,we describe corre-lation measures which form the ba of our method in evaluating feature relevance and redundancy.In c-tion4,weﬁrst propo our method which lects good features for classiﬁcation bad on a novel concept, predominant correlation,and then prent a fast algorithm with less than quadratic time complexity.In ction5,we evaluate the eﬃciency and eﬀectiveness of this algorithm via extensive experiments on various real-world data

ts comparing with other reprenta-tive feature lection algorithms,and discuss the im-plications of theﬁndings.In ction6,we conclude our work with some possible extensions.

2.Related Work

Within theﬁlter model,diﬀerent feature lection al-gorithms can be further categorized into two groups, namely feature weighting algorithms and subt arch algorithms,bad on whether they evaluate the good-ness of features individually or through feature sub-ts.Below,we discuss the advantages and shortcom-ings of reprentative algorithms in each group. Feature weighting algorithms assign weights to fea-tures individually and rank them bad on their rel-evance to the target concept.There are a number of diﬀerent deﬁnitions on feature relevance in machine learning literature(Blum&Langley,1997;Kohavi& John,1997).A feature is good and thus will be -lected if its weight of relevance is greater than a thresh-old value.A well known algorithm that relies on rele-vance evaluation is Relief(Kira&Rendell,1992).The key idea of Relief is to estimate the relevance of fea-tures according to how well their values distinguish be-tween the instances of the same and diﬀerent class that are near each other.Relief randomly samples a number(m)of instances from the training t and up-dates the relevance estimation of each feature bad on the diﬀerence between the lected instance and the two nearest instances of the same and opposite class.Tim

e complexity of Relief for a data t with M instances and N features is O(mMN).With m be-ing a constant,the time complexity becomes O(MN), which makes it very scalable to data ts with both a huge number of instances and a very high dimension-ality.However,Relief does not help with removing redundant features.As long as features are deemed relevant to the class concept,they will all be lected even though many of them are highly correlated to each other(Kira&Rendell,1992).Many other algo-rithms in this group have similar problems as Relief does.They can only capture the relevance of features to the target concept,but cannot discover redundancy among features.However,empirical evidence from fea-ture lection literature shows that,along with irrele-vant features,redundant features also aﬀect the speed and accuracy of learning algorithms and thus should be eliminated as well(Hall,2000;Kohavi&John,1997). Therefore,in the context of feature lection for high dimensional data where there may exist many redun-dant features,pure relevance-bad feature weighting algorithms do not meet the need of feature lection very well.

Subt arch algorithms arch through candidate feature subts guided by a certain evaluation mea-sure(Liu&Motoda,1998)which captures the good-ness of each subt.An optimal(or near optimal)sub-t is lected when the arch stops.Some existing evaluation measures that have bee

n shown eﬀective in removing both irrelevant and redundant features in-clude the consistency measure(Dash et al.,2000)and the correlation measure(Hall,1999;Hall,2000).Con-sistency measure attempts toﬁnd a minimum num-ber of features that parate class as consistently as the full t of features can.An inconsistency is de-ﬁned as two instances having the same feature values but diﬀerent class labels.In Dash et al.(2000),dif-ferent arch strategies,namely,exhaustive,heuristic, and random arch,are combined with this evalua-tion measure to form diﬀerent algorithms.The time complexity is exponential in terms of data dimension-ality for exhaustive arch and quadratic for heuristic arch.The complexity can be linear to the number of iterations in a random arch,but experiments show that in order toﬁnd best feature subt,the number of iterations required is mostly at least quadratic to the number of features(Dash et al.,2000).In Hall(2000), a correlation measure is applied to evaluate the good-

ness of feature subts bad on the hypothesis that a good feature subt is one that contains features highly correlated with the class,yet uncorrelated with each other.The underlying algorithm,named CFS,also exploits heuristic arch.Therefore,with quadratic or higher time complexity in terms of dimensionality, existing subt arch algorithms do not have strong scalability to deal with high dimensional data.

To overcome the problems of algorithms in both groups and meet the demand for feature lection for high dimensional data,we develop a novel algorithm which can eﬀectively identify both irrelevant and redundant features with less time complexity than subt arch algorithms.

3.Correlation-Bad Measures

In this ction,we discuss how to evaluate the good-ness of features for classiﬁcation.In general,a feature is good if it is relevant to the class concept but is not redundant to any of the other relevant features.If we adopt the correlation between two variables as a good-ness measure,the above deﬁnition becomes that a fea-ture is good if it is highly correlated with the class but not highly correlated with any of the other features. In other words,if the correlation between a feature and the class is high enough to make it relevant to(or predictive of)the class and the correlation between it and any other relevant features does not reach a level so that it can be predicted by any of the other relevant features,it will be regarded as a good feature for the classiﬁcation task.In this n,the problem of fea-ture lection boils down toﬁnd a suitable measure of correlations between features and a sound procedure to lect features bad on this measure.

There exist broadly two approaches to measure the correlation between two random variables.One i

s bad on classical linear correlation and the other is bad on information theory.Under theﬁrst approach, the most well known measure is linear correlation co-eﬃcient.For a pair of variables(X,Y),the linear correlation coeﬃcient r is given by the formula

(x i−x i)(y i−y i)

(x i−x i)2

(y i−y i)2

(1)

where x i is the mean of X,and y i is the mean of Y. The value of r lies between-1and1,inclusive.If X a

nd Y are completely correlated,r takes the value of 1or-1;if X and Y are totally independent,r is zero. It is a symmetrical measure for two variables.Other measures in this category are basically variations of the above formula,such as least square regression er-ror and maximal information compression index(Mi-tra et al.,2002).There are veral beneﬁts of choos-ing linear correlation as a feature goodness measure for classiﬁcation.First,it helps remove features with near zero linear correlation to the class.Second,it helps to reduce redundancy among lected features. It is known that if data is linearly parable in the original reprentation,it is still linearly parable if all but one of a group of linearly dependent features are removed(Das,1971).However,it is not safe to always assume linear correlation between features in the real world.Linear correlation measures may not be able to capture correlations that are not linear in nature. Another limitation is that the calculation requires all features contain numerical values.

To overcome the shortcomings,in our solution we adopt the other approach and choo a correlation measure bad on the information-theoretical concept of entropy,a measure of the uncertainty of a random variable.The entropy of a variable X is deﬁned as

H(X)=−

P(x i)log2(P(x i)),(2)

and the entropy of X after obrving values of another variable Y is deﬁned as

H(X|Y)=−

P(y j)

P(x i|y j)log2(P(x i|y j)),

(3) where P(x i)is the prior probabilities for all values of X,and P(x i|y i)is the posterior probabilities of X given the values of Y.The amount by which the en-tropy of X decreas reﬂects additional information about X provided by Y and is called information gain(Quinlan,1993),given by

IG(X|Y)=H(X)−H(X|Y).(4)

According to this measure,a feature Y is regarded more correlated to feature X than to feature Z,if IG

(X|Y)>IG(Z|Y).About information gain mea-sure,we have the following theorem.

Theorem Information gain is symmetrical for two random variables X and Y.

Symmetry is a desired property for a measure of cor-relations between features.However,information gain is biad in favor of features with more values.Fur-thermore,the values have to be normalized to ensure

they are comparable and have the same aﬀect.There-fore,we choo symmetrical uncertainty(Press et al., 1988),deﬁned as follows.

SU(X,Y)=2

IG(X|Y)

H(X)+H(Y)

(5)

It compensates for information gain’s bias toward fea-tures with more values and normalizes its values to the range[0,1]with the value1indicating that knowledge of the value of either one completely predicts the value of the other and the value0indicating that X and Y are independent.In addition,it still treats a pair of features symmetrically.The entropy-bad mea-sures require nominal features,but they can be applied to measure correlations between continuous features as well,if the values are discretized properly in ad-vance(Fayyad&Irani,1993;Liu et al.,2002a).There-fore,we u symmetrical uncertainty in this work. 4.A Correlation-Bad Filter Approach

4.1.Methodology

Using symmetrical uncertainty(SU)as the goodness measure,we are now ready to develop a procedure to lect good features for classiﬁcation bad on corre-lation analysis of features(including the class).This involves two aspects:(1)how to decide whether a fea-ture is relevant to the class or not;and(2)how to decide whether such a relevant feature is redundant or not when considering it with other relevant features. The answer to theﬁrst question can be using a thresh-old SU value decided by the ur,as the method ud by many other feature weighting ,Re-lief).More speciﬁcally,suppo a data t S contains N features and a class C.Let SU i,c denote the SU value that measures the correlation between a feature f i and the class C(nam

ed c-correlation),then a subt S of relevant features can be decided by a threshold SU valueδ,such that∀f i∈S ,1≤i≤N,SU i,c≥δ. The answer to the cond question is more complicated becau it may involve analysis of pairwi correlations between all features(named f-correlation),which re-sults in a time complexity of O(N2)associated with the number of features N for most existing algorithms. To solve this problem,we propo our method below. Since f-correlations are also captured by SU values, in order to decide whether a relevant feature is redun-dant or not,we need toﬁnd a reasonable way to decide the threshold level for f-correlations as well.In other words,we need to decide whether the level of correla-tion between two features in S is high enough to cau redundancy so that one of them may be removed from S .For a feature f i in S ,the value of SU i,c quantiﬁes the extent to which f i is correlated to(or predictive of)the class C.If we examine the value of SU j,i for ∀f j∈S (j=i),we will also obtain quantiﬁed es-timations about the extent to which f i is correlated to(or predicted by)the rest relevant features in S . Therefore,it is possible to identify highly correlated features to f i in the same straightforward manner as we decide S ,using a threshold SU value equal or sim-ilar toδ.We can do this for all features in S .How-ever,this method only sounds reasonable when we try to determine highly correlated features to one concept while not considering another concept.In the con-text of a t of relevant features S already identiﬁed for the class concept,when we try to determine the highly correlated features for a given feature f i within S ,it is m

ore reasonable to u the c-correlation level between f i and the class concept,SU i,c,as a refer-ence.The reason lies on the common phenomenon-a feature that is correlated with one ,the class)at a certain level may also be correlated with some other concepts(features)at the same or an even higher level.Therefore,even the correlation between this feature and the class concept is larger than some thresholdδand thereof making this feature relevant to the class concept,this correlation is by no means pre-dominant.To be more preci,we deﬁne the concept of predominant correlation as follows.

Deﬁnition1(Predominant correlation).The corre-lation between a feature f i(f i∈S)and the class C is predominant iﬀSU i,c≥δ,and∀f j∈S (j=i),there exists no f j such that SU j,i≥SU i,c.

If there exists such f j to a feature f i,we call it a redundant peer to f i and u S P

to denote the t of all redundant peers for f i.Given f i∈S and S P

i (S P

=∅),we divide S P

i公租房在哪里申请

into two parts,S P

+and

S P

−,where S

P i

+={f

|f j∈S P

,SU j,c>SU i,c}and S P

−={f

|f j∈S P

,SU j,c≤SU i,c}.

Deﬁnition2(Predominant feature).A feature is pre-dominant to the class,iﬀits correlation to the class is predominant or can become predominant after remov-ing its redundant peers.

According to the above deﬁnitions,a feature is good if it is predominant in predicting the class concept, and feature lection for classiﬁcation is a process that identiﬁes all predominant features to the class concept and removes the rest.We now propo three heuristics that together can eﬀectively identify predominant fea-tures and remove redundant ones among all relevant features,without having to identify all the redundant peers for every feature in S ,and thus avoids pairwi analysis of f-correlations between all relevant features.

Our assumption in developing the heuristics is that if two features are found to be redundant to each other and one of them needs to be removed,removing the one that is less relevant to the class concept keeps more information to predict the class while reducing redun-dancy in the data.

Heuristic1(if S P

i +=∅).Treat f

as a predom-

inant feature,remove all features in S P

i −,and skip犷的读音

identifying redundant peers for them.

Heuristic2(if S P

i +=∅).Process all features in

S P

i +before deciding whether or not to remove f

.If

none of them becomes predominant,follow Heuristic1; otherwi only remove f i and decide whether or not to remove features in S P

−bad on other features in S . Heuristic3(starting point).The feature with the largest SU i,c value is always a predominant feature and can be a starting point to remove other features.

4.2.Algorithm and Analysis

Bad on the methodology prented before,we de-velop an algorithm,named FCBF(Fast Correlation-Bad Filter).As in Figure1,given a data t with

input:S(f1,f2,...,f N,C)//a training data t δ//a predeﬁned threshold output:S best//an optimal subt

1begin

2for i=1to N do begin

3calculate SU i,c for f i;

4if(SU i,c≥δ)

5append f i to S

list ;

6end;

7order S

list in descending SU i,c value;

8f p=getF irstElement(S

list );

9do begin

10f q=getNextElement(S

list ,f p);

11if(f q<>NULL)

12do begin

13f q=f q;

14if(SU p,q≥SU q,c)

15remove f q from S

list ;

16f q=getNextElement(S

list ,f q);

17el f q=getNextElement(S

list ,f q);

18end until(f q==NULL);

19f p=getNextElement(S

list ,f p);

20end until(f p==NULL);

21S best=S

list ;

22end;

Figure1.FCBF Algorithm N features and a class C,the algorithmﬁnds a t

of predominant features S best for the class concept.It consists of two major parts.In theﬁrst part(line2-7), it calculates the SU value for each feature,lects rele-vant features into S

list

bad on the predeﬁned thresh-oldδ,and orders them in descending order according to their SU values.In the cond part(line8-20),

烤肉腌制

it further process the ordered list S

list

to remove redundant features and only keeps predominant ones among all the lected relevant feat

ures.According to Heuristic1,a feature f p that has already been deter-mined to be a predominant feature can always be ud toﬁlter out other features that are ranked lower than f p and have f p as one of its redundant peers.The iteration starts from theﬁrst element(Heuristic3)in

list

(line8)and continues as follows.For all the re-maining features(from the one right next to f p to the

last one in S

list

),if f p happens to be a redundant peer

to a feature f q,f q will be removed from S

list

金匮肾气片(Heuris-tic2).After one round ofﬁltering features bad on f p,the algorithm will take the currently rem

aining fea-ture right next to f p as the new reference(line19)to repeat theﬁltering process.The algorithm stops until

there is no more feature to be removed from S

list

. Theﬁrst part of the above algorithm has a linear time complexity in terms of the number of features N.As to the cond part,in each iteration,using the pre-dominant feature f p identiﬁed in the previous round, FCBF can remove a large number of features that are redundant peers to f p in the current iteration.The best ca could be that all of the remaining features following f p in the ranked list will be removed;the worst ca could be none of them.On average,we can assume that half of the remaining features will be removed in each iteration.Therefore,the time com-plexity for the cond part is O(N log N)in terms of N.Since the calculation of SU for a pair of features is linear in term of the number of instances M in a data t,the overall complexity of FCBF is O(MN log N).

5.Empirical Study

The objective of this ction is to evaluate our pro-pod algorithm in terms of speed,number of lected features,and learning accuracy on lected features.

5.1.Experiment Setup

In our experiments,we choo three reprentative fea-ture lection algorithms in comparison with FCBF. One is a feature weighting algorithm,ReliefF(an ex-tension to Relief)which arches for veral nearest neighbors to be robust to noi and handles multi-ple class(Kononenko,1994);the other two are sub-

t arch algorithms which exploit quential forward arch and utilizes correlation measure or consistency measure to guide the arch,denoted as CorrSF and ConsSF respectively.CorrSF is a variation of the CFS algorithm mentioned in ction2.The reason why we prefer CorrSF to CFS is becau both ex-periments in Hall(1999)and our initial experiments show that CFS only produces slightly better results than CorrSF,but CorrSF bad on quential forward arch runs faster than CFS bad on bestﬁrst arch with5nodes expansion and therefore is more suitable for high dimensional data.In addition to feature -lection algorithms,we also lect two diﬀerent learning algorithms,C4.5(Quinlan,1993)and NBC(Witten& Frank,2000),to evaluate the accuracy on lected fea-tures for each feature lection algorithm.

Table1.Summary of bench-mark data ts.

Title Features Instances Class Lung-cancer57323 Promoters591062 Splice6231903 USCensus906893383 CoIL20008658222 Chemical1519363 Musk216965982 Arrhythmia28045216 Isolet618156026 Multi-features650200010

The experiments are conducted using Weka’s imple-mentation of all the algorithms and FCBF is also implemented in Weka environment(Witten&Frank, 2000).All together10data ts are lected from the UCI Machine Learning Repository(Blake&Merz, 1998)and UCI KDD Archive(Bay,1999).A summary of data ts is prented in Table1.

For each data t,we run all four feature lection algorithms,FCBF,ReliefF,CorrSF,ConsSF,respec-tively,and record the running time and the number of lected features for each algorithm.We then ap-ply C4.5and NBC on the original data t as well as each newly obtained data t containing only the -lected features from each algorithm and record overall accuracy by10-fold cross-validation.

5.2.Results and Discussions

Table2records the running time and the number of lected features for each feature lection algorithm. For ReliefF,the parameter k is t to5(neighbors)and m is t to30(instances)throughout the experiments.From Table2,we can obrve that for each algorithm the running times over diﬀerent

低轨道卫星

银行培训data ts are consis-tent with our previous time complexity analysis.From the averaged values in the last row of Table2,it is clear that FCBF runs signiﬁcantly faster(in degrees)than the other three algorithms,which veriﬁes FCBF’s su-perior computational eﬃciency.What is interesting is that ReliefF is unexpectedly slow even though its time complexity becomes O(MN)with aﬁxed sample size m.The reason lies on that arching for nearest neighbors involves distance calculation which is more time consuming than the calculation of symmetrical uncertainty value.

From Table2,it is also clear that FCBF achieves the highest level of dimensionality reduction by lecting the least number of features(with only one exception in USCensus90),which is consistent with our theoret-ical analysis about FCBF’s ability to identify redun-dant features.

Tables3and4show the learning accuracy of C4.5and NBC respectively on diﬀerent feature ts.From the averaged accuracy over all data ts,we obrve that, in general,(1)FCBF improves the accuracy of both C4.5and NBC;and(2)of the other three algorithms, only CorrSF can enhance the accuracy of C4.5to the same level as FCBF does.From individual accuracy values,we also obrve that for most of the data ts, FCBF can maintain or even increa the accuracy. The above experimental results suggest that FCBF is practical for feature lection for classiﬁcation of high dimensional data.It can eﬃciently achieve high degree of dimensionality reduction and enhance clas

siﬁcation accuracy with predominant features.

6.Conclusions

In this paper,we propo a novel concept of predomi-nant correlation,introduce an eﬃcient way of analyz-ing feature redundancy,and design a fast correlation-badﬁlter approach.A new feature lection algo-rithm FCBF is implemented and evaluated through extensive experiments comparing with related feature lection algorithms.The feature lection results are further veriﬁed by applying two diﬀerent classiﬁcation algorithms to data with and without feature lec-tion.Our approach demonstrates its eﬃciency and eﬀectiveness in dealing with high dimensional data for classiﬁcation.Our further work will extend FCBF to work on data with higher dimensionality(thousands of features).We will study in more detail redun-dant features and their role in classiﬁcation,and com-bine FCBF with feature discretization algorithms to smoothly handle data of diﬀerent feature types.

本文发布于:2023-06-13 08:21:50，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1036274.html

上一篇：机械专业词汇中英文对照

下一篇：（分析化学专业论文）非均电场对三维电极体系性能的影响

标签：生气短信租房腌制银行

留言与评论（共有 0 条评论）