Unsupervid Feature Selection for Multi-Cluster Data

更新时间:2023-06-17 09:31:08 阅读：评论：0

Unsupervid Feature Selection for Multi-Cluster Data Deng Cai Chiyuan Zhang Xiaofei He

State Key Lab of CAD&CG,College of Computer Science

Zhejiang University,China

{dengcai,xiaofeihe}@cad.zju.edu,

ABSTRACT

In many data analysis tasks,one is often confronted with very high dimensional data.Feature lection techniques are designed toﬁnd the relevant feature subt of the original features which can facilitate clustering,classiﬁcation and re-trieval.In this paper,we consider the feature lection prob-lem in unsupervid learning scenario,which is particularly diﬃcult due to the abnce of class labels that would guide the arch for relevant information.The feature lection problem is esntially a combinatorial optimization problem which is computationally expensive.Traditional unsuper-vid feature lection methods address this issue by lect-ing the top ranked features bad on certain scores com-puted independently for each feature.The approaches ne-glect the possible correlation between diﬀerent features and thus can not produce an optimal feature subt.I

nspired from the recent developments on manifold learning and L1-regularized models for subt lection,we propo in this paper a new approach,called Multi-Cluster Feature Selection (MCFS),for unsupervid feature lection.Speciﬁcally,we lect tho features such that the multi-cluster structure of the data can be best prerved.The corresponding op-timization problem can be eﬃciently solved since it only involves a spar eigen-problem and a L1-regularized least squares problem.Extensive experimental results over vari-ous real-life data ts have demonstrated the superiority of the propod algorithm.

Categories and Subject Descriptors

I.5.2[Pattern Recognition]:Design Methodology—Fea-ture evaluation and lection

General Terms

优秀毕业论文

Algorithms,Theory

Keywords

Feature lection,Unsupervid,Clustering

Permission to make digital or hard copies of all or part of this work for personal or classroom u is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on theﬁrst page.To copy otherwi,to republish,to post on rvers or to redistribute to lists,requires prior speciﬁc permission and/or a fee.

KDD’10,July25–28,2010,Washington,DC,USA.

In many applications in computer vision,pattern recog-nition and data mining,one is often confronted with very high dimensional data.High dimensionality signiﬁcantly in-creas the time and space requirements for processing the data.Moreover,various data mining and machine learning tasks,such as classiﬁcation and clustering,that are ana-lytically or computationally manageable in low dimensional spaces may become completely intractable in spaces of v-eral hundred or thousand dimensions[12].To overcome this problem,feature lection techniques[3,4,17,21,29,30]are designed to reduce the dimensionality byﬁnding a relevant feature subt.Once a small number of relevant features are lected,conventional data analysis techniques can then be applied.

Bad on whether the label information is available,fea-ture lection methods can be classiﬁed into

supervid and unsupervid methods.Supervid feature lection meth-ods usually evaluate the importance of features by the cor-relation between features and class label.The typical super-vid feature lection methods include Pearson correlation coeﬃcients[23],Fisher score[12],and Information gain[11]. However,in practice,there is usually no shortage of unla-beled data but labels are expensive.Hence,it is of great signiﬁcance to develop unsupervid feature lection algo-rithms which can make u of all the data points.In this paper,we consider the problem of lecting features in unsu-pervid learning scenarios,which is a much harder problem due to the abnce of class labels that would guide the arch for relevant information.

The feature lection aims at lecting the most relevant feature subt bad on certain evaluation criteria.This problem is esntially a combinatorial optimization prob-lem which is computationally expensive.Traditional fea-ture lection methods address this issue by lecting the top ranked features bad on some scores computed inde-pendently for each feature.The scores are usually deﬁned to reﬂect the power of each feature in diﬀerentiating diﬀer-ent class/clusters.This approach may work well on binary class/clusters problems.However,it is very likely to fail in multi class/clusters cas.Fig.(1)shows an intuitive example.There are three Gaussians in a three dimensional space.Without the label information,some popular unsu-pervid feature lection meth

,Maximum variance and LaplacianScore[17])rank the features as a>b>c.If one is asking to lect two features,the methods will lect features a and b,which is obviously sub-optimal.When deal-ing with multi class/clusters data,diﬀerent features have

(a)plane a⊗b

(b)plane a⊗c(c)plane b⊗c

Figure1:A failed example for binary clusters/class feature lection methods.(a)-(c)show the projections of the data on the plane of two joint features,respectively.Without the label information,both Maximum variance and LaplacianScore[17]methods rank the features as a>b>c.If one is asking to lect two features,both Maximum variance and LaplacianScore methods will lect features a and b,which is obviously sub-optimal.

diﬀerent powers on diﬀerentiating diﬀerent class/clusters (e.g.,cluster1vs.cluster2and cluster1vs.cluster3). There are some studies on supervid feature lection[2] trying to solve this issue.However,without label informa-tion,it is unclear how to apply the similar ideas to unsuper-vid feature lection methods.

Inspired from the recent developments on spectral analy-sis of the data(manifold learning)[1,22]and L1-regularized models for subt lection[14,16],we propo in this pa-per a new approach,called Multi-Cluster Feature Selection (MCFS),for unsupervid feature lection.Speciﬁcally,we lect tho features such that the multi-cluster structure of the data can be well prerved.By using spectral analysis techniques,MCFS suggests a principled way to measure the correlations between

diﬀerent features without label infor-mation.Thus,MCFS can well handle the data with multiple cluster structure.The corresponding optimization problem only involves a spar eigen-problem and a L1-regularized least squares problem,thus can be eﬃciently solved.It is important to note that our method esntially follows our previous work on spectral regression[5]and spar subspace learning[6,7].

The rest of the paper is organized as follows:in Section 2,we provide a brief review of the related work.Our multi cluster feature lection algorithm is introduced in Section3. The experimental results are prented in Section4.Finally, we provide the concluding remarks in Section5.

2.RELATED WORK

Feature lection methods can be classiﬁed into“wrapper”methods and“ﬁlter”methods[19,21].The wrapper model techniques evaluate the features using the mining algorithm that will ultimately be employed.Thus,they“wrap”the lection process around the mining algorithm.Algorithms bad on theﬁlter model examine intrinsic properties of the data to evaluate the features prior to the mining tasks.

For unsupervid“wrapper”methods,the clustering is a commonly ud mining algorithm[10,13,20,24]

.The algorithms consider feature lection and clustering simul-taneously and arch for features better suited to clustering aiming to improve clustering performance.However,the “wrapper”methods are usually computationally expensive [19]and may not be able to be applied on large scale data mining problems.In this paper,we are particularly inter-ested in theﬁlter methods which are much more eﬃcient.

准自然实验Most of the existingﬁlter methods are supervid.Max-imum variance might be the most simple yet eﬀective un-supervid evaluation criterion for lecting features.This criterion esntially projects the data points along the di-mensions of maximum variances.Note that,the Principal Component Analysis(PCA)algorithm shares the same prin-ciple of maximizing variance,but it involves feature trans-formation and obtains a t of transformed features rather than a subt of the original features.

Although the maximum variance criteriaﬁnds features that are uful for reprenting data,there is no reason to assume that the features must be uful for discriminat-ing between data in diﬀerent class.Recently,the Lapla-cianScore algorithm[17]and its extensions[30]have been propod to lect tho features which can best reﬂect the underlying manifold structure.LaplacianScore us a near-est neighbor graph to model the local geometric structure of the data and lects tho feature

s which are smoothest on the graph.It has been proven[17]that with label informa-tion LaplacianScore becomes Fisher criterion score.The lat-ter is a supervid feature lection method(ﬁlter method) which eks features that are eﬃcient for discrimination[12]. Fisher criterion score assigns the highest score to the feature on which the data points of diﬀerent class are far from each other while requiring data points of the same class to be clo to each other.

Wolf et al.propod a feature lection algorithm called Q-α[29].The algorithm optimizes over a least-squares crite-rion function which measures the clusterability of the input data points projected onto the lected coordinates.The op-timal coordinates are tho for which the cluster coherence, measured by the spectral gap of the corresponding aﬃnity

matrix,is maximized[29].A remarkable property of the algorithm is that it always yields spar solutions.

3.MULTI-CLUSTER FEATURE

SELECTION

The generic problem of unsupervid feature lection is the following.Given a t of points X=[x1,x

2,···,x N], x i∈R M,ﬁnd a feature subt with size d which contains the most informative features.In other words,the points {x′1,x′2,···,x′N}reprented in the d-dimensional space R d can well prerve the geometric structure as the data repre-nted in the original M-dimensional space.

Since naturally occurring data usually have multiple clus-ters structure,a good feature lection algorithm should con-sider the following two aspects:

•The lected features can best prerve the cluster struc-ture of the data.Previous studies on unsupervid fea-

ture lection[13,20,24]usually u Gaussian shape

clusters.However,recent studies have shown that hu-

man generated data are probably sampled from a sub-

manifold of the ambient Euclidean space[1,25,28].

描写景色的段落The intrinsic manifold structure should be considered

while measuring the goodness of the clusters[22].

•The lected features can“cover”all the possible clus-

ters in the data.Since diﬀerent features have diﬀer-

ent power on diﬀerentiating diﬀerent clusters,it is cer-

tainly undesirable that all the lect features can well

diﬀerentiate cluster1and cluster2but failed on dif-

ferentiating cluster1and cluster3.

In the remaining part of this ction,we will introduce our Multi-Cluster Feature Selection(MCFS)algorithm which con-siders the above two aspects.We begin with a discussion on spectral embedding for cluster analysis with arbitrary shapes.

3.1Spectral Embedding for Cluster Analysis To detect the cluster(arbitrary shapes)structure of data, spectral clustering techniques[8,22,26]received signiﬁcant interests recently.The spectral clustering usually clusters the data points using the top eigenvectors of graph Laplacian [9],which is deﬁned on the aﬃnity matrix of data points. From the graph partitioning perspective,spectral clustering tries toﬁn

d the best cut of the graph so that the prede-ﬁned criterion function can be optimized.Many criterion functions,such as ratio cut[8],average association[26],and normalized cut[26]have been propod along with the corre-sponding eigen-problems forﬁnding their optimal solutions. Spectral clustering has a clo connection with the stud-ies on manifold learning[1,25,28],which consider the ca when the data are drawn from sampling a probability dis-tribution that has support on or near to a submanifold of the ambient space.In order to detect the underlying mani-fold structure,many manifold learning algorithms have been propod[1,25,28].The algorithms construct a nearest neighbor graph to model the local geometric structure and perform spectral analysis on the graph weight matrix.This way,the manifold learning algorithms can“unfold”the data manifold and provide the“ﬂat”embedding for the data points.The spectral clustering can be thought as a two-step approach[1].Theﬁrst step is“unfolding”the data manifold using the manifold learning algorithms and the cond step is performing traditional clustering(typically k-means)on the“ﬂat”embedding for the data points[22].

Consider a graph with N vertices where each vertex cor-responds to a data point.For each data point x i,weﬁnd its p nearest neighbors and put an edge between x i and its neighbors.There are many choices to deﬁne the weight ma-trix W on the graph.Three of the most commonly ud are as follows:

1.0-1weighting.W ij=1if and only if nodes i and j

are connected by an edge.This is the simplest weight-

ing method and is very easy to compute.

2.Heat kernel weighting.If nodes i and j are con-

宝宝身高体重标准表

nected,put

W ij=e−

x i−x j 2

Heat kernel has an intrinsic connection to the Laplace

Beltrami operator on diﬀerentiable functions on a man-

ifold[1].

3.Dot-product weighting.If nodes i and j are con-

nected,put

W ij=x T i x j

Note that,if x is normalized to have unit norm,the

dot product of two vectors is equivalent to the cosine

similarity of the two vectors.

If the heat kernel or dot-product weighting is ud,some rearchers[22]u a complete ,put an edge be-tween any two points)instead of the p-nearest neighbors graph.

Deﬁne a diagonal matrix D who entries are column(or row,since W is symmetric)sums of W,D ii= j W ij, we can compute the graph Lapalcian L=D−W[9].The “ﬂat”embedding for the data points which“unfold”the data manifold can be found by solving the following generalized eigen-problem[1]:

Ly=λDy(1) Let Y=[y1,···,y K],y k’s are the eigenvectors of the above generalized eigen-problem with r

espect to the smallest eigen-value.Each row of Y is the“ﬂat”embedding for each data point.The K is the intrinsic dimensionality of the data and each y k reﬂects the data distribution along the correspond-ing dimension(topic,concept,etc.)[1].When one tries to perform cluster analysis of the data,each y k can reﬂect the data distribution on the corresponding cluster.Thus,if the cluster number of the data is known,the K is usually t to be equal to the number of clusters[22].

3.2Learning Spar Coefﬁcient Vectors

After we obtain the“ﬂat”embedding Y for the data points, we can measure the importance of each feature along each intrinsic dimension(each column of Y),correspondingly,the contribution of each feature for diﬀerentiating each cluster.

Given y k,a column of Y,we canﬁnd a relevant subt of features by minimizing theﬁtting error as follows:

min

a k

y k−X T a k 2+β|a k|(2)

where a k is a M-dimensional vector and|a k|= M j=1|a k,j| denotes the L1-norm of a k.a k esntially contains the com-bination coeﬃcients for diﬀerent features in approximating徐悲鸿的画

y k.Due to the nature of the L1-norm penalty,some coeﬃ-cients will be shrunk to exact zero ifβis large enough.In this ca,we can lect a subt containing the most rele-vant features(corresponding to the non-zero coeﬃcients in a k)with respect to y k.Eq.(2)is esntially a regression problem.In statistics,this L1-regularized regression prob-lem is called LASSO[16].

The regression problem in Eq.(2)has the following equiv-alent formulation:

min

a k

y k−X T a k 2

(3)

The Least Angel Regression(LARs)algorithm[14]can be ud to solve the optimization problem in Eq.(3).Instead of tting the parameterγ,LARs provides another choice to control the sparness of a k by specifying the cardinality (the number of non-zero entries)of a k,which is particularly convenient for feature lection.

It is very possible that some features are correlated.And the combination of veral“weak”features1can better dif-ferentiate diﬀerent clusters.Several supervid feature -lection algorithms[2]have been designed to address this issue.Thus,the advantage of using a L1-regularized regres-sion model toﬁnd the subt of features instead of evaluating the contribution of each feature independently is clear.

3.3Feature Selection on Spar Coefﬁcient

Vectors

We consider lecting d features from the M feature can-didates.For a data t containing K clusters,we can u the method discusd in the previous subctions to compute K spar coeﬃcient vectors{a k}K k=1∈R M.The cardinality of each a k is d and each entry in a k corresponds to a feature. If we lect all the features that have at least one non-zero coeﬃcient in K vectors{a k}K

k=1,it is very possible that we will obtain more than d features.In reality,we can u the following simple yet eﬀective method for lecting exactly d features from the K spar coeﬃcient vectors.

For every feature j,we deﬁne the MCFS score for the feature as