Unsupervid Feature Selection for Multi-Cluster Data Deng Cai Chiyuan Zhang Xiaofei He
State Key Lab of CAD&CG,College of Computer Science
Zhejiang University,China
{dengcai,xiaofeihe}@cad.zju.edu,
ABSTRACT
In many data analysis tasks,one is often confronted with very high dimensional data.Feature lection techniques are designed tofind the relevant feature subt of the original features which can facilitate clustering,classification and re-trieval.In this paper,we consider the feature lection prob-lem in unsupervid learning scenario,which is particularly difficult due to the abnce of class labels that would guide the arch for relevant information.The feature lection problem is esntially a combinatorial optimization problem which is computationally expensive.Traditional unsuper-vid feature lection methods address this issue by lect-ing the top ranked features bad on certain scores com-puted independently for each feature.The approaches ne-glect the possible correlation between different features and thus can not produce an optimal feature subt.I
nspired from the recent developments on manifold learning and L1-regularized models for subt lection,we propo in this paper a new approach,called Multi-Cluster Feature Selection (MCFS),for unsupervid feature lection.Specifically,we lect tho features such that the multi-cluster structure of the data can be best prerved.The corresponding op-timization problem can be efficiently solved since it only involves a spar eigen-problem and a L1-regularized least squares problem.Extensive experimental results over vari-ous real-life data ts have demonstrated the superiority of the propod algorithm.
Categories and Subject Descriptors
I.5.2[Pattern Recognition]:Design Methodology—Fea-ture evaluation and lection
General Terms
优秀毕业论文
Algorithms,Theory
Keywords
Feature lection,Unsupervid,Clustering
Permission to make digital or hard copies of all or part of this work for personal or classroom u is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwi,to republish,to post on rvers or to redistribute to lists,requires prior specific permission and/or a fee.
KDD’10,July25–28,2010,Washington,DC,USA.
Copyright2010ACM978-1-4503-0055-1/$10.00.1.INTRODUCTION生巧克力
In many applications in computer vision,pattern recog-nition and data mining,one is often confronted with very high dimensional data.High dimensionality significantly in-creas the time and space requirements for processing the data.Moreover,various data mining and machine learning tasks,such as classification and clustering,that are ana-lytically or computationally manageable in low dimensional spaces may become completely intractable in spaces of v-eral hundred or thousand dimensions[12].To overcome this problem,feature lection techniques[3,4,17,21,29,30]are designed to reduce the dimensionality byfinding a relevant feature subt.Once a small number of relevant features are lected,conventional data analysis techniques can then be applied.
Bad on whether the label information is available,fea-ture lection methods can be classified into
supervid and unsupervid methods.Supervid feature lection meth-ods usually evaluate the importance of features by the cor-relation between features and class label.The typical super-vid feature lection methods include Pearson correlation coefficients[23],Fisher score[12],and Information gain[11]. However,in practice,there is usually no shortage of unla-beled data but labels are expensive.Hence,it is of great significance to develop unsupervid feature lection algo-rithms which can make u of all the data points.In this paper,we consider the problem of lecting features in unsu-pervid learning scenarios,which is a much harder problem due to the abnce of class labels that would guide the arch for relevant information.
The feature lection aims at lecting the most relevant feature subt bad on certain evaluation criteria.This problem is esntially a combinatorial optimization prob-lem which is computationally expensive.Traditional fea-ture lection methods address this issue by lecting the top ranked features bad on some scores computed inde-pendently for each feature.The scores are usually defined to reflect the power of each feature in differentiating differ-ent class/clusters.This approach may work well on binary class/clusters problems.However,it is very likely to fail in multi class/clusters cas.Fig.(1)shows an intuitive example.There are three Gaussians in a three dimensional space.Without the label information,some popular unsu-pervid feature lection meth
,Maximum variance and LaplacianScore[17])rank the features as a>b>c.If one is asking to lect two features,the methods will lect features a and b,which is obviously sub-optimal.When deal-ing with multi class/clusters data,different features have
(a)plane a⊗b
(b)plane a⊗c(c)plane b⊗c
Figure1:A failed example for binary clusters/class feature lection methods.(a)-(c)show the projections of the data on the plane of two joint features,respectively.Without the label information,both Maximum variance and LaplacianScore[17]methods rank the features as a>b>c.If one is asking to lect two features,both Maximum variance and LaplacianScore methods will lect features a and b,which is obviously sub-optimal.
different powers on differentiating different class/clusters (e.g.,cluster1vs.cluster2and cluster1vs.cluster3). There are some studies on supervid feature lection[2] trying to solve this issue.However,without label informa-tion,it is unclear how to apply the similar ideas to unsuper-vid feature lection methods.
Inspired from the recent developments on spectral analy-sis of the data(manifold learning)[1,22]and L1-regularized models for subt lection[14,16],we propo in this pa-per a new approach,called Multi-Cluster Feature Selection (MCFS),for unsupervid feature lection.Specifically,we lect tho features such that the multi-cluster structure of the data can be well prerved.By using spectral analysis techniques,MCFS suggests a principled way to measure the correlations between
different features without label infor-mation.Thus,MCFS can well handle the data with multiple cluster structure.The corresponding optimization problem only involves a spar eigen-problem and a L1-regularized least squares problem,thus can be efficiently solved.It is important to note that our method esntially follows our previous work on spectral regression[5]and spar subspace learning[6,7].
The rest of the paper is organized as follows:in Section 2,we provide a brief review of the related work.Our multi cluster feature lection algorithm is introduced in Section3. The experimental results are prented in Section4.Finally, we provide the concluding remarks in Section5.
2.RELATED WORK
Feature lection methods can be classified into“wrapper”methods and“filter”methods[19,21].The wrapper model techniques evaluate the features using the mining algorithm that will ultimately be employed.Thus,they“wrap”the lection process around the mining algorithm.Algorithms bad on thefilter model examine intrinsic properties of the data to evaluate the features prior to the mining tasks.
For unsupervid“wrapper”methods,the clustering is a commonly ud mining algorithm[10,13,20,24]
.The algorithms consider feature lection and clustering simul-taneously and arch for features better suited to clustering aiming to improve clustering performance.However,the “wrapper”methods are usually computationally expensive [19]and may not be able to be applied on large scale data mining problems.In this paper,we are particularly inter-ested in thefilter methods which are much more efficient.
准自然实验Most of the existingfilter methods are supervid.Max-imum variance might be the most simple yet effective un-supervid evaluation criterion for lecting features.This criterion esntially projects the data points along the di-mensions of maximum variances.Note that,the Principal Component Analysis(PCA)algorithm shares the same prin-ciple of maximizing variance,but it involves feature trans-formation and obtains a t of transformed features rather than a subt of the original features.
Although the maximum variance criteriafinds features that are uful for reprenting data,there is no reason to assume that the features must be uful for discriminat-ing between data in different class.Recently,the Lapla-cianScore algorithm[17]and its extensions[30]have been propod to lect tho features which can best reflect the underlying manifold structure.LaplacianScore us a near-est neighbor graph to model the local geometric structure of the data and lects tho feature
s which are smoothest on the graph.It has been proven[17]that with label informa-tion LaplacianScore becomes Fisher criterion score.The lat-ter is a supervid feature lection method(filter method) which eks features that are efficient for discrimination[12]. Fisher criterion score assigns the highest score to the feature on which the data points of different class are far from each other while requiring data points of the same class to be clo to each other.
Wolf et al.propod a feature lection algorithm called Q-α[29].The algorithm optimizes over a least-squares crite-rion function which measures the clusterability of the input data points projected onto the lected coordinates.The op-timal coordinates are tho for which the cluster coherence, measured by the spectral gap of the corresponding affinity
matrix,is maximized[29].A remarkable property of the algorithm is that it always yields spar solutions.
3.MULTI-CLUSTER FEATURE
SELECTION
The generic problem of unsupervid feature lection is the following.Given a t of points X=[x1,x
2,···,x N], x i∈R M,find a feature subt with size d which contains the most informative features.In other words,the points {x′1,x′2,···,x′N}reprented in the d-dimensional space R d can well prerve the geometric structure as the data repre-nted in the original M-dimensional space.
Since naturally occurring data usually have multiple clus-ters structure,a good feature lection algorithm should con-sider the following two aspects:
•The lected features can best prerve the cluster struc-ture of the data.Previous studies on unsupervid fea-
ture lection[13,20,24]usually u Gaussian shape
clusters.However,recent studies have shown that hu-
man generated data are probably sampled from a sub-
manifold of the ambient Euclidean space[1,25,28].
描写景色的段落The intrinsic manifold structure should be considered
while measuring the goodness of the clusters[22].
•The lected features can“cover”all the possible clus-
ters in the data.Since different features have differ-
ent power on differentiating different clusters,it is cer-
tainly undesirable that all the lect features can well
differentiate cluster1and cluster2but failed on dif-
ferentiating cluster1and cluster3.
In the remaining part of this ction,we will introduce our Multi-Cluster Feature Selection(MCFS)algorithm which con-siders the above two aspects.We begin with a discussion on spectral embedding for cluster analysis with arbitrary shapes.
3.1Spectral Embedding for Cluster Analysis To detect the cluster(arbitrary shapes)structure of data, spectral clustering techniques[8,22,26]received significant interests recently.The spectral clustering usually clusters the data points using the top eigenvectors of graph Laplacian [9],which is defined on the affinity matrix of data points. From the graph partitioning perspective,spectral clustering tries tofin
d the best cut of the graph so that the prede-fined criterion function can be optimized.Many criterion functions,such as ratio cut[8],average association[26],and normalized cut[26]have been propod along with the corre-sponding eigen-problems forfinding their optimal solutions. Spectral clustering has a clo connection with the stud-ies on manifold learning[1,25,28],which consider the ca when the data are drawn from sampling a probability dis-tribution that has support on or near to a submanifold of the ambient space.In order to detect the underlying mani-fold structure,many manifold learning algorithms have been propod[1,25,28].The algorithms construct a nearest neighbor graph to model the local geometric structure and perform spectral analysis on the graph weight matrix.This way,the manifold learning algorithms can“unfold”the data manifold and provide the“flat”embedding for the data points.The spectral clustering can be thought as a two-step approach[1].Thefirst step is“unfolding”the data manifold using the manifold learning algorithms and the cond step is performing traditional clustering(typically k-means)on the“flat”embedding for the data points[22].
Consider a graph with N vertices where each vertex cor-responds to a data point.For each data point x i,wefind its p nearest neighbors and put an edge between x i and its neighbors.There are many choices to define the weight ma-trix W on the graph.Three of the most commonly ud are as follows:
1.0-1weighting.W ij=1if and only if nodes i and j
are connected by an edge.This is the simplest weight-
ing method and is very easy to compute.
2.Heat kernel weighting.If nodes i and j are con-
宝宝身高体重标准表
nected,put
W ij=e−
x i−x j 2
σ
Heat kernel has an intrinsic connection to the Laplace
Beltrami operator on differentiable functions on a man-
ifold[1].
3.Dot-product weighting.If nodes i and j are con-
nected,put
W ij=x T i x j
Note that,if x is normalized to have unit norm,the
dot product of two vectors is equivalent to the cosine
similarity of the two vectors.
If the heat kernel or dot-product weighting is ud,some rearchers[22]u a complete ,put an edge be-tween any two points)instead of the p-nearest neighbors graph.
Define a diagonal matrix D who entries are column(or row,since W is symmetric)sums of W,D ii= j W ij, we can compute the graph Lapalcian L=D−W[9].The “flat”embedding for the data points which“unfold”the data manifold can be found by solving the following generalized eigen-problem[1]:
Ly=λDy(1) Let Y=[y1,···,y K],y k’s are the eigenvectors of the above generalized eigen-problem with r
espect to the smallest eigen-value.Each row of Y is the“flat”embedding for each data point.The K is the intrinsic dimensionality of the data and each y k reflects the data distribution along the correspond-ing dimension(topic,concept,etc.)[1].When one tries to perform cluster analysis of the data,each y k can reflect the data distribution on the corresponding cluster.Thus,if the cluster number of the data is known,the K is usually t to be equal to the number of clusters[22].
3.2Learning Spar Coefficient Vectors
After we obtain the“flat”embedding Y for the data points, we can measure the importance of each feature along each intrinsic dimension(each column of Y),correspondingly,the contribution of each feature for differentiating each cluster.
Given y k,a column of Y,we canfind a relevant subt of features by minimizing thefitting error as follows:
min
a k
y k−X T a k 2+β|a k|(2)
where a k is a M-dimensional vector and|a k|= M j=1|a k,j| denotes the L1-norm of a k.a k esntially contains the com-bination coefficients for different features in approximating徐悲鸿的画
y k.Due to the nature of the L1-norm penalty,some coeffi-cients will be shrunk to exact zero ifβis large enough.In this ca,we can lect a subt containing the most rele-vant features(corresponding to the non-zero coefficients in a k)with respect to y k.Eq.(2)is esntially a regression problem.In statistics,this L1-regularized regression prob-lem is called LASSO[16].
The regression problem in Eq.(2)has the following equiv-alent formulation:
min
a k
y k−X T a k 2
(3)
The Least Angel Regression(LARs)algorithm[14]can be ud to solve the optimization problem in Eq.(3).Instead of tting the parameterγ,LARs provides another choice to control the sparness of a k by specifying the cardinality (the number of non-zero entries)of a k,which is particularly convenient for feature lection.
It is very possible that some features are correlated.And the combination of veral“weak”features1can better dif-ferentiate different clusters.Several supervid feature -lection algorithms[2]have been designed to address this issue.Thus,the advantage of using a L1-regularized regres-sion model tofind the subt of features instead of evaluating the contribution of each feature independently is clear.
3.3Feature Selection on Spar Coefficient
Vectors
We consider lecting d features from the M feature can-didates.For a data t containing K clusters,we can u the method discusd in the previous subctions to compute K spar coefficient vectors{a k}K k=1∈R M.The cardinality of each a k is d and each entry in a k corresponds to a feature. If we lect all the features that have at least one non-zero coefficient in K vectors{a k}K
k=1,it is very possible that we will obtain more than d features.In reality,we can u the following simple yet effective method for lecting exactly d features from the K spar coefficient vectors.
For every feature j,we define the MCFS score for the feature as
MCFS(j)=max
k
|a k,j|(4)
where a k,j is the j-th element of vector a k.We then sort all the features according to their MCFS scores in descending order and lect the top d features.
We summarize the complete MCFS algorithm for feature lection in Table(1).
3.4Computational Complexity Analysis
Our MCFS algorithm consists offive steps as shown in Table(1).The computational cost for each step can be computed as follows:
•The p-nearest neighbor graph construction step needs
O(N2M)to compute the pair wi distances and O(N2p) tofind p neighbors for each data point.
•For a p-nearest neighbor graph,each row of the weight
matrix W contains approximate p non-zero values.We 1They are not very informative in differentiating different clusters if evaluated independently
Table1:MCFS for Feature Selection Input:N data points with M features;
The number of clusters K;
The number of lected features d;
The number of nearest neighbors p;
the weighting scheme(and the parameter
σif choosing to u the heat
kernel weighting),
Output:d lected features.
1:Construct a p nearest neighbor graph as
discusd in Section3.1.
2:Solve the generalized eigen-problem in Eq.(1), Let Y=[y1,···,y K]contain the top K
eigenvectors with respect to the smallest
eigenvalues.
3:Solve K L1-regularized regression problems in Eq.(3)using LARs algorithm with the
cardinality constraint t to d.We get K spar
coefficient vectors{a k}K k=1∈R M.
4:Compute the MCFS score for each feature
according to Eq.(4).
5:return the top d features according to their MCFS scores.描述春天的成语
can u Lanczos algorithm to compute the top K eigen-vectors of eigen-problem in Eq.(1)within O(KNp) time[27].
•The LARs algorithm can solve the L1-regularize re-gression problem in Eq.(3)with cardinality constraint (cardi(a k)=d)in O(d3+Nd2)[14].Thus,we need O(Kd3+NKd2)to solve the K regression problems in total.
•The MCFS scores for all the features can be computed within O(KM).
•The top d features can be found within O(M log M).2 Considering K≪N and p is usuallyfixed as a constant 5,the total cost for our MCFS algorithm is:
O(N2M+Kd3+NKd2+M log M).(5) 4.EXPERIMENTS
In this ction,veral experiments were performed to show the effectiveness of our propod MCFS for unsuper-vid feature lection.The experiments include clustering and nearest neighbor classification.The following four un-supervid feature lection algorithms(filter methods)are compared:
•Our propod MCFS algorithm.The number of near-est neighbors(p)is t to be5and we u the binary weighting for its simplicity.
•Q-αalgorithm[29],which aims to maximize the clus-ter coherence.
2If d is very small,this cost can be reduced to O(dM)
(a)10Clusters
(b)20Clusters
(c)30Clusters
(d)40Clusters
Figure2:Clustering performance vs.the number of lected features on ORL data t.
Table3:Clustering performance(%)by using50features on the ORL data t.The last row shows the performance by using all the1024features.
10Clusters20Clusters30Clusters40Clusters Average
MCFS79.5±6.774.7±2.475.0±1.774.776.0
Q-α65.6±10.162.9±2.664.8±1.965.264.6
LaplacianScore70.7±8.468.8±4.467.5±2.768.668.9
MaxVaiance65.2±7.963.9±2.963.9±1.566.664.9
All Features76.4±7.274.0±2.973.3±2.275.974.9
Table2:Statistics of the four data ts data t size#of features#of class
女生自我介绍简单大方ORL400102440
USPS929825610
COIL201440102420
Isolet156061726
•LaplacianScore[17],which lects tho features that can best prerve the local manifold structure.
•Feature lection bad on maximum variance(Max-Variance),which lects tho features of maximum variances in order to obtain the best expressive power. After lecting the features,the clustering and classification are then performed by only using the lected features. 4.1Data Sets
Four real world data ts were ud in our experiments. The important statistics of the data ts are summarized below(e also Table2):
•Thefirst one is ORL face databa which consists of a total of400face images,of a total of40subjects(10 samples per subject).The images were captured at dif-ferent times and have different variations including ex-pressions(open or clod eyes,smiling or non-smiling) and facial details(glass or no glass).The images were taken with a tolerance for some tilting and rota-tion of the face up to20degrees.The original images were normalized(in scale and orientation)such that the two eyes were aligned at the same position.Then, the facial areas were cropped into thefinal ima
ges for matching.The size of each cropped image is32×32pixels,with256grey levels per pixel.Thus,each
face image can be reprented by a1024-dimensional vector.
•The cond one is the USPS handwritten digit databa
[18].A popular subt3contains929816×16hand-
written digit images in total is ud in this experiment.
•The third one is COIL20image library from Columbia which contains20objects.The images of each object were taken5degrees apart as the object is rotated on
a turntable and each objects has72images.The size
of each image is32×32pixels,with256grey levels per pixel.
•The fourth one is Isolet spoken letter recognition data4.
This data t wasfirst ud in[15].It contains150sub-jects who spoke the name of each letter of the al
phabet twice.The speakers are grouped into ts of30speak-ers each,and are referred to as isolet1through isolet5.
In our experiment,we u isolet1which consists1560 examples with617features.
4.2Clustering
Clustering is a common technique for exploratory data analysis.In this experiment,we perform k-means clustering by using the lected features and compare the results of different algorithms.
4.2.1Evaluation Metrics
The clustering result is evaluated by comparing the ob-tained label of each data point using clustering algorithms with that provided by the data t.We u the normalized mutual information metric(NMI)[17]to measure the per-formance.Let C denote the t of clusters obtained from the u.edu.tw/∼cjlin/libsvmtools
/datats/multiclass.html#usps
4www.ics.uci.edu/∼mlearn/MLSummary.html