LOCAL LINEAR PROJECTION(LLP)
Xiaoming Huo,Jihong Chen
School of Industrial&System Engineering,Georgia Institute of Technology
Atlanta,GA30332-0205
ABSTRACT
Dimensionality reduction has important applications in ex-ploratory data analysis.A method bad on Local Linear Projection(LLP)is propod.The advantage of this method
is that it is robust against uncertainty.Statistical analysis is applied to estimate parameters.Simulational results on syn-thetic data are promising.Some preliminary experiment of applying this method to microarray data is reported.The results show that LLP can identify significant patterns.We propo some future tasks to perfect this method.
1.INTRODUCTION
Dimensionality reduction plays a significant role in exploratory data analysis.In many real applications,although the data may have very high dimensions,they typically embedded
in manifolds(or subspaces)that are of substantially lower dimensions.Identifying the manifolds(or subspaces)are critical in understanding the data.It is also important in applications such as data visualization and modeling.In the communities of statistics,machine learning,and artificial intelligence,a substantial amount of techniques have been developed.In the following,we will give a quick review on works that are directly related to ours.
When the embedded structures are linear subspaces,lin-ear techniques such as Principal Component Analysis(PCA) and Singular Value Decomposition(SVD)can be ud to identify the embedded linear subspaces.In PCA,the c-ond order statistics(variances and covariances)of the data are considered,rearchers find the directions in which the variances are maximized.SVD works on the data them-lves.It finds the linear subspace that best prerves the information of the data.For both PCA and SVD,the em-bedded structure must be globally linear.In many applica-tions,this condition is too restrictive.Multi-Dimensional Scaling(especially metric MDS)is clo to PCA and SVD. PCA and SVD are to find the most significant linear sub-spaces.In Metric MDS,workers try to map the data into a This work is partially supported by a ed grant from Center for Graphics Visu
alization and Usability at Georgia Institute of Technology and a DARPA-Lockheed-Martin-Stanford University contract.low-dimensional space,at the same time keeping the inter-data distances[13].Although the philosophical points are emingly different,the underlying linear algebra are very similar.
When the global linearity condition is abandoned,some methods that focud on finding local embedded structures have been propod,among them,we have for example principal curves[7,2].Recently,we have paid attention to some methods that are dedicated to identifying local hid-den manifolds,for example,ISOMAP[11]and Local Lin-ear Embedding(LLE)[8].In ISOMAP,instead of con-sider the distance between two data points,they consider the geodesic distance,which is the length of the shortest path that resides on the embedded manifold.In implementations, this idea is realized by considering the k-nearest neighbors. Later on,in order to achieve better numerical performance, rearchers have propod some Curvilinear Distance Analysis(CDA),[4].In LLE,each data point is reprented as a convex combination of its k-nearest neigh-bors;the data is then mapped into a low-D space,at the same time,the convex combinations(which is called embedding) is prerved to the best possibility.In[4,11,8],good ex-amples are shown to illustrate the ideas.The examples are Swiss rolls,open boxes,and cylinders.We found them very instructive.
Fig.1.Hurricane:A3-D data with1-D embedded structure.
足球赛新闻稿In order to help our readers to visualize the type of the problem that we are trying to solve,we provide an exem-plary data in Figure1.This data is in3-D but has an appar-ent1-D embedded structure.
Due to the maturization of the human Genome project and the availability of the microarray technology,microar-ray data pos a new challenge to data analysts.The mi-croarray technology allows workers to measure the levels
of gene expression for tens and thousands of genes simulta-neously.The dimensionality of microarray data is definitely high.It is urgent to develop efficient dimension reduction tools.As a matt
er of fact,many previously mentioned tools have been applied to microarray data,for example,rearchers have ud SVD to interpolate missing values in a microar-ray data[12].ISOMAP has been ud to understand the structure of a microarray data[10].PCA has been ud to summarize microarray experiments[6].A lot more exam-ples can be found in the references of[5].
As an evidence to illustrate the importance of dimension reduction for microarray data,let us consider the clustering
of genes.Clustering genes is to group together the genes that might be associated with identical functionalities.A nice survey on clustering methods for microarray datats
is given in[5].An associated software is described in[9]. Many studies have been [1].Due to space,we can not enumerate all of them here.Dimension reduction can help improving the clustering result.One first project the data points to an embedded low-dimensional manifold, then compute the inter-distances between projections.The inter-distances should be more“faithful”than the inter-distance computed directly from the data.Hence a dimension reduc-tion tool can be ud as a preprocessing tool for a clustering algorithm.
A dimension reduction tool can also help to visualize the data.To visualize the data,we have to reduce the global dimensionality of the data.This is a little bit different from reducing the local dimensionality of a data.But by append-ing a post-processing method,it can be ud to visualize the data.For example,we can look at the local structure of the data.In our simulational study to a synthetic data,we will give a demo of this idea.黑耳
In the works that we have en so far,we obrved the following shortcomings.
1.In many methods,(e.g.ISOMAP,CDA,LLE,anddeal过去式
other k-nearest neighbor bad methods,)no statis-
tical model has been assumed.Hence it becomes
difficult to quantitatively measure the success(or fail-
ure)of each method.It is also difficult to describe the白鞋网面脏了如何清理
domain in which the methods work.
2.Even though in most of the existing methods,the al-
gorithms are clear and well described,while imple-
menting them,there are always veral parameters:
for example,the number of nearest neighbors,and the
dimension of the embedded manifold.No analysis on
how to choo them have been fully reported.
We believe the answers to the above problems can be found through a statistical analysis,more specifically,the ANaly-sis Of V Ariance(ANOV A).
In this paper,a statistical model is introduced to model the phenomenon of a locally embedded manifold in a noisy data.Bad on the propod model,we propo a Local Linear Projection method to identify this embedded mani-fold.Some preliminary computational and statistical analy-sis are carried out to determine how to choo the values of parameters in the model.We found that this method works well on synthetic data(as expected).We provide some pre-liminary results for microarray data.
The rest of the paper is organized as follows.In Sec-tion2,the statistical model for embedded manifolds is de-scribed.In Section3,we describe the idea and algorithm for LLP.In Section4,some parameter estimation strategies are prented.In Section5,we report simulational findings for both a synthetic data and a microarray data.In Section6, questions that will be further analyzed are listed,and some final remarks are made.
2.MODEL
We assume an additive noi model.Suppo there are N obrvations,which are denoted by y1,y2,...,y N.Let p denote the dimension of each obrvation.We have y i∈R p,∀1≤i≤N.We assume that there is an underlying
(piecewi smooth)function f(·)such that
y i=f(x i)+εi,i=1,2,...,N,
where variable x i∈R p0is from a much lower dimensional space(p0<<p),noisεi’s follow a multivariate normal distribution(εi∼N( 0,σ2I p),whereσis unknown).
In the above model,if the underlying function f is lo-cally regular,or more specifically,function f can be
ap-proximated by a linear function:
f(x)≈β0+βT1x,
whereβ0∈R p andβ0∈R p×p0,then locally linear pro-jection can be applied to extract this information.
3.LOCAL LINEAR PROJECTION
LLP can be applied to extract the local low-dimensional structure.In the first step,neighboring obrvations are identified.In the cond step,SVD or PCA is ud to es-timate the local linear subspace.Finally,the obrvation is projected into this subspace.
ALGORITHM:LLP
for each obrvation y i,i=1,2,3,...,N,
1.Find the K-nearest neighbors of y i.The neighboring
points are denoted by˜y1,˜y2,...,˜y K.
2.U PCA or SVD to identify the linear subspace that
contains most of the information on vectors˜y1,˜y2,
...,˜y K.Suppo the linear subspace is A i,and P A
i (x)
denote the projection of a vector x into this subspace.
Let k0denote the assumed dimension of the embed-
ded manifold,then subspace A i can be viewed as
a linear subspace spanned by the vectors associated
with the first k0singular values.
3.Project y i into the linear subspace A i and let y i de-
note this projection: y i=P A
i
(y i).
end.
The output of LLP, y i,i=1,2,...,N,are more“faith-ful”to the underlying structure(if it exists)than the original obrvations are.
A justification to step2.is that bad on the previous model,for˜y1,˜y2,...,˜y K,we have
˜y1=β0+βT1˜x1+˜ε1,
˜y2=β0+βT1˜x2+˜ε2,
..
.
˜y K=β0+βT1˜x K+˜εK,
where˜εi∼N( 0,σ2I p),and˜x i∈R p0,i=1,2,...,K. Hence˜y i’s can be viewed as random vectors who mean vectors are from a low-dimensional subspace.The low-dimensional subspace can be extracted
via SVD of vectors ˜y1,˜y2,...,˜y K.The dimension of this linear subspace can be estimated by analyzing the variances.
The computational complexity of LLP is roughly
C(p,k0,K)N2,where constant C(p,k0,K)is a function of the dimension of the data(p),the dimension of the embed-ded linear subspace(k0),and the number of nearest neigh-bors(K).The reasoning is as follows.First of all,to iden-tify the nearest neighbors,the distance matrix of the N ob-rvations need to be computed,which costs O(pN2)op-erations.Then each row(or column)of the distance ma-trix need to be sorted,which costs O(N log(N))(order of complexity for the quick sort algorithm)multiply with N (number of rows)operations.Or in other words,the sorting takes O(N2log(N))operations.Suppo that each SVD step takes a constant amount of operations C2(p,k0,K),so does each projection step.Overall,the order of complexity for LLP is C(p,k0,K)N2.
4.ESTIMATING MODEL PARAMETERS There are two key parameters in LLP.They are the number of nearest neighbors(K)and the dimension of the local un-derlying subspace(k0).The ideal number for K is the one such that the linearity assumption holds.For parameter k0, it is ideal to have k0=p0.
4.1.Number of the Nearest Neighbors
Following the notations in Section3,for a fixed data point y i and its K-nearest neighbors˜y j,j=1,2,...,K,if the linearization model is true,the squared distances
d i,j= y i−˜y j 22,j=1,2,...,K, should approximately follow the2σ2·χ2p distribution.Thes
e distances can be ordered:
d i,(1)<d i,(2)<···<d i,(K).
If we calculate the differences
d i,(j+1)−d i,(j),j=1,2,...,K−1,
we are going to obrve a few big ones at the beginning, and then it decreas to small ones.This is becau for χ2-distributed random variables,the quence of the dif-ferences of the order statistics is going to have the above mentioned pattern.The decreasing pattern of the differ-ences can help to identify the appropriate number of nearest neighbors.Due to the space,we postpone detailed statistical analysis to future publications.
学字开头的成语4.2.Dimension of the linear subspace
Still following the notations in Section3,if a fatter ver-sion of the matrixβ1is fixed,(in our ca,it is computed from SVD,)then the analysis of the appropriate dimension of the linear subspace(p0)falls into the domain of Analy-sis of Variance(ANOV A).In our ca,the analysis is more complicated.Since the model matrix is computed from the data as well.Intuitively,as the dimension of the subspaces increas,people would expect a quick drop of variances at the beginning,and then a relatively steady decreasing. Again we postpone the detailed analysis to future publica-tions.
5.SIMULATIONS
旅游的作文Two experiments are reported.The first one is for a syn-thetic signal.The cond one is a microarray data which is also ud in[3].
5.1.Synthetic Data:a12-D Hurricane
A twelve dimensional signal is produced.The underlying function is
f(t)=(
√
t sin(4.5πt),
√
t cos(4.5πt),t−0.5,...)
The dimension(4,5,6),(7,8,9),and(10,11,12)has the same pattern as in the first three dimensions.This signal is intrinsically1-D.An illustration of a noisy data that is generated bad on the above function,limited to its first three dimensions,is in Figure1.In creating the noisy data, the standard deviation for noi is chon to beσ=0.10. Bad on the analysis of the differences between squared distances,we choo the number of the nearest neighbors K=40.The results of applying LLP are shown in Figure 2.
Fig.2.Result of applying LLP to the12-D synthetic data. Limited to the£rst three dimensions.
5.2.Microarray Data
The LLP is applied to a microarray datat which is also ud in[3].(The datat is downloadable on the web.)We found that the number of the nearest neighbors should be K=30.The result of applying LLP is shown in Figure3.
−6−4−202468
−6−4−202468
After LLP
Fig.3.Result of applying LLP to a microarray data.You must view it in a color figure.
We postpone the detailed discussion on this result to fu-ture publications.
6.FUTURE WORKS AND CONCLUSION
LLP has been uful in identifying locally low-dimensional embedded subspaces.It is an optimal dim
ension reduction tool.It can be ud as a preprocessing tool for other data analysis techniques.
We plan to carry out a detailed analysis on the two ap-proaches that are described in4.1and4.2.
7.REFERENCES
[1]Ein,M.B.,Spellman,P.T.,Brown,P.O.and Botstein,D.
用什组词
(1998)Cluster analysis and display of genome-wide expres-sion patterns,Proc.Natl Acad.Sci.USA,vol95:14863-14868.
有口无肛门[2]Hastie,T.,and Stuetzle,W.(1989)Principal Curves,Journal
of the American Statistical Association,84(406):502-516, June.
[3]Hastie,T.,Tibshirani,R.,and Friedman,J.H.(2001)The
Elements of Statistical Learning:Data Mining,Inference,and Prediction,Springer ries in statistics,New York.
[4]Lee,J.A.,Lendas,A.and Verleyn,M.(2000)Curvilin-
ear distance analysis versus isomap,submitted to ESANN’02, Bruges.
[5]Quackenbush,J.(2001)Computational analysis of microarray
data,Nat Rev Genet,6:418-427.
[6]Raychaudhuri,S.,Stuart,J.M.and Altman,R.B.(2000)Prin-
cipal components analysis to summarize microarray experi-ments:application to sporulation time ries.Pac.Symp.Bio-comput.2000,455-466.
[7]Stanford,D.C.and Raftery A.E.(2000)Finding curvilinear
features in spatial point patterns:Principal curve clustering with noi,IEEE Trans.PAMI,22(6):601-609,June.
[8]Roweis,S.T.and Saul,L.K.(2000)Nonlinear dimensional-
ity reduction by locally linear embedding,Science,vol290: 2323-2326.
[9]Sturn,A.,Quackenbush,J.and Trajanoski,Z.(2002)Genesis:
cluster analysis of microarray data,Bioinfomatics,207-208.
[10] al.(1999)Interpreting patterns of gene ex-
pression with lf-organizing maps:methods and application, Proc.Natl Acad.Sci.USA,vol96:2907-2912.
[11]Tenenbaum,J.B.,Silva,V.,and Langford,J.C.(2000)A
global geometric framework for nonlinear dimensionality re-duction,Science,vol290:2319-2323.
[12]Troyanskaya,O.,Cantor,M.,Sherlock,G.,Brown,P.,
Hastie,T.,Tishirani,R.,Bostein, D.,and Altman,R.B.
(2001)Missing value estimation methods for DNA microar-rays,Bioinformatics,V ol.17(6):520-525.
[13]Young,F.(1981)Introduction to Multidimensional Scaling:
Theory,Methods,and Applications,Academic Press,New York.