Computational tractability of machine learning algorithms for tall fat data
VIKAS CHANDRAKANT RAYKAR,University of Maryland,CollegePark
vikas@cs.umd.edu
Huge data ts containing millions of training examples(tall data)with a large number of at-tributes(fat data)are relatively easy to gather.However one of the bottlenecks for successful inference of uful information from the data is the computational complexity of machine learning algorithms.Most state-of-the-art nonparametric machine learning algorithms have a computa-tional complexity of either O(N2)or O(N3),where N is the number of training examples.In my thesis I propo to explore the scalability of machine learning algorithms both with respect to the number of training examples and the number of attributes per training example and propo lin-ear O(N)time -exact approximation algorithms.The area lies at the interface of computational statistics,machine learning,approximation theory,and computational geometry.
[Disrtation proposal]:May4,2005
衣服折纸1.THE EVILS OF LEARNING WITH MASSIVE DATA SETS
垃圾分类宣传单
During the past few decades is has become relatively easy to gather huge amount of data,apprehensively called-massive data ts.A few such examples include genome quencing,astronomical databas,internet databas,experimental data from particle physics,medical databas,financial records,weather reports,audio and video data.We would like to build systems which can automatically extract uful information from the raw data.Learning is a principled method for distilling predictive and therefore scientific theories from the data[Poggio and Smale2003]. In a supervid learning scenario we are given a collection of say N training examples along with the desired output bad on which we learn a concept/model to draw inferences for examples not in the training t.A few canonical supervid learning tasks include regression,classification,and density estimation.
The parametric approach to learning assumes a functional form for the model to be learnt,and then estimates the unknown parameters.Once the model has been trained the training examples can be discarded.The esnce of the training examples have been captured in the model parameters,using which we can draw further inferences.However,unless the form of the function is known a priori, assuming a certain form very often leads to erroneous inference.The nonparametric methods do not make any assumptions on the form of the underlying function.This is sometimes referred to as’letting
the data speak for themlves’[Wand and Jones 1995].All the available data has to be retained while making the inference.It should be noted that nonparametric does not mean the lack of parameters,but rather that the underlying function/model of a learning problem cannot be indexed with afinite number of parameters.The number of parameters usually grows with the number of training data.
One of the major bottlenecks for successful inference using nonparametric meth-ods is their computational complexity.Most of the current stat-of-the-art nonpara-
The art of getting good enough solutions as fast as possible.
2·Vikas Chandrakant Raykar
metric machine learning algorithms have the computational complexity of either O(N2)or O(N3).This has riously restricted the u of massive data ts.Cur-rent implementations can handle only a few thousands of training examples.During the last decade,data t size has outgrown processor speed.This thesis investigates computationally efficient ways to exploit such large data ts using nonparametric machine learning algorithms.We u ideas from scientific computing and compu-tational geometry literature to design novel algorithms that scale as O(N).
1.1Tall fat data
Consider a data t D of N training examples where each training instance has d attributes(for example D={x i∈R d,y i}N i=1).One of the other concerns of this thesis is that our algorithms should also scale well with d.We refer to D as tall data if N>>d and fat data if d>>N.While most of the available data ts are tall,it is not uncommon to have tall fat data.A typical example for fat data would be gene expression data.
2.WEIGHTED SUPERPOSITION OF KERNELS
In most kernel bad machine learning algorithms[Shawe-Taylor and Cristianini 2004],Gaussian process[Rasmusn and Williams2006],and non-parametric statistics[Izenman1991]the key computationally intensive task is to compute a linear combination of local kernel functions centered on the training ,
f(x)=
N教育漫话
i=1
q i k(x,x i),(1)
where
—{x i∈R d,i=1,...,N}are the N training data points,
—{q i∈R,i=1,...,N}are the weights,
—k:R d×R d→R is the local kernel function,
—and x∈R d is the test point at which f(.)is to be computed.
The computational complexity to evaluate Equation1at a given test point is O(N). For kernel ularized least squares[Poggio and Smale2003],sup-port vector machines[Cristianini and Shawe-Taylor2000],kernel regression[Wand and Jones1995])f is the regression/classification function.This is a conquence of the well known classical reprenter theorem[Wabha1990]which states that the so-lutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the kernels centered on the training examples.In ca of Gaussian process regression[Williams and Rasmusn1996]f is the mean prediction.For non-parametric density estimation it is the kernel densi
ty estimate[Wand and Jones1995].
Training the models scales as O(N3)since most involve solving the linear sys-tem of equation
(K+λI)ξ=y.(2) K is the N×N Gram matrix where[K]ij=k(x i,x j).Also many kernel methods in unsupervid learning like kernel principal component analysis[Smola et al.1996], Disrtation proposal
The art of getting good enough solutions as fast as possible .·3
spectral clustering [Chung 1997],and Laplacian eigenmaps involve computing the eigen values of the Gram matrix.Solutions to such problems can be obtained using iterative methods,where the dominant computation is evaluation of f (x ).
Recently,such nonparametric problems have been collectively referred to as N -body problems in learning [Gray and Moore 2001],in analogy with the coulombic,magnetostatic,and gravitational N -body potential problems occurring in compu-tational physics [Greengard 1994].The problems require the calculation of all pairwi interactions in a large enmble of particles.
3. -EXACT APPROXIMATION
In general we need to evaluate Equation 1at M points {y j ∈R d ,j =1,...,M },i.e.,
f (y j )=N
i =1q i k (y j ,x i )j =1,...,M,(3)
leading to the quadratic complexity of O (MN ).We will develop fast -exact al-gorithms that compute the sum 3approximately in linear O (M +N )time.The algorithm is -exact in the n made preci below.
Definition 3.1.For any given >0, f is an −exact approximation to f if the maximum absolute error relative to the total weight Q = N i =1|q i |is upper
bounded by ,i.e.,max y j |ˆf (y j )−f (y j )|Q
≤ .(4)The constant in O (M +N ),depends on the desired accuracy ,which however can be arbitrary .In fact for machine precision accuracy there is no difference between the direct and the fast methods.
3.1Matrix vector multiplication
采纳营销
The sum in equation 3can be thought of as a matrix-vector multiplication非典怎么消失的
f =K q,(5)
where K is a M ×N matrix the entries of which are of the form [K ]ij =k (y j ,x i )and q =[q 1,...,q N ]T is a N ×1column vector.
Definition 3.2.A den matrix of order M ×N is called a structured matrix if its entries depend only on O (M +N )parameters.
The reason we will be able to get O (M +N )algorithms to compute the matrix-vector multiplication is that the matrix K is a structured matrix,since all the entries
of the matrix are determined by the t of M +N points {x i }N i =1and {y j }M i =1.If
the entries of the of the matrix K are completely random than we cannot do any better than O (MN )1.
1This statement should be considered in a deterministic n.It may be possible to derive -exact approximation algorithms in a probabilistic n
Disrtation proposal
4·Vikas Chandrakant Raykar
rrb4004.GAUSSIAN KERNEL
The most commonly ud kernel function is the Gaussian kernel
k (x,y )=e − x −y 2/h 2,(6)
where h is called the bandwidth of the kernel.The bandwidth h controls the degree of smoothing,of noi tolerance,and of generalization.The Gaussian kernel is a local kernel in the n that lim x −y →∞k (x,y )=0.
The sum of multivariate Gaussian kernels is known as the discrete Gauss trans-form in the scientific computing literature.More formally,for each target point {y j ∈R d }j =1,...,M the discrete Gauss transform is defined as,
G (y j )=N
i =1q i e − y j −x i 2/h 2,(7)
where {q i ∈R }i =1,...,N are the source weights ,{x i ∈R d }i =1,...,N are the source points ,i.e.,the center of the Gaussians,and h ∈R +is the source scale or bandwidth .In other words G (y j )is the total contribution at y j of N Gaussians centered at x i each with bandwidth h .Each Gaussian is weighted by the term q i .
The computational complexity to evaluate the discrete Gauss transform (Equa-tion 7)at M target points is O (MN ).This makes the computation for large scale problems prohibitively expensive.In many machine learning tasks data-ts containing more than 104points are already common and larger problems are of interest.
The Fast Gauss Transform (FGT)is an −exact approximation algorithm that reduces the computational complexity to O (M +N ),at the expen of reduced precision,which however can be arbitrary.The constant depends on the desired precision,dimensionality of the problem,and the bandwidth.Given any >0,it computes an approximation ˆG (y j )to G (y j )such that the maximum absolute error relative to the total weight Q = N i =1|q i |is upper bounded by ,i.e.,max y j |ˆG (y j )−G (y j )|Q
≤ .(8)5.PROBLEMS SUCCESSFULLY ADDRESSED
As of now the following milestones have been achieved.Brief summaries of the problems are given below.The detailed reports are attached along with this pro-posal.
5.1Fast computation of sums of Gaussians
赭山[Raykar et al.2005][Raykar and Duraiswami 2005a ]
For the FGT the constant factor in O (M +N )grows exponentially with increasing dimensionality d ,which makes the algorithm impractical for dimensions greater than three.The FGT was first propod by [Greengard and Strain 1991]and applied successfully to a few lower dimensional applications in mathematics and physics.However the algorithm has not been widely ud much in statistics,pat-tern recognition,and machine learning applications where higher dimensions occur commonly.An important reason for the lack of u of the algorithm in the areas is Disrtation proposal
The art of getting good enough solutions as fast as possible.·5 that the performance of the propod FGT degrades exponentially with increasing dimensionality,which makes it impractical for the statistics,machine learning,and pattern recognition applications.
We propo a linear time −exact approximation algorithm for the Gaussian kernel which scales well with dimensionality.The constant factor is reduced to asymptotically polynomial order.The reduction is bad on a new multivariate Taylor’s ries expansion(which can act both as a local as well as a farfield ex-pansion)scheme combined with the efficient space subdivision using the k-center algorithm.The propod method differs from the original fast Gauss transform in terms of a different factorization,efficient space subdivision,and the u of point-wi error bounds.Algorithm details,error bounds,procedure to choo the parameters and numerical experiments are prented.
5.2Multivariate kernel density estimation
奶茶咋做
[Raykar et al.2005]
As an example we show how the IFGT can be ud for very fast -exact multivariate kernel density estimation.We also show the effect of -exact approximation on the performance of the estimator.
5.3Optimal bandwidth estimation for KDE
[Raykar and Duraiswami2005b][Raykar and Duraiswami2006]
Efficient u of KDE requires the optimal lection of the smoothing parameter called the bandwidth
h of the kernel.For a practical implementation of KDE the choice of the bandwidth h is very important.Small h leads to an estimator with small bias and large variance.Large h leads to a small variance at the expen of increa in bias.The bandwidth h has to be chon optimally.
Most state-of-the-art automatic bandwidth lection procedures require estima-tion of quantities involving the density derivatives.The computational complexity of evaluating the density derivative at M evaluation points given N sample points from the density scales as O(MN).We propo a computationally efficient −exact approximation algorithm for the univariate Gaussian kernel bad density deriva-tive estimation that reduces the computational complexity from O(MN)to linear O(M+N).
We apply the density derivative evaluation procedure to estimate the optimal bandwidth for kernel density estimation,a process that is often intractable for large data ts.We demonstrate the speedup achieved on the bandwidth lection using the solve-the-equation plug-in method.We also demonstrate that the propod procedure can be extremely uful for speeding up exploratory projection pursuit techniques.
6.WORK IN PROGRESS
6.1Fast large scale Gaussian process regression
Gaussian process allow treatment of non-linear non-parametric regression prob-lems in a Bayesian framework.However the computational cost of training such a model with N examples scales as O(N3).Iterative methods can bring this cost down to O(N2),which is still prohibitive for large datats.In this paper we u
Disrtation proposal
6·Vikas Chandrakant Raykar
an -exact approximation technique,the improved fast Gauss transform to reduce the computational complexity to O(N),for the squared exponential covariance function.Using the theory of inexact Krylov subspace methods we show how to adaptively increa as the iteration progress.For prediction at M points the computational complexity is reduced from O(NM)to O(M+N).We demonstrate the speedup achieved on large synthetic and real datats.We also show how the hyperparameters can be chon in O(N)time.Unlike methods which rely on -lecting a subt of the data t we u all the available data and still get substantial speedups.
6.2Implicit surfacefitting via Gaussian process
We u Gaussian process regression,an extremely powerful technique from machine learning,for surface reconstruction from point cloud data.The surface is defined implicity as the zero level t of a suitable function.This function is learnt from the given point cloud data.Rather than estimating the value of a function at a point, this method gives the entire predictive distribution in a Bayesian framework.As a result we obtain surface reprentations together with valid estimates of uncertainty. The computed uncertainty not only reflects the noi in the point cloud data but also the effect of irregular sampling of the surface during the scanning process.This fact is shown to be very uful for spar reprentation of surface,mesh saliency reprentation,mesh simplification,and real-time re-scanning the surface bad on the computed uncertainties.One of the major bottlenecks for most implicit surface methods is that their prohibitive computational complexity.This is no different for Gaussian process regression,the computational complexity of which scales as O(N3).In this paper we u two different -exact approximation techniques,the improved fast Gauss transform and the dual-tree methods both of which reduce the computational complexity to O(N).Thus we are able to handle data-ts containing millions of points.
7.FUTURE WORK
The following problems are among tho that I wish to formulate well and solve in the cour of this t
hesis,and in the future.
(1)Dual-tree methods Another class of methods recently propod for the same
problems are the dual-tree methods[Gray and Moore2003].The methods rely only on data structures like kd-trees and ball tress and not on ries expansions.
The dual-tree methods give good speedup for small bandwidths while the ries bad methods such as IFGT give good speedup for large bandwidths.I would like to explore whether the two different techniques can be combined together to give good speedups both for large and small bandwidths.
(2)Inexact eigenvalue methods Many kernel methods in unsupervid learning
like kernel principal component analysis[Smola et al.1996],spectral cluster-ing[Chung1997],and Laplacian eigenmaps involve computing the eigen values of the Gram matrix.Solutions to such problems can be obtained using iterative methods[Saad1992].The dominant computation is the matrix-vector multipli-cation for which we u the IFGT.We need to explore how the approximations affect the convergence of the iterative methods.
Disrtation proposal