Computational tractability of machine learning

更新时间:2023-05-19 02:17:29 阅读：评论：0

Computational tractability of machine learning algorithms for tall fat data

VIKAS CHANDRAKANT RAYKAR,University of Maryland,CollegePark

vikas@cs.umd.edu

Huge data ts containing millions of training examples(tall data)with a large number of at-tributes(fat data)are relatively easy to gather.However one of the bottlenecks for successful inference of uful information from the data is the computational complexity of machine learning algorithms.Most state-of-the-art nonparametric machine learning algorithms have a computa-tional complexity of either O(N2)or O(N3),where N is the number of training examples.In my thesis I propo to explore the scalability of machine learning algorithms both with respect to the number of training examples and the number of attributes per training example and propo lin-ear O(N)time -exact approximation algorithms.The area lies at the interface of computational statistics,machine learning,approximation theory,and computational geometry.

[Disrtation proposal]:May4,2005

衣服折纸1.THE EVILS OF LEARNING WITH MASSIVE DATA SETS

垃圾分类宣传单

During the past few decades is has become relatively easy to gather huge amount of data,apprehensively called-massive data ts.A few such examples include genome quencing,astronomical databas,internet databas,experimental data from particle physics,medical databas,ﬁnancial records,weather reports,audio and video data.We would like to build systems which can automatically extract uful information from the raw data.Learning is a principled method for distilling predictive and therefore scientiﬁc theories from the data[Poggio and Smale2003]. In a supervid learning scenario we are given a collection of say N training examples along with the desired output bad on which we learn a concept/model to draw inferences for examples not in the training t.A few canonical supervid learning tasks include regression,classiﬁcation,and density estimation.

The parametric approach to learning assumes a functional form for the model to be learnt,and then estimates the unknown parameters.Once the model has been trained the training examples can be discarded.The esnce of the training examples have been captured in the model parameters,using which we can draw further inferences.However,unless the form of the function is known a priori, assuming a certain form very often leads to erroneous inference.The nonparametric methods do not make any assumptions on the form of the underlying function.This is sometimes referred to as’letting

the data speak for themlves’[Wand and Jones 1995].All the available data has to be retained while making the inference.It should be noted that nonparametric does not mean the lack of parameters,but rather that the underlying function/model of a learning problem cannot be indexed with aﬁnite number of parameters.The number of parameters usually grows with the number of training data.

One of the major bottlenecks for successful inference using nonparametric meth-ods is their computational complexity.Most of the current stat-of-the-art nonpara-

The art of getting good enough solutions as fast as possible.

2·Vikas Chandrakant Raykar

metric machine learning algorithms have the computational complexity of either O(N2)or O(N3).This has riously restricted the u of massive data ts.Cur-rent implementations can handle only a few thousands of training examples.During the last decade,data t size has outgrown processor speed.This thesis investigates computationally eﬃcient ways to exploit such large data ts using nonparametric machine learning algorithms.We u ideas from scientiﬁc computing and compu-tational geometry literature to design novel algorithms that scale as O(N).

1.1Tall fat data

Consider a data t D of N training examples where each training instance has d attributes(for example D={x i∈R d,y i}N i=1).One of the other concerns of this thesis is that our algorithms should also scale well with d.We refer to D as tall data if N>>d and fat data if d>>N.While most of the available data ts are tall,it is not uncommon to have tall fat data.A typical example for fat data would be gene expression data.

2.WEIGHTED SUPERPOSITION OF KERNELS

In most kernel bad machine learning algorithms[Shawe-Taylor and Cristianini 2004],Gaussian process[Rasmusn and Williams2006],and non-parametric statistics[Izenman1991]the key computationally intensive task is to compute a linear combination of local kernel functions centered on the training ,

f(x)=

N教育漫话

i=1

q i k(x,x i),(1)

where

—{x i∈R d,i=1,...,N}are the N training data points,

—{q i∈R,i=1,...,N}are the weights,

—k:R d×R d→R is the local kernel function,

—and x∈R d is the test point at which f(.)is to be computed.

The computational complexity to evaluate Equation1at a given test point is O(N). For kernel ularized least squares[Poggio and Smale2003],sup-port vector machines[Cristianini and Shawe-Taylor2000],kernel regression[Wand and Jones1995])f is the regression/classiﬁcation function.This is a conquence of the well known classical reprenter theorem[Wabha1990]which states that the so-lutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the kernels centered on the training examples.In ca of Gaussian process regression[Williams and Rasmusn1996]f is the mean prediction.For non-parametric density estimation it is the kernel densi

ty estimate[Wand and Jones1995].

Training the models scales as O(N3)since most involve solving the linear sys-tem of equation

(K+λI)ξ=y.(2) K is the N×N Gram matrix where[K]ij=k(x i,x j).Also many kernel methods in unsupervid learning like kernel principal component analysis[Smola et al.1996], Disrtation proposal

The art of getting good enough solutions as fast as possible .·3

spectral clustering [Chung 1997],and Laplacian eigenmaps involve computing the eigen values of the Gram matrix.Solutions to such problems can be obtained using iterative methods,where the dominant computation is evaluation of f (x ).

Recently,such nonparametric problems have been collectively referred to as N -body problems in learning [Gray and Moore 2001],in analogy with the coulombic,magnetostatic,and gravitational N -body potential problems occurring in compu-tational physics [Greengard 1994].The problems require the calculation of all pairwi interactions in a large enmble of particles.

3. -EXACT APPROXIMATION

In general we need to evaluate Equation 1at M points {y j ∈R d ,j =1,...,M },i.e.,

f (y j )=N

i =1q i k (y j ,x i )j =1,...,M,(3)

leading to the quadratic complexity of O (MN ).We will develop fast -exact al-gorithms that compute the sum 3approximately in linear O (M +N )time.The algorithm is -exact in the n made preci below.

Definition 3.1.For any given >0, f is an −exact approximation to f if the maximum absolute error relative to the total weight Q = N i =1|q i |is upper

bounded by ,i.e.,max y j |ˆf (y j )−f (y j )|Q

≤ .(4)The constant in O (M +N ),depends on the desired accuracy ,which however can be arbitrary .In fact for machine precision accuracy there is no diﬀerence between the direct and the fast methods.

3.1Matrix vector multiplication

采纳营销

The sum in equation 3can be thought of as a matrix-vector multiplication非典怎么消失的

f =K q,(5)

where K is a M ×N matrix the entries of which are of the form [K ]ij =k (y j ,x i )and q =[q 1,...,q N ]T is a N ×1column vector.

Definition 3.2.A den matrix of order M ×N is called a structured matrix if its entries depend only on O (M +N )parameters.

The reason we will be able to get O (M +N )algorithms to compute the matrix-vector multiplication is that the matrix K is a structured matrix,since all the entries

of the matrix are determined by the t of M +N points {x i }N i =1and {y j }M i =1.If

the entries of the of the matrix K are completely random than we cannot do any better than O (MN )1.

1This statement should be considered in a deterministic n.It may be possible to derive -exact approximation algorithms in a probabilistic n

Disrtation proposal

4·Vikas Chandrakant Raykar

rrb4004.GAUSSIAN KERNEL

The most commonly ud kernel function is the Gaussian kernel

k (x,y )=e − x −y 2/h 2,(6)

where h is called the bandwidth of the kernel.The bandwidth h controls the degree of smoothing,of noi tolerance,and of generalization.The Gaussian kernel is a local kernel in the n that lim x −y →∞k (x,y )=0.

The sum of multivariate Gaussian kernels is known as the discrete Gauss trans-form in the scientiﬁc computing literature.More formally,for each target point {y j ∈R d }j =1,...,M the discrete Gauss transform is deﬁned as,

G (y j )=N

i =1q i e − y j −x i 2/h 2,(7)

where {q i ∈R }i =1,...,N are the source weights ,{x i ∈R d }i =1,...,N are the source points ,i.e.,the center of the Gaussians,and h ∈R +is the source scale or bandwidth .In other words G (y j )is the total contribution at y j of N Gaussians centered at x i each with bandwidth h .Each Gaussian is weighted by the term q i .

The computational complexity to evaluate the discrete Gauss transform (Equa-tion 7)at M target points is O (MN ).This makes the computation for large scale problems prohibitively expensive.In many machine learning tasks data-ts containing more than 104points are already common and larger problems are of interest.

The Fast Gauss Transform (FGT)is an −exact approximation algorithm that reduces the computational complexity to O (M +N ),at the expen of reduced precision,which however can be arbitrary.The constant depends on the desired precision,dimensionality of the problem,and the bandwidth.Given any >0,it computes an approximation ˆG (y j )to G (y j )such that the maximum absolute error relative to the total weight Q = N i =1|q i |is upper bounded by ,i.e.,max y j |ˆG (y j )−G (y j )|Q

≤ .(8)5.PROBLEMS SUCCESSFULLY ADDRESSED

As of now the following milestones have been achieved.Brief summaries of the problems are given below.The detailed reports are attached along with this pro-posal.

5.1Fast computation of sums of Gaussians

赭山[Raykar et al.2005][Raykar and Duraiswami 2005a ]

For the FGT the constant factor in O (M +N )grows exponentially with increasing dimensionality d ,which makes the algorithm impractical for dimensions greater than three.The FGT was ﬁrst propod by [Greengard and Strain 1991]and applied successfully to a few lower dimensional applications in mathematics and physics.However the algorithm has not been widely ud much in statistics,pat-tern recognition,and machine learning applications where higher dimensions occur commonly.An important reason for the lack of u of the algorithm in the areas is Disrtation proposal

The art of getting good enough solutions as fast as possible.·5 that the performance of the propod FGT degrades exponentially with increasing dimensionality,which makes it impractical for the statistics,machine learning,and pattern recognition applications.

We propo a linear time −exact approximation algorithm for the Gaussian kernel which scales well with dimensionality.The constant factor is reduced to asymptotically polynomial order.The reduction is bad on a new multivariate Taylor’s ries expansion(which can act both as a local as well as a farﬁeld ex-pansion)scheme combined with the eﬃcient space subdivision using the k-center algorithm.The propod method diﬀers from the original fast Gauss transform in terms of a diﬀerent factorization,eﬃcient space subdivision,and the u of point-wi error bounds.Algorithm details,error bounds,procedure to choo the parameters and numerical experiments are prented.

5.2Multivariate kernel density estimation

奶茶咋做

[Raykar et al.2005]

As an example we show how the IFGT can be ud for very fast -exact multivariate kernel density estimation.We also show the eﬀect of -exact approximation on the performance of the estimator.

5.3Optimal bandwidth estimation for KDE

[Raykar and Duraiswami2005b][Raykar and Duraiswami2006]

Eﬃcient u of KDE requires the optimal lection of the smoothing parameter called the bandwidth

h of the kernel.For a practical implementation of KDE the choice of the bandwidth h is very important.Small h leads to an estimator with small bias and large variance.Large h leads to a small variance at the expen of increa in bias.The bandwidth h has to be chon optimally.

Most state-of-the-art automatic bandwidth lection procedures require estima-tion of quantities involving the density derivatives.The computational complexity of evaluating the density derivative at M evaluation points given N sample points from the density scales as O(MN).We propo a computationally eﬃcient −exact approximation algorithm for the univariate Gaussian kernel bad density deriva-tive estimation that reduces the computational complexity from O(MN)to linear O(M+N).

We apply the density derivative evaluation procedure to estimate the optimal bandwidth for kernel density estimation,a process that is often intractable for large data ts.We demonstrate the speedup achieved on the bandwidth lection using the solve-the-equation plug-in method.We also demonstrate that the propod procedure can be extremely uful for speeding up exploratory projection pursuit techniques.

6.WORK IN PROGRESS

6.1Fast large scale Gaussian process regression

Gaussian process allow treatment of non-linear non-parametric regression prob-lems in a Bayesian framework.However the computational cost of training such a model with N examples scales as O(N3).Iterative methods can bring this cost down to O(N2),which is still prohibitive for large datats.In this paper we u

Disrtation proposal

6·Vikas Chandrakant Raykar

an -exact approximation technique,the improved fast Gauss transform to reduce the computational complexity to O(N),for the squared exponential covariance function.Using the theory of inexact Krylov subspace methods we show how to adaptively increa as the iteration progress.For prediction at M points the computational complexity is reduced from O(NM)to O(M+N).We demonstrate the speedup achieved on large synthetic and real datats.We also show how the hyperparameters can be chon in O(N)time.Unlike methods which rely on -lecting a subt of the data t we u all the available data and still get substantial speedups.

6.2Implicit surfaceﬁtting via Gaussian process

We u Gaussian process regression,an extremely powerful technique from machine learning,for surface reconstruction from point cloud data.The surface is deﬁned implicity as the zero level t of a suitable function.This function is learnt from the given point cloud data.Rather than estimating the value of a function at a point, this method gives the entire predictive distribution in a Bayesian framework.As a result we obtain surface reprentations together with valid estimates of uncertainty. The computed uncertainty not only reﬂects the noi in the point cloud data but also the eﬀect of irregular sampling of the surface during the scanning process.This fact is shown to be very uful for spar reprentation of surface,mesh saliency reprentation,mesh simpliﬁcation,and real-time re-scanning the surface bad on the computed uncertainties.One of the major bottlenecks for most implicit surface methods is that their prohibitive computational complexity.This is no diﬀerent for Gaussian process regression,the computational complexity of which scales as O(N3).In this paper we u two diﬀerent -exact approximation techniques,the improved fast Gauss transform and the dual-tree methods both of which reduce the computational complexity to O(N).Thus we are able to handle data-ts containing millions of points.

7.FUTURE WORK

The following problems are among tho that I wish to formulate well and solve in the cour of this t

hesis,and in the future.

(1)Dual-tree methods Another class of methods recently propod for the same

problems are the dual-tree methods[Gray and Moore2003].The methods rely only on data structures like kd-trees and ball tress and not on ries expansions.

The dual-tree methods give good speedup for small bandwidths while the ries bad methods such as IFGT give good speedup for large bandwidths.I would like to explore whether the two diﬀerent techniques can be combined together to give good speedups both for large and small bandwidths.

(2)Inexact eigenvalue methods Many kernel methods in unsupervid learning

like kernel principal component analysis[Smola et al.1996],spectral cluster-ing[Chung1997],and Laplacian eigenmaps involve computing the eigen values of the Gram matrix.Solutions to such problems can be obtained using iterative methods[Saad1992].The dominant computation is the matrix-vector multipli-cation for which we u the IFGT.We need to explore how the approximations aﬀect the convergence of the iterative methods.

Disrtation proposal

本文发布于:2023-05-19 02:17:29，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/688916.html

上一篇：关于幼儿表扬信集锦八篇

下一篇：最新感恩的主题演讲稿三分钟感恩主题演讲稿5分钟(7篇)

标签：衣服垃圾漫话采纳

留言与评论（共有 0 条评论）