Joint covariate lection and joint subspace lection for multiple

更新时间:2023-06-17 09:07:57 阅读：评论：0

Stat Comput

DOI10.1007/s11222-008-9111-x

Joint covariate lection and joint subspace lection for multiple classiﬁcation problems

Guillaume Obozinski·Ben Taskar·Michael I.Jordan

Received:9October2007/Accepted:1December2008

©The Author(s)2008.This article is published with open access

Abstract We address the problem of recovering a common t of covariates that are relevant simultaneously to veral classiﬁcation problems.By penalizing the sum of 2norms of the blocks of coefﬁcients associated with each covariate across different classiﬁcation problems,similar sparsity pat-terns in all models are encouraged.To take computational advantage of the sparsity of solutions at high regularization levels,we propo a blockwi path-following scheme that approximately traces the regularization path.As the regu-larization coefﬁcient decreas,the algorithm maintains and updates concurrently a growing t of covariates that are si-multaneously active for all problems.We also show how to u random projections to extend this approach to the problem of joint subspace lection,wher

e multiple pre-dictors are found in a common low-dimensional subspace. We prent theoretical results showing that this random pro-jection approach converges to the solution yielded by trace-norm regularization.Finally,we prent a variety of exper-imental results exploring joint covariate lection and joint subspace lection,comparing the path-following approach G.Obozinski( )

Department of Statistics,University of California at Berkeley,367 Evans Hall,Berkeley,CA94720-3860,USA

e-mail:gobo@stat.berkeley.edu

B.Taskar

Department of Computer and Information Science,University of Pennsylvania,3330Walnut Street,Philadelphia,PA19104-6389, USA

e-mail:taskar@cis.upenn.edu

M.I.Jordan

Department of Statistics and Department of Electrical

Engineering and Computer Science,University of California at Berkeley,367Evans Hall,Berkeley,CA94720-3860,USA

e-mail:jordan@stat.berkeley.edu to competing algorithms in terms of prediction accuracy and running time.

Keywords Variable lection·Subspace lection·Lasso·Group Lasso·Regularization path·Supervid dimensionality reduction·Multitask learning·Block norm·Trace norm·Random projections

1Introduction

英文在线The problem of covariate lection for regression and clas-siﬁcation has been the focus of a substantial literature.As with many model lection problems,the problem is ren-dered difﬁcult by the disparity between the large number of models to be considered and the comparatively small amount of data available to evaluate the models.One ap-proach to the problem focus on procedures that arch within the exponentially-large t of all subts of com-ponents of the covariate vector,using various heuristics such as forward or backward lection to limit the arch (Draper and Smith1998).Another approach treats the prob-lem as a parameter estimation problem in which the s

hrink-age induced by a constraint on the 1norm of the para-meter vector yields estimates in which certain components are equal to zero(Tibshirani1996;Fu and Knight2000; Donoho2004).A virtue of the former approach is that it focus on the qualitative decision as to whether a covariate is relevant to the problem at hand,a decision which is con-ceptually distinct from parameter estimation.A virtue of the latter approach is its computational tractability.

In this paper,we focus on a problem tting in which the virtues appear to be better aligned than they are in gen-eral regression and classiﬁcation problems.In particular,we focus on situations involving multiple,related data ts in which the same t of covariates are prent in each data

Stat Comput

but where the respons differ.In this multi-respon tting

it is natural to associate a notion of“relevance”to a covari-

ate that is conceptually distinct from the numerical value of

a parameter.For example,a particular covariate may appear

with a positive coefﬁcient in predicting one respon vari-

able and with a negative coefﬁcient in predicting a different

respon.We would clearly want to judge such a covariate

最好做某事as being“relevant”to the overall class of prediction prob-

lems without making a commitment to a speciﬁc value of a

parameter.In general we wish to“borrow strength”across

multiple estimation problems in order to support a decision

that a covariate is to be lected.

Our focus in this paper is the classiﬁcation or discrim-

ination problem.Consider,for example,the following pat-

tern recognition problem that we consider later in Sect.6.

We assume that we are given a data t consisting of pixel-

level or stroke-level reprentations of handwritten charac-

ters and we wish to classify a given character into one of

aﬁxed t of class.In this optical character recognition

(OCR)problem,there are veral thousand covariates,most

of which are irrelevant to the classiﬁcation decision of char-

acter identity.To support the choice of relevant covariates

in this high-dimensional problem,we consider an extended

version of the problem in which we assume that multiple

data ts are available,one for each individual in a t of

writers.We expect that even though the styles of individual

writers may vary,there should be a common subt of im-

age features(pixels,strokes)that form a shared t of uful

covariates across writers.

As another example of our general tting,also discusd

in Sect.6,consider a DNA microarray analysis problem

in which the covariates are levels of gene expression and

the respons are phenotypes or cellular process(Khan et

al.2001).Given the high-dimensional nature of microarray

data ts,covariate lection is often esntial both for sci-

entiﬁc understanding and for effective prediction.Our pro-

posal is to approach the covariate lection problem by con-

sidering multiple related phenotypes—e.g.,related ts of

cancers—and eking toﬁnd covariates that are uful in

predicting the multiple respon variables.

Our approach to the simultaneous covariate lection

problem is an adaptation of 1shrinkage methods such as

LASSO.Brieﬂy,for each data t{(x k i,y k i):i=1,...,N k}, where k∈{1,...,K}indexes data ts,weﬁt a model in-

volving a parameter vector w k∈R p.View the vectors as rows of a K×p matrix W,and consider the j th column vector,w j,of W.This vector consists of the t of parame-ters associated to the j th covariate across all classiﬁcation problems.We now deﬁne a regularization term that is an 1 sum of t

he 2norms of the covariate-speciﬁc parameter vec-tors w j.Each of the 2norms can be viewed as asssing the overall relevance of a particular covariate.The 1sum then enforces a lection among covariates bad on the norms.

This approach is a particular ca of a general method-ology in which block norms are ud to deﬁne groupings of variables in regression and classiﬁcation problems(Bach et al.2004;Yuan and Lin2006;Park and Hastie2006; Meier et al.2008;Kim et al.2006;Zhao et al.2008).How-ever,the focus in this literature differs from ours in that it is concerned with grouping variables within a single regres-sion or classiﬁcation problem.For example,in a polynomial regression we may wish to group the linear,quadratic and cubic terms corresponding to a speciﬁc covariate and lect the terms jointly.Similarly,in an ANOV A model we may wish to group the indicator variables corresponding to a spe-ciﬁc factor.The block-norm approach to the problems is bad on deﬁning block norms involving hybrids of 1, 2 and ∞norms as regularization terms.

Argyriou et al.(2008)have independently propod the u of a block 1/ 2norm for covariate lection in the multiple-respon tting.Moreover,they consider a more general framework in which the variables that are lected are linear combinations of the original covariates.We re-fer to this problem as joint subspace lection.Joint covari-ate lection is a special ca in which the subspac

es are restricted to be axis-parallel.Argyriou et al.show that the general subspace lection problem can be formulated as an optimization problem involving the trace norm.

Our contribution relative to Argyriou et al.is as fol-lows.First,we note that the trace norm is difﬁcult to op-timize computationally(it yields a non-differentiable func-tional that is generally evaluated by the computation of a singular value decomposition at each step of a nonlinear op-timization procedure Srebro et al.2005b),and we thus fo-cus on the special ca of covariate lection,where it is not necessary to u the trace norm.For the ca of covari-ate lection we show that it is possible to develop a simple homotopy-bad approach that evaluates an entire regular-ization path efﬁciently(cf.Efron et al.2004;Osborne et al. 2000).We prent a theoretical result establishing the con-vergence of this homotopy-bad method.Moreover,for the general ca of joint subspace lection we show how ran-dom projections can be ud to reduce the problem to co-variate lection.Applying our homotopy method for joint covariate lection to the random projections,we obtain a computationally-efﬁcient procedure for joint subspace -lection.We also prent a theoretical result showing that this approach approximates the solution obtained from the trace norm.Finally,we prent veral experiments on large-scale datats that compare and contrast various methods for joint covariate lection and joint subspace lection.

The general problem of jointly estimating models from multiple,related data ts is often referred to as“transfer learning”or“multi-task learning”in the machine learning literature(Maurer2006;Ben-David and Schuller-Borbely 2008;Argyriou et al.2008;Jebara2004;Evgeniou and Pon-til2004;Torralba et al.2004;Ando and Zhang2005).We

Stat Comput

adopt the following terminology from this literature:a task is deﬁned to be a pairing of a t of covariate vectors and a speciﬁc component of a multiple respon vector.We wish toﬁnd covariates and subspaces that are uful across mul-tiple tasks.

The paper is organized as follows.In Sect.2,we intro-duce the 1/ 2regularization scheme and the correspond-ing optimization problem.In Sect.3we discuss homotopy-bad methods,and in Sect.4we propo a general scheme for following a piecewi smooth,nonlinear regularization path.We extend our algorithm to subspace lection in Sect.5and prove convergence to trace-norm regularization. In Sect.6we prent an empirical evaluation of our joint feature lection algorithm,comparing to veral competing block-norm optimizers.We also prent an empirical evalu-ation and comparison of our extension to subspace lection. We conclude with a discussion in Sect.7.

孤独之旅教学设计

2Joint regularization周日清

We assume a group of K classiﬁcation problems or“tasks”and a t of data samples{(x k i,y k i)∈X×Y,i=1,...,N k, k=1,...,K}where the superscript k indexes tasks and the subscript i indexes the i.i.d.obrvations for each task.We assume that the common covariate space X is R p and the outcome space Y is{0,1}.

Let w k∈R p parameterize a linear discriminant function for task k,and let J k(w k·x k,y k)be a loss function on ex-ample(x k,y k)for task k.Typical smooth loss functions for linear classiﬁcation models include logistic and exponential loss.A standard approach to obtaining spar estimates of the parameters w k is to solve an 1-regularized empirical risk minimization problem:

min w k

N k

i=1

J k(w k·x k i,y k i)+λ w k 1,

whereλis a regularization coefﬁcient.Solving an indepen-dent 1-regularized objective for each of the problems is equivalent to solving the global problem obtained by sum-ming the objectives:

min W

k=1

N k

酸菜方便面i=1

J k(w k·x k i,y k i)+λ

k=1

w k 1,(1)

where W=(w k j)k,j is the matrix who rows are the vec-tors w k and who columns are the vectors w j of the coefﬁ-cients associated with covariate j across classiﬁcation tasks. Note that we have assumed that the regularization coefﬁcient λis the same across tasks.We refer to the regularization scheme in(1)as an 1/ 1-regularization.Solving this opti-mization problem would lead to individual sparsity patterns for each w k.

We focus instead on a regularization scheme that lects covariates jointly across tasks.We achieve this by encour-aging veral w j to be zero.We thus propo to solve the problem

min

k=1

N k

i=1

J k(w k·x k i,y k i)+λ

j=1

w j 2,(2)

仙华山in which we penalize the 1norm of the vector of 2norms of the covariate-speciﬁc coefﬁcient vectors.Note that this 1/ 2-regularization scheme reduces to 1-regularization if the group is reduced to one task,and can thus be en an extension of 1-regularization where instead of summing the absolute values of coefﬁcients associated with covariates we sum the Euclidean norms of coefﬁcient blocks.

The 2norm is ud here as a measure of magnitude and one could also generalize to 1/ p norms by consid-ering p norms for1≤p≤∞.The choice of p should depend on how much covariate sharing we wish to impo among classiﬁcation problems,from none(p=1)to full sharing(p=∞).Indeed,increasing p corresponds to al-lowing better“group discounts”for sharing the same covari-ate,from p=1,where the cost grows linearly with the num-ber of classiﬁcation problems that u a covariate,to p=∞, where only the most demanding classiﬁcation matters.

The shape of the unit“ball”of the 1/ 2norm is dif-ﬁcult to visualize.It clearly has corners that,in a manner analogous to the 1norm,tend to produce spar solutions. As shown in Fig.1,one way to appreciate the effect of the 1/ 2norm is to consider a problem with two covariates

and

Fig.1(Color online)(Left)Norm ball induced on the coefﬁcients

(w2

,w2

)for task2as covariate coefﬁcients for task1vary:thin

red contour for(w1

,w1

退学申请书模板2

)=(0,1)and thick green contour for

(w1

,w1

)=(0.5,0.5)

Stat Comput

two tasks and to obrve the ball of the norm induced on w2 when w1varies under the constraint that w1 1=1in an 1/ 2ball of size2(which is the largest value of the 1/ 2 norm if w1 1= w2 1=1).If a covariate j has a non-zero coefﬁcient in w1then the induced norm on w2is smooth around w2j=0.Otherwi,it has sharp corners,which en-courages w2j to be t to zero.

3A path-following algorithm for joint covariate lection

In this ction we prent an algorithm for solving the 1/ 2-regularized optimization problem prented in(2).One ap-proach to solving such regularization problems is to re-peatedly solve them on a grid of values of the regulariza-tion coefﬁcientλ,if possible using“warm starts”to ini-tialize the pro

cedure for a given value ofλusing the so-lution for a nearby value ofλ.An alternative framework which can be more efﬁcient computationally and can pro-vide insight into the space of solutions is to attempt to fol-low the“regularization path”(the t of solutions for all values ofλ).There are problems—including 1-regularized least-squares regression and the 1-and 2-regularized sup-port vector machines—for which this path is piecewi lin-ear and for which it is possible to follow the path exactly (Efron et al.2004;Rost and Zhu2007).More gener-ally,we can avail ourlves of path-following algorithms. Classical path-following algorithms involve traditional path-following a combination of prediction steps(along the tan-gent to the path)and correction steps(which correct for errors due to theﬁrst-order approximation of the predic-tion steps).The algorithms generally require the compu-tation of the Hessian of the combined objective and thus are onerous computationally.However,in the ca of 1 regularization it has been shown that the solution path can be approximated by computationally efﬁcient variations of boosting and stagewi forward lection(Hastie et al.2001; Zhao and Yu2007).

Note that the amount of sparsity is controlled by the reg-ularization coefﬁcientλ.Asλranges from0to∞,the spar-sity of solutions typically progress through veral levels (although this is not guaranteed in general).The approach that we prent here exploits the high degree of sparsity for large values ofλ.

Our approach is inspired by the stagewi Lasso algo-rithm of Zhao and Yu(2007).In their algorithm,the opti-mization is performed on a grid with step size and esn-tially reduces to a discrete problem that can be viewed as a simplex problem,where“forward”and“backward”steps are alternated.Our approach extends this methodology to the tting of blockwi norms by esntially combining stagewi Lasso with a classical correction step.We take advantage of sparsity so that this step can be implemented cheaply.4Active t and parameter updates

We begin our description of the path-following algorithm

with a simple lemma that us a subgradient calculation (equivalently,the Karush-Kuhn-Tucker(KKT)conditions) to show how the sparsity of the solution can lead to an efﬁ-cient construction of the path.Let us denote the joint loss by J(W)=

k=1

i=1

J k(w k·x k i,y k i).

Lemma1If J is everywhere differentiable,then any solu-tion W∗of the optimization problem in(2)is characterized by the following conditions

either w∗j=0, ∇w j J(W∗) 2≤λ

or w∗j∝−∇w j J(W∗), ∇w j J(W∗) 2=λ,

where∇w

J(W)are partial gradients in each of the sub-spaces corresponding to covariate-speciﬁc parameter vec-tors.

Proof At an optimum,a subgradient of the objective func-tion equals zero.This implies—given that the 1/ 2-regu-larization term is parable for the column vectors w j of

W—that for all j,∇w

J(W∗)+λz∗j=0for z∗j∈∂w j w j 2, where the latter denotes the subgradient of the Euclidean norm.Moreover,the subgradient of the Euclidean norm sat-isﬁes

⎧

⎨

⎩

∂w

w j 2=w j w

if w j=0,

∂w

w j 2={z∈R K| z 2≤1}otherwi,

(3)

which proves the lemma.The subgradient equations can also be obtained by conic duality,in which ca they result di-rectly from the KKT conditions.

In particular,only the“active”covariates—tho for which the norm of the gradient vector is not strictly less thanλ—participate in the solution.For the active co-variates,λ w∗

w∗j=−∇w j J(W∗).(Note that ifλ≥λ0= max j ∇w j J(0) 2then the zero vector is a solution to our problem.)

The conditions suggest an algorithm which gradually decreas the regularization coefﬁcient from

λ0and popu-lates an active t with inactive covariates as they start to vi-olate subgradient conditions.In particular,we consider ap-proximate subgradient conditions of the form:

either w j=0, ∇w j J(W) <λ+ξ0

(4) or

∇

w j J(W)+(λ−ξ)

w j

≤ξ,

whereξandξ0are slack parameters.The conditions are obtained by relaxing the constraints that there must exist a

Stat Comput

Algorithm 1Approximate block-Lasso path while λt >λmin do

Set j ∗=argmax j ∇w j J (W t )

Update w (t +1)

j ∗

=w (t)

j ∗− u t with u t =∇w

j ∗J

∇w j ∗J

λt +1=min λt ,J (W t )−J (W t +1)

Add j ∗to the active t

Enforce (4)for covariates in the active t with ξ0=ξ.end while

subgradient equal to zero,and asking instead that ⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩For w j =0, ∇w j J (W )+λz j ≤ξ0for some z j ∈∂w j w j 2,For w j =0, ∇w j J (W )+(λ−ξ)z j

≤ξ

for some z j ∈∂w j w j 2.

The latter constraint ensures that,for any active covariate j ,we have ∇w j J (W ) ≤λand that the partial subgradient of the objective with respect to w j is of norm at most 2ξ.Note that,on the other hand,if ξ0>0,the previous inequality does not hold a priori for inactive covariates,so that a solu-tion to (4)does not necessarily have the exact same active t as one satisfying conditions (3).

To obtain a path of solutions that satisfy the approxi-mate subgradient conditions,consider Algorithm 1.

Algorithm 1enforces explicitly the subgradient condi-tion (4),with ξ0=ξ,on its active t.If J is twice con-tinuously differentiable,and if the largest eigenvalue of its Hessian is bounded above by μmax ,Algorithm 1actu-ally also enforces (4)implicitly for the other variables with ξ0=12 μmax .This crucial property is proved in Appendix A together with the next proposition,which shows that Algo-rithm 1approximates the regularization path for the 1/ 2norm:

Proposition 1Let λt denote the value of the regulariza-tion parameter at the t th iteration ,with initial value λ0≥ ∇w j ∗J (0) .Assuming J to be twice differentiable and strictly convex ,for all ηthere exists >0and ξ>0such that iterates W t of Algorithm 1obey J (W t )−J (W (λt ))≤ηfor every time step t such that λt +1<λt ,where W (λt )is the unique solution to (2).Moreover ,the algorithm terminates (provided the active t is not pruned )in a ﬁnite number of iterations to a regularization coefﬁcient no greater than any prespeciﬁed λmin >0.

It is also worth noting that it is possible to t ξ0=0and develop a stricter version of the algorithm that identiﬁes the correct active t for each λ.We prent this variant in Appendix B .

Since our algorithm does not appeal to global cond-order information,it is quite scalable compared to stan-dard homotopy algorithms such as LARS.This is particu-larly uful in the multi-task tting where problems can be relatively large,and where algorithms such as LARS be-come slow.Our algorithm samples the path regularly,on a scale that is determined automatically by the algorithm through the update rule for λt ,and allows for veral new covariates to enter the active t simultaneously.(Empiri-cally we ﬁnd that this scale is logarithmic.)The algorithm is obviously less efﬁcient than LARS-type algorithms in long pieces of the path that are smooth,but we indicate in the following ction how variants of the algorithm could ad-dress this.Finally,our algorithm applies to co

ntexts in which LARS-type algorithms do not apply directly,and where the u of classical homotopy methods are precluded by non-differentiability.

In the following two subctions we further describe Al-gorithm 1,providing further details on the prediction step (the choice of u t )and the correction step (the enforcement of (4)for covariates in the active t).4.1Prediction steps

The choice u t =∇w j ∗J / ∇w j ∗J that we have speciﬁed for the prediction step is one possible option.It is also possible to take a global gradient descent step or more generally a step along a gradient-related descent direction

(a direction such that lim inf t −u t .∇J (W t )

∇J (W t ) >δ>0)with an update rule for the regularization coefﬁcient of the form:

λt +1=min (λt ,J (W t )−J (W t +1)

W t −W t +1

1/ 2

).Indeed,the proof of Ap-pendix A could easily be generalized to the ca of steps of 1/ 2norm taken along a general descent direction.Note that only the iterates that conclude with a decrea of the regularization coefﬁcient are guaranteed to be clo to the path.

For simplicity,we have prented the algorithm as using a ﬁxed step size ,but in practice we recommend using an adaptive step size determined by a line arch limited to the gment (0, ].This allows us to explore the end of the path where the regularization coefﬁcient becomes exponentially small.Lemma 3in Appendix A considers this ca.中国猿人

If we understand the “active t”as the t of covariates with non-zero coefﬁcients it is possible for a covariate to enter and later exit the t,which,a priori,would require pruning.The analysis of pruning is delicate and we do not consider it here.In practice,the ca of parameters returning to zero appears to be rare—in our experiments typically at most two components return to zero per path.Thus,imple-menting a pruning step would not yield a signiﬁcant speed-up of the algorithm.

本文发布于:2023-06-17 09:07:57，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/974396.html

上一篇：最新建材料销售合同建材行业购销合同(十六篇)

下一篇：最新8月份婚礼贺词(四篇)

标签：设计酸菜退学猿人教学方便面

留言与评论（共有 0 条评论）