Stat Comput
DOI10.1007/s11222-008-9111-x
Joint covariate lection and joint subspace lection for multiple classification problems
Guillaume Obozinski·Ben Taskar·Michael I.Jordan
Received:9October2007/Accepted:1December2008
©The Author(s)2008.This article is published with open access
Abstract We address the problem of recovering a common t of covariates that are relevant simultaneously to veral classification problems.By penalizing the sum of 2norms of the blocks of coefficients associated with each covariate across different classification problems,similar sparsity pat-terns in all models are encouraged.To take computational advantage of the sparsity of solutions at high regularization levels,we propo a blockwi path-following scheme that approximately traces the regularization path.As the regu-larization coefficient decreas,the algorithm maintains and updates concurrently a growing t of covariates that are si-multaneously active for all problems.We also show how to u random projections to extend this approach to the problem of joint subspace lection,wher
e multiple pre-dictors are found in a common low-dimensional subspace. We prent theoretical results showing that this random pro-jection approach converges to the solution yielded by trace-norm regularization.Finally,we prent a variety of exper-imental results exploring joint covariate lection and joint subspace lection,comparing the path-following approach G.Obozinski( )
Department of Statistics,University of California at Berkeley,367 Evans Hall,Berkeley,CA94720-3860,USA
e-mail:gobo@stat.berkeley.edu
B.Taskar
Department of Computer and Information Science,University of Pennsylvania,3330Walnut Street,Philadelphia,PA19104-6389, USA
e-mail:taskar@cis.upenn.edu
M.I.Jordan
Department of Statistics and Department of Electrical
Engineering and Computer Science,University of California at Berkeley,367Evans Hall,Berkeley,CA94720-3860,USA
e-mail:jordan@stat.berkeley.edu to competing algorithms in terms of prediction accuracy and running time.
Keywords Variable lection·Subspace lection·Lasso·Group Lasso·Regularization path·Supervid dimensionality reduction·Multitask learning·Block norm·Trace norm·Random projections
1Introduction
英文在线The problem of covariate lection for regression and clas-sification has been the focus of a substantial literature.As with many model lection problems,the problem is ren-dered difficult by the disparity between the large number of models to be considered and the comparatively small amount of data available to evaluate the models.One ap-proach to the problem focus on procedures that arch within the exponentially-large t of all subts of com-ponents of the covariate vector,using various heuristics such as forward or backward lection to limit the arch (Draper and Smith1998).Another approach treats the prob-lem as a parameter estimation problem in which the s
hrink-age induced by a constraint on the 1norm of the para-meter vector yields estimates in which certain components are equal to zero(Tibshirani1996;Fu and Knight2000; Donoho2004).A virtue of the former approach is that it focus on the qualitative decision as to whether a covariate is relevant to the problem at hand,a decision which is con-ceptually distinct from parameter estimation.A virtue of the latter approach is its computational tractability.
In this paper,we focus on a problem tting in which the virtues appear to be better aligned than they are in gen-eral regression and classification problems.In particular,we focus on situations involving multiple,related data ts in which the same t of covariates are prent in each data
t
Stat Comput
but where the respons differ.In this multi-respon tting
it is natural to associate a notion of“relevance”to a covari-
ate that is conceptually distinct from the numerical value of
a parameter.For example,a particular covariate may appear
with a positive coefficient in predicting one respon vari-
able and with a negative coefficient in predicting a different
respon.We would clearly want to judge such a covariate
最好做某事as being“relevant”to the overall class of prediction prob-
lems without making a commitment to a specific value of a
parameter.In general we wish to“borrow strength”across
multiple estimation problems in order to support a decision
that a covariate is to be lected.
Our focus in this paper is the classification or discrim-
ination problem.Consider,for example,the following pat-
tern recognition problem that we consider later in Sect.6.
We assume that we are given a data t consisting of pixel-
level or stroke-level reprentations of handwritten charac-
ters and we wish to classify a given character into one of
afixed t of class.In this optical character recognition
(OCR)problem,there are veral thousand covariates,most
of which are irrelevant to the classification decision of char-
acter identity.To support the choice of relevant covariates
in this high-dimensional problem,we consider an extended
version of the problem in which we assume that multiple
data ts are available,one for each individual in a t of
writers.We expect that even though the styles of individual
writers may vary,there should be a common subt of im-
age features(pixels,strokes)that form a shared t of uful
covariates across writers.
As another example of our general tting,also discusd
in Sect.6,consider a DNA microarray analysis problem
in which the covariates are levels of gene expression and
the respons are phenotypes or cellular process(Khan et
al.2001).Given the high-dimensional nature of microarray
data ts,covariate lection is often esntial both for sci-
entific understanding and for effective prediction.Our pro-
posal is to approach the covariate lection problem by con-
sidering multiple related phenotypes—e.g.,related ts of
cancers—and eking tofind covariates that are uful in
predicting the multiple respon variables.
Our approach to the simultaneous covariate lection
problem is an adaptation of 1shrinkage methods such as
LASSO.Briefly,for each data t{(x k i,y k i):i=1,...,N k}, where k∈{1,...,K}indexes data ts,wefit a model in-
volving a parameter vector w k∈R p.View the vectors as rows of a K×p matrix W,and consider the j th column vector,w j,of W.This vector consists of the t of parame-ters associated to the j th covariate across all classification problems.We now define a regularization term that is an 1 sum of t
he 2norms of the covariate-specific parameter vec-tors w j.Each of the 2norms can be viewed as asssing the overall relevance of a particular covariate.The 1sum then enforces a lection among covariates bad on the norms.
This approach is a particular ca of a general method-ology in which block norms are ud to define groupings of variables in regression and classification problems(Bach et al.2004;Yuan and Lin2006;Park and Hastie2006; Meier et al.2008;Kim et al.2006;Zhao et al.2008).How-ever,the focus in this literature differs from ours in that it is concerned with grouping variables within a single regres-sion or classification problem.For example,in a polynomial regression we may wish to group the linear,quadratic and cubic terms corresponding to a specific covariate and lect the terms jointly.Similarly,in an ANOV A model we may wish to group the indicator variables corresponding to a spe-cific factor.The block-norm approach to the problems is bad on defining block norms involving hybrids of 1, 2 and ∞norms as regularization terms.
Argyriou et al.(2008)have independently propod the u of a block 1/ 2norm for covariate lection in the multiple-respon tting.Moreover,they consider a more general framework in which the variables that are lected are linear combinations of the original covariates.We re-fer to this problem as joint subspace lection.Joint covari-ate lection is a special ca in which the subspac
es are restricted to be axis-parallel.Argyriou et al.show that the general subspace lection problem can be formulated as an optimization problem involving the trace norm.
Our contribution relative to Argyriou et al.is as fol-lows.First,we note that the trace norm is difficult to op-timize computationally(it yields a non-differentiable func-tional that is generally evaluated by the computation of a singular value decomposition at each step of a nonlinear op-timization procedure Srebro et al.2005b),and we thus fo-cus on the special ca of covariate lection,where it is not necessary to u the trace norm.For the ca of covari-ate lection we show that it is possible to develop a simple homotopy-bad approach that evaluates an entire regular-ization path efficiently(cf.Efron et al.2004;Osborne et al. 2000).We prent a theoretical result establishing the con-vergence of this homotopy-bad method.Moreover,for the general ca of joint subspace lection we show how ran-dom projections can be ud to reduce the problem to co-variate lection.Applying our homotopy method for joint covariate lection to the random projections,we obtain a computationally-efficient procedure for joint subspace -lection.We also prent a theoretical result showing that this approach approximates the solution obtained from the trace norm.Finally,we prent veral experiments on large-scale datats that compare and contrast various methods for joint covariate lection and joint subspace lection.
The general problem of jointly estimating models from multiple,related data ts is often referred to as“transfer learning”or“multi-task learning”in the machine learning literature(Maurer2006;Ben-David and Schuller-Borbely 2008;Argyriou et al.2008;Jebara2004;Evgeniou and Pon-til2004;Torralba et al.2004;Ando and Zhang2005).We
Stat Comput
adopt the following terminology from this literature:a task is defined to be a pairing of a t of covariate vectors and a specific component of a multiple respon vector.We wish tofind covariates and subspaces that are uful across mul-tiple tasks.
The paper is organized as follows.In Sect.2,we intro-duce the 1/ 2regularization scheme and the correspond-ing optimization problem.In Sect.3we discuss homotopy-bad methods,and in Sect.4we propo a general scheme for following a piecewi smooth,nonlinear regularization path.We extend our algorithm to subspace lection in Sect.5and prove convergence to trace-norm regularization. In Sect.6we prent an empirical evaluation of our joint feature lection algorithm,comparing to veral competing block-norm optimizers.We also prent an empirical evalu-ation and comparison of our extension to subspace lection. We conclude with a discussion in Sect.7.
孤独之旅教学设计
2Joint regularization周日清
We assume a group of K classification problems or“tasks”and a t of data samples{(x k i,y k i)∈X×Y,i=1,...,N k, k=1,...,K}where the superscript k indexes tasks and the subscript i indexes the i.i.d.obrvations for each task.We assume that the common covariate space X is R p and the outcome space Y is{0,1}.
Let w k∈R p parameterize a linear discriminant function for task k,and let J k(w k·x k,y k)be a loss function on ex-ample(x k,y k)for task k.Typical smooth loss functions for linear classification models include logistic and exponential loss.A standard approach to obtaining spar estimates of the parameters w k is to solve an 1-regularized empirical risk minimization problem:
min w k
N k
i=1
J k(w k·x k i,y k i)+λ w k 1,
whereλis a regularization coefficient.Solving an indepen-dent 1-regularized objective for each of the problems is equivalent to solving the global problem obtained by sum-ming the objectives:
min W
K
k=1
N k
酸菜方便面i=1
J k(w k·x k i,y k i)+λ
K
k=1
w k 1,(1)
where W=(w k j)k,j is the matrix who rows are the vec-tors w k and who columns are the vectors w j of the coeffi-cients associated with covariate j across classification tasks. Note that we have assumed that the regularization coefficient λis the same across tasks.We refer to the regularization scheme in(1)as an 1/ 1-regularization.Solving this opti-mization problem would lead to individual sparsity patterns for each w k.
We focus instead on a regularization scheme that lects covariates jointly across tasks.We achieve this by encour-aging veral w j to be zero.We thus propo to solve the problem
min
W
K
k=1
N k
i=1
J k(w k·x k i,y k i)+λ
p
j=1
w j 2,(2)
仙华山in which we penalize the 1norm of the vector of 2norms of the covariate-specific coefficient vectors.Note that this 1/ 2-regularization scheme reduces to 1-regularization if the group is reduced to one task,and can thus be en an extension of 1-regularization where instead of summing the absolute values of coefficients associated with covariates we sum the Euclidean norms of coefficient blocks.
The 2norm is ud here as a measure of magnitude and one could also generalize to 1/ p norms by consid-ering p norms for1≤p≤∞.The choice of p should depend on how much covariate sharing we wish to impo among classification problems,from none(p=1)to full sharing(p=∞).Indeed,increasing p corresponds to al-lowing better“group discounts”for sharing the same covari-ate,from p=1,where the cost grows linearly with the num-ber of classification problems that u a covariate,to p=∞, where only the most demanding classification matters.
The shape of the unit“ball”of the 1/ 2norm is dif-ficult to visualize.It clearly has corners that,in a manner analogous to the 1norm,tend to produce spar solutions. As shown in Fig.1,one way to appreciate the effect of the 1/ 2norm is to consider a problem with two covariates
and
Fig.1(Color online)(Left)Norm ball induced on the coefficients
(w2
1
,w2
2
)for task2as covariate coefficients for task1vary:thin
red contour for(w1
1
,w1
退学申请书模板2
)=(0,1)and thick green contour for
(w1
1
,w1
2
)=(0.5,0.5)
Stat Comput
two tasks and to obrve the ball of the norm induced on w2 when w1varies under the constraint that w1 1=1in an 1/ 2ball of size2(which is the largest value of the 1/ 2 norm if w1 1= w2 1=1).If a covariate j has a non-zero coefficient in w1then the induced norm on w2is smooth around w2j=0.Otherwi,it has sharp corners,which en-courages w2j to be t to zero.
3A path-following algorithm for joint covariate lection
In this ction we prent an algorithm for solving the 1/ 2-regularized optimization problem prented in(2).One ap-proach to solving such regularization problems is to re-peatedly solve them on a grid of values of the regulariza-tion coefficientλ,if possible using“warm starts”to ini-tialize the pro
cedure for a given value ofλusing the so-lution for a nearby value ofλ.An alternative framework which can be more efficient computationally and can pro-vide insight into the space of solutions is to attempt to fol-low the“regularization path”(the t of solutions for all values ofλ).There are problems—including 1-regularized least-squares regression and the 1-and 2-regularized sup-port vector machines—for which this path is piecewi lin-ear and for which it is possible to follow the path exactly (Efron et al.2004;Rost and Zhu2007).More gener-ally,we can avail ourlves of path-following algorithms. Classical path-following algorithms involve traditional path-following a combination of prediction steps(along the tan-gent to the path)and correction steps(which correct for errors due to thefirst-order approximation of the predic-tion steps).The algorithms generally require the compu-tation of the Hessian of the combined objective and thus are onerous computationally.However,in the ca of 1 regularization it has been shown that the solution path can be approximated by computationally efficient variations of boosting and stagewi forward lection(Hastie et al.2001; Zhao and Yu2007).
Note that the amount of sparsity is controlled by the reg-ularization coefficientλ.Asλranges from0to∞,the spar-sity of solutions typically progress through veral levels (although this is not guaranteed in general).The approach that we prent here exploits the high degree of sparsity for large values ofλ.
Our approach is inspired by the stagewi Lasso algo-rithm of Zhao and Yu(2007).In their algorithm,the opti-mization is performed on a grid with step size and esn-tially reduces to a discrete problem that can be viewed as a simplex problem,where“forward”and“backward”steps are alternated.Our approach extends this methodology to the tting of blockwi norms by esntially combining stagewi Lasso with a classical correction step.We take advantage of sparsity so that this step can be implemented cheaply.4Active t and parameter updates
We begin our description of the path-following algorithm
with a simple lemma that us a subgradient calculation (equivalently,the Karush-Kuhn-Tucker(KKT)conditions) to show how the sparsity of the solution can lead to an effi-cient construction of the path.Let us denote the joint loss by J(W)=
K
k=1
N
k
i=1
J k(w k·x k i,y k i).
Lemma1If J is everywhere differentiable,then any solu-tion W∗of the optimization problem in(2)is characterized by the following conditions
either w∗j=0, ∇w j J(W∗) 2≤λ
or w∗j∝−∇w j J(W∗), ∇w j J(W∗) 2=λ,
where∇w
j
J(W)are partial gradients in each of the sub-spaces corresponding to covariate-specific parameter vec-tors.
Proof At an optimum,a subgradient of the objective func-tion equals zero.This implies—given that the 1/ 2-regu-larization term is parable for the column vectors w j of
W—that for all j,∇w
j
J(W∗)+λz∗j=0for z∗j∈∂w j w j 2, where the latter denotes the subgradient of the Euclidean norm.Moreover,the subgradient of the Euclidean norm sat-isfies
⎧
⎨
⎩
∂w
j
w j 2=w j w
j
if w j=0,
∂w
j
w j 2={z∈R K| z 2≤1}otherwi,
(3)
which proves the lemma.The subgradient equations can also be obtained by conic duality,in which ca they result di-rectly from the KKT conditions.
In particular,only the“active”covariates—tho for which the norm of the gradient vector is not strictly less thanλ—participate in the solution.For the active co-variates,λ w∗
j
w∗j=−∇w j J(W∗).(Note that ifλ≥λ0= max j ∇w j J(0) 2then the zero vector is a solution to our problem.)
The conditions suggest an algorithm which gradually decreas the regularization coefficient from
λ0and popu-lates an active t with inactive covariates as they start to vi-olate subgradient conditions.In particular,we consider ap-proximate subgradient conditions of the form:
either w j=0, ∇w j J(W) <λ+ξ0
(4) or
∇
w j J(W)+(λ−ξ)
w j
w j
≤ξ,
whereξandξ0are slack parameters.The conditions are obtained by relaxing the constraints that there must exist a
Stat Comput
Algorithm 1Approximate block-Lasso path while λt >λmin do
Set j ∗=argmax j ∇w j J (W t )
Update w (t +1)
j ∗
=w (t)
j ∗− u t with u t =∇w
j ∗J
∇w j ∗J
λt +1=min λt ,J (W t )−J (W t +1)
Add j ∗to the active t
Enforce (4)for covariates in the active t with ξ0=ξ.end while
subgradient equal to zero,and asking instead that ⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩For w j =0, ∇w j J (W )+λz j ≤ξ0for some z j ∈∂w j w j 2,For w j =0, ∇w j J (W )+(λ−ξ)z j
≤ξ
for some z j ∈∂w j w j 2.
The latter constraint ensures that,for any active covariate j ,we have ∇w j J (W ) ≤λand that the partial subgradient of the objective with respect to w j is of norm at most 2ξ.Note that,on the other hand,if ξ0>0,the previous inequality does not hold a priori for inactive covariates,so that a solu-tion to (4)does not necessarily have the exact same active t as one satisfying conditions (3).
To obtain a path of solutions that satisfy the approxi-mate subgradient conditions,consider Algorithm 1.
Algorithm 1enforces explicitly the subgradient condi-tion (4),with ξ0=ξ,on its active t.If J is twice con-tinuously differentiable,and if the largest eigenvalue of its Hessian is bounded above by μmax ,Algorithm 1actu-ally also enforces (4)implicitly for the other variables with ξ0=12 μmax .This crucial property is proved in Appendix A together with the next proposition,which shows that Algo-rithm 1approximates the regularization path for the 1/ 2norm:
Proposition 1Let λt denote the value of the regulariza-tion parameter at the t th iteration ,with initial value λ0≥ ∇w j ∗J (0) .Assuming J to be twice differentiable and strictly convex ,for all ηthere exists >0and ξ>0such that iterates W t of Algorithm 1obey J (W t )−J (W (λt ))≤ηfor every time step t such that λt +1<λt ,where W (λt )is the unique solution to (2).Moreover ,the algorithm terminates (provided the active t is not pruned )in a finite number of iterations to a regularization coefficient no greater than any prespecified λmin >0.
It is also worth noting that it is possible to t ξ0=0and develop a stricter version of the algorithm that identifies the correct active t for each λ.We prent this variant in Appendix B .
Since our algorithm does not appeal to global cond-order information,it is quite scalable compared to stan-dard homotopy algorithms such as LARS.This is particu-larly uful in the multi-task tting where problems can be relatively large,and where algorithms such as LARS be-come slow.Our algorithm samples the path regularly,on a scale that is determined automatically by the algorithm through the update rule for λt ,and allows for veral new covariates to enter the active t simultaneously.(Empiri-cally we find that this scale is logarithmic.)The algorithm is obviously less efficient than LARS-type algorithms in long pieces of the path that are smooth,but we indicate in the following ction how variants of the algorithm could ad-dress this.Finally,our algorithm applies to co
ntexts in which LARS-type algorithms do not apply directly,and where the u of classical homotopy methods are precluded by non-differentiability.
In the following two subctions we further describe Al-gorithm 1,providing further details on the prediction step (the choice of u t )and the correction step (the enforcement of (4)for covariates in the active t).4.1Prediction steps
The choice u t =∇w j ∗J / ∇w j ∗J that we have specified for the prediction step is one possible option.It is also possible to take a global gradient descent step or more generally a step along a gradient-related descent direction
(a direction such that lim inf t −u t .∇J (W t )
∇J (W t ) >δ>0)with an update rule for the regularization coefficient of the form:
λt +1=min (λt ,J (W t )−J (W t +1)
W t −W t +1
1/ 2
).Indeed,the proof of Ap-pendix A could easily be generalized to the ca of steps of 1/ 2norm taken along a general descent direction.Note that only the iterates that conclude with a decrea of the regularization coefficient are guaranteed to be clo to the path.
For simplicity,we have prented the algorithm as using a fixed step size ,but in practice we recommend using an adaptive step size determined by a line arch limited to the gment (0, ].This allows us to explore the end of the path where the regularization coefficient becomes exponentially small.Lemma 3in Appendix A considers this ca.中国猿人
If we understand the “active t”as the t of covariates with non-zero coefficients it is possible for a covariate to enter and later exit the t,which,a priori,would require pruning.The analysis of pruning is delicate and we do not consider it here.In practice,the ca of parameters returning to zero appears to be rare—in our experiments typically at most two components return to zero per path.Thus,imple-menting a pruning step would not yield a significant speed-up of the algorithm.