Maximum-margin matrix factorization

更新时间:2023-07-04 23:43:22 阅读: 评论:0

Maximum-Margin Matrix Factorization
Nathan Srebro Dept.of Computer Science University of Toronto Toronto,ON,CANADA o.edu Jason D.M.Rennie Tommi S.Jaakkola Computer Science and Artificial Intelligence Lab Massachutts Institute of Technology
Cambridge,MA,USA
jrennie,tommi@csail.mit.edu Abstract
We prent a novel approach to collaborative prediction,using low-norm
bring it
instead of low-rank factorizations.The approach is inspired by,and has
strong connections to,large-margin linear discrimination.We show how
to learn low-norm factorizations by solving a mi-definite program,and
discuss generalization error bounds for them.
1Introduction
Fitting a target matrix Y with a low-rank matrix X by minimizing the sum-squared error is a common approach to modeling tabulated data,and can be done explicitly in terms of the singular value decomposition of Y.It is often desirable,though,to minimize a different loss function:loss corresponding to a specific probabilistic model(where X are the mean parameters,as in pLSA[1],or the natural parameters[2]);or loss functions such as hinge loss appropriate for binary or discrete ordinal data.Loss functions other than squared-error yield non-convex optimization problems with multiple local minima.Even with a squared-error loss,when only some of the entries in Y are obrved,as is the ca for collaborative filtering,local minima ari and SVD techniques are no longer applicable[3].
Low-rank approximations constrain the dimensionality of the factorization X=UV . Other constraints,such as sparsity and non-negativity[4],have also been suggested for better capturing the structure in Y,and also lead to non-convex optimization problems.
In this paper we suggest regularizing the factorization by constraining the norm of U and V—constraints that ari naturally when matrix factorizations are viewed as feature learn-ing for large-margin linear prediction(Section2).Unlike low-rank factorizations,such constraints lead to convex optimization problems that can be formulated as mi-definite programs(Section4).Throughout the p
aper,we focus on using low-norm factorizations for“collaborative prediction”:predicting unobrved entries of a target matrix Y,bad on a subt S of obrved entries Y S.In Section5,we prent generalization error bounds for collaborative prediction using low-norm factorizations.
2Matrix Factorization as Feature Learning
Using a low-rank model for collaborative prediction[5,6,3]is straightforward:A low-rank matrix X is sought that minimizes a loss versus the obrved entries Y S.Unobrved
entries in Y are predicted according to X.Matrices of rank at most k are tho that can be factored into X=UV ,U∈R n×k,V∈R m×k,and so eking a low-rank matrix is equivalent to eking a low-dimensional factorization.
If one of the matrices,say U,isfixed,and only the other matrix V needs to be learned,then fitting each column of the target matrix Y is a parate linear prediction problem.Each row of U functions as a“feature vector”,and each column of V is a linear predictor,predicting the entries in the corresponding column of Y bad on the“features”in U.
In collaborative prediction,both U and V are unknown and need to be estimated.This can be thought
of as learning feature vectors(rows in U)for each of the rows of Y,enabling good linear prediction across all of the prediction problems(columns of Y)concurrently, each with a different linear predictor(columns of V ).The features are learned without any external information or constraints which is impossible for a single prediction task(we would u the labels as features).The underlying assumption that enables us to do this in a collaborativefiltering situation is that the prediction tasks(columns of Y)are related,in that the same features can be ud for all of them,though possibly in different ways.
Low-rank collaborative prediction corresponds to regularizing by limiting the dimensional-ity of the feature space—each column is a linear prediction problem in a low-dimensional space.Instead,we suggest allowing an unbounded dimensionality for the feature space,and regularizing by requiring a low-norm factorization,while predicting with large-margin. Consider adding to the loss a penalty term which is the sum of squares of entries in U and
U 2
Fro + V 2
Fro
英语手抄报图片简单又漂亮
(
Fro
denotes the Frobenius norm).Each“conditional”problem
(fitting U given V and vice versa)again decompos into a collection of standard,this time regularized,linear prediction problems.With an appropriate loss function,or constraints on the obrved entries,the correspond to large-margin linear discrimination problems. For example,if we learn a binary obrvation matrix by minimizing a hinge loss plus such a regularization term,each conditional problem decompos into a collection of SVMs.
3Maximum-Margin Matrix Factorizations
Matrices with a factorization X=UV ,where U and V have low Frobenius norm(recall that the dimensionality of U and V is no longer bounded!),can be characterized in veral equivalent ways,and are known as low trace norm matrices:
Definition1.The trace norm1 X
Σ
is the sum of the singular values of X.
Lemma1. X
Σ=min X=UV  U
Fro
V
Fro
=min X=UV 1
2
( U 2
Fro
+ V 2
Fro
)
德英乐教育The characterization in terms of the singular value decomposition allows us to characterize low trace norm matrices as the convex hull of bounded-norm rank-one matrices:
Lemma2.{X| X
Σ≤B}=conv
uv |u∈R n,v∈R m,|u|2=|v|2=B
In particular,the trace norm is a convex function,and the t of bounded trace norm ma-trices is a convex t.For convex loss functions,eking a bounded trace norm matrix minimizing the loss versus some target matrix is a convex optimization problem.
This contrasts sharply with minimizing loss over low-rank matrices—a non-convex prob-lem.Although the sum-squared error versus a fully obrved target matrix can be min-imized efficiently using the SVD(despite the optimization problem being non-convex!), minimizing other loss
functions,or even minimizing a squared loss versus a partially ob-rved matrix,is a difficult optimization problem with multiple local minima[3].宣布英文
1Also known as the nuclear norm and the Ky-Fan n-norm.
In fact,the trace norm has been suggested as a convex surrogate to the rank for various rank-minimization problems [7].Here,we justify the trace norm directly,both as a natural extension of large-margin methods and by providing generalization error bounds.
To simplify prentation,we focus on binary labels,Y ∈{±1}n ×m .We consider hard-margin matrix factorization ,where we ek a minimum trace norm matrix X that matches the obrved labels with a margin of one:Y ia X ia ≥1for all ia ∈S .We also consider soft-margin learning,where we minimize a trade-off between the trace norm of X and its hinge-loss relative to Y S :
minimize  X  Σ+c  ia ∈S
max(0,1−Y ia X ia ).(1)
As in maximum-margin linear discrimination,there is an inver dependence between the norm and the margin.Fixing the margin and minimizing the trace norm is equivalent to fixing the trace norm an
d maximizing the margin.As in large-margin discrimination with certain infinite dimensional (e.g.radial)kernels,the data is always parable with sufficiently high trace norm (a trace norm of  n |S |is sufficient to attain a margin of one).The max-norm variant Instead of constraining the norms of rows in U and V on aver-age,we can constrain all rows of U and V to have small L 2norm,replacing the trace norm with  X  max =min X =UV  (max i |U i |)(max a |V a |)where U i ,V a are rows of U,V .Low-max-norm discrimination has a clean geometric interpretation.First,note that predicting the target matrix with the signs of a rank-k matrix corresponds to mapping the “items”(columns)to points in R k ,and the “urs”(rows)to homogeneous hyperplanes,such that each ur’s hyperplane parates his positive items from his negative items.Hard-margin low-max-norm prediction corresponds to mapping the urs and items to points and hy-perplanes in a high-dimensional unit sphere such that each ur’s hyperplane parates his positive and negative items with a large-margin (the margin being the inver of the max-norm).
4Learning Maximum-Margin Matrix Factorizations
In this ction we investigate the optimization problem of learning a a low norm factorization UV  ,given a binary target matrix.Bounding the trace norm of UV  by 1
2( U  2Fro + V  2Fro ),we can characterize the trace norm in terms of the trace of a positive mi-definite matrix:
Lemma 3([7,Lemma 1]).For any X ∈R n ×m and t ∈R : X  Σ≤t iff there exists
A ∈R n ×n and
B ∈R m ×m such that 2 A X X  B
0and tr A +tr B ≤2t .Proof.Note that for any matrix W , W  Fro =tr W W  .If  A X X  B  0,we can write it as a product [U V ][U  V  ].We have X =UV  and 12( U  2Fro + V  2Fro )=12(tr A +tr B )≤t ,establishing  X  Σ≤t .Converly,if  X  Σ≤t we can write it as X =UV  with tr UU  +tr V V  ≤2t and consider the p.s.d.matrix  UU  X
X  V V  .
Lemma 3can be ud in order to formulate minimizing the trace norm as a mi-definite optimization problem (SDP).Soft-margin matrix factorization (1),can be written as:
min 12(tr A +tr B )+c  ia ∈S
ξ A X X  B  0,y ia X ia ≥1−ξia ξia ≥0∀ia ∈S (2)2A  0denotes A is positive mi-definite
Associating a dual variable Q ia with each constraint on X ia,the dual of(2)is[8,Section 5.4.2]:
max
ia∈S Q
I(−Q⊗Y)
(−Q⊗Y) I
0,0≤Q ia≤c(3)
where Q⊗Y denotes the spar matrix(Q⊗Y)ia=Q ia Y ia for ia∈S and zeros elwhere.The problem is strictly feasible,and there is no duality gap.The p.straint in the dual(3)is equivalent to bounding the spectral norm of Q⊗Y,and the dual can also be written as an optimization problem subject to a bound on the spectral a bound on the singular values of Q⊗Y:
max
ia∈S Q
Q⊗Y
foolishness2
≤1
0≤Q ia≤c∀ia∈S
(4)
In typical collaborative prediction problems,we obrve only a small fraction of the entries in a large target matrix.Such a situation translates to a spar dual mi-definite program, with the number of variables equal to the number of obrved entries.Large-scale SDP solvers can take advantage of such sparsity.
The prediction matrix X∗minimizing(1)is part of the primal optimal solution of(2),and can be extracted
from it directly.Nevertheless,it is interesting to study how the optimal prediction matrix X∗can be directly recovered from a dual optimal solution Q∗alone. Although unnecessary when relying on interior point methods ud by most SDP solvers(as the return a primal/dual optimal pair),this can enable us to u specialized optimization methods,taking advantage of the simple structure of the dual.
Recovering X∗from Q∗As for linear programming,recovering a primal optimal solu-tion directly from a dual optimal solution is not always possible for SDPs.However,at least for the hard-margin problem(no slack)this is possible,and we describe below how an optimal prediction matrix X∗can be recovered from a dual optimal solution Q∗by calculating a singular value decomposition and solving linear equations.
Given a dual optimal Q∗,consider its singular value decomposition Q∗⊗Y=UΛV . Recall that all singular values of Q∗⊗Y are bounded by one,and consider only the columns ˜U∈R n×p of U and˜V∈R m×p of V with singular value one.It is possible to show[8,
Section5.4.3],using complimentary slackness,that for some matrix R∈R p×p,X∗=˜URR ˜V is an optimal solution to the maximum margin matrix factorization problem(1).
Furthermore,p(p+1)
2is bounded above by the number of non-zero Q∗ia.When Q∗ia>0,
and assuming hard-margin box constraints in the dual,complimentary slackness dictates that X∗ia=˜U i RR ˜V a=Y ia,providing us with a linear equation on
the p(p+1)
2entries in the symmetric RR .For hard-margin matrix factorization,we can
therefore recover the entries of RR by solving a system of linear equations,with a number of variables bounded by the number of obrved entries.
Recovering specific entries The approach described above requires solving a large sys-tem of linear equations(with as many variables as obrvations).Furthermore,especially when the obrvations are very spar(only a small fraction of the entries in the target matrix are obrved),the dual solution is much more compact then the prediction matrix: the dual involves a single number for each obrved entry.It might be desirable to avoid
storing the prediction matrix X∗explicitly,and calculate a desired entry X∗i
0a0
,or at least
its sign,directly from the dual optimal solution Q∗.
Consider adding the constraint X i
0a0
>0to the primal SDP(2).If there exists an optimal
solution X∗to the original SDP with X∗i
0a0
>0,then this is also an optimal solution to
the modified SDP,with the same objective value.Otherwi,the optimal solution of the modified SDP is not optimal for the original SDP,and the optimal value of the modified SDP is higher(wor)than the optimal value of the original SDP.
Introducing the constraint X i
0a0
>0to the primal SDP(2)corresponds to introducing a
new variable Q ilie是什么意思
0a0
price tag什么意思to the dual SDP(3),appearing in Q⊗Y(with Y i
a0
=1)but not in the
objective.In this modified dual,the optimal solution Q∗of the original dual would always
be feasible.But,if X∗i
0a0
<0in all primal optimal solutions,then the modified primal
SDP has a higher value,and so does the dual,and Q∗is no longer optimal for the new dual. By checking the optimality of Q∗for the modified by attempting to re-optimize
it,we can recover the sign of X∗i
0a0 .
We can repeat this test once with Y i
0a0
=1and once with Y i
a0
=−1,corresponding
to X i
0a0
<0.If Y i
a0
X∗i
a0
<0(in all optimal solutions),then the dual solution can be
improved by introducing Q i
0a0
with a sign of Y i
a0
.
Predictions for new urs So far,we assumed that learning is done on the known entries in all rows.It
is commonly desirable to predict entries in a new partially obrved row of Y(a new ur),not included in the original training t.This esntially requires solving a“conditional”problem,where V is already known,and a new row of U is learned(the predictor for the new ur)bad on a new partially obrved row of X.Using maximum-margin matrix factorization,this is a standard SVM problem.
Max-norm MMMF as a SDP The max-norm variant can also be written as a SDP,with the primal and dual taking the forms:
min t+c
ia∈S ξ
A X
X B
A ii,
B aa≤t∀i,a
y ia X ia≥1−ξia
ξia≥0
∀ia∈S
(5)
max我慢
ia∈S Q
Γ(−Q⊗Y)
(−Q⊗Y) ∆
Γ,∆are diagonal
trΓ+tr∆=1
0≤Q ia≤c∀ia∈S
(6)
无线电工程师
中文翻译5Generalization Error Bounds for Low Norm Matrix Factorizations Similarly to standard feature-bad prediction approaches,collaborative prediction meth-ods can also be analyzed in terms of their generalization ability:How confidently can we predict entries of Y bad on our error on the obrved entries Y S?We prent here gen-eralization error bounds that holds for any target matrix Y,and for a random subt of obrvations S,and bound the average error across all entries in terms of the obrved margin error3.The central assumption,paralleling the i.i.d.source assumption for standard feature-bad prediction,is that the obrved subt S is picked uniformly at random. Theorem4.For all target matrices Y∈{±1}n×m and sample sizes|S|>n log n,and for a uniformly lected sample S of|S|entries in Y,with probability at least1−δover 3The bounds prented here are special cas of bounds for general loss functions that we prent and prove elwhere[8,Section6.2].To prove the bounds we bound the Rademacher complexity of bounded trace norm and bounded max-norm he norms).The unit trace norm ball is the convex hull of outer products of unit norm vectors.It is therefore enough to bound the Rademacher complexity of such outer products,which boils down to analyzing the spectral norm of random matrices.As a conquence of Grothendiek’s inequality,the unit max-norm ball is within a factor of two of the convex hull of outer products of sign vectors.The Rademacher complexity of such outer products can be bounded by considering their cardinality.
the sample lection,the following holds for all matrices X ∈R n ×m and all γ>0:1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+K  X  Σγ√nm
4√ln m  (n +m )ln n |S |+ ln(1+|log  X  Σ/γ|)|S |+ ln(4/δ)2|S |(7)and
1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+12 X  max γ n +m |S |+ ln(1+|log  X  Σ/γ|)|S |+ ln(4/δ)2|S |(8)
Where K is a universal constant that does not depend on Y ,n ,m ,γor any other quantity.To understand the scaling of the bounds,consider n ×m matrices X =UV  where the norms of rows of U and V are bounded by r ,i.e.matrices with  X  max ≤r 2.The trace norm of such matrices is bounded by r 2/√nm ,and so the two bounds agree up to log-factors—the cost of allowing the norm to be low on-average but not uniformly.Recall that the conditional problem,where V is fixed and only U is learned,is a collection of low-norm (large-margin)linear prediction problems.When the norms of rows in U and V are bounded by r ,a similar generalization error bound on the conditional problem would include the term r
2γ n |S |,matching the bounds of Theorem 4up to log-factors—learning
both U and V does not introduce significantly more error than learning just one of them.Also of interest is the comparison with bounds for low-rank matrices,for which  X  Σ≤√rank X  X  Fro .In particular,for n ×m rank-k X with entries bounded by B , X  Σ≤√knmB ,and the cond term in the right-hand side of (7)becomes:
K B γ4√ln m  k (n +m )ln n |S |
(9)Although this is the best (up to log factors)that can be expected from scale-nsitive bounds 4,taking a combinatorial approach,the dependence on the magnitude of the entries in X (and the margin)can be avoided [9].
6Implementation and Experiments
Ratings In many collaborative prediction tasks,the labels are not binary,but rather are discrete “ratings”in veral ordered levels ( star through five stars).Separating R levels by thresholds −∞=θ0<θ1<···<θR =∞,and generalizing hard-margin constraints for binary labels,one can require θY ia +1≤X ia ≤θY ia +1−1.A soft-margin version of the constraints,with slack variables for the two constraints on each obrved rating,corresponds to a generalization of the hinge loss which is a convex bound on the zero/one level-agreement error (ZOE)[10].To obtain a loss which is a convex b
ound on the mean-absolute-error (MAE—the difference,in levels,between the predicted level and the true level),we introduce R −1slack variables for each obrved rating—one for each
4For general loss functions,bounds as in Theorem 4depend only on the Lipschitz constant of the loss,and (9)is the best (up to log factors)that can be achieved without explicitly bounding the magnitude of the loss function.

本文发布于:2023-07-04 23:43:22,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/78/1078810.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:手抄报   图片
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图