Maximum-margin matrix factorization

更新时间:2023-07-04 23:43:22 阅读：评论：0

Maximum-Margin Matrix Factorization

Nathan Srebro Dept.of Computer Science University of Toronto Toronto,ON,CANADA o.edu Jason D.M.Rennie Tommi S.Jaakkola Computer Science and Artiﬁcial Intelligence Lab Massachutts Institute of Technology

Cambridge,MA,USA

jrennie,tommi@csail.mit.edu Abstract

We prent a novel approach to collaborative prediction,using low-norm

bring it

instead of low-rank factorizations.The approach is inspired by,and has

strong connections to,large-margin linear discrimination.We show how

to learn low-norm factorizations by solving a mi-deﬁnite program,and

discuss generalization error bounds for them.

1Introduction

Fitting a target matrix Y with a low-rank matrix X by minimizing the sum-squared error is a common approach to modeling tabulated data,and can be done explicitly in terms of the singular value decomposition of Y.It is often desirable,though,to minimize a different loss function:loss corresponding to a speciﬁc probabilistic model(where X are the mean parameters,as in pLSA[1],or the natural parameters[2]);or loss functions such as hinge loss appropriate for binary or discrete ordinal data.Loss functions other than squared-error yield non-convex optimization problems with multiple local minima.Even with a squared-error loss,when only some of the entries in Y are obrved,as is the ca for collaborative ﬁltering,local minima ari and SVD techniques are no longer applicable[3].

Low-rank approximations constrain the dimensionality of the factorization X=UV . Other constraints,such as sparsity and non-negativity[4],have also been suggested for better capturing the structure in Y,and also lead to non-convex optimization problems.

In this paper we suggest regularizing the factorization by constraining the norm of U and V—constraints that ari naturally when matrix factorizations are viewed as feature learn-ing for large-margin linear prediction(Section2).Unlike low-rank factorizations,such constraints lead to convex optimization problems that can be formulated as mi-deﬁnite programs(Section4).Throughout the p

aper,we focus on using low-norm factorizations for“collaborative prediction”:predicting unobrved entries of a target matrix Y,bad on a subt S of obrved entries Y S.In Section5,we prent generalization error bounds for collaborative prediction using low-norm factorizations.

2Matrix Factorization as Feature Learning

Using a low-rank model for collaborative prediction[5,6,3]is straightforward:A low-rank matrix X is sought that minimizes a loss versus the obrved entries Y S.Unobrved

entries in Y are predicted according to X.Matrices of rank at most k are tho that can be factored into X=UV ,U∈R n×k,V∈R m×k,and so eking a low-rank matrix is equivalent to eking a low-dimensional factorization.

If one of the matrices,say U,isﬁxed,and only the other matrix V needs to be learned,then ﬁtting each column of the target matrix Y is a parate linear prediction problem.Each row of U functions as a“feature vector”,and each column of V is a linear predictor,predicting the entries in the corresponding column of Y bad on the“features”in U.

In collaborative prediction,both U and V are unknown and need to be estimated.This can be thought

of as learning feature vectors(rows in U)for each of the rows of Y,enabling good linear prediction across all of the prediction problems(columns of Y)concurrently, each with a different linear predictor(columns of V ).The features are learned without any external information or constraints which is impossible for a single prediction task(we would u the labels as features).The underlying assumption that enables us to do this in a collaborativeﬁltering situation is that the prediction tasks(columns of Y)are related,in that the same features can be ud for all of them,though possibly in different ways.

Low-rank collaborative prediction corresponds to regularizing by limiting the dimensional-ity of the feature space—each column is a linear prediction problem in a low-dimensional space.Instead,we suggest allowing an unbounded dimensionality for the feature space,and regularizing by requiring a low-norm factorization,while predicting with large-margin. Consider adding to the loss a penalty term which is the sum of squares of entries in U and

U 2

Fro + V 2

Fro

英语手抄报图片简单又漂亮

(

Fro

denotes the Frobenius norm).Each“conditional”problem

(ﬁtting U given V and vice versa)again decompos into a collection of standard,this time regularized,linear prediction problems.With an appropriate loss function,or constraints on the obrved entries,the correspond to large-margin linear discrimination problems. For example,if we learn a binary obrvation matrix by minimizing a hinge loss plus such a regularization term,each conditional problem decompos into a collection of SVMs.

3Maximum-Margin Matrix Factorizations

Matrices with a factorization X=UV ,where U and V have low Frobenius norm(recall that the dimensionality of U and V is no longer bounded!),can be characterized in veral equivalent ways,and are known as low trace norm matrices:

Deﬁnition1.The trace norm1 X

is the sum of the singular values of X.

Lemma1. X

Σ=min X=UV U

Fro

=min X=UV 1

( U 2

Fro

+ V 2

Fro

)

德英乐教育The characterization in terms of the singular value decomposition allows us to characterize low trace norm matrices as the convex hull of bounded-norm rank-one matrices:

Lemma2.{X| X

Σ≤B}=conv

uv |u∈R n,v∈R m,|u|2=|v|2=B

In particular,the trace norm is a convex function,and the t of bounded trace norm ma-trices is a convex t.For convex loss functions,eking a bounded trace norm matrix minimizing the loss versus some target matrix is a convex optimization problem.

This contrasts sharply with minimizing loss over low-rank matrices—a non-convex prob-lem.Although the sum-squared error versus a fully obrved target matrix can be min-imized efﬁciently using the SVD(despite the optimization problem being non-convex!), minimizing other loss

functions,or even minimizing a squared loss versus a partially ob-rved matrix,is a difﬁcult optimization problem with multiple local minima[3].宣布英文

1Also known as the nuclear norm and the Ky-Fan n-norm.

In fact,the trace norm has been suggested as a convex surrogate to the rank for various rank-minimization problems [7].Here,we justify the trace norm directly,both as a natural extension of large-margin methods and by providing generalization error bounds.

To simplify prentation,we focus on binary labels,Y ∈{±1}n ×m .We consider hard-margin matrix factorization ,where we ek a minimum trace norm matrix X that matches the obrved labels with a margin of one:Y ia X ia ≥1for all ia ∈S .We also consider soft-margin learning,where we minimize a trade-off between the trace norm of X and its hinge-loss relative to Y S :

minimize X Σ+c ia ∈S

max(0,1−Y ia X ia ).(1)

As in maximum-margin linear discrimination,there is an inver dependence between the norm and the margin.Fixing the margin and minimizing the trace norm is equivalent to ﬁxing the trace norm an

d maximizing the margin.As in large-margin discrimination with certain inﬁnite dimensional (e.g.radial)kernels,the data is always parable with sufﬁciently high trace norm (a trace norm of n |S |is sufﬁcient to attain a margin of one).The max-norm variant Instead of constraining the norms of rows in U and V on aver-age,we can constrain all rows of U and V to have small L 2norm,replacing the trace norm with X max =min X =UV (max i |U i |)(max a |V a |)where U i ,V a are rows of U,V .Low-max-norm discrimination has a clean geometric interpretation.First,note that predicting the target matrix with the signs of a rank-k matrix corresponds to mapping the “items”(columns)to points in R k ,and the “urs”(rows)to homogeneous hyperplanes,such that each ur’s hyperplane parates his positive items from his negative items.Hard-margin low-max-norm prediction corresponds to mapping the urs and items to points and hy-perplanes in a high-dimensional unit sphere such that each ur’s hyperplane parates his positive and negative items with a large-margin (the margin being the inver of the max-norm).

4Learning Maximum-Margin Matrix Factorizations

In this ction we investigate the optimization problem of learning a a low norm factorization UV ,given a binary target matrix.Bounding the trace norm of UV by 1

2( U 2Fro + V 2Fro ),we can characterize the trace norm in terms of the trace of a positive mi-deﬁnite matrix:

Lemma 3([7,Lemma 1]).For any X ∈R n ×m and t ∈R : X Σ≤t iff there exists

A ∈R n ×n and

B ∈R m ×m such that 2 A X X B

0and tr A +tr B ≤2t .Proof.Note that for any matrix W , W Fro =tr W W .If A X X B 0,we can write it as a product [U V ][U V ].We have X =UV and 12( U 2Fro + V 2Fro )=12(tr A +tr B )≤t ,establishing X Σ≤t .Converly,if X Σ≤t we can write it as X =UV with tr UU +tr V V ≤2t and consider the p.s.d.matrix UU X

X V V .

Lemma 3can be ud in order to formulate minimizing the trace norm as a mi-deﬁnite optimization problem (SDP).Soft-margin matrix factorization (1),can be written as:

min 12(tr A +tr B )+c ia ∈S

ξ A X X B 0,y ia X ia ≥1−ξia ξia ≥0∀ia ∈S (2)2A 0denotes A is positive mi-deﬁnite

Associating a dual variable Q ia with each constraint on X ia,the dual of(2)is[8,Section 5.4.2]:

max

ia∈S Q

I(−Q⊗Y)

(−Q⊗Y) I

0,0≤Q ia≤c(3)

where Q⊗Y denotes the spar matrix(Q⊗Y)ia=Q ia Y ia for ia∈S and zeros elwhere.The problem is strictly feasible,and there is no duality gap.The p.straint in the dual(3)is equivalent to bounding the spectral norm of Q⊗Y,and the dual can also be written as an optimization problem subject to a bound on the spectral a bound on the singular values of Q⊗Y:

max

ia∈S Q

Q⊗Y

foolishness2

≤1

0≤Q ia≤c∀ia∈S

(4)

In typical collaborative prediction problems,we obrve only a small fraction of the entries in a large target matrix.Such a situation translates to a spar dual mi-deﬁnite program, with the number of variables equal to the number of obrved entries.Large-scale SDP solvers can take advantage of such sparsity.

The prediction matrix X∗minimizing(1)is part of the primal optimal solution of(2),and can be extracted

from it directly.Nevertheless,it is interesting to study how the optimal prediction matrix X∗can be directly recovered from a dual optimal solution Q∗alone. Although unnecessary when relying on interior point methods ud by most SDP solvers(as the return a primal/dual optimal pair),this can enable us to u specialized optimization methods,taking advantage of the simple structure of the dual.

Recovering X∗from Q∗As for linear programming,recovering a primal optimal solu-tion directly from a dual optimal solution is not always possible for SDPs.However,at least for the hard-margin problem(no slack)this is possible,and we describe below how an optimal prediction matrix X∗can be recovered from a dual optimal solution Q∗by calculating a singular value decomposition and solving linear equations.

Given a dual optimal Q∗,consider its singular value decomposition Q∗⊗Y=UΛV . Recall that all singular values of Q∗⊗Y are bounded by one,and consider only the columns ˜U∈R n×p of U and˜V∈R m×p of V with singular value one.It is possible to show[8,

Section5.4.3],using complimentary slackness,that for some matrix R∈R p×p,X∗=˜URR ˜V is an optimal solution to the maximum margin matrix factorization problem(1).

Furthermore,p(p+1)

2is bounded above by the number of non-zero Q∗ia.When Q∗ia>0,

and assuming hard-margin box constraints in the dual,complimentary slackness dictates that X∗ia=˜U i RR ˜V a=Y ia,providing us with a linear equation on

the p(p+1)

2entries in the symmetric RR .For hard-margin matrix factorization,we can

therefore recover the entries of RR by solving a system of linear equations,with a number of variables bounded by the number of obrved entries.

Recovering speciﬁc entries The approach described above requires solving a large sys-tem of linear equations(with as many variables as obrvations).Furthermore,especially when the obrvations are very spar(only a small fraction of the entries in the target matrix are obrved),the dual solution is much more compact then the prediction matrix: the dual involves a single number for each obrved entry.It might be desirable to avoid

storing the prediction matrix X∗explicitly,and calculate a desired entry X∗i

0a0

,or at least

its sign,directly from the dual optimal solution Q∗.

Consider adding the constraint X i

0a0

>0to the primal SDP(2).If there exists an optimal

solution X∗to the original SDP with X∗i

0a0

>0,then this is also an optimal solution to

the modiﬁed SDP,with the same objective value.Otherwi,the optimal solution of the modiﬁed SDP is not optimal for the original SDP,and the optimal value of the modiﬁed SDP is higher(wor)than the optimal value of the original SDP.

Introducing the constraint X i

0a0

>0to the primal SDP(2)corresponds to introducing a

new variable Q ilie是什么意思

0a0

price tag什么意思to the dual SDP(3),appearing in Q⊗Y(with Y i

=1)but not in the

objective.In this modiﬁed dual,the optimal solution Q∗of the original dual would always

be feasible.But,if X∗i

0a0

<0in all primal optimal solutions,then the modiﬁed primal

SDP has a higher value,and so does the dual,and Q∗is no longer optimal for the new dual. By checking the optimality of Q∗for the modiﬁed by attempting to re-optimize

it,we can recover the sign of X∗i

0a0 .

We can repeat this test once with Y i

0a0

=1and once with Y i

=−1,corresponding

to X i

0a0

<0.If Y i

X∗i

<0(in all optimal solutions),then the dual solution can be

improved by introducing Q i

0a0

with a sign of Y i

Predictions for new urs So far,we assumed that learning is done on the known entries in all rows.It

is commonly desirable to predict entries in a new partially obrved row of Y(a new ur),not included in the original training t.This esntially requires solving a“conditional”problem,where V is already known,and a new row of U is learned(the predictor for the new ur)bad on a new partially obrved row of X.Using maximum-margin matrix factorization,this is a standard SVM problem.

Max-norm MMMF as a SDP The max-norm variant can also be written as a SDP,with the primal and dual taking the forms:

min t+c

ia∈S ξ

A X

X B

A ii,

B aa≤t∀i,a

y ia X ia≥1−ξia

ξia≥0

∀ia∈S

(5)

max我慢

ia∈S Q

Γ(−Q⊗Y)

(−Q⊗Y) ∆

Γ,∆are diagonal

trΓ+tr∆=1

0≤Q ia≤c∀ia∈S

(6)

无线电工程师

中文翻译5Generalization Error Bounds for Low Norm Matrix Factorizations Similarly to standard feature-bad prediction approaches,collaborative prediction meth-ods can also be analyzed in terms of their generalization ability:How conﬁdently can we predict entries of Y bad on our error on the obrved entries Y S?We prent here gen-eralization error bounds that holds for any target matrix Y,and for a random subt of obrvations S,and bound the average error across all entries in terms of the obrved margin error3.The central assumption,paralleling the i.i.d.source assumption for standard feature-bad prediction,is that the obrved subt S is picked uniformly at random. Theorem4.For all target matrices Y∈{±1}n×m and sample sizes|S|>n log n,and for a uniformly lected sample S of|S|entries in Y,with probability at least1−δover 3The bounds prented here are special cas of bounds for general loss functions that we prent and prove elwhere[8,Section6.2].To prove the bounds we bound the Rademacher complexity of bounded trace norm and bounded max-norm he norms).The unit trace norm ball is the convex hull of outer products of unit norm vectors.It is therefore enough to bound the Rademacher complexity of such outer products,which boils down to analyzing the spectral norm of random matrices.As a conquence of Grothendiek’s inequality,the unit max-norm ball is within a factor of two of the convex hull of outer products of sign vectors.The Rademacher complexity of such outer products can be bounded by considering their cardinality.

the sample lection,the following holds for all matrices X ∈R n ×m and all γ>0:1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+K X Σγ√nm

4√ln m (n +m )ln n |S |+ ln(1+|log X Σ/γ|)|S |+ ln(4/δ)2|S |(7)and

1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+12 X max γ n +m |S |+ ln(1+|log X Σ/γ|)|S |+ ln(4/δ)2|S |(8)

Where K is a universal constant that does not depend on Y ,n ,m ,γor any other quantity.To understand the scaling of the bounds,consider n ×m matrices X =UV where the norms of rows of U and V are bounded by r ,i.e.matrices with X max ≤r 2.The trace norm of such matrices is bounded by r 2/√nm ,and so the two bounds agree up to log-factors—the cost of allowing the norm to be low on-average but not uniformly.Recall that the conditional problem,where V is ﬁxed and only U is learned,is a collection of low-norm (large-margin)linear prediction problems.When the norms of rows in U and V are bounded by r ,a similar generalization error bound on the conditional problem would include the term r

2γ n |S |,matching the bounds of Theorem 4up to log-factors—learning

both U and V does not introduce signiﬁcantly more error than learning just one of them.Also of interest is the comparison with bounds for low-rank matrices,for which X Σ≤√rank X X Fro .In particular,for n ×m rank-k X with entries bounded by B , X Σ≤√knmB ,and the cond term in the right-hand side of (7)becomes:

K B γ4√ln m k (n +m )ln n |S |

(9)Although this is the best (up to log factors)that can be expected from scale-nsitive bounds 4,taking a combinatorial approach,the dependence on the magnitude of the entries in X (and the margin)can be avoided [9].

6Implementation and Experiments

Ratings In many collaborative prediction tasks,the labels are not binary,but rather are discrete “ratings”in veral ordered levels ( star through ﬁve stars).Separating R levels by thresholds −∞=θ0<θ1<···<θR =∞,and generalizing hard-margin constraints for binary labels,one can require θY ia +1≤X ia ≤θY ia +1−1.A soft-margin version of the constraints,with slack variables for the two constraints on each obrved rating,corresponds to a generalization of the hinge loss which is a convex bound on the zero/one level-agreement error (ZOE)[10].To obtain a loss which is a convex b

ound on the mean-absolute-error (MAE—the difference,in levels,between the predicted level and the true level),we introduce R −1slack variables for each obrved rating—one for each

4For general loss functions,bounds as in Theorem 4depend only on the Lipschitz constant of the loss,and (9)is the best (up to log factors)that can be achieved without explicitly bounding the magnitude of the loss function.

本文发布于:2023-07-04 23:43:22，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1078810.html

上一篇：字典学习（KSVD）详解

下一篇：“创意”本身真是没有创意：89%的创意广告来自这6个模板

标签：手抄报图片

留言与评论（共有 0 条评论）