Matrix Completion and Large-scale SVD Computationshastie/TALKS/SVD_hastie.pdf · Matrix Completion and Large-scale SVD Computations Trevor Hastie Stanford Statistics joint with Rahul

Matrix Completion and Large-scale SVDComputations

Trevor HastieStanford Statistics

joint with Rahul Mazumder and Rob Tibshirani

May, 2012

Mazumder, Hastie, Tibshirani Matrix Completion 1/ 42

Outline of Talk

I Convex matrix completion, collaborative filtering(Mazumder, Hastie, Tibshirani 2010 JMLR)

I Recent Algorithmic Advances and Large-Scale SVD(Mazumder, Hastie 2011 (preprint) and work in progress)


Motivation: the Netflix Prize


The Netflix Data Set

mov

ieI

mov

ieII

mov

ieII

I

mov

ieIV

· · ·User A 1 ? 5 4 · · ·User B ? 2 3 ? · · ·User C 4 1 2 ? · · ·User D ? 5 1 3 · · ·User E 1 2 ? ? · · ·...

......

......

. . .

I Training Data:480K users, 18K movies,100 M ratingsratings 1-5(99 % ratings missing)

I Goal:$ 1 M prize for 10 % reductionin RMSE over Cinematch

I BellKor’s Pragmatic Chaosdeclared winners on 9/21/2009

used ensemble of models, animportant ingredient beinglow-rank factorization


Motivation: Expression Arrays

I The rows are genes(variables)

I the columns are observations(samples, DNA arrays).

I Typical numbers are 6-10Kgenes, 50-150 samples.

I Refer to eigengenes andeigenarrays.

I Often 10–15% NAs.


Example: NCI Cancer Data

Principal Component 1

Prin

cipa

l Com

pone

nt 2

0.04 0.08 0.12 0.16

-0.2

-0.1

0.0

0.1

0.2

6

6

6

6

66

6

6

6

4

44

4 4

4

1

1

11

11

1

1

33

33

333

8

8

7

7

777

7

5

55

5

55

55

9

99

999

9

92

22

2 2

2

Gene

Load

ing

for

PC

-1

0 2000 4000 6000 8000

-0.1

00.

0

Gene

Load

ing

for

PC

-2

0 2000 4000 6000 8000

-0.0

50.

00.

05

First two eigengenes (left) and eigenarrays (right). Points are colored

according to NCI cancer classes


Matrix Completion/ Collaborative Filtering:Problem Definition

Use

rs

Movie Ratings

I Large matrices# rows | #columns ≈ 105, 106

I Very under-determined(often only 1− 2 % observed)

I Exploit matrix structure, row | columninteractions

I Task: “fill-in” missing entries

I Applications: recommender systems,image-processing, imputation of NAs forgenomic data, rank estimation for SVD.


Model Assumption : Low Rank + Noise

I Under-determined – assumelow-rank

I Meaningful?

Interpretation – User & itemfactors induce collaboration

Empirical – Netflix successes

Theoretical – “Reconstruction”possible under low-rank &regularity conditions

Srebro et al (2005); Candes and Recht(2008); Candes and Tao (2009); Keshavan et.al. (2009); Negahban and Wainwright (2012)


Optimization problem

Find Zn×m of (small) rank r such that training error is small.

minimizeZ

∑Observed(i ,j)

(Xij − Zij)2 subject to rank(Z ) = r

Impute missing Xij with Zij

True X Observed X Fitted Z Imputed X


Our Approach: Nuclear Norm Relaxation

I The rank(Z ) constraint makes the problem non-convex— combinatorially very hard (although good algorithms exist).

I ‖Z‖∗ =∑

j λj(Z ) — sum of singular values of Z — is convexin Z . Called the “nuclear norm” of Z .

I ‖Z‖∗ tightest convex relaxation of rank(Z )(Fazel, Boyd, 2002)

We solve instead

minimizeZ

∑Observed(i ,j)

(Xij − Zij)2 subject to ||Z ||∗ ≤ τ

which is convex in Z .


Notation

Following Cai et al (2010) define PΩ(X )n×m: projection onto theobserved entries

PΩ(X )i ,j =

{Xi ,j if (i , j) is observed0 if (i , j) is missing

Criterion rewritten as:∑Observed(i ,j)

(Xij − Zij)2 = ‖PΩ(X )− PΩ(Z )‖2F


Inspiration

I SVT algorithm of Caiet. al. (2010) solves:

minimize ‖Z‖∗subject to PΩ(Z ) = PΩ(X )

I Rephrasing our criterion

minimize ‖Z‖∗subject to ‖PΩ(X )− PΩ(Z )‖F ≤ δ

I First order algorithm —scalable to large matrices viasparse SVD

I No-noise reconstruction modelseems too rigid.

I In real-life, there is noise —fitting training data exactly incursadded variance

I Introduce bias to decreasevariance

I Computation requires more than asparse SVD


Bias-Variance Trade-Off

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0

Nuclear Norm

Rel

ativ

e M

ean−

Squ

ared

Err

or

Training Error

0 500 1000 15000.

40.

50.

60.

70.

80.

91.

0

Nuclear Norm

Rel

ativ

e M

ean−

Squ

ared

Err

or

Soft−ImputeSVTMMMF

Test Error

50% missing entries with SNR=1, true rank 6, 50 simulations


Soft SVD

Let (fully observed) Xn×m have SVD

X = U · diag[σ1, . . . , σm] · V ′

Consider the convex optimization problem

minimizeZ

12‖X − Z‖2F + λ‖Z‖∗

Solution is soft-thresholded SVD

Sλ(X ) := U · diag[(σ1 − λ)+, . . . , (σm − λ)+] · V ′

Like lasso for SVD: singular values are shrunk to zero, with manyset to zero. Smooth version of best-rank approximation.


Convex Optimization Problem

Back to missing data problem, in Lagrange form:

minimizeZ

12‖PΩ(X )− PΩ(Z )‖2F + λ‖Z‖∗

I This is a semi-definite program (SDP), convex in Z .

I Complexity of existing off-the-shelf solvers:

– interior-point methods: O(n4) . . .O(n6) . . .– (black box) first-order methods complexity: O(n3)

I We solve using an iterative soft SVD (next slide), with costper soft SVD O[(m + n) · r + |Ω|] where r is rank of solution.


Soft-Impute: Path Algorithm

1 Initialize Z old = 0 and create a decreasing grid Λ of valuesλ0 > λ1 > . . . > λK > 0, with λ0 = λmax(PΩ(X ))

2 For each λ = λ1, λ2, . . . ∈ Λ iterate 2a-2b till convergence:

(2a) Compute Znew ← Sλ(PΩ(X ) + P⊥Ω (Z old))(2b) Assign Z old ← Znew and go to step (2a)(2c) Assign Ẑλ ← Znew and go to 2

3 Output the sequence of solutions Ẑλ1 , . . . , ẐλK .


Soft-Impute: Convergence Analysis

Theorem (Mazumder et. al., 2010)

Take λ > 0. The sequence of estimates {Zk}k given by:

Zk+1 = argminZ

12‖PΩ(X ) + P⊥Ω (Zk)− Z‖2F + λ‖Z‖∗

converges to Z∞, a fixed point of

Z = Sλ(PΩ(X ) + P⊥Ω (Z ))

Hence, Z∞ minimizes

fλ(Z ) :=12‖PΩ(X )− PΩ(Z )‖2F + λ‖Z‖∗


Soft-Impute: Algorithm Properties

I Objective values decrease at every iteration:

fλ(Zk+1) ≤ fλ(Zk)

I Successive iterates move closer to the set of optimal solutions:

‖Zk+1 − Z ∗‖ ≤ ‖Zk − Z ∗‖

for any Z ∗ ∈ argminZ fλ(Z )I Theorem (Mazumder et. al.; 2010)

Worst rate of convergence is O( 1k )

fλ(Zk)− fλ(Z∞) ≤2

k + 1‖Z0 − Z∞‖2F

(Rate can be tightened to linear i.e. O(γk) with warm-startsand large λ)


Soft-Impute : Computational Bottleneck

Obtain the sequence {Zk}, where Zk is current guess...

Zk+1 = argminZ

12‖PΩ(X ) + P⊥Ω (Zk)− Z‖2F + λ‖Z‖∗

Computational bottleneck — soft SVD requires (low-rank) SVD ofcompleted matrix after k iterations:

X̂k = PΩ(X ) + P⊥Ω (Zk)

Trick:

PΩ(X ) + P⊥Ω (Zk) = {PΩ(X )− PΩ(Zk)} + Zk

Sparse Low Rank


Computational tricks in Soft-Impute

I Anticipate rank of Ẑλj+1 based on rank of Ẑλj , erring ongenerous side.

I Compute low-rank SVD of X̂k using orthogonal QR iterationswith Reitz acceleration (Stewart, 1969, Mazumder & Hastie2012 [preprint]).

I Iterations require left and right multiplications U ′X̂k andX̂kV . Ideal for Sparse + Low-Rank structure.

I Warm starts: Sλ(X̂k) provides excellent warm starts (U andV ) for Sλ(X̂k+1). Likewise Ẑλj for Ẑλj+1 .

I Total cost per iteration O[(m + n) · r + |Ω|].


Soft-Impute on Netflix problem

rank time (hrs) RMSE % Improvement

42 1.36 0.9622 -1.166 2.21 0.9572 -0.681 2.83 0.9543 -0.3

Cinematch 0.9514 0

95 3.27 0.9497 1.8120 4.40 0.9213 3.2· · ·· · ·Winning Goal 0.8563 10

state-of-the-art convex solvers do not scale to this size


Hard-Impute

minimizerank(Z)=r

||PΩ(X )− PΩ(Z )||F

This is not convex in Z , but by analogy with Soft-Impute, aniterative algorithm gives good solutions.

Replace step:

(2a) Compute Znew ← Sλ(PΩ(X ) + P⊥Ω (Z old))with

(2a’) Compute Znew ← Hr (PΩ(X ) + P⊥Ω (Z old))Here Hr (X ∗) is the best rank-r approximation to X ∗ — i.e. therank-r truncated SVD approximation.


Example: choosing a good rank for SVD

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

20 40 60 80

02

46

10−fold CV Rank Determination

Rank

Roo

t Mea

n S

quar

e E

rror

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●● ●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●●

10−fold CVTrain

Truth is 200× 100 rank-50 matrix plus noise (SNR 3). Randomlyomit 10% of entries, and then predict using solutions fromHard-Impute.


Maximum Margin Matrix Factorization

Consider rank-r approximation Z = Am×rB′n×r , and solve

minimizeA,B

||PΩ(X )− PΩ(AB ′)||2F + λ(‖A‖2F + ‖B‖2F )

��

��

��

��

��

X ≈ A

B′

I Regularized SVD (Srebro et al 2003,Simon Funk)

I Not convex, but bi-convex : alternatingridge regression

Lemma (Mazumder et al 2010)For any matrix W , the following holds:

||W ||∗ = minA,B: W=ABT

12

(‖A‖2F + ‖B‖2F

).

If rank(W ) = k ≤ min{m, n}, then the minimum above is attained at afactor decomposition W = Am×kB

Tn×k .


Connections between MMMF and soft-impute

MMMF : minimizeAn×r ,Bm×r

12 ||PΩ(X )− PΩ(AB ′)||2F + λ2 (‖A‖2F + ‖B‖2F )

soft-impute: minimizeZ

12 ||PΩ(X )− PΩ(Z )||2F + λ‖Z‖∗

I Solution-space of MMMF contains solutions of soft-impute.I For large rank r : MMMF ≡ soft-impute.

−4 −2 0 2 4 6

020

4060

8010

0

Rank

k

log λ

(Related Work: Srebro et al ’05; Fazel ’02)Mazumder, Hastie, Tibshirani Matrix Completion 25/ 42

Synthesis and New Approach

I MMMF is slower than soft-impute — factor of 10.

I MMMF requires guesswork for rank, and does not return adefinitive low-rank solution.

I soft-impute requires a low-rank SVD at each iteration.Typically iterative QR methods are used, exploiting problemstructure and warms starts.

Idea: combine soft-impute and MMMF

I Leads to algorithm more efficient than soft-impute

I Scales naturally to larger problems using parallel/multicoreprogramming

I Suggests efficient algorithm for low-rank SVD for completematrices


New nuclear-norm and MMMF results (H & M, 2012)

Consider fully observed Xn×m.

Nuclear : minimizerank(Z)≤r

12 ||X − Z ||2F + λ‖Z‖∗

MMMF : minimizeAn×r ,Bm×r

12 ||X − AB ′||2F + λ2 (‖A‖2F + ‖B‖2F )

The solution to Nuclear is

Z = UrD∗V′r ,

where Ur and Vr are first r left and right singular vectors of X , and

D∗ = diag[(σ1 − λ)+, . . . , (σr − λ)+]

A solution to MMMF is

A = UrD12∗ and B = VrD

12∗


Consequences of new nuclear-norm / MMMF connections

For SVD of fully observed matrix:

I Can solve reduced-rank SVD by alternating ridge regressions.

I At each iteration, re-orthogonalization as in usual QRiterations (for reduced-rank SVD) means ridge regression is asimple matrix multiply, followed by column scaling.

I Ridging speeds up convergence, and focuses accuracy onleading dimensions.

I Solution delivers a reduced-rank SVD.

For matrix completion:

I Combine SVD calculation and imputation in Soft-Impute.

I Leads to a faster algorithm that can be distributed to multiplecores for storage and computation efficiency.


Soft-Impute//

Back to matrix imputation.

1. Initialize Un×r , Vm×r orthogonal, Dr×r > 0 diagonal, andA = UD, B = VD.

2. Given U and D and hence A = UD, update B:2.a Compute current imputation:

X ∗ = PΩ(X ) + P⊥Ω (AB

′)

= [PΩ(X )− PΩ(AB ′)] + UD2V ′

2.b Ridge regression of X ∗ on A:

B ′ ← (D2 + λI )−1DU ′X ∗= D1U

′[PΩ(X )− PΩ(AB ′)] + D2V ′

2.c Reorthogonalize and update V , D and U via SVD of BD.

3. Given V and D and B = VD, update A in similar fashion.

4. At convergence, U and V provides SVD of X ∗, and henceSλ(X

∗), which cleans up the rank of the solution.


Distributed Computation

��

��

��

��

��

��

X11 U1 D V′1

��

��

��

��

��

��

X12 U1 D V′2

��

��

��

��

��

��

X13 U1 D V′3

��

��

��

��

��

��

X21 U2 D V′1

��

��

��

��

��

��

X22 U2 D V′2

��

��

��

��

��

��

X23 U2 D V′3

Large matrix is chunked into submatrices, along with current SVD.Each core computes residuals, does left or right multiplies, thensends results to central core for synthesis and re-orthogonalization.Subvectors Vj , Ui and D are sent back to relevant cores.


Application: Image InpaintingTrue Image


Application: Image Inpainting50% Masked/Degraded Noisy Training Image


Application: Image Inpainting

Training Oracle

Soft-Impute Soft-Impute+


Generalizations

I The low-rank modeling framework can be limiting in terms ofthe scope of applications

I However, simple generalizations can accommodate a far richerclass of problems.

I The initial formulations are combinatorially hard(er)

I But the tasks can be accomplished by retaining:— Convexity (via convex relaxations)— Large scale algorithmic scalability

I We give some examples ...


Video Surveillance & Background ModelingClean Sequence Masked & Noisy Sequence


Shadow Removal & Image CleaningOriginal Images Masked & Noisy Images


Modeling Assumption: Robust SVD

I The matrix X is approximated by:

X ≈ L︸︷︷︸Low-Rank

+ S︸︷︷︸Sparse

I Under some regularity conditions bothS and L can be recovered(if X is fully observed)

I We extend the framework to noisymatrix completion problems(Mazumder, Hastie ’11)

I Other applications: Outliers inMicro-array, robust collaborativefiltering.

Related work: Candes et al ’10; Zhou et al ’10


Convex Robust Matrix Completion

Criterion:

minimizeL,S

12‖PΩ(X − L− S)‖2F︸︷︷︸

Training Error

+ λ‖L‖∗︸︷︷︸Low-Rank Regularization

+ γ‖S‖1︸︷︷︸Sparsity Regularization

∗

‖L‖∗: Nuclear norm surrogate for rank of L

‖S‖1: `1 surrogate for # non-zeros in S

(a) We show equivalences with traditional robust estimation statistics(Huber’s M-estimation)

minimizeL

12

∑ij∈Ω

hγ(Xij − Lij) + λ‖L‖∗hγ

(b) Using (a) along with ideas used in soft-impute we developefficient algorithms for solving ∗

Mazumder, Hastie ’11 (preprint)


Algorithmic Improvements

10 20 30 40 50 60 70 80 90 100−6

−5

−4

−3

−2

−1

0

Num−SVD−CALLS

Dis

tancefr

om

Optim

alS

olu

tion

AGDGDMGDMAGD

Number of SVD calls


Results: Video Surveillance & Background Modeling

Video in a Hall


movie_hall_8mb.mpegMedia File (video/mpeg)

Results: Shadow & Light RemovalTraining ≈ Low-Rank + Sparse Training ≈ Low-Rank + Sparse

Faces of two Individuals. Images from Extended Yale Database,


Conclusions

I Convex nuclear norm relaxation is a disciplined and effectivestrategy for low-rank matrix modeling:

– Soft-Impute : natural and computationally efficientalgorithm [special structure, warm starts], scales tocollaborative filtering, image applications, among others

– allows for simple generalizations to more flexible modelingframeworks (e.g. robust SVD)

I New connections between (reduced-rank) nuclear norm and ridgeminimization:

– new algorithms for computing reduced-rank SVD– new version of Soft-Impute which does imputation and

iterative SVD calculations at the same time.– simple nature of updates allows for natural parallel/distributed

computation.

I THE END! Thank you for your attention.


Documents

Matrix Completion and Large-scale SVD Computationshastie/TALKS/SVD_hastie.pdf · Matrix Completion and Large-scale SVD Computations Trevor Hastie Stanford Statistics joint with Rahul