Accelerated Inexact Soft-Impute for Fast Large-Scale ...€¦ · singular value thresholding(SVT): shrink singular values no bigger than to 0 Acceleration can be used [Ji and Ye,

Introduction Related Work Proposed Algorithm Experiments

Accelerated Inexact Soft-Impute forFast Large-Scale Matrix Completion

Quanming Yao

Department of Computer Science and EngineeringHong Kong University of Science and Technology

Hong Kong

Joint work with James Kwok

Quanming Yao AIS-Impute for Matrix Completion


Outline

1 Introduction

2 Related Work

3 Proposed Algorithm

4 Experiments



Motivating Applications

Recommender systems: predict rating by user i on item j




Similarity among users and items: low-rank assumption




Image inpainting: fill in missing pixels

Natural image can be well approximated by low rank matrix



Matrix Completion

minX12‖PΩ(X − O)‖

2F + λ‖X‖∗

X ∈ Rm×n: low-rank matrix to be recovered (m ≤ n)O ∈ Rm×n: observed elements[PΩ(A)]ij = Aij if Ωij = 1, and 0 otherwise

‖X‖∗: nuclear norm (sum of X ’s singular values,non-smooth)‖X‖∗ =

∑mi=1 σi (X )

find X which is low-rank and consistent with the observations



Proximal Gradient Descent

minx f (x) + λg(x)

f (·): convex and smoothg(·): convex, can be non-smooth

xt+1 = arg minx

f (xt) + 〈x − xt ,∇f (xt)〉+1

2‖x − xt‖2F + λg(x)

= arg minx

1

2‖x − zt‖2 + λg(x)︸︷︷︸

Proximal Step

(where zt = xt −∇f (xt))

often has simple closed-form solution

convergence rate: O(1/T ), where T is number of iterations



Proximal Gradient Descent - Acceleration

minx f (x) + λg(x)

can be accelerated to O(1/T 2) [Nesterov, 2013]

yt = (1 + θt)xt − θtxt−1zt = yt −∇f (yt)

xt+1 = arg minx

1

2‖x − zt‖2 + λg(x)

e.g., θt = (t − 1)/(t + 2)can be seen as momentum method with specified weight



Proximal Gradient Descent for Matrix Completion

minX1

2‖PΩ(X − O)‖2F︸︷︷︸

f (X )

+λ ‖X‖∗︸︷︷︸g(X )

Let the SVD of matrix Z be UΣV>.

Proximal Step for Matrix Completion

arg minX

1

2‖X − Z‖2F + λ‖X‖∗ = U (Σ− λI )+︸︷︷︸

thresholding

V> ≡ SVTλ(Z )

[(A)+]ij = max(Aij , 0)

singular value thresholding (SVT): shrink singular values nobigger than λ to 0

Acceleration can be used [Ji and Ye, 2009; Toh and Yun, 2010].



Soft-Impute [Mazumder et al., 2010]

Zt = PΩ(O) + P⊥Ω (Xt), Xt+1 = SVTλ(Zt).

[P⊥Ω (A)]ij = Aij if Ωij = 0, and 1 otherwise (complement of PΩ(A))

To compute SVD, the basic operations are matrix multiplications ofthe form Ztu and Z

>t v

Key observation: Zt is sparse + low-rank

Let Xt = UtΣtVt>. For any u ∈ Rn,

Ztu = PΩ(O − Xt)u︸︷︷︸sparse:O(‖Ω‖1)

+ UtΣt(Vt>u)︸︷︷︸

low rank:O((m+n)k)

Rank-k SVD takes O(‖Ω‖1k + (m + n)k2) time, instead of O(mnk)(similarly, for Z>t v)

k is much smaller than m and n; ‖Ω‖1 much smaller than mn



Soft-Impute is Proximal Gradient

Zt = Xt −∇f (Xt)︸︷︷︸Proximal Gradient

= Xt − PΩ(Xt − O) = P⊥Ω (Xt) + PΩ(O)︸︷︷︸Soft-Impute

Soft-Impute = Proximal Gradient

Possible to use acceleration and obtain O(1/T 2) rate

Previous work suggested that this is not useful

“sparse + low-rank” structure no longer existsincrease in iteration complexity > gain in convergence rate



Main Contributions

Acceleration is useful!

1 “sparse + low-rank” structure can still be used

maintain low iteration complexity

improve convergence rate to O(1/T 2)

2 Speedup SVT using power method

further reduces iteration complexity

use of approximation still yields O(1/T 2) convergence rate



“Sparse + Low-Rank” Structure

With acceleration,

Zt = PΩ(O − Yt) + Yt= PΩ(O − Yt)︸︷︷︸

sparse

+ (1 + θt)Xt − θtXt−1︸︷︷︸sum of two low-rank matrices

For any u,

Ztu = PΩ(O − Yt)u︸︷︷︸O(‖Ω‖1)

+ (1 + θt)UtΣtV>t u︸︷︷︸

O((m+n)k)

− θtUt−1Σt−1V>t−1u︸︷︷︸O((m+n)k)

.

rank-k SVD takes O(‖Ω‖1k + (m + n)k2) time(same as Soft-Impute)

but rate is improved to O(1/T 2) (because of acceleration)



Approximate SVT - Motivation

The iterative procedure becomes

Yt = (1 + θt)Xt − θtXt−1Zt = PΩ(O − Yt) + Yt

Xt+1 = SVT (Zt)

Motivations

in SVT, only need singular vectors with singular values ≥ λpartial-SVD still has to be exactly solved

iterative nature of proximal gradient descent, warm start canbe helpful

→ approximate the subspace spanned by those singular vectorsusing power method



Power Method

Let rank-k SVD of Z̃ = UkΣkV>k , power method is

simple but efficient to approximate subspace spanned by Uk

iterative algorithm and can be warm-started (using R)

PowerMethod(Z̃ ,R, �̃) [Halko et al., 2011]

Require: Z̃ ∈ Rm×n, initial R ∈ Rn×k for warm-start, tolerance �̃;1: initialize Q0 = QR(Z̃R);2: for j = 0, 1, . . . do3: Qj+1 = QR(Z̃(Z̃

>Qj)); // QR decomposition of a matrix4: ∆j+1 = ‖Qj+1Qj+1> − QjQj>‖F ;5: if ∆j+1 ≤ �̃ then break;6: end for7: return Qj+1;



Power Method - Case with k = 1

PowerMethod(Z̃ , r)

1: initialize q0 = Z̃ r ;2: for j = 0, 1, . . . do3: qj = qj/‖qj‖; // QR becomes normalization of a vector4: qj+1 = Z̃(Z̃

>qj);5: end for

Let Z̃ = UΣV>, recursive relationship can be seen as

qj =(Z̃ Z̃>

)jZ̃ r = U

1 (σ2/σ1)2j...

U>Z̃ rFor i = 2, · · · ,m, lim

j→∞

(σiσ1

)2j= 0, power method captures

span of u1 (first column of U)



Obtain SVT(Z̃t) from a much smaller SVT

With the obtained Q, an approximate SVT can be constructed as

X̂t = Q SVTλ(Q>Z̃t).

Q>Z̃t ∈ Rk×n, thus is much smaller than Z̃t ∈ Rm×n

Approx-SVT(Z̃t ,R, λ, �̃)

Require: Z̃t ∈ Rm×n, R ∈ Rn×k , thresholds λ and �̃.1: Q = PowerMethod(Z̃t ,R, �̃);2: [U,Σ,V ] = SVD(Q>Z̃t);3: U = {ui | σi > λ}, V = {vi | σi > λ}, Σ = (Σ− λI )+;4: return QU,Σ and V .

still O(‖Ω‖1k + (m + n)k2), but is cheaper than exact SVD



Complete Algorithm

Accelerated Inexact Soft-Impute (AIS-Impute).

Require: partially observed matrix O, parameter λ, decay parameter ν ∈ (0, 1),threshold �;

1: [U0, λ0,V0] = rank-1 SVD(PΩ(O));2: initialize c = 1, �̃0 = ‖PΩ(O)‖F , X0 = X1 = λ0U0V>0 ;3: for t = 1, 2, . . . do4: λt = ν

t(λ0 − λ) + λ;5: θt = (c − 1)/(c + 2);6: Yt = Xt + θt(Xt − Xt−1);7: Z̃t = Yt + PΩ(O − Yt);8: �̃t = ν

t �̃0;9: Vt−1 = Vt−1 − Vt(Vt>Vt−1), remove zero columns;

10: Rt = QR([Vt ,Vt−1]);11: [Ut+1,Σt+1,Vt+1] = approx-SVT(Z̃t ,Rt , λt , �̃t);12: if F (Ut+1Σt+1V

>t+1) > F (UtΣtV

>t ) c = 1 else c = c + 1;

13: end for14: return Xt+1 = Ut+1Σt+1V

>t+1.









>t+1) > F (UtΣtV

>t ) c = 1 else c = c + 1;


>t+1.

core steps: 5–7 (acceleration)









>t+1) > F (UtΣtV

>t ) c = 1 else c = c + 1;


>t+1.

core steps: 8–11 (approximate SVT)the last two iterations (Vt and Vt−1) is used to warm-start power methoderror on approximate SVT �̃t is decreased linearly









>t+1) > F (UtΣtV

>t ) c = 1 else c = c + 1;


>t+1.

step 12: adaptive restarts algorithm if F (X ) starts to increase









>t+1) > F (UtΣtV

>t ) c = 1 else c = c + 1;


>t+1.

step 4 (continuation strategy): λt is initialized to large value and then

decreased gradually; allows further speedup



Error in Approximate SVT

Let hλg (X ;Zt) ≡ 12‖X − Zt‖2F + λg(X ), if power method exits after j

iterations, assume that k ≥ k̂ , ηt < 1 and �̃ ≥ αtηjt√

1 + η2t , then

hλ‖·‖∗(X̂t ; Z̃t) ≤ hλ‖·‖∗(SVTλ(Z̃t); Z̃t) +ηt

1− ηtβtγt �̃︸︷︷︸

controlled by �̃

.

where X̂t is approximate solution.

αt , βt , γt and ηt are some constants depend on Z̃t

k̂ is # of singular values > λ, k is input rank for Approx-SVT

�̃ is tolerance for power method

The approximation error in Approx-SVT can be controlled by �̃t



Convergence of AIS-Impute

Theorem

With controlled approximation error on SVT, Algorithm 3converges to the optimal solution with a rate of O(1/T 2).

Since approximation error �̃t on proximal step (approx-SVT)decreases to 0 faster than O(1/T 2), the convergence rate is thesame as for exact SVT



Synthetic Data

m ×m data matrix O = UV + GU ∈ Rm×5,V ∈ R5×m: sampled i.i.d. from N (0, 1)G : sampled from N (0, 0.05)

‖Ω‖1 = 15m log(m) random elements in O are observedhalf for training, half for parameter tuning

Testing on the unobserved (missing) elements

Performance criteria:

NMSE =√‖P⊥Ω (X − X̃ )‖F/‖P⊥Ω (X̃ )‖F

rank obtainedtime



Synthetic Data - Compared Methods

Compare the proposed AIS-Impute with

accelerated proximal gradient algorithm (“APG”) [Ji and Ye,2009; Toh and Yun, 2010];

Soft-Impute [Mazumder et al., 2010]

AlgorithmIteration

ComplexityRate SVT

APG O(mnk) O(1/T 2) Exact

Soft-ImputeO(k‖Ω‖1 +k2(m + n))

O(1/T ) Exact

AIS-ImputeO(k‖Ω‖1 +k2(m + n))

O(1/T 2) Approximate

Code can be download fromhttps://github.com/quanmingyao/AIS-impute


https://github.com/quanmingyao/AIS-impute


Results

m = 500 (sparsity=18.64%) m = 1000 (10.36%)

NMSE rank time (sec) NMSE rank time (sec)

APG 0.0183 5 5.1 0.0223 5 45.5

Soft-Impute 0.0183 5 1.3 0.0223 5 4.4

AIS-Impute 0.0183 5 0.3 0.0223 5 1.1

m = 1500 (7.31%) m = 2000 (5.70%)

NMSE rank time (sec) NMSE rank time (sec)

APG 0.0251 5 172.7 0.0273 5 483.9

Soft-Impute 0.0251 5 13.3 0.0273 5 18.7

AIS-Impute 0.0251 5 2.0 0.0273 5 2.9

All algorithms are equally good on recovery, while AIS-Impute isthe fastest



Convergence Speeds

(a) objective vs #iterations. (b) objective vs time.

W.r.t. #iterations

APG and AIS-Impute are much faster than Soft-ImputeAIS-Impute has a slightly higher objective than APG

W.r.t. time

APG is the slowest (does not use “sparse plus low-rank”)AIS-Impute is the fastest



Recommendation - MovieLens Data

Task: Recommend movies based on users’ historical ratings

#users #movies #ratings

MovieLens-100K 943 1,682 100,000

MovieLens-1M 6,040 3,449 999,714

MovieLens-10M 69,878 10,677 10,000,054

ratings (from 1 to 5) of different users on movies

50% of the observed ratings for training

25% for validation and the rest for testing



MovieLens Data - Compared Methods

Besides proximal algorithms, we also compare with

active subspace selection (“active”) [Hsieh and Olsen, 2014]

Frank-Wolfe algorithm (“boost”) [Zhang et al., 2012]

variant of Soft-Impute (“ALT-Impute”) [Hastie et al., 2014]

second-order trust-region algorithm (“TR”) [Mishra et al.,2013]



Objective w.r.t. Time

AIS-Impute is in black

(a) MovieLens-100K. (b) MovieLens-10M.

MovieLen-10M

TR and APG are very slow, and thus not shown



Testing RMSE w.r.t. Time

AIS-Impute is in black

(a) MovieLens-100K. (b) MovieLens-10M.



Results

MovieLens-100K MovieLens-1M MovieLens-10M

RMSE rank time RMSE rank time RMSE rank time

active 1.037 70 59.5 0.925 180 1431.4 0.918 217 29681.4

boost 1.038 71 19.5 0.925 178 616.3 0.917 216 13873.9

ALT-Impute 1.037 70 29.1 0.925 179 797.1 0.919 215 17337.3

TR 1.037 71 1911.4 — — > 106 — — > 106

APG 1.037 70 83.4 0.925 180 2060.3 — — > 106

Soft-Impute 1.037 70 337.6 0.925 180 8821.0 — — > 106

AIS-Impute 1.037 70 5.8 0.925 179 129.7 0.916 215 2817.5

All algorithms are equally good at recovering the missingmatrix elements

TR is the slowest

ALT-Impute has the same convergence rate as Soft-Impute,but is faster (than Soft-Impute)

AIS-Impute is the fastest



Conclusion

AIS-Impute

accelerates proximal gradient descent without losing the“sparse plus low-rank” structure

power method produces good approximation to SVT efficiently

fast convergence rate + low iteration complexity

empirically, much faster than the state-of-the-art


IntroductionRelated WorkProximal Gradient Descent

Proposed AlgorithmExperiments

Documents

Accelerated Inexact Soft-Impute for Fast Large-Scale ...€¦ · singular value thresholding(SVT): shrink singular values no bigger than to 0 Acceleration can be used [Ji and Ye,