Spectral k-Support Norm Regularization€¦ · References [1]A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing

Spectral k-Support Norm Regularization

Andrew McDonald

Department of Computer Science, UCL

(Joint work with Massimiliano Pontil and Dimitris Stamos)

25 March, 2015

1 / 19

Problem: Matrix Completion

Goal: Recover a matrix from a subset of measurements.

e.g. random sample of 10% of entries

2 / 19

Applications

I Collaborative filtering: predict interests of a user frompreferences from many users (e.g. Netflix Problem)

I Triangulation of distances from an incomplete network (e.g.wireless network)

I Multitask learning: leverage commonalities between multiplelearning tasks

I . . .

3 / 19

Problem Statement

Given a subset Ω of observations of a matrix X , estimate themissing entries.

(i) ill-posed problem → assume X is low rank, use regularizer toencourage low rank structure

(ii) regularization with rank operator is NP hard → use convexapproximation, e.g. trace norm (sum of singular values)

minW‖Ω(W )− Ω(X )‖2F + λ‖W ‖tr

4 / 19

Trace Norm Regularization

I Trace norm is the tightest convex relaxation of rank operatoron the spectral norm unit ball. [Fazel, Hindi & Boyd 2001]

I Optimization can be solved efficiently using proximal gradientmethods.

I Can be shown that this method finds true underlying matrixwith high probability.

Goal: can we improve on the performance using other regularizers?

5 / 19

The Vector k-Support Norm

I The k-support norm is a regularizer used in sparse vectorestimation problems. [Argyriou, Foygel & Srebro 2012]

I For k ∈ {1, . . . , d}, unit ball is :

co{w : card(w) ≤ k, ‖w‖2 ≤ 1}.

I Includes ‖ · ‖1 (k = 1) and ‖ · ‖2 (k = d).

I Dual norm is the `2-norm of the largest k components of avector.

6 / 19

Vector k-Support Unit Balls

I unit balls in IR3 (k = 1, 2, 3)

I convex hull interpretation

7 / 19

The Spectral k-Support Norm

Extend the k-support norm to matrices.

I The k-support norm is a symmetric gauge function: inducesthe spectral k-support norm [von Neumann 1937]

‖W ‖(k) = ‖(σ1(W ), . . . , σd(W ))‖(k)

I Unit ball is given by

co{W : rank(W ) ≤ k , ‖W ‖F ≤ 1}.

I Includes ‖ · ‖tr (k = 1), and ‖ · ‖F (k = d).

[McDonald, Pontil & Stamos 2014]

8 / 19

Optimization

I The k-support norm can be written as

‖w‖(k) = infθ∈Θ

√√√√ d∑i=1

w2iθi, Θ = {0 < θi ≤ 1,

∑iθi ≤ k}

I Coordinate-wise separable using Lagrange multipliers.

I The norm can be computed in O(d log d) time as

‖w‖2(k) = ‖w↓[1:r ]‖

22 +

1

k − r‖w↓(r :d ]‖

21.

I Similar computation for proximity operator of squared norm:can use proximal gradient methods to solve optimization.

I Matrix case follows using SVD.

9 / 19

Extension: The (k , p)-Support Norm

Fit the curvature of the underlying model.

I For p ∈ [1,∞] define the vector (k , p)-support norm by itsunit ball

co{w : card(w) ≤ k , ‖w‖p ≤ 1}.

I The dual norm is the `q-norm of the k largest components( 1p +

1q = 1).

(Work in progress.)

10 / 19

The Spectral (k , p)-Support Norm

Fit the curvature of the underlying spectrum.

I For p ∈ [1,∞] the spectral (k , p)-support unit ball is definedin terms of the Schatten p-norm

co{W : rank(W ) ≤ k , ‖W ‖p ≤ 1}.

I Von Neumann again: ‖W ‖(k,p) = ‖σ(W )‖(k,p).

11 / 19

Optimization

I For p ∈ (1,∞) the (k , p)-support norm can be computed as

‖w‖p(k,p) = ‖w↓[1:r ]‖

pp +

1

(k − r)p/q‖w↓(r :d ]‖

p1 .

I For p = 1 we recover the `1 norm for all k , and for p =∞ wehave

‖w‖(k,∞) = max(‖w‖∞,

1

k‖w‖1

).

I For p ∈ (1,∞), we solve the constrained problem

argmins

{〈s,∇`(w)〉 : ‖s‖(k,p) ≤ α

}.

I For p =∞ we can compute the projection onto the unit ball:can use proximal gradient methods.

12 / 19

Experiments: Matrix Completion

Benchmark datasets: MovieLens (movies), Jester (jokes)

dataset norm test error k p

MovieLens 100k trace 0.2017 - -k-support 0.1990 1.87 -(k , p)-support 0.1988 2.00 1.16

Jester 1 trace 0.1752 - -k-support 0.1739 6.38 -(k , p)-support 0.1731 2.00 6.50

Jester 3 trace 0.1959 - -k-support 0.1942 2.13 -(k , p)-support 0.1932 3.00 1.14

Note: k = 1 is trace norm, p = 2 is spectral k-support norm.

13 / 19

Role of p in (k , p)-Support Norm

Spectral (k , p)-support norm:

I Intuition: for large p, the `p norm of a vector is increasinglydominated by the largest components.

I Regularization with larger values of p encourages matriceswith flatter spectrum.

Spectrum of synthetic rank 5 matrix with different regularizers:

2 4 6 8 100

10

20

30

40

50

60

70

p = ∞p = 50p = 5p = 1

14 / 19

Extension: Connection to the Cluster Norm

I Using the infimum formulation

‖w‖box = infθ∈Θ

√√√√ d∑i=1

w2iθi, Θ = {a < θi ≤ b,

∑iθi ≤ c}.

I Box norm is a perturbation of the k-support norm (k = d−cab−a )

‖w‖2box = minu,v

{1

a‖u‖22 +

1

b − a‖v‖2(k)

}I Matrix case: we recover the cluster norm [Jacob, Bach, & Vert]

used in multitask learning.

15 / 19

Role of a, c in Box Norm

Simulated datasets:

2 4 6 8 100

0.01

0.02

0.03

a v

alu

e

SNR

Low signal/high noise: higha.

2 4 6 8 101

2

3

4

5

k v

alu

e

true rank

High rank of underlyingmatrix: high k .

16 / 19

Further Work

I Statistical bounds on the performance of the norms: variousresults known [Chatterjee, Chen & Banerjee 2014, Maurer & Pontil2012, Richard, Obozinksi & Vert 2014]

I Infimum formulation of (k , p)-support norm: known forp ∈ [1, 2], unclear for p ∈ (2,∞].

I Study the family of norms for a general choice of theparameter set Θ. [Micchelli, Morales & Pontil 2013]

17 / 19

Conclusion

I Spectral k-support norm as regularizer for low rank matrixlearning

I Spectral (k, p)-support norm allows us to learn curvature ofthe spectrum

I Box norm as perturbation of k-support norm

I Connection to multitask learning cluster norm

18 / 19

References[1] A. Argyriou, R. Foygel, and N. Srebro.

Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474, 2012.

[2] S. Chatterjee, S. Chen, and A. Banerjee.Generalized dantzig selector: Application to the k-support norm.In NIPS, 2014.

[3] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd.A rank minimization heuristic with application to minimum orders system approximation.Proceedings of the American Control Conference, 2001.

[4] L. Jacob, F. Bach, and J.-P. Vert.Clustered multi-task learning: a convex formulation.Advances in Neural Information Processing Systems (NIPS 21), 2009.

[5] A. Maurer and M. Pontil.Structured sparsity and generalization.The Journal of Machine Learning Research, 13:671–690, 2012.

[6] A. M. McDonald, M. Pontil, and D. Stamos.Spectral k-support regularization.In Advances in Neural Information Processing Systems 27, 2014.

[7] C. A. Micchelli, J. M. Morales, and M. Pontil.Regularizers for structured sparsity.Advances in Comp. Mathematics, 38:455–489, 2013.

[8] E. Richard, G. Obozinksi, and J.-P. Vert.Tight convex relaxations for sparse matrix factorization.In Advances in Neural Information Processing Systems (NIPS), 2014.

[9] J. Von Neumann.Some matrix-inequalities and metrization of matric-space.Tomsk. Univ. Rev. Vol I, 1937.

19 / 19

Documents

Spectral k-Support Norm Regularization€¦ · References [1]A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing