Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Spectral k-Support Norm Regularization
Andrew McDonald
Department of Computer Science, UCL
(Joint work with Massimiliano Pontil and Dimitris Stamos)
25 March, 2015
1 / 19
Problem: Matrix Completion
Goal: Recover a matrix from a subset of measurements.
e.g. random sample of 10% of entries
2 / 19
Applications
I Collaborative filtering: predict interests of a user frompreferences from many users (e.g. Netflix Problem)
I Triangulation of distances from an incomplete network (e.g.wireless network)
I Multitask learning: leverage commonalities between multiplelearning tasks
I . . .
3 / 19
Problem Statement
Given a subset Ω of observations of a matrix X , estimate themissing entries.
(i) ill-posed problem → assume X is low rank, use regularizer toencourage low rank structure
(ii) regularization with rank operator is NP hard → use convexapproximation, e.g. trace norm (sum of singular values)
minW‖Ω(W )− Ω(X )‖2F + λ‖W ‖tr
4 / 19
Trace Norm Regularization
I Trace norm is the tightest convex relaxation of rank operatoron the spectral norm unit ball. [Fazel, Hindi & Boyd 2001]
I Optimization can be solved efficiently using proximal gradientmethods.
I Can be shown that this method finds true underlying matrixwith high probability.
Goal: can we improve on the performance using other regularizers?
5 / 19
The Vector k-Support Norm
I The k-support norm is a regularizer used in sparse vectorestimation problems. [Argyriou, Foygel & Srebro 2012]
I For k ∈ {1, . . . , d}, unit ball is :
co{w : card(w) ≤ k, ‖w‖2 ≤ 1}.
I Includes ‖ · ‖1 (k = 1) and ‖ · ‖2 (k = d).
I Dual norm is the `2-norm of the largest k components of avector.
6 / 19
Vector k-Support Unit Balls
I unit balls in IR3 (k = 1, 2, 3)
I convex hull interpretation
7 / 19
The Spectral k-Support Norm
Extend the k-support norm to matrices.
I The k-support norm is a symmetric gauge function: inducesthe spectral k-support norm [von Neumann 1937]
‖W ‖(k) = ‖(σ1(W ), . . . , σd(W ))‖(k)
I Unit ball is given by
co{W : rank(W ) ≤ k , ‖W ‖F ≤ 1}.
I Includes ‖ · ‖tr (k = 1), and ‖ · ‖F (k = d).
[McDonald, Pontil & Stamos 2014]
8 / 19
Optimization
I The k-support norm can be written as
‖w‖(k) = infθ∈Θ
√√√√ d∑i=1
w2iθi, Θ = {0 < θi ≤ 1,
∑iθi ≤ k}
I Coordinate-wise separable using Lagrange multipliers.
I The norm can be computed in O(d log d) time as
‖w‖2(k) = ‖w↓[1:r ]‖
22 +
1
k − r‖w↓(r :d ]‖
21.
I Similar computation for proximity operator of squared norm:can use proximal gradient methods to solve optimization.
I Matrix case follows using SVD.
9 / 19
Extension: The (k , p)-Support Norm
Fit the curvature of the underlying model.
I For p ∈ [1,∞] define the vector (k , p)-support norm by itsunit ball
co{w : card(w) ≤ k , ‖w‖p ≤ 1}.
I The dual norm is the `q-norm of the k largest components( 1p +
1q = 1).
(Work in progress.)
10 / 19
The Spectral (k , p)-Support Norm
Fit the curvature of the underlying spectrum.
I For p ∈ [1,∞] the spectral (k , p)-support unit ball is definedin terms of the Schatten p-norm
co{W : rank(W ) ≤ k , ‖W ‖p ≤ 1}.
I Von Neumann again: ‖W ‖(k,p) = ‖σ(W )‖(k,p).
11 / 19
Optimization
I For p ∈ (1,∞) the (k , p)-support norm can be computed as
‖w‖p(k,p) = ‖w↓[1:r ]‖
pp +
1
(k − r)p/q‖w↓(r :d ]‖
p1 .
I For p = 1 we recover the `1 norm for all k , and for p =∞ wehave
‖w‖(k,∞) = max(‖w‖∞,
1
k‖w‖1
).
I For p ∈ (1,∞), we solve the constrained problem
argmins
{〈s,∇`(w)〉 : ‖s‖(k,p) ≤ α
}.
I For p =∞ we can compute the projection onto the unit ball:can use proximal gradient methods.
12 / 19
Experiments: Matrix Completion
Benchmark datasets: MovieLens (movies), Jester (jokes)
dataset norm test error k p
MovieLens 100k trace 0.2017 - -k-support 0.1990 1.87 -(k , p)-support 0.1988 2.00 1.16
Jester 1 trace 0.1752 - -k-support 0.1739 6.38 -(k , p)-support 0.1731 2.00 6.50
Jester 3 trace 0.1959 - -k-support 0.1942 2.13 -(k , p)-support 0.1932 3.00 1.14
Note: k = 1 is trace norm, p = 2 is spectral k-support norm.
13 / 19
Role of p in (k , p)-Support Norm
Spectral (k , p)-support norm:
I Intuition: for large p, the `p norm of a vector is increasinglydominated by the largest components.
I Regularization with larger values of p encourages matriceswith flatter spectrum.
Spectrum of synthetic rank 5 matrix with different regularizers:
2 4 6 8 100
10
20
30
40
50
60
70
p = ∞p = 50p = 5p = 1
14 / 19
Extension: Connection to the Cluster Norm
I Using the infimum formulation
‖w‖box = infθ∈Θ
√√√√ d∑i=1
w2iθi, Θ = {a < θi ≤ b,
∑iθi ≤ c}.
I Box norm is a perturbation of the k-support norm (k = d−cab−a )
‖w‖2box = minu,v
{1
a‖u‖22 +
1
b − a‖v‖2(k)
}I Matrix case: we recover the cluster norm [Jacob, Bach, & Vert]
used in multitask learning.
15 / 19
Role of a, c in Box Norm
Simulated datasets:
2 4 6 8 100
0.01
0.02
0.03
a v
alu
e
SNR
Low signal/high noise: higha.
2 4 6 8 101
2
3
4
5
k v
alu
e
true rank
High rank of underlyingmatrix: high k .
16 / 19
Further Work
I Statistical bounds on the performance of the norms: variousresults known [Chatterjee, Chen & Banerjee 2014, Maurer & Pontil2012, Richard, Obozinksi & Vert 2014]
I Infimum formulation of (k , p)-support norm: known forp ∈ [1, 2], unclear for p ∈ (2,∞].
I Study the family of norms for a general choice of theparameter set Θ. [Micchelli, Morales & Pontil 2013]
17 / 19
Conclusion
I Spectral k-support norm as regularizer for low rank matrixlearning
I Spectral (k, p)-support norm allows us to learn curvature ofthe spectrum
I Box norm as perturbation of k-support norm
I Connection to multitask learning cluster norm
18 / 19
References[1] A. Argyriou, R. Foygel, and N. Srebro.
Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474, 2012.
[2] S. Chatterjee, S. Chen, and A. Banerjee.Generalized dantzig selector: Application to the k-support norm.In NIPS, 2014.
[3] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd.A rank minimization heuristic with application to minimum orders system approximation.Proceedings of the American Control Conference, 2001.
[4] L. Jacob, F. Bach, and J.-P. Vert.Clustered multi-task learning: a convex formulation.Advances in Neural Information Processing Systems (NIPS 21), 2009.
[5] A. Maurer and M. Pontil.Structured sparsity and generalization.The Journal of Machine Learning Research, 13:671–690, 2012.
[6] A. M. McDonald, M. Pontil, and D. Stamos.Spectral k-support regularization.In Advances in Neural Information Processing Systems 27, 2014.
[7] C. A. Micchelli, J. M. Morales, and M. Pontil.Regularizers for structured sparsity.Advances in Comp. Mathematics, 38:455–489, 2013.
[8] E. Richard, G. Obozinksi, and J.-P. Vert.Tight convex relaxations for sparse matrix factorization.In Advances in Neural Information Processing Systems (NIPS), 2014.
[9] J. Von Neumann.Some matrix-inequalities and metrization of matric-space.Tomsk. Univ. Rev. Vol I, 1937.
19 / 19