65
Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Embed Size (px)

Citation preview

Page 1: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Peter Richtárik

Coordinate Descent Methods with Arbitrary Sampling

Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Page 2: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015
Page 3: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Papers & Coauthors

Zheng Qu

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)

Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060, 2014

Zheng Qu and P.R.Coordinate descent with arbitrary sampling II: expected separable overapproximation arXiv:1412.8063, 2014

Martin Takáč Tong Zhang

Page 4: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Warmup

Page 5: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Part ANSync

P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)

Page 6: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Problem

Smooth and strongly convex

Page 7: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

NSynci.i.d. with arbitrary

distribution

Page 8: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Key Assumption

Inequality must hold for all

Page 9: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity Theorem

strong convexity constant

Page 10: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Proof

Copy-paste from the

paper

Page 11: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Uniform vs Optimal Sampling

Page 12: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Two-level samplingDefinition of a parametric family of random subsets {1,2, … , n} of fixed cardinality:

STEP 0:

STEP 1:

STEP 2:

Page 13: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Part B

Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060

Page 14: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Problem

Smooth & convex Convex

Page 15: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

ALPHA (for smooth minimization)STEP 0:

STEP 1:

STEP 2:

\[z^{t+1}_{{\color{blue}i}} \leftarrow z^t_{\color{blue}i} - \frac{\color{blue}p_i}{{\color{red}v_i}\theta_k} \nabla_{\color{blue}i} f(y^t)\]

STEP 3:

i.i.d. random subsets of coordinates

(any distribution allowed)

Same as in NSync

Page 16: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity Theorem

Arbitrary point

Same as in NSync

Page 17: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Part C

PRIMAL-DUAL FRAMEWORK

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873

Page 18: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Primal Problem

d = # features (parameters)

n = # samples 1 - strongly convex function (regularizer)

- smooth & convex

regularizationparameter

Page 19: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Assumption 1

Loss functions have Lipschitz gradient

Lipschitz constant

Page 20: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Assumption 2

Regularizer is 1-strongly convex

subgradient

Page 21: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Dual Problem

- strongly convex 1 – smooth

& convex

Page 22: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

C.1ALGORITHM

Page 23: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Quartz

Page 24: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Fenchel Duality

Weak duality

Optimality conditions

Page 25: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

The Algorithm

Page 26: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Quartz: Bird’s Eye View

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

Page 27: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

The Algorithm

STEP 1

STEP 2

Convex combinationconstant

Page 28: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Randomized Primal-Dual Methods

SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014

Page 29: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

C.2MAIN RESULT

Page 30: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Assumption 3 (Expected Separable Overapproximation)

inequality must hold for all

Page 31: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity Theorem (QRZ’14)

Page 32: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

C.3UPDATING ONE DUALVARIABLE AT A TIME

Page 33: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity of Quartz specialized to serial sampling

Page 34: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Data

Page 35: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Standard primal update

Experiment: Quartz vs SDCA,uniform vs optimal sampling

“Aggressive” primal update

Page 36: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

C.4TAU-NICE SAMPLING

(STANDARD MINIBATCHING)

Page 37: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Data sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

Page 38: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity of Quartz

Page 39: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Speedup

Assume the data is normalized:

Then:

Linear speedup up to a certain data-independent minibatch size:

Further data-dependent speedup, up to the extreme case:

Page 40: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Speedup: sparse data

Page 41: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Speedup: denser data

Page 42: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Speedup: fully dense data

Page 43: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

astro_ph: n = 29,882 density = 0.08%

Page 44: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

CCAT: n = 781,265 density = 0.16%

Page 45: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Primal-dual methods with tau-nice sampling

SS-Shwartz & T Zhang ‘13

SS-Shwartz & T Zhang ‘13

Y Zhang & L Xiao ‘14

Page 46: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

For sufficiently sparse data, Quartz wins even when compared against accelerated methods

Acce

lera

ted

Page 47: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

GOTTA END HERE

Page 48: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

C.5DISTRIBUTED

QUARTZ

Page 49: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Distributed Quartz: Perform the Dual Updates in a Distributed Manner

Quartz STEP 2: DUAL UPDATE

Data required to compute the update

Page 50: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Distribution of Datan = # dual variables Data matrix

Page 51: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Distributed sampling

Page 52: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Distributed sampling

Random set of dual variables

Page 53: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Distributed sampling & distributed coordinate descent

P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014

Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014

2

strongly convex & smooth

convex & smooth

Page 54: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Complexity of distributed Quartz

Page 55: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Reallocating load: theoretical speedup

n = 1,000,000density = 100%

n = 1,000,000density = 0.01%

Page 56: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Extra materialin the zero probability

event that I will have time for it

Page 57: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Part DESO

Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014

Page 58: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Computation of ESO parameters

Lemma (QR’14b) {For simplicity, assume that m = 1}

Page 59: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

ESO

For any sampling , ESO holds with

Theorem (QR’14b)

where

Page 60: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

ESO

Page 61: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Experiment

Machine: 128 nodes of Hector Supercomputer (4096 cores)

Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB

Algorithm: with c = 512

P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

Page 62: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

LASSO: 3TB data + 128 nodes

Page 63: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Experiment

Machine: 128 nodes of Archer Supercomputer

Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)

Algorithm: Hydra2 with c = 256

Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

Page 64: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

LASSO: 5TB data (d = 50b) + 128 nodes

Page 65: Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

END