56
Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Embed Size (px)

Citation preview

Page 1: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Peter Richtárik

Randomized Dual Coordinate Ascentwith Arbitrary Sampling

1st UCL Workshop on the Theory of Big Data – London– January 2015

Page 2: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Coauthors

Zheng QuEdinburgh, Mathematics

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

Tong ZhangRutgers, Statistics

Baidu, Big Data Lab

Page 3: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 1MACHINE LEARNING

Page 4: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Statistical nature of data

Data (e.g., image, text,

measurements, …)

Label

Page 5: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Prediction of labels from data

Find

Such that when (data, label) pair is drawn from the distribution

Then

Predicted label True label

Linear predictor

Page 6: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Measure of Success

We want the expected loss (=risk) to be small:

data label

Page 7: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Finding a Linear Predictor viaEmpirical Risk Minimization

Draw i.i.d. data (samples) from the distribution

Output predictor which minimizes the empirical risk:

Page 8: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 2OPTIMIZATION

Page 9: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Primal Problem

d = # features (parameters)

n = # samples 1 - strongly convex function (regularizer)

- smooth & convex

regularizationparameter

Page 10: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Assumption 1

Loss functions have Lipschitz gradient

Lipschitz constant

Page 11: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Assumption 2

Regularizer is 1-strongly convex

subgradient

Page 12: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Dual Problem

- strongly convex 1 – smooth

& convex

Page 13: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 3ALGORITHM

Page 14: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Quartz

Page 15: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Fenchel Duality

Weak duality

Optimality conditions

Page 16: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

The Algorithm

Page 17: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Quartz: Bird’s Eye View

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

Page 18: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

The Algorithm

STEP 1

STEP 2

Convex combinationconstant

Page 19: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Randomized Primal-Dual Methods

SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014

Page 20: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 4MAIN RESULT

Page 21: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Assumption 3 (Expected Separable Overapproximation)

inequality must hold for all

Page 22: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Complexity Theorem (QRZ’14)

Page 23: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 5aUPDATING ONE DUALVARIABLE AT A TIME

Page 24: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Complexity of Quartz specialized to serial sampling

Page 25: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Data

Page 26: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Standard primal update

Experiment: Quartz vs SDCA,uniform vs optimal sampling

“Aggressive” primal update

Page 27: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 5bTAU-NICE SAMPLING

(STANDARD MINIBATCHING)

Page 28: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Data sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

Page 29: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Complexity of Quartz

Page 30: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Speedup

Assume the data is normalized:

Then:

Linear speedup up to a certain data-independent minibatch size:

Further data-dependent speedup, up to the extreme case:

Page 31: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Speedup: sparse data

Page 32: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Speedup: denser data

Page 33: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Speedup: fully dense data

Page 34: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

astro_ph: n = 29,882 density = 0.08%

Page 35: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

CCAT: n = 781,265 density = 0.16%

Page 36: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Primal-dual methods with tau-nice sampling

SS-Shwartz & T Zhang ‘13

SS-Shwartz & T Zhang ‘13

Y Zhang & L Xiao ‘14

Page 37: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

For sufficiently sparse data, Quartz wins even when compared against accelerated methods

Acce

lera

ted

Page 38: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 6ESO

Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014

Page 39: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Computation of ESO parameters

Lemma (QR’14b) {For simplicity, assume that m = 1}

Page 40: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

ESO

For any sampling , ESO holds with

Theorem (QR’14b)

where

Page 41: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

ESO

Page 42: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Part 7DISTRIBUTED

QUARTZ

Page 43: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Distributed Quartz: Perform the Dual Updates in a Distributed Manner

Quartz STEP 2: DUAL UPDATE

Data required to compute the update

Page 44: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Distribution of Datan = # dual variables Data matrix

Page 45: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Distributed sampling

Page 46: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Distributed sampling

Random set of dual variables

Page 47: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Distributed sampling & distributed coordinate descent

P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014

Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014

2

strongly convex & smooth

convex & smooth

Page 48: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Complexity of distributed Quartz

Page 49: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Reallocating load: theoretical speedup

n = 1,000,000density = 100%

n = 1,000,000density = 0.01%

Page 50: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Reallocating load: experiment

Data: webspamn = 350,000density = 33.51%

Page 51: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Experiment

Machine: 128 nodes of Hector Supercomputer (4096 cores)

Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB

Algorithm: with c = 512

P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

Page 52: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

LASSO: 3TB data + 128 nodes

Page 53: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Experiment

Machine: 128 nodes of Archer Supercomputer

Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)

Algorithm: Hydra2 with c = 256

Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

Page 54: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

LASSO: 5TB data (d = 50b) + 128 nodes

Page 55: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Related Work

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv:1412.80630, 2014

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv:1412.8060, 2014

P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013

Page 56: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

END