Upload
damion-hoad
View
215
Download
0
Embed Size (px)
Citation preview
Peter Richtárik
Randomized Dual Coordinate Ascentwith Arbitrary Sampling
1st UCL Workshop on the Theory of Big Data – London– January 2015
Coauthors
Zheng QuEdinburgh, Mathematics
Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014
Tong ZhangRutgers, Statistics
Baidu, Big Data Lab
Part 1MACHINE LEARNING
Statistical nature of data
Data (e.g., image, text,
measurements, …)
Label
Prediction of labels from data
Find
Such that when (data, label) pair is drawn from the distribution
Then
Predicted label True label
Linear predictor
Measure of Success
We want the expected loss (=risk) to be small:
data label
Finding a Linear Predictor viaEmpirical Risk Minimization
Draw i.i.d. data (samples) from the distribution
Output predictor which minimizes the empirical risk:
Part 2OPTIMIZATION
Primal Problem
d = # features (parameters)
n = # samples 1 - strongly convex function (regularizer)
- smooth & convex
regularizationparameter
Assumption 1
Loss functions have Lipschitz gradient
Lipschitz constant
Assumption 2
Regularizer is 1-strongly convex
subgradient
Dual Problem
- strongly convex 1 – smooth
& convex
Part 3ALGORITHM
Quartz
Fenchel Duality
Weak duality
Optimality conditions
The Algorithm
Quartz: Bird’s Eye View
STEP 1: PRIMAL UPDATE
STEP 2: DUAL UPDATE
The Algorithm
STEP 1
STEP 2
Convex combinationconstant
Randomized Primal-Dual Methods
SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014
Part 4MAIN RESULT
Assumption 3 (Expected Separable Overapproximation)
inequality must hold for all
Complexity Theorem (QRZ’14)
Part 5aUPDATING ONE DUALVARIABLE AT A TIME
Complexity of Quartz specialized to serial sampling
Data
Standard primal update
Experiment: Quartz vs SDCA,uniform vs optimal sampling
“Aggressive” primal update
Part 5bTAU-NICE SAMPLING
(STANDARD MINIBATCHING)
Data sparsity
A normalized measure of average sparsity of the data
“Fully sparse data” “Fully dense data”
Complexity of Quartz
Speedup
Assume the data is normalized:
Then:
Linear speedup up to a certain data-independent minibatch size:
Further data-dependent speedup, up to the extreme case:
Speedup: sparse data
Speedup: denser data
Speedup: fully dense data
astro_ph: n = 29,882 density = 0.08%
CCAT: n = 781,265 density = 0.16%
Primal-dual methods with tau-nice sampling
SS-Shwartz & T Zhang ‘13
SS-Shwartz & T Zhang ‘13
Y Zhang & L Xiao ‘14
For sufficiently sparse data, Quartz wins even when compared against accelerated methods
Acce
lera
ted
Part 6ESO
Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014
Computation of ESO parameters
Lemma (QR’14b) {For simplicity, assume that m = 1}
ESO
For any sampling , ESO holds with
Theorem (QR’14b)
where
ESO
Part 7DISTRIBUTED
QUARTZ
Distributed Quartz: Perform the Dual Updates in a Distributed Manner
Quartz STEP 2: DUAL UPDATE
Data required to compute the update
Distribution of Datan = # dual variables Data matrix
Distributed sampling
Distributed sampling
Random set of dual variables
Distributed sampling & distributed coordinate descent
P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013
Previously studied (not in the primal-dual setup):
Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014
Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014
2
strongly convex & smooth
convex & smooth
Complexity of distributed Quartz
Reallocating load: theoretical speedup
n = 1,000,000density = 100%
n = 1,000,000density = 0.01%
Reallocating load: experiment
Data: webspamn = 350,000density = 33.51%
Experiment
Machine: 128 nodes of Hector Supercomputer (4096 cores)
Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB
Algorithm: with c = 512
P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013
LASSO: 3TB data + 128 nodes
Experiment
Machine: 128 nodes of Archer Supercomputer
Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)
Algorithm: Hydra2 with c = 256
Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014
LASSO: 5TB data (d = 50b) + 128 nodes
Related Work
Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv:1412.80630, 2014
Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv:1412.8060, 2014
P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013
END