Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling

Zheng QuUniversity of Edinburgh

Optimization & Big Data Workshop Edinburgh, 6th to 8th May, 2015

Randomized dual coordinate ascent with arbitrary

sampling

Joint work with Peter Richtárik (Edinburgh) & Tong Zhang (Rutgers & Baidu)

Supervised Statistical Learning

Data Algorithm Predictor

Supervised Statistical Learning

Data Algorithm Predictor

Predicted label True label

Input Label

Empirical Risk Minimization

Data Algorithm PredictorInput Label

empirical risk regularization

n = # samples (big!)

n = # samples (big!)

empirical loss regularization

ERM problem:

Empirical Risk Minimization

Algorithm: QUARTZ

Z. Q., P. Richtárik (UoE) and T. Zhang (Rutgers & Baidu Big Data Lab, Beijing)Randomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

Primal-Dual Formulation

Fenchel conjugates:

ERM problem

Dual problem

Intuition behind QUARTZ

Fenchel’s inequality

weak duality

Optimality conditions

The Primal-Dual Update

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

Optimality conditions

STEP 1: Primal update

STEP 2: Dual update

Just maintaining

SDCA: SS. Shwartz & T. Zhang, 09/2012mSDCA M. Takáč, A. Bijral, P. Richtárik & N. Srebro, 03/2013ASDCA: SS. Shwartz & T. Zhang, 05/2013AccProx-SDCA: SS. Shwartz & T. Zhang, 10/2013 DisDCA: TB. Yang, 2013 Iprox-SDCA: PL. Zhao & T. Zhang, 01/2014 APCG: QH. Lin, Z. Lu & L. Xiao, 07/2014SPDC: Y. Zhang & L. Xiao, 09/2014QUARTZ: Z. Q., P. Richtárik & T. Zhang, 11/2014

Randomized Primal-Dual Methods

zheng qu

comments

Convergence Theorem

Expected

Separable

Overapproximation

ESO Assumption

Convex combination constant

zheng qu

comments

Iteration Complexity Result

(*)

zheng qu

Complexity Results for Serial Sampling

zheng qu

comments

Experiment: Quartz vs SDCA,uniform vs optimal sampling

zheng qu

comments

QUARTZ with Standard Mini-Batching

zheng qu

comments

Data Sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

zheng qu

comments

Iteration Complexity Results

zheng qu

comments

Iteration Complexity Results

zheng qu

comments

Theoretical Speedup Factor

Linear speedup up to a certain data-independent mini-batch size:

Further data-dependent speedup:

zheng qu

comments

Plots of Theoretical Speedup Factor

Linear speedup up to a certain data-independent mini-batch size:

Further data-dependent speedup:

zheng qu

comments

Theoretical vs Pratical Speedup

astro_ph; sparsity: 0.08%; n=29,882; cov1; sparsity: 22.22%; n=522,911;

zheng qu

comments

Comparison with Accelerated Mini-Batch P-D Methods

zheng qu

comments

Distribution of Datan = # dual variables Data matrix

Distributed Sampling

Random set of dual variables

Distributed Sampling & Distributed Coordinate Descent

Peter Richtárik and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Z. Q., Peter Richtárik and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, 2014

Jakub Marecek, Peter Richtárik and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, 2014

2

strongly convex & smooth

convex & smooth

Complexity of Distributed QUARTZ

Reallocating Load: Theoretical Speedup

Theoretical vs Practical Speedup

zheng qu

comments

More on ESOESO:

second order /curvature informationlocal second order /curvature information

lost

get

Computation of ESO Parameters

Lemma (QR’14b)

Sampling Data

Conclusion

QUARTZ (Randomized coordinate ascent method with arbitrary sampling )o Direct primal-dual analysis (for arbitrary sampling)

optimal serial sampling tau-nice sampling (mini-batch) distributed sampling

o Theoretical speedup factor which is a very good predictor of the practical speedup factor depends on both the sparsity and the condition number shows a weak dependence on how data is distributed

Accelerated QUARTZ? Randomized fixed point algorithm with relaxation? …?

zheng qu

comments

Documents

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling