48
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Embed Size (px)

Citation preview

Page 1: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Semi-Stochastic Gradient Descent Methods

Jakub KonečnýUniversity of Edinburgh

Lehigh UniversityDecember 3, 2014

Page 2: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Based on Basic Method: S2GD

Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013

Mini-batching (& proximal setting): mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD:

mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014

Coordinate descent variant: S2CD Konečný, Qu and Richtárik. S2CD: Semi-

Stochastic Coordinate Descent, October 2014

Page 3: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Introduction

Page 4: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Large scale problem setting Problems are often structured

Frequently arising in machine learning

Structure – sum of functions is BIG

Page 5: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Examples Linear regression (least squares)

Logistic regression (classification)

Page 6: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Assumptions Lipschitz continuity of derivative of

Strong convexity of

Page 7: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Gradient Descent (GD) Update rule

Fast convergence rate

Alternatively, for accuracy we need

iterations

Complexity of single iteration – (measured in gradient evaluations)

Page 8: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Stochastic Gradient Descent (SGD) Update rule

Why it works

Slow convergence

Complexity of single iteration – (measured in gradient evaluations)

a step-size parameter

Page 9: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Goal

GD SGD

Fast convergence

gradient evaluations in each iteration

Slow convergence

Complexity of iteration independent of

Combine in a single algorithm

Page 10: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Semi-Stochastic Gradient Descent

S2GD

Page 11: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Intuition

The gradient does not change drastically We could reuse the information from “old”

gradient

Page 12: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Modifying “old” gradient Imagine someone gives us a “good” point

and

Gradient at point , near , can be expressed as

Approximation of the gradient

Already computed gradientGradient changeWe can try to estimate

Page 13: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

The S2GD Algorithm

Simplification; size of the inner loop is random, following a geometric rule

Page 14: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Theorem

Page 15: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Convergence rate

How to set the parameters ?

Can be made arbitrarily small, by decreasing

For any fixed , can be made arbitrarily small by increasing

Page 16: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Setting the parameters

The accuracy is achieved by setting

Total complexity (in gradient evaluations)

# of epochs

full gradient evaluation cheap iterations

# of epochs

stepsize

# of iterations

Fix target accuracy

Page 17: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Complexity S2GD complexity

GD complexity iterations complexity of a single iteration Total

Page 18: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Related Methods SAG – Stochastic Average Gradient

(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)

Refresh single stochastic gradient in each iteration Need to store gradients. Similar convergence rate Cumbersome analysis

SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)

Refined analysis

MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)

Similar to SAG, slightly worse performance Elegant analysis

Page 19: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Related Methods SVRG – Stochastic Variance Reduced

Gradient(Rie Johnson, Tong Zhang, 2013)

Arises as a special case in S2GD Prox-SVRG

(Tong Zhang, Lin Xiao, 2014)

Extended to proximal setting

EMGD – Epoch Mixed Gradient Descent(Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013)

Handles simple constraints, Worse convergence rate

Page 20: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

Page 21: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Extensions

Page 22: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Extensions summary S2GD:

Efficient handling of sparse data Pre-processing with SGD (S2GD+) Non-strongly convex losses High-probability result

Minibatching: mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD:

mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014

Coordinate descent variant:S2CD Konečný, Qu, Richtárik. S2CD: Semi-Stochastic

Coordinate Descent, October 2014

Page 23: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Sparse data For linear/logistic regression, gradient copies

sparsity pattern of example.

But the update direction is fully dense

Can we do something about it?

DENSESPARSE

Page 24: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Sparse data Yes we can! To compute , we only need coordinates

of corresponding to nonzero elements of

For each coordinate , remember when was it updated last time – Before computing in inner iteration

number , update required coordinates Step being Compute direction and make a single update

Number of iterations when the coordinate was not updatedThe “old gradient”

Page 25: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Sparse data implementation

Page 26: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

S2GD+ Observing that SGD can make reasonable

progress, while S2GD computes first full gradient (in case we are starting from arbitrary point),we can formulate the following algorithm (S2GD+)

Page 27: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

S2GD+ Experiment

Page 28: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

High Probability Result The result holds only in expectation Can we say anything about the concentration

of the result in practice?

For any

we have:

Paying just logarithm of probabilityIndependent from other parameters

Page 29: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Code Efficient implementation for logistic

regression -available at MLOSS

http://mloss.org/software/view/556/

Page 30: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

mS2GD (mini-batch S2GD) How does mini-batching influence the

algorithm? Replace

by

Provides two-fold speedup Provably less gradient evaluations are needed

(up to certain number of mini-batches) Easy possibility of parallelism

Page 31: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

S2CD (Semi-Stochastic Coordinate Descent) SGD type methods

Sampling rows (training examples) of data matrix Coordinate Descent type methods

Sampling columns (features) of data matrix

Question: Can we do both? Sample both columns and rows

Page 32: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

S2CD (Semi-Stochastic Coordinate Descent)

Comlpexity

S2GD

Page 33: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

S2GD as a Learning Algorithm

Page 34: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Problem with “us” optimizers Optimizers care about optimization Statisticians care about statistics In isolation Practical need to control both statistical

predictive power and effort spent on optimization, is not well understood.

Optimizers should be aware of… The following framework is mostly due to

Bottou and Bousquets, 2007

Page 35: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Machine Learning Setting Space of input-output pairs Unknown distribution A relationship between inputs and outputs Loss function to measure discrepancy

between predicted and real output

Define Expected Risk

Page 36: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Machine Learning Setting Ideal goal: Find such that,

But you cannot even evaluate

Define Expected Risk

Page 37: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Machine Learning Setting We at least have iid samples

Define Empirical Risk

Page 38: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

First learning principle – fix a family of candidate prediction functions

Find Empirical Minimizer

Define Empirical Risk

Machine Learning Setting

Page 39: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Since optimal is unlikely to belong to ,we also define

Define Empirical Risk

Machine Learning Setting

Page 40: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Finding by minimizing the Empirical Risk exactly is often computationally expensive

Run optimization algorithm that returns such that

Define Empirical Risk

Machine Learning Setting

Page 41: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Recapitulation

Ideal optimum

“Best” from our family

Empirical Minimizer

From approximate optimization

Page 42: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Machine Learning Goal Big goal is to minimize the Excess Risk

Approximation error Estimation Error Optimization Error

Page 43: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Generic Machine Learning Problem All this leads to a complicated compromise

Three variables Family of functions Number or examples Optimization accuracy

Two constraints Maximal number of examples Maximal computational time available

Page 44: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Generic Machine Learning Problem Small scale learning problem

If first inequality is tight Can reduce to insignificant levels and recover

approximation-estimation tradeoff (well studied) Large scale learning problem

If second inequality is tight More complicated compromise

Page 45: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Solving Large Scale ML Problem Several simplifications needed

Not carefully balance the three terms; instead we only ensure that asymptotically

Consider fixed family of functions , linearly parameterized by a vector Effectively setting to be a constant

Simplifies to Estimation–Optimization tradeoff

Page 46: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Estimation–Optimization tradeoff Using uniform convergence bounds, one can

obtain

Often considered weak

Page 47: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Estimation–Optimization tradeoff Using Localized Bounds (Bousquet, PhD

thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get

… if we can establish the following variance condition

Often , for example under strong convexity, or making assumptions on the data distribution

Page 48: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Estimation–Optimization tradeoff Using the previous bounds yields

where is an absolute constant We want to push this term below Choosing and using

and we get the following table