67
Optimization Algorithms for Analyzing Large Datasets Stephen Wright University of Wisconsin-Madison Winedale, October, 2012 Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 1 / 67

Optimization Algorithms for Analyzing Large Datasets

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Optimization Algorithms for Analyzing Large Datasets

Stephen Wright

University of Wisconsin-Madison

Winedale, October, 2012

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 1 / 67

1 Learning from Data: Problems and Optimization Formulations

2 First-Order Methods

3 Stochastic Gradient Methods

4 Nonconvex Stochastic Gradient

5 Sparse / Regularized Optimization

6 Decomposition / Coordinate Relaxation

7 Lagrangian Methods

8 Higher-Order Methods

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 2 / 67

Outline

Give some examples of data analysis and learning problems and theirformulations as optimization problems.

Highlight common features in these formulations.

Survey a number of optimization techniques that are being applied tothese problems.

Just the basics - give the flavor of each method and state a keytheoretical result or two.

All of these approaches are being investigated at present; rapiddevelopments continue.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 3 / 67

Data Analysis: Learning

Typical ingredients of optimization formulations of data analysis problems:

Collection of data, from which we want to learn to make inferencesabout future / missing data, or do regression to extract keyinformation.

A model of how the data relates to the meaining we are trying toextract. Model parameters to be determined by inspecting the dataand using prior knowledge.

Objective that captures prediction errors on the data and deviationfrom prior knowledge or desirable structure.

Other typical properties of learning problems are huge underlying data set,and requirement for solutions with only low-medium accuracy.

In some cases, the optimization formulation is well settled. (e.g. LeastSquares, Robust Regression, Support Vector Machines, LogisticRegression, Recommender Systems.)

In other areas, formulation is a matter of ongoing debate!Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 4 / 67

LASSO

Least Squares: Given a set of feature vectors ai āˆˆ Rn and outcomes bi ,i = 1, 2, . . . ,m, find weights x on the features that predict the outcomeaccurately: aTi x ā‰ˆ bi .

Under certain assumptions on measurement error, can find a suitable x bysolving a least squares problem

minx

1

2ā€–Ax āˆ’ bā€–2

2,

where the rows of A are aTi , i = 1, 2, . . . ,m.

Suppose we want to identify the most important features of each ai ā€”those features most important in predicting the outcome.

In other words, we seek an approx least-squares solution x that is sparseā€” and x with just a few nonzero components. Obtain

LASSO: minx

1

2ā€–Ax āˆ’ bā€–2

2 such that ā€–xā€–1 ā‰¤ T ,

for parameter T > 0. (Smaller T implies fewer nonzeros in x .)Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 5 / 67

LASSO and Compressed Sensing

LASSO is equivalent to an ā€œ`2-`1ā€ formulation:

minx

1

2ā€–Ax āˆ’ bā€–2

2 + Ļ„ā€–xā€–1, for some Ļ„ > 0.

This same formulation is common in compressed sensing, but themotivation is slightly different. Here x represents some signal that isknown to be (nearly) sparse, and the rows of A are probing x to givemeasured outcomes b. The problem above is solved to reconstruct x .

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 6 / 67

Group LASSO

Partition x into groups of variables that have some relationship ā€” turnthem ā€œonā€ or ā€œoffā€ as a group, not as individuals.

minx

1

2ā€–Ax āˆ’ bā€–2

2 + Ļ„āˆ‘gāˆˆGā€–x[g ]ā€–2,

with each [g ] āŠ‚ 1, 2, . . . , n.Easy handle when groups [g ] are disjoint.

Still easy when ā€– Ā· ā€–2 is replaced by ā€– Ā· ā€–āˆž. (Turlach, Venables,Wright, 2005)

When the groups form a hierarchy, the problem is slightly harder butsimilar algorithms still work.

For general overlapping groups, algorithms are more complex (seeBach, Mairal, et al.)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 7 / 67

Least Squares with Nonconvex Regularizers

Nonconvex element-wise penalties have become popular for variableselection in statistics.

SCAD (smoothed clipped absolute deviation) (Fan and Li, 2001)

MCP (Zhang, 2010).

Properties: Unbiased and sparse estimates, solution path continuous inregularization parameter Ļ„ .

SparseNet (Mazumder, Friedman, Hastie, 2011): coordinate descent.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 8 / 67

Support Vector Classification

Given many data vectors xi āˆˆ Rn, for i = 1, 2, . . . ,m, together with a labelyi = Ā±1 to indicate the class (one of two) to which xi belongs.

Find z such that (usually) we have

xTi z ā‰„ 1 when yi = +1;

xTi z ā‰¤ āˆ’1 when yi = āˆ’1.

SVM with hinge loss:

f (z) = CNāˆ‘i=1

max(1āˆ’ yi (zT xi ), 0) +

1

2ā€–zā€–2,

where C > 0 is a parameter. Dual formulation is

minĪ±

1

2Ī±TKĪ±āˆ’ 1TĪ± subject to 0 ā‰¤ Ī± ā‰¤ C1, yTĪ± = 0,

where Kij = yiyjxTi xj . Subvectors of the gradient KĪ±āˆ’ 1 can be

computed economically.

(Many extensions and variants.)Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 9 / 67

(Regularized) Logistic Regression

Seek odds function parametrized by z āˆˆ Rn:

p+(x ; z) := (1 + ezT x)āˆ’1, pāˆ’(x ; z) := 1āˆ’ p+(z ;w),

choosing z so that p+(xi ; z) ā‰ˆ 1 when yi = +1 and pāˆ’(xi ; z) ā‰ˆ 1 whenyi = āˆ’1. Scaled, negative log likelihood function L(z) is

L(z) = āˆ’ 1

m

āˆ‘yi=āˆ’1

log pāˆ’(xi ; z) +āˆ‘yi=1

log p+(xi ; z)

To get a sparse z (i.e. classify on the basis of a few features) solve:

minzL(z) + Ī»ā€–zā€–1.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 10 / 67

Multiclass Logistic Regression

M classes: yij = 1 if data point i is in class j ; yij = 0 otherwise. z[j] is thesubvector of z for class j .

f (z) = āˆ’ 1

N

Nāˆ‘i=1

Māˆ‘j=1

yij(zT[j]xi )āˆ’ log(

Māˆ‘j=1

exp(zT[j]xi ))

+ Ī»

Māˆ‘j=1

ā€–z[j]ā€–22.

Useful in speech recognition, to classify phonemes.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 11 / 67

Matrix Completion

Seek a matrix X āˆˆ RmƗn with low rank that matches certain observations,possibly noisy.

minX

1

2ā€–A(X )āˆ’ bā€–2

2 + Ļ„Ļˆ(X ),

where A(X ) is a linear mapping of the components of X (e.g.element-wise observations).

Can have Ļˆ as the nuclear norm = sum of singular values. Tends topromote low rank (in the same way as ā€–xā€–1 tends to promote sparsity of avector x).

Alternatively: X is the sum of sparse matrix and a low-rank matrix. Theelement-wise 1-norm ā€–Xā€–1 is useful in inducing sparsity.

Useful in recommender systems, e.g. Netflix, Amazon.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 12 / 67

*~=

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 13 / 67

Image Processing: TV Regularization

Natural images are not random! They tend to have large areas ofnear-constant intensity or color, separated by sharp edges.

Denoising: Given an image in which the pixels contain noise, find aā€œnearby natural image.ā€

Total-Variation Regularization applies an `1 penalty to gradients in theimage.

u : Ī©ā†’ R, Ī© := [0, 1]Ɨ [0, 1],

To denoise an image f : Ī©ā†’ R, solve

minu

āˆ«Ī©

(u(x)āˆ’ f (x))2 dx + Ī»

āˆ«Ī©ā€–āˆ‡u(x)ā€–2

2 dx .

(Rudin, Osher, Fatemi, 1992)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 14 / 67

(a) Cameraman: Clean (b) Cameraman: Noisy

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 15 / 67

(c) Cameraman: Denoised (d) Cameraman: Noisy

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 16 / 67

Optimization: Regularization Induces Structure

Formulations may include regularization functions to induce structure:

minx

f (x) + Ļ„Ļˆ(x),

where Ļˆ induces the desired structure in x . Often Ļˆ is nonsmooth.

ā€–xā€–1 to induce sparsity in the vector x (variable selection / LASSO);

SCAD and MCP: nonconvex regularizers to induce sparsity, withoutbiasing the solution;

Group regularization / group selection;

Nuclear norm ā€–Xā€–āˆ— (sum of singular values) to induce low rank inmatrix X ;

low ā€œtotal-variationā€ in image processing;

generalizability (Vapnik: ā€œ...tradeoff between the quality of theapproximation of the given data and the complexity of theapproximating functionā€).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 17 / 67

Optimization: Properties of f .

Objective f can be derived from Bayesian statistics + maximum likelihoodcriterion. Can incorporate prior knowledge.

f have distinctive properties in several applications:

Partially Separable: Typically

f (x) =1

m

māˆ‘i=1

fi (x),

where each term fi corresponds to a single item of data, and possiblydepends on just a few components of x .

Cheap Partial Gradients: subvectors of the gradient āˆ‡f may beavailable at proportionately lower cost than the full gradient.

These two properties are often combined.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 18 / 67

Batch vs Incremental

Considering the partially separable form

f (x) =1

m

māˆ‘i=1

fi (x),

the size N of the training set can be very large. Thereā€™s a fundamentaldivide in algorithmic strategy.

Incremental: Select a single i āˆˆ 1, 2, . . . ,m at random, evaluateāˆ‡fi (x), and take a step in this direction. (Note thatE [āˆ‡fi (x)] = āˆ‡f (x).) Stochastic Approximation (SA) a.k.a.Stochastic Gradient (SG).

Batch: Select a subset of data X āŠ‚ 1, 2, . . . ,m, and minimize thefunction f (x) =

āˆ‘iāˆˆX fi (x). Sample-Average Approximation (SAA).

Minibatch is a kind of compromise: Aggregate the component functionsfi into small groups, consisting of 10 or 100 individual terms, and applyincremental algorithms to the redefined summation. (Gives lower-variancegradient estimates.)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 19 / 67

First-Order Methods

min f (x), with smooth convex f . First-order methods calculate āˆ‡f (xk) ateach iteration, and do something with it.

Usually assumeĀµI āˆ‡2f (x) LI for all x ,

with 0 ā‰¤ Āµ ā‰¤ L. (L is thus a Lipschitz constant on the gradient āˆ‡f .)Function is strongly convex if Āµ > 0.

But some of these methods are extendible to more general cases, e.g.

nonsmooth-regularized f ;

simple constraints x āˆˆ Ī©;

availability of just an approximation to āˆ‡f .

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 20 / 67

Steepest Descent

xk+1 = xk āˆ’ Ī±kāˆ‡f (xk), for some Ī±k > 0.

Choose step length Ī±k by doing line search, or more simply by usingknowledge of L and Āµ.

Ī±k ā‰” 1/L yields sublinear convergence

f (xk)āˆ’ f (xāˆ—) ā‰¤ 2Lā€–x0 āˆ’ xāˆ—ā€–2

k.

In strongly convex case, Ī±k ā‰” 2/(Āµ+ L) yields linear convergence, withconstant dependent on problem conditioning Īŗ := L/Āµ:

f (xk)āˆ’ f (xāˆ—) ā‰¤ L

2

(1āˆ’ 2

Īŗ+ 1

)2k

ā€–x0 āˆ’ xāˆ—ā€–2.

We canā€™t improve much on these rates by using more sophisticated choicesof Ī±k ā€” theyā€™re a fundamental limitation of searching along āˆ’āˆ‡f (xk).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 21 / 67

steepest descent, exact line search

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 22 / 67

Momentum

First-order methods can be improved dramatically using momentum:

xk+1 = xk āˆ’ Ī±kāˆ‡f (xk) + Ī²k(xk āˆ’ xkāˆ’1).

Search direction is a combination of previous search direction xk āˆ’ xkāˆ’1

and latest gradient āˆ‡f (xk). Methods in this class include:

Heavy-ball;

Conjugate gradient;

Accelerated gradient.

Heavy-ball sets

Ī±k ā‰”4

L

1

(1 + 1/āˆšĪŗ)2

, Ī²k ā‰”(

1āˆ’ 2āˆšĪŗ+ 1

)2

.

to get a linear convergence rate with constant approximately 1āˆ’ 2/āˆšĪŗ.

Thus requires aboutāˆšĪŗ log Īµ to achieve precision of Īµ, vs. about Īŗ log Īµ for

steepest descent.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 23 / 67

Conjugate Gradient

Basic step is

xk+1 = xk + Ī±kpk , pk = āˆ’āˆ‡f (xk) + Ī³kpkāˆ’1.

Put it in the ā€œmomentumā€ framework by setting Ī²k = Ī±kĪ³k/Ī±kāˆ’1.However, CG can be implemented in a way that doesnā€™t require knowledge(or estimation) of L and Āµ.

Choose Ī±k to (approximately) miminize f along pk ;

Choose Ī³k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere,etc), all of which are equivalent if f is convex quadratic. e.g.

Ī³k = ā€–āˆ‡f (xk)ā€–2/ā€–āˆ‡f (xkāˆ’1)ā€–2.

Similar convergence rate to heavy-ball: linear with rate 1āˆ’ 2/āˆšĪŗ.

CG has other properties, particularly when f is convex quadratic.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 24 / 67

firstāˆ’order method with momentum

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 25 / 67

Accelerated Gradient Methods

Accelerate the rate to 1/k2 for weakly convex, while retaining the linearrate (based on

āˆšĪŗ) for strongly convex case.

Nesterov (1983, 2004) describes a method that requires Īŗ.

0: Choose x0, Ī±0 āˆˆ (0, 1); set y0 ā† x0./

k : xk+1 ā† yk āˆ’ 1Lāˆ‡f (yk); (*short-step gradient*)

solve for Ī±k+1 āˆˆ (0, 1): Ī±2k+1 = (1āˆ’ Ī±k+1)Ī±2

k + Ī±k+1/Īŗ;set Ī²k = Ī±k(1āˆ’ Ī±k)/(Ī±2

k + Ī±k+1);set yk+1 ā† xk+1 + Ī²k(xk+1 āˆ’ xk). (*update with momentum*)

Separates ā€œsteepest descentā€ step from ā€œmomentumā€ step, producingtwo sequences xk and yk.Still works for weakly convex (Īŗ =āˆž).

FISTA is similar.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 26 / 67

k

xk+1

xk

yk+1

xk+2

yk+2

y

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 27 / 67

Extension to Sparse / Regularized Optimization

Accelerated gradient can be applied with minimal changes to theregularized problem

minx

f (x) + Ļ„Ļˆ(x),

where f is convex and smooth, Ļˆ convex and ā€œsimpleā€ but usuallynonsmooth, and Ļ„ is a positive parameter.

Simply replace the gradient step by

xk = arg minx

L

2

āˆ„āˆ„āˆ„āˆ„x āˆ’ [yk āˆ’ 1

Lāˆ‡f (yk)

]āˆ„āˆ„āˆ„āˆ„2

+ Ļ„Ļˆ(x).

(This is the shrinkage step. Often cheap to compute. More later...)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 28 / 67

Stochastic Gradient Methods

Still deal with (weakly or strongly) convex f . But change the rules:

Allow f nonsmooth.

Donā€™t calculate function values f (x).

Can evaluate cheaply an unbiased estimate of a vector from thesubgradient āˆ‚f .

Common settings are:f (x) = EĪ¾F (x , Ī¾),

where Ī¾ is a random vector with distribution P over a set Īž. A usefulspecial case is the partially separable form

f (x) =1

m

māˆ‘i=1

fi (x),

where each fi is convex.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 29 / 67

ā€œClassicalā€ Stochastic Gradient

For the sum:

f (x) =1

m

māˆ‘i=1

fi (x),

Choose index ik āˆˆ 1, 2, . . . ,m uniformly at random at iteration k, set

xk+1 = xk āˆ’ Ī±kāˆ‡fik (xk),

for some steplength Ī±k > 0.

When f is strongly convex, the analysis of convergence of E (ā€–xk āˆ’ xāˆ—ā€–2) isfairly elementary - see Nemirovski et al (2009).

Convergence depends on careful choice of Ī±k , naturally.Line search for Ī±k is not really an option, since we canā€™t (or wonā€™t)evaluate f !

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 30 / 67

Convergence of Classical SG

Suppose f is strongly convex with modulus Āµ, there is a bound M on thesize of the gradient estimates:

1

m

māˆ‘i=1

ā€–āˆ‡fi (x)ā€–2 ā‰¤ M2

for all x of interest. Convergence obtained for the expected square error:

ak :=1

2E (ā€–xk āˆ’ xāˆ—ā€–2).

Elementary argument shows a recurrence:

ak+1 ā‰¤ (1āˆ’ 2ĀµĪ±k)ak +1

2Ī±2kM

2.

When we set Ī±k = 1/(kĀµ), a neat inductive argument reveals a 1/k rate:

ak ā‰¤Q

2k, for Q := max

(ā€–x1 āˆ’ xāˆ—ā€–2,

M2

Āµ2

).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 31 / 67

But... What if we donā€™t know Āµ? Or if Āµ = 0?

The choice Ī±k = 1/(kĀµ) requires strong convexity, with knowledge of Āµ.Underestimating Āµ can greatly degrade performance!

Robust Stochastic Approximation approach uses primal averaging torecover a rate of 1/

āˆšk in f , works even for weakly convex nonsmooth

functions, and is not sensitive to choice of parameters in the step length.

At iteration k :

set xk+1 = xk āˆ’ Ī±kāˆ‡fik (xk) as before;

set

xk =

āˆ‘ki=1 Ī±ixiāˆ‘ki=1 Ī±i

.

For any Īø > 0 (not critical), choose step lengths to be Ī±k = Īø/(Māˆšk).

Then

E [f (xk)āˆ’ f (xāˆ—)] āˆ¼ log kāˆšk.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 32 / 67

Constant Step Size: Ī±k ā‰” Ī±

We can also get rates of approximately 1/k for the strongly convex case,without performing iterate averaging and without requiring an accurateestimate of Āµ. Need to specify the desired threshold for ak in advance.

Setting Ī±k ā‰” Ī±, we have

ak+1 ā‰¤ (1āˆ’ 2ĀµĪ±)ak +1

2Ī±2M2.

We can show by some manipulation that

ak ā‰¤ (1āˆ’ 2ĀµĪ±)ka0 +Ī±M2

4Āµ.

Given threshold Īµ > 0, we aim to find Ī± and K such that ak ā‰¤ Īµ for allk ā‰„ K . Choose so that both terms in the bound are less than Īµ/2:

Ī± :=2ĪµĀµ

M2, K :=

M2

4ĪµĀµ2log(a0

2Īµ

).

A kind of 1/k rate!Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 33 / 67

Parallel Stochastic Gradient

Several approaches tried for parallel stochastic approximation.

Dual Averaging: Average gradient estimates evaluated in parallel ondifferent cores. Requires message passing / synchronization (Dekel etal, 2011; Duchi et al, 2010).

Round-Robin: Cores evaluate āˆ‡fi in parallel and update centrallystored x in round-robin fashion. Requires synchronization (Langfordet al, 2009).

Asynchronous: Hogwild!: Each core grabs the centrally-stored xand evaluates āˆ‡fe(xe) for some random e, then writes the updatesback into x (Niu et al, 2011). Downpour SGD: Similar idea forcluster (Dean et al, 2012).

Hogwild!: Each processor runs independently:

1 Sample ik from 1, 2, . . . ,m;2 Read current state of x from central memory, evalute g := āˆ‡fik (x);

3 for nonzero components gv do xv ā† xv āˆ’ Ī±gv ;

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 34 / 67

Hogwild! Convergence

Updates can be old by the time they are applied, but we assume abound Ļ„ on their age.

Processors can overwrite each otherā€™s work, but sparsity of āˆ‡fe helpsā€” updates to not interfere too much.

Analysis of Niu et al (2011) simplified / generalized by Richtarik (2012).

Rates depend on Ļ„ , L, Āµ, initial error, and other quantities that define theamount of overlap between nonzero components of āˆ‡fi and āˆ‡fj , for i 6= j .

For a constant-step scheme (with Ī± chosen as a function of the quantitiesabove), essentially recover the 1/k behavior of basic SG. (Also dependslinearly on Ļ„ and the average sparsity.)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 35 / 67

Hogwild! Performance

Hogwild! compared with averaged gradient (AIG) and round-robin (RR).Experiments run on a 12-core machine. (10 cores used for gradientevaluations, 2 cores for data shuffling.)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 36 / 67

Hogwild! Performance

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 37 / 67

Stochastic Gradient for Nonconvex Objectives

Derivation and analysis of SG methods depends crucially on convexity of fā€” even strong convexity. However there are data-intensive problems ofcurrent interest (e.g. in deep belief networks) with separable nonconvex f .

For nonconvex problems, standard convergence results for full-gradientmethods are that ā€œaccumulation points are stationary,ā€ that is, āˆ‡f (x) = 0.

Are there stochastic-gradient methods that achieve similar properties?

Yes! See Ghadimi and Lan (2012).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 38 / 67

Deep Belief Networks

Example of a deep belief net-work for autoencoding (Hin-ton, 2007). Output (at top)depends on input (at bot-tom) of an image with 28 Ɨ28 pixels. Transformationsparametrized by W1, W2, W3,W4; output is a highly nonlin-ear function of these parame-ters.

Learning problems based onstructures like this give riseto separable, nonconvex ob-jectives.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 39 / 67

Nonconvex SG

Describe for the problem

f (x) =1

m

māˆ‘i=1

fi (x),

with fi smooth, possibly nonconvex. L is Lipschitz constant for āˆ‡f ; Mdefined as before.

Approach of Ghadimi and Lan (2012): SG with a random number of steps.

Pick an iteration limit N; Pick stepsize sequence Ī±k , k = 1, 2, . . . ,N;

Pick R randomly and uniformly from 1, 2, . . . ,N;Do R steps of SG: For k = 1, 2, . . . ,R āˆ’ 1:

xk+1 = xk āˆ’ Ī±kāˆ‡fik (xk)

with ik chosen randomly and uniformly from 1, 2, . . . ,m;Output xR .

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 40 / 67

Convergence for Nonconvex SG

Constant-step variant:

Ī±k ā‰” min

1

L,

D

MāˆšN

,

for some user-defined D > 0.

Define Df so that D2f = 2(f (x1)āˆ’ f āˆ—)/L (where f āˆ— is the optimal value of

f ). Then1

LE [ā€–āˆ‡f (xR)ā€–2] ā‰¤

LD2f

N+

(D +

D2f

D

)MāˆšN.

In the case of convex f , get a similar bound:

E [f (xR)āˆ’ f āˆ—] ā‰¤LD2

X

N+

(D +

D2X

D

)MāˆšN,

where DX = ā€–x1 āˆ’ xāˆ—ā€–.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 41 / 67

Two-Phase Nonconvex SG

Ghadimi and Lan extend this method to achieve complexity like 1/k, byrunning multiple processes in parallel, and picking the solution with thebest sample gradient. Prove a large-deviation convergence bound: userspecifies required threshold Īµ for ā€–āˆ‡f (x)ā€–2, and required likelihood 1āˆ’ Ī›(for small positive Ī›).

Choose trial set j1, j2, . . . , jT of indices from 1, 2, . . . ,m.Set S = log(2/Ī›);

For s = 1, 2, . . . ,S , run SG with random number of iterates, toproduce xs ;

For s = 1, 2, . . . ,S , calculate

gs =1

T

Tāˆ‘k=1

āˆ‡fjk (xs).

Output xs for which ā€–gsā€– is minimized.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 42 / 67

Sparse Optimization: Motivation

Many applications need structured, approximate solutions of optimizationformulations, rather than exact solutions.

More Useful, More Credible

Structured solutions are easier to understand.They correspond better to prior knowledge about the solution.They may be easier to use and actuate.Extract just the essential meaning from the data set, not the lessimportant effects.

Less Data Needed

Structured solution lies in lower-dimensional spaces ā‡’ need to gather /sample less data to capture it.

The structural requirements have deep implications for how we formulateand solve these problems.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 43 / 67

Regularized Formulations: Shrinking Algorithms

Consider the formulation

minx

f (x) + Ļ„Ļˆ(x).

Regularizer Ļˆ is often nonsmooth but ā€œsimple.ā€ Often the problem is easyto solve when f is replaced by a quadratic with diagonal Hessian:

minz

gT (z āˆ’ x) +1

2Ī±ā€–z āˆ’ xā€–2

2 + Ļ„Ļˆ(z).

Equivalently,

minz

1

2Ī±ā€–z āˆ’ (x āˆ’ Ī±g)ā€–2

2 + Ļ„Ļˆ(z).

Define the shrink operator as the arg min:

SĻ„ (y , Ī±) := arg minz

1

2Ī±ā€–z āˆ’ yā€–2

2 + Ļ„Ļˆ(z).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 44 / 67

Interesting Regularizers and their Shrinks

Cases for which the subproblem is simple:

Ļˆ(z) = ā€–zā€–1. Thus SĻ„ (y , Ī±) = sign(y) max(|y | āˆ’ Ī±Ļ„, 0). When ycomplex, have

SĻ„ (y , Ī±) =max(|y | āˆ’ Ļ„Ī±, 0)

max(|y | āˆ’ Ļ„Ī±, 0) + Ļ„Ī±y .

Ļˆ(z) =āˆ‘

gāˆˆG ā€–z[g ]ā€–2, where z[g ], g āˆˆ G are non-overlappingsubvectors of z . Here

SĻ„ (y , Ī±)[g ] =max(|y[g ]| āˆ’ Ļ„Ī±, 0)

max(|y[g ]| āˆ’ Ļ„Ī±, 0) + Ļ„Ī±y[g ].

Ļˆ(x) = IĪ©(x): Indicator function for a closed convex set Ī©. ThenSĻ„ (y , Ī±) is the projection of y onto Ī©.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 45 / 67

Basic Prox-Linear Algorithm

(Fukushima and Mine, 1981) for solving minx f (x) + Ļ„Ļˆ(x).

0: Choose x0

k : Choose Ī±k > 0 and set

xk+1 = SĻ„ (xk āˆ’ Ī±kāˆ‡f (xk);Ī±k)

= arg minzāˆ‡f (xk)T (z āˆ’ xk) +

1

2Ī±kā€–z āˆ’ xkā€–2

2 + Ļ„Ļˆ(z).

This approach goes by many names, including forward-backward splitting,shrinking / thresholding.

Straightforward, but can be fast when the regularization is strong (i.e.solution is ā€œhighly constrainedā€).

Can show convergence for steps Ī±k āˆˆ (0, 2/L), where L is the bound onāˆ‡2f . (Like a short-step gradient method.)

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 46 / 67

Enhancements

Alternatively, since Ī±k plays the role of a steplength, can adjust it to getbetter performance and guaranteed convergence.

Backtracking: decrease Ī±k until sufficient decrease condition holds.

Use Barzilai-Borwein strategies to get nonmonotonic methods. Byenforcing sufficient decrease every 10 iterations (say), still get globalconvergence.

Embed accelerated first-order methods (e.g. FISTA) by redefining thelinear term g appropriately.

Do continuation on the parameter Ļ„ . Solve first for large Ļ„ (easierproblem) then decrease Ļ„ in steps toward a target, using the previoussolution as a warm start for the next value.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 47 / 67

Decomposition / Coordinate Relaxation

For min f (x), at iteration k, choose a subset Gk āŠ‚ 1, 2, . . . , n and takea step dk only in these components. i.e. fix [dk ]i = 0 for i /āˆˆ Gk .

Fewer variables ā‡’ smaller and cheaper to deal with than the full problem.Can

take a reduced gradient step in the Gk components;

take multiple ā€œinner iterationsā€

actually solve the reduced subproblem in the space defined by Gk .

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 48 / 67

Example: Decomposition in SVM

Decomposition has long been popular for solving the dual (QP)formulation of support vector machines (SVM), since the number ofvariables (= number of training examples) may be very large.

SMO, LIBSVM: Each Gk has two components, different heuristicsfor choosing Gk .

LASVM: Again |Gk | = 2, with focus on online setting.

SVM-light: Small |Gk | (default 10).

GPDT: Larger |Gk | (default 400) with gradient projection solver asinner loop.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 49 / 67

Deterministic Results

Generalized Gauss-Seidel condition requires only that each coordinate isā€œtouchedā€ at least once every T iterations:

Gk āˆŖ Gk+1 āˆŖ . . . āˆŖ Gk+Tāˆ’1 = 1, 2, . . . , n.

Can show global convergence (e.g. Tseng and Yun, 2009; Wright, 2012).

There are also results on

global linear convergence rates;

optimal manifold identification (for problems with nonsmoothregularizer);

fast local convergence for an algorithm that takes reduced steps onthe estimated optimal manifold.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 50 / 67

Stochastic Coordinate Descent

Randomized coordinate descent (RCDC)

(Richtarik and Takac, 2012; see also Nesterov, 2012)

For min f (x).

Partitions the components of x into subvectors;

Makes a random selection of one partition to update at each iteration;

Exploits knowledge of the partial Lipschitz constant for each partitionin choosing the step.

Allows parallel implementation.

At each iteration, pick a partition at random, and take a short-stepsteepest descent step on the variables in that component.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 51 / 67

RCDC Details

Partition components 1, 2, . . . , n into m blocks with block [i ] withcorresponding columns from the n Ɨ n identity matrix denoted by Ui .

Denote by Li the partial Lipschitz constant on partition [i ]:

ā€–āˆ‡[i ]f (x + Ui t)āˆ’āˆ‡[i ]f (x)ā€– ā‰¤ Liā€–tā€–.

Fix probabilities of choosing each partition: pi , i = 1, 2, . . . ,m.

Iteration k :

Choose partition ik āˆˆ 1, 2, . . . ,m with probability pi ;

Set dk,i = āˆ’(1/Lik )āˆ‡[ik ]f (xk);

Set xk+1 = xk + Uikdk,i .

For convex f and uniform weights pi = 1/m, can prove high probabilityconvergence of f to within a specified threshold of f (xāˆ—) in O(1/k)iterations.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 52 / 67

RCDC Result

ā€–xā€–L :=

(māˆ‘i=1

Liā€–x[i ]ā€–2

)1/2

Weighted measure of level set size:

RL(x) := maxy

maxxāˆ—āˆˆXāˆ—

ā€–y āˆ’ xāˆ—ā€–L : f (y) ā‰¤ f (x).

Assuming that desired precision Īµ has

Īµ < min(R2L(x0), f (x0)āˆ’ f āˆ—)

and defining

K :=2nR2

L(x0)

Īµlog

f (x0)āˆ’ f āˆ—

ĪµĻ,

we have for k ā‰„ K that

P(f (xk)āˆ’ f āˆ— ā‰¤ Īµ) ā‰„ 1āˆ’ Ļ.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 53 / 67

Augmented Lagrangian Methods and Splitting

Consider linearly constrained problem:

min f (x) s.t. Ax = b.

Augmented Lagrangian is

L(x , Ī»; Ļ) := f (x) + Ī»T (Ax āˆ’ b) +Ļ

2ā€–Ax āˆ’ bā€–2

2,

where Ļ > 0. Basic augmented Lagrangian / method of multipliers is

xk = arg minxL(x , Ī»kāˆ’1; Ļk);

Ī»k = Ī»kāˆ’1 + Ļk(Axk āˆ’ b);

(choose Ļk+1).

Extends to inequality constraints, nonlinear constraints.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 54 / 67

Features of Augmented Lagrangian

If we set Ī» = 0, recover the quadratic penalty method.

If Ī» = Ī»āˆ—, the minimizer of L(x , Ī»āˆ—; Ļ) is the true solution xāˆ—, for anyĻ ā‰„ 0.

AL provides successive estimates Ī»k of Ī»āˆ—, thus can achieve solutionswith good accuracy without driving Ļk to āˆž.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 55 / 67

History of Augmented Lagrangian

Dates from 1969: Hestenes, Powell.

Developments in 1970s, early 1980s by Rockafellar, Bertsekas,....

Lancelot code for nonlinear programming: Conn, Gould, Toint,around 1990.

Largely lost favor as an approach for general nonlinear programmingduring the next 15 years.

Recent revival in the context of sparse optimization and associateddomain areas, in conjunction with splitting / coordinate descenttechniques.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 56 / 67

Separable Objectives: ADMM

Alternating Directions Method of Multipliers (ADMM) arises when theobjective in the basic linearly constrained problem is separable:

min(x ,z)

f (x) + h(z) subject to Ax + Bz = c ,

for which

L(x , z , Ī»; Ļ) := f (x) + h(z) + Ī»T (Ax + Bz āˆ’ c) +Ļ

2ā€–Ax āˆ’ Bz āˆ’ cā€–2

2.

Standard augmented Lagrangian would minimize L(x , z , Ī»; Ļ) over (x , z)jointly ā€” but these are coupled through the quadratic term, so theadvantage of separability is lost.

Instead, minimize over x and z separately and sequentially:

xk = arg minxL(x , zkāˆ’1, Ī»kāˆ’1; Ļk);

zk = arg minzL(xk , z , Ī»kāˆ’1; Ļk);

Ī»k = Ī»kāˆ’1 + Ļk(Axk + Bzk āˆ’ c).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 57 / 67

ADMM

Each iteration does a round of block-coordinate descent in (x , z).

The minimizations over x and z add only a quadratic term to f andh, respectively. This does not alter the cost much.

Can perform these minimizations inexactly.

Convergence is often slow, but sufficient for many applications.

Many recent applications to compressed sensing, image processing,matrix completion, sparse PCA.

ADMM has a rich collection of antecendents. For an excellent recentsurvey, including a diverse collection of machine learning applications, see(Boyd et al, 2011).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 58 / 67

ADMM for Consensus Optimization

Given

minmāˆ‘i=1

fi (x),

form m copies of the x , with the original x as a ā€œmasterā€ variable:

minx ,x1,x2,...,xm

māˆ‘i=1

fi (xi ) subject to x i = x , i = 1, 2, . . . ,m.

Apply ADMM, with z = (x1, x2, . . . , xm), get

x ik = arg minx i

fi (xi ) + (Ī»ikāˆ’1)T (x i āˆ’ xkāˆ’1) +

Ļk2ā€–x i āˆ’ xkāˆ’1ā€–2

2, āˆ€i ,

xk =1

m

māˆ‘i=1

(x ik +

1

ĻkĪ»ikāˆ’1

),

Ī»ik = Ī»ikāˆ’1 + Ļk(x ik āˆ’ xk), āˆ€i

Can do the x i updates in parallel. Synchronize for x update.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 59 / 67

ADMM for Awkward Intersections

The feasible set is sometimes an intersection of two or more convex setsthat are easy to handle separately (e.g. projections are easily computable),but whose intersection is more difficult to work with.

Example: Optimization over the cone of doubly nonnegative matrices:

minX

f (X ) s.t. X 0, X ā‰„ 0.

General form:

min f (x) s.t. x āˆˆ Ī©i , i = 1, 2, . . . ,m

Just consensus optimization, with indicator functions for the sets.

xk = arg minx

f (x) +māˆ‘i=1

(Ī»ikāˆ’1)T (x āˆ’ x ikāˆ’1) +Ļk2ā€–x āˆ’ x ikāˆ’1ā€–2

2,

x ik = arg minxiāˆˆĪ©i

(Ī»ikāˆ’1)T (xk āˆ’ x i ) +Ļk2ā€–xk āˆ’ x iā€–2

2, āˆ€i

Ī»ik = Ī»ikāˆ’1 + Ļk(xk āˆ’ x ik), āˆ€i .

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 60 / 67

Splitting+Lagrangian+Hogwild!

Recalling Hogwild! for minimizing

f (x) =māˆ‘i=1

fi (x),

using parallel SG, where each core evaluates a random partial gradientāˆ‡fi (x) and updates a centrally stored x . Results were given showing goodspeedups on a 12-core machine.

The machine has two sockets, so need frequent cross-socket cache fetches.

Re et al (2012) use the same 12-core, 2-socket platform, but

each socket has its own copy of x , each updated by half the cores;

The copies ā€œsynchronizeā€ periodically by consensus, or by updating acommon multiplier Ī».

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 61 / 67

Details

Replace min f (x) by the split problem

minx ,z

f (z) + f (z) s.t. x = z .

Min-max formulation is

minx ,z

maxĪ»

f (x) + f (z) + Ī»T (x āˆ’ z).

Store x on one socket and z on the other. Algorithm step k:

Apply Hogwild independently on the two sockets, for a certainamount of time, to

minx

f (x) + Ī»Tk x , minz

f (z)āˆ’ Ī»Tk z ,

To obtain xk and zk .

Swap values xk and zk between sockets and update

Ī»k+1 := Ī»k + Ī³(xk āˆ’ zk), for some Ī³ > 0.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 62 / 67

Over 100 epochs for the SVM problem RCV1, without permutation,speedup factor is 4.6 (.77s vs .17s).

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 63 / 67

Partially Separable: Higher-Order Methods

Consider again

f (x) =1

m

māˆ‘i=1

fi (x)

Newtonā€™s method obtains step by solving

āˆ‡2f (xk)dk = āˆ’āˆ‡f (xk),

and setting xk+1 = xk + dk . We have

āˆ‡2f (x) =1

m

māˆ‘i=1

āˆ‡2fi (x), āˆ‡f (x) =1

m

māˆ‘i=1

āˆ‡fi (x).

The ā€œobviousā€ implementation requires scan through the complete dataset, and formation and solution of an n Ɨ n linear system. This is usuallyimpractical for large m, n.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 64 / 67

Sampling

(Martens, 2010; Byrd et al, 2011)

Approximate the derivatives by sampling over a subset of the m terms.

Choose X āŠ‚ 1, 2, . . . ,m and S āŠ‚ X , and set

HkS =

1

|S |āˆ‘iāˆˆSāˆ‡2fi (xk), gk

X =1

|X |āˆ‘iāˆˆXāˆ‡fi (xk).

Approximate Newton step satisfies

HkSdk = gk

X .

Typically, S is just 1% to 10% of the terms.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 65 / 67

Conjugate Gradient and ā€œHessian-Freeā€ Implementation

We can use conjugate gradient (CG) to find an approximate solution.Each iteration of CG requires one matrix-vector multiplication with Hk

S ,plus some vector operations (O(n) cost).

Hessian-vector product

HkSv =

1

|S |āˆ‘iāˆˆSāˆ‡2fi (xk)v .

Since āˆ‡2fi often is sparse, each product āˆ‡2fi (xk)v is cheap.

We can even avoid second-derivative calculations altogether, by usingfinite-difference approximation to the matrix-vector product:

āˆ‡2fi (xk)v ā‰ˆ [āˆ‡fi (xk + Īµv)āˆ’āˆ‡fi (xk)]/Īµ.

Sampling and coordinate descent can be combined: see Wright (2012) forapplication of this combination to regularized logistic regression.

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 66 / 67

Conclusions

Iā€™ve given an incomplete survey of optimization techniques that may beuseful in analysis of large datasets, particularly for learning problems.

Some important topics omitted, e.g.

quasi-Newton methods (L-BFGS),

nonconvex regularizers,

manifold identification / variable selection.

For further information, see slides, videos, and reading lists for tutorialsthat I have presented recently, on some of these topics. Or mail me.

FIN

Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 67 / 67