Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Optimization Algorithms for Analyzing Large Datasets
Stephen Wright
University of Wisconsin-Madison
Winedale, October, 2012
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 1 / 67
1 Learning from Data: Problems and Optimization Formulations
2 First-Order Methods
3 Stochastic Gradient Methods
4 Nonconvex Stochastic Gradient
5 Sparse / Regularized Optimization
6 Decomposition / Coordinate Relaxation
7 Lagrangian Methods
8 Higher-Order Methods
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 2 / 67
Outline
Give some examples of data analysis and learning problems and theirformulations as optimization problems.
Highlight common features in these formulations.
Survey a number of optimization techniques that are being applied tothese problems.
Just the basics - give the flavor of each method and state a keytheoretical result or two.
All of these approaches are being investigated at present; rapiddevelopments continue.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 3 / 67
Data Analysis: Learning
Typical ingredients of optimization formulations of data analysis problems:
Collection of data, from which we want to learn to make inferencesabout future / missing data, or do regression to extract keyinformation.
A model of how the data relates to the meaining we are trying toextract. Model parameters to be determined by inspecting the dataand using prior knowledge.
Objective that captures prediction errors on the data and deviationfrom prior knowledge or desirable structure.
Other typical properties of learning problems are huge underlying data set,and requirement for solutions with only low-medium accuracy.
In some cases, the optimization formulation is well settled. (e.g. LeastSquares, Robust Regression, Support Vector Machines, LogisticRegression, Recommender Systems.)
In other areas, formulation is a matter of ongoing debate!Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 4 / 67
LASSO
Least Squares: Given a set of feature vectors ai ā Rn and outcomes bi ,i = 1, 2, . . . ,m, find weights x on the features that predict the outcomeaccurately: aTi x ā bi .
Under certain assumptions on measurement error, can find a suitable x bysolving a least squares problem
minx
1
2āAx ā bā2
2,
where the rows of A are aTi , i = 1, 2, . . . ,m.
Suppose we want to identify the most important features of each ai āthose features most important in predicting the outcome.
In other words, we seek an approx least-squares solution x that is sparseā and x with just a few nonzero components. Obtain
LASSO: minx
1
2āAx ā bā2
2 such that āxā1 ā¤ T ,
for parameter T > 0. (Smaller T implies fewer nonzeros in x .)Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 5 / 67
LASSO and Compressed Sensing
LASSO is equivalent to an ā`2-`1ā formulation:
minx
1
2āAx ā bā2
2 + Ļāxā1, for some Ļ > 0.
This same formulation is common in compressed sensing, but themotivation is slightly different. Here x represents some signal that isknown to be (nearly) sparse, and the rows of A are probing x to givemeasured outcomes b. The problem above is solved to reconstruct x .
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 6 / 67
Group LASSO
Partition x into groups of variables that have some relationship ā turnthem āonā or āoffā as a group, not as individuals.
minx
1
2āAx ā bā2
2 + ĻāgāGāx[g ]ā2,
with each [g ] ā 1, 2, . . . , n.Easy handle when groups [g ] are disjoint.
Still easy when ā Ā· ā2 is replaced by ā Ā· āā. (Turlach, Venables,Wright, 2005)
When the groups form a hierarchy, the problem is slightly harder butsimilar algorithms still work.
For general overlapping groups, algorithms are more complex (seeBach, Mairal, et al.)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 7 / 67
Least Squares with Nonconvex Regularizers
Nonconvex element-wise penalties have become popular for variableselection in statistics.
SCAD (smoothed clipped absolute deviation) (Fan and Li, 2001)
MCP (Zhang, 2010).
Properties: Unbiased and sparse estimates, solution path continuous inregularization parameter Ļ .
SparseNet (Mazumder, Friedman, Hastie, 2011): coordinate descent.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 8 / 67
Support Vector Classification
Given many data vectors xi ā Rn, for i = 1, 2, . . . ,m, together with a labelyi = Ā±1 to indicate the class (one of two) to which xi belongs.
Find z such that (usually) we have
xTi z ā„ 1 when yi = +1;
xTi z ā¤ ā1 when yi = ā1.
SVM with hinge loss:
f (z) = CNāi=1
max(1ā yi (zT xi ), 0) +
1
2āzā2,
where C > 0 is a parameter. Dual formulation is
minĪ±
1
2Ī±TKĪ±ā 1TĪ± subject to 0 ā¤ Ī± ā¤ C1, yTĪ± = 0,
where Kij = yiyjxTi xj . Subvectors of the gradient KĪ±ā 1 can be
computed economically.
(Many extensions and variants.)Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 9 / 67
(Regularized) Logistic Regression
Seek odds function parametrized by z ā Rn:
p+(x ; z) := (1 + ezT x)ā1, pā(x ; z) := 1ā p+(z ;w),
choosing z so that p+(xi ; z) ā 1 when yi = +1 and pā(xi ; z) ā 1 whenyi = ā1. Scaled, negative log likelihood function L(z) is
L(z) = ā 1
m
āyi=ā1
log pā(xi ; z) +āyi=1
log p+(xi ; z)
To get a sparse z (i.e. classify on the basis of a few features) solve:
minzL(z) + Ī»āzā1.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 10 / 67
Multiclass Logistic Regression
M classes: yij = 1 if data point i is in class j ; yij = 0 otherwise. z[j] is thesubvector of z for class j .
f (z) = ā 1
N
Nāi=1
Māj=1
yij(zT[j]xi )ā log(
Māj=1
exp(zT[j]xi ))
+ Ī»
Māj=1
āz[j]ā22.
Useful in speech recognition, to classify phonemes.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 11 / 67
Matrix Completion
Seek a matrix X ā RmĆn with low rank that matches certain observations,possibly noisy.
minX
1
2āA(X )ā bā2
2 + ĻĻ(X ),
where A(X ) is a linear mapping of the components of X (e.g.element-wise observations).
Can have Ļ as the nuclear norm = sum of singular values. Tends topromote low rank (in the same way as āxā1 tends to promote sparsity of avector x).
Alternatively: X is the sum of sparse matrix and a low-rank matrix. Theelement-wise 1-norm āXā1 is useful in inducing sparsity.
Useful in recommender systems, e.g. Netflix, Amazon.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 12 / 67
Image Processing: TV Regularization
Natural images are not random! They tend to have large areas ofnear-constant intensity or color, separated by sharp edges.
Denoising: Given an image in which the pixels contain noise, find aānearby natural image.ā
Total-Variation Regularization applies an `1 penalty to gradients in theimage.
u : Ī©ā R, Ī© := [0, 1]Ć [0, 1],
To denoise an image f : Ī©ā R, solve
minu
ā«Ī©
(u(x)ā f (x))2 dx + Ī»
ā«Ī©āāu(x)ā2
2 dx .
(Rudin, Osher, Fatemi, 1992)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 14 / 67
(a) Cameraman: Clean (b) Cameraman: Noisy
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 15 / 67
(c) Cameraman: Denoised (d) Cameraman: Noisy
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 16 / 67
Optimization: Regularization Induces Structure
Formulations may include regularization functions to induce structure:
minx
f (x) + ĻĻ(x),
where Ļ induces the desired structure in x . Often Ļ is nonsmooth.
āxā1 to induce sparsity in the vector x (variable selection / LASSO);
SCAD and MCP: nonconvex regularizers to induce sparsity, withoutbiasing the solution;
Group regularization / group selection;
Nuclear norm āXāā (sum of singular values) to induce low rank inmatrix X ;
low ātotal-variationā in image processing;
generalizability (Vapnik: ā...tradeoff between the quality of theapproximation of the given data and the complexity of theapproximating functionā).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 17 / 67
Optimization: Properties of f .
Objective f can be derived from Bayesian statistics + maximum likelihoodcriterion. Can incorporate prior knowledge.
f have distinctive properties in several applications:
Partially Separable: Typically
f (x) =1
m
māi=1
fi (x),
where each term fi corresponds to a single item of data, and possiblydepends on just a few components of x .
Cheap Partial Gradients: subvectors of the gradient āf may beavailable at proportionately lower cost than the full gradient.
These two properties are often combined.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 18 / 67
Batch vs Incremental
Considering the partially separable form
f (x) =1
m
māi=1
fi (x),
the size N of the training set can be very large. Thereās a fundamentaldivide in algorithmic strategy.
Incremental: Select a single i ā 1, 2, . . . ,m at random, evaluateāfi (x), and take a step in this direction. (Note thatE [āfi (x)] = āf (x).) Stochastic Approximation (SA) a.k.a.Stochastic Gradient (SG).
Batch: Select a subset of data X ā 1, 2, . . . ,m, and minimize thefunction f (x) =
āiāX fi (x). Sample-Average Approximation (SAA).
Minibatch is a kind of compromise: Aggregate the component functionsfi into small groups, consisting of 10 or 100 individual terms, and applyincremental algorithms to the redefined summation. (Gives lower-variancegradient estimates.)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 19 / 67
First-Order Methods
min f (x), with smooth convex f . First-order methods calculate āf (xk) ateach iteration, and do something with it.
Usually assumeĀµI ā2f (x) LI for all x ,
with 0 ā¤ Āµ ā¤ L. (L is thus a Lipschitz constant on the gradient āf .)Function is strongly convex if Āµ > 0.
But some of these methods are extendible to more general cases, e.g.
nonsmooth-regularized f ;
simple constraints x ā Ī©;
availability of just an approximation to āf .
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 20 / 67
Steepest Descent
xk+1 = xk ā Ī±kāf (xk), for some Ī±k > 0.
Choose step length Ī±k by doing line search, or more simply by usingknowledge of L and Āµ.
Ī±k ā” 1/L yields sublinear convergence
f (xk)ā f (xā) ā¤ 2Lāx0 ā xāā2
k.
In strongly convex case, Ī±k ā” 2/(Āµ+ L) yields linear convergence, withconstant dependent on problem conditioning Īŗ := L/Āµ:
f (xk)ā f (xā) ā¤ L
2
(1ā 2
Īŗ+ 1
)2k
āx0 ā xāā2.
We canāt improve much on these rates by using more sophisticated choicesof Ī±k ā theyāre a fundamental limitation of searching along āāf (xk).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 21 / 67
steepest descent, exact line search
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 22 / 67
Momentum
First-order methods can be improved dramatically using momentum:
xk+1 = xk ā Ī±kāf (xk) + Ī²k(xk ā xkā1).
Search direction is a combination of previous search direction xk ā xkā1
and latest gradient āf (xk). Methods in this class include:
Heavy-ball;
Conjugate gradient;
Accelerated gradient.
Heavy-ball sets
Ī±k ā”4
L
1
(1 + 1/āĪŗ)2
, Ī²k ā”(
1ā 2āĪŗ+ 1
)2
.
to get a linear convergence rate with constant approximately 1ā 2/āĪŗ.
Thus requires aboutāĪŗ log Īµ to achieve precision of Īµ, vs. about Īŗ log Īµ for
steepest descent.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 23 / 67
Conjugate Gradient
Basic step is
xk+1 = xk + Ī±kpk , pk = āāf (xk) + Ī³kpkā1.
Put it in the āmomentumā framework by setting Ī²k = Ī±kĪ³k/Ī±kā1.However, CG can be implemented in a way that doesnāt require knowledge(or estimation) of L and Āµ.
Choose Ī±k to (approximately) miminize f along pk ;
Choose Ī³k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere,etc), all of which are equivalent if f is convex quadratic. e.g.
Ī³k = āāf (xk)ā2/āāf (xkā1)ā2.
Similar convergence rate to heavy-ball: linear with rate 1ā 2/āĪŗ.
CG has other properties, particularly when f is convex quadratic.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 24 / 67
firstāorder method with momentum
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 25 / 67
Accelerated Gradient Methods
Accelerate the rate to 1/k2 for weakly convex, while retaining the linearrate (based on
āĪŗ) for strongly convex case.
Nesterov (1983, 2004) describes a method that requires Īŗ.
0: Choose x0, Ī±0 ā (0, 1); set y0 ā x0./
k : xk+1 ā yk ā 1Lāf (yk); (*short-step gradient*)
solve for Ī±k+1 ā (0, 1): Ī±2k+1 = (1ā Ī±k+1)Ī±2
k + Ī±k+1/Īŗ;set Ī²k = Ī±k(1ā Ī±k)/(Ī±2
k + Ī±k+1);set yk+1 ā xk+1 + Ī²k(xk+1 ā xk). (*update with momentum*)
Separates āsteepest descentā step from āmomentumā step, producingtwo sequences xk and yk.Still works for weakly convex (Īŗ =ā).
FISTA is similar.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 26 / 67
Extension to Sparse / Regularized Optimization
Accelerated gradient can be applied with minimal changes to theregularized problem
minx
f (x) + ĻĻ(x),
where f is convex and smooth, Ļ convex and āsimpleā but usuallynonsmooth, and Ļ is a positive parameter.
Simply replace the gradient step by
xk = arg minx
L
2
ā„ā„ā„ā„x ā [yk ā 1
Lāf (yk)
]ā„ā„ā„ā„2
+ ĻĻ(x).
(This is the shrinkage step. Often cheap to compute. More later...)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 28 / 67
Stochastic Gradient Methods
Still deal with (weakly or strongly) convex f . But change the rules:
Allow f nonsmooth.
Donāt calculate function values f (x).
Can evaluate cheaply an unbiased estimate of a vector from thesubgradient āf .
Common settings are:f (x) = EĪ¾F (x , Ī¾),
where Ī¾ is a random vector with distribution P over a set Ī. A usefulspecial case is the partially separable form
f (x) =1
m
māi=1
fi (x),
where each fi is convex.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 29 / 67
āClassicalā Stochastic Gradient
For the sum:
f (x) =1
m
māi=1
fi (x),
Choose index ik ā 1, 2, . . . ,m uniformly at random at iteration k, set
xk+1 = xk ā Ī±kāfik (xk),
for some steplength Ī±k > 0.
When f is strongly convex, the analysis of convergence of E (āxk ā xāā2) isfairly elementary - see Nemirovski et al (2009).
Convergence depends on careful choice of Ī±k , naturally.Line search for Ī±k is not really an option, since we canāt (or wonāt)evaluate f !
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 30 / 67
Convergence of Classical SG
Suppose f is strongly convex with modulus Āµ, there is a bound M on thesize of the gradient estimates:
1
m
māi=1
āāfi (x)ā2 ā¤ M2
for all x of interest. Convergence obtained for the expected square error:
ak :=1
2E (āxk ā xāā2).
Elementary argument shows a recurrence:
ak+1 ā¤ (1ā 2ĀµĪ±k)ak +1
2Ī±2kM
2.
When we set Ī±k = 1/(kĀµ), a neat inductive argument reveals a 1/k rate:
ak ā¤Q
2k, for Q := max
(āx1 ā xāā2,
M2
Āµ2
).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 31 / 67
But... What if we donāt know Āµ? Or if Āµ = 0?
The choice Ī±k = 1/(kĀµ) requires strong convexity, with knowledge of Āµ.Underestimating Āµ can greatly degrade performance!
Robust Stochastic Approximation approach uses primal averaging torecover a rate of 1/
āk in f , works even for weakly convex nonsmooth
functions, and is not sensitive to choice of parameters in the step length.
At iteration k :
set xk+1 = xk ā Ī±kāfik (xk) as before;
set
xk =
āki=1 Ī±ixiāki=1 Ī±i
.
For any Īø > 0 (not critical), choose step lengths to be Ī±k = Īø/(Māk).
Then
E [f (xk)ā f (xā)] ā¼ log kāk.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 32 / 67
Constant Step Size: Ī±k ā” Ī±
We can also get rates of approximately 1/k for the strongly convex case,without performing iterate averaging and without requiring an accurateestimate of Āµ. Need to specify the desired threshold for ak in advance.
Setting Ī±k ā” Ī±, we have
ak+1 ā¤ (1ā 2ĀµĪ±)ak +1
2Ī±2M2.
We can show by some manipulation that
ak ā¤ (1ā 2ĀµĪ±)ka0 +Ī±M2
4Āµ.
Given threshold Īµ > 0, we aim to find Ī± and K such that ak ā¤ Īµ for allk ā„ K . Choose so that both terms in the bound are less than Īµ/2:
Ī± :=2ĪµĀµ
M2, K :=
M2
4ĪµĀµ2log(a0
2Īµ
).
A kind of 1/k rate!Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 33 / 67
Parallel Stochastic Gradient
Several approaches tried for parallel stochastic approximation.
Dual Averaging: Average gradient estimates evaluated in parallel ondifferent cores. Requires message passing / synchronization (Dekel etal, 2011; Duchi et al, 2010).
Round-Robin: Cores evaluate āfi in parallel and update centrallystored x in round-robin fashion. Requires synchronization (Langfordet al, 2009).
Asynchronous: Hogwild!: Each core grabs the centrally-stored xand evaluates āfe(xe) for some random e, then writes the updatesback into x (Niu et al, 2011). Downpour SGD: Similar idea forcluster (Dean et al, 2012).
Hogwild!: Each processor runs independently:
1 Sample ik from 1, 2, . . . ,m;2 Read current state of x from central memory, evalute g := āfik (x);
3 for nonzero components gv do xv ā xv ā Ī±gv ;
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 34 / 67
Hogwild! Convergence
Updates can be old by the time they are applied, but we assume abound Ļ on their age.
Processors can overwrite each otherās work, but sparsity of āfe helpsā updates to not interfere too much.
Analysis of Niu et al (2011) simplified / generalized by Richtarik (2012).
Rates depend on Ļ , L, Āµ, initial error, and other quantities that define theamount of overlap between nonzero components of āfi and āfj , for i 6= j .
For a constant-step scheme (with Ī± chosen as a function of the quantitiesabove), essentially recover the 1/k behavior of basic SG. (Also dependslinearly on Ļ and the average sparsity.)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 35 / 67
Hogwild! Performance
Hogwild! compared with averaged gradient (AIG) and round-robin (RR).Experiments run on a 12-core machine. (10 cores used for gradientevaluations, 2 cores for data shuffling.)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 36 / 67
Stochastic Gradient for Nonconvex Objectives
Derivation and analysis of SG methods depends crucially on convexity of fā even strong convexity. However there are data-intensive problems ofcurrent interest (e.g. in deep belief networks) with separable nonconvex f .
For nonconvex problems, standard convergence results for full-gradientmethods are that āaccumulation points are stationary,ā that is, āf (x) = 0.
Are there stochastic-gradient methods that achieve similar properties?
Yes! See Ghadimi and Lan (2012).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 38 / 67
Deep Belief Networks
Example of a deep belief net-work for autoencoding (Hin-ton, 2007). Output (at top)depends on input (at bot-tom) of an image with 28 Ć28 pixels. Transformationsparametrized by W1, W2, W3,W4; output is a highly nonlin-ear function of these parame-ters.
Learning problems based onstructures like this give riseto separable, nonconvex ob-jectives.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 39 / 67
Nonconvex SG
Describe for the problem
f (x) =1
m
māi=1
fi (x),
with fi smooth, possibly nonconvex. L is Lipschitz constant for āf ; Mdefined as before.
Approach of Ghadimi and Lan (2012): SG with a random number of steps.
Pick an iteration limit N; Pick stepsize sequence Ī±k , k = 1, 2, . . . ,N;
Pick R randomly and uniformly from 1, 2, . . . ,N;Do R steps of SG: For k = 1, 2, . . . ,R ā 1:
xk+1 = xk ā Ī±kāfik (xk)
with ik chosen randomly and uniformly from 1, 2, . . . ,m;Output xR .
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 40 / 67
Convergence for Nonconvex SG
Constant-step variant:
Ī±k ā” min
1
L,
D
MāN
,
for some user-defined D > 0.
Define Df so that D2f = 2(f (x1)ā f ā)/L (where f ā is the optimal value of
f ). Then1
LE [āāf (xR)ā2] ā¤
LD2f
N+
(D +
D2f
D
)MāN.
In the case of convex f , get a similar bound:
E [f (xR)ā f ā] ā¤LD2
X
N+
(D +
D2X
D
)MāN,
where DX = āx1 ā xāā.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 41 / 67
Two-Phase Nonconvex SG
Ghadimi and Lan extend this method to achieve complexity like 1/k, byrunning multiple processes in parallel, and picking the solution with thebest sample gradient. Prove a large-deviation convergence bound: userspecifies required threshold Īµ for āāf (x)ā2, and required likelihood 1ā Ī(for small positive Ī).
Choose trial set j1, j2, . . . , jT of indices from 1, 2, . . . ,m.Set S = log(2/Ī);
For s = 1, 2, . . . ,S , run SG with random number of iterates, toproduce xs ;
For s = 1, 2, . . . ,S , calculate
gs =1
T
Tāk=1
āfjk (xs).
Output xs for which āgsā is minimized.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 42 / 67
Sparse Optimization: Motivation
Many applications need structured, approximate solutions of optimizationformulations, rather than exact solutions.
More Useful, More Credible
Structured solutions are easier to understand.They correspond better to prior knowledge about the solution.They may be easier to use and actuate.Extract just the essential meaning from the data set, not the lessimportant effects.
Less Data Needed
Structured solution lies in lower-dimensional spaces ā need to gather /sample less data to capture it.
The structural requirements have deep implications for how we formulateand solve these problems.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 43 / 67
Regularized Formulations: Shrinking Algorithms
Consider the formulation
minx
f (x) + ĻĻ(x).
Regularizer Ļ is often nonsmooth but āsimple.ā Often the problem is easyto solve when f is replaced by a quadratic with diagonal Hessian:
minz
gT (z ā x) +1
2Ī±āz ā xā2
2 + ĻĻ(z).
Equivalently,
minz
1
2Ī±āz ā (x ā Ī±g)ā2
2 + ĻĻ(z).
Define the shrink operator as the arg min:
SĻ (y , Ī±) := arg minz
1
2Ī±āz ā yā2
2 + ĻĻ(z).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 44 / 67
Interesting Regularizers and their Shrinks
Cases for which the subproblem is simple:
Ļ(z) = āzā1. Thus SĻ (y , Ī±) = sign(y) max(|y | ā Ī±Ļ, 0). When ycomplex, have
SĻ (y , Ī±) =max(|y | ā ĻĪ±, 0)
max(|y | ā ĻĪ±, 0) + ĻĪ±y .
Ļ(z) =ā
gāG āz[g ]ā2, where z[g ], g ā G are non-overlappingsubvectors of z . Here
SĻ (y , Ī±)[g ] =max(|y[g ]| ā ĻĪ±, 0)
max(|y[g ]| ā ĻĪ±, 0) + ĻĪ±y[g ].
Ļ(x) = IĪ©(x): Indicator function for a closed convex set Ī©. ThenSĻ (y , Ī±) is the projection of y onto Ī©.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 45 / 67
Basic Prox-Linear Algorithm
(Fukushima and Mine, 1981) for solving minx f (x) + ĻĻ(x).
0: Choose x0
k : Choose Ī±k > 0 and set
xk+1 = SĻ (xk ā Ī±kāf (xk);Ī±k)
= arg minzāf (xk)T (z ā xk) +
1
2Ī±kāz ā xkā2
2 + ĻĻ(z).
This approach goes by many names, including forward-backward splitting,shrinking / thresholding.
Straightforward, but can be fast when the regularization is strong (i.e.solution is āhighly constrainedā).
Can show convergence for steps Ī±k ā (0, 2/L), where L is the bound onā2f . (Like a short-step gradient method.)
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 46 / 67
Enhancements
Alternatively, since Ī±k plays the role of a steplength, can adjust it to getbetter performance and guaranteed convergence.
Backtracking: decrease Ī±k until sufficient decrease condition holds.
Use Barzilai-Borwein strategies to get nonmonotonic methods. Byenforcing sufficient decrease every 10 iterations (say), still get globalconvergence.
Embed accelerated first-order methods (e.g. FISTA) by redefining thelinear term g appropriately.
Do continuation on the parameter Ļ . Solve first for large Ļ (easierproblem) then decrease Ļ in steps toward a target, using the previoussolution as a warm start for the next value.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 47 / 67
Decomposition / Coordinate Relaxation
For min f (x), at iteration k, choose a subset Gk ā 1, 2, . . . , n and takea step dk only in these components. i.e. fix [dk ]i = 0 for i /ā Gk .
Fewer variables ā smaller and cheaper to deal with than the full problem.Can
take a reduced gradient step in the Gk components;
take multiple āinner iterationsā
actually solve the reduced subproblem in the space defined by Gk .
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 48 / 67
Example: Decomposition in SVM
Decomposition has long been popular for solving the dual (QP)formulation of support vector machines (SVM), since the number ofvariables (= number of training examples) may be very large.
SMO, LIBSVM: Each Gk has two components, different heuristicsfor choosing Gk .
LASVM: Again |Gk | = 2, with focus on online setting.
SVM-light: Small |Gk | (default 10).
GPDT: Larger |Gk | (default 400) with gradient projection solver asinner loop.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 49 / 67
Deterministic Results
Generalized Gauss-Seidel condition requires only that each coordinate isātouchedā at least once every T iterations:
Gk āŖ Gk+1 āŖ . . . āŖ Gk+Tā1 = 1, 2, . . . , n.
Can show global convergence (e.g. Tseng and Yun, 2009; Wright, 2012).
There are also results on
global linear convergence rates;
optimal manifold identification (for problems with nonsmoothregularizer);
fast local convergence for an algorithm that takes reduced steps onthe estimated optimal manifold.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 50 / 67
Stochastic Coordinate Descent
Randomized coordinate descent (RCDC)
(Richtarik and Takac, 2012; see also Nesterov, 2012)
For min f (x).
Partitions the components of x into subvectors;
Makes a random selection of one partition to update at each iteration;
Exploits knowledge of the partial Lipschitz constant for each partitionin choosing the step.
Allows parallel implementation.
At each iteration, pick a partition at random, and take a short-stepsteepest descent step on the variables in that component.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 51 / 67
RCDC Details
Partition components 1, 2, . . . , n into m blocks with block [i ] withcorresponding columns from the n Ć n identity matrix denoted by Ui .
Denote by Li the partial Lipschitz constant on partition [i ]:
āā[i ]f (x + Ui t)āā[i ]f (x)ā ā¤ Liātā.
Fix probabilities of choosing each partition: pi , i = 1, 2, . . . ,m.
Iteration k :
Choose partition ik ā 1, 2, . . . ,m with probability pi ;
Set dk,i = ā(1/Lik )ā[ik ]f (xk);
Set xk+1 = xk + Uikdk,i .
For convex f and uniform weights pi = 1/m, can prove high probabilityconvergence of f to within a specified threshold of f (xā) in O(1/k)iterations.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 52 / 67
RCDC Result
āxāL :=
(māi=1
Liāx[i ]ā2
)1/2
Weighted measure of level set size:
RL(x) := maxy
maxxāāXā
āy ā xāāL : f (y) ā¤ f (x).
Assuming that desired precision Īµ has
Īµ < min(R2L(x0), f (x0)ā f ā)
and defining
K :=2nR2
L(x0)
Īµlog
f (x0)ā f ā
ĪµĻ,
we have for k ā„ K that
P(f (xk)ā f ā ā¤ Īµ) ā„ 1ā Ļ.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 53 / 67
Augmented Lagrangian Methods and Splitting
Consider linearly constrained problem:
min f (x) s.t. Ax = b.
Augmented Lagrangian is
L(x , Ī»; Ļ) := f (x) + Ī»T (Ax ā b) +Ļ
2āAx ā bā2
2,
where Ļ > 0. Basic augmented Lagrangian / method of multipliers is
xk = arg minxL(x , Ī»kā1; Ļk);
Ī»k = Ī»kā1 + Ļk(Axk ā b);
(choose Ļk+1).
Extends to inequality constraints, nonlinear constraints.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 54 / 67
Features of Augmented Lagrangian
If we set Ī» = 0, recover the quadratic penalty method.
If Ī» = Ī»ā, the minimizer of L(x , Ī»ā; Ļ) is the true solution xā, for anyĻ ā„ 0.
AL provides successive estimates Ī»k of Ī»ā, thus can achieve solutionswith good accuracy without driving Ļk to ā.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 55 / 67
History of Augmented Lagrangian
Dates from 1969: Hestenes, Powell.
Developments in 1970s, early 1980s by Rockafellar, Bertsekas,....
Lancelot code for nonlinear programming: Conn, Gould, Toint,around 1990.
Largely lost favor as an approach for general nonlinear programmingduring the next 15 years.
Recent revival in the context of sparse optimization and associateddomain areas, in conjunction with splitting / coordinate descenttechniques.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 56 / 67
Separable Objectives: ADMM
Alternating Directions Method of Multipliers (ADMM) arises when theobjective in the basic linearly constrained problem is separable:
min(x ,z)
f (x) + h(z) subject to Ax + Bz = c ,
for which
L(x , z , Ī»; Ļ) := f (x) + h(z) + Ī»T (Ax + Bz ā c) +Ļ
2āAx ā Bz ā cā2
2.
Standard augmented Lagrangian would minimize L(x , z , Ī»; Ļ) over (x , z)jointly ā but these are coupled through the quadratic term, so theadvantage of separability is lost.
Instead, minimize over x and z separately and sequentially:
xk = arg minxL(x , zkā1, Ī»kā1; Ļk);
zk = arg minzL(xk , z , Ī»kā1; Ļk);
Ī»k = Ī»kā1 + Ļk(Axk + Bzk ā c).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 57 / 67
ADMM
Each iteration does a round of block-coordinate descent in (x , z).
The minimizations over x and z add only a quadratic term to f andh, respectively. This does not alter the cost much.
Can perform these minimizations inexactly.
Convergence is often slow, but sufficient for many applications.
Many recent applications to compressed sensing, image processing,matrix completion, sparse PCA.
ADMM has a rich collection of antecendents. For an excellent recentsurvey, including a diverse collection of machine learning applications, see(Boyd et al, 2011).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 58 / 67
ADMM for Consensus Optimization
Given
minmāi=1
fi (x),
form m copies of the x , with the original x as a āmasterā variable:
minx ,x1,x2,...,xm
māi=1
fi (xi ) subject to x i = x , i = 1, 2, . . . ,m.
Apply ADMM, with z = (x1, x2, . . . , xm), get
x ik = arg minx i
fi (xi ) + (Ī»ikā1)T (x i ā xkā1) +
Ļk2āx i ā xkā1ā2
2, āi ,
xk =1
m
māi=1
(x ik +
1
ĻkĪ»ikā1
),
Ī»ik = Ī»ikā1 + Ļk(x ik ā xk), āi
Can do the x i updates in parallel. Synchronize for x update.Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 59 / 67
ADMM for Awkward Intersections
The feasible set is sometimes an intersection of two or more convex setsthat are easy to handle separately (e.g. projections are easily computable),but whose intersection is more difficult to work with.
Example: Optimization over the cone of doubly nonnegative matrices:
minX
f (X ) s.t. X 0, X ā„ 0.
General form:
min f (x) s.t. x ā Ī©i , i = 1, 2, . . . ,m
Just consensus optimization, with indicator functions for the sets.
xk = arg minx
f (x) +māi=1
(Ī»ikā1)T (x ā x ikā1) +Ļk2āx ā x ikā1ā2
2,
x ik = arg minxiāĪ©i
(Ī»ikā1)T (xk ā x i ) +Ļk2āxk ā x iā2
2, āi
Ī»ik = Ī»ikā1 + Ļk(xk ā x ik), āi .
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 60 / 67
Splitting+Lagrangian+Hogwild!
Recalling Hogwild! for minimizing
f (x) =māi=1
fi (x),
using parallel SG, where each core evaluates a random partial gradientāfi (x) and updates a centrally stored x . Results were given showing goodspeedups on a 12-core machine.
The machine has two sockets, so need frequent cross-socket cache fetches.
Re et al (2012) use the same 12-core, 2-socket platform, but
each socket has its own copy of x , each updated by half the cores;
The copies āsynchronizeā periodically by consensus, or by updating acommon multiplier Ī».
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 61 / 67
Details
Replace min f (x) by the split problem
minx ,z
f (z) + f (z) s.t. x = z .
Min-max formulation is
minx ,z
maxĪ»
f (x) + f (z) + Ī»T (x ā z).
Store x on one socket and z on the other. Algorithm step k:
Apply Hogwild independently on the two sockets, for a certainamount of time, to
minx
f (x) + Ī»Tk x , minz
f (z)ā Ī»Tk z ,
To obtain xk and zk .
Swap values xk and zk between sockets and update
Ī»k+1 := Ī»k + Ī³(xk ā zk), for some Ī³ > 0.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 62 / 67
Over 100 epochs for the SVM problem RCV1, without permutation,speedup factor is 4.6 (.77s vs .17s).
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 63 / 67
Partially Separable: Higher-Order Methods
Consider again
f (x) =1
m
māi=1
fi (x)
Newtonās method obtains step by solving
ā2f (xk)dk = āāf (xk),
and setting xk+1 = xk + dk . We have
ā2f (x) =1
m
māi=1
ā2fi (x), āf (x) =1
m
māi=1
āfi (x).
The āobviousā implementation requires scan through the complete dataset, and formation and solution of an n Ć n linear system. This is usuallyimpractical for large m, n.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 64 / 67
Sampling
(Martens, 2010; Byrd et al, 2011)
Approximate the derivatives by sampling over a subset of the m terms.
Choose X ā 1, 2, . . . ,m and S ā X , and set
HkS =
1
|S |āiāSā2fi (xk), gk
X =1
|X |āiāXāfi (xk).
Approximate Newton step satisfies
HkSdk = gk
X .
Typically, S is just 1% to 10% of the terms.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 65 / 67
Conjugate Gradient and āHessian-Freeā Implementation
We can use conjugate gradient (CG) to find an approximate solution.Each iteration of CG requires one matrix-vector multiplication with Hk
S ,plus some vector operations (O(n) cost).
Hessian-vector product
HkSv =
1
|S |āiāSā2fi (xk)v .
Since ā2fi often is sparse, each product ā2fi (xk)v is cheap.
We can even avoid second-derivative calculations altogether, by usingfinite-difference approximation to the matrix-vector product:
ā2fi (xk)v ā [āfi (xk + Īµv)āāfi (xk)]/Īµ.
Sampling and coordinate descent can be combined: see Wright (2012) forapplication of this combination to regularized logistic regression.
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 66 / 67
Conclusions
Iāve given an incomplete survey of optimization techniques that may beuseful in analysis of large datasets, particularly for learning problems.
Some important topics omitted, e.g.
quasi-Newton methods (L-BFGS),
nonconvex regularizers,
manifold identification / variable selection.
For further information, see slides, videos, and reading lists for tutorialsthat I have presented recently, on some of these topics. Or mail me.
FIN
Stephen Wright (UW-Madison) Optimization Winedale, October, 2012 67 / 67