Optimization and Machine Learning Training Algorithms for ...€¦ · Optimization and Machine Learning Training Algorithms for Fitting Numerical Physics Models Raghu Bollapragaday

Optimization and Machine Learning TrainingAlgorithms for Fitting Numerical Physics Models

Raghu Bollapragada† Matt Menickelly†

Witold Nazarewicz ‡,§ Jared O’Neal †

Paul-Gerhard Reinhard ∗ Stefan M. Wild†

†Argonne National Laboratory, ‡Michigan State University, §University of Warsaw,∗Universitat Erlangen-Nurnberg

June 9, 2020

Optimizing Parameters of Fayans’ Functional� Use Fayans functional form proposed in [S.A. Fayans, JETP Lett. 68, 169 (1998)]

� Computer model FaNDF0 (m(ν;x)) [P.-G. Reinhard & W. Nazarewicz, PhysRevC (2017)]

� 13 free computer model parameters (x) to be fitted:� ASP, ASYMM, EOVERA, COMPR, RHO NM, DASYM, HGRADP, C0NABJ, C1NABJ,

H2VM, FXI, HGRADXI, H1XI� Use “even” dataset (d1. . . . , d198) from that study corresponding to 198

observables (ν1, . . . , ν198) derived from 72 nucleus configurations:

Class Number of ObservablesBinding Energy 63RMS Charge Radius 52Diffraction Radius 28Surface Thickness 26Neutron Single-Level Energy 4Proton Single-Level Energy 5Isotopic Shifts 3Neutron Pairing Gap 5Proton Pairing Gap 11Other 1

σi-weighted least squares optimization problem:

minx∈R13

f(x), where f(x) =1

198

198∑i=1

(m (νi;x)− di

σi

)2

,

2 / 16

Why optimizers would like this problem

σi-weighted least squares optimization problem:

minx∈R13

f(x), where f(x) =1

198

198∑i=1

(m (νi;x)− di

σi

)2

,

Computing

� We can compute all observables m(νi;x) in a couple of seconds using 192 cores

� Inexpensive: benchmarking optimization algorithms not prohibitive

This makes for a good case study!

An eye on general nuclear data-fitting problems of the form

minx∈Rn

f(x), where f(x) =1

nd

nd∑i=1

(m (νi;x)− di

σi

)2

,

where

� nd might be (substantially) larger than 198 ...

� but m(νi;x) doesn’t admit analytic gradients, like the Fayans computer model.

3 / 16

Supervised Machine Learning (in one slide!)

The steps

1 Given a dataset of feature (νi) -data (di) pairs {(νi, di)}nd

i=1.

2 Let x parameterize a model, m -m(νi;x).

3 Define a loss function `(m(ν;x), d)that (in some sense) penalizesdiscrepancies between the modelprediction m(ν;x) and data d.

4 An optimization problem involvingthe empirical average results:

minx

1

nd

nd∑i=1

`(m(νi;x), di).

5 Solve the problem with stochasticgradient (SG) methods.

Silly toy example

1 Height and weight (νi) - dog vs cat(di)

2 x are the weights of a CNN m(νi;x)

3 Loss function `(m(ν;x), d) isclassification cross entropy

4 Assemble optimization problem(Pytorch/Tensorflow?)

5 Optimal x∗ =⇒ classifier m(ν;x∗):

4 / 16

Optimization and Machine Learning

minx∈Rn

f(x), where f(x) =1

nd

nd∑i=1

(m (νi;x)− di

σi

)2

,

One can interpret f(x) as an empirical square loss function for regression - looks justlike supervised training of ML models!

In that case, why don’t we just use stochastic gradient (SG) methods1?

� No gradients of m(νi;x) ... need derivative-free techniques.� But even with gradients, no great reason to believe SG methods should

outperform “more traditional” optimization methods.

We performed a comparison of derivative-free solvers and approximate SG methods onthis case study problem.

Solvers� POUNDERS

� Nelder-Mead

� Kiefer-Wolfowitz Iteration

� Two-point Bandit

� Derivative-free ADAQN

1or one of the many, many, many variants of SG methods, e.g., Momentum, Polyak Averaging, Adam,

RMSProp, AdaGrad, AdaDelta, AdaMax, AMSGrad, NAG, Nadam, SVRG, SAG, SARAH, SPIDER, ...

5 / 16

The Solvers - POUNDERS� Previous use in calibrating nuclear models, e.g. [S.M. Wild, J. Sarich & N. Schunck, JPG:

NPP (2015)]

� Model-based derivative-free optimization� Builds (quadratic) interpolation models of objective on evaluated pairs (x, f(x)).� Exploits known square loss function structure of f(x) for better Hessian

approximation.� Treats the objective as deterministic (all nd = 198 observables are computed for

each function value).

6 / 16

The Solvers- Nelder-Mead simplex algorithm� A very popular (direct search) method of derivative-free optimization [J.A. Nelder &

R. Mead, The Computer Journal (1965)]

� Maintains a simplex (n+ 1 points in Rn) of evaluated pairs (x, f(x)).� Relative values of f(x) determine which simplex vertices to delete and which

new vertices to add and evaluate.� Treats the objective as deterministic (all nd = 198 observables are computed for

each function value).

x(1)

x(n+1)

xnew

xnew

xnew

xnew2

xnew1

xnew3

xnew1

7 / 16

The Solvers - Kiefer-Wolfowitz

� Just replace the gradients in stochastic gradient method with finite differences!

� Seminal paper [J. Kiefer & J. Wolfowitz, Annals of Math. Stats. (1952)] was published oneyear after Robbins and Monro’s classic paper on SG [H. Robbins & S. Monro, Annals of

Math. Stats. (1951)].

Let B ⊂ {1, . . . , nd} be a random batch of observables drawn without replacement.Define the function

fB(x) =1

|B|∑i∈B

(m (νi;x)− di

σi

)2

SG KWxk+1 ← xk − ηk∇fBk

(xk) xk+1 ← xk − ηkgk

gk is a finite difference approximation of ∇fBk(xk) computed from values of fBk

.

Batchsize (|Bk|) and stepsizes {ηk} are typically treated as hyperparameters.(So is finite difference parameter h).

8 / 16

The Solvers - Two-Point Bandit Methods

� Employs a two-point approximation of the approximate gradient.

� (compare to the n+1 or 2n points needed for finite difference gradients in KW).

� Sample a random unit direction u and evaluate fB(x) and fB(x+ hu).

� Let d =fB(x+ hu)− fB(x)

hu

SG KW Banditxk+1 ← xk − ηk∇fBk

(xk) xk+1 ← xk − ηkgk xk+1 ← xk − ηkdk

Once again, batchsize, stepsize, and finite difference parameter h are hyperparameters.

9 / 16

The Solvers - Derivative-Free ADAQN

� An L-BFGS method (a popular quasi-Newton method) ...

� but gradients are approximated by finite differencing.

� Specifically designed for empirical loss functions of the form f(x).

� Requests variable batchsizes |Bk|.� |Bk| depends on algorithmically determined estimates of the variance of the

approximate gradients.

� Key citations: Derivative-based: [R. Bollapragada, J. Nocedal, D. Mudigere, H. Shi & P. Tang,

ICML (2018)], Derivative-free: [R. Bollapragada & S.M. Wild, ICML Workshop (2019)]

10 / 16

Overall Comparison of Function Value Trajectories

Let’s focus on POUNDERS, KW, and ADAQN ...

11 / 16

The Case (For?) POUNDERS

Only one hyperparameter to tune (initial TR radius), and performance is veryinvariant!

12 / 16

The Case (Against?) KW

Figure: Summary results for KW, fixing 3 differentbatchsizes, and comparing across stepsizes. � Lots of

hyperparametertuning (batchsizesand stepsizesshown here) - veryexpensive in corehours.

� Much variabilityacrosshyperparameters.

� Smaller batch sizestend to result incomputationalfailure fromcomputer model!

13 / 16

The Case (For?) ADAQN

Only tuned one hyperparameter - initial batchsize.Reliable performance, seems to improve with smaller initial batchsizes!

14 / 16

Towards larger problems

Figure: Resource utilization plots for the final solvers

A study of parallel resource exploitation.On the x-axis - the number of observables m(x; νi) one could compute simultaneously.On the y-axis - the (median) number of rounds of full utilization of parallel resourcesof the size on the x-axis needed to reduce the optimality gap to a fraction τ .

15 / 16

Future Directions

� From the physics: More models! Bigger models?!

� From the mathematics: Consider randomized (sampled)variants of POUNDERS for a “best of both worlds” in termsof reliability and speed to solution?

Thank you!

16 / 16

Documents

Optimization and Machine Learning Training Algorithms for ...€¦ · Optimization and Machine Learning Training Algorithms for Fitting Numerical Physics Models Raghu Bollapragaday