Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Optimization and Machine Learning TrainingAlgorithms for Fitting Numerical Physics Models
Raghu Bollapragada† Matt Menickelly†
Witold Nazarewicz ‡,§ Jared O’Neal †
Paul-Gerhard Reinhard ∗ Stefan M. Wild†
†Argonne National Laboratory, ‡Michigan State University, §University of Warsaw,∗Universitat Erlangen-Nurnberg
June 9, 2020
Optimizing Parameters of Fayans’ Functional� Use Fayans functional form proposed in [S.A. Fayans, JETP Lett. 68, 169 (1998)]
� Computer model FaNDF0 (m(ν;x)) [P.-G. Reinhard & W. Nazarewicz, PhysRevC (2017)]
� 13 free computer model parameters (x) to be fitted:� ASP, ASYMM, EOVERA, COMPR, RHO NM, DASYM, HGRADP, C0NABJ, C1NABJ,
H2VM, FXI, HGRADXI, H1XI� Use “even” dataset (d1. . . . , d198) from that study corresponding to 198
observables (ν1, . . . , ν198) derived from 72 nucleus configurations:
Class Number of ObservablesBinding Energy 63RMS Charge Radius 52Diffraction Radius 28Surface Thickness 26Neutron Single-Level Energy 4Proton Single-Level Energy 5Isotopic Shifts 3Neutron Pairing Gap 5Proton Pairing Gap 11Other 1
σi-weighted least squares optimization problem:
minx∈R13
f(x), where f(x) =1
198
198∑i=1
(m (νi;x)− di
σi
)2
,
2 / 16
Why optimizers would like this problem
σi-weighted least squares optimization problem:
minx∈R13
f(x), where f(x) =1
198
198∑i=1
(m (νi;x)− di
σi
)2
,
Computing
� We can compute all observables m(νi;x) in a couple of seconds using 192 cores
� Inexpensive: benchmarking optimization algorithms not prohibitive
This makes for a good case study!
An eye on general nuclear data-fitting problems of the form
minx∈Rn
f(x), where f(x) =1
nd
nd∑i=1
(m (νi;x)− di
σi
)2
,
where
� nd might be (substantially) larger than 198 ...
� but m(νi;x) doesn’t admit analytic gradients, like the Fayans computer model.
3 / 16
Supervised Machine Learning (in one slide!)
The steps
1 Given a dataset of feature (νi) -data (di) pairs {(νi, di)}nd
i=1.
2 Let x parameterize a model, m -m(νi;x).
3 Define a loss function `(m(ν;x), d)that (in some sense) penalizesdiscrepancies between the modelprediction m(ν;x) and data d.
4 An optimization problem involvingthe empirical average results:
minx
1
nd
nd∑i=1
`(m(νi;x), di).
5 Solve the problem with stochasticgradient (SG) methods.
Silly toy example
1 Height and weight (νi) - dog vs cat(di)
2 x are the weights of a CNN m(νi;x)
3 Loss function `(m(ν;x), d) isclassification cross entropy
4 Assemble optimization problem(Pytorch/Tensorflow?)
5 Optimal x∗ =⇒ classifier m(ν;x∗):
4 / 16
Optimization and Machine Learning
minx∈Rn
f(x), where f(x) =1
nd
nd∑i=1
(m (νi;x)− di
σi
)2
,
One can interpret f(x) as an empirical square loss function for regression - looks justlike supervised training of ML models!
In that case, why don’t we just use stochastic gradient (SG) methods1?
� No gradients of m(νi;x) ... need derivative-free techniques.� But even with gradients, no great reason to believe SG methods should
outperform “more traditional” optimization methods.
We performed a comparison of derivative-free solvers and approximate SG methods onthis case study problem.
Solvers� POUNDERS
� Nelder-Mead
� Kiefer-Wolfowitz Iteration
� Two-point Bandit
� Derivative-free ADAQN
1or one of the many, many, many variants of SG methods, e.g., Momentum, Polyak Averaging, Adam,
RMSProp, AdaGrad, AdaDelta, AdaMax, AMSGrad, NAG, Nadam, SVRG, SAG, SARAH, SPIDER, ...
5 / 16
The Solvers - POUNDERS� Previous use in calibrating nuclear models, e.g. [S.M. Wild, J. Sarich & N. Schunck, JPG:
NPP (2015)]
� Model-based derivative-free optimization� Builds (quadratic) interpolation models of objective on evaluated pairs (x, f(x)).� Exploits known square loss function structure of f(x) for better Hessian
approximation.� Treats the objective as deterministic (all nd = 198 observables are computed for
each function value).
6 / 16
The Solvers- Nelder-Mead simplex algorithm� A very popular (direct search) method of derivative-free optimization [J.A. Nelder &
R. Mead, The Computer Journal (1965)]
� Maintains a simplex (n+ 1 points in Rn) of evaluated pairs (x, f(x)).� Relative values of f(x) determine which simplex vertices to delete and which
new vertices to add and evaluate.� Treats the objective as deterministic (all nd = 198 observables are computed for
each function value).
x(1)
x(n+1)
xnew
xnew
xnew
xnew2
xnew1
xnew3
xnew1
7 / 16
The Solvers - Kiefer-Wolfowitz
� Just replace the gradients in stochastic gradient method with finite differences!
� Seminal paper [J. Kiefer & J. Wolfowitz, Annals of Math. Stats. (1952)] was published oneyear after Robbins and Monro’s classic paper on SG [H. Robbins & S. Monro, Annals of
Math. Stats. (1951)].
Let B ⊂ {1, . . . , nd} be a random batch of observables drawn without replacement.Define the function
fB(x) =1
|B|∑i∈B
(m (νi;x)− di
σi
)2
SG KWxk+1 ← xk − ηk∇fBk
(xk) xk+1 ← xk − ηkgk
gk is a finite difference approximation of ∇fBk(xk) computed from values of fBk
.
Batchsize (|Bk|) and stepsizes {ηk} are typically treated as hyperparameters.(So is finite difference parameter h).
8 / 16
The Solvers - Two-Point Bandit Methods
� Employs a two-point approximation of the approximate gradient.
� (compare to the n+1 or 2n points needed for finite difference gradients in KW).
� Sample a random unit direction u and evaluate fB(x) and fB(x+ hu).
� Let d =fB(x+ hu)− fB(x)
hu
SG KW Banditxk+1 ← xk − ηk∇fBk
(xk) xk+1 ← xk − ηkgk xk+1 ← xk − ηkdk
Once again, batchsize, stepsize, and finite difference parameter h are hyperparameters.
9 / 16
The Solvers - Derivative-Free ADAQN
� An L-BFGS method (a popular quasi-Newton method) ...
� but gradients are approximated by finite differencing.
� Specifically designed for empirical loss functions of the form f(x).
� Requests variable batchsizes |Bk|.� |Bk| depends on algorithmically determined estimates of the variance of the
approximate gradients.
� Key citations: Derivative-based: [R. Bollapragada, J. Nocedal, D. Mudigere, H. Shi & P. Tang,
ICML (2018)], Derivative-free: [R. Bollapragada & S.M. Wild, ICML Workshop (2019)]
10 / 16
Overall Comparison of Function Value Trajectories
Let’s focus on POUNDERS, KW, and ADAQN ...
11 / 16
The Case (For?) POUNDERS
Only one hyperparameter to tune (initial TR radius), and performance is veryinvariant!
12 / 16
The Case (Against?) KW
Figure: Summary results for KW, fixing 3 differentbatchsizes, and comparing across stepsizes. � Lots of
hyperparametertuning (batchsizesand stepsizesshown here) - veryexpensive in corehours.
� Much variabilityacrosshyperparameters.
� Smaller batch sizestend to result incomputationalfailure fromcomputer model!
13 / 16
The Case (For?) ADAQN
Only tuned one hyperparameter - initial batchsize.Reliable performance, seems to improve with smaller initial batchsizes!
14 / 16
Towards larger problems
Figure: Resource utilization plots for the final solvers
A study of parallel resource exploitation.On the x-axis - the number of observables m(x; νi) one could compute simultaneously.On the y-axis - the (median) number of rounds of full utilization of parallel resourcesof the size on the x-axis needed to reduce the optimality gap to a fraction τ .
15 / 16
Future Directions
� From the physics: More models! Bigger models?!
� From the mathematics: Consider randomized (sampled)variants of POUNDERS for a “best of both worlds” in termsof reliability and speed to solution?
Thank you!
16 / 16