“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp...

“Sharp Minima Can Generalize For Deep Nets,”L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio

Mar. 2017.

At a loss: Landscaping loss, and how it affects generalization-Adam

Flat Minima Generalize Well

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv:1609.04836 [cs, math], Sep. 2016.

“We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization.”

Flat Minima Can Generalize WellWhen using stochastic gradient descent:

● batches of size 256 generalize better and are flatter than batches of size 10%

Flat Minima Generalize Well?Is it really flat minima that generalize well?

Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney, “Hessian-based Analysis of Large Batch Training and Robustness to Adversaries,” arXiv:1802.08241 [cs, stat], Feb. 2018.

Which of the following are true:

1. Flat minima are the reason for good generalization2. Gradient descent methods gravitate to flat minima more easily.

a. Thus models that generalize get chosen over models that do not, causing the model selection process to biased towards performing well in flat minima for the data regimes that they have been selected for.

3. Something else

Why do flat minima generalize well?Presuming that flat minima mark

good generalization

Do flat minima mark good generalization?

Coarse Model for Minima

“Sharp Minima Can Generalize For Deep Nets,”L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio

Mar. 2017.

Broad StrokesModels can be tweaked after training to adjust sharpness according to common metrics.

Common Flatness Metrics1. Volume flatness2. ϵ-sharpness3. Hessian Based

Flatness 1Volume flatness:

“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio

Flatness 1Volume flatness has infinite volume for rectified neural networks.

Flatness 2ϵ-sharpness: (Keskar et. al flatness)

“For rectified neural network every minimum is observationally equivalent to a minimum that generalizes as well but with high ϵ-sharpness. This also applies when using full-space ϵ-sharpness used by Keskar et al. (2017).”

Full-space is related to spectral norm.

“For rectified neural network every minimum is observationally equivalent to a minimum that generalizes as well but with high ϵ-sharpness. This also applies when using full-space ϵ-sharpness used by Keskar et al. (2017).”

Full-space is related to spectral norm of the Hessian.

What about other eigenvalues?

Flatness 2What about other eigenvalues?

“We have not been able to show a similar problem with random subspace ϵ-sharpness used by Keskar et al. (2017)”

Hessian FlatnessLooks at network curvature is with respect to parameters

Hessian Flatness QuantificationValues used:

● Spectral Norm● Trace

Hessian Flatness QuantificationValues used:

● Spectral Norm● Trace (lower bounded by spectral norm)

Alpha-Scale Transformation

2-layer version

Spectral norm can be arbitrarily scaled

Multi-layer versionEigenvalues of n-1 layers can be scaled arbitrarily large, where n is the count of the layers.

Hessian Measures that can avoid thisMaybe a product of hessian eigenvalues… can still be scaled to be sharper, but relative sharpness will be maintained.

Model re-parameterization & input spaceCan re-parameterize weights as function to change landscape.

ConclusionFlat minima don’t necessarily generalize better than sharp ones.

Exploiting non-identifiability allows changing surface without affecting function.

More care is needed to define flatness.

Bayesian methodS. L. Smith and Q. V. Le, “A Bayesian Perspective on Generalization and Stochastic Gradient Descent,” arXiv:1710.06451 [cs, stat], Oct. 2017.

Likely models are a combination of depth and breadth

Occam FactorRegularization constant divided by product of eigenvalues of hessian.

SGD Noise ScaleS. L. Smith and Q. V. Le, “A Bayesian Perspective on Generalization and Stochastic Gradient Descent,” arXiv:1710.06451 [cs, stat], Oct. 2017.

AddendumWarm-starting models...Check out: S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t Decay the Learning Rate, Increase the Batch Size,” arXiv:1711.00489 [cs, stat], Nov. 2017.

Is flatness a goose-chase?Why focus on flatness rather than directly on training paradigms that directly affect generalization?

Can we optimize over the model itself to improve flatness as well? Eg. Skip-connections, other online transformations

“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp...

Documents

Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

On Recurrent and Deep Neural Networkslisa/pointeurs/RazvanPascanu...Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit e de Montr eal, LISA lab September 2014 Pascanu On Recurrent

Do ImageNet Classiﬁers Generalize to ImageNet?proceedings.mlr.press/v97/recht19a/recht19a.pdf · 2021. 1. 14. · Do ImageNet Classiﬁers Generalize to ImageNet? Figure 1. Model

Learning to Generalize for Complex Selection Tasks

Ihhllllllllhl lllllllllllhl uirnrnrnrnmmtI · generalize Kalman filters to a continuous nonlinear stochastic case without white noise limitations and that we can generalize control

Inductive Bias: How to generalize on novel data

Can We Generalize from Case Studies?

Consensus Features Nested Cross-Validation - bioRxiv · CV has been implemented include leave-one-out CV (LOOCV) (Stone 1974), k-fold CV (Bengio, Bengio, and Grandvalet 2003), repeated

Deep Learning Theory by Yoshua Bengio

Recurrent Nets and Attention for System 2 Processing · (Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013) • Clipping gradients (avoid exploding gradients)

Why Does Unsupervised Pre-training Help Deep Learning? · Samy Bengio BENGIO@GOOGLE.COM Google Research 1600 Amphitheatre Parkway Mountain View, CA, 94043, USA Editor: Leon Bottou´

YOSHUA BENGIO UMONTREAL CA 2900 Edouard Montpetit Blvd

NIPS'2015 Tutorial Geoff Hinton, Yoshua Bengio & Yann LeCun

Making neural programming architectures generalize via recursion 20170224

Adaptively Truncating Backpropagation Through Time to ...auai.org/uai2019/proceedings/papers/290.pdf · ish (Bengio et al., 1994; Pascanu et al., 2013). When the gradients vanish,

Deep learning for constructing microblog behavior representation … · 2016-09-19 · Deep learning is a set of algorithms in machine learning (Bengio, 2009; Bengio, Courville &

Scaling Learning Algorithms towards AI - Yann LeCunyann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf · 2007-08-24 · Scaling Learning Algorithms towards AI Yoshua Bengio (1) and

Generalize System preference

Benjamin Scellier and Yoshua Bengio March 30, 2017 · Benjamin Scellier and Yoshua Bengio* Université de Montréal, Montreal Institute for Learning Algorithms March 30, 2017 Abstract

· Pascanu lonel I .2. Sot/sotie Pascanu Liliana 1.3. Copii Sursa venitului: numele, adresa Directia de Sanatate Publica- Control in Sanatate Publica Galati, str. Brailei 177 Venitul