Failures of Gradient-Based Deep Learning

Ohad ShamirWeizmann Institute

Joint work with Shai Shalev-Shwartz & Shaked Shammah

(Hebrew University & Mobileye)

ICRI-CI WorkshopMay 2017

Neural Networks (a.k.a. Deep Learning)

The Fizz Buzz Job Interview Question

Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”

Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

This Talk

Simple problems where standard deep learning either

Does not work at all

Even for “nice” distributions and realizabilityEven with over-parameterization (a.k.a. improper learning)

Does not work well

Requires prior knowledge for better architectural/algorithmicchoices

Mix of theory and experiments. Code available!

Take-home Message

Even deep learning has limitations. To overcome them, priorknowledge and domain expertise can still be important

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

Piecewise Linear Curves: Motivation

Piecewise Linear Curves

Problem: Train a piecewise linear curve detector

Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =k∑

ar [x − θr ]+ , θr ∈ 0, . . . , n − 1

Output: Curve parameters ar , θrkr=1

Approach 1: Deep autoencoder

minv1,v2

E[(Mv2(Nv1(f))− f)2]

(3 ReLU layers + linear output; sizes (100, 100, n) and (500, 100, 2k)

(after 500; 10000; 50000 iterations)

f (x) =∑k

r=1 ar [x − θr ]+Output: ar , θrkr=1

Approach 2: Linear Regression

Observation: f = Wp, where

Wi ,j = [i − j + 1]+ , pj =k∑

ar1θr = j

Can extract parameter vector p from f by p = W−1f

Learning approach: Train a one-layer fully connected networkon (f,p) examples:

E[(Uf − p)2

[(Uf −W−1f)2

f (x) =∑k

r=1 ar [x − θr ]+Output: ar , θrkr=1

Approach 2: Linear Regression

Observation: f = Wp, where

Wi ,j = [i − j + 1]+ , pj =k∑

ar1θr = j

Can extract parameter vector p from f by p = W−1f

Learning approach: Train a one-layer fully connected networkon (f,p) examples:

E[(Uf − p)2

[(Uf −W−1f)2

E[(Uf − p)2

[(Uf −W−1f)2

Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)

E[(Uf − p)2

[(Uf −W−1f)2

]Convex; Realizable

E[(Uf − p)2

[(Uf −W−1f)2

]Convex; Realizable

E[(Uf − p)2

[(Uf −W−1f)2

]Convex; Realizable

Explanation: W has a very large condition number

Theorem

λmax(W>W )

λmin(W>W )= Ω(n3.5)

⇒ SGD requires Ω(n3.5) iterations to reach U s.t.∥∥E[U]−W−1∥∥ < 1/2

Approach 3: Convolutional Networks

p = W−1f

Observation:

W−1 =

1 0 0 0 · · ·−2 1 0 0 · · ·1 −2 1 0 · · ·0 1 −2 1 · · ·0 0 1 −2 · · ·...

......

W−1f is 1D convolution of f with “line-break” filter (1,−2, 1)

Can train a one-layer convnet to learn filter (problem in R3!)

Theorem: Condition number reduced to Θ(n3). Convolutionsaid geometry!

But: Θ(n3) iterations very disappointing for a problem in R3

Approach 4: Preconditioning

Convolutions reduce the problem to R3. In such lowdimension, can easily estimate correlations in f and use toprecondition

Outline

4 Flat Activations

Linear-Periodic Functions

x 7→ ψ(w>x), ψ periodic

Closely related to generalized linear models

Implementable with 2-layer networks on any bounded domain

Statistically learnable from data

Computationally learnable from data, at least in some cases

Informal Result

Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd

Even with over-parameterization / arbitrarily complex network

Even if ψ and distribution are known

Linear-Periodic Functions

x 7→ ψ(w>x), ψ periodic

Closely related to generalized linear models

Implementable with 2-layer networks on any bounded domain

Statistically learnable from data

Computationally learnable from data, at least in some cases

Informal Result

Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd

Even with over-parameterization / arbitrarily complex network

Even if ψ and distribution are known

Case Study

Target function: x 7→ cos(w?>x)x has standard Gaussian distribution in Rd

With enough training data, equivalent to

Ex∼N (0,I )

[(cos(w>x)− cos(w?>x)

Case Study

In 2 dimensions, w? = (2, 2):

No local minima/saddle points

However, extremely flat unless very close to optimum⇒ difficult for gradient methods

Case Study

Analysis

Similar issues even for

Arbitrary smooth distributions

Any periodic ψ (not just cosine)

Arbitrary networks

minv∈V

Fw?(v) = Ex∼ϕ2

[(f (v, x)− ψ(w?>x)

Theorem

Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,

Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))

Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.

Analysis

Similar issues even for

Arbitrary smooth distributions

Any periodic ψ (not just cosine)

Arbitrary networks

minv∈V

Fw?(v) = Ex∼ϕ2

[(f (v, x)− ψ(w?>x)

Theorem

Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,

Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))

Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.

Experiment

0 1 2 3 4 5

Training Iterations

d=5d=10d=30

2-layer ReLU network, width 10d

Outline

4 Flat Activations

End-to-End vs. Decomposition

Input x: k-tuple of images of random lines

f1(x): For each image, whether slopes up or down

f2(x): Given bit vector, return parityImportant: Will focus on small k (where parity is easy)

Goal: Learn f2(f1(x))

End-to-end approach: Train overall network on primaryobjective

Decomposition approach: Augment objective with loss specificto first net, using per-image labels

1k = 1

1k = 2

1k = 3

1k = 4

20000 iterations

End-to-end approach: Train overall network on primaryobjective

Decomposition approach: Augment objective with loss specificto first net, using per-image labels

1k = 1

1k = 2

1k = 3

1k = 4

20000 iterations

Analysis

Through gradient variance w.r.t. target function

Under some simplifying assumptions,

Varw? (∇Fw?(v)) ≤ O(√

where d = number of pixels

Extremely concentrated already for very small values of k

Through gradient signal-to-noise ratio (SNR)

Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point

1 2 3 4

Analysis

Varw? (∇Fw?(v)) ≤ O(√

1 2 3 4

Analysis

Varw? (∇Fw?(v)) ≤ O(√

1 2 3 4

Outline

4 Flat Activations

Flat Activations

Vanishing gradients due to saturating activations (e.g. in RNN’s)

Flat Activations

Problem: Learning x 7→ u(w?>, x) where u is a fixed step function

Optimization problem:

Ex[(u(Nw(x))− u(w?>x))2]

u′(z) = 0 almost everywhere → can’t apply gradient-basedmethods

Standard workarounds (smooth approximations; end-to-end;multiclass), don’t work too well either – see paper

Flat Activations

Different approach (Kalai & Sastry 2009, Kakade, Kalai, Kanade,S. 2011): Gradient descent, but replace gradient with somethingelse

((u(w>x))− u(w?>x)

)2]∇ = Ex

[(u(w>x)− u(w?>x)

)· u′(w>x) · x

]∇ = Ex

[(u(w>x)− u(w?>x))x

]Interpretation: “Forward only” backpropagation

Flat Activations

(linear; 5000 iterations)

Best results, and smallest train+test time

Analysis (KS09, KKKS11): Needs O(L2/ε2) iterations if u isL-Lipschitz

Summary

Simple problems where standard gradient-based deep learningdoesn’t work well (or at all), even under favorable conditions

Not due to local minima/saddle points!

Prior knowledge and domain expertise can still be importantFor more details:

“Distribution-Specific Hardness of Learning Neural Networks”arXiv 1609.01037

“Failures of Gradient-Based Deep Learning”: arXiv1703.07950

github.com/shakedshammah/failures_of_DL

Failures of Gradient-Based Deep Learning -...

Documents

Exponentiated Gradient versus Gradient Descent for Linear

Bridge Failures

Governor Failures

Financial Failures

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

Global Business Failures Report special...Global Business Failures Report Global Business Failures—Insights • Business failures continue to fall globally despite the economic slowdown

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS … Buyers Guide v2 0 19...ECDIS buyers guide gradient 1 gradient 2 gradient 3 gradient 4 gradient 1 gradient 2 gradient 3 gradient

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Mechanism!based strain gradient plasticity* I[ Theory · gradient plasticity theory which includes both rotation gradient and stretch gradient of the deformation in the constitutive

170509 Jefferies Conference - s1.q4cdn.coms1.q4cdn.com/.../2017/Oracle-170509-Jefferies-Conference-Final-1.pdf · Supply Chain Management Enterprise Resource ... SCM SAP Other Oracle

UCF REU: Weeks 1 & 2. Gradient Code Gradient Direction of the Gradient: Calculating theta

Current Failures: 1 st : 73.06% 1 st : 73.06%4 th : 79.22%7 th : 80.71% - 5 Failures - 4 Failures - 3 Failures - 5 Failures - 4 Failures - 3 Failures 100029058

Bariatric Failures

An Overview of Gradient Descent Optimization Algorithms ... · Outline 1 Introduction Basics 2 Gradient Descent Variants Basic Gradient Descent Algorithms Limitations 3 Gradient Descent

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS buyers · PDF file11 Type-specific ECDIS training 11 Technical training 11 Where should training take place? ... gradient 1 gradient

Direct Gradient-Based Reinforcement Learning: II. Gradient ...mkearns/finread/BaxterWeaverBartlett.pdf · Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms

Gradient dune fonction Gradient dune fonction. Généralités La notion de gradient est dun usage courant : on parle du gradient de température, gradient

GRAVITY GRADIENT STABILIZATIONGRAVITY GRADIENT STABILIZATION Vertical stabilization of a satellite is achievable by means of the earth's gravity gradient. Gravity gradient attitude

Steepest Gradient Method Conjugate Gradient Method

Brand Failures