Failures of Gradient-Based Deep Learning -...

Preview:

Citation preview

Failures of Gradient-Based Deep Learning

Ohad ShamirWeizmann Institute

Joint work with Shai Shalev-Shwartz & Shaked Shammah

(Hebrew University & Mobileye)

ICRI-CI WorkshopMay 2017

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

The Fizz Buzz Job Interview Question

Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”

Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

Neural Networks (a.k.a. Deep Learning)

The Fizz Buzz Job Interview Question

Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”

Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

Neural Networks (a.k.a. Deep Learning)

The Fizz Buzz Job Interview Question

Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”

Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

This Talk

Simple problems where standard deep learning either

Does not work at all

Even for “nice” distributions and realizabilityEven with over-parameterization (a.k.a. improper learning)

Does not work well

Requires prior knowledge for better architectural/algorithmicchoices

Mix of theory and experiments. Code available!

Take-home Message

Even deep learning has limitations. To overcome them, priorknowledge and domain expertise can still be important

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

Piecewise Linear Curves: Motivation

Piecewise Linear Curves

Problem: Train a piecewise linear curve detector

Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =k∑

r=1

ar [x − θr ]+ , θr ∈ 0, . . . , n − 1

Output: Curve parameters ar , θrkr=1

Piecewise Linear Curves

Approach 1: Deep autoencoder

minv1,v2

E[(Mv2(Nv1(f))− f)2]

(3 ReLU layers + linear output; sizes (100, 100, n) and (500, 100, 2k)

(after 500; 10000; 50000 iterations)

Piecewise Linear Curves

Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =∑k

r=1 ar [x − θr ]+Output: ar , θrkr=1

Approach 2: Linear Regression

Observation: f = Wp, where

Wi ,j = [i − j + 1]+ , pj =k∑

r=1

ar1θr = j

Can extract parameter vector p from f by p = W−1f

Learning approach: Train a one-layer fully connected networkon (f,p) examples:

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]

Piecewise Linear Curves

Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =∑k

r=1 ar [x − θr ]+Output: ar , θrkr=1

Approach 2: Linear Regression

Observation: f = Wp, where

Wi ,j = [i − j + 1]+ , pj =k∑

r=1

ar1θr = j

Can extract parameter vector p from f by p = W−1f

Learning approach: Train a one-layer fully connected networkon (f,p) examples:

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]

Piecewise Linear Curves

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]

Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)

Piecewise Linear Curves

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)

Piecewise Linear Curves

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)

Piecewise Linear Curves

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)

Piecewise Linear Curves

Explanation: W has a very large condition number

Theorem

λmax(W>W )

λmin(W>W )= Ω(n3.5)

⇒ SGD requires Ω(n3.5) iterations to reach U s.t.∥∥E[U]−W−1∥∥ < 1/2

Piecewise Linear Curves

Approach 3: Convolutional Networks

p = W−1f

Observation:

W−1 =

1 0 0 0 · · ·−2 1 0 0 · · ·1 −2 1 0 · · ·0 1 −2 1 · · ·0 0 1 −2 · · ·...

......

W−1f is 1D convolution of f with “line-break” filter (1,−2, 1)

Can train a one-layer convnet to learn filter (problem in R3!)

Piecewise Linear Curves

(after 500; 10000; 50000 iterations)

Theorem: Condition number reduced to Θ(n3). Convolutionsaid geometry!

But: Θ(n3) iterations very disappointing for a problem in R3

...

Piecewise Linear Curves

Approach 4: Preconditioning

Convolutions reduce the problem to R3. In such lowdimension, can easily estimate correlations in f and use toprecondition

(after 500; 10000; 50000 iterations)

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

Linear-Periodic Functions

x 7→ ψ(w>x), ψ periodic

Closely related to generalized linear models

Implementable with 2-layer networks on any bounded domain

Statistically learnable from data

Computationally learnable from data, at least in some cases

Informal Result

Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd

Even with over-parameterization / arbitrarily complex network

Even if ψ and distribution are known

Linear-Periodic Functions

x 7→ ψ(w>x), ψ periodic

Closely related to generalized linear models

Implementable with 2-layer networks on any bounded domain

Statistically learnable from data

Computationally learnable from data, at least in some cases

Informal Result

Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd

Even with over-parameterization / arbitrarily complex network

Even if ψ and distribution are known

Case Study

Target function: x 7→ cos(w?>x)x has standard Gaussian distribution in Rd

With enough training data, equivalent to

minw

Ex∼N (0,I )

[(cos(w>x)− cos(w?>x)

)2]

Case Study

In 2 dimensions, w? = (2, 2):

No local minima/saddle points

However, extremely flat unless very close to optimum⇒ difficult for gradient methods

Case Study

In 2 dimensions, w? = (2, 2):

No local minima/saddle points

However, extremely flat unless very close to optimum⇒ difficult for gradient methods

Case Study

In 2 dimensions, w? = (2, 2):

No local minima/saddle points

However, extremely flat unless very close to optimum⇒ difficult for gradient methods

Analysis

Similar issues even for

Arbitrary smooth distributions

Any periodic ψ (not just cosine)

Arbitrary networks

minv∈V

Fw?(v) = Ex∼ϕ2

[(f (v, x)− ψ(w?>x)

)2]

Theorem

Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,

Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))

Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.

Analysis

Similar issues even for

Arbitrary smooth distributions

Any periodic ψ (not just cosine)

Arbitrary networks

minv∈V

Fw?(v) = Ex∼ϕ2

[(f (v, x)− ψ(w?>x)

)2]

Theorem

Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,

Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))

Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.

Experiment

0 1 2 3 4 5

·104

0.5

1

Training Iterations

Acc

ura

cy

d=5d=10d=30

2-layer ReLU network, width 10d

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

End-to-End vs. Decomposition

Input x: k-tuple of images of random lines

f1(x): For each image, whether slopes up or down

f2(x): Given bit vector, return parityImportant: Will focus on small k (where parity is easy)

Goal: Learn f2(f1(x))

End-to-End vs. Decomposition

End-to-end approach: Train overall network on primaryobjective

Decomposition approach: Augment objective with loss specificto first net, using per-image labels

0.3

1k = 1

0.3

1k = 2

0.3

1k = 3

0.3

1k = 4

20000 iterations

End-to-End vs. Decomposition

End-to-end approach: Train overall network on primaryobjective

Decomposition approach: Augment objective with loss specificto first net, using per-image labels

0.3

1k = 1

0.3

1k = 2

0.3

1k = 3

0.3

1k = 4

20000 iterations

Analysis

Through gradient variance w.r.t. target function

Under some simplifying assumptions,

Varw? (∇Fw?(v)) ≤ O(√

k/d)k

where d = number of pixels

Extremely concentrated already for very small values of k

Through gradient signal-to-noise ratio (SNR)

Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point

1 2 3 4

−7

−15

k

log(

SN

R)

Analysis

Through gradient variance w.r.t. target function

Under some simplifying assumptions,

Varw? (∇Fw?(v)) ≤ O(√

k/d)k

where d = number of pixels

Extremely concentrated already for very small values of k

Through gradient signal-to-noise ratio (SNR)

Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point

1 2 3 4

−7

−15

k

log(

SN

R)

Analysis

Through gradient variance w.r.t. target function

Under some simplifying assumptions,

Varw? (∇Fw?(v)) ≤ O(√

k/d)k

where d = number of pixels

Extremely concentrated already for very small values of k

Through gradient signal-to-noise ratio (SNR)

Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point

1 2 3 4

−7

−15

k

log(

SN

R)

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

Flat Activations

Vanishing gradients due to saturating activations (e.g. in RNN’s)

Flat Activations

Problem: Learning x 7→ u(w?>, x) where u is a fixed step function

Optimization problem:

minw

Ex[(u(Nw(x))− u(w?>x))2]

u′(z) = 0 almost everywhere → can’t apply gradient-basedmethods

Standard workarounds (smooth approximations; end-to-end;multiclass), don’t work too well either – see paper

Flat Activations

Different approach (Kalai & Sastry 2009, Kakade, Kalai, Kanade,S. 2011): Gradient descent, but replace gradient with somethingelse

minw

Ex

[1

2

((u(w>x))− u(w?>x)

)2]∇ = Ex

[(u(w>x)− u(w?>x)

)· u′(w>x) · x

]∇ = Ex

[(u(w>x)− u(w?>x))x

]Interpretation: “Forward only” backpropagation

Flat Activations

(linear; 5000 iterations)

Best results, and smallest train+test time

Analysis (KS09, KKKS11): Needs O(L2/ε2) iterations if u isL-Lipschitz

Summary

Simple problems where standard gradient-based deep learningdoesn’t work well (or at all), even under favorable conditions

Not due to local minima/saddle points!

Prior knowledge and domain expertise can still be importantFor more details:

“Distribution-Specific Hardness of Learning Neural Networks”arXiv 1609.01037

“Failures of Gradient-Based Deep Learning”: arXiv1703.07950

github.com/shakedshammah/failures_of_DL

Recommended