74
Learning Deep Learning M1 Sonse Shimaoka

Learning Deep Learning

Embed Size (px)

Citation preview

Page 1: Learning Deep Learning

Learning  Deep  Learning  

M1  Sonse  Shimaoka

Page 2: Learning Deep Learning

Neural  Network

(From  the  lecture  slide  of    Nando  de  Freitas  )

Page 3: Learning Deep Learning

Machine  Learning

Page 4: Learning Deep Learning

Supervised  Learning

Input output

[0,0,1,0,1,1,0,0,0,1,1] 1

[1,1,1,0,1,1,1,0,0,1,1] 0

[1,1,1,0,1,1,0,0,0,1,1] 0

[0,0,0,0,1,1,1,0,0,0,0] 1

[1,0,1,0,1,1,0,0,0,0,0] 1

[1,0,1,0,0,0,0,0,0,1,1] 0

[0,0,0,0,1,1,0,1,0,1,1] 1

Training  data Input output

[1,0,1,0,1,1,0,0,0,1,0] ?

[1,1,1,1,1,1,1,0,0,1,1] ?

[1,0,1,0,1,1,0,1,0,1,1] ?

Test  data

GeneralizaGon

Page 5: Learning Deep Learning

Perceptron

∑ sign

x1

x2

x3

w1

w3

w2

b

y

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Page 6: Learning Deep Learning

Perceptron

∑ sign

1

3

−2

2

1.5

1

0.5

1

1*2+3*1−2*1.5+ 0.5= 2.5

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Page 7: Learning Deep Learning

Perceptron

(x1, x2, x3) = (1,3,−2)(w1,w2,w3) = (2,1,1.5)b = 0.5

= sign 1*2+3*1− 2*1.5+ 0.5( ) = sign(2.5) =1

y = sign wixii=1

3

∑ + b"

#$

%

&'= sign w1x1 +w2x2 +w3x3 + b( )

Page 8: Learning Deep Learning

Perceptron x1

x2w1x1 +w2x2 + b = 0

Page 9: Learning Deep Learning

Problem  with  Perceptron x1

x2w1x1 +w2x2 + b = 0

What  is  the  probability  that  this    point  belongs  to  the  posiGve  class?  

Perceptron  can’t  answer  this!

Page 10: Learning Deep Learning

Problem  with  Perceptron x1

x2

Impossible    to  separate  linearly  !!

Page 11: Learning Deep Learning

LogisGc  Regression

∑ sigmoid

x1

x2

x3

w1

w3

w2

b

y

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

Page 12: Learning Deep Learning

LogisGc  Regression sigmoid x( ) = 1

1+ exp(−x)

Page 13: Learning Deep Learning

LogisGc  Regression

∑ sigmoid

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

1

3

−2

2

1.5

1

0.5

1*2+3*1−2*1.5+ 0.5= 2.5

0.924

Probability!!

Page 14: Learning Deep Learning

Feature  TransformaGon  

x1

x2

New  Space

ΦNon  Linear  TransformaGon

φ1(x1, x2 )

φ2 (x1, x2 )

Original  Space But,  we  must    sGll  design  the  transformaGon…

Page 15: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

A  neuron

AcGvaGon  funcGon

Page 16: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1

Page 17: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1

input  layer hidden  layer output  layer

Page 18: Learning Deep Learning

AbstracGon  by  Layer

Linear Linear

V

f g

W

x h yWx Vh

Page 19: Learning Deep Learning

FFN  can  learn  representaGons!!

Page 20: Learning Deep Learning

FFN  can  learn  representaGons!!

Page 21: Learning Deep Learning

AcGvaGon  FuncGons sigmoid x( ) = 1

1+ exp(−x)=exp(x)exp(x)+1

Page 22: Learning Deep Learning

AcGvaGon  FuncGons tanh x( ) = exp(x)− exp(−x)

exp(x)+ exp(−x)

Page 23: Learning Deep Learning

AcGvaGon  FuncGons rectifier(x) =max(0, x)

Page 24: Learning Deep Learning

AcGvaGon  FuncGons

softmax(x1,..., xm )c =exp(xc )

exp(xk )k=1

m

Page 25: Learning Deep Learning

Loss  FuncGons •  When  you  want  a  model  to  learn  to  do  something,  you  give  it  feedback  on  how  well  it  is  doing.    

•  This  funcGon  that  computes  an  objecGve  measure  of  the  model's  performance  is  called  a  loss  func1on.  

•  A  typical  loss  funcGon  takes  in  the  model's  output  and  the  ground  truth  and  computes  a  value  that  quanGfies  the  model's  performance.  

•  The  model  then  corrects  itself  to  have  a  smaller  loss.  

Page 26: Learning Deep Learning

L2  norm

(y1,..., yn )

L = 1n

ti − yi2

2i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  Regression

Page 27: Learning Deep Learning

Cross  Entropy

(y1,..., yn )

L = 1n

−ti log yi − (1− ti )log(1− yi )i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  Binary  ClassificaGon

Page 28: Learning Deep Learning

Class  NegaGve  Log  Likelihood

(y1,..., yn )

L = − 1n

ti,k log yi,kk

m

∑i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  MulG  Class  ClassificaGon

Page 29: Learning Deep Learning

Output  acGvaGon  funcGons    and  Loss  funcGons

Task Output  ac1va1on

Loss  func1on

Regression Linear L2  norm

Binary  ClassificaGon

Sigmoid Cross  Entropy

MulG  Class  ClassificaGon

So]max Class  NLL

Page 30: Learning Deep Learning

ProbabilisGc  PerspecGve

•  We  can  assume  NNs  are  compuGng  condiGonal  probabiliGes

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

p(t1 | x1, x2, x3)

Page 31: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log 12πσ

exp −ti − yi( )2

2σ 2

#

$%%

&

'((

i=1

n

=12σ 2 ti − yi( )2 − n 2πσ

i=1

n

p(t | x) = 12πσ

exp −t − y( )2

2σ 2

"

#$$

%

&''

 L2  norm

Page 32: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log yiti (1−

i=1

n

∏ yi )1−ti

= −ti log yi − (1− ti )log(1− yi )i=1

n

p(t | x) = yt (1− y)1−t

 Cross  Entropy

Page 33: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log yti,ki,kk=1

m

∏i=1

n

= − ti,k log yi,kk=1

m

∑i=1

n

p(t | x) = yktk

k=1

m

 Class  NegaGve  Log  Likelihood

Page 34: Learning Deep Learning

Gradient  Descent  

•  Gradient    

•  Gradient  Descent  

Page 35: Learning Deep Learning

Gradient  Descent  

FuncGon  to  be  minimized    IniGal  point    Learning  rate    Update  rule  

 

L(w)

winit

wnew ← wold −α∂L∂w w=wold

α

Page 36: Learning Deep Learning

Gradient  Descent  

Big  learning  rate Small  learning  rate

Page 37: Learning Deep Learning

Loss  funcGon  for  LogisGc  regression

L(w,b;D) = log ytiii=1

n

∏ (1− yi )1−ti

= ti log yi + (1− ti )log(1− yi )i=1

n

yi =1

1+ exp(−wT xi − b)

Page 38: Learning Deep Learning

Gradient  with  respect  to  w ∂L(w,b;D)

∂w=∂∂w

ti log yi + (1− ti )log(1− yi )i=1

n

=∂∂w

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂w

∂∂yi

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂w

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂w

ti − yiyi (1− yi )$

%&

'

()

i=1

n

= xiyi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

= xi (ti − yi )i=1

n

∵∂yi∂w

=∂∂w

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂w

1+ exp(−wT xi − b)( )1+ exp(−wT xi − b)( )

2

=xi exp(−w

T xi − b)1+ exp(−wT xi − b)( )

2

= xiyi (1− yi )

Page 39: Learning Deep Learning

Gradient  with  respect  to  b ∂L(w,b;D)

∂b=∂∂b

ti log yi + (1− ti )log(1− yi )i=1

n

=∂∂b

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂b

∂∂yi

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂b

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂b

ti − yiyi (1− yi )$

%&

'

()

i=1

n

= yi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

= ti − yii=1

n

∵∂yi∂b

=∂∂b

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂b1+ exp(−wT xi − b)( )

1+ exp(−wT xi − b)( )2

=exp(−wT xi − b)

1+ exp(−wT xi − b)( )2

= yi (1− yi )

Page 40: Learning Deep Learning

Gradient  Descent    for  LogisGc  Regression  

FuncGon  to  be  minimized      Update  rule  

 bnew ← bold −α ti − yii=1

n

L(w,b;D)= ti log yi + (1− ti )log(1− yi )i=1

n

wnew ← wold −α xi (ti − yi )i=1

n

Page 41: Learning Deep Learning

Exercise:  Gradient  Descent    for  Linear  Regression  

L(w,b;D) = ti − yi( )2i=1

n

yi = wT xi + b

Page 42: Learning Deep Learning

Answer

FuncGon  to  be  minimized      Update  rule  

 

L(w,b;D) = ti − yi( )2i=1

n

bnew ← bold −α ti − yii=1

n

wnew ← wold −α xi (ti − yi )i=1

n

Page 43: Learning Deep Learning

BackpropagaGon

How  do  we  compute                      and                      ? ∂L∂W

∂L∂V

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

Page 44: Learning Deep Learning

BackpropagaGon

Use  the  Chain  Rule!!! ∂∂xq s x( )( ) = ∂s(x)

∂x∂q(s(x))∂s(x)

x s(x)qs q(p(x))

∂q(s(x))∂s(x)

∂s(x)∂x

∂q(s(x))∂s(x)

Page 45: Learning Deep Learning

BackpropagaGon

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

Start  from    Output  layer:  

∂L∂y1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

Page 46: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂l1

=∂y1∂l1

∂L∂y1

= "g l1( ) ∂L∂y1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

Page 47: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂V3,1

=∂l1∂V3,1

∂L∂l1

= h3∂L∂l1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

Page 48: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂h3

=∂l1∂h3

∂L∂l1

+∂l1∂h3

∂L∂l1

=V3,1∂L∂l1

+V3,2∂L∂l2

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3

Page 49: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂u3

=∂h3∂u3

∂L∂h3

= "f u3( ) ∂L∂h3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

Page 50: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂W2,3

=∂u3∂W2,3

∂L∂u3

= u3∂L∂u3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

∂L∂W2,3

Page 51: Learning Deep Learning

AbstracGon  by  Layer

Linear Linear

V

f

∂L∂W

∂L∂V

g

W

Lx

t

∂L∂x

h

∂L∂h

y

∂L∂y

Wx Vh

∂L∂ Wx( )

∂L∂ Vh( )

Page 52: Learning Deep Learning

AbstracGon  by  Layer

input output

∂loss∂input

∂loss∂outputLayer

Page 53: Learning Deep Learning

AbstracGon  by  Layer

input outputLayer

Forward  ComputaGon

output = Layer. forward input( )

Page 54: Learning Deep Learning

AbstracGon  by  Layer

∂loss∂input

∂loss∂outputLayer

Backward  ComputaGon

∂loss∂input

= Layer. backward input, ∂loss∂output

"

#$

%

&'

input

Page 55: Learning Deep Learning

BackpropagaGon

①  Execute  the  forward  computaGon  

Linear Linear

V

f g

W

Lx

t

h yWx Vh

Page 56: Learning Deep Learning

BackpropagaGon

②  Compute  the  derivaGve  of  the  loss  funcGon  with  respect  to  the  output

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

Page 57: Learning Deep Learning

BackpropagaGon

③  StarGng  from  the  final  layer,  backpropagate  derivaGves  through  layers  

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

∂L∂ Vh( )

Page 58: Learning Deep Learning

Classifying  Digits

32×32=1024  pixels Class:  10  digits  (0~9)  Training:  60000  examples  TesGng:  60000  examples

Page 59: Learning Deep Learning

Classifying  Digits

x ∈ R1024

0000100000

!

"

##############

$

%

&&&&&&&&&&&&&&

t =

Page 60: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

Page 61: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

u =Wx + b

u

Page 62: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

h = Tanh(u)

hu

Page 63: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

l =Vh+ c

h lu

Page 64: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

y = softmax l( )

h ylu

Page 65: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

L = tk log ykk=1

10

∑ = tT log y

L

lu

Page 66: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL Wx + b h Vh+ c y

∂L∂y

=∂∂ytT log y = t1

y1,..., t10 y10

"#$

%&'

T

L

∂L∂y

Page 67: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂l

= y ! (t − y)! ∂L∂y

= y1 t1 − y1( ),..., y10 t10 − y10( )#$ %&T! t1 y1

,..., t10 y10#$'

%&(

T

= t1 t1 − y1( ),..., t10 t10 − y10( )#$ %&T

L

∂L∂y

∂L∂l

lu

Page 68: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂h

=VT ∂L∂l

L

∂L∂y

∂L∂V

=∂L∂lhT ∂L

∂c=∂L∂l

∂L∂h

∂L∂V, ∂L∂c

∂L∂l

lu

Page 69: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂u

= 1+ h( )! 1− h( )! ∂L∂h

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂l

lu

Page 70: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂W

=∂L∂u

xT

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂b

=∂L∂u

∂L∂x

=WT ∂L∂u

∂L∂W

, ∂L∂b

∂L∂x

∂L∂l

lu

Page 71: Learning Deep Learning

Classifying  Digits

bnew ← b−α ∂L∂b

Wnew ←W −α∂L∂W

V new ←V −α ∂L∂V

cnew ← c−α ∂L∂c

Page 72: Learning Deep Learning

Torch7  

Page 73: Learning Deep Learning

Torch7  

Page 74: Learning Deep Learning

Torch7