14
Memo: Backpropaga.on in Convolu.onal Neural Network Hiroshi Kuwajima 13032014 1 /14

Backpropagation in Convolutional Neural Network

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Backpropagation in Convolutional Neural Network

Memo:  Backpropaga.on    in  Convolu.onal  Neural  Network

Hiroshi  Kuwajima  13-­‐03-­‐2014

1 /14

Page 2: Backpropagation in Convolutional Neural Network

2  /14  Note n  Purpose  

The  purpose  of  this  memo  is  trying  to  understand  and  remind  the  backpropaga.on  algorithm  in  Convolu.onal  Neural  Network  based  on  a  discussion  with  Prof.  Masayuki  Tanaka.    

n  Table  of  Contents  In  this  memo,  backpropaga.on  algorithms  in  different  neural  networks  are  explained  in  the  following  order.      p  Single  neuron    3  p  Mul.-­‐layer  neural  network  5  p  General  cases    7  p  Convolu.on  layer    9  p  Pooling  layer      11  p  Convolu.onal  Neural  Network  13  

n  Nota.on  This  memo  follows  the  nota.on  in  UFLDL  tutorial  (hYp://ufldl.stanford.edu/tutorial)  

Page 3: Backpropagation in Convolutional Neural Network

3  /14  Neural  Network  as  a  Composite  Func4on A  neural  network  is  decomposed  into  a  composite  func.on  where  each  func.on  element  corresponds  to  a  differen.able  opera.on.      n  Single  neuron  (the  simplest  neural  network)  example  

A  single  neuron  is  decomposed  into  a  composite  func.on  of  an  affine  func.on  element  parameterized  by  W  and  b  and  an  ac.va.on  func.on  element    f  which  we  choose  to  be  the  sigmoid  func.on.    

           

Deriva.ves  of  both  affine  and  sigmoid  func.on  elements  w.r.t.  both  inputs  and  parameters  are  known.  Note  that  sigmoid  func.on  does  not  have  neither  parameters  nor  deriva.ves  parameters.    

Sigmoid  func.on  is  applied  element-­‐wise.  ‘●’  denotes  Hadamard  product,  or  element-­‐wise  product.    

hW ,b x( ) = f W T x + b( ) = sigmoid affineW ,b x( )( ) = sigmoid!affineW ,b( ) x( )

∂a∂z

= a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) = 11+ exp −z( )

∂z∂x

=W, ∂z∂W

= x, ∂z∂b

= I where z = affineW ,b x( ) =WTx + b and I is identity matrix

Decomposi.on

Neuron

Standard  network  representa.on

x1

x2

x3

+1

hW,b(x)

Affine   Ac.va.on  (e.g.  sigmoid)  

Composite  func.on  representa.on

x1

x2

x3

+1

hW,b(x) z a

Page 4: Backpropagation in Convolutional Neural Network

∇W J W,b; x, y( ) = ∂∂W

J W,b; x, y( ) = ∂J∂z

∂z∂W

= δ z( )xT

∇bJ W,b; x, y( ) = ∂∂bJ W,b; x, y( ) = ∂J

∂z∂z∂b

= δ z( )

4  /14  Chain  Rule  of  Error  Signals  and  Gradients Error  signals  are  defined  as  the  deriva.ves  of  any  cost  func.on  J  which  we  choose  to  be  the  square  error.  Error  signals  are  computed  (propagated  backward)  by  the  chain  rule  of  deriva.ve  and  useful  for  compu.ng  the  gradient  of  the  cost  func.on.      n  Single  neuron  example  

Suppose  we  have  m  labeled  training  examples  {(x(1),  y(1)),  …,  (x(m),  y(m))}.  Square  error  cost  func.on  for  each  example  is  as  follows.  Overall  cost  func.on  is  the  summa.on  of  cost  func.ons  over  all  examples.        Error  signals  of  the  square  error  cost  func.on  for  each  example  are  propagated  using  deriva.ves  of  func.on  elements  w.r.t.  inputs.    

   Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example  is  computed  using  error  signals  and  deriva.ves  of  func.on  elements  w.r.t.  parameters.  Summing  gradients  for  all  examples  gets  overall  gradient.    

δ a( ) = ∂∂aJ W,b; x, y( ) = − y − a( )

δ z( ) = ∂∂zJ W,b; x, y( ) = ∂J

∂a∂a∂z

= δ a( ) • a • 1− a( )

J W,b; x, y( ) = 12y − hw,b x( ) 2

Page 5: Backpropagation in Convolutional Neural Network

5  /14  Decomposi4on  of  Mul4-­‐Layer  Neural  Network n  Composite  func.on  representa.on  of  a  mul.-­‐layer  neural  network  

 n  Deriva.ves  of  func.on  elements  w.r.t.  inputs  and  parameters  

a 1( ) = x, a lmax( ) = hw,b x( )∂a l+1( )

∂z l+1( ) = al+1( ) • 1− a l+1( )( ) where a l+1( ) = sigmoid z l+1( )( ) = 1

1+ exp −z l+1( )( )∂z l+1( )

∂a l( ) =W l( ), ∂zl+1( )

∂W l( ) = al( ), ∂z

l+1( )

∂b l( ) = I where z l+1( ) = W l( )( )T a l( ) + b l( )

hW ,b x( ) = sigmoid!affineW 2( ),b 2( )!sigmoid!affineW 1( ),b 1( )( ) x( )

Decomposi.on Standard  network  representa.on

x1

x2

x3

+1

Layer  1

+1

Layer  2 x

hW,b(x)

a2(2)

a1(2)

a3(2)

Composite  func.on  representa.on

x1

x2

x3

+1

Affine  1 Sigmoid  1 x

z2(2)

z1(2)

z3(2)

+1

Affine  2

a2(2)

a1(2)

a3(2) hW,b(x)

z1(3) a1(3)

a1(1)

a2(1)

a3(1)

Sigmoid  2

Page 6: Backpropagation in Convolutional Neural Network

6  /14  Error  Signals  and  Gradients  in  Mul4-­‐Layer  NN n  Error  signals  of  the  square  error  cost  func.on  for  each  example  

n  Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example  

δ a l( )( ) = ∂∂a l( ) J W,b; x, y( ) =

− y − a l( )( ) for l = lmax

∂J∂z l+1( )

∂z l+1( )

∂a l( ) = W l( )( )T δ z l+1( )( ) otherwise

⎨⎪

⎩⎪

δ z l( )( ) = ∂∂z l( ) J W,b; x, y( ) = ∂J

∂a l( )∂a l( )

∂z l( ) = δa l( )( ) • a l( ) • 1− a l( )( )

∇W l( )J W,b; x, y( ) = ∂

∂W l( ) J W,b; x, y( ) = ∂J∂z l+1( )

∂z l+1( )

∂W l( ) = δz l+1( )( ) a l( )( )T

∇b l( )J W,b; x, y( ) = ∂

∂b l( ) J W,b; x, y( ) = ∂J∂z l+1( )

∂z l+1( )

∂b l( ) = δ z l+1( )( )

Page 7: Backpropagation in Convolutional Neural Network

7  /14  Backpropaga4on  in  General  Cases 1.  Decompose  opera.ons  in  layers  of  a  neural  network  into  func.on  elements  whose  

deriva.ves  w.r.t  inputs  are  known  by  symbolic  computa.on.    

2.  Backpropagate  error  signals  corresponding  to  a  differen.able  cost  func.on  by  numerical  computa.on  (Star.ng  from  cost  func.on,  plug  in  error  signals  backward).    

3.  Use  backpropagated  error  signals  to  compute  gradients  w.r.t.  parameters  only  for  the  func.on  elements  with  parameters  where  their  deriva.ves  w.r.t  parameters  are  known  by  symbolic  computa.on.    

4.  Sum  gradients  over  all  example  to  get  overall  gradient.    

hθ x( ) = f lmax( ) !…! fθ l( )l( ) !…! f

θ 2( )2( ) ! f 1( )( ) x( ) where f 1( ) = x, f lmax( ) = hθ x( ) and ∀l : ∂f

l+1( )

∂f l( ) is known

δ l( ) = ∂∂f l( ) J θ; x, y( ) = ∂J

∂f l+1( )∂f l+1( )

∂f l( ) = δ l+1( ) ∂f l+1( )

∂f l( ) where ∂J∂f lmax( ) is known

∇θ l( )J θ; x, y( ) = ∂

∂θ l( ) J θ; x, y( ) = ∂J∂f l( )

∂fθ l( )l( )

∂θ l( ) = δl( ) ∂fθ l( )

l( )

∂θ l( ) where ∂f

θ l( )l( )

∂θ l( ) is known

∇θ l( )J θ( ) = ∇

θ l( )J θ; x i( ), y i( )( ) i=1

m

Page 8: Backpropagation in Convolutional Neural Network

8  /14  Convolu4onal  Neural  Network A  convolu.on-­‐pooling  layer  in  Convolu.onal  Neural  Network  is  a  composite  func.on  decomposed  into  func.on  elements  f(conv)  and  f(pool).    

Let  a  be  the  output  from  the  previous  layer.    

f pool( ) ! fWconv( )( ) a( )

Convolu.on Pooling a

f  (conv)

f  (pool)

Page 9: Backpropagation in Convolutional Neural Network

9  /14  Deriva4ves  of  Convolu4on n  Discrete  convolu.on  parameterized  by  a  feature  W    and  its  deriva.ves  

Let  z  =  f(conv)  for  simplifica.on.    

z = a∗W = zn[ ], zn = a∗W( ) n[ ] = an+i−1Wii=1

W

∑ =WTan:n+W −1

∂zn∂an:n+W −1

=W, ∂zn∂an+i−1

=Wi, ∂zn−i+1

∂an=Wi for 1≤ i ≤ W and ∂zn

∂W= an:n+W −1

an

an+1

W1

W2 …

zn zn  has  incoming    connec.ons  from  an:n+|W|-­‐1.    

a Convolu.on

an W1 zn

W2

… zn-­‐1

… an  has  outgoing  connec.ons    to  zn-­‐|W|+1:n,  i.e.,    all  zn-­‐|W|+1:n  have  deriva.ves    w.r.t.  an.  

a Convolu.on

Page 10: Backpropagation in Convolutional Neural Network

10  /14  Backpropaga4on  in  Convolu4on Error  signals  and  gradient  for  each  example  are  computed  by  convolu.on  using  its  commuta.vity  property  and  the  mul.variable  chain  rule  of  deriva.ve.    

Let  the  error  signals  at  convolu.on  layer  δ(z)  =  δ(conv)  for  consistency.    

δn

a( )= ∂∂an

J θ; x, y( ) = ∂J∂z

∂z∂an

= ∂J∂zn−i+1

∂zn−i+1

∂ani=1

W

∑ = δn−i+1

z( )Wi

i=1

W

∑ = δz( )∗W( ) n[ ], δ a( )

= δn

a( )⎡⎣

⎤⎦ = δ

z( )∗W

∇W J θ; x, y( ) = ∂∂W

J θ; x, y( ) = ∂J∂z

∂z∂W

= ∂J∂zn

∂zn∂Wn

∑ = δn

z( )an:n+W −1

n∑ = δ

z( )∗a = a∗δ

conv( )

Backward  propaga.on  (convolu.on)

δ(z)n-­‐1

δ(z)n

W1

W2

Error  signals

δ(a)        =        W      *      δ(z)

δ(a)n

(Full  convolu.on)

Gradient

∂J/∂Wn

an

an+1

δ(z)1

δ(z)2

a        *        δ(z)      =      ∂J/∂W

(Valid  convolu.on)

a          *          W          =          z

an

an+1

W1

W2 …

zn

Forward  propaga.on  (convolu.on)

(Valid  convolu.on)

Page 11: Backpropagation in Convolutional Neural Network

11  /14  Deriva4ves  of  Pooling Pooling  layer  subsamples  sta.s.cs  to  obtain  summary  sta.s.cs  with  any  aggregate  func.on  (or  filter)  g  whose  input  is  vector  and  output  is  scalar.  Subsampling  is  an  opera.on  like  convolu.on,  however  g  is  applied  to  disjoint  regions.    n  Defini.on:  subsample  

Let  m  be  the  size  of  pooling  region  and  let  a  =  f(conv),  z  =  f(pool)  for  simplicity.    

a(n-­‐1)m+1

a Pooling

zn

zn = subsample a,g( ) n[ ] = g a n−1( )m+1:nm( )z = subsample a,g( ) = zn[ ]

g x( ) =

xkk=1

m

∑m

, ∂g∂x

= 1m

for mean pooling

max x( ), ∂g∂xi

=1 if xi =max x( )0 otherwise

⎧⎨⎩

for max pooling

x p = xkp

k=1

m

∑⎛⎝⎜⎞⎠⎟

1/p

, ∂g∂xi

= xkp

k=1

m

∑⎛⎝⎜⎞⎠⎟

1/p−1

xip−1 for Lp pooling

!

⎪⎪⎪⎪⎪

⎪⎪⎪⎪⎪

Page 12: Backpropagation in Convolutional Neural Network

12  /14  Backpropaga4on  in  Pooling Error  signals  for  each  example  are  computed  by  upsampling.  Upsampling  is  an  opera.on  which  backpropagates  (distributes)  the  error  signals  over  the  aggregate  func.on  g  using  its  deriva.ves  g’n  =  ∂g/∂a(n-­‐1)m+1:nm.  g’n  can  change  depending  on  pooling  region  n.    

p  In  max  pooling,  the  unit  which  was  the  max  at  forward  propaga.on  receives  all  the  error  at  backward  propaga.on  and  the  unit  is  different  depending  on  the  region  n.    

 n  Defini.on:  upsample  

δ n−1( )m+1:nma( )

= upsample δz( ), ′g( ) n[ ] = δn

z( )′gn = δn

z( ) ∂g∂a n−1( )m+1:nm

= ∂J∂zn

∂zn∂a n−1( )m+1:nm

= ∂∂a n−1( )m+1:nm

J θ; x, y( )

δa( )= upsample δ

z( ), ′g( ) = δ n−1( )m+1:nm

a( )⎡⎣

⎤⎦

δ(a)  =  upsample(δ(z),  g’)

δ(z)n

δ(a)(n-­‐1)m+1

Forward  propaga.on  (upsapmling)

a(n-­‐1)m+1

subsample(a,  g)  =  z

zn

Forward  propaga.on  (subsapmling)

Page 13: Backpropagation in Convolutional Neural Network

13  /14  Backpropaga4on  in  Convolu4onal  Neural  Network

Plug  in  δ(conv)

Plug  in  δ(conv)

∂J/∂Wn

an

an+1

δ(conv)1

δ(conv)2

(Valid  convolu.on)

∇W J θ; x, y( ) = a∗δ conv( )

3.  Compute  gradient  ∇wJ

δ(conv)n-­‐1

δ(conv)n

W1

W2 δ(a)n

(Full  convolu.on)

2.  Propagate  error  signals  δ(conv)

δa( )= δ

conv( )∗W δ

conv( )= upsample δ

pool( ), ′g( )

1.  Propagate  error  signals  δ(pool)

δ(z)n

δ(a)(n-­‐1)m+1

Page 14: Backpropagation in Convolutional Neural Network

14  /14  Remarks n  References  

p  UFLDL  Tutorial,  hYp://ufldl.stanford.edu/tutorial  p  Chain  Rule  of  Neural  Network  is  Error  Back  Propaga.on,    

hYp://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf  

n  Acknowledgement  This  memo  was  wriYen  thanks  to  a  good  discussion  with  Prof.  Masayuki  Tanaka.