The back-propagation algorithm · The back propagation algorithm Notation I x‘ j: Input to node j of layer ‘ I W‘ ij: Weight from layer ‘ 1 node i to layer ‘node j I ˙(x)

The back-propagation algorithm

January 8, 2012

Ryan

The neuron

I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.

I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1

Figure: The Sigmoid Function

I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...

The neuron


I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1



The neuron


I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1



The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)

=e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(1 + e−x) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)


d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

Single input neuron

σξω

θ

O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:

O = σ(ξω)

Single input neuron

σξω

θ

O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:

O = σ(ξω + θ)

Multiple input neuron

ξ1ω1

ξ2ω2

ξ3ω3

θ

O∑

σ

Figure: A Multiple Input Neuron

Figure 3 is the diagram representing the following equation:

O = σ(ω1ξ1 + ω2ξ2 + ω3ξ3 + θ)

A neural network

I J K

Figure: A layer

A neural network

A neural network

I J K

Figure: A neural network

A neural network

I J K

Figure: A neural network

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W ìj : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer


Notation








Notation








Notation








Notation








Notation







The error calculation

Given a set of training data points tk and output layer output Ok

we can write the error as

E =1

2

∑k∈K

(Ok − tk)2

We let the error of the network for a single training iteration bedenoted by E . We want to calculate ∂E

∂W `jk

, the rate of change of

the error with respect to the given connective weight, so we canminimize it.Now we consider two cases: The node is an output node, or it is ina hidden layer...

Output layer node

∂E

∂Wjk=

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk=

∂

∂Wjk

1

2

∑k∈K

(Ok − tk)2


∂E

∂Wjk= Ojδk


Output layer node

∂E

∂Wjk= (Ok − tk)

∂

∂WjkOk


∂E

∂Wjk= Ojδk


Output layer node

∂E

∂Wjk= (Ok − tk)

∂

∂Wjkσ(xk)


∂E

∂Wjk= Ojδk


Output layer node

∂E

∂Wjk= (Ok − tk)σ(xk)(1− σ(xk))

∂

∂Wjkxk


∂E

∂Wjk= Ojδk


Output layer node

∂E

∂Wjk= (Ok − tk)Ok(1−Ok)Oj


∂E

∂Wjk= Ojδk


Output layer node

∂E

∂Wjk= (Ok − tk)Ok(1−Ok)Oj


∂E

∂Wjk= Ojδk


Hidden layer node

∂E

∂Wij=

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=

∂

∂Wij

1

2

∑k∈K

(Ok − tk)2


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)∂

∂WijOk


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)∂

∂Wijσ(xk)


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)σ(xk)(1− σ(xk))∂xk∂Wij


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)Ok(1−Ok)∂xk∂Oj

·∂Oj

∂Wij


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)Ok(1−Ok)Wjk∂Oj

∂Wij


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=

∂Oj

∂Wij

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk


∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)

∂xj∂Wij

∑k∈K



∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)Oi

∑k∈K



∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E


∑k∈K



∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

Hidden layer node

∂E


∑k∈K



∂E


∑k∈K

δkWjk


∂E

∂Wij= Oiδj

How weights affect errors

For an output layer node k ∈ K

∂E

∂Wjk= Ojδk


For a hidden layer node j ∈ J

∂E

∂Wij= Oiδj

whereδj = Oj(1−Oj)

∑k∈K

δkWjk

What about the bias?

If we incorporate the bias term θ into the equation you will findthat

∂O∂θ

= O(1−O)∂θ

∂θ

and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.

This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that

∂E

∂θ= δ`

(because the O` is replacing the output from the “previous layer”)

What about the bias?

If we incorporate the bias term θ into the equation you will findthat

∂O∂θ

= O(1−O)∂θ

∂θ

and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that

∂E

∂θ= δ`

(because the O` is replacing the output from the “previous layer”)


1. Run the network forward with your input data to get thenetwork output

2. For each output node compute

δk = Ok(1−Ok)(Ok − tk)

3. For each hidden node calulate

δj = Oj(1−Oj)∑k∈K

δkWjk

4. Update the weights and biases as followsGiven

∆W = −ηδÒ`−1

∆θ = −ηδàpply

W + ∆W →W

θ + ∆θ → θ

Documents

The back-propagation algorithm · The back propagation algorithm Notation I x‘ j: Input to node j of layer ‘ I W‘ ij: Weight from layer ‘ 1 node i to layer ‘node j I ˙(x)