Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
The back-propagation algorithm
January 8, 2012
Ryan
The neuron
I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.
I
σ(x) =1
1 + e−x(1)
x
y
-5 -4 -3 -2 -1 0 1 2 3 4 5
1
Figure: The Sigmoid Function
I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...
The neuron
I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.
I
σ(x) =1
1 + e−x(1)
x
y
-5 -4 -3 -2 -1 0 1 2 3 4 5
1
Figure: The Sigmoid Function
I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...
The neuron
I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.
I
σ(x) =1
1 + e−x(1)
x
y
-5 -4 -3 -2 -1 0 1 2 3 4 5
1
Figure: The Sigmoid Function
I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)
=e−x
(1 + e−x)2
=
(
1 + e−x
)
− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−
(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=
(
1 + e−x
)
− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−
(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=
(
1 + e−x
)
− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−
(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=(1 + e−x)− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−
(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=(1 + e−x)− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−
(
1
(1 + e−x) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=(1 + e−x)− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=(1 + e−x)− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
The derivative of the sigmoid transfer function
d
dxσ(x) =
d
dx
(1
1 + e−x
)=
e−x
(1 + e−x)2
=(1 + e−x)− 1
(1 + e−x)2
=1 + e−x
(1 + e−x)2−(
1
(
1 + e−x
) 2
)2
= σ(x)− σ(x)2
σ′ = σ(1− σ)
Single input neuron
σξω
θ
O
Figure: A Single-Input Neuron
In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:
O = σ(ξω)
Single input neuron
σξω
θ
O
Figure: A Single-Input Neuron
In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:
O = σ(ξω + θ)
Multiple input neuron
ξ1ω1
ξ2ω2
ξ3ω3
θ
O∑
σ
Figure: A Multiple Input Neuron
Figure 3 is the diagram representing the following equation:
O = σ(ω1ξ1 + ω2ξ2 + ω3ξ3 + θ)
A neural network
I J K
Figure: A layer
A neural network
A neural network
I J K
Figure: A neural network
A neural network
I J K
Figure: A neural network
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The back propagation algorithm
Notation
I x`j : Input to node j of layer `
I W `ij : Weight from layer `− 1 node i to layer ` node j
I σ(x) = 11+e−x : Sigmoid Transfer Function
I θ`j : Bias of node j of layer `
I O`j : Output of node j in layer `
I tj : Target value of node j of the output layer
The error calculation
Given a set of training data points tk and output layer output Ok
we can write the error as
E =1
2
∑k∈K
(Ok − tk)2
We let the error of the network for a single training iteration bedenoted by E . We want to calculate ∂E
∂W `jk
, the rate of change of
the error with respect to the given connective weight, so we canminimize it.Now we consider two cases: The node is an output node, or it is ina hidden layer...
Output layer node
∂E
∂Wjk=
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk=
∂
∂Wjk
1
2
∑k∈K
(Ok − tk)2
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk= (Ok − tk)
∂
∂WjkOk
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk= (Ok − tk)
∂
∂Wjkσ(xk)
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk= (Ok − tk)σ(xk)(1− σ(xk))
∂
∂Wjkxk
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk= (Ok − tk)Ok(1−Ok)Oj
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Output layer node
∂E
∂Wjk= (Ok − tk)Ok(1−Ok)Oj
For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
Hidden layer node
∂E
∂Wij=
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=
∂
∂Wij
1
2
∑k∈K
(Ok − tk)2
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=∑k∈K
(Ok − tk)∂
∂WijOk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=∑k∈K
(Ok − tk)∂
∂Wijσ(xk)
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=∑k∈K
(Ok − tk)σ(xk)(1− σ(xk))∂xk∂Wij
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=∑k∈K
(Ok − tk)Ok(1−Ok)∂xk∂Oj
·∂Oj
∂Wij
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=∑k∈K
(Ok − tk)Ok(1−Ok)Wjk∂Oj
∂Wij
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij=
∂Oj
∂Wij
∑k∈K
(Ok − tk)Ok(1−Ok)Wjk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij= Oj(1−Oj)
∂xj∂Wij
∑k∈K
(Ok − tk)Ok(1−Ok)Wjk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij= Oj(1−Oj)Oi
∑k∈K
(Ok − tk)Ok(1−Ok)Wjk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij= Oj(1−Oj)Oi
∑k∈K
(Ok − tk)Ok(1−Ok)Wjk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
Hidden layer node
∂E
∂Wij= Oj(1−Oj)Oi
∑k∈K
(Ok − tk)Ok(1−Ok)Wjk
But, recalling our definition of δk we can write this as
∂E
∂Wij= OiOj(1−Oj)
∑k∈K
δkWjk
Similar to before we will now define all terms besides the Oi to beδj , so we have
∂E
∂Wij= Oiδj
How weights affect errors
For an output layer node k ∈ K
∂E
∂Wjk= Ojδk
whereδk = Ok(1−Ok)(Ok − tk)
For a hidden layer node j ∈ J
∂E
∂Wij= Oiδj
whereδj = Oj(1−Oj)
∑k∈K
δkWjk
What about the bias?
If we incorporate the bias term θ into the equation you will findthat
∂O∂θ
= O(1−O)∂θ
∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.
This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that
∂E
∂θ= δ`
(because the O` is replacing the output from the “previous layer”)
What about the bias?
If we incorporate the bias term θ into the equation you will findthat
∂O∂θ
= O(1−O)∂θ
∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that
∂E
∂θ= δ`
(because the O` is replacing the output from the “previous layer”)
The back propagation algorithm
1. Run the network forward with your input data to get thenetwork output
2. For each output node compute
δk = Ok(1−Ok)(Ok − tk)
3. For each hidden node calulate
δj = Oj(1−Oj)∑k∈K
δkWjk
4. Update the weights and biases as followsGiven
∆W = −ηδ`O`−1
∆θ = −ηδ`apply
W + ∆W →W
θ + ∆θ → θ