Upload
okeee
View
1.064
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Christof MonzInformatics Institute
University of Amsterdam
Data MiningPart 3: Neural Networks
Overview
Christof MonzData Mining - Part 3: Neural Networks
1
I PerceptronsI Gradient descent searchI Multi-layer neural networksI The backpropagation algorithm
Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
2
I Analogy to biological neural systems, the mostrobust learning systems we know
I Attempt to understand natural biologicalsystems through computational modeling
I Massive parallelism allows for computationalefficiency
I Help understand ‘distributed’ nature of neuralrepresentations
I Intelligent behavior as an ‘emergent’ property oflarge number of simple units rather than fromexplicitly encoded symbolic rules and algorithms
Neural Network Learning
Christof MonzData Mining - Part 3: Neural Networks
3
I Learning approach based on modelingadaptation in biological neural systems
I Perceptron: Initial algorithm for learningsimple neural networks (single layer) developedin the 1950s
I Backpropagation: More complex algorithm forlearning multi-layer neural networks developedin the 1980s.
Real Neurons
Christof MonzData Mining - Part 3: Neural Networks
4
Human Neural Network
Christof MonzData Mining - Part 3: Neural Networks
5
Modeling Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
6
Perceptrons
Christof MonzData Mining - Part 3: Neural Networks
7
Perceptrons
Christof MonzData Mining - Part 3: Neural Networks
8
I A perceptron is a single layer neural networkwith one output unit
I The output of a perceptron is computed asfollows
o(x1 . . .xn) =
{1 ifw0 + w1x1 + . . .+ wnxn > 0−1 otherwise
I Assume a ‘dummy’ input x0 = 1 we can write:
o(x1 . . .xn) =
{1 if ∑
ni=0 wixi > 0
−1 otherwise
Perceptrons
Christof MonzData Mining - Part 3: Neural Networks
9
I Learning a perceptron involves choosing the‘right’ values for the weights w0 . . .wn
I The set of candidate hypotheses isH = {~w | ~w ∈ℜ(n+1)}
Representational Power
Christof MonzData Mining - Part 3: Neural Networks
10
I A single perceptron represent many booleanfunctions, e.g. AND, OR, NAND (¬AND), . . . ,but not all (e.g., XOR)
Peceptron Training Rule
Christof MonzData Mining - Part 3: Neural Networks
11
I The perceptron training rule can be definedfor each weight as:wi ← wi + ∆wi
where ∆wi = η(t−o)xi
where t is the target output, o is the output ofthe perceptron, and η is the learning rate
I This scenario assume that we know what thetarget outputs are supposed to be like
Peceptron Training Rule Example
Christof MonzData Mining - Part 3: Neural Networks
12
I If t = o then η(t−o)xi = 0 and ∆wi = 0, i.e.the weight for wi remains unchanged, regardlessof the learning rate and the input values (i.e. xi)
I Let’s assume a learning rate of η = 0.1 and aninput value of xi = 0.8• If t = +1 and o =−1, then
∆wi = 0.1(1− (−1)))0.8 = 0.16
• If t =−1 and o = +1, then∆wi = 0.1(−1−1)))0.8 =−0.16
Peceptron Training Rule
Christof MonzData Mining - Part 3: Neural Networks
13
I The perceptron training rule converges after afinite number of iterations
I Stopping criterion holds if the amount ofchanges falls below a pre-defined threshold θ,e.g., if |∆~w |L1 < θ
I But only if the training examples are linearlyseparable
The Delta Rule
Christof MonzData Mining - Part 3: Neural Networks
14
I The delta rule overcomes the shortcoming ofthe perceptron training rule not beingguaranteed to converge if the examples are notlinearly separable
I Delta rule is based on gradient descent searchI Let’s assume we have an unthresholded
perceptron: o(~x) = ~w ·~xI We can define the training error as:
E(~w) = 12 ∑
d∈D(td −od)2
where D is the set of training examples
Error Surface
Christof MonzData Mining - Part 3: Neural Networks
15
Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
16
I The gradient of E is the vector pointing in thedirection of the steepest increase for any pointon the error surface
∇E(~w) =[
∂E∂w0
, ∂E∂w1
, . . . , ∂E∂wn
]I Since we are interested in minimizing the error,
we consider negative gradients: −∇E(~w)I The training rule for gradient descent is:
~w ← ~w + ∆~wwhere ∆~w =−η∇E(~w)
Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
17
I The training rule for individual weights isdefined as wi ← wi + ∆wi
where ∆wi =−η∂E∂wi
I Instantiating E for the error function we usegives: ∂E
∂wi= ∂
∂wi
12 ∑
d∈D(td −od)2
I How do we use partial derivatives to actuallycompute updates to weights at each step?
Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
18
∂E∂wi
=∂
∂wi
12 ∑
d∈D(td −od)2
=12 ∑
d∈D
∂
∂wi(td −od)2
=12 ∑
d∈D2(td −od)
∂
∂wi(td −od)
= ∑d∈D
(td −od)∂
∂wi(td −od)
∂E∂wi
= ∑d∈D
(td −od) · (−xid)
Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
19
I The delta rule for individual weights can now bewritten as wi ← wi + ∆wi
where ∆wi = η ∑d∈D
(td −od)xid
I The gradient descent algorithm• picks initial random weights
• computes the outputs
• updates each weight by adding ∆wi
• repeats until converge
The Gradient Descent Algorithm
Christof MonzData Mining - Part 3: Neural Networks
20
Each training example is a pair <~x , t >
1 Initialize each wi to some small random value2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each <~x , t >∈ D do
2.2.1 Compute o(~x)
2.2.2 For each weight wi do∆wi ←∆wi + η(t−o)xi
2.3 For each weight wi dowi ← wi + ∆wi
The Gradient Descent Algorithm
Christof MonzData Mining - Part 3: Neural Networks
21
I The gradient descent algorithm will find theglobal minimum, provided that the learning rateis small enough
I If the learning rate is too large, this algorithmruns into the risk of overstepping the globalminimum
I It’s a common strategy to gradually thedecrease the learning rate
I This algorithm works also in case the trainingexamples are not linearly separable
Shortcomings of Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
22
I Converging to a minimum can be quite slow(i.e. it can take thousands of steps). Increasingthe learning rate on the other hand can lead tooverstepping minima
I If there are multiple local minima in the errorsurface, gradient descent can get stuck in oneof them and not find the global minimum
I Stochastic gradient descent alleviates thesedifficulties
Stochastic Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
23
I Gradient descent updates the weights aftersumming over all training examples
I Stochastic (or incremental) gradient descentupdates weights incrementally after calculatingthe error for each individual training example
I This this end step 2.3 is deleted and step 2.2.2modified
Stochastic Gradient Descent
Christof MonzData Mining - Part 3: Neural Networks
24
Each training example is a pair <~x , t >
1 Initialize each wi to some small random value2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each <~x , t >∈ D do
2.2.1 Compute o(~x)
2.2.2 For each weight wi dowi ← wi + η(t−o)xi
Comparison
Christof MonzData Mining - Part 3: Neural Networks
25
I In standard gradient descent summing overmultiple examples requires more computationsper weight update step
I As a consequence standard gradient descentoften uses larger learning rates than stochasticgradient descent
I Stochastic gradient descent can avoid fallinginto local minima because it uses the different∇Ed(~w) rather than the overall ∇E(~w) to guideits search
Multi-Layer Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
26
I Perceptrons only have two layers: the inputlayer and the output layer
I Perceptrons only have one output unitI Perceptrons are limited in their expressivenessI Multi-layer neural networks consist of an input
layer, a hidden layer, and an output layerI Multi-layer neural networks can have several
output units
Multi-Layer Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
27
Multi-Layer Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
28
I The units of the hidden layer function as inputunits to the next layer
I However, multiple layers of linear units stillproduce only linear functions
I The step function in perceptrons is anotherchoice, but it is not differentiable, and thereforenot suitable for gradient descent search
I Solution: the sigmoid function, a non-linear,differentiable threshold function
Sigmoid Unit
Christof MonzData Mining - Part 3: Neural Networks
29
The Sigmoid Function
Christof MonzData Mining - Part 3: Neural Networks
30
I The output is computed as o = σ(~w ·~x)where σ(y) = 1
1+e−y
i.e. o = σ(~w ·~x) = 11+e−(~w ·~x)
I Another nice property of the sigmoid function isthat its derivative is easily expressed:dσ(y)
dy = σ(y) · (1−σ(y))
Learning with Multiple Layers
Christof MonzData Mining - Part 3: Neural Networks
31
I The gradient descent search can be used totrain multi-layer neural networks, but thealgorithm has to be adapted
I Firstly, there can be multiple output units, andtherefore the error function as to be generalized:E(~w) = 1
2 ∑d∈D
∑k∈outputs
(tkd −okd)2
I Secondly, the error ‘feedback’ has to be fedthrough multiple layers
Backpropagation Algorithm
Christof MonzData Mining - Part 3: Neural Networks
32
For each training example <~x , t > do1. Input ~x to the network and compute ou for every unit in
the network
2. For each output unit k calculate its error δk :δk ← ok (1−ok )(tk −ok )
3. For each hidden unit h calculate its error δh:δh← oh(1−oh) ∑
k∈outputswkhδk
4. Update each network weight wji :wji ← wji + ∆wji
where ∆wji = ηδjxji
I Note: xji is the value from unit i to j and wji isthe weight of connecting unit i to j,
Backpropagation Algorithm
Christof MonzData Mining - Part 3: Neural Networks
33
I Step 1 propagates the input forward throughthe network
I Steps 2–4 propagate the errors backwardthrough the network
I Step 2 is similar to the delta rule in gradientdescent (step 2.3)
I Step 3 sums over the errors of all output unitsinfluence by a given hidden unit (this is becausethe training data only provides direct feedbackfor the output units)
Applications of Neural Networks
Christof MonzData Mining - Part 3: Neural Networks
34
I Text to speechI Fraud detectionI Automated vehiclesI Game playingI Handwriting recognition
Summary
Christof MonzData Mining - Part 3: Neural Networks
35
I Perceptrons, simple one layer neural networksI Perceptron training ruleI Gradient descent searchI Multi-layer neural networksI Backpropagation algorithm