Dm part03 neural-networks-handout

Christof MonzInformatics Institute

University of Amsterdam

Data MiningPart 3: Neural Networks

Overview

Christof MonzData Mining - Part 3: Neural Networks

1

I PerceptronsI Gradient descent searchI Multi-layer neural networksI The backpropagation algorithm

Neural Networks


2

I Analogy to biological neural systems, the mostrobust learning systems we know

I Attempt to understand natural biologicalsystems through computational modeling

I Massive parallelism allows for computationalefficiency

I Help understand ‘distributed’ nature of neuralrepresentations

I Intelligent behavior as an ‘emergent’ property oflarge number of simple units rather than fromexplicitly encoded symbolic rules and algorithms

Neural Network Learning


3

I Learning approach based on modelingadaptation in biological neural systems

I Perceptron: Initial algorithm for learningsimple neural networks (single layer) developedin the 1950s

I Backpropagation: More complex algorithm forlearning multi-layer neural networks developedin the 1980s.

Real Neurons


4

Human Neural Network


5

Modeling Neural Networks


6

Perceptrons


7

Perceptrons


8

I A perceptron is a single layer neural networkwith one output unit

I The output of a perceptron is computed asfollows

o(x1 . . .xn) =

{1 ifw0 + w1x1 + . . .+ wnxn > 0−1 otherwise

I Assume a ‘dummy’ input x0 = 1 we can write:

o(x1 . . .xn) =

{1 if ∑

ni=0 wixi > 0

−1 otherwise

Perceptrons


9

I Learning a perceptron involves choosing the‘right’ values for the weights w0 . . .wn

I The set of candidate hypotheses isH = {~w | ~w ∈ℜ(n+1)}

Representational Power


10

I A single perceptron represent many booleanfunctions, e.g. AND, OR, NAND (¬AND), . . . ,but not all (e.g., XOR)

Peceptron Training Rule


11

I The perceptron training rule can be definedfor each weight as:wi ← wi + ∆wi

where ∆wi = η(t−o)xi

where t is the target output, o is the output ofthe perceptron, and η is the learning rate

I This scenario assume that we know what thetarget outputs are supposed to be like

Peceptron Training Rule Example


12

I If t = o then η(t−o)xi = 0 and ∆wi = 0, i.e.the weight for wi remains unchanged, regardlessof the learning rate and the input values (i.e. xi)

I Let’s assume a learning rate of η = 0.1 and aninput value of xi = 0.8• If t = +1 and o =−1, then

∆wi = 0.1(1− (−1)))0.8 = 0.16

• If t =−1 and o = +1, then∆wi = 0.1(−1−1)))0.8 =−0.16

Peceptron Training Rule


13

I The perceptron training rule converges after afinite number of iterations

I Stopping criterion holds if the amount ofchanges falls below a pre-defined threshold θ,e.g., if |∆~w |L1 < θ

I But only if the training examples are linearlyseparable

The Delta Rule


14

I The delta rule overcomes the shortcoming ofthe perceptron training rule not beingguaranteed to converge if the examples are notlinearly separable

I Delta rule is based on gradient descent searchI Let’s assume we have an unthresholded

perceptron: o(~x) = ~w ·~xI We can define the training error as:

E(~w) = 12 ∑

d∈D(td −od)2

where D is the set of training examples

Error Surface


15

Gradient Descent


16

I The gradient of E is the vector pointing in thedirection of the steepest increase for any pointon the error surface

∇E(~w) =[

∂E∂w0

, ∂E∂w1

, . . . , ∂E∂wn

]I Since we are interested in minimizing the error,

we consider negative gradients: −∇E(~w)I The training rule for gradient descent is:

~w ← ~w + ∆~wwhere ∆~w =−η∇E(~w)

Gradient Descent


17

I The training rule for individual weights isdefined as wi ← wi + ∆wi

where ∆wi =−η∂E∂wi

I Instantiating E for the error function we usegives: ∂E

∂wi= ∂

∂wi

12 ∑

d∈D(td −od)2

I How do we use partial derivatives to actuallycompute updates to weights at each step?

Gradient Descent


18

∂E∂wi

=∂

∂wi

12 ∑

d∈D(td −od)2

=12 ∑

d∈D

∂

∂wi(td −od)2

=12 ∑

d∈D2(td −od)

∂

∂wi(td −od)

= ∑d∈D

(td −od)∂

∂wi(td −od)

∂E∂wi

= ∑d∈D

(td −od) · (−xid)

Gradient Descent


19

I The delta rule for individual weights can now bewritten as wi ← wi + ∆wi

where ∆wi = η ∑d∈D

(td −od)xid

I The gradient descent algorithm• picks initial random weights

• computes the outputs

• updates each weight by adding ∆wi

• repeats until converge

The Gradient Descent Algorithm


20

Each training example is a pair <~x , t >

1 Initialize each wi to some small random value2 Until the termination condition is met do:

2.1 Initialize each ∆wi to 0

2.2 For each <~x , t >∈ D do

2.2.1 Compute o(~x)

2.2.2 For each weight wi do∆wi ←∆wi + η(t−o)xi

2.3 For each weight wi dowi ← wi + ∆wi

The Gradient Descent Algorithm


21

I The gradient descent algorithm will find theglobal minimum, provided that the learning rateis small enough

I If the learning rate is too large, this algorithmruns into the risk of overstepping the globalminimum

I It’s a common strategy to gradually thedecrease the learning rate

I This algorithm works also in case the trainingexamples are not linearly separable

Shortcomings of Gradient Descent


22

I Converging to a minimum can be quite slow(i.e. it can take thousands of steps). Increasingthe learning rate on the other hand can lead tooverstepping minima

I If there are multiple local minima in the errorsurface, gradient descent can get stuck in oneof them and not find the global minimum

I Stochastic gradient descent alleviates thesedifficulties

Stochastic Gradient Descent


23

I Gradient descent updates the weights aftersumming over all training examples

I Stochastic (or incremental) gradient descentupdates weights incrementally after calculatingthe error for each individual training example

I This this end step 2.3 is deleted and step 2.2.2modified

Stochastic Gradient Descent


24

Each training example is a pair <~x , t >

1 Initialize each wi to some small random value2 Until the termination condition is met do:

2.1 Initialize each ∆wi to 0

2.2 For each <~x , t >∈ D do

2.2.1 Compute o(~x)

2.2.2 For each weight wi dowi ← wi + η(t−o)xi

Comparison


25

I In standard gradient descent summing overmultiple examples requires more computationsper weight update step

I As a consequence standard gradient descentoften uses larger learning rates than stochasticgradient descent

I Stochastic gradient descent can avoid fallinginto local minima because it uses the different∇Ed(~w) rather than the overall ∇E(~w) to guideits search

Multi-Layer Neural Networks


26

I Perceptrons only have two layers: the inputlayer and the output layer

I Perceptrons only have one output unitI Perceptrons are limited in their expressivenessI Multi-layer neural networks consist of an input

layer, a hidden layer, and an output layerI Multi-layer neural networks can have several

output units



27



28

I The units of the hidden layer function as inputunits to the next layer

I However, multiple layers of linear units stillproduce only linear functions

I The step function in perceptrons is anotherchoice, but it is not differentiable, and thereforenot suitable for gradient descent search

I Solution: the sigmoid function, a non-linear,differentiable threshold function

Sigmoid Unit


29

The Sigmoid Function


30

I The output is computed as o = σ(~w ·~x)where σ(y) = 1

1+e−y

i.e. o = σ(~w ·~x) = 11+e−(~w ·~x)

I Another nice property of the sigmoid function isthat its derivative is easily expressed:dσ(y)

dy = σ(y) · (1−σ(y))

Learning with Multiple Layers


31

I The gradient descent search can be used totrain multi-layer neural networks, but thealgorithm has to be adapted

I Firstly, there can be multiple output units, andtherefore the error function as to be generalized:E(~w) = 1

2 ∑d∈D

∑k∈outputs

(tkd −okd)2

I Secondly, the error ‘feedback’ has to be fedthrough multiple layers

Backpropagation Algorithm


32

For each training example <~x , t > do1. Input ~x to the network and compute ou for every unit in

the network

2. For each output unit k calculate its error δk :δk ← ok (1−ok )(tk −ok )

3. For each hidden unit h calculate its error δh:δh← oh(1−oh) ∑

k∈outputswkhδk

4. Update each network weight wji :wji ← wji + ∆wji

where ∆wji = ηδjxji

I Note: xji is the value from unit i to j and wji isthe weight of connecting unit i to j,

Backpropagation Algorithm


33

I Step 1 propagates the input forward throughthe network

I Steps 2–4 propagate the errors backwardthrough the network

I Step 2 is similar to the delta rule in gradientdescent (step 2.3)

I Step 3 sums over the errors of all output unitsinfluence by a given hidden unit (this is becausethe training data only provides direct feedbackfor the output units)

Applications of Neural Networks


34

I Text to speechI Fraud detectionI Automated vehiclesI Game playingI Handwriting recognition

Summary


35

I Perceptrons, simple one layer neural networksI Perceptron training ruleI Gradient descent searchI Multi-layer neural networksI Backpropagation algorithm

Education

Dm part03 neural-networks-handout