Upload
roozbeh
View
223
Download
0
Embed Size (px)
Citation preview
8/11/2019 ml-lecture3-2013 (1).pdf
1/60
Stavros Petridis Machine Learn ing (course 395)
Course 395: Mach ine Learning - Lectures
Lecture 1-2: Concept Learning (M. Pantic)
Lecture 3-4: Decision Trees & CBC Intro (M. Pantic)
Lecture 5-6: Artificial Neural Networks (S.Petridis)
Lecture 7-8: Instance Based Learning (M. Pantic)
Lecture 9-10: Genetic Algorithms (M. Pantic)
Lecture 11-12: Evaluating Hypotheses (THs)
Lecture 13-14: Invited Lecture
Lecture 15-16: Inductive Logic Programming (S. Muggleton)
Lecture 17-18: Inductive Logic Programming (S. Muggleton)
8/11/2019 ml-lecture3-2013 (1).pdf
2/60
Stavros Petridis Machine Learn ing (course 395)
Neural Networks
Reading:Machine Learning (Tom Mitchel) Chapter 4
Pattern Classification (Duda, Hart, Stork) Chapter 6
(strongly to advised to read 6.1, 6.2, 6.3, 6.8)
Further Reading:
Pattern Recognition and Machine Learning by Christopher M.
Bishop, Springer, 2006 (chapter: 5)Neural Networks (Haykin)
Neural Networks for Pattern Recognition (Chris M. Bishop)
8/11/2019 ml-lecture3-2013 (1).pdf
3/60
Stavros Petridis Machine Learn ing (course 395)
What are Neural Networks?
The real thing! 10 billion neurons
Local computations on interconnected elements (neurons)
Parallel computation neuron switch time 0.001sec recognition tasks performed in 0.1 sec.
8/11/2019 ml-lecture3-2013 (1).pdf
4/60
Stavros Petridis Machine Learn ing (course 395)
Biological Neural Networks
A network of interconnected biological neurons.
Connections per neuron 104 - 105
8/11/2019 ml-lecture3-2013 (1).pdf
5/60
Stavros Petridis Machine Learn ing (course 395)
B iolog ical vs Art i f ic ial Neural Networks
8/11/2019 ml-lecture3-2013 (1).pdf
6/60
Stavros Petridis Machine Learn ing (course 395)
Archi tecture
How the neurons are connected
The Neuron
How information is processed in each unit. output = f(input)
Learn ing algor i thm s
How a Neural Network modifies its weights in order to solve a
particular learning task in a set of training examples
The goal is to have a Neural Network that generalizes well, that is, that
it generates a correct output on a set of test/new examples/inputs.
Artificial Neural Networks: the dimensions
8/11/2019 ml-lecture3-2013 (1).pdf
7/60
Stavros Petridis Machine Learn ing (course 395)
The Neuron
I
n
p
u
t
s
Output
Bias
o=(net)
Neuron
Weights
Main building block of any neural network
Net Activation
Transfer/ActivationFunction
8/11/2019 ml-lecture3-2013 (1).pdf
8/60
Stavros Petridis Machine Learn ing (course 395)
Activation functions
)(,1 0
netoYwxwnetX n
i ii
8/11/2019 ml-lecture3-2013 (1).pdf
9/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron
I
n
p
ut
s
o=(net)
= sign/step/function
Perceptron = a neuron that its input is the dot product of W and X
and uses a step function as a transfer function
8/11/2019 ml-lecture3-2013 (1).pdf
10/60
Stavros Petridis Machine Learn ing (course 395)
Generalization to single layer perceptrons with more neurons is
easy because:
The output units are mutually independent
Each weight only affects one of the outputs
Perceptron : Arch i tecture
8/11/2019 ml-lecture3-2013 (1).pdf
11/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron was invented by Rosenblatt The Perceptron--a perceiving
and recognizing automaton, 1957
Perceptron
8/11/2019 ml-lecture3-2013 (1).pdf
12/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron: Example 1 - AND
x1 = 1, x2 = 1 net = 20+20-30=10o = (10) = 1
x1 = 0, x2 = 1 net = 0+20-30 =-10o = (-10) = 0 x1 = 1, x2 = 0 net = 20+0-30 =-10o = (-10) = 0
x1 = 0, x2 = 0 net = 0+0-30 =-30o = (-30) = 0
o=(net)
8/11/2019 ml-lecture3-2013 (1).pdf
13/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron: Example 2 - OR
x1 = 1, x2 = 1 net = 20+20-10=30o = (30) = 1
x1 = 0, x2 = 1 net = 0+20-10 =10o = (10) = 1
x1 = 1, x2 = 0 net = 20+0-10 =10o = (10) = 1
x1 = 0, x2 = 0 net = 0+0-10 =-10o = (-10) = 0
o=(net)
8/11/2019 ml-lecture3-2013 (1).pdf
14/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron: Example 3 - NAND
x1 = 1, x2 = 1 net = -20-20+30=-10 o = (-10) = 0
x1 = 0, x2 = 1 net = 0-20+30 =10 o = (10) = 1
x1 = 1, x2 = 0 net = -20+0+30 =10 o = (10) = 1
x1 = 0, x2 = 0 net = 0+0+30 =30 o = (30) = 1
o=(net)
8/11/2019 ml-lecture3-2013 (1).pdf
15/60
Stavros Petridis Machine Learn ing (course 395)
Given training examples of classes , train the perceptron
in such a way that it classifies correctly the training examples:
I f the output of the perceptron is 1 then the input is
assigned to class (i .e. if )
I f the output is 0 then the input is assigned to class
Geometrically, we try to find a hyper-plane that separates the
examples of the two classes. The hyper-plane is defined by thelinear function
Perceptron for classification
8/11/2019 ml-lecture3-2013 (1).pdf
16/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron : Geometr ic view
(Note that q -w0)
20
10
02211
02211
AClassthenwxwxwif
AClassthenwxwxwif
definitionourondepends
AorAClassthenwxwxwif 21002211
8/11/2019 ml-lecture3-2013 (1).pdf
17/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron : The l im itat ions of perceptron
Perceptron can only classify examples that are linearly separable
The XOR is not linearly separable.
This was a terrible blow to the field
8/11/2019 ml-lecture3-2013 (1).pdf
18/60
Stavros Petridis Machine Learn ing (course 395)
Marvin Minsky Seymour Papert
A famous book was published in 1969: Perceptrons
Caused a significant decline in interest and funding ofneural network research
Perceptron
8/11/2019 ml-lecture3-2013 (1).pdf
19/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron XOR Solution
XOR can be expressed in terms of AND, OR, NAND
8/11/2019 ml-lecture3-2013 (1).pdf
20/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron XOR Solution
XOR can be expressed in terms of AND, OR, NAND XOR = NAND (AND) OR
AND
1 1 1
0 1 0
1 0 0
0 0 0
OR
1 1 1
0 1 1
1 0 1
0 0 0
NAND
1 1 0
0 1 1
1 0 1
0 0 1
OR
20
20
-10
NAND
-20
-20
30
20AND
20
-30
y1
y2
o
x1=1, x2 =1 y1=1 AND y2=0 o = 0 x1=1, x2 =0 y1=1 AND y2=1 o = 1
x1=0, x2 =1 y1=1 AND y2=1 o = 1
x1=0, x2 =0 y1=0 AND y2=1 o = 0
8/11/2019 ml-lecture3-2013 (1).pdf
21/60
Stavros Petridis Machine Learn ing (course 395)
XOR
1
0 1
x2
x1
OR
NAND
1
1
0
020x1 + 20x2 = 10
20x1 + 20x2 = 30
8/11/2019 ml-lecture3-2013 (1).pdf
22/60
Stavros Petridis Machine Learn ing (course 395)
Input
layer
Output
layer
Hidden Layer
We consider a more general network architecture: between the input and outputlayers there are hidden layers, as illustrated below.
Hidden nodes do not directly receive inputs nor send outputs to the external
environment.
Mult i layer Feed Forward Neural Network
8/11/2019 ml-lecture3-2013 (1).pdf
23/60
Stavros Petridis Machine Learn ing (course 395)
NNs: Architecture
Input
Layer
Hidden
Layer 1
Hidden
Layer 2
Output
Layer
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Output
Layer
Input
Layer
Feedback
Connection
3-layer feed-forward network 4-layer feed-forward network
4-layer recurrent network
The input layer doesnot count as a layer
8/11/2019 ml-lecture3-2013 (1).pdf
24/60
Stavros Petridis Machine Learn ing (course 395)
Mult i layer Feed Forward Neural Network
jiw kjwix
ji
w = weight associated with ith inputto hidden unitj
kjw = weight associated withjth inputto output unit k
jy
ko
= output ofjth hidden unit
= output of kth output unit
n
i jiij wxy
0
nH
j kjjk wyo
1
n = number of inputs
nH = number of hidden neurons
nH
j kj
n
i jiik wwxo
1 0
8/11/2019 ml-lecture3-2013 (1).pdf
25/60
Stavros Petridis Machine Learn ing (course 395)
Representational Power of
Feedforward Neural Networks
Boolean functions: Every boolean function can be represented
exactlyby some network with two layers
Continuous functions: Every bounded continuous function can
be approximated with arbitrarily small error by a network with
2 layers
Arbitrary functions: Any function can be approximated to
arbitrary accuracy by a network with 3 layers
Catch: We do not know 1) what the appropriate number ofhidden neurons is, 2) the proper weight values
nH
j kj
n
i jiik wwxo
1 0
8/11/2019 ml-lecture3-2013 (1).pdf
26/60
Stavros Petridis Machine Learn ing (course 395)
Classification /Regression with NNs
You should think of neural networks as functionapproximators
nH
j kj
n
i jiik wwxo
1 0
Classification-Decision boundary approximation-Discrete output
Regression- Continuous output
8/11/2019 ml-lecture3-2013 (1).pdf
27/60
Stavros Petridis Machine Learn ing (course 395)
022110 xwxww
022110 xwxwwx1
1
x2 w2
w1
w0
Convex
region
L1L2
L3L4 -3.5
Networkwith a single
node
One-hidden layer network that
realizes the convex region: each
hidden node realizes one of the
lines bounding the convex region
P1
P2
P3
1
1
1
1
1
x1
x2
1
1.5
two-hidden layer network thatrealizes the union of three convex
regions: each box represents a one
hidden layer network realizing
one convex region
1
1
1
1
x1
x2
1
Decis ion boundar ies
8/11/2019 ml-lecture3-2013 (1).pdf
28/60
8/11/2019 ml-lecture3-2013 (1).pdf
29/60
Stavros Petridis Machine Learn ing (course 395)
Multiclass Classification
Target Values: vector (length=no. Classes)
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
Class1 Class3Class2 Class4
8/11/2019 ml-lecture3-2013 (1).pdf
30/60
Stavros Petridis Machine Learn ing (course 395)
Training
We have assumed so far that we know the weight
values
We are given a training set consisting of inputs and
targets (X, T) Training: Tuning of the weights (w) so that for each
input pattern (x) the output (o) of the network is close
to the target values (t).
nH
j kj
n
i jii wwxo
1 0
o t
8/11/2019 ml-lecture3-2013 (1).pdf
31/60
Stavros Petridis Machine Learn ing (course 395)
Perceptron -Training
n
i ii wxo 0
ix
iw
- Training Set: A set of input vectorso
Begin with random weights
Update the weights so
nixi 1, with thecorresponding targets
it
o t
8/11/2019 ml-lecture3-2013 (1).pdf
32/60
Stavros Petridis Machine Learn ing (course 395)
TrainingGradient Descent
Gradient Descent: A general, effective way for estimatingparameters (e.g. w) that minimise error functions
We need to define an error function E(w)
Update the weights in each iteration in a direction that reduces
the error the order in order to minimize E
iii www
ii w
E
w
-
8/11/2019 ml-lecture3-2013 (1).pdf
33/60
Stavros Petridis Machine Learn ing (course 395)
Gradient descent method: take a step in the direction
that decreases the error E. This direction is the opposite
of the derivative of E.
Gradient directionE
Grad ient Descent
- derivative: direction of steepest increase
- learning rate: determines the step size in
the direction of steepest decrease
8/11/2019 ml-lecture3-2013 (1).pdf
34/60
Stavros Petridis Machine Learn ing (course 395)
We assume linear activation transfer (i.e. its derivative = 1)
E depends on the weights because
We wish to find the weight values that minimise E, i.e. thedesired target tis very close to the actual output o.
Grad ient Descent
We define our error function as follows (D = number oftraining examples):
( - D
d dd otE
1
2
2
1
n
i i
d
id wxo 0
8/11/2019 ml-lecture3-2013 (1).pdf
35/60
Stavros Petridis Machine Learn ing (course 395)
The partial derivative can be computed as follows:
Therefore w is ( idD
d ddi xotw - 1
Gradient Descent
( - D
d dd otE 12
21
8/11/2019 ml-lecture3-2013 (1).pdf
36/60
Stavros Petridis Machine Learn ing (course 395)
1. Initialise weights randomly2. Compute the output o for all the training examples
3. Compute the weight update for each weight
4. Update the weights
( idD
d ddi xotw - 1
Gradient Descent -Perceptron - Summary
iii www
Repeat steps 2-4 until a termination condition is met
The algorithm will converge to a weight vector with
minimum error, given that the learning rate is sufficiently small
8/11/2019 ml-lecture3-2013 (1).pdf
37/60
Stavros Petridis Machine Learn ing (course 395)
Gradient Descent Learn ing Rate
8/11/2019 ml-lecture3-2013 (1).pdf
38/60
Stavros Petridis Machine Learn ing (course 395)
The Backprop algorithm searches for weight values thatminimize the total squared error of the network (K outputs)
over the set of D training examples (training set).
Based on gradient descent algorithm
Input
layer
Output
layer
Learning: The backp ropagat ion algo r i thm
iii www i
iw
Ew
-
( - K
k
D
d kdkd otE
1 1
2
2
1
8/11/2019 ml-lecture3-2013 (1).pdf
39/60
Stavros Petridis Machine Learn ing (course 395)
Backpropagation consists of the repeated application of thefollowing two passes:
Forward pass: in this step the network is activated on
one example and the error of (each neuron of) the output
layer is computed. Backward pass: in this step the network error is used for
updating the weights (credit assignment problem). This
process is complex because hidden nodes are linked to
the error not directly but by means of the nodes of thenext layer. Therefore, starting at the output layer, the
error is propagated backwards through the network, layer
by layer.
Learning: The backp ropagat ion algo r i thm
8/11/2019 ml-lecture3-2013 (1).pdf
40/60
Stavros Petridis Machine Learn ing (course 395)
Consider the error of one pattern d: ( - K
k kkd otE
12
21
(
jid wE
j
j
jijjji
net
netxotw
-
-
/
)(
nH
i jijij wxo
0
jix
jiw
output unitj subscriptji: ith input to unitj,
i.e., input from hidden neuron ito output unitj
( idd
i ddi xotw - 1
Using exactly the same approach as in the perceptron
perceptron w
The only difference is the partial derivative of the sigmoid
activation function (we assumed linear activation derivative =1)
Learning: Output Layer Weigh ts
8/11/2019 ml-lecture3-2013 (1).pdf
41/60
Stavros Petridis Machine Learn ing (course 395)
( k
kkjkkkj
net
netxotw
-
)(
( kk
kkk net
net
ot
-
)(
We define the error term for the output k:Reminder:
1
2
3
4
j
j
edTojonsConnectoutputNeurk
kjkjnet
netw
)(
Note thatjk, ij, i.e.
j: hidden unit
k: output unit
The error term for hidden
unitj is:
Learning: Hidden Layer Weigh ts
8/11/2019 ml-lecture3-2013 (1).pdf
42/60
Stavros Petridis Machine Learn ing (course 395)
( kjkk
kkjkkkj x
net
netxotw
-
)(( k
k
kkknet
net
ot
-
)(
We define the error term for the output k: Reminder:
j
j
edTojonsConnectoutputNeurk kjkj net
netw
)(
x is the input to output unit k from hidden unitj
Similarly the update rule for weights in the input layer is
jijji xw
x is the input to hidden unitj from input unit i
Learning: Hidden Layer Weigh ts
8/11/2019 ml-lecture3-2013 (1).pdf
43/60
Stavros Petridis Machine Learn ing (course 395)
Finally for all training examples D
This is called batch training because theweights are updated after all training examples
have been presented to the network (=epoch)
Incremental training: weights are updated after
each training example is presented
D
d i
examplesallfor
i ww 1
Learning : Backp ropagation A lgor i thm
8/11/2019 ml-lecture3-2013 (1).pdf
44/60
Stavros Petridis Machine Learn ing (course 395)
When the gradient magnitude is small, i.e. the error
does not change much between two epochs
When the maximum number of epochs has beenreached
When the error on the validation set increases for nconsecutive times (this implies that we monitor the
error on the validation set)
Backpropagation Stopping Criter ia
8/11/2019 ml-lecture3-2013 (1).pdf
45/60
8/11/2019 ml-lecture3-2013 (1).pdf
46/60
Stavros Petridis Machine Learn ing (course 395)
6. Update weights
7. Repeat steps 2-6 until the stopping criterion is met
Backp ropagation Summary
iii www
8/11/2019 ml-lecture3-2013 (1).pdf
47/60
Stavros Petridis Machine Learn ing (course 395)
Backpropagation : Var iants
Backpropagation with momentum
The weight update depends also on the previous update
Useful for flat regions where the update would be small
Backpropagation with adaptive learning rate
Learning rate changes during training
- If error is decreased after weight update then lr is increased
- If error is increased after weight update then lr is decreased
10),1()( - nwxnw jijijji
8/11/2019 ml-lecture3-2013 (1).pdf
48/60
Stavros Petridis Machine Learn ing (course 395)
Backp ropagation: Convergence
Converges to a local minimum of the error function can be retrained a number of times
Minimises the error over the training examples
will it generalise well over unknown examples?
Training requires thousands of iterations (slow)
but once trained it can rapidly evaluate output
8/11/2019 ml-lecture3-2013 (1).pdf
49/60
Stavros Petridis Machine Learn ing (course 395)
Backpropagation : Error Surface
8/11/2019 ml-lecture3-2013 (1).pdf
50/60
Stavros Petridis Machine Learn ing (course 395)
Backpropagation : Overfi t t ing
Error might decrease in the training set but increase in the
validation set (overfitting!) Stop when the error in the validation set increases (but
not too soon!). This is called early stopping
8/11/2019 ml-lecture3-2013 (1).pdf
51/60
Stavros Petridis Machine Learn ing (course 395)
Backpropagation : Overfi t t ing
Another approach to avoid overfitting is regularisation
Add a term in the error function to penalise large weight
values
Having smaller weights and biases makes the network
less likely to overfit
Check matlab documentation for other approaches
8/11/2019 ml-lecture3-2013 (1).pdf
52/60
Stavros Petridis Machine Learn ing (course 395)
Parameters / Weigh ts
Parameters are what the user specifies, e.g. number of
hidden neurons, learning rate, number of epochs etc
They need to be optimised
Weights are the weights of the network
They are also parameters but they are optimisedautomatically via gradient descent
8/11/2019 ml-lecture3-2013 (1).pdf
53/60
Stavros Petridis Machine Learn ing (course 395)
Practical Suggestions: Activation Functions
Continuous, smooth (essential for backpropagation)
Nonlinear
Saturates, i.e. has a min and max value
Monotonic (if not then additional local minima can beintroduced)
Sigmoids are good candidate (log-sig, tan-sig)
8/11/2019 ml-lecture3-2013 (1).pdf
54/60
Stavros Petridis Machine Learn ing (course 395)
Practical Suggestions: Activation Functions
In case of regression then output layer should have
linear activation functions
8/11/2019 ml-lecture3-2013 (1).pdf
55/60
P i l S i T V l
8/11/2019 ml-lecture3-2013 (1).pdf
56/60
Stavros Petridis Machine Learn ing (course 395)
Practical Suggestions: Target Values
Binary Classification
- Target Values : -1/0 (negative) and 1 (positive)
- 0 for log-sigmoid, -1 for tan-sigmoid
Multiclass Classification
- [0,0,1,0] or [-1, -1, 1, -1]
- 0 for log-sigmoid, -1 for tan-sigmoid
Regression
Target values: continuous values [-inf, +inf]
Practical Suggestions:
8/11/2019 ml-lecture3-2013 (1).pdf
57/60
Stavros Petridis Machine Learn ing (course 395)
Practical Suggestions:
Number of Hidden Layers
Networks with many hidden layers are prone to
overfitting and they are also harder to train
For most problems one hidden layer should beenough
2 hidden layers can sometimes lead to improvement
l b l
8/11/2019 ml-lecture3-2013 (1).pdf
58/60
Stavros Petridis Machine Learn ing (course 395)
Matlab Examples
Create feedforward network
- net = feedforwardnet(hiddenSizes,trainFcn)- hiddenSizes = [10, 10, 10]3 layer network with 10 hidden
neurons in each layer
- trainFcn = trainlm, traingdm, trainrp etc
Configure (set number of input/output neurons)
- net = configure(net,x,t)
- x: input data, noFeatures x noExamples
- t: target data, noClasses x noEx
8/11/2019 ml-lecture3-2013 (1).pdf
59/60
M l b E l
8/11/2019 ml-lecture3-2013 (1).pdf
60/60
Matlab Examples
Batch training: Train
Stochastic/Incremental Training: Adapt
Use Train for the CBC