07/11/2019
1
MALIS: Neural Networks
Maria A. Zuluaga
Data Science Department
Recap: Classification
MALIS 2019 2
Source: A Zisserman
The input space divided into decision regions whose boundaries are called decision boundaries or decision surfaces.
07/11/2019
2
Separating hyperplanes
MALIS 2019 3
Figure 4.1.4 From The Elements of Statistical
learning
Separating hyperplanes classifiers are linear classifiers
which try to “explicitly” separate the data as well as
possible.
{�: ��� + ����� + ��� = 0}
= � 1 �� ��−1 �� �
Least squares solution regressing:
leads to a line given by
Two (of infinitely) possible separating hyperplanes
The Perceptron
MALIS 2019 4
07/11/2019
3
The Perceptron
• Assumptions:
• Data is linearly separable
• Binary classification using labels ∈ −1, 1
• Goal: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.
MALIS 2019 5
…
Formulation
• � = �(��� + b)
MALIS 2019 6
� � = �+1, � ≥ 0−1, � < 0
Source: Machine learning for Intelligent Systems. Cornell University
As before we will “absorb” b by adding a
“dummy” variable to x:
� = �(���)
� = � 1 � = �!
07/11/2019
4
Error function: The perceptron criterion
• Conditions:
• Patterns �" ∈ �� will have ���" > 0• Patterns �" ∈ � will have ���" < 0
• Since ∈ {−1, +1}, this means we want all patterns to satisfy:���"y" > 0
MALIS 2019 7
%& � = − ' ���" "(
"∈ℳPerceptron criterion
Interpretation
• The output is the characteristic function of a half-space, which is limited by the hyperplane
MALIS 2019 8
1x
2x1y =
1y −=
hyperplane
��� + b = 0
07/11/2019
5
Representation
MALIS 2019 9
activationW0
Wixi
+1
Linear Separability
• Given two sets of points, is there a perceptron which classifies them ?
YES NO
• True only if sets are linearly separable
1x
2x
1x
2x
1
0MALIS 2019
07/11/2019
6
Finding the weights
• Obtain an expression for the gradient of the perceptron criterion
• Use stochastic gradient descent to minimize the error function
• A change in the weight vector is given by:*(+,�) = *(+) − -.%& � = * + + -/" "
MALIS 2019 11
Rather than computing the sum of the gradient contributions of each observation
followed by a step in the negative gradient direction, a step is taken after each
single observation is visited.
Stochastic gradient descent vs gradient descent:
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + - 1/1m=m+1
if m = 0
break
MALIS 2019 12
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
07/11/2019
7
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 13
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 14
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
07/11/2019
8
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 15
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 16
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
07/11/2019
9
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 17
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 18
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
07/11/2019
10
Perceptron training algorithm
initialize �while TRUE do:
m=0
foreach (xi,yi) do
if ���" " ≤ 0* = * + 1/1m=m+1
if m = 0
break
MALIS 2019 19
Illu
stra
tio
n a
da
pte
d f
rom
Fig
4.7
PR
ML
–C
Bis
ho
p
Hands on example: The OR function
• 03_perceptron.ipynb
MALIS 2019 20
1
10
-1
+1
07/11/2019
11
Hands on example: The OR functionX0 X1 X2 b W1 W2 y activation m
1
0.5 0 1
MALIS 2019 21
1
10
Solution
MALIS 2019 22
07/11/2019
12
Perceptron convergence theorem
• If there exists an exact solution (i.e. data is linearly separable), then it is guaranteed for the perceptron algorithm to converge to this solution in a finite number of steps.
• However:
• The number of steps to convergence might be large
• Until convergence is achieved, it is not possible to distinguish a non-separable problem from a slow to converge one.
MALIS 2019 23
What if we use a different initialization value?
MALIS 2019 24
Question: These are specific examples. What
are the general set of inequalities that must
be satisfied for an OR perceptron?
07/11/2019
13
Other logic functions
MALIS 2019 25Exercises to complete in the notebook
Perceptron Limitations
• Perceptrons cannot generate XOR:
0 1
1
1x
2x XOR
?
x1 x2 XOR
0 0 0
1 0 1
0 1 1
1 1 0
26
[Minsky 1969] -> The AI winter
07/11/2019
14
Perceptron Limitations
• The algorithm does not converge when the data are not separable
• When the data is separable, there are many solutions, and which one is found depends on the starting values
• The “finite” number of steps can be very large.
27
Some history
MALIS 2019 28
Source: C. Bishop - PRML
07/11/2019
15
Recap
• We introduced the perceptron algorithm, a linear classifier that guarantees convergence
• We saw that it guarantees a solution for separable data
• But we also saw that it has numerous limitations
MALIS 2019 29
Neural Networks
MALIS 2019 30
07/11/2019
16
• The term neural network has its origins in attempts to find mathematical representations of information processing in biological systems
• From Bishop: it has been used very broadly to cover a wide range of different models, many of which have been the subject of exaggerated claims regarding their biological plausibility
• They are nonlinear efficient models for statistical pattern recognition
MALIS 2019 31
Why neural networks?A note on history
Motivation
• Recall the first lecture on linear models
• We saw that adding features could give a better fit of the model2 � = 1, �, �, … , �4
• 2 � : basis function
• Model:
�, � = � ' �56
57� 25(�)
MALIS 2019 32
07/11/2019
17
Motivation
• We also saw that choosing the right set of features was challenging
• Goal: Making the basis functions 25(/) depend on parameters
and allow these parameters to be adjusted along with the
coefficients {�5} during training
• How? Neural networks
MALIS 2019 33
Revisit: 01_linear_models.ipynb
Which one is the good value for n?
• Basic neural network model: series of functional transformations
• Step 1: Construct M linear combinations of the input variables / ∈ℝ9
�5 = ' �5"(�)�" + �5�(�)9
"7�
MALIS 2019 34
Feed forward networksa.k.a. The multilayer perceptron (MLP)
bias
j=1,..M
Index of linear
combinationsweights
Layer index. E.g.
parameters of the
first layer of the
network
activations
i=1,..D
Index of dimensions of the
input X
07/11/2019
18
• Step 2: Transform each activation using a differentiable, nonlinear activation function ℎ ⋅ <5 = ℎ �5
• ℎ ⋅ generally chosen to be a sigmoidal function
• <5 corresponds to the outputs of the basis functions of our model. Recall:
�, � = � ' �56
57� 25(�)
• In the context of neural networks, these are called hidden units
MALIS 2019 35
Feed forward networksa.k.a. The multilayer perceptron (MLP)
• Step 3: The <5 are again linearly combined to give output unit
activations
�+ = ' �+5 <5 + �+�6
57�
MALIS 2019 36
Feed forward networksa.k.a. The multilayer perceptron (MLP)
bias
k=1,..K
Output index.
K: number of
outputs
weights
Layer index. E.g.
parameters of the
2nd layer of the
network
Output unit
activations
07/11/2019
19
• Step 4: The �+ are transformed using an activation function to give a set of network outputs +
• Choice of the activation function follows same considerations as for linear models
• For regression: Identity function
• Common activation functions for classification
• Sigmoid function: = � = 1/(1 + ?@A)• Tanh: tanh � = FG@FHG
FG,FHG• Hinge or relu: ℎ � = max(� , 0) • Softmax (multiclass): ℎ � = FGK
∑ FGM(MMALIS 2019 37
Feed forward networksa.k.a. The multilayer perceptron (MLP)
• Combining all, using a sigmoidal output unit activation function
MALIS 2019 38
Feed forward networks a.k.a MLPFinal expression
+ �, � = = ' �+5 ℎ ' �5"� �" + �5��9
"7�+ �+�
6
57�
07/11/2019
20
• Forward propagation of information through the network
MALIS 2019 39
Feed forward networks / MLPInterpretation: Network diagram
Fig 5.1 PRML – C. Bishop. Two-layered network
• As with linear models, the bias parameters can be absorbed into the set of weight parameters by adding a dummy variable �� = 1
�5 = ' �5"(�)�"N
"7�• In the second layer:
�+ = ' �+5 <56
57�
MALIS 2019 40
Simplifying notationAbsorbing the biases
07/11/2019
21
• The overall network function now becomes
MALIS 2019 41
Feed forward networks a.k.a MLPSimplified final expression
+ �, � = = ' �+5 ℎ ' �5"� �"9
"7�
6
57�
�, � = � ' �56
57� 25(�)
• Two stages of processing, each of which resembles the perceptron
MALIS 2019 42
Multilayer perceptronInterpretation
Adapted from Fig 5.1 PRML – C. Bishop
Similar to perceptron
+ �, � = = ' �+5 ℎ ' �5"� �"N
"7�
6
57�
07/11/2019
22
Back to features
• A neuron can be seen as a feature map of the form
25 � = ℎ ' �5"�"N
"7�• Therefore, each node in the network can be interpreted as a feature variable
• By optimizing the weights {w} we are doing feature selection
• Pre-trained networks: Resulting features from optimization useful to many problems
• Need of a lot of data
MALIS 2019 43
Neuron
Network training: Backpropagation
• We cannot use the training algorithm from the perceptron because we don’t know the “correct” outputs of the hidden units
• Strategy: Apply the chain rule to differentiate composite functions
• Refresher:
MALIS 2019 44
O � = �(P �) → OR � = �R P � P′(�)T<T� = T<
T ⋅ T T�
Leibniz’s notation
07/11/2019
23
Deriving gradient descent for MLP
MALIS 2019 45
• See board notes for simpler derivation of the backpropagation
algorithm
Deriving gradient descent for MLP
• Error function:
% � = ' %4(�)U
47�
• We will estimate .%"(�)• Let us consider a simple linear model with outputs +:
V+ = ' �+"�"(
"• The error function for a particular input sample n will be
%4 = 12 ' V4+ − 4+
(
+MALIS 2019 46
07/11/2019
24
Deriving gradient descent for MLP
• The gradient of this error function w.r.t a weight �5"X%4X�5" = �5"�4" − 45 �4" = V45 − 45 �4" "• Interpretation: Product of an error signal V45 − 45 associated to the
output end of the link �5" and the variable associated to the input �4"• Similar to expression obtained for logistic regression when using
sigmoid function
MALIS 2019 47
.*% * = X%(*)X* = ' V" − " �"
U
"7�Refresher: Exercise proposed
slide 27 (annotated)
Refresher: forward propagation• Let us recall the activation of each unit in the network be denoted as
�5 = ' �5" <"(
"with <" the activation or input of a unit connecting to unit j and �5" the
weight associated to the connection and<5 = ℎ(�5)• These are composite functions so, let’s use the chain rule to estimate
the derivative of the error
MALIS 2019 48
<" <5�5" ℎ(�5). . .
07/11/2019
25
Deriving gradient descent for MLP
• Applying the chain ruleX%4X�5" = X%4X�5X�5X�5" = Y5
X�5X�5"• Since �5 = ∑ �5" <"(" , then X�5X�5" =• All together
MALIS 2019 49
X%4X�5" = Y5<"X%4X�5" = V45 − 45 �4"
Same form as
How to estimate δ?
• For the output units, we did it already:Y+ = V+ − +• For the hidden units, we resort to the chain rule again
Y5 = X%4X�5 = ' X%4X�+X�+X�5
(
+• Can we obtain an expression for it?
MALIS 2019 50
07/11/2019
26
How to estimate δ for hidden units?
MALIS 2019 51
Y5 = X%4X�5 = ' X%4X�+X�+X�5
(
+
�5 = ' �5"<"(
"<5 = ℎ(�5)T<T� = T<
T ⋅ T T�
Y5 = X%4X�5
Y5 = ℎR �5 ' �+5Y+(
+
Solution in the next slide but, try to do it on your own
MALIS 2019 52
First thing is that:
Y+ = X%4X�+ , soY5 = X%4X�5 = ' Y+
X�+X�5
(
+Now let us find an expression for a_k:
�+ = ' �+5<5(
5And directly from the cheat sheet <5 = ℎ �5 . The derivative amounts to applying the chain ruleX�+X�5 = X�+X<5
X<5X�5X
T<5 �+5<5 ℎ′(�5)�+5ℎ′(�5)
Plugin into the original expression:
Y5 = ℎR �5 ' �+5Y+(
+
07/11/2019
27
Backpropagation formula
MALIS 2019 53
Y5 = ℎR �5 ' �+5Y+(
+
Figure 5.7 – Bishop - PRML<+ = ℎ ' �+5<5
(
5Forward propagation
Backpropagation algorithm
1. For an input vector �4 to the network do a forward pass using:
�5 = ' �5" <"(
", <5 = ℎ �5
To find the activations of all hidden and output units
2. Evaluate Y+ = V+ − + for the output units
3. Backward pass the δ’s to obtain all the Y5 for the hidden units using: Y5 = ℎR �5 ∑ �+5Y+(+4. Obtain the required derivatives using:
]^_]`ab = Y5<"
MALIS 2019 54
07/11/2019
28
Backpropagation algorithm: DYI
• Read Section 5.3.2 from Bishop for a concrete example
• We have derived a general form that covers any error function, activation function and network topology
• Obtain expression for the backpropagation algorithm when using cross-entropy error function (exercise 11.3 from ESL)
MALIS 2019 55
Properties: Universality
• MLPs are Universal Boolean functions
• They can compute any Boolean function
• MLPs are Universal Classification functions
• MLPs are Universal approximators
• Can actually compose arbitrary functions in any number of dimensions
MALIS 2019 56
07/11/2019
29
MLPs are Universal Boolean Functions
• The perceptron could not solve the XOR.
• If the MLP is an universal Boolean function, it should be able to implement an XOR.
• How?
MALIS 2019 57
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
A truth table shows all input combinations
for which output is 1
We express the function in disjunctive normal form
c = de�d + d�de
XOR function : c = de�d + d�de
MALIS 2019 58
d�
d
* Bias being omitted
07/11/2019
30
XOR function : c = de�d + d�de
MALIS 2019 59
d�
d
* Bias being omitted
XOR function : c = de�d + d�de
MALIS 2019 60
d�
d
* Bias being ommited
07/11/2019
31
XOR function : c = de�d + d�de
MALIS 2019 61
ANDd�
d AND
OR
Any truth table can be expressed in this manner
* Bias being omitted
Exercise: Find weights for the XOR
MALIS 2019 62
Y1d�
d Y2
Y
+1
+1
���
��
��� ��
�
�� ���
��
���
07/11/2019
32
Step 1: Write down truth tables
MALIS 2019 63
X1 X2 fgh Y1
0 0
0 1
1 0
1 1
X1 X2 fgi Y2
0 0
0 1
1 0
1 1
Y1 Y2 Y
0 0
0 1
1 0
1 1
Step 2: Write general expressions
MALIS 2019 64
Y1 Y2 Y
0 0 0 (-1)
0 1 1
1 0 1
1 1 1
ℎ(��� � + �� + ���)+��� ≤ 0
� + ��� > 0��� + ��� > 0��� + �� + ��� > 0
��� = −3�� = 4��� = 4
** Layer index is being omitted
Y
07/11/2019
33
Step 2: Write general expressions
MALIS 2019 65
ℎ(���� + �� + ��)+�� ≤ 0
� + �� ≤ 0�� + �� > 0�� + � + �� ≤ 0
�� = −3� = −5�� = 4
** Layer index is being omitted
X1 X2 fgi Y2
0 0 1 0*
0 1 0 0*
1 0 1 1
1 1 0 0*
Y2
Step 2: Write general expressions
MALIS 2019 66
ℎ(��� � + �� + ���)+��� ≤ 0
�� + ��� > 0��� + ��� ≤ 0��� + �� + ��� ≤ 0
��� = −3�� = 4��� = -5
** Layer index is being omitted
Y1
X1 X2 fgh Y1
0 0 1 0
0 1 1 1
1 0 0 0
1 1 0 0
07/11/2019
34
Result: Find weights for the XOR
MALIS 2019 67
Y1d�
d Y2
Y
+1
+1
−5
4
−3 −3−5
4 4
4
−3
What to do for more complex functions?
• Karnaugh Maps
MALIS 2019 68
m = cn + og dce + cn̅deDrawback: MLP can represent a given
function only if it is sufficiently wide
m =
07/11/2019
35
MLP as universal function approximation
• A feed-forward network with at least one hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rd, under mild assumptions on the activation function
• However, the key problem is how to find suitable parameter values given a set of training data
• Proof by G. Cybenko, 1989
MALIS 2019 69
No hidden layer
Half-space
One hidden layer
Convex sets
(intersections of half-spaces)
Two hidden layers
Concave and non-connex
sets (union of
intersections of half-spaces)
MLP Intuitive Potential
70
07/11/2019
36
Summary on MLP
• Advantages
• Very general, can be applied in many situations
• Powerful according to theory
• Efficient according to practice
• Drawbacks
• Training is often slow
• Choice of optimal number of layers & neurons difficult
• Little understanding of real model
71
Deep Learning
MALIS 2019 72
Le-Net5: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document (1998)
07/11/2019
37
Recap
• We introduced feedforward networks, aka the multilayer perceptron
• We introduced the backpropagation algorithm which is the mechanism to train feedforward networks
• We saw the strengths but also the limitations of MLPs
• Deep Learning course (spring term) if you want to learn about more powerful neural network architectures
MALIS 2019 73
What I have not covered yet
• Some other limitations: problems associated with training
MALIS 2019 74
07/11/2019
38
Further reading and useful material
Source Chapters
The Elements of Statistical Learning Sec 4.5, Ch 11
Pattern Recognition and Machine Learning Sec. 4.1.7, Ch 5
Rosenblatt’s original article - The Perceptron --
MALIS 2019 75
Warning: Notation might vary among the different sources
From the first lecture
MALIS 2019 76
Deep learning
07/11/2019
39
MALIS 2019 77
Project definition: What I expect from you
• Able to identify a problem that can be solved using ML tools
• Frame it correctly: supervised, not supervised, regression, classification, density estimation…
• Able to establish reasonable objectives• Not too easy
• Not too difficult that can not be completed in the given time frame
• Able to follow instructions• Submit via moodle
• Work in pairs or talk to me to agree on exceptions
• Able to produce a readable document
MALIS 2019 78