Download pdf - MALIS: Neural Networks - EURECOM

07/11/2019

1

MALIS: Neural Networks

Maria A. Zuluaga

Data Science Department

Recap: Classification

MALIS 2019 2

Source: A Zisserman

The input space divided into decision regions whose boundaries are called decision boundaries or decision surfaces.

07/11/2019

2

Separating hyperplanes

MALIS 2019 3

Figure 4.1.4 From The Elements of Statistical

learning

Separating hyperplanes classifiers are linear classifiers

which try to “explicitly” separate the data as well as

possible.

{�: �� + �� + �� = 0}

= � 1 �� −1 ��

Least squares solution regressing:

leads to a line given by

Two (of infinitely) possible separating hyperplanes

The Perceptron

MALIS 2019 4

07/11/2019

3

The Perceptron

• Assumptions:

• Data is linearly separable

• Binary classification using labels ∈ −1, 1

• Goal: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.

MALIS 2019 5

…

Formulation

• � = �(�� + b)

MALIS 2019 6

� � = �+1, � ≥ 0−1, � < 0

Source: Machine learning for Intelligent Systems. Cornell University

As before we will “absorb” b by adding a

“dummy” variable to x:

� = �(��)

� = � 1 � = �!

07/11/2019

4

Error function: The perceptron criterion

• Conditions:

• Patterns �" ∈ �� will have ��" > 0• Patterns �" ∈ � will have ��" < 0

• Since ∈ {−1, +1}, this means we want all patterns to satisfy:��"y" > 0

MALIS 2019 7

%& � = − ' ��" "(

"∈ℳPerceptron criterion

Interpretation

• The output is the characteristic function of a half-space, which is limited by the hyperplane

MALIS 2019 8

1x

2x1y =

1y −=

hyperplane

�� + b = 0

07/11/2019

5

Representation

MALIS 2019 9

activationW0

Wixi

+1

Linear Separability

• Given two sets of points, is there a perceptron which classifies them ?

YES NO

• True only if sets are linearly separable

1x

2x

1x

2x

1

0MALIS 2019

07/11/2019

6

Finding the weights

• Obtain an expression for the gradient of the perceptron criterion

• Use stochastic gradient descent to minimize the error function

• A change in the weight vector is given by:*(+,�) = *(+) − -.%& � = * + + -/" "

MALIS 2019 11

Rather than computing the sum of the gradient contributions of each observation

followed by a step in the negative gradient direction, a step is taken after each

single observation is visited.

Stochastic gradient descent vs gradient descent:

Perceptron training algorithm

initialize �while TRUE do:

m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + - 1/1m=m+1

if m = 0

break

MALIS 2019 12

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p

07/11/2019

7



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 13

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 14

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p

07/11/2019

8



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 15

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 16

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p

07/11/2019

9



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 17

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 18

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p

07/11/2019

10



m=0

foreach (xi,yi) do

if ��" " ≤ 0* = * + 1/1m=m+1

if m = 0

break

MALIS 2019 19

Illu

stra

tio

n a

da

pte

d f

rom

Fig

4.7

PR

ML

–C

Bis

ho

p

Hands on example: The OR function

• 03_perceptron.ipynb

MALIS 2019 20

1

10

-1

+1

07/11/2019

11

Hands on example: The OR functionX0 X1 X2 b W1 W2 y activation m

1

0.5 0 1

MALIS 2019 21

1

10

Solution

MALIS 2019 22

07/11/2019

12

Perceptron convergence theorem

• If there exists an exact solution (i.e. data is linearly separable), then it is guaranteed for the perceptron algorithm to converge to this solution in a finite number of steps.

• However:

• The number of steps to convergence might be large

• Until convergence is achieved, it is not possible to distinguish a non-separable problem from a slow to converge one.

MALIS 2019 23

What if we use a different initialization value?

MALIS 2019 24

Question: These are specific examples. What

are the general set of inequalities that must

be satisfied for an OR perceptron?

07/11/2019

13

Other logic functions

MALIS 2019 25Exercises to complete in the notebook

Perceptron Limitations

• Perceptrons cannot generate XOR:

0 1

1

1x

2x XOR

?

x1 x2 XOR

0 0 0

1 0 1

0 1 1

1 1 0

26

[Minsky 1969] -> The AI winter

07/11/2019

14

Perceptron Limitations

• The algorithm does not converge when the data are not separable

• When the data is separable, there are many solutions, and which one is found depends on the starting values

• The “finite” number of steps can be very large.

27

Some history

MALIS 2019 28

Source: C. Bishop - PRML

07/11/2019

15

Recap

• We introduced the perceptron algorithm, a linear classifier that guarantees convergence

• We saw that it guarantees a solution for separable data

• But we also saw that it has numerous limitations

MALIS 2019 29

Neural Networks

MALIS 2019 30

07/11/2019

16

• The term neural network has its origins in attempts to find mathematical representations of information processing in biological systems

• From Bishop: it has been used very broadly to cover a wide range of different models, many of which have been the subject of exaggerated claims regarding their biological plausibility

• They are nonlinear efficient models for statistical pattern recognition

MALIS 2019 31

Why neural networks?A note on history

Motivation

• Recall the first lecture on linear models

• We saw that adding features could give a better fit of the model2 � = 1, �, �, … , �4

• 2 � : basis function

• Model:

�, � = � ' �56

57� 25(�)

MALIS 2019 32

07/11/2019

17

Motivation

• We also saw that choosing the right set of features was challenging

• Goal: Making the basis functions 25(/) depend on parameters

and allow these parameters to be adjusted along with the

coefficients {�5} during training

• How? Neural networks

MALIS 2019 33

Revisit: 01_linear_models.ipynb

Which one is the good value for n?

• Basic neural network model: series of functional transformations

• Step 1: Construct M linear combinations of the input variables / ∈ℝ9

�5 = ' �5"(�)�" + �5�(�)9

"7�

MALIS 2019 34

Feed forward networksa.k.a. The multilayer perceptron (MLP)

bias

j=1,..M

Index of linear

combinationsweights

Layer index. E.g.

parameters of the

first layer of the

network

activations

i=1,..D

Index of dimensions of the

input X

07/11/2019

18

• Step 2: Transform each activation using a differentiable, nonlinear activation function ℎ ⋅ <5 = ℎ �5

• ℎ ⋅ generally chosen to be a sigmoidal function

• <5 corresponds to the outputs of the basis functions of our model. Recall:

�, � = � ' �56

57� 25(�)

• In the context of neural networks, these are called hidden units

MALIS 2019 35


• Step 3: The <5 are again linearly combined to give output unit

activations

�+ = ' �+5 <5 + �+�6

57�

MALIS 2019 36


bias

k=1,..K

Output index.

K: number of

outputs

weights

Layer index. E.g.

parameters of the

2nd layer of the

network

Output unit

activations

07/11/2019

19

• Step 4: The �+ are transformed using an activation function to give a set of network outputs +

• Choice of the activation function follows same considerations as for linear models

• For regression: Identity function

• Common activation functions for classification

• Sigmoid function: = � = 1/(1 + ?@A)• Tanh: tanh � = FG@FHG

FG,FHG• Hinge or relu: ℎ � = max(� , 0) • Softmax (multiclass): ℎ � = FGK

∑ FGM(MMALIS 2019 37


• Combining all, using a sigmoidal output unit activation function

MALIS 2019 38

Feed forward networks a.k.a MLPFinal expression

+ �, � = = ' �+5 ℎ ' �5"� �" + �5��9

"7�+ �+�

6

57�

07/11/2019

20

• Forward propagation of information through the network

MALIS 2019 39

Feed forward networks / MLPInterpretation: Network diagram

Fig 5.1 PRML – C. Bishop. Two-layered network

• As with linear models, the bias parameters can be absorbed into the set of weight parameters by adding a dummy variable �� = 1

�5 = ' �5"(�)�"N

"7�• In the second layer:

�+ = ' �+5 <56

57�

MALIS 2019 40

Simplifying notationAbsorbing the biases

07/11/2019

21

• The overall network function now becomes

MALIS 2019 41

Feed forward networks a.k.a MLPSimplified final expression

+ �, � = = ' �+5 ℎ ' �5"� �"9

"7�

6

57�

�, � = � ' �56

57� 25(�)

• Two stages of processing, each of which resembles the perceptron

MALIS 2019 42

Multilayer perceptronInterpretation

Adapted from Fig 5.1 PRML – C. Bishop

Similar to perceptron

+ �, � = = ' �+5 ℎ ' �5"� �"N

"7�

6

57�

07/11/2019

22

Back to features

• A neuron can be seen as a feature map of the form

25 � = ℎ ' �5"�"N

"7�• Therefore, each node in the network can be interpreted as a feature variable

• By optimizing the weights {w} we are doing feature selection

• Pre-trained networks: Resulting features from optimization useful to many problems

• Need of a lot of data

MALIS 2019 43

Neuron

Network training: Backpropagation

• We cannot use the training algorithm from the perceptron because we don’t know the “correct” outputs of the hidden units

• Strategy: Apply the chain rule to differentiate composite functions

• Refresher:

MALIS 2019 44

O � = �(P �) → OR � = �R P � P′(�)T<T� = T<

T ⋅ T T�

Leibniz’s notation

07/11/2019

23

Deriving gradient descent for MLP

MALIS 2019 45

• See board notes for simpler derivation of the backpropagation

algorithm


• Error function:

% � = ' %4(�)U

47�

• We will estimate .%"(�)• Let us consider a simple linear model with outputs +:

V+ = ' �+"�"(

"• The error function for a particular input sample n will be

%4 = 12 ' V4+ − 4+

(

+MALIS 2019 46

07/11/2019

24


• The gradient of this error function w.r.t a weight �5"X%4X�5" = �5"�4" − 45 �4" = V45 − 45 �4" "• Interpretation: Product of an error signal V45 − 45 associated to the

output end of the link �5" and the variable associated to the input �4"• Similar to expression obtained for logistic regression when using

sigmoid function

MALIS 2019 47

.*% * = X%(*)X* = ' V" − " �"

U

"7�Refresher: Exercise proposed

slide 27 (annotated)

Refresher: forward propagation• Let us recall the activation of each unit in the network be denoted as

�5 = ' �5" <"(

"with <" the activation or input of a unit connecting to unit j and �5" the

weight associated to the connection and<5 = ℎ(�5)• These are composite functions so, let’s use the chain rule to estimate

the derivative of the error

MALIS 2019 48

<" <5�5" ℎ(�5). . .

07/11/2019

25


• Applying the chain ruleX%4X�5" = X%4X�5X�5X�5" = Y5

X�5X�5"• Since �5 = ∑ �5" <"(" , then X�5X�5" =• All together

MALIS 2019 49

X%4X�5" = Y5<"X%4X�5" = V45 − 45 �4"

Same form as

How to estimate δ?

• For the output units, we did it already:Y+ = V+ − +• For the hidden units, we resort to the chain rule again

Y5 = X%4X�5 = ' X%4X�+X�+X�5

(

+• Can we obtain an expression for it?

MALIS 2019 50

07/11/2019

26

How to estimate δ for hidden units?

MALIS 2019 51

Y5 = X%4X�5 = ' X%4X�+X�+X�5

(

+

�5 = ' �5"<"(

"<5 = ℎ(�5)T<T� = T<

T ⋅ T T�

Y5 = X%4X�5

Y5 = ℎR �5 ' �+5Y+(

+

Solution in the next slide but, try to do it on your own

MALIS 2019 52

First thing is that:

Y+ = X%4X�+ , soY5 = X%4X�5 = ' Y+

X�+X�5

(

+Now let us find an expression for a_k:

�+ = ' �+5<5(

5And directly from the cheat sheet <5 = ℎ �5 . The derivative amounts to applying the chain ruleX�+X�5 = X�+X<5

X<5X�5X

T<5 �+5<5 ℎ′(�5)�+5ℎ′(�5)

Plugin into the original expression:

Y5 = ℎR �5 ' �+5Y+(

+

07/11/2019

27

Backpropagation formula

MALIS 2019 53

Y5 = ℎR �5 ' �+5Y+(

+

Figure 5.7 – Bishop - PRML<+ = ℎ ' �+5<5

(

5Forward propagation

Backpropagation algorithm

1. For an input vector �4 to the network do a forward pass using:

�5 = ' �5" <"(

", <5 = ℎ �5

To find the activations of all hidden and output units

2. Evaluate Y+ = V+ − + for the output units

3. Backward pass the δ’s to obtain all the Y5 for the hidden units using: Y5 = ℎR �5 ∑ �+5Y+(+4. Obtain the required derivatives using:

]^_]`ab = Y5<"

MALIS 2019 54

07/11/2019

28

Backpropagation algorithm: DYI

• Read Section 5.3.2 from Bishop for a concrete example

• We have derived a general form that covers any error function, activation function and network topology

• Obtain expression for the backpropagation algorithm when using cross-entropy error function (exercise 11.3 from ESL)

MALIS 2019 55

Properties: Universality

• MLPs are Universal Boolean functions

• They can compute any Boolean function

• MLPs are Universal Classification functions

• MLPs are Universal approximators

• Can actually compose arbitrary functions in any number of dimensions

MALIS 2019 56

07/11/2019

29

MLPs are Universal Boolean Functions

• The perceptron could not solve the XOR.

• If the MLP is an universal Boolean function, it should be able to implement an XOR.

• How?

MALIS 2019 57

X1 X2 Y

0 0 0

0 1 1

1 0 1

1 1 0

A truth table shows all input combinations

for which output is 1

We express the function in disjunctive normal form

c = de�d + d�de

XOR function : c = de�d + d�de

MALIS 2019 58

d�

d

* Bias being omitted

07/11/2019

30


MALIS 2019 59

d�

d



MALIS 2019 60

d�

d

* Bias being ommited

07/11/2019

31


MALIS 2019 61

ANDd�

d AND

OR

Any truth table can be expressed in this manner


Exercise: Find weights for the XOR

MALIS 2019 62

Y1d�

d Y2

Y

+1

+1

��

��

��

�

��

��

��

07/11/2019

32

Step 1: Write down truth tables

MALIS 2019 63

X1 X2 fgh Y1

0 0

0 1

1 0

1 1

X1 X2 fgi Y2

0 0

0 1

1 0

1 1

Y1 Y2 Y

0 0

0 1

1 0

1 1

Step 2: Write general expressions

MALIS 2019 64

Y1 Y2 Y

0 0 0 (-1)

0 1 1

1 0 1

1 1 1

ℎ(�� + �� + ��)+�� ≤ 0

� + �� > 0�� + �� > 0�� + �� + �� > 0

�� = −3�� = 4�� = 4

** Layer index is being omitted

Y

07/11/2019

33


MALIS 2019 65

ℎ(�� + �� + ��)+�� ≤ 0

� + �� ≤ 0�� + �� > 0�� + � + �� ≤ 0

�� = −3� = −5�� = 4


X1 X2 fgi Y2

0 0 1 0*

0 1 0 0*

1 0 1 1

1 1 0 0*

Y2


MALIS 2019 66

ℎ(�� + �� + ��)+�� ≤ 0

�� + �� > 0�� + �� ≤ 0�� + �� + �� ≤ 0

�� = −3�� = 4�� = -5


Y1

X1 X2 fgh Y1

0 0 1 0

0 1 1 1

1 0 0 0

1 1 0 0

07/11/2019

34

Result: Find weights for the XOR

MALIS 2019 67

Y1d�

d Y2

Y

+1

+1

−5

4

−3 −3−5

4 4

4

−3

What to do for more complex functions?

• Karnaugh Maps

MALIS 2019 68

m = cn + og dce + cn̅deDrawback: MLP can represent a given

function only if it is sufficiently wide

m =

07/11/2019

35

MLP as universal function approximation

• A feed-forward network with at least one hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rd, under mild assumptions on the activation function

• However, the key problem is how to find suitable parameter values given a set of training data

• Proof by G. Cybenko, 1989

MALIS 2019 69

No hidden layer

Half-space

One hidden layer

Convex sets

(intersections of half-spaces)

Two hidden layers

Concave and non-connex

sets (union of

intersections of half-spaces)

MLP Intuitive Potential

70

07/11/2019

36

Summary on MLP

• Advantages

• Very general, can be applied in many situations

• Powerful according to theory

• Efficient according to practice

• Drawbacks

• Training is often slow

• Choice of optimal number of layers & neurons difficult

• Little understanding of real model

71

Deep Learning

MALIS 2019 72

Le-Net5: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document (1998)

07/11/2019

37

Recap

• We introduced feedforward networks, aka the multilayer perceptron

• We introduced the backpropagation algorithm which is the mechanism to train feedforward networks

• We saw the strengths but also the limitations of MLPs

• Deep Learning course (spring term) if you want to learn about more powerful neural network architectures

MALIS 2019 73

What I have not covered yet

• Some other limitations: problems associated with training

MALIS 2019 74

07/11/2019

38

Further reading and useful material

Source Chapters

The Elements of Statistical Learning Sec 4.5, Ch 11

Pattern Recognition and Machine Learning Sec. 4.1.7, Ch 5

Rosenblatt’s original article - The Perceptron --

MALIS 2019 75

Warning: Notation might vary among the different sources

From the first lecture

MALIS 2019 76

Deep learning

07/11/2019

39

MALIS 2019 77

Project definition: What I expect from you

• Able to identify a problem that can be solved using ML tools

• Frame it correctly: supervised, not supervised, regression, classification, density estimation…

• Able to establish reasonable objectives• Not too easy

• Not too difficult that can not be completed in the given time frame

• Able to follow instructions• Submit via moodle

• Work in pairs or talk to me to agree on exceptions

• Able to produce a readable document

MALIS 2019 78