Artificial Neural Networks and Single Layer Perceptron · 2019. 12. 9. · Single Layer Perceptron....

Preview:

Citation preview

Supervised Learning

Artificial Neural Networks and Single Layer Perceptron

Supervised Learning• Learning from correct answers

Supervised Learning SystemInputs Outputs

Training info: desired (target outputs)

Supervised Learning Methods• Artificial neural networks• Decision trees• Gaussian process regression• Naive Bayes classifier• Nearest neighbor algorithm• Support vector machines• Random forests• Ensembles of classifiers• …

Applications of Supervised Learning• Handwriting recognition• Object recognition• Optical character

recognition• Spam detection• Pattern recognition• Speech recognition• Prediction models• …

Inspiration: The Human Brain• The brain doesn’t seem to have a

CPU

• Instead, it has got lots of simple parallel, asynchronous units, called neurons

• There are about 1011 (100 billion) neurons of about 20 types

• GPU NVIDIA Tesla K80 has 4992 cores

Neurons• A neuron is a single cell that has a number of relatively short

fibers, called dendrites, and one long fiber, called an axon

Synapses• A synapse is a structure that permits a neuron (or nerve cell)

to pass an electrical signal to another cell

Signal Generation: Action Potential• Neurotransmitters move across the synapse and change the

electrical potential of the cell body

From Real to Artificial Neurons

u2 Σ

u1

un

...

w1

w2

wna=Σi=1

n wi ui

v

inputsweights

activation outputθ

activationfunction

Artificial Neurons• Threshold Logic Unit (TLU) proposed by Warren

McCulloch and Walter Pitts in 1943– Initially with binary inputs and outputs– Heaviside function as threshold

• Perceptron developed by Frank Rosenblatt in 1957– Arbitrary inputs and outputs– Linear transfer function

Activation Functions

a

v

a

v

a

v

a

v

threshold linear

piece-wise linear sigmoid

Threshold Logic Unit (TLU)

u2 Σ

u1

un

...

w1

w2

wna=Σi=1

n wi ui

v

inputsweights

activation output

θ

1 if a ≥ θv=

0 if a < θ{

Classification• Identifying to which of a set of categories a new

observation belongs, on the basis of a set of data containing observations whose category is known (Wikipedia)

-5 0 5

Variable 1

-4

-2

0

2

4

Var

iabl

e 2

Class 1

Class 2

Class 3

? (1, 2 or 3)

Decision Surface of a TLU

u1

u2

Decision linew1 u1 + w2 u2 = θ

1

1 1

0

0

0

0

0

1

> θ

< θ

Scalar Products and Projections

w•u > 0

u

w

w•u = 0

u

w

w•u < 0

u

wϕ ϕ ϕ

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|

u

wϕuw

Geometric Interpretation

u1

u2

w

u

w•u = θ

v=1

v=0

|uw| = θ/|w|uw

Decision linew1 u1 + w2 u2 = θ

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ

Geometric Interpretation (cont.)

u1

u2

w

u

v=1

v=0

uw

w•u = θ

|uw| = θ/|w|

Decision linew1 u1 + w2 u2 = θ

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ

Geometric Interpretation (cont.)

u1

u2

w

u

v=1

v=0

uw

w•u = θ

|uw| = θ/|w|

Decision linew1 u1 + w2 u2 = θ

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ

Decision Surface: Summary• In n dimensions the relation w•u=θ defines a n-1

dimensional hyper-plane, which is perpendicular to the weight vector w

• On one side of the hyper-plane (w•u>θ) all patterns are classified by the TLU as “1”, while those that get classified as “0” lie on the other side of the hyper-plane

Example: Logical AND

u1

u2

10

0 0

u1 u2 a v0 0 ? 00 1 ? 01 0 ? 01 1 ? 1

w1=?w2=?θ=?

Example: Logical AND

u1

u2

10

0 0

u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1

w1=1w2=1θ=?

Example: Logical AND

u1

u2

10

0 0

u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1

w1=1w2=1θ=1.5

Example: Logical OR

u1

u2

11

0 1

u1 u2 a v0 0 ? 00 1 ? 11 0 ? 11 1 ? 1

w1=?w2=?θ=?

Example: Logical OR

u1

u2

11

0 1

u1 u2 a v0 0 0 00 1 1 11 0 1 11 1 2 1

w1=1w2=1θ=0.5

Example: Logical NOT

u11 0

u1 a v0 ? 11 ? 0

w1=?θ=?

Example: Logical NOT

u11 0

u1 a v0 0 11 -1 0

w1=-1θ=-0.5

Threshold as Weight

Σ

u1

u2

un

...

w1

w2

wn

wn+1

un+1=-1

a= Σi=1n+1 wi ui

v

θ=wn+1

θ

1 if a ≥ 0v=

0 if a < 0{

Geometric Interpretation

u1

u2Decision line

w

u

w•u=0v=1

v=0

Training ANNs• Training set S of examples {u,vt}

– u is an input vector and– vt the desired target output– Example: Logical And

S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}

• Iterative process– present a training example u– compute network output v– compare output v with target vt

– adjust weights w and threshold θ

• Learning rule– Specifies how to change the weights w and threshold θ of the

network as a function of the inputs u, output v and target vt.

Presentation of Training Samples• Presenting all training examples once to the ANN

is called an epoch

• Training examples can be presented in– Fixed order (1,2,3…,M) [on/off-line]– Randomly permutated order (5,2,7,…,3) [off-line]– Completely random (4,1,7,1,5,4,……) [off-line]

Perceptron Learning Rule• w`=w + µ (vt-v) u

– w`i = wi + ∆wi = wi + µ (vt-v) ui (i=1..n+1), withwn+1 = θ and un+1=-1

– The parameter µ is called the learning rate. It determines the magnitude of weight updates ∆wi

– If the output is correct (vt=v) the weights are not changed (∆wi =0)

– If the output is incorrect (vt ≠ v) the weights wi are changed such that the output of the TLU for the new weights w`i is closer/further to the input ui

Adjusting the Weight Vector

Target vt =1Output v=0

u

w

ϕ>90

w

uw` = w + µu

µu

Move w in the direction of u

Target vt =0Output v=1

wϕ<90

u

w

u

−µuw` = w - µu

Move w away from the direction of u

Perceptron Learning Rule (cont.)• If we do not include the threshold as an input we

use the following learning rule:

w`=w + µ(vt-v) u

θ`=θ - µ(vt-v)

and

Procedure of Perceptron Training

• While v≠vt for all training vector pairs– For each training vector pair (u, vt)

• calculate the output v: v=Σwu• if v≠vt

update weight vector w: w`=w + µ (vt-v)uupdate threshold θ: θ`=θ – µ(vt-v) (if not included in weights)

elsedo nothing

Demo: Perceptron with TLU

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

TLU Convergence• The algorithm converges to the correct

classification– if the training data is linearly separable– given sufficiently small learning rate µ

• Solution w is not unique, since if w•u = 0 defines a hyper-plane, so does w` = k w

Multiple TLUs

u1 u2 u3 un

v1 v2 vn. . .

. . .

w’ji = wji + µ (vtj-vj) ui

wji connects ui with vj

• Learing rule:

Example: Character Recognition• 26 classes: A, B, C, …, Z• Target output is a vector

– e.g., vtA= [1 0 0 … 0], vt

B= [0 1 0 … 0], …

u1 u2 u3 un

v1 v2 v26. . .

. . .

A B Z

Activation Functions

a

v

a

v

a

v

a

v

threshold linear

piece-wise linear sigmoid

Perceptron with Linear Function

Σ

u1

u2

un

...

w1

w2

wn

v

v= a = Σi=1n wi vi

inputsweights

activationoutput

Note: We will use notation t instead of vt

Gradient Descent Learning Rule• Consider linear unit without threshold and

continuous output v (not just –1,1/1,0)v = w1 u1 + … + wn un

• Train the wi such that they minimize the squared errorE[w1,…,wn] = 1/2 Σd∈D (td-vd)2

– where D is the set of training examples and– t the target outputs

Gradient Descent Learning Rule (cont.)

Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2

(w1,w2)

(w1+∆w1,w2 +∆w2)

-1/µ ∆wi = ∂E/∂wi= ∂/∂wi 1/2 (td-vd)2

= ∂/∂wi 1/2 (td-Σi wi ui)2

= Σd(td- vd) (-ui)E

[w1,w

2]w1 w2

We get:∆wi = µ (td- vd) ui Also known as Delta rule

Incremental vs. Batch Gradient Descent

• Incremental (stochastic) mode: w` = w - µ ∇Ed[w] over individual training examples dEd[w] = 1/2 (td-vd)2

• Batch mode:w` = w - µ ∇ED[w] over the entire data DED[w] = 1/2Σd(td-vd)2

Perceptron vs. Gradient Descent Rule• Perceptron rule

w`i = wi + µ (vt-v) ui

– derived from manipulation of decision surface

• Gradient descent rulew`i = wi + µ (td-vd) ui

– derived from minimization of error function E[w1,…,wn] = 1/2 Σd (td-vd)2

by means of gradient descent

Demo: Gradient Descent Learning Rule

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

Linear Regression• Common statistical method

• Describe relationship between variables

• Predict one variable knowing the others

• We make assumption that there is a linear relationship– between an outcome (dependent variable, response variable) and a

predictor (independent variable, explanatory variable, feature) or

– between one variable and several other variables

Describing Relationships

5 6 7

Sepal length (cm)

2

2.5

3

3.5

4

4.5

Sep

al w

idth

(cm

)

Very weak relationship /very low correlation

Strong relationship /high correlation

cor. coef. = -0.1094

5 6 7

Sepal length (cm)

0

0.5

1

1.5

2

2.5

Pet

al w

idth

(cm

)

cor. coef = 0.8180

Making predictions

0 100 200 300

Days

0

500

1000

1500

Pro

fit (E

uro)

140 160 180 200

Father's height (cm)

150

160

170

180

190

Son

's h

eigh

t (cm

)

Predicting variables Predicting future outcomes

?

?

?

Demo: Linear Regression

0.2 0.4 0.6 0.8 1x

-2

-1.8

-1.6

-1.4

f(x)

0 0.5 1x

-2

-1.8

-1.6

f(x)

Perceptron vs. Gradient Descent• Perceptron (TLU)

– Guaranteed to succeed if• training examples are linearly separable• given sufficiently small learning rate µ

– Robust against outliers

• Gradient descent (linear transfer function)– Guaranteed to converge to hypothesis with minimum

squared error• even when training data not separable by hyperplane• given sufficiently small learning rate µ

– Very sensitive to outliers

Activation Functions

a

v

a

v

a

v

a

v

threshold linear

piece-wise linear sigmoid

Perceptron with Sigmoid Function

Σ

u1

u2

un

...

w1

w2

wna=Σi=1

n wi ui v=σ(a) =1/(1+e-a)

v

inputsweights

activation output

Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2

Gradient Descent Rule for Sigmoid Activation Function

a

σ

a

σ’

-1/µ ∆wi = ∂E/∂wi

= ∂/∂wi 1/2 (td-vd)2

= ∂/∂wi 1/2 (td - σ(Σi wi ui))2

= (td-vd) σ‘(Σi wi ui) (-ui)

We get:∆wi = µ vd(1-vd)(td-vd) ui

vd = σ(a) = 1/(1+e-a)σ’(a) = e-a/(1+e-a)2

= σ(a) (1-σ(a))

Demo: Gradient Descent with Sigmoid Activation Function

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5

Summary• Threshold Logic Unit

– works only for linearly separable patterns– robust to outliers

• Linear Neuron– converges to minimal squared error even

when patterns are not linearly separable– very sensitive to outliers

• Sigmoid Neuron– combination of TLU and linear neuron– converges to minimal squared error even

when patterns are not linearly separable– robust to outliers

Recommended