Artificial Neural Networks and Single Layer Perceptron · 2019. 12. 9. · Single Layer Perceptron....

Supervised Learning

Artificial Neural Networks and Single Layer Perceptron

Supervised Learning• Learning from correct answers

Supervised Learning SystemInputs Outputs

Training info: desired (target outputs)

Supervised Learning Methods• Artificial neural networks• Decision trees• Gaussian process regression• Naive Bayes classifier• Nearest neighbor algorithm• Support vector machines• Random forests• Ensembles of classifiers• …

Applications of Supervised Learning• Handwriting recognition• Object recognition• Optical character

recognition• Spam detection• Pattern recognition• Speech recognition• Prediction models• …

Inspiration: The Human Brain• The brain doesn’t seem to have a

• Instead, it has got lots of simple parallel, asynchronous units, called neurons

• There are about 1011 (100 billion) neurons of about 20 types

• GPU NVIDIA Tesla K80 has 4992 cores

Neurons• A neuron is a single cell that has a number of relatively short

fibers, called dendrites, and one long fiber, called an axon

Synapses• A synapse is a structure that permits a neuron (or nerve cell)

to pass an electrical signal to another cell

Signal Generation: Action Potential• Neurotransmitters move across the synapse and change the

electrical potential of the cell body

From Real to Artificial Neurons

wna=Σi=1

n wi ui

inputsweights

activation outputθ

activationfunction

Artificial Neurons• Threshold Logic Unit (TLU) proposed by Warren

McCulloch and Walter Pitts in 1943– Initially with binary inputs and outputs– Heaviside function as threshold

• Perceptron developed by Frank Rosenblatt in 1957– Arbitrary inputs and outputs– Linear transfer function

Activation Functions

threshold linear

piece-wise linear sigmoid

Threshold Logic Unit (TLU)

wna=Σi=1

n wi ui

inputsweights

activation output

1 if a ≥ θv=

0 if a < θ{

Classification• Identifying to which of a set of categories a new

observation belongs, on the basis of a set of data containing observations whose category is known (Wikipedia)

-5 0 5

Variable 1

Class 1

Class 2

Class 3

? (1, 2 or 3)

Decision Surface of a TLU

Decision linew1 u1 + w2 u2 = θ

Scalar Products and Projections

w•u > 0

w•u = 0

w•u < 0

wϕ ϕ ϕ

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|

Geometric Interpretation

w•u = θ

|uw| = θ/|w|uw

w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ

Geometric Interpretation (cont.)

w•u = θ

|uw| = θ/|w|

Geometric Interpretation (cont.)

w•u = θ

|uw| = θ/|w|

Decision Surface: Summary• In n dimensions the relation w•u=θ defines a n-1

dimensional hyper-plane, which is perpendicular to the weight vector w

• On one side of the hyper-plane (w•u>θ) all patterns are classified by the TLU as “1”, while those that get classified as “0” lie on the other side of the hyper-plane

Example: Logical AND

u1 u2 a v0 0 ? 00 1 ? 01 0 ? 01 1 ? 1

w1=?w2=?θ=?

u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1

w1=1w2=1θ=?

u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1

w1=1w2=1θ=1.5

Example: Logical OR

u1 u2 a v0 0 ? 00 1 ? 11 0 ? 11 1 ? 1

w1=?w2=?θ=?

Example: Logical OR

u1 u2 a v0 0 0 00 1 1 11 0 1 11 1 2 1

w1=1w2=1θ=0.5

Example: Logical NOT

u1 a v0 ? 11 ? 0

w1=?θ=?

Example: Logical NOT

u1 a v0 0 11 -1 0

w1=-1θ=-0.5

Threshold as Weight

un+1=-1

a= Σi=1n+1 wi ui

θ=wn+1

1 if a ≥ 0v=

0 if a < 0{

Geometric Interpretation

u2Decision line

w•u=0v=1

Training ANNs• Training set S of examples {u,vt}

– u is an input vector and– vt the desired target output– Example: Logical And

S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}

• Iterative process– present a training example u– compute network output v– compare output v with target vt

– adjust weights w and threshold θ

• Learning rule– Specifies how to change the weights w and threshold θ of the

network as a function of the inputs u, output v and target vt.

Presentation of Training Samples• Presenting all training examples once to the ANN

is called an epoch

• Training examples can be presented in– Fixed order (1,2,3…,M) [on/off-line]– Randomly permutated order (5,2,7,…,3) [off-line]– Completely random (4,1,7,1,5,4,……) [off-line]

Perceptron Learning Rule• w`=w + µ (vt-v) u

– w`i = wi + ∆wi = wi + µ (vt-v) ui (i=1..n+1), withwn+1 = θ and un+1=-1

– The parameter µ is called the learning rate. It determines the magnitude of weight updates ∆wi

– If the output is correct (vt=v) the weights are not changed (∆wi =0)

– If the output is incorrect (vt ≠ v) the weights wi are changed such that the output of the TLU for the new weights w`i is closer/further to the input ui

Adjusting the Weight Vector

Target vt =1Output v=0

uw` = w + µu

Move w in the direction of u

Target vt =0Output v=1

wϕ<90

−µuw` = w - µu

Move w away from the direction of u

Perceptron Learning Rule (cont.)• If we do not include the threshold as an input we

use the following learning rule:

w`=w + µ(vt-v) u

θ`=θ - µ(vt-v)

Procedure of Perceptron Training

• While v≠vt for all training vector pairs– For each training vector pair (u, vt)

• calculate the output v: v=Σwu• if v≠vt

update weight vector w: w`=w + µ (vt-v)uupdate threshold θ: θ`=θ – µ(vt-v) (if not included in weights)

elsedo nothing

Demo: Perceptron with TLU

-5 0 5-5

TLU Convergence• The algorithm converges to the correct

classification– if the training data is linearly separable– given sufficiently small learning rate µ

• Solution w is not unique, since if w•u = 0 defines a hyper-plane, so does w` = k w

Multiple TLUs

u1 u2 u3 un

v1 v2 vn. . .

w’ji = wji + µ (vtj-vj) ui

wji connects ui with vj

• Learing rule:

Example: Character Recognition• 26 classes: A, B, C, …, Z• Target output is a vector

– e.g., vtA= [1 0 0 … 0], vt

B= [0 1 0 … 0], …

u1 u2 u3 un

v1 v2 v26. . .

threshold linear

Perceptron with Linear Function

v= a = Σi=1n wi vi

inputsweights

activationoutput

Note: We will use notation t instead of vt

Gradient Descent Learning Rule• Consider linear unit without threshold and

continuous output v (not just –1,1/1,0)v = w1 u1 + … + wn un

• Train the wi such that they minimize the squared errorE[w1,…,wn] = 1/2 Σd∈D (td-vd)2

– where D is the set of training examples and– t the target outputs

Gradient Descent Learning Rule (cont.)

Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2

(w1,w2)

(w1+∆w1,w2 +∆w2)

-1/µ ∆wi = ∂E/∂wi= ∂/∂wi 1/2 (td-vd)2

= ∂/∂wi 1/2 (td-Σi wi ui)2

= Σd(td- vd) (-ui)E

2]w1 w2

We get:∆wi = µ (td- vd) ui Also known as Delta rule

Incremental vs. Batch Gradient Descent

• Incremental (stochastic) mode: w` = w - µ ∇Ed[w] over individual training examples dEd[w] = 1/2 (td-vd)2

• Batch mode:w` = w - µ ∇ED[w] over the entire data DED[w] = 1/2Σd(td-vd)2

Perceptron vs. Gradient Descent Rule• Perceptron rule

w`i = wi + µ (vt-v) ui

– derived from manipulation of decision surface

• Gradient descent rulew`i = wi + µ (td-vd) ui

– derived from minimization of error function E[w1,…,wn] = 1/2 Σd (td-vd)2

by means of gradient descent

Demo: Gradient Descent Learning Rule

-5 0 5-5

Linear Regression• Common statistical method

• Describe relationship between variables

• Predict one variable knowing the others

• We make assumption that there is a linear relationship– between an outcome (dependent variable, response variable) and a

predictor (independent variable, explanatory variable, feature) or

– between one variable and several other variables

Describing Relationships

Sepal length (cm)

Very weak relationship /very low correlation

Strong relationship /high correlation

cor. coef. = -0.1094

Sepal length (cm)

cor. coef = 0.8180

Making predictions

0 100 200 300

fit (E

140 160 180 200

Father's height (cm)

Predicting variables Predicting future outcomes

Demo: Linear Regression

0.2 0.4 0.6 0.8 1x

0 0.5 1x

Perceptron vs. Gradient Descent• Perceptron (TLU)

– Guaranteed to succeed if• training examples are linearly separable• given sufficiently small learning rate µ

– Robust against outliers

• Gradient descent (linear transfer function)– Guaranteed to converge to hypothesis with minimum

squared error• even when training data not separable by hyperplane• given sufficiently small learning rate µ

– Very sensitive to outliers

threshold linear

Perceptron with Sigmoid Function

wna=Σi=1

n wi ui v=σ(a) =1/(1+e-a)

inputsweights

activation output

Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2

Gradient Descent Rule for Sigmoid Activation Function

-1/µ ∆wi = ∂E/∂wi

= ∂/∂wi 1/2 (td-vd)2

= ∂/∂wi 1/2 (td - σ(Σi wi ui))2

= (td-vd) σ‘(Σi wi ui) (-ui)

We get:∆wi = µ vd(1-vd)(td-vd) ui

vd = σ(a) = 1/(1+e-a)σ’(a) = e-a/(1+e-a)2

= σ(a) (1-σ(a))

Demo: Gradient Descent with Sigmoid Activation Function

-5 0 5-5

Summary• Threshold Logic Unit

– works only for linearly separable patterns– robust to outliers

• Linear Neuron– converges to minimal squared error even

when patterns are not linearly separable– very sensitive to outliers

• Sigmoid Neuron– combination of TLU and linear neuron– converges to minimal squared error even

when patterns are not linearly separable– robust to outliers

Artificial Neural Networks and Single Layer Perceptron · 2019. 12. 9. · Single Layer Perceptron....

Documents

Learning Linear Separators Perceptron, Marginsninamf/courses/315sp19/lectures/Perceptron-01-25-2019.pdfJan 25, 2019 · Perceptron Extensions • Can use Perceptron in a batch setting

Online Learning: Mistake Bound, Winnow, Perceptron

Supervised and semi-supervised learning for NLP

Objectives 4 Perceptron Learning Rule

Chapter 4 Perceptron Learning

Multilayer Perceptron perceptron.pdf · Multilayer Perceptron ... input x belongs to C 1. Perceptron is cosmetically similar to logistic ... Learning Boolean XOR A simple perceptron

Poorly Supervised Learning - statisticalcyber.comstatisticalcyber.com/talks/anagnostopoulos.pdf · Poorly Supervised Learning ... Semi-supervised learning and co-training Expectation-Maximisation:

Graph-BasedSemi-Supervised Learningsemi-supervised learning, graph-based semi-supervised learning, manifold learn- ing, graph-based learning, transductive learning, inductive learning,

15 Machine Learning Multilayer Perceptron

Semi-Supervised Learningzhuxj/tmp/book.pdf1.1.2 Semi-Supervised Learning Semi-supervised learning (SSL) is half way between supervised and unsupervised learning. In addition to unlabeled

Single layer and multi layer perceptron (Supervised learning ......Department of Computer Engineering University of Kurdistan Neural Networks (Graduate level) Single layer and multi

A ”Thermal” Perceptron Learning Rule - Web Homehomepages.mcs.vuw.ac.nz/~marcus/manuscripts/Frean92-thermal-perceptron.pdf · A ”Thermal” Perceptron Learning Rule Marcus Frean

Multi-View Perceptron: a Deep Model for Learning Face ...papers.nips.cc/paper/5546-multi-view-perceptron-a-deep...Multi-View Perceptron: a Deep Model for Learning Face Identity and

Lecture 23: Associative memory & Hopfield Networksce.sharif.edu/courses/92-93/1/ce957-1/resources/root/Lectures/... · Supervised-Learning NNs • Feedforward NNs Perceptron Adaline,

nn perceptron 1 - cneuro.rmki.kfki.hucneuro.rmki.kfki.hu/sites/default/files/nn_perceptron_1_0.pdf · 6 Terminology Supervised learning: labelled data is provided, i.e. an input-output

Perceptron Learning Rules

Online Learning Perceptron Algorithm · Perceptron revisited • Perceptron update: • Batch hinge minimization update: • Difference? ©2017 Emily Fox 28 CSE 446: Machine Learning

Lecture 2: Supervised Learning | Classi cationjaven/talk/L2 Supervised Learning.pdf · Recap Lecture 1 Concepts of Supervised Learning (SL) Classi cation algorithms Supervised Learning

September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly

6.- Supervised Neural Networks: Multilayer Perceptron