8

Click here to load reader

Arquitecturas Basicas Slides

  • Upload
    escom

  • View
    342

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Arquitecturas Basicas Slides

10 - Connectionist Methods#1

• Introduction• Simple Threshold Unit• Perceptron

• Linear Threshold Unit• Problems

• Backpropagation

Introduction

• Human Brain• Complex fabric of multiply networked cells

(neurons) which exchange signals.• First recognised by Golgi & Cajal.

• Neocortex• Site of intelligent capabilities of brain:

• Approx. 0.2m2, approx. 2-3 mm thick• Approx. 100,000 interconnected nerve cells lie

under every square mm.

Neuron

• Three main structures make up a typical neuron:• Dendritic tree• Cell body (soma)• Axon

Cell body(soma)

axonsynapse

dendrite

dendrite

axon

synapse

Neuron

• Dendritic tree• Branched structure of thin cell extensions.• Sums output signals of surrounding neurons in form of an

electric potential.• Cell Body

• If input potential exceeds a certain threshold value:• Cell body produces a short electrical spike.• Conducted along axon.

• Axon• Branches out & conducts pulse to several thousand target

neurons.• Contacts of axon are either located on dendritic tree or

directly on cell body of target neuron.• Connections - known as synapses.

Page 2: Arquitecturas Basicas Slides

Simple Threshold Unit (TU)

• McCulloch & Pitts - 1943• Developed neuron-like threshold units.

• Could be used to represent logical expressions.• Demonstrated how networks of such units might effectively

carry out computations.• Two important weaknesses in this model:

• Does not explain how interconnections between neurons could be formed.

• ie How this might occur through learning• Such networks depend on error-free functioning of all

their components (cf error tolerance of biological neural networks).

Simple Threshold Unit (TU)

• TU has N input channels & 1 output channel.• Each input channel is either active - input = 1, or

silent - input = 0.• Activity states of all channels encode input

information as a binary sequence of N bits.• State of TU

• Given by linear summation of all input signals & comparison of this sum with a threshold value, s.

• If sum exceeds threshold, "neuron" is excited (output = 1) or quiescent (output = 0).

Perceptron

• Rosenblatt - 1958• Pioneered LTUs as basic unit in neural networks.• Mark I Perceptron:

• 20x20 array of photocells to act as a retina• Layer of 512 association units

• Each AU took input from randomly selected subset of photocells & formed simple logical combination of them.

• Output of AUs connected to 8 response units.• Strength of connections between AUs & RUs set

by motor-driven potentiometers.• RUs could mutually interact - eventually agreeing

on a response.

Perceptron

• Basic structure:

digitized image

AUs RUs

weighted links

•••

•••

Page 3: Arquitecturas Basicas Slides

Linear Threshold Unit (LTU)

• Many similarities with simple TU.• LTU:

• Can accept real (not just binary) inputs.• Has real-valued weights associated with its input

connections.• TU - activation status is determined by summing

inputs.• LTU - activation status is determined by summing

the products of the inputs & the weights attached to the relevant connections and then testing this sum against unit's threshold value.

• If sum > threshold, unit's activation is set to 1; otherwise set to 0.

LTU

• Simple LTU network:

• Weights on input connections form unit's weight vector:

• [ 0.2 -0.3 0.9 ]• Set of input activation levels form input vector:

• [ 0.7 0.3 0.1 ]• Sum of products of multiplying 2 vectors together -

inner product of vectors: (0.14).

input units

output unit (LTU)

0.2-0.3

0.9

0.7 0.3 0.1

LTU

• Any input vector or weight vector with n components specifies a point in n-dimensional space.

• Components of vector - coordinates of point.• Advantage of viewing vectors as points or rays in an

n-dimensional space - can understand behaviour of LTU in terms of way in which it divides input space into two:

• Region containing all input vectors which turn LTU on.

• Region containing all input vectors which turn LTU off.

LTU

• Consider two input units:• Activation levels range between 0 & 1• Weights range between -1 & +1

I (0.7 0.7 )W ( 0.11 0.6 )

Weight components: (0.11 0.6)

Input components: (0.7 0.7)

Page 4: Arquitecturas Basicas Slides

Learning AND

• We can train an LTU to discriminate between two different classes of input.

• eg to compute a simple logic function - AND• To construct an LTU to do this - must ensure that it

returns right output for each input.• NB Only get 1 as output if inner product > threshold

value.(1 0) (0)

(0 0 ) (0)

(0 1) (0)

(1 1) (1)

Learning AND

1 1

1

1 0

0

0 1

0

0 0

0

w1 w2 w1 w2 w1 w2 w1 w2

w1 + w2 ≥ 0.5 w1 < 0.5 w2 < 0.5 0

• Threshold = 0.5

Learning AND

(1 1)(0 1)

(1 0)

w1 < 0.5

w2 < 0.5

w1 + w2 ≥ 0.5

• Weight vectors must be in filled region to give satisfactory output.

Perceptron

• Basis of algorithm:• Iterative reduction of LTU error.

• Take each input/output pair in turn & present input vector to LTU.

• Calculate degree to which activation of unit differs from desired activation & adjust weight vector so as to reduce this difference by a small, fixed amount.

Page 5: Arquitecturas Basicas Slides

Perceptron Algorithm

• Initialize:• Randomly initialize weights in network.• Cycle through training set applying the following three

rules to the weights on the connections to the output unit.• If activation level of output unit is 1 when it should be 0 -

reduce weight on link to the ith input unit by r x Ii where I

i

is activation level of ith input unit & r is a fixed weight step.• If activation level of output unit is 0 when it should be 1 -

increase weight on link to ith input unit by r x Ii

• If activation level of output unit is at desired level - do nothing.

Perceptron

• Minsky & Pappert - late 1960s• Developed mathematical analysis of perceptron &

related architectures.• Central result:

• Since basic element of perceptron is LTU - it can only discriminate between linearly separable classes.

• Large proportion of interesting classes are NOT linearly-separable.

• Therefore, perceptron approach is restricted.• Minsky & Pappert dampened enthusiasm for neural

networks.

Linear Separability

• LTU discriminates between classes by separating them with a line (more generally, a hyperplane) in the input space.

• A great many classes cannot be separated in this way - ie many classes are NOT linearly separable.

• For example,• XOR (0 0) (0)

(0 1) (1)

(1 0) (1)

(1 1) (0)

Backpropagation

• Rumelhart & McClelland - 1986• To achieve more powerful performance - need to move to

more complex networks.• Networks containing input & output units + hidden units.

• Terminology:• 2-layer network:

• Layer of input units, layer of hidden units & layer of output units.• Feed-forward network:

• Activation flows from input units, through hidden units to output units.• Completely connected network:

• Every unit in every layer receives input from every unit in layer below.• Strictly layered network:

• Only has connections between units in adjacent layers.

Page 6: Arquitecturas Basicas Slides

Backpropagation

• Simple two-layer network:

• If introduction of hidden units is to achieve anything - it is essential for units to have non-linear activation functions.

• Most common approach is to use a function (often known as logistic function):

x - total input to unit.

11 + e-x

Bias

• Activation level of a unit is calculated by applying logistic function to the inputs of the unit.

• But: To obtain satisfactory performance of Backpropagation it is necessary to allow units to have some level of activation independent of inputs.

• This activation = bias.• Implemented by connecting units to a dummy

unit (which always has activation = 1).

Errors

• Computing output unit error - straightforward.• Compare actual & target activation level:

Eo = S

o d(A

o)

Eo - error on output unit.

So - difference between actual & target

activation. d(A

o) - first derivative of logistic function.

• First derivative is easily computed:• If A

i is current activation of unit i, first

derivative is just: A

i (1 - A

i)

Hidden Unit Error

• How to compute error for hidden units ?• Although we cannot assign errors to hidden units directly - can

deduce level of error by computing errors of output units & propagating this error backwards through the network.

• Thus, for a 2-layer feed-forward network:• Contribution that a hidden unit makes to error of an output unit to

which it is connected is simply degree to which the hidden unit was responsible for giving the output unit the wrong level of activation.

• Size of contribution depends on two factors:• Weight on link which connect two units.• Activation of hidden unit.

• Can arrive at an estimated error value for any hidden unit by summing all the "contributions" which it makes to errors of the units to which it is connected.

Page 7: Arquitecturas Basicas Slides

Hidden Unit Error

• Error of a hidden unit, i is:

• Must take account of activation of hidden unit.• S

i is multiplied by derivative of activation.

• If Ai is activation of unit, i, final error of i is:

Ei = S

i d(A

i)

Ei = S

i A

i (1 - A

i)

Si = ∑ E

j W

ijj

Ej = error value of jth unit.

Weights

• Once error values have been determined for all units in the network - weights are then updated.

• Weight on connection which feeds activation from unit

i to unit

j is updated by an amount proportional

to the product of Ej & A

i:

∆ Wij = E

j A

i r (r - learning rate)

• Known as generalised delta rule.

Sloshing

• Backpropagation performs gradient descent in squared error.

• Tries to find a global minimum of the error surface - by modifying weights.

• This can lead to problem behaviour: sloshing.• Example:

• A weight configuration corresponds to a point high up, on one side of a long thin valley in error surface.

• Changes to weights may result in move to a position high up on other side of valley.

• Next iteration will jump back, and so on ...

Momentum

• Sloshing can substantially slow down learning.

• To solve this problem: introduce momentum term.

• Weight updating process modified to consider last change applied to the weight.

• Effect is to smooth out weight changes & thus prevent oscillations.

∆W = (∆Wprevious

x momentum) + (∆Wcurrent

x (1 - momentum))

Page 8: Arquitecturas Basicas Slides

Backpropagation - Example

0.5 0.22

0.39 -0.06

-0.07

0.37

• Learning XOR• 2-2-1 network• Randomly initialised weights.

Backpropagation - Example

• Present first training pair: (0 1) (1)• Set activation of left input unit to 0.0 & activation of

right input unit to 1.0• Propagate activation forward through network using

logistic function & compute new activations:

0.5 0.22

0.39 -0.06

-0.07

0.37

0.6

0.59

0.0

0.48

1.0

Backpropagation - Example

0.5 0.22

0.39 -0.06

-0.07

0.37

0.6

0.59

0.0

0.48

1.0

• Calculate error on output unit & propagate error back through network.

• Error = target output (1.0) - actual output (0.6) * derivative of activation ( 0.6 (1 - 0.6) ) = 0.1

Error = (0.1 x 0.22) x 0.48 (1 - 0.48) = 0.01

Error = (0.1 x 0.5) x 0.59 (1 - 0.59) = 0.01

Backpropagation - Example

0.5 + (0.1 x 0.59) - 0.559 0.22 + (0.1 x 0.48) = 0.268

0.39 + (0.01 x 0.0) = 0.39 -0.06 + (0.01 x 1.0) = -0.05

-0.07 + (0.01 x 0.0) = -0.07

0.37 + (0.01 x 1.0) = 0.38

0.6

0.59

0.0

0.48

1.0

• Update weights.• Weights altered by amount proportional to the

product of error value of the receiving unit & the activation of sending unit.