November 20, 2014Computer Vision Lecture 19: Object Recognition III 1 Linear Separability So by varying the weights and the threshold, we can realize any

November 20, 2014 Computer Vision Lecture 19: Object Recognition III

1

Linear Separability

So by varying the weights and the threshold, we can realize any linear separation of the input space into a region that yields output 1, and another region that yields output 0.

As we have seen, a two-dimensional input space can be divided by any straight line.

A three-dimensional input space can be divided by any two-dimensional plane.

In general, an n-dimensional input space can be divided by an (n-1)-dimensional plane or hyperplane.

Of course, for n > 3 this is hard to visualize.


2

Capabilities of Threshold Neurons

What do we do if we need a more complex function?

We can combine multiple artificial neurons to form networks with increased capabilities.

For example, we can build a two-layer network with any number of neurons in the first layer giving input to a single neuron in the second layer.

The neuron in the second layer could, for example, implement an AND function.


3


What kind of function can such a network realize?

x1

x2

x1

x2

x1

x2

...

xi


4


Assume that the dotted lines in the diagram represent the input-dividing lines implemented by the neurons in the first layer:

1st comp.

2nd comp.

Then, for example, the second-layer neuron could output 1 if the input is within a polygon, and 0 otherwise.


5


However, we still may want to implement functions that are more complex than that.

An obvious idea is to extend our network even further.

Let us build a network that has three layers, with arbitrary numbers of neurons in the first and second layers and one neuron in the third layer.

The first and second layers are completely connected, that is, each neuron in the first layer sends its output to every neuron in the second layer.


6


What type of function can a three-layer network realize?

x1

x2

x1

x2

x1

x2

...

oi...


7


Assume that the polygons in the diagram indicate the input regions for which each of the second-layer neurons yields output 1:

1st comp.

2nd comp.

Then, for example, the third-layer neuron could output 1 if the input is within any of the polygons, and 0 otherwise.


8


The more neurons there are in the first layer, the more vertices can the polygons have.

With a sufficient number of first-layer neurons, the polygons can approximate any given shape.

The more neurons there are in the second layer, the more of these polygons can be combined to form the output function of the network.

With a sufficient number of neurons and appropriate weight vectors wi, a three-layer network of threshold neurons can realize any (!) function Rn {0, 1}.


9

TerminologyUsually, we draw neural networks in such a way that the input enters at the bottom and the output is generated at the top.

Arrows indicate the direction of data flow.

The first layer, termed input layer, just contains the input vector and does not perform any computations.

The second layer, termed hidden layer, receives input from the input layer and sends its output to the output layer.

After applying their activation function, the neurons in the output layer contain the output vector.


10

Terminology

Example: Network function f: R3 {0, 1}2

output layer

hidden layer

input layer

input vector

output vector


11

Sigmoidal Neurons

Sigmoidal neurons accept any vectors of real numbers as input, and they output a real number between 0 and 1.

Sigmoidal neurons are the most common type of artificial neuron, especially in learning networks.

A network of sigmoidal units with m input neurons and n output neurons realizes a network function f: Rm (0,1)n


12

Sigmoidal Neurons

In backpropagation networks, we typically choose = 1 and = 0.

1

0

1

fi(neti(t))

neti(t)-1

= 1

= 0.1

/))(net(1

1))(net(

tii ie

tf


13

Sigmoidal Neurons

This leads to a simplified form of the sigmoid function:

We do not need a modifiable threshold , because we will use “dummy” (offset) inputs.

The choice = 1 works well in most situations and results in a very simple derivative of S(net).

)(1

1)(

netenetS


14

Sigmoidal Neurons

This result will be very useful when we develop the backpropagation algorithm.

xexS

1

1)(

2)1(

)()('

x

x

e

e

dx

xdSxS

22 )1(

1

1

1

)1(

11xxx

x

eee

e

))(1)(( xSxS


15

Feedback-Based Weight Adaptation• Feedback from environment (possibly teacher) is

used to improve the system’s performance

• Synaptic weights are modified to reduce the system’s error in computing a desired function

• For example, if increasing a specific weight increases error, then the weight is decreased

• Small adaptation steps are needed to find optimal set of weights

• Learning rate can vary during learning process

• Typical for supervised learning


16

Evaluation of Networks• Basic idea: define error function and measure

error for untrained data (testing set)

• Typical:

where d is the desired output, and o is the actual output.

• For classification:E = number of misclassified samples/ total number of samples

i

ii od 2)(E


17

Gradient DescentGradient descent is a very common technique to find the absolute minimum of a function.

It is especially useful for high-dimensional functions.

We will use it to iteratively minimizes the network’s (or neuron’s) error by finding the gradient of the error surface in weight-space and adjusting the weights in the opposite direction.


18

Gradient Descent

Gradient-descent example: Finding the absolute minimum of a one-dimensional error function f(x):

f(x)

xx0

slope: f’(x0)

x1 = x0 - f’(x0)

Repeat this iteratively until for some xi, f’(xi) is sufficiently close to 0.


19

Gradient DescentGradients of two-dimensional functions:

The two-dimensional function in the left diagram is represented by contour lines in the right diagram, where arrows indicate the gradient of the function at different locations. Obviously, the gradient is always pointing in the direction of the steepest increase of the function. In order to find the function’s minimum, we should always move against the gradient.


20

Multilayer NetworksThe backpropagation algorithm was popularized by Rumelhart, Hinton, and Williams (1986).

This algorithm solved the “credit assignment” problem, i.e., crediting or blaming individual neurons across layers for particular outputs.

The error at the output layer is propagated backwards to units at lower layers, so that the weights of all neurons can be adapted appropriately.


21

TerminologyExample: Network function f: R3 R2

output layer

hidden layer

input layer

input vector

output vector

x1 x2

o2o1

x3

)0,1(1,1w )0,1(

3,4w

)1,2(1,1w

)1,2(4,2w


22

Backpropagation Learning

The goal of the backpropagation learning algorithm is to modify the network’s weights so that its output vector

op = (op,1, op,2, …, op,K)

is as close as possible to the desired output vector

dp = (dp,1, dp,2, …, dp,K)

for K output neurons and input patterns p = 1, …, P.

The set of input-output pairs (exemplars) {(xp, dp) | p = 1, …, P} constitutes the training set.


23

Backpropagation LearningWe need a cumulative error function that is to be minimized:

P

pppErr

1

),(Error do

We can choose the mean square error (MSE), where the 1/P factor does not matter for minimizing error:

P

p

K

jjplP 1 1

2, )(

1MSE

where

jpjpjp dol ,,,


24

Backpropagation LearningFor input pattern p, the i-th input layer node holds xp,i.

Net input to j-th node in hidden layer:

n

iipijj xwnet

0,

)0,1(,

)1(

K

kkpkp

K

kkpp odlE

1

2,,

1

2, )()(Network error for p:

jjpjkkp xwSo )1(

,)1,2(

,,Output of k-th node in output layer:

j

jpjkk xwnet )1(,

)1,2(,

)2(Net input to k-th node in output layer:

n

iipijjp xwSx

0,

)0,1(,

)1(,Output of j-th node in hidden layer:


25

Backpropagation LearningAs E is a function of the network weights, we can use gradient descent to find those weights that result in minimal error.

For individual weights in the hidden and output layers, we should move against the error gradient (omitting index p):

)1,2(,

)1,2(,

jkjk w

Ew

)0,1(,

)0,1(,

ijij w

Ew

Output layer: Derivative easy to calculate

Hidden layer: Derivative difficult to calculate


26

Backpropagation LearningWhen computing the derivative with regard to wk,j

(2,1), we can disregard any output units except ok:

ki

kki odlE 22 )(

)(2 kkk

odo

E

Remember that ok is obtained by applying the sigmoid function S to netk

(2), which is computed by:

j

jjkk xwnet )1()1,2(,

)2(

Therefore, we need to apply the chain rule twice.


27


)1.2(,

)2(

)2()1.2(, jk

k

k

k

kjk w

net

net

o

o

E

w

E

Since j

jjkk xwnet )1()1,2(,

)2(

)1()1.2(

,

)2(

jjk

k xw

net

We have:

We know that: )2()2(

' kk

k netSnet

o

Which gives us: )1()2()1.2(

,

')(2 jkkkjk

xnetSodw

E


28

Backpropagation LearningFor the derivative with regard to wj,i

(1,0), notice that E depends on it through netj

(1), which influences each ok with k = 1, …, K:

Using the chain rule of derivatives again:

, )2(kk netSo

iiijj xwnet )0,1(

,)1( , )1()1(

jj netSx

)0,1(,

)1(

)1(

)1(

1)1(

)2(

)2()0,1(, ij

j

j

jK

k j

k

k

k

kij w

net

net

x

x

net

net

o

o

E

w

E

K

kijjkkkk

ij

xnetSwnetSodw

E

1

)1()1,2(,

)2()0,1(

,

'')(2


29


This gives us the following weight changes at the output layer:

… and at the inner layer:

with)1()1.2(, jkjk xw

)2(')( kkkk netSod

with)0,1(, ijij xw

)1(

1

)1,2(, ' j

K

kjkkj netSw


30

Backpropagation LearningAs you surely remember from a few minutes ago:

Then we can simplify the generalized error terms:))(1)(()(' xSxSxS

)2(')( kkkk netSod )1()( kkkkk oood

And:

)1(

1

)1,2(, ' j

K

kjkkj netSw

)1()1(

1

)1,2(, 1 jj

K

kjkkj xxw


31

Backpropagation LearningThe simplified error terms k and j use variables that are calculated in the feedforward phase of the network and can thus be calculated very efficiently.

Now let us state the final equations again and reintroduce the subscript p for the p-th pattern:

with)1(,,

)1.2(, jpkpjk xw

)1()( ,,,,, kpkpkpkpkp oood

with,,)0,1(

, ipjpij xw

)1(,

)1(,

1

)1,2(,,, 1 jpjp

K

kjkkpjp xxw


32

Backpropagation LearningAlgorithm Backpropagation; Start with randomly chosen weights; while MSE is above desired threshold and computational bounds are not exceeded, do for each input pattern xp, 1 p P, Compute hidden node inputs; Compute hidden node outputs; Compute inputs to the output nodes; Compute the network outputs; Compute the error between output and desired output; Modify the weights between hidden and output nodes; Modify the weights between input and hidden nodes; end-for end-while.


33

Supervised Function Approximation

There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate).

This problem is similar to fitting a function to a given set of data points.

Let us assume that you want to find a fitting function f:RR for a set of three data points.

You try to do this with polynomials of degree one (a straight line), two, and nine.


34

Supervised Function Approximation

Obviously, the polynomial of degree 2 provides the most plausible fit.

f(x)

x

deg. 1

deg. 2

deg. 9


35

Supervised Function ApproximationThe same principle applies to ANNs:• If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function.• If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization.

Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics.


36

Creating Data Representations

The problem with some data representations is that the meaning of the output of one neuron depends on the output of other neurons.

This means that each neuron does not represent (detect) a certain feature, but groups of neurons do.

In general, such functions are much more difficult to learn.

Such networks usually need more hidden neurons and longer training, and their ability to generalize is weaker than for the one-neuron-per-feature-value networks.


37


On the other hand, sets of orthogonal vectors (such as 100, 010, 001) representing individual features can be processed by the network more easily.

This becomes clear when we consider that a neuron’s net input signal is computed as the inner product of the input and weight vectors.

The geometric interpretation of these vectors shows that orthogonal vectors are especially easy to discriminate for a single neuron.


38


Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present.

Good examples: • the brightness of a pixel in an input image• the distance between a robot and an obstacle

Poor examples:• the letter (1 – 26) of a word• the type (1 – 6) of a chess piece

Documents

November 20, 2014Computer Vision Lecture 19: Object Recognition III 1 Linear Separability So by varying the weights and the threshold, we can realize any