31
Beyond Linear Separability

Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Embed Size (px)

Citation preview

Page 1: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Beyond Linear Separability

Page 2: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Limitations of Perceptron• Only linear separations

• Only converges for linearly separable data

One Solution (SVM’s)

Map data into a feature space where they are linearly separable

Another solution is to use multiple interconnected perceptrons.

Page 3: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Artificial Neural Networks• Interconnected networks of simple units

(let's call them "artificial neurons") in which each connection has a weight. – Weight wij is the weight of the ith input into

unit j. – NN have some inputs where the feature

values are placed and they compute one or more output values.

– Depending on the output value we determine the class.

– If there are more than one output units we choose the one with the greatest value.

• The learning takes place by adjusting the weights in the network so that the desired output is produced whenever a sample in the input data set is presented.

Page 4: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Single Perceptron Unit• We start by looking at a

simpler kind of "neural-like" unit called a perceptron.

0or

0

isequation hyperplane then the

and 1let

0

00

n

jjjxw

bwx

xw

)()( xwx signh

Depending on the value of h(x) it outputs one class or the other.

Page 5: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Beyond Linear Separability• Since a single perceptron unit

can only define a single linear boundary, it is limited to solving linearly separable problems.

• A problem like that illustrated by the values of the XOR boolean function cannot be solved by a single perceptron unit.

Page 6: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Multi-Layer Perceptron• Solution: Combine multiple

linear separators.

• The introduction of "hidden" units into NN make them much more powerful: – they are no longer limited to

linearly separable problems.

• Earlier layers transform the problem into more tractable problems for the latter layers.

Page 7: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Example: XOR problem

Output: “class 0”

or “class 1”

Page 8: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Example: XOR problem

Page 9: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Example: XOR problem

w23o2+w13o1+w03=0w03=-1/2, w13=-1, w23=1

o2-o1-1/2=0

Page 10: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Multi-Layer Perceptron• Any set of training points can be separated by a three-layer

perceptron network.

• “Almost any” set of points is separable by two-layer perceptron network.

Page 11: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Autonomous Land Vehicle In a Neural Network (ALVINN)

• ALVINN is an automatic steering system for a car based on input from a camera mounted on the vehicle. – Successfully demonstrated in a cross-country trip.

Page 12: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

ALVINN• The ALVINN neural

network is shown here. It has – 960 inputs (a 30x32

array derived from the pixels of an image),

– 4 hidden units and

– 30 output units (each representing a steering command).

Page 13: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Optional material

Page 14: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Multi-Layer Perceptron Learning

• However, the presence of the discontinuous threshold in the operation means that there is no simple local search for a good set of weights; – one is forced into trying possibilities in a combinatorial way.

• The limitations of the single-layer perceptron and the lack of a good learning algorithm for multilayer perceptrons essentially killed the field for quite a few years.

Page 15: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Soft Threshold• A natural question to ask is whether we could use gradient

ascent/descent to train a multi-layer perceptron.

• The answer is that we can't as long as the output is discontinuous with respect to changes in the inputs and the weights. – In a perceptron unit it doesn't matter how far a point is from the

decision boundary, we will still get a 0 or a 1.

• We need a smooth output (as a function of changes in the network weights) if we're to do gradient descent.

Page 16: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Sigmoid Unit• The classic "soft threshold" that is used in

neural nets is referred to as a "sigmoid" (meaning S-like) and is shown here.

– The variable z is the "total input" or "activation" of a neuron, that is, the weighted sum of all of its inputs.

– Note that when the input (z) is 0, the sigmoid's value is 1/2.

– The sigmoid is applied to the weighted inputs (including the threshold value as before).

• There are actually many different types of sigmoids that can be (and are) used in neural networks.

– The sigmoid shown here is actually called the logistic function.

Page 17: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training• The key property of the sigmoid is that it is differentiable.

– This means that we can use gradient based methods of minimization for training.

• The output of a multi-layer net of sigmoid units is a function of two vectors, the inputs (x) and the weights (w).

• The output of this function (y) varies smoothly with changes in the input and, importantly, with changes in the weights.

Page 18: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training

Page 19: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training• Given a set of training points, each of which specifies the net

inputs and the desired outputs, we can write an expression for the training error, usually defined as the sum of the squared differences between the actual output (given the weights) and the desired output.

• The goal of training is to find a weight vector that minimizes the training error.

• We could also use the mean squared error (MSE), which simply divides the sum of the squared errors by the number of training points instead of just 2. Since the number of training points is a constant, the value for which we get the minimum is not affected.

Page 20: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training

Page 21: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Gradient Descent

E

w

y

w

yy

yyyE

yyE

n

m

mmm

mm

w

w

ww

ww

w

wx

wxwx

wx

follows as change We

,...,),( where

),()),((

)),((2

1

1

2

We've seen that the simplest method for minimizing a differentiable function is gradient descent(or ascent if we're maximizing).

Recall that we are trying to find the weights that lead to a minimum value of training error.

Here we see the gradient of the training error as a function of the weights.

The descent rule is basically to change the weights by taking a small step (determined by the learning rate ) in the directionopposite this gradient.

Online version: We consider each time only the error for one data item

Page 22: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Gradient Descent – Single Unit

i

i

ii

z

n

iii

n

i

xz

zs

w

z

z

zs

w

z

z

y

w

y

ezsyxwz

w

y

w

yy

)(

)(

1

1)(

,...,),(1

wxw

ii

m

im

ii

xwz

zsyy

z

E

xz

zsyyww

)()(

)()(

Delta rule

Substituting in the equation of previous slide we get (for the arbitrary ith element):

Page 23: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Derivative of the sigmoid

))(1)((

11

1

]][)1([

)1()(

1

1)(

2

1

zszs

e

e

e

ee

edz

d

dz

zdse

zs

z

z

z

zz

z

z

Page 24: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Generalized Delta Rule

2442

1441

yw

yw

Now, let’s compute 4.

...4

66

4

55

4

6

64

5

5

44

z

z

z

z

z

z

z

E

z

z

z

E

z

E

z4 will influence E, only indirectly through z5 and z6.

Page 25: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

))(1(

)))((1)((

)()(

)()(

...)(......)(...

64654544

64654544

6465454

4

4

4646

4

4545

4

4646

4

4545

4

4

4

4646

4

4

4

4545

4

4

4

66

4

4

4

55

4

66

4

55

4

6

64

5

544

wwyy

wwzszs

wwdz

zd

dz

zdw

dz

zdw

z

yw

z

yw

z

y

y

yw

z

y

y

yw

z

y

y

z

z

y

y

z

z

z

z

z

z

z

z

E

z

z

z

E

z

E

Page 26: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Generalized Delta Rule

2442

1441

646545444 ))(1(

yw

yw

wwyy

ijjiji

jdownstreamkkjkjjj

yww

wyy

)(

)1(

In general, for a hidden unit j we have

Page 27: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Generalized Delta RuleFor an output unit we have

))(1( )(

21

)1(

)1(

)(

2

mnnn

n

mn

nn

nnn

nn

n

n

n

n

nn

yyyyy

yyyy

y

Eyy

y

E

dz

zds

z

y

y

E

z

E

inni yw

Page 28: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Backpropagation Algorithm1. Initialize weights to small random values2. Choose a random sample training item, say (xm, ym)

3. Compute total input zj and output yj for each unit (forward prop)

4. Compute n for output layer n = yn(1-yn)(yn-ynm)

5. Compute j for all preceding layers by backprop rule6. Compute weight change by descent rule (repeat for all weights)

• Note that each expression involves data local to a particular unit, we don't have to look around summing things over the whole network.

• It is for this reason, simplicity, locality and, therefore, efficiency that backpropagation has become the dominant paradigm for training neural nets.

Page 29: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training Neural Nets• Now that we have looked at the basic mathematical techniques

for minimizing the training error of a neural net, we should step back and look at the whole approach to training a neural net, keeping in mind the potential problem of overfitting.

• Here we look at a methodology that attempts to minimize that danger.

Page 30: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Training Neural NetsGiven: Data set, desired outputs and a neural net with m weights. Find a setting for the weights that will give good predictive

performance on new data. 1. Split data set into three subsets:

i. Training set – used for adjusting weightsii. Validation set – used to stop trainingiii. Test set – used to evaluate performance

2. Pick random, small weights as initial values3. Perform iterative minimization of error over training set

(backprop)4. Stop when error on validation set reaches a minimum (to avoid

overfitting)5. Repeat training (from step 2) several times (to avoid local

minima)6. Use best weights to compute error on test set.

Page 31: Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

Backpropagation Example

y3

z3

y1

z1

y2

z2

1 2

3

-1 -1x1x2

w01

w03

w02w11 w22

w21

w13 w23

w12

-1

133111

233222

3333

)1(

)1(

))(1(

wyy

wyy

yyyy m

First do forward propagation:Compute zi’s and yi’s.

232323

131313

30303 )1(

yww

yww

rww

222222

121212

20202 )1(

xww

xww

rww

212121

111111

10101 )1(

xww

xww

rww