Lecture 9 Perceptron

Machine Learning for Language Technology Lecture 9: Perceptron

Marina San2ni Department of Linguis2cs and Philology Uppsala University, Uppsala, Sweden

Autumn 2014

Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials

1

Inputs and Outputs

Feature Representa2on

Features and Classes

Examples (i)

Examples (ii)

Block Feature Vectors

Representa2on

Linear Classifiers: Repe22on & Extension 8

Linear classifiers (atomic classes)


•  Assump2on: data must be linearily separable

Perceptron

Perceptron (i)

Perceptron Learning Algorithm

Separability and Margin (i)

Separability and Margin (ii)


•  Given a training instance, let Y bar t be the set of all labels that are incorrect, let’s define the set of incorrect labels minus the correct labels for that instance.

•  Then we say that a training set is separable with a margin gamma, if there exists a weight vector w that has a certain norm (ie 1),

The score that we get when we use this vector w minus the score of every incorrect label is at least gamma

Separability and Margin (iii) •  IMPORTANT: for every training instance the score that we

get when we use the training vector w minus the score of every incorrect label is at least a certain margin gamma (ɣ). That is, the margin ɣ is the smallest difference between the score of the right class and the best score of the incorrect class.

The higher the weights, the greater the norms. And we want this to be 1 (normalization).

There are different ways of measuring the length/magnitude of a vector and they are known as norms. The Eucledian norm (or L2 norm) says: take all the values of the weight vector, square them and sum them up, then take the square root .

Perceptron


Perceptron Learning Algorithm


Main Theorem


Perceptron Theorem

• For any training set that is separable with some margin, we can prove that the number of mistakes during training -‐-‐ if we keep itera2ng over the training set -‐-‐ is bounded by a quan2ty that depends on the size of the margin (see proofs in the Appendix, slides Lecture 3).

• R depends on the norm of the largest difference you can have between feature vectors. The larger R, the more spread out the data, the more errors we can poten2ally make. Similarly if gamma is larger we will make fewer mistakes.

Summary

Basically…


.... if it is possible to find such a weight vector for some posiAve margin gamma, then the training set is separable.

So... if the training set is separable, Perceptron will eventually find the weight vector that separates the data. The 2me it takes depends on the property of the data. But aeer a finite number of itera2on, the training set will converge to 0. However... although we find the perfect weight vector for separa2ng the training data, it might be the case that the classifier has not good generaliza2on (do you remember the difference between empirical error and generaliza2on error?) So, with Perceptron, we have a fixed norm (=1) and variable margin (>0).

Appendix: Proofs and Deriva2ons

Education

Lecture 9 Perceptron