The Perceptron algorithm - BIUu.cs.biu.ac.il/~jkeshet/teaching/iml2016/iml2016_tirgul03.pdf · Iℋ:0,12→ℕand a learning algorithm with the following property: For every 𝜖,

The Perceptron algorithmTirgul 3

November 2016

Agnostic PAC Learnability

A hypothesis class ℋ is agnostic PAC learnable if there exists a function

𝑚ℋ: 0,12 → ℕ and a learning algorithm with the following property: For

every 𝜖, 𝛿 ∈ 0,1 , and for every distribution 𝒟 over 𝒳 × 𝒴, when running

the learning algorithm on 𝑚 ≥ 𝑚ℋ 𝜖, 𝛿 i.i.d. examples generated by 𝒟, the

algorithm returns a hypothesis ℎ such that, with probability of at least 1 − 𝛿

(over the choice of the𝑚 training examples),

𝐿𝒟 ℎ ≤ minℎ′∈ℋ

𝐿𝒟 ℎ′ + 𝜖

2

Goal: what is h = argminℎ′∈ℋ

𝐿𝒟 ℎ′

Agnostic PAC Learnability:𝐿𝒟 ℎ ≤ min

ℎ′∈ℋ𝐿𝒟 ℎ′ + 𝜖

When Life Gives You Lemons, Make Lemonade

• We do have our sample set 𝑆

• and we hope it represents the distribution pretty good…

• i.i.d assumption

• So why can’t we just minimize the error over the training set?

• In other words - Empirical Risk Minimization

4

Empirical Risk Minimization

𝐿𝑠 ℎ =| 𝑖 ∈ 𝑚 : ℎ(𝑥𝑖 ≠ 𝑦𝑖}|

𝑚

• Examples

• Consistent

• Halving

5

Linear PredictorsERM Approach

6

Introduction

• Linear Predictors:• Efficient

• Intuitive

• Fit data reasonably well in many natural learning problems

• Several hypothesis classes:• linear regression, logistic regression, Perceptron.

7

Example

Linear Predictors

• The different hypothesis classes of linear predictors are compositions of a function 𝜙: ℝ → 𝒴 on ℋ:• Binary classification: 𝜙 is the sign function 𝑠𝑔𝑛 𝑥 .

• Regression: 𝜙 is the identity function (𝑓 𝑥 = 𝑥).

9

Halfspaces

• Designed for binary classification problems 𝒳 = ℝ𝑑 , 𝒴 = ±1

• ℋℎ𝑎𝑙𝑓𝑠𝑝𝑎𝑐𝑒𝑠 = {𝑥 → 𝑠𝑖𝑔𝑛 𝑤, 𝑥 + 𝑏 :𝑤 ∈ ℝ𝑑}

• Geometric illustration (d=2): each hypothesis forms a hyperplane that is perpendicular to the vector w.

10

• Instances that are “above” the hyperplane are labeled positively

• Instances that are “below” the hyperplane are labeled negatively.

Adding a Bias

• Add b (a bias) into w as an extra coordinate:• 𝒘′ = 𝑏,𝑤1, 𝑤2, … , 𝑤𝑑 ∈ ℝ𝑑+1

• …and add a value of 1 to all 𝑥 ∈ 𝑋:• 𝒙′ = 1, 𝑥1, 𝑥2, … , 𝑥𝑑 ∈ ℝ𝑑+1

• Thus, each affine function in ℝ𝑑 can be rewritten as a homogenous linear function in ℝ𝑑+1.

11

The Dot Product

• Algebraic Defintion:• 𝒘 ⋅ 𝒙 = 𝑖=1

𝑛 𝑤𝑖𝑥𝑖 = 𝑖=1𝑛 𝑤1𝑥1 +𝑤2𝑥2 + …+𝑤𝑛𝑥𝑛

• Notation: ⟨𝒘, 𝒙⟩ = 𝒘𝑇𝒙

• 𝒂 =03, 𝒃 =

40

• 𝒂 ⋅ 𝒃 = 0 ⋅ 4 + 3 ⋅ 0 = 0

12

𝒂

𝒃

𝜽

The Dot Product

• Geometric Definition:• 𝒂 ⋅ 𝒃 = 𝒂 𝒃 cos 𝜃

• Where:• 𝒙 is the magnitude of vector 𝒙.

• 𝜃 is the angle between 𝒂 and 𝒃.

13

• If 𝜃 = 90°:• 𝒂 ⋅ 𝒃 = 0

• If 𝜃 = 0°:• 𝒂 ⋅ 𝒃 = 𝒂 𝒃

• This implies that that dot product of a vector by itself is:• 𝒂 ⋅ 𝒂 = 𝒂 𝟐

• Which gives:• 𝒂 = 𝒂 ⋅ 𝒂

• The formula of the Euclidean length of the vector.

𝒂

𝒃

𝜽

The Decision Boundary

• Perceptron tries to find a straight line that separates between the positive and negative examples• A line in 2D, a plane in 3D, a hyperplane in higher dimensions

• Called a decision boundary.

14

The Linearly Separable Case

• The linearly separable case:• If a perfect decision boundary exists

• (The “realizable” case.)

• The “separable” case:• Possible to separate with a hyperplane all the positive examples from all the

negative ones.

15

Finding an ERM Halfspace

• In the separable case:• Linear Programming

• The Perceptron Algorithm (Rosenblatt, 1957)

• In the non-separable case:• Learn a halfspace that minimizes a different loss function

• E.g. Logistic Regression

16

Perceptron

• 𝑥𝑖 - inputs; 𝑤𝑖 - weights

• The inputs 𝑥𝑖 are multiplied by the weights 𝑤𝑖• The neuron sums their values.

• If the sum is greater than the threshold 𝜃 then the neuron fires (1);• Otherwise, it does not.

17

Finding 𝜃

• We now need to learn both 𝑤 and 𝜃:

18

Finding 𝜃

• Reminder: we added a bias• Thus, we have a adjustable threshold

• Don’t need to learn another parameter

19

𝑥0 𝑤0 = 𝜃

𝜃 is equivalent to the parameter b we mentioned previously.

Perceptron for Halfspaces

Perceptron for Halfspaces

• Our goal is to have ∀𝑖, 𝑦𝑖 𝑤, 𝑥𝑖 > 0

• 𝑦𝑖 𝑤𝑡+1 , 𝑥𝑖 = 𝑦𝑖 𝑤

𝑡 + 𝑦𝑖𝑥𝑖 , 𝑥𝑖 = 𝑦𝑖 𝑤𝑡 , 𝑥𝑖 + 𝑥𝑖

2

• The update rule makes Perceptron “more correct” on the ith example

21

The Learning Rate 𝜂

• The update rule:• 𝒘(𝑡+1) = 𝒘𝑡 + 𝑦𝑖𝒙𝑖

• We could add a parameter 𝜂• 𝒘(𝑡+1) = 𝒘𝑡 + 𝜂𝑦𝑖𝒙𝑖• Controls how much the weights will

change

• In the separable case 𝜂 has no affect.• Proof: HW

22

• If 𝜂 = 1:• The weights change a lot whenever

there is a wrong answer• Unstable network, never

“settles down”.

• If 𝜂 is very small:• The weights need to see inputs

more often before they change significantly• Network takes longer to learn

• Typically choose 0.1 ≤ 𝜂 ≤ 0.4

Example: Logic Function OR

• Data of the OR logic function and a plot of the datapoints:

23

-1

A Feasibility Problem

• Suppose the algorithm found a weight vector that learned all of the examples correctly• There are many different values that will give

correct outputs!

• We are interested in finding a set of weights that works

• feasibility A feasibility (/satisfiability) problem, is the problem of finding any feasible solution, without regard to the objective value

24


• The Perceptron network

25

1


• We need to find the three weights:• Initially: 𝑤(1) = 0, 0, 0

• First input: 𝑥1 = 0, 0 , 𝑦1 = −1• Include the bias: 1, 0, 0

• Value of neuron:• 𝑤(1) ⋅ 𝑥1 = 0 × 1 + 0 × 0 + 0 × 0 = 0

• → 𝑦1(𝑤(1) ⋅ 𝑥1) = 0 ≤ 0

• Update: 𝑤 2 = 𝑤(1) + (−1)(1, 0, 0)

• 𝑤 2 = (−1, 0 , 0)

26

The algorithm:

The dataset:

-1


• 𝑤(2) = −1, 0, 0

• Second input: 𝑥2 = 0, 1 , 𝑦2 = 1• Include the bias: 1, 0, 1

• Value of neuron:• 𝑤(2) ⋅ 𝑥2 = −1 × 1 + 0 × 0 + 0 × 1 = −1

• → 𝑦2(𝑤(2) ⋅ 𝑥2) = −1 ≤ 0

• Update: 𝑤 3 = 𝑤(2) + (1)(1, 0, 1)

• 𝑤 3 = (0, 0, 1)

27

The algorithm:

The dataset:

-1


• 𝑤(3) = 0, 0, 1

• Third input: 𝑥3 = 1, 0 , 𝑦3 = 1• Include the bias: 1, 1, 0


• → 𝑦3(𝑤(3) ⋅ 𝑥3) = 0 ≤ 0

• Update: 𝑤 4 = 𝑤(3) + (1)(1, 1, 0)

• 𝑤 4 = (1, 1 , 1)

28

The algorithm:

The dataset:

-1


• 𝑤(4) = 1, 1, 1

• Fourth input: 𝑥4 = 1, 1 , 𝑦4 = 1• Include the bias: 1, 1, 1


• → 𝑦4(𝑤(4) ⋅ 𝑥4) = 3 ≥ 0

• No update

29

The algorithm:

The dataset:

-1


• Not done yet!• 𝑤(4) = 1, 1, 1

• First input: 𝑥1 = 0, 0 , 𝑦1 = −1• Include the bias: 1, 0, 0


• → 𝑦1(𝑤(4) ⋅ 𝑥1) = −1 ≤ 0

• Need to update again…

30

The algorithm:

The dataset:

-1


• We’ve been through all the inputs once –• but that doesn’t mean we finished!

• Need to go through the inputs again• Till the weights settle down and stop changing

• When data is inseparable, the weights may never stop changing…

31

When to stop?

• The algorithm runs over the dataset many times…

• How to decide when to stop learning?

• (generally)

32

Validation Set

• Training set: to train the algorithm• To adjust the weights

• Validation set: to keep tack of how well it is doing• To verify that any increase in accuracy over the training data yields an

increase in accuracy over a dataset that the network wasn’t trained on.

• Test set: to produce the final results• To test the final solution, in order to confirm the actual predictive power of

the algorithm.

33

Validation Set

• Proportion of train/validation/test sets • Are typically 60:20:20 (after the dataset has been shuffled!)

• Alternatively:• K-fold Cross validation

• The dataset is randomly partitioned into K subsets• One subset is used for validation; the algorithm is trained on all the others• Then, a different subset is let out, and a new model is trained…• Repeat the process for all K subsets• Finally, the model that produced the lowest validation error is used.

• Leave-one-out• Algorithm is validated on one piece of data, and trained on all the rest, N times.

34

Length of dataset

Overfitting

• Rather than finding a general function (left), the our network matches the input perfectly, including the noise in them (right).

• Reduces generalization capabilities.

35

Back to: When to stop?

• If we plot the error during training:

• it typically reduces fairly quickly during the first few training iterations,

• then the reduction slows down as the learning algorithm performs small changes to find the exact local minimum.

36

Note: This graph is general anddoes not necessarily describethe behavior of the error ratewhile training Perceptron,because Perceptron does notguarantee that there will befewer mistakes on the nextiterations.

When to stop?

• We don't want to stop training until the local minimum has been found, but, keeping on training too long leads to overfitting .

• This is where the validation set comes in useful.

37

When to stop?

• We train the network for some predetermined amount of time, and then use the validation set to estimate how well the network is generalizing.

• We then carry on training for a few more iterations, and repeat the whole process.

38

When to stop?

• At some stage the error on the validation set will start increasing again, • because the network has

stopped learning about the function that generated the data, and started to learn about the noise that is in the data itself .

• At this stage we stop the training. This technique is called early stopping.

39

When to stop?

• Thus, the validation set was used to prevent overfitting, and to monitor the generalization ability of the network:• If the accuracy over the training

data set increases,

• but the accuracy over then validation data set stays the same or decreases,

• then we have caused overfitting, and should stop training.

40

Perceptron Variant

• The pocket algorithm: • Keeps the best solution seen so far "in its pocket".

• The algorithm then returns the solution in the pocket, rather than the last solution.

41

Perceptron Bound Theorem

Note: 𝛾 is called a “margin”.

Documents

The Perceptron algorithm - BIUu.cs.biu.ac.il/~jkeshet/teaching/iml2016/iml2016_tirgul03.pdf · Iℋ:0,12→ℕand a learning algorithm with the following property: For every 𝜖,