Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Perceptron algorithmTirgul 3
November 2016
Agnostic PAC Learnability
A hypothesis class ℋ is agnostic PAC learnable if there exists a function
𝑚ℋ: 0,12 → ℕ and a learning algorithm with the following property: For
every 𝜖, 𝛿 ∈ 0,1 , and for every distribution 𝒟 over 𝒳 × 𝒴, when running
the learning algorithm on 𝑚 ≥ 𝑚ℋ 𝜖, 𝛿 i.i.d. examples generated by 𝒟, the
algorithm returns a hypothesis ℎ such that, with probability of at least 1 − 𝛿
(over the choice of the𝑚 training examples),
𝐿𝒟 ℎ ≤ minℎ′∈ℋ
𝐿𝒟 ℎ′ + 𝜖
2
Goal: what is h = argminℎ′∈ℋ
𝐿𝒟 ℎ′
Agnostic PAC Learnability:𝐿𝒟 ℎ ≤ min
ℎ′∈ℋ𝐿𝒟 ℎ′ + 𝜖
When Life Gives You Lemons, Make Lemonade
• We do have our sample set 𝑆
• and we hope it represents the distribution pretty good…
• i.i.d assumption
• So why can’t we just minimize the error over the training set?
• In other words - Empirical Risk Minimization
4
Empirical Risk Minimization
𝐿𝑠 ℎ =| 𝑖 ∈ 𝑚 : ℎ(𝑥𝑖 ≠ 𝑦𝑖}|
𝑚
• Examples
• Consistent
• Halving
5
Linear PredictorsERM Approach
6
Introduction
• Linear Predictors:• Efficient
• Intuitive
• Fit data reasonably well in many natural learning problems
• Several hypothesis classes:• linear regression, logistic regression, Perceptron.
7
Example
Linear Predictors
• The different hypothesis classes of linear predictors are compositions of a function 𝜙: ℝ → 𝒴 on ℋ:• Binary classification: 𝜙 is the sign function 𝑠𝑔𝑛 𝑥 .
• Regression: 𝜙 is the identity function (𝑓 𝑥 = 𝑥).
9
Halfspaces
• Designed for binary classification problems 𝒳 = ℝ𝑑 , 𝒴 = ±1
• ℋℎ𝑎𝑙𝑓𝑠𝑝𝑎𝑐𝑒𝑠 = {𝑥 → 𝑠𝑖𝑔𝑛 𝑤, 𝑥 + 𝑏 :𝑤 ∈ ℝ𝑑}
• Geometric illustration (d=2): each hypothesis forms a hyperplane that is perpendicular to the vector w.
10
• Instances that are “above” the hyperplane are labeled positively
• Instances that are “below” the hyperplane are labeled negatively.
Adding a Bias
• Add b (a bias) into w as an extra coordinate:• 𝒘′ = 𝑏,𝑤1, 𝑤2, … , 𝑤𝑑 ∈ ℝ𝑑+1
• …and add a value of 1 to all 𝑥 ∈ 𝑋:• 𝒙′ = 1, 𝑥1, 𝑥2, … , 𝑥𝑑 ∈ ℝ𝑑+1
• Thus, each affine function in ℝ𝑑 can be rewritten as a homogenous linear function in ℝ𝑑+1.
11
The Dot Product
• Algebraic Defintion:• 𝒘 ⋅ 𝒙 = 𝑖=1
𝑛 𝑤𝑖𝑥𝑖 = 𝑖=1𝑛 𝑤1𝑥1 +𝑤2𝑥2 + …+𝑤𝑛𝑥𝑛
• Notation: ⟨𝒘, 𝒙⟩ = 𝒘𝑇𝒙
• 𝒂 =03, 𝒃 =
40
• 𝒂 ⋅ 𝒃 = 0 ⋅ 4 + 3 ⋅ 0 = 0
12
𝒂
𝒃
𝜽
The Dot Product
• Geometric Definition:• 𝒂 ⋅ 𝒃 = 𝒂 𝒃 cos 𝜃
• Where:• 𝒙 is the magnitude of vector 𝒙.
• 𝜃 is the angle between 𝒂 and 𝒃.
13
• If 𝜃 = 90°:• 𝒂 ⋅ 𝒃 = 0
• If 𝜃 = 0°:• 𝒂 ⋅ 𝒃 = 𝒂 𝒃
• This implies that that dot product of a vector by itself is:• 𝒂 ⋅ 𝒂 = 𝒂 𝟐
• Which gives:• 𝒂 = 𝒂 ⋅ 𝒂
• The formula of the Euclidean length of the vector.
𝒂
𝒃
𝜽
The Decision Boundary
• Perceptron tries to find a straight line that separates between the positive and negative examples• A line in 2D, a plane in 3D, a hyperplane in higher dimensions
• Called a decision boundary.
14
The Linearly Separable Case
• The linearly separable case:• If a perfect decision boundary exists
• (The “realizable” case.)
• The “separable” case:• Possible to separate with a hyperplane all the positive examples from all the
negative ones.
15
Finding an ERM Halfspace
• In the separable case:• Linear Programming
• The Perceptron Algorithm (Rosenblatt, 1957)
• In the non-separable case:• Learn a halfspace that minimizes a different loss function
• E.g. Logistic Regression
16
Perceptron
• 𝑥𝑖 - inputs; 𝑤𝑖 - weights
• The inputs 𝑥𝑖 are multiplied by the weights 𝑤𝑖• The neuron sums their values.
• If the sum is greater than the threshold 𝜃 then the neuron fires (1);• Otherwise, it does not.
17
Finding 𝜃
• We now need to learn both 𝑤 and 𝜃:
18
Finding 𝜃
• Reminder: we added a bias• Thus, we have a adjustable threshold
• Don’t need to learn another parameter
19
𝑥0 𝑤0 = 𝜃
𝜃 is equivalent to the parameter b we mentioned previously.
Perceptron for Halfspaces
Perceptron for Halfspaces
• Our goal is to have ∀𝑖, 𝑦𝑖 𝑤, 𝑥𝑖 > 0
• 𝑦𝑖 𝑤𝑡+1 , 𝑥𝑖 = 𝑦𝑖 𝑤
𝑡 + 𝑦𝑖𝑥𝑖 , 𝑥𝑖 = 𝑦𝑖 𝑤𝑡 , 𝑥𝑖 + 𝑥𝑖
2
• The update rule makes Perceptron “more correct” on the ith example
21
The Learning Rate 𝜂
• The update rule:• 𝒘(𝑡+1) = 𝒘𝑡 + 𝑦𝑖𝒙𝑖
• We could add a parameter 𝜂• 𝒘(𝑡+1) = 𝒘𝑡 + 𝜂𝑦𝑖𝒙𝑖• Controls how much the weights will
change
• In the separable case 𝜂 has no affect.• Proof: HW
22
• If 𝜂 = 1:• The weights change a lot whenever
there is a wrong answer• Unstable network, never
“settles down”.
• If 𝜂 is very small:• The weights need to see inputs
more often before they change significantly• Network takes longer to learn
• Typically choose 0.1 ≤ 𝜂 ≤ 0.4
Example: Logic Function OR
• Data of the OR logic function and a plot of the datapoints:
23
-1
A Feasibility Problem
• Suppose the algorithm found a weight vector that learned all of the examples correctly• There are many different values that will give
correct outputs!
• We are interested in finding a set of weights that works
• feasibility A feasibility (/satisfiability) problem, is the problem of finding any feasible solution, without regard to the objective value
24
Example: Logic Function OR
• The Perceptron network
25
1
Example: Logic Function OR
• We need to find the three weights:• Initially: 𝑤(1) = 0, 0, 0
• First input: 𝑥1 = 0, 0 , 𝑦1 = −1• Include the bias: 1, 0, 0
• Value of neuron:• 𝑤(1) ⋅ 𝑥1 = 0 × 1 + 0 × 0 + 0 × 0 = 0
• → 𝑦1(𝑤(1) ⋅ 𝑥1) = 0 ≤ 0
• Update: 𝑤 2 = 𝑤(1) + (−1)(1, 0, 0)
• 𝑤 2 = (−1, 0 , 0)
26
The algorithm:
The dataset:
-1
Example: Logic Function OR
• 𝑤(2) = −1, 0, 0
• Second input: 𝑥2 = 0, 1 , 𝑦2 = 1• Include the bias: 1, 0, 1
• Value of neuron:• 𝑤(2) ⋅ 𝑥2 = −1 × 1 + 0 × 0 + 0 × 1 = −1
• → 𝑦2(𝑤(2) ⋅ 𝑥2) = −1 ≤ 0
• Update: 𝑤 3 = 𝑤(2) + (1)(1, 0, 1)
• 𝑤 3 = (0, 0, 1)
27
The algorithm:
The dataset:
-1
Example: Logic Function OR
• 𝑤(3) = 0, 0, 1
• Third input: 𝑥3 = 1, 0 , 𝑦3 = 1• Include the bias: 1, 1, 0
• Value of neuron:• 𝑤(3) ⋅ 𝑥3 = 0 × 1 + 0 × 1 + 1 × 0 = 0
• → 𝑦3(𝑤(3) ⋅ 𝑥3) = 0 ≤ 0
• Update: 𝑤 4 = 𝑤(3) + (1)(1, 1, 0)
• 𝑤 4 = (1, 1 , 1)
28
The algorithm:
The dataset:
-1
Example: Logic Function OR
• 𝑤(4) = 1, 1, 1
• Fourth input: 𝑥4 = 1, 1 , 𝑦4 = 1• Include the bias: 1, 1, 1
• Value of neuron:• 𝑤(4) ⋅ 𝑥4 = 1 × 1 + 1 × 1 + 1 × 1 = 3
• → 𝑦4(𝑤(4) ⋅ 𝑥4) = 3 ≥ 0
• No update
29
The algorithm:
The dataset:
-1
Example: Logic Function OR
• Not done yet!• 𝑤(4) = 1, 1, 1
• First input: 𝑥1 = 0, 0 , 𝑦1 = −1• Include the bias: 1, 0, 0
• Value of neuron:• 𝑤(4) ⋅ 𝑥1 = 1 × 1 + 1 × 0 + 1 × 0 = 1
• → 𝑦1(𝑤(4) ⋅ 𝑥1) = −1 ≤ 0
• Need to update again…
30
The algorithm:
The dataset:
-1
Example: Logic Function OR
• We’ve been through all the inputs once –• but that doesn’t mean we finished!
• Need to go through the inputs again• Till the weights settle down and stop changing
• When data is inseparable, the weights may never stop changing…
31
When to stop?
• The algorithm runs over the dataset many times…
• How to decide when to stop learning?
• (generally)
32
Validation Set
• Training set: to train the algorithm• To adjust the weights
• Validation set: to keep tack of how well it is doing• To verify that any increase in accuracy over the training data yields an
increase in accuracy over a dataset that the network wasn’t trained on.
• Test set: to produce the final results• To test the final solution, in order to confirm the actual predictive power of
the algorithm.
33
Validation Set
• Proportion of train/validation/test sets • Are typically 60:20:20 (after the dataset has been shuffled!)
• Alternatively:• K-fold Cross validation
• The dataset is randomly partitioned into K subsets• One subset is used for validation; the algorithm is trained on all the others• Then, a different subset is let out, and a new model is trained…• Repeat the process for all K subsets• Finally, the model that produced the lowest validation error is used.
• Leave-one-out• Algorithm is validated on one piece of data, and trained on all the rest, N times.
34
Length of dataset
Overfitting
• Rather than finding a general function (left), the our network matches the input perfectly, including the noise in them (right).
• Reduces generalization capabilities.
35
Back to: When to stop?
• If we plot the error during training:
• it typically reduces fairly quickly during the first few training iterations,
• then the reduction slows down as the learning algorithm performs small changes to find the exact local minimum.
36
Note: This graph is general anddoes not necessarily describethe behavior of the error ratewhile training Perceptron,because Perceptron does notguarantee that there will befewer mistakes on the nextiterations.
When to stop?
• We don't want to stop training until the local minimum has been found, but, keeping on training too long leads to overfitting .
• This is where the validation set comes in useful.
37
When to stop?
• We train the network for some predetermined amount of time, and then use the validation set to estimate how well the network is generalizing.
• We then carry on training for a few more iterations, and repeat the whole process.
38
When to stop?
• At some stage the error on the validation set will start increasing again, • because the network has
stopped learning about the function that generated the data, and started to learn about the noise that is in the data itself .
• At this stage we stop the training. This technique is called early stopping.
39
When to stop?
• Thus, the validation set was used to prevent overfitting, and to monitor the generalization ability of the network:• If the accuracy over the training
data set increases,
• but the accuracy over then validation data set stays the same or decreases,
• then we have caused overfitting, and should stop training.
40
Perceptron Variant
• The pocket algorithm: • Keeps the best solution seen so far "in its pocket".
• The algorithm then returns the solution in the pocket, rather than the last solution.
41
Perceptron Bound Theorem
Note: 𝛾 is called a “margin”.