33
Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 3: PerceptronΒ Β· Perceptron Algorithm β€’Assume for simplicity: all 𝑖 has length 1 Perceptron: figure from the lecture note of Nina Balcan

Embed Size (px)

Citation preview

Machine Learning Basics Lecture 3: Perceptron

Princeton University COS 495

Instructor: Yingyu Liang

Perceptron

Overview

β€’ Previous lectures: (Principle for loss function) MLE to derive lossβ€’ Example: linear regression; some linear classification models

β€’ This lecture: (Principle for optimization) local improvementβ€’ Example: Perceptron; SGD

Task (π‘€βˆ—)𝑇π‘₯ = 0

Class +1

Class -1

π‘€βˆ—

(π‘€βˆ—)𝑇π‘₯ > 0

(π‘€βˆ—)𝑇π‘₯ < 0

Attempt

β€’ Given training data π‘₯𝑖 , 𝑦𝑖 : 1 ≀ 𝑖 ≀ 𝑛 i.i.d. from distribution 𝐷

β€’ Hypothesis 𝑓𝑀 π‘₯ = 𝑀𝑇π‘₯β€’ 𝑦 = +1 if 𝑀𝑇π‘₯ > 0

β€’ 𝑦 = βˆ’1 if 𝑀𝑇π‘₯ < 0

β€’ Prediction: 𝑦 = sign(𝑓𝑀 π‘₯ ) = sign(𝑀𝑇π‘₯)

β€’ Goal: minimize classification error

Perceptron Algorithm

β€’ Assume for simplicity: all π‘₯𝑖 has length 1

Perceptron: figure from the lecture note of Nina Balcan

Intuition: correct the current mistake

β€’ If mistake on a positive example

𝑀𝑑+1𝑇 π‘₯ = 𝑀𝑑 + π‘₯ 𝑇π‘₯ = 𝑀𝑑

𝑇π‘₯ + π‘₯𝑇π‘₯ = 𝑀𝑑𝑇π‘₯ + 1

β€’ If mistake on a negative example

𝑀𝑑+1𝑇 π‘₯ = 𝑀𝑑 βˆ’ π‘₯ 𝑇π‘₯ = 𝑀𝑑

𝑇π‘₯ βˆ’ π‘₯𝑇π‘₯ = 𝑀𝑑𝑇π‘₯ βˆ’ 1

The Perceptron Theorem

β€’ Suppose there exists π‘€βˆ— that correctly classifies π‘₯𝑖 , 𝑦𝑖

β€’ W.L.O.G., all π‘₯𝑖 and π‘€βˆ— have length 1, so the minimum distance of any example to the decision boundary is

𝛾 = min𝑖

| π‘€βˆ— 𝑇π‘₯𝑖|

β€’ Then Perceptron makes at most1

𝛾

2mistakes

The Perceptron Theorem

β€’ Suppose there exists π‘€βˆ— that correctly classifies π‘₯𝑖 , 𝑦𝑖

β€’ W.L.O.G., all π‘₯𝑖 and π‘€βˆ— have length 1, so the minimum distance of any example to the decision boundary is

𝛾 = min𝑖

| π‘€βˆ— 𝑇π‘₯𝑖|

β€’ Then Perceptron makes at most1

𝛾

2mistakes

Need not be i.i.d. !

Do not depend on 𝑛, the length of the data sequence!

Analysis

β€’ First look at the quantity π‘€π‘‘π‘‡π‘€βˆ—

β€’ Claim 1: 𝑀𝑑+1𝑇 π‘€βˆ— β‰₯ 𝑀𝑑

π‘‡π‘€βˆ— + 𝛾

β€’ Proof: If mistake on a positive example π‘₯

𝑀𝑑+1𝑇 π‘€βˆ— = 𝑀𝑑 + π‘₯ π‘‡π‘€βˆ— = 𝑀𝑑

π‘‡π‘€βˆ— + π‘₯π‘‡π‘€βˆ— β‰₯ π‘€π‘‘π‘‡π‘€βˆ— + 𝛾

β€’ If mistake on a negative example

𝑀𝑑+1𝑇 π‘€βˆ— = 𝑀𝑑 βˆ’ π‘₯ π‘‡π‘€βˆ— = 𝑀𝑑

π‘‡π‘€βˆ— βˆ’ π‘₯π‘‡π‘€βˆ— β‰₯ π‘€π‘‘π‘‡π‘€βˆ— + 𝛾

Analysis

β€’ Next look at the quantity 𝑀𝑑

β€’ Claim 2: 𝑀𝑑+12

≀ 𝑀𝑑2

+ 1

β€’ Proof: If mistake on a positive example π‘₯

𝑀𝑑+12

= 𝑀𝑑 + π‘₯2

= 𝑀𝑑2

+ π‘₯2

+ 2𝑀𝑑𝑇π‘₯

Negative since we made a mistake on x

Analysis: putting things together

β€’ Claim 1: 𝑀𝑑+1𝑇 π‘€βˆ— β‰₯ 𝑀𝑑

π‘‡π‘€βˆ— + 𝛾

β€’ Claim 2: 𝑀𝑑+12

≀ 𝑀𝑑2

+ 1

After 𝑀 mistakes:

β€’ 𝑀𝑀+1𝑇 π‘€βˆ— β‰₯ 𝛾𝑀

β€’ 𝑀𝑀+1 ≀ βˆšπ‘€

β€’ 𝑀𝑀+1𝑇 π‘€βˆ— ≀ 𝑀𝑀+1

So 𝛾𝑀 ≀ βˆšπ‘€, and thus 𝑀 ≀1

𝛾

2

Intuition

β€’ Claim 1: 𝑀𝑑+1𝑇 π‘€βˆ— β‰₯ 𝑀𝑑

π‘‡π‘€βˆ— + 𝛾

β€’ Claim 2: 𝑀𝑑+12

≀ 𝑀𝑑2

+ 1

The correlation gets larger. Could be:

1. 𝑀𝑑+1 gets closer to π‘€βˆ—

2. 𝑀𝑑+1 gets much longer

Rules out the bad case β€œ2. 𝑀𝑑+1 gets much longer”

Some side notes on Perceptron

History

Figure from Pattern Recognition and Machine Learning, Bishop

Note: connectionism vs symbolism

β€’ Symbolism: AI can be achieved by representing concepts as symbolsβ€’ Example: rule-based expert system, formal grammar

β€’ Connectionism: explain intellectual abilities using connections between neurons (i.e., artificial neural networks)β€’ Example: perceptron, larger scale neural networks

Symbolism example: Credit Risk Analysis

Example from Machine learning lecture notes by Tom Mitchell

Connectionism example

Figure from Pattern Recognitionand machine learning,Bishop

Neuron/perceptron

Note: connectionism v.s. symbolism

β€’ Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rule-based manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons.

---- The Central Paradox of Cognition (Smolensky et al., 1992)

Note: online vs batch

β€’ Batch: Given training data π‘₯𝑖 , 𝑦𝑖 : 1 ≀ 𝑖 ≀ 𝑛 , typically i.i.d.

β€’ Online: data points arrive one by oneβ€’ 1. The algorithm receives an unlabeled example π‘₯𝑖

β€’ 2. The algorithm predicts a classification of this example.

β€’ 3. The algorithm is then told the correct answer 𝑦𝑖, and update its model

Stochastic gradient descent (SGD)

Gradient descent

β€’ Minimize loss 𝐿 πœƒ , where the hypothesis is parametrized by πœƒ

β€’ Gradient descentβ€’ Initialize πœƒ0

β€’ πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›» 𝐿 πœƒπ‘‘

Stochastic gradient descent (SGD)

β€’ Suppose data points arrive one by one

β€’ 𝐿 πœƒ =1

𝑛σ𝑑=1

𝑛 𝑙(πœƒ, π‘₯𝑑 , 𝑦𝑑), but we only know 𝑙(πœƒ, π‘₯𝑑 , 𝑦𝑑) at time 𝑑

β€’ Idea: simply do what you can based on local informationβ€’ Initialize πœƒ0

β€’ πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›»π‘™(πœƒπ‘‘, π‘₯𝑑 , 𝑦𝑑)

Example 1: linear regression

β€’ Find 𝑓𝑀 π‘₯ = 𝑀𝑇π‘₯ that minimizes 𝐿 𝑓𝑀 =1

𝑛σ𝑑=1

𝑛 𝑀𝑇π‘₯𝑑 βˆ’ 𝑦𝑑2

β€’ 𝑙 𝑀, π‘₯𝑑 , 𝑦𝑑 =1

𝑛𝑀𝑇π‘₯𝑑 βˆ’ 𝑦𝑑

2

β€’ 𝑀𝑑+1 = 𝑀𝑑 βˆ’ πœ‚π‘‘π›»π‘™ 𝑀𝑑 , π‘₯𝑑 , 𝑦𝑑 = 𝑀𝑑 βˆ’2πœ‚π‘‘

𝑛𝑀𝑑

𝑇π‘₯𝑑 βˆ’ 𝑦𝑑 π‘₯𝑑

Example 2: logistic regression

β€’ Find 𝑀 that minimizes

𝐿 𝑀 = βˆ’1

𝑛

𝑦𝑑=1

log𝜎(𝑀𝑇π‘₯𝑑) βˆ’1

𝑛

𝑦𝑑=βˆ’1

log[1 βˆ’ 𝜎 𝑀𝑇π‘₯𝑑 ]

𝐿 𝑀 = βˆ’1

𝑛

𝑑

log𝜎(𝑦𝑑𝑀𝑇π‘₯𝑑)

𝑙 𝑀, π‘₯𝑑 , 𝑦𝑑 =βˆ’1

𝑛log𝜎(𝑦𝑑𝑀𝑇π‘₯𝑑)

Example 2: logistic regression

β€’ Find 𝑀 that minimizes

𝑙 𝑀, π‘₯𝑑 , 𝑦𝑑 =βˆ’1

𝑛log𝜎(𝑦𝑑𝑀𝑇π‘₯𝑑)

𝑀𝑑+1 = 𝑀𝑑 βˆ’ πœ‚π‘‘π›»π‘™ 𝑀𝑑 , π‘₯𝑑 , 𝑦𝑑 = 𝑀𝑑 +πœ‚π‘‘

𝑛

𝜎 π‘Ž 1βˆ’πœŽ π‘Ž

𝜎(π‘Ž)𝑦𝑑π‘₯𝑑

Where π‘Ž = 𝑦𝑑𝑀𝑑𝑇π‘₯𝑑

Example 3: Perceptron

β€’ Hypothesis: 𝑦 = sign(𝑀𝑇π‘₯)

β€’ Define hinge loss𝑙 𝑀, π‘₯𝑑 , 𝑦𝑑 = βˆ’π‘¦π‘‘π‘€π‘‡π‘₯𝑑 𝕀[mistake on π‘₯𝑑]

𝐿 𝑀 = βˆ’

𝑑

𝑦𝑑𝑀𝑇π‘₯𝑑 𝕀[mistake on π‘₯𝑑]

𝑀𝑑+1 = 𝑀𝑑 βˆ’ πœ‚π‘‘π›»π‘™ 𝑀𝑑 , π‘₯𝑑 , 𝑦𝑑 = 𝑀𝑑 + πœ‚π‘‘π‘¦π‘‘π‘₯𝑑 𝕀[mistake on π‘₯𝑑]

Example 3: Perceptron

β€’ Hypothesis: 𝑦 = sign(𝑀𝑇π‘₯)

𝑀𝑑+1 = 𝑀𝑑 βˆ’ πœ‚π‘‘π›»π‘™ 𝑀𝑑 , π‘₯𝑑 , 𝑦𝑑 = 𝑀𝑑 + πœ‚π‘‘π‘¦π‘‘π‘₯𝑑 𝕀[mistake on π‘₯𝑑]

β€’ Set πœ‚π‘‘ = 1. If mistake on a positive example

𝑀𝑑+1 = 𝑀𝑑 + 𝑦𝑑π‘₯𝑑 = 𝑀𝑑 + π‘₯

β€’ If mistake on a negative example

𝑀𝑑+1 = 𝑀𝑑 + 𝑦𝑑π‘₯𝑑 = 𝑀𝑑 βˆ’ π‘₯

Pros & Cons

Pros:

β€’ Widely applicable

β€’ Easy to implement in most cases

β€’ Guarantees for many losses

β€’ Good performance: error/running time/memory etc.

Cons:

β€’ No guarantees for non-convex opt (e.g., those in deep learning)

β€’ Hyper-parameters: initialization, learning rate

Mini-batch

β€’ Instead of one data point, work with a small batch of 𝑏 points

(π‘₯𝑑𝑏+1,𝑦𝑑𝑏+1),…, (π‘₯𝑑𝑏+𝑏,𝑦𝑑𝑏+𝑏)

β€’ Update rule

πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›»1

𝑏

1≀𝑖≀𝑏

𝑙 πœƒπ‘‘ , π‘₯𝑑𝑏+𝑖 , 𝑦𝑑𝑏+𝑖

β€’ Other variants: variance reduction etc.

Homework

Homework 1

β€’ Assignment onlineβ€’ Course website:

http://www.cs.princeton.edu/courses/archive/spring16/cos495/

β€’ Piazza: https://piazza.com/princeton/spring2016/cos495

β€’ Due date: Feb 17th (one week)

β€’ Submissionβ€’ Math part: hand-written/print; submit to TA (Office: EE, C319B)

β€’ Coding part: in Matlab/Python; submit the .m/.py file on Piazza

Homework 1

β€’ Grading policy: every late day reduces the attainable credit for the exercise by 10%.

β€’ Collaboration: β€’ Discussion on the problem sets is allowed

β€’ Students are expected to finish the homework by himself/herself

β€’ The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question.