Upload
vuonganh
View
225
Download
1
Embed Size (px)
Citation preview
Overview
β’ Previous lectures: (Principle for loss function) MLE to derive lossβ’ Example: linear regression; some linear classification models
β’ This lecture: (Principle for optimization) local improvementβ’ Example: Perceptron; SGD
Attempt
β’ Given training data π₯π , π¦π : 1 β€ π β€ π i.i.d. from distribution π·
β’ Hypothesis ππ€ π₯ = π€ππ₯β’ π¦ = +1 if π€ππ₯ > 0
β’ π¦ = β1 if π€ππ₯ < 0
β’ Prediction: π¦ = sign(ππ€ π₯ ) = sign(π€ππ₯)
β’ Goal: minimize classification error
Perceptron Algorithm
β’ Assume for simplicity: all π₯π has length 1
Perceptron: figure from the lecture note of Nina Balcan
Intuition: correct the current mistake
β’ If mistake on a positive example
π€π‘+1π π₯ = π€π‘ + π₯ ππ₯ = π€π‘
ππ₯ + π₯ππ₯ = π€π‘ππ₯ + 1
β’ If mistake on a negative example
π€π‘+1π π₯ = π€π‘ β π₯ ππ₯ = π€π‘
ππ₯ β π₯ππ₯ = π€π‘ππ₯ β 1
The Perceptron Theorem
β’ Suppose there exists π€β that correctly classifies π₯π , π¦π
β’ W.L.O.G., all π₯π and π€β have length 1, so the minimum distance of any example to the decision boundary is
πΎ = minπ
| π€β ππ₯π|
β’ Then Perceptron makes at most1
πΎ
2mistakes
The Perceptron Theorem
β’ Suppose there exists π€β that correctly classifies π₯π , π¦π
β’ W.L.O.G., all π₯π and π€β have length 1, so the minimum distance of any example to the decision boundary is
πΎ = minπ
| π€β ππ₯π|
β’ Then Perceptron makes at most1
πΎ
2mistakes
Need not be i.i.d. !
Do not depend on π, the length of the data sequence!
Analysis
β’ First look at the quantity π€π‘ππ€β
β’ Claim 1: π€π‘+1π π€β β₯ π€π‘
ππ€β + πΎ
β’ Proof: If mistake on a positive example π₯
π€π‘+1π π€β = π€π‘ + π₯ ππ€β = π€π‘
ππ€β + π₯ππ€β β₯ π€π‘ππ€β + πΎ
β’ If mistake on a negative example
π€π‘+1π π€β = π€π‘ β π₯ ππ€β = π€π‘
ππ€β β π₯ππ€β β₯ π€π‘ππ€β + πΎ
Analysis
β’ Next look at the quantity π€π‘
β’ Claim 2: π€π‘+12
β€ π€π‘2
+ 1
β’ Proof: If mistake on a positive example π₯
π€π‘+12
= π€π‘ + π₯2
= π€π‘2
+ π₯2
+ 2π€π‘ππ₯
Negative since we made a mistake on x
Analysis: putting things together
β’ Claim 1: π€π‘+1π π€β β₯ π€π‘
ππ€β + πΎ
β’ Claim 2: π€π‘+12
β€ π€π‘2
+ 1
After π mistakes:
β’ π€π+1π π€β β₯ πΎπ
β’ π€π+1 β€ βπ
β’ π€π+1π π€β β€ π€π+1
So πΎπ β€ βπ, and thus π β€1
πΎ
2
Intuition
β’ Claim 1: π€π‘+1π π€β β₯ π€π‘
ππ€β + πΎ
β’ Claim 2: π€π‘+12
β€ π€π‘2
+ 1
The correlation gets larger. Could be:
1. π€π‘+1 gets closer to π€β
2. π€π‘+1 gets much longer
Rules out the bad case β2. π€π‘+1 gets much longerβ
Note: connectionism vs symbolism
β’ Symbolism: AI can be achieved by representing concepts as symbolsβ’ Example: rule-based expert system, formal grammar
β’ Connectionism: explain intellectual abilities using connections between neurons (i.e., artificial neural networks)β’ Example: perceptron, larger scale neural networks
Note: connectionism v.s. symbolism
β’ Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rule-based manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons.
---- The Central Paradox of Cognition (Smolensky et al., 1992)
Note: online vs batch
β’ Batch: Given training data π₯π , π¦π : 1 β€ π β€ π , typically i.i.d.
β’ Online: data points arrive one by oneβ’ 1. The algorithm receives an unlabeled example π₯π
β’ 2. The algorithm predicts a classification of this example.
β’ 3. The algorithm is then told the correct answer π¦π, and update its model
Gradient descent
β’ Minimize loss πΏ π , where the hypothesis is parametrized by π
β’ Gradient descentβ’ Initialize π0
β’ ππ‘+1 = ππ‘ β ππ‘π» πΏ ππ‘
Stochastic gradient descent (SGD)
β’ Suppose data points arrive one by one
β’ πΏ π =1
πΟπ‘=1
π π(π, π₯π‘ , π¦π‘), but we only know π(π, π₯π‘ , π¦π‘) at time π‘
β’ Idea: simply do what you can based on local informationβ’ Initialize π0
β’ ππ‘+1 = ππ‘ β ππ‘π»π(ππ‘, π₯π‘ , π¦π‘)
Example 1: linear regression
β’ Find ππ€ π₯ = π€ππ₯ that minimizes πΏ ππ€ =1
πΟπ‘=1
π π€ππ₯π‘ β π¦π‘2
β’ π π€, π₯π‘ , π¦π‘ =1
ππ€ππ₯π‘ β π¦π‘
2
β’ π€π‘+1 = π€π‘ β ππ‘π»π π€π‘ , π₯π‘ , π¦π‘ = π€π‘ β2ππ‘
ππ€π‘
ππ₯π‘ β π¦π‘ π₯π‘
Example 2: logistic regression
β’ Find π€ that minimizes
πΏ π€ = β1
π
π¦π‘=1
logπ(π€ππ₯π‘) β1
π
π¦π‘=β1
log[1 β π π€ππ₯π‘ ]
πΏ π€ = β1
π
π‘
logπ(π¦π‘π€ππ₯π‘)
π π€, π₯π‘ , π¦π‘ =β1
πlogπ(π¦π‘π€ππ₯π‘)
Example 2: logistic regression
β’ Find π€ that minimizes
π π€, π₯π‘ , π¦π‘ =β1
πlogπ(π¦π‘π€ππ₯π‘)
π€π‘+1 = π€π‘ β ππ‘π»π π€π‘ , π₯π‘ , π¦π‘ = π€π‘ +ππ‘
π
π π 1βπ π
π(π)π¦π‘π₯π‘
Where π = π¦π‘π€π‘ππ₯π‘
Example 3: Perceptron
β’ Hypothesis: π¦ = sign(π€ππ₯)
β’ Define hinge lossπ π€, π₯π‘ , π¦π‘ = βπ¦π‘π€ππ₯π‘ π[mistake on π₯π‘]
πΏ π€ = β
π‘
π¦π‘π€ππ₯π‘ π[mistake on π₯π‘]
π€π‘+1 = π€π‘ β ππ‘π»π π€π‘ , π₯π‘ , π¦π‘ = π€π‘ + ππ‘π¦π‘π₯π‘ π[mistake on π₯π‘]
Example 3: Perceptron
β’ Hypothesis: π¦ = sign(π€ππ₯)
π€π‘+1 = π€π‘ β ππ‘π»π π€π‘ , π₯π‘ , π¦π‘ = π€π‘ + ππ‘π¦π‘π₯π‘ π[mistake on π₯π‘]
β’ Set ππ‘ = 1. If mistake on a positive example
π€π‘+1 = π€π‘ + π¦π‘π₯π‘ = π€π‘ + π₯
β’ If mistake on a negative example
π€π‘+1 = π€π‘ + π¦π‘π₯π‘ = π€π‘ β π₯
Pros & Cons
Pros:
β’ Widely applicable
β’ Easy to implement in most cases
β’ Guarantees for many losses
β’ Good performance: error/running time/memory etc.
Cons:
β’ No guarantees for non-convex opt (e.g., those in deep learning)
β’ Hyper-parameters: initialization, learning rate
Mini-batch
β’ Instead of one data point, work with a small batch of π points
(π₯π‘π+1,π¦π‘π+1),β¦, (π₯π‘π+π,π¦π‘π+π)
β’ Update rule
ππ‘+1 = ππ‘ β ππ‘π»1
π
1β€πβ€π
π ππ‘ , π₯π‘π+π , π¦π‘π+π
β’ Other variants: variance reduction etc.
Homework 1
β’ Assignment onlineβ’ Course website:
http://www.cs.princeton.edu/courses/archive/spring16/cos495/
β’ Piazza: https://piazza.com/princeton/spring2016/cos495
β’ Due date: Feb 17th (one week)
β’ Submissionβ’ Math part: hand-written/print; submit to TA (Office: EE, C319B)
β’ Coding part: in Matlab/Python; submit the .m/.py file on Piazza
Homework 1
β’ Grading policy: every late day reduces the attainable credit for the exercise by 10%.
β’ Collaboration: β’ Discussion on the problem sets is allowed
β’ Students are expected to finish the homework by himself/herself
β’ The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question.