Upload
others
View
74
Download
1
Embed Size (px)
Citation preview
Introduction to Machine Learning
Machine Learning: Jordan Boyd-GraberUniversity of MarylandLOGISTIC REGRESSION FROM TEXT
Slides adapted from Emily Fox
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 18
Reminder: Logistic Regression
P(Y = 0|X) =1
1+exp�
β0 +∑
i βiXi
� (1)
P(Y = 1|X) =exp
�
β0 +∑
i βiXi
�
1+exp�
β0 +∑
i βiXi
� (2)
� Discriminative prediction: p(y |x)� Classification uses: ad placement, spam detection
� What we didn’t talk about is how to learn β from data
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 18
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)�
β0 +∑
i
βix(j)i
�
− ln
�
1+exp
�
β0 +∑
i
βix(j)i
��
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)�
β0 +∑
i
βix(j)i
�
− ln
�
1+exp
�
β0 +∑
i
βix(j)i
��
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
Convexity
� Convex function
� Doesn’t matter where you start, ifyou “walk up” objective
� Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
Convexity
� Convex function
� Doesn’t matter where you start, ifyou “walk up” objective
� Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
3
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij(l+1)
Jα∂∂β
β ij(l)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij(l+1)
Jα∂∂β
β ij(l)
Luckily, (vanilla) logistic regression is convex
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient for Logistic Regression
To ease notation, let’s define
πi =expβT xi
1+expβT xi(5)
Our objective function is
L =∑
i
logp(yi |xi) =∑
i
Li =∑
i
¨
logπi if yi = 1
log(1−πi) if yi = 0(6)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 18
Taking the Derivative
Apply chain rule:
∂L∂ βj
=∑
i
∂Li( ~β)
∂ βj=∑
i
(
1πi
∂ πi∂ βj
if yi = 11
1−πi
�
− ∂ πi∂ βj
�
if yi = 0(7)
If we plug in the derivative,
∂ πi
∂ βj=πi(1−πi)xj , (8)
we can merge these two cases
∂Li
∂ βj= (yi −πi)xj . (9)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
Why are we adding? What would well do if we wanted to do descent?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
η: step size, must be greater than zero
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
NB: Conjugate gradient is usually better, but harder to implement
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Remaining issues
� When to stop?
� What if β keeps getting bigger?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 18
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(13)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (14)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(13)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (14)
µ is “regularization” parameter that trades off between likelihood andhaving small parameters
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
Getting to Union Station
Pretend it’s a pre-smartphone world and you want to get to Union Station
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 18
Stochastic Gradient for Logistic Regression
Given a single observation xi chosen at random from the dataset,
βj ←β ′j +η�
−µβ ′j + xij [yi −πi ]�
(16)
Examples in class.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18
Stochastic Gradient for Logistic Regression
Given a single observation xi chosen at random from the dataset,
βj ←β ′j +η�
−µβ ′j + xij [yi −πi ]�
(16)
Examples in class.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (17)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −πi)xj −2µβj (18)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (17)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −πi)xj −2µβj (18)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18
Algorithm
1. Initialize a vector B to be all zeros
2. For t = 1, . . . ,T� For each example ~xi ,yi and feature j :� Compute πi ≡ Pr(yi = 1 | ~xi)� Set β [j] =β [j]′+λ(yi −πi)xi
3. Output the parameters β1, . . . ,βd .
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 16 / 18
Proofs about Stochastic Gradient
� Depends on convexity of objective and how close ε you want to get toactual answer
� Best bounds depend on changing η over time and per dimension (notall features created equal)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 17 / 18
In class
� Your questions!
� Working through simple example
� Prepared for logistic regression homework
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 18 / 18