Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
© Eric Xing @ CMU, 2006-2008 1
Machine LearningMachine Learning
1010--701/15701/15--781, Fall 2008781, Fall 2008
Logistic RegressionLogistic Regression---- generative verses discriminative generative verses discriminative
classifier classifier
Eric XingEric XingLecture 5, September 22, 2008
Reading: Chap. 4.3 CB
© Eric Xing @ CMU, 2006-2008 2
Generative vs. Discriminative Classifiers
Goal: Wish to learn f: X → Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data
Yi
Xi
Yi
Xi
2
© Eric Xing @ CMU, 2006-2008 3
Discussion: Generative and discriminative classifiers
Generative:Modeling the joint distribution of all data
Discriminative:Modeling only points at the boundary
How? Regression!
© Eric Xing @ CMU, 2006-2008 4
Linear regression The data:
Both nodes are observed:X is an input vectorY is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
A regression scheme can be used to model p(y|x) directly,rather than p(x,y)
Yi
XiN
{ }),(,),,(),,(),,( NN yxyxyxyx L332211
3
© Eric Xing @ CMU, 2006-2008 5
Classification and logistic regression
© Eric Xing @ CMU, 2006-2008 6
The logistic function
4
© Eric Xing @ CMU, 2006-2008 7
Logistic regression (sigmoid classifier)
The condition distribution: a Bernoulli
where µ is a logistic function
We can used the brute-force gradient method as in LR
But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …)
yy xxxyp −−= 11 ))(()()|( µµ
xT
ex
θµ
−+=1
1)(
© Eric Xing @ CMU, 2006-2008 8
Training Logistic Regression: MCLE
Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data
Training data
Data likelihood =
Data conditional likelihood =
5
© Eric Xing @ CMU, 2006-2008 9
Expressing Conditional Log Likelihood
Recall the logistic function:
and conditional likelihood:
© Eric Xing @ CMU, 2006-2008 10
Maximizing Conditional Log Likelihood
The objective:
Good news: l(θ) is concave function of θ
Bad news: no closed-form solution to maximize l(θ)
6
© Eric Xing @ CMU, 2006-2008 11
Gradient Ascent
Property of sigmoid function:
The gradient:
The gradient ascent algorithm iterate until change < εFor all i,
repeat
© Eric Xing @ CMU, 2006-2008 12
Overfitting …
7
© Eric Xing @ CMU, 2006-2008 13
How about MAP?It is very common to use regularized maximum likelihood.
One common approach is to define priors on θ– Normal distribution, zero mean, identity covarianceHelps avoid very large weights and overfitting
MAP estimate
© Eric Xing @ CMU, 2006-2008 14
MLE vs MAPMaximum conditional likelihood estimate
Maximum a posteriori estimate
8
© Eric Xing @ CMU, 2006-2008 15
The Newton’s methodFinding a zero of a function
© Eric Xing @ CMU, 2006-2008 16
The Newton’s method (con’d)To maximize the conditional likelihood l(θ):
since l is convex, we need to find θ∗ where l’(θ∗)=0 !
So we can perform the following iteration:
9
© Eric Xing @ CMU, 2006-2008 17
The Newton-Raphson methodIn LR the θ is vector-valued, thus we need the following generalization:
∇ is the gradient operator over the function
H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2008 18
The Newton-Raphson methodIn LR the θ is vector-valued, thus we need the following generalization:
∇ is the gradient operator over the function
H is known as the Hessian of the function
10
© Eric Xing @ CMU, 2006-2008 19
Iterative reweighed least squares (IRLS)
Recall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson
Now for logistic regression:
© Eric Xing @ CMU, 2006-2008 20
IRLSRecall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson
Now for logistic regression:
11
© Eric Xing @ CMU, 2006-2008 21
Logistic regression: practical issues
NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations
Quasi-Newton methods, that approximate the Hessian, work faster.
Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.
Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
© Eric Xing @ CMU, 2006-2008 22
Case study
Dataset 20 News Groups (20 classes)Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)61,118 words, 18,774 documentsClass labels descriptions
12
© Eric Xing @ CMU, 2006-2008 23
Experimental setup
ParametersBinary feature; Using (stochastic) Gradient descent vs. IRLS to estimate parameters
Training/Test Sets: Subset of 20ng (binary classes)Use SVD to do dimension reduction (to 100)Random select 50% for training10 run and report average result
Convergence Criteria: RMSE in training set
© Eric Xing @ CMU, 2006-2008 24
Convergence curves
© Eric Xing @ CMU, 2006-2008
alt.atheismvs.
comp.graphics
rec.autosvs.
rec.sport.baseball
comp.windows.xvs.
rec.motorcycles
Legend: - X-axis: Iteration #; Y-axis: error - In each figure, red for IRLS and blue for gradient descent
13
© Eric Xing @ CMU, 2006-2008 25
Generative vs. Discriminative Classifiers
Goal: Wish to learn f: X → Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data
Yi
Xi
Yi
Xi
© Eric Xing @ CMU, 2006-2008 26
Gaussian Discriminative Analysis learning f: X → Y, where
X is a vector of real-valued features, < X1…Xm >Y is boolean
What does that imply about the form of P(Y|X)?The joint probability of a datum and its label is:
Given a datum xn, we predict its label using the conditional probability of the label given the datum:
Yi
Xi
{ }221
212 221
111
)-(-exp)(
),,|()(),|,(
/ knk
knn
kn
knn
x
yxpypyxp
µπσ
π
σµσµ
σ=
=×===
{ }{ }∑ −−
−−==
'
2'2
12/12'
22
12/12
)(exp)2(
1
)(exp)2(
1
),,|1(2
2
kknk
knk
nkn
x
xxyp
µπσ
π
µπσ
πσµ
σ
σ
14
© Eric Xing @ CMU, 2006-2008 27
Naïve Bayes Classifier When X is multivariate-Gaussian vector:
The joint probability of a datum and it label is:
The naïve Bayes simplification
More generally:
Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
{ })-()-(-exp)(
),,|()(),|,(
/ knT
knk
knn
kn
knn
xx
yxpypyxp
µµπ
π
µµrrrr
rrrr
121
2121
111−Σ
Σ=
Σ=×==Σ=
Yi
Xi
{ }∏
∏
=
=×===
jjk
jn
jkk
jjkjk
kn
jn
kn
knn
x
yxpypyxp
jk
221
212 221
111
)-(-exp)(
),,|()(),|,(
,/,
,,
,µ
πσπ
σµσµ
σ
∏=
×=m
jn
jnnnn yxpypyxp
1),|()|(),|,( ηππη
Yi
Xi1 Xi
2 Xim…
© Eric Xing @ CMU, 2006-2008 28
The predictive distribution
Understanding the predictive distribution
Under naïve Bayes assumption:
For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function
* ),|,(
),|,(),|(
),,|,(),,,|(' '''∑ Σ
Σ=
ΣΣ=
=Σ=k kknk
kknk
n
nkn
nkn xN
xNxp
xypxypµπ
µπµ
πµπµ v
vv 11
( ){ }1
1222
1221
1
2222
12
1221
121
11
1
1
1
ππ
σσσµπ
σµπ
µµµµσ
σ)(2
log)-(exp
log)-(exp
log)][-]([)-(exp
−
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛−−−
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛−−−
++−+
=
∑
∑+
=
∑ jjjjjj
nCx
Cx
jjj jjj
nj
j jjj
nj
x
nT xe θ−+
=1
1
)|( nn xyp 11 =
**
log)(exp
log)(exp
),,,|(
' ,'','
'
,,
∑ ∑
∑
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛−−−−
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛−−−−
=Σ=
k j jkj
kj
njk
k
j jkj
kj
njk
k
nkn
Cx
Cx
xyp
σµσ
π
σµσ
π
πµ2
2
22
21
21
1 v
15
© Eric Xing @ CMU, 2006-2008 29
The decision boundary
The predictive distribution
The Bayes decision rule:
For multiple class (i.e., K>2), * correspond to a softmax function
∑ −
−
==
j
x
x
nkn
nTj
nTk
eexyp
θ
θ
)|( 1
nT xM
j
jnj
nnex
xypθ
θθ−
=
+=
⎭⎬⎫
⎩⎨⎧
−−+
==
∑ 11
1
11
01
1 exp
)|(
nT
x
x
x
nn
nn x
ee
exypxyp
nT
nT
nT
θ
θ
θ
θ=
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
+
+===
−
−
−
1
11
11
2
1 ln
)|()|(ln
© Eric Xing @ CMU, 2006-2008 30
Naïve Bayes vs Logistic Regression
Consider Y boolean, X continuous, X=<X1 ... Xm>Number of parameters to estimate:
NB:
LR:
Estimation method:NB parameter estimates are uncoupledLR parameter estimates are coupled
16
© Eric Xing @ CMU, 2006-2008 31
Naïve Bayes vs Logistic Regression
Asymptotic comparison (# training examples → infinity)
when model assumptions correctNB, LR produce identical classifiers
when model assumptions incorrectLR is less biased – does not assume conditional independencetherefore expected to outperform NB
© Eric Xing @ CMU, 2006-2008 32
Naïve Bayes vs Logistic Regression
Non-asymptotic analysis (see [Ng & Jordan, 2002] )
convergence rate of parameter estimates – how many training examples needed to assure good estimates?
NB order log m (where m = # of attributes in X)LR order m
NB converges more quickly to its (perhaps less helpful) asymptotic estimates
17
© Eric Xing @ CMU, 2006-2008 33
Rate of convergence: logistic regression
Let hDis,m be logistic regression trained on n examples in mdimensions. Then with high probability:
Implication: if we want for some small constant ε0, it suffices to pick order m examples
Convergences to its asymptotic classifier, in order m examples
result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m
© Eric Xing @ CMU, 2006-2008 34
Rate of convergence: naïve Bayes parameters
Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ0 > 0, we have that
Let
Then with probability at least 1-δ, after n examples:
1. For discrete input, for all i and b
2. For continuous inputs, for all i and b
18
© Eric Xing @ CMU, 2006-2008 35
Some experiments from UCI data sets
© Eric Xing @ CMU, 2006-2008 36
Back to our 20 NG Case study
Dataset 20 News Groups (20 classes)61,118 words, 18,774 documents
Experiment:Solve only a two-class subset: 1 vs 2.1768 instances, 61188 features.Use dimensionality reduction on the data (SVD).Use 90% as training set, 10% as test set.Test prediction error used as accuracy measure.
19
© Eric Xing @ CMU, 2006-2008 37
Versus training size
Generalization error (1)
• 30 features.• A fixed test set• Training set varied
from 10% to 100% of the training set
© Eric Xing @ CMU, 2006-2008 38
Versus model size
Generalization error (2)
• Number of dimensions of the data varied from 5 to 50 in steps of 5
• The features were chosen in decreasing order of their singular values
• 90% versus 10% split on training and test
20
© Eric Xing @ CMU, 2006-2008 39
SummaryLogistic regression
– Functional form follows from Naïve Bayes assumptionsFor Gaussian Naïve Bayes assuming varianceFor discrete-valued Naïve Bayes too
But training procedure picks parameters without the conditional independence assumption
– MLE training: pick W to maximize P(Y | X; θ)– MAP training: pick W to maximize P(θ | X,Y)
‘regularization’helps reduce overfitting
Gradient ascent/descent– General approach when closed-form solutions unavailable
Generative vs. Discriminative classifiers– Bias vs. variance tradeoff