27
2/9/08 CS 461, Winter 2008 1 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff [email protected]

2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff [email protected] Dr. Kiri Wagstaff [email protected]

Embed Size (px)

Citation preview

Page 1: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 1

CS 461: Machine LearningLecture 6

Dr. Kiri [email protected]

Dr. Kiri [email protected]

Page 2: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 2

Plan for Today

Note: Room change for 2/16: E&T A129 Solution to Midterm Solution to Homework 3

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)

Parametric classification Maximize the posterior probability

Page 3: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 3

Review from Lecture 5

Probability Axioms

Bayesian Learning Classification Bayes’s Rule Bayesian Networks Naïve Bayes Classifier Association Rules

Page 4: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 4

Parametric Methods

Chapter 4Chapter 4

Page 5: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 5

Parametric Learning

Assume: data x comes from a distribution p(x)

Model this distribution by selecting parameters θ e.g., N ( μ, σ2) where θ = { μ, σ2}

Page 6: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 6

Maximum Likelihood Model

(Last time: Max Likelihood Classification) Likelihood of θ given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

[Alpaydin 2004 The MIT Press]

Page 7: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 7

Example: do you wear glasses?

Bernoulli: Two states, x in {0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

[Alpaydin 2004 The MIT Press]

Page 8: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 8

Gaussian (Normal) Distribution

p(x) = N ( μ, σ2)

MLE for μ and σ2:

[Alpaydin 2004 The MIT Press]

( ) ( )⎥⎦

⎤⎢⎣

σμ−

σπ= 2

2

2exp

2

1 x-xp

μ σ

p x( ) =1

2π σexp −

x − μ( )2

2σ 2

⎣ ⎢ ⎢

⎦ ⎥ ⎥

m =

x t

t

∑N

s2 =

x t − m( )2

t

∑N

Page 9: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 9

How good is that estimate?

Let d be the estimate of θ It is also a random variable Technically, it’s d(X) since it depends on X E [d] is the expected value of d (regardless of X)

Bias: bθ(d) = E [d] – θ How far off the correct value is it?

Variance: Var(d) = E [(d–E [d])2] How much does it change with different X?

Mean square error: r (d,θ) = E [(d–θ)2]

= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance

[Alpaydin 2004 The MIT Press]

Page 10: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 10

Bayes Estimator: Using what we already know

Prior information p (θ) Bayes’s rule (get posterior):

p (θ|X) = p(X|θ) p(θ) / p(X) Maximum a Posteriori (MAP):

θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML):

θML = argmaxθ p(X|θ) Bayes Estimator:

θBayes = E[θ|X] = ∫ θ p(θ|X) dθ In a sense, it is the “weighted mean” for θ

For our purposes, θMAP = θBayes

[Alpaydin 2004 The MIT Press]

Page 11: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 11

Bayes Estimator: Continuous Example Assume xt ~ N (θ, σo

2) and θ ~ N ( μ, σ2)

θML = m (sample mean)

θMAP = θBayes =

Estimated mean = weighted average of sample mean m and prior mean μ Weights indicate how much you trust the

sample

[ ] μσ+σ

σ+

σ+σσ

=θ 220

2

220

20

1

1

1|

//N

/m

//N/N

E X

[Alpaydin 2004 The MIT Press]

Page 12: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 12

Example: Coin flipping

?

??

??

?

?

0 flips total

[Copyright Terran Lane]

Page 13: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 13

Example: Coin flipping

1 flip total

[Copyright Terran Lane]

Page 14: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 14

Example: Coin flipping

5 flips total

[Copyright Terran Lane]

Page 15: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 15

Example: Coin flipping

10 flips total

[Copyright Terran Lane]

Page 16: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 16

Example: Coin flipping

20 flips total

[Copyright Terran Lane]

Page 17: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 17

Example: Coin flipping

50 flips total

[Copyright Terran Lane]

Page 18: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 18

Example: Coin flipping

100 flips total

[Copyright Terran Lane]

Page 19: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 19

Parametric Classification

Remember Naïve Bayes? Maximum likelihood estimator (MLE)

Maximum a-posteriori (MAP) classifier

Cpredict = argmaxc

P(X1 = u1L Xm = um | C = c)

Cpredict = argmaxc

P(C = c | X1 = u1L Xm = um )

[Copyright Andrew Moore]

Page 20: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 20

Parametric Classification

gi x( ) = p x | Ci( )P Ci( )

or equivalently

gi x( ) = log p x | Ci( ) + log P Ci( )

p x | Ci( ) =1

2π σ i

exp −x − μ i( )

2

2σ i2

⎣ ⎢ ⎢

⎦ ⎥ ⎥

gi x( ) = −1

2log 2π − log σ i −

x − μ i( )2

2σ i2 + log P Ci( )

[Alpaydin 2004 The MIT Press]

Discriminant (take the max over i):

If we assume p(x|Ci) are Gaussian:

Page 21: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 21

Given the sample

ML estimates of priors are

Discriminant becomes

Nt

tt,rx 1}{ ==X

ℜ∈x ⎪⎩

⎪⎨⎧

≠∈∈

= , if 0

if 1

ijx

xr

jt

it

ti C

C

ˆ P Ci( ) =

rit

t

∑N

mi =

x trit

t

rit

t

∑ si

2 =

x t − mi( )2ri

t

t

rit

t

gi x( ) = −1

2log 2π − log si −

x − mi( )2

2si2 + log ˆ P Ci( )

[Alpaydin 2004 The MIT Press]

1-D data

Observed frequency

Class indicators

Observed variance

Observed mean

Page 22: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 22

Example: 2 classes (same variance)

[Alpaydin 2004 The MIT Press]

Posterior = discriminant g(x) normalized by P(x)

Likelihood = Gaussian with mean, std dev

Assume both classes have the same prior

Page 23: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 23

Example: 2 classes (diff variance)

[Alpaydin 2004 The MIT Press]

Posterior = discriminant g(x) normalized by P(x)

Likelihood = Gaussian with mean, std dev

Assume both classes have the same prior

Page 24: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 24

Example: Predicting student’s major

From HW 3: Use “glasses” feature: yes or no Discrete distribution

Likelihood (of data)

Page 25: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 25

Example: Predicting student’s major

Posterior (of class)

Priors: P(CS) = 0.4, P(Physics) = 0.3, P(EE) = 0.3

Probability of data: P(yes) = 0.6, P(no) = 0.4

Bayes: P(major|glasses) = P(glasses|major) P(major) / P(glasses)

MAPnoMAPyes

Page 26: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 26

Summary: Key Points for Today

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias,

variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate

(weighted)

Parametric classification Maximize the posterior probability

Page 27: 2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu

2/9/08 CS 461, Winter 2008 27

Next Time

Clustering!(read Ch. 7.1-7.4, 7.8… optional)

No reading questions