59
Introduc)on to Bayesian methods Lecture 10 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Introduc)on  to  Bayesian  methods  Lecture  10  

David  Sontag  New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Page 2: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayesian  learning  

•  Bayesian  learning  uses  probability  to  model  data  and  quan+fy  uncertainty  of  predic;ons  – Facilitates  incorpora;on  of  prior  knowledge  – Gives  op;mal  predic;ons  

•  Allows  for  decision-­‐theore;c  reasoning  

Page 3: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Your  first  consul;ng  job  •  A  billionaire  from  the  suburbs  of  ManhaFan  asks  you  a  ques;on:  – He  says:  I  have  thumbtack,  if  I  flip  it,  what’s  the  probability  it  will  fall  with  the  nail  up?  

–  You  say:  Please  flip  it  a  few  ;mes:  

–  You  say:  The  probability  is:  •  P(heads)  =  3/5  

– He  says:  Why???  –  You  say:  Because…  

Page 4: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis;c  regression  

Page 5: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Random Variables

•  A random variable is some aspect of the world about which we (may) have uncertainty –  R = Is it raining? –  D = How long will it take to drive to work? –  L = Where am I?

•  We denote random variables with capital letters

•  Random variables have domains –  R in {true, false} (sometimes write as {+r, ¬r}) –  D in [0, ∞) –  L in possible locations, maybe {(0,0), (0,1), …}

Page 6: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Probability Distributions •  Discrete random variables have distributions

•  A discrete distribution is a TABLE of probabilities of values

•  The probability of a state (lower case) is a single number

•  Must have:

T P

warm 0.5

cold 0.5

W P

sun 0.6

rain 0.1

fog 0.3

meteor 0.0

Page 7: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Joint Distributions •  A joint distribution over a set of random variables:

specifies a real number for each assignment:

–  How many assignments if n variables with domain sizes d?

–  Must obey:

•  For all but the smallest distributions, impractical to write out or estimate –  Instead, we make additional assumptions about the distribution

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

Page 8: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Marginal Distributions

•  Marginal distributions are sub-tables which eliminate variables

•  Marginalization (summing out): Combine collapsed rows by adding

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

T P

hot 0.5

cold 0.5

W P

sun 0.6

rain 0.4

P (t) =X

w

P (t, w)

P (w) =X

t

P (t, w)

Page 9: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Conditional Probabilities •  A simple relation between joint and conditional probabilities

–  In fact, this is taken as the definition of a conditional probability

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

Page 10: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Conditional Distributions •  Conditional distributions are probability distributions over

some variables given fixed values of others

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

W P

sun 0.8

rain 0.2

W P

sun 0.4

rain 0.6

Conditional Distributions Joint Distribution

Page 11: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

The Product Rule

•  Sometimes have conditional distributions but want the joint

•  Example:

W P

sun 0.8

rain 0.2

D W P

wet sun 0.1

dry sun 0.9

wet rain 0.7

dry rain 0.3

D W P

wet sun 0.08

dry sun 0.72

wet rain 0.14

dry rain 0.06

Page 12: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayes’ Rule

•  Two ways to factor a joint distribution over two variables:

•  Dividing, we get:

•  Why is this at all helpful? –  Let’s us build one conditional from its reverse –  Often one conditional is tricky but the other one is simple –  Foundation of many practical systems (e.g. ASR, MT)

•  In the running for most important ML equation!

Page 13: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Returning  to  thumbtack  example…  

•  P(Heads)  =  θ,    P(Tails)  =  1-­‐θ

•  Flips  are  i.i.d.:    –  Independent  events  –  Iden;cally  distributed  according  to  Bernoulli  distribu;on  

•  Sequence  D  of  αH  Heads  and  αT  Tails      

D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Called the “likelihood” of the data under the model

Page 14: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Maximum  Likelihood  Es;ma;on  •  Data:  Observed  set  D  of  αH  Heads  and  αT  Tails      •  Hypothesis:  Bernoulli  distribu;on    •  Learning:  finding  θ is  an  op;miza;on  problem  

– What’s  the  objec;ve  func;on?  

•  MLE:  Choose  θ  to  maximize  probability  of  D  

Page 15: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Your  first  parameter  learning  algorithm  

•  Set  deriva;ve  to  zero,  and  solve!  

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

ˆ� = argmax

✓lnP (D | �)

ln �↵H

d

d�lnP (D | �) =

d

d�ln �↵H

(1� �)↵T

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Page 16: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Data

L(✓;D) = lnP (D|✓)

Page 17: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

What  if  I  have  prior  beliefs?    •  Billionaire  says:  Wait,  I  know  that  the  thumbtack  is  “close”  to  50-­‐50.  What  can  you  do  for  me  now?  

•  You  say:  I  can  learn  it  the  Bayesian  way…  •  Rather  than  es;ma;ng  a  single  θ,  we  obtain  a  distribu;on  over  possible  values  of  θ

Observe flips e.g.: {tails, tails}

In the beginning

Pr(✓)

After observations

Pr(✓ | D)

Page 18: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayesian  Learning  

•  Use  Bayes’  rule!  

•  Or equivalently:

•  For uniform priors, this reduces to

maximum likelihood estimation!

Prior

Normalization

Data Likelihood

Posterior

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⇤lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇥2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⇤lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇥2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

1

Page 19: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayesian  Learning  for  Thumbtacks  

Likelihood:  

•  What  should  the  prior  be?  –  Represent  expert  knowledge  –  Simple  posterior  form  

•  For  binary  variables,  commonly  used  prior  is  the  Beta  distribu;on:  

Page 20: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Beta  prior  distribu;on  –  P(θ)  

•  Since  the  Beta  distribu;on  is  conjugate  to  the  Bernoulli  distribu;on,  the  posterior  distribu;on  has  a  par;cularly  simple  form:  

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1

1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1= ⌅�H+⇥H�1

(1� ⌅)�T+⇥t+1

1

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d

d⇧lnP (D | ⇧) =

d

d⇧[

ln ⇧�H(1� ⇧)�T

]

=

d

d⇧[

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d

d⇧ln(1� ⇧) =

�H

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

2⌅2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

1

Page 21: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

•  We  now  have  a  distribu)on  over  parameters  •  For  any  specific  f,  a  func;on  of  interest,  compute  the  

expected  value  of  f:  

•  Integral  is  oien  hard  to  compute  •  As  more  data  is  observed,  prior  is  more  concentrated  

•  MAP  (Maximum  a  posteriori  approxima;on):  use  most  likely  parameter  to  approximate  the  expecta;on  

Using  Bayesian  inference  for  predic;on  

Page 22: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

What  about  con;nuous  variables?  

•  Billionaire  says:  If  I  am  measuring  a  con;nuous  variable,  what  can  you  do  for  me?  

•  You  say:  Let  me  tell  you  about  Gaussians…  

Page 23: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Some  proper;es  of  Gaussians  

•  Affine  transforma;on  (mul;plying  by  scalar  and  adding  a  constant)  are  Gaussian  –  X  ~  N(µ,σ2)  –  Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

•  Sum  of  Gaussians  is  Gaussian  –  X  ~  N(µX,σ2

X)  –  Y  ~  N(µY,σ2

Y)  –  Z  =  X+Y    Z  ~  N(µX+µY,  σ2

X+σ2Y)  

•  Easy  to  differen;ate,  as  we  will  see  soon!  

Page 24: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Learning  a  Gaussian  

•  Collect  a  bunch  of  data  – Hopefully,  i.i.d.  samples  

– e.g.,  exam  scores  

•  Learn  parameters  – µ (“mean”) – σ (“variance”)

xi i =

Exam  Score  

0   85  

1   95  

2   100  

3   12  

…   …  

99   89  

Page 25: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

MLE  for  Gaussian:  

•  Prob.  of  i.i.d.  samples  D={x1,…,xN}:  

•  Log-likelihood of data:

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

2

Page 26: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Your  second  learning  algorithm:  MLE  for  mean  of  a  Gaussian  

•  What’s  MLE  for  mean?  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �NX

i=1

xi +Nµ = 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

-

Page 27: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

MLE  for  variance  •  Again,  set  deriva;ve  to  zero:  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

Page 28: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Learning  Gaussian  parameters  

•  MLE:  

•  MLE  for  the  variance  of  a  Gaussian  is  biased  – Expected  result  of  es;ma;on  is  not  true  parameter!    – Unbiased  variance  es;mator:  

Page 29: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayesian  learning  of  Gaussian  parameters  

•  Conjugate  priors  – Mean:  Gaussian  prior  

– Variance:  Wishart  Distribu;on  

•  Prior  for  mean:  

Page 30: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis;c  regression  

Page 31: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Bayesian  Classifica;on  

•  Problem  statement:  – Given  features  X1,X2,…,Xn  – Predict  a  label  Y  

[Next several slides adapted from: Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos

Guestrin, and Dan Weld]

Page 32: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Example  Applica;on  

•  Digit  Recogni)on  

•  X1,…,Xn  ∈  {0,1}  (Black  vs.  White  pixels)  

•  Y  ∈  {0,1,2,3,4,5,6,7,8,9}  

Classifier 5

Page 33: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

The  Bayes  Classifier  

•  If  we  had  the  joint  distribu;on  on  X1,…,Xn  and  Y,  could  predict  using:  

–  (for  example:  what  is  the  probability  that  the  image  represents  a  5  given  its  pixels?)  

•  So  …  How  do  we  compute  that?  

Page 34: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

The  Bayes  Classifier  

•  Use  Bayes  Rule!  

•  Why  did  this  help?    Well,  we  think  that  we  might  be  able  to  specify  how  features  are  “generated”  by  the  class  label  

Normalization Constant

Likelihood Prior

Page 35: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

The  Bayes  Classifier  

•  Let’s  expand  this  for  our  digit  recogni;on  task:  

•  To  classify,  we’ll  simply  compute  these  probabili;es,  one  per  class,  and  predict  based  on  which  one  is  largest  

Page 36: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Model  Parameters  

•  How  many  parameters  are  required  to  specify  the  likelihood,  P(X1,…,Xn|Y)  ?  

–  (Supposing  that  each  image  is  30x30  pixels)  

•  The  problem  with  explicitly  modeling  P(X1,…,Xn|Y)  is  that  there  are  usually  way  too  many  parameters:  

– We’ll  run  out  of  space  

– We’ll  run  out  of  ;me  –  And  we’ll  need  tons  of  training  data  (which  is  usually  not  available)  

Page 37: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Naïve  Bayes  •  Naïve  Bayes  assump;on:  

–  Features  are  independent  given  class:  

– More  generally:  

•  How  many  parameters  now?  •  Suppose  X  is  composed  of  n  binary  features  

Page 38: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

The  Naïve  Bayes  Classifier  •  Given:  

–  Prior  P(Y)  –  n  condi;onally  independent  features  X  given  the  class  Y  

–  For  each  Xi,  we  have  likelihood  P(Xi|Y)  

•  Decision  rule:  

If certain assumption holds, NB is optimal classifier! (they typically don’t)

Y

X1 Xn X2

Page 39: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

A Digit Recognizer

•  Input: pixel grids

•  Output: a digit 0-9

Are  the  naïve  Bayes  assump;ons  realis;c  here?  

Page 40: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

What has to be learned?

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

Page 41: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

MLE  for  the  parameters  of  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MLE  for  discrete  NB,  simply:  –  Prior:  

–  Observa;on  distribu;on:    

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

Page 42: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

MLE  for  the  parameters  of  NB  

•  Training  amounts  to,  for  each  of  the  classes,  averaging  all  of  the  examples  together:  

Page 43: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

MAP  es;ma;on  for  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MAP  es;ma;on  for  discrete  NB,  simply:  –  Prior:  

–  Observa;on  distribu;on:    

•  Called  “smoothing”.  Corresponds  to  Dirichlet  prior!  

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

+ a

+ |X_i|*a

Page 44: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

What  about  if  there  is  missing  data?  •  One  of  the  key  strengths  of  Bayesian  approaches  is  that  

they  can  naturally  handle  missing  data  Missing data

  Suppose don’t have value for some attribute Xi

  applicant’s credit history unknown   some medical test not performed on patient   how to compute P(X1=x1 … Xj=? … Xd=xd | y)

  Easy with Naïve Bayes   ignore attribute in instance

where its value is missing   compute likelihood based on observed attributes   no need to “fill in” or explicitly model missing values   based on conditional independence between attributes

Copyright © Victor Lavrenko, 2013

Missing data (2)

•  Ex: three coin tosses: Event = {X1=H, X2=?, X3=T} •  event = head, unknown (either head or tail), tail •  event = {H,H,T} + {H,T,T} •  P(event) = P(H,H,T) + P(H,T,T)

•  General case: Xj has missing value

Copyright © Victor Lavrenko, 2013

Summary   Naïve Bayes classifier

-  explicitly handles class priors -  “normalizes” across observations: outliers comparable -  assumption: all dependence is “explained” by class label

  Continuous example -  unable to handle correlated data

  Discrete example -  fooled by repetitions -  must deal with zero-frequency problem

-  Pros: -  handles missing data -  good computational complexity -  incremental updates

Copyright © Victor Lavrenko, 2013

Computational complexity

•  One of the fastest learning methods •  O(nd+cd) training time complexity

•  c … number of classes •  n … number of instances •  d … number of dimensions (attributes) •  both learning and prediction •  no hidden constants (number of iterations, etc.) •  testing: O(ndc)

•  O(dc) space complexity •  only decision trees are more compact

Copyright © Victor Lavrenko, 2013

[Slide from Victor Lavrenko and Nigel Goddard]

Page 45: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

pageMehryar Mohri - Introduction to Machine Learning

Naive Bayes = Linear Classifier

Theorem: assume that for all . Then, the Naive Bayes classifier is defined by

Proof: observe that for any ,

20

xi � {0, 1} i � [1, N ]

x ⇥� sgn(w · x + b),

where

and

i � [1, N ]

logPr[xi | +1]Pr[xi | �1]

=�

logPr[xi = 1 | +1]Pr[xi = 1 | �1]

� logPr[xi = 0 | +1]Pr[xi = 0 | �1]

�xi+log

Pr[xi = 0 | +1]Pr[xi = 0 | �1]

.

wi = log Pr[xi=1|+1]Pr[xi=1|�1] � log Pr[xi=0|+1]

Pr[xi=0|�1]

b = log Pr[+1]Pr[�1] +

�Ni=1 log Pr[xi=0|+1]

Pr[xi=0|�1] .

[Slide from Mehyrar Mohri]

Page 46: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis)c  regression  

[Next several slides adapted from: Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]

Page 47: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Logis;c  Regression  

  Learn P(Y|X) directly!  Assume a particular functional form

✬  Linear classifier? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0

✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise...

P(Y=1)=0

P(Y=1)=1

Page 48: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Logis;c  Regression  Logistic function (Sigmoid):

•  Learn P(Y|X) directly! •  Assume a particular

functional form

•  Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

z

1

1 + e�z

Page 49: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Logis;c  Func;on  in  n  Dimensions  

-2 0 2 4 6-4-2 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1x1x2

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Page 50: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Logis;c  Regression:  decision  boundary    

A Linear Classifier!

•  Prediction: Output the Y with highest P(Y|X) –  For binary Y, output Y=0 if

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c� 2010, Tom M. Mitchell. 8

−5 0 50

0.2

0.4

0.6

0.8

1

Y

X

Y = 1/(1 + exp(−X))

Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-sumed to follow this form.

and taking the natural log of both sides we have a linear classification rule thatassigns label Y = 0 if X satisfies

0 < w0 +n

⇥i=1

wiXi (18)

and assigns Y = 1 otherwise.Interestingly, the parametric form of P(Y |X) used by Logistic Regression is

precisely the form implied by the assumptions of a Gaussian Naive Bayes classi-fier. Therefore, we can view Logistic Regression as a closely related alternative toGNB, though the two can produce different results in many cases.

3.1 Form of P(Y |X) for Gaussian Naive Bayes ClassifierHere we derive the form of P(Y |X) entailed by the assumptions of a GaussianNaive Bayes (GNB) classifier, showing that it is precisely the form used by Logis-tic Regression and summarized in equations (16) and (17). In particular, considera GNB based on the following modeling assumptions:

• Y is boolean, governed by a Bernoulli distribution, with parameter � =P(Y = 1)

• X = ⇤X1 . . .Xn⌅, where each Xi is a continuous random variable

w.X

+w

0 =

0

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Page 51: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Genera;ve  (Naïve  Bayes)  maximizes  Data  likelihood  

Discrimina;ve  (Logis;c  Regr.)  maximizes  Condi)onal  Data  Likelihood  

Focuses  only  on  learning  P(Y|X)  -­‐  all  that  maFers  for  classifica;on      

Likelihood  vs.  Condi;onal  Likelihood  

Page 52: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Maximizing  Condi;onal  Log  Likelihood  

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local maxima Concave functions easy to optimize

0 or 1!

Page 53: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Op;mizing  concave  func;on  –  Gradient  ascent    

•  Condi;onal  likelihood  for  Logis;c  Regression  is  concave  !    

Gradient:

Update rule: Learning rate, η>0

Page 54: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Maximize  Condi;onal  Log  Likelihood:  Gradient  ascent  

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

⌅l(w)

⌅wi=

j

xji

�yj � P (Y j = 1|xj , w)

⌅l(w)

⌅wi=

j

⇧⌅

⌅wyj(w0 +

i

wixji )�

⌅wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

1

1 + e�ax

ln p(w) ⇥ ��

2

i

w2i

⌅ ln p(w)

⌅wi= ��wi

1

⇤i

⇤2⇥

e� (x�µik)

2

2�2i

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

⌅l(w)

⌅wi=

j

xji

�yj � P (Y j = 1|xj , w)

⌅l(w)

⌅wi=

j

⇧⌅

⌅wyj(w0 +

i

wixji )�

⌅wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

1

1 + e�ax

ln p(w) ⇥ ��

2

i

w2i

⌅ ln p(w)

⌅wi= ��wi

1

⇤i

⇤2⇥

e� (x�µik)

2

2�2i

3

error =⌥

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +⌥

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +⌥

i

w2iXi)

P (Y = r|X) = 1�r�1⌥

j=1

P (Y = j|X)

=⌥

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=⌥

j

yjxji

3

i i

Page 55: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Gradient  Ascent  for  LR  

Gradient ascent algorithm: (learning rate η > 0)

do:  

 For  i=1  to  n:  (iterate  over  features)  

un;l  “change”  <  ε  Loop over training examples!

Page 56: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Naïve  Bayes        vs.      Logis)c  Regression  

Genera)ve    •  Assume  func;onal  form  for    

–  P(X|Y)    assume  cond  indep    –  P(Y)  –  Est.  params  from  train  data  

•  Gaussian  NB  for  cont.  features  •  Bayes  rule  to  calc.  P(Y|X=  x):  

–  P(Y  |  X)  ∝  P(X  |  Y)  P(Y)  •  Indirect  computa;on    

–  Can  generate  a  sample  of  the  data  –  Can  easily  handle  missing  data  

Discrimina)ve  •  Assume  func;onal  form  for    

–  P(Y|X)      no  assump)ons  

–  Est  params  from  training  data  •  Handles  discrete  &  cont  features  

•  Directly  calculate  P(Y|X=x)  –  Can’t  generate  data  sample  

Learning:  h:X   Y            X  –  features                    Y  –  target  classes  

Page 57: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

 Naïve  Bayes  vs.  Logis;c  Regression  

•  Genera;ve  vs.  Discrimina;ve  classifiers  •   Asympto;c  comparison    

(#  training  examples    infinity)  –   when  model  correct  

•  NB,  Linear  Discriminant  Analysis  (with  class  independent  variances),  and  Logis;c  Regression  produce  iden;cal  classifiers  

–   when  model  incorrect  •   LR  is  less  biased  –  does  not  assume  condi;onal  independence  

– therefore  LR  expected  to  outperform  NB  

[Ng & Jordan, 2002]

Page 58: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Naïve  Bayes  vs.  Logis;c  Regression  

•  Genera;ve  vs.  Discrimina;ve  classifiers  •  Non-­‐asympto;c  analysis  

–   convergence  rate  of  parameter  es;mates,          (n  =  #  of  a_ributes  in  X)  •  Size  of  training  data  to  get  close  to  infinite  data  solu;on  •  Naïve  Bayes  needs  O(log  n)  samples  •  Logis;c  Regression  needs  O(n)  samples  

– Naïve  Bayes  converges  more  quickly  to  its  (perhaps  less  helpful)  asympto;c  es;mates  

[Ng & Jordan, 2002]

Page 59: lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

Some  experiments  from  UCI  data  

sets  ©Carlos Guestrin 2005-2009 67

Naïve bayes Logistic Regression