lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014....

Preview:

Citation preview

Introduc)on  to  Bayesian  methods  Lecture  10  

David  Sontag  New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Bayesian  learning  

•  Bayesian  learning  uses  probability  to  model  data  and  quan+fy  uncertainty  of  predic;ons  – Facilitates  incorpora;on  of  prior  knowledge  – Gives  op;mal  predic;ons  

•  Allows  for  decision-­‐theore;c  reasoning  

Your  first  consul;ng  job  •  A  billionaire  from  the  suburbs  of  ManhaFan  asks  you  a  ques;on:  – He  says:  I  have  thumbtack,  if  I  flip  it,  what’s  the  probability  it  will  fall  with  the  nail  up?  

–  You  say:  Please  flip  it  a  few  ;mes:  

–  You  say:  The  probability  is:  •  P(heads)  =  3/5  

– He  says:  Why???  –  You  say:  Because…  

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis;c  regression  

Random Variables

•  A random variable is some aspect of the world about which we (may) have uncertainty –  R = Is it raining? –  D = How long will it take to drive to work? –  L = Where am I?

•  We denote random variables with capital letters

•  Random variables have domains –  R in {true, false} (sometimes write as {+r, ¬r}) –  D in [0, ∞) –  L in possible locations, maybe {(0,0), (0,1), …}

Probability Distributions •  Discrete random variables have distributions

•  A discrete distribution is a TABLE of probabilities of values

•  The probability of a state (lower case) is a single number

•  Must have:

T P

warm 0.5

cold 0.5

W P

sun 0.6

rain 0.1

fog 0.3

meteor 0.0

Joint Distributions •  A joint distribution over a set of random variables:

specifies a real number for each assignment:

–  How many assignments if n variables with domain sizes d?

–  Must obey:

•  For all but the smallest distributions, impractical to write out or estimate –  Instead, we make additional assumptions about the distribution

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

Marginal Distributions

•  Marginal distributions are sub-tables which eliminate variables

•  Marginalization (summing out): Combine collapsed rows by adding

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

T P

hot 0.5

cold 0.5

W P

sun 0.6

rain 0.4

P (t) =X

w

P (t, w)

P (w) =X

t

P (t, w)

Conditional Probabilities •  A simple relation between joint and conditional probabilities

–  In fact, this is taken as the definition of a conditional probability

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

Conditional Distributions •  Conditional distributions are probability distributions over

some variables given fixed values of others

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

W P

sun 0.8

rain 0.2

W P

sun 0.4

rain 0.6

Conditional Distributions Joint Distribution

The Product Rule

•  Sometimes have conditional distributions but want the joint

•  Example:

W P

sun 0.8

rain 0.2

D W P

wet sun 0.1

dry sun 0.9

wet rain 0.7

dry rain 0.3

D W P

wet sun 0.08

dry sun 0.72

wet rain 0.14

dry rain 0.06

Bayes’ Rule

•  Two ways to factor a joint distribution over two variables:

•  Dividing, we get:

•  Why is this at all helpful? –  Let’s us build one conditional from its reverse –  Often one conditional is tricky but the other one is simple –  Foundation of many practical systems (e.g. ASR, MT)

•  In the running for most important ML equation!

Returning  to  thumbtack  example…  

•  P(Heads)  =  θ,    P(Tails)  =  1-­‐θ

•  Flips  are  i.i.d.:    –  Independent  events  –  Iden;cally  distributed  according  to  Bernoulli  distribu;on  

•  Sequence  D  of  αH  Heads  and  αT  Tails      

D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Called the “likelihood” of the data under the model

Maximum  Likelihood  Es;ma;on  •  Data:  Observed  set  D  of  αH  Heads  and  αT  Tails      •  Hypothesis:  Bernoulli  distribu;on    •  Learning:  finding  θ is  an  op;miza;on  problem  

– What’s  the  objec;ve  func;on?  

•  MLE:  Choose  θ  to  maximize  probability  of  D  

Your  first  parameter  learning  algorithm  

•  Set  deriva;ve  to  zero,  and  solve!  

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

ˆ� = argmax

✓lnP (D | �)

ln �↵H

d

d�lnP (D | �) =

d

d�ln �↵H

(1� �)↵T

1

Brief Article

The Author

January 11, 2012

ˆ⇥ = argmax

⇥lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[

ln ⇥�H(1� ⇥)�T

]

=

d

d⇥[

�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Data

L(✓;D) = lnP (D|✓)

What  if  I  have  prior  beliefs?    •  Billionaire  says:  Wait,  I  know  that  the  thumbtack  is  “close”  to  50-­‐50.  What  can  you  do  for  me  now?  

•  You  say:  I  can  learn  it  the  Bayesian  way…  •  Rather  than  es;ma;ng  a  single  θ,  we  obtain  a  distribu;on  over  possible  values  of  θ

Observe flips e.g.: {tails, tails}

In the beginning

Pr(✓)

After observations

Pr(✓ | D)

Bayesian  Learning  

•  Use  Bayes’  rule!  

•  Or equivalently:

•  For uniform priors, this reduces to

maximum likelihood estimation!

Prior

Normalization

Data Likelihood

Posterior

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⇤lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇥2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⇤lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇥2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

1

Bayesian  Learning  for  Thumbtacks  

Likelihood:  

•  What  should  the  prior  be?  –  Represent  expert  knowledge  –  Simple  posterior  form  

•  For  binary  variables,  commonly  used  prior  is  the  Beta  distribu;on:  

Beta  prior  distribu;on  –  P(θ)  

•  Since  the  Beta  distribu;on  is  conjugate  to  the  Bernoulli  distribu;on,  the  posterior  distribu;on  has  a  par;cularly  simple  form:  

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1

1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1= ⌅�H+⇥H�1

(1� ⌅)�T+⇥t+1

1

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d

d⇧lnP (D | ⇧) =

d

d⇧[

ln ⇧�H(1� ⇧)�T

]

=

d

d⇧[

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d

d⇧ln(1� ⇧) =

�H

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

2⌅2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

1

•  We  now  have  a  distribu)on  over  parameters  •  For  any  specific  f,  a  func;on  of  interest,  compute  the  

expected  value  of  f:  

•  Integral  is  oien  hard  to  compute  •  As  more  data  is  observed,  prior  is  more  concentrated  

•  MAP  (Maximum  a  posteriori  approxima;on):  use  most  likely  parameter  to  approximate  the  expecta;on  

Using  Bayesian  inference  for  predic;on  

What  about  con;nuous  variables?  

•  Billionaire  says:  If  I  am  measuring  a  con;nuous  variable,  what  can  you  do  for  me?  

•  You  say:  Let  me  tell  you  about  Gaussians…  

Some  proper;es  of  Gaussians  

•  Affine  transforma;on  (mul;plying  by  scalar  and  adding  a  constant)  are  Gaussian  –  X  ~  N(µ,σ2)  –  Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

•  Sum  of  Gaussians  is  Gaussian  –  X  ~  N(µX,σ2

X)  –  Y  ~  N(µY,σ2

Y)  –  Z  =  X+Y    Z  ~  N(µX+µY,  σ2

X+σ2Y)  

•  Easy  to  differen;ate,  as  we  will  see  soon!  

Learning  a  Gaussian  

•  Collect  a  bunch  of  data  – Hopefully,  i.i.d.  samples  

– e.g.,  exam  scores  

•  Learn  parameters  – µ (“mean”) – σ (“variance”)

xi i =

Exam  Score  

0   85  

1   95  

2   100  

3   12  

…   …  

99   89  

MLE  for  Gaussian:  

•  Prob.  of  i.i.d.  samples  D={x1,…,xN}:  

•  Log-likelihood of data:

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

2

Your  second  learning  algorithm:  MLE  for  mean  of  a  Gaussian  

•  What’s  MLE  for  mean?  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �NX

i=1

xi +Nµ = 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

-

MLE  for  variance  •  Again,  set  deriva;ve  to  zero:  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

Learning  Gaussian  parameters  

•  MLE:  

•  MLE  for  the  variance  of  a  Gaussian  is  biased  – Expected  result  of  es;ma;on  is  not  true  parameter!    – Unbiased  variance  es;mator:  

Bayesian  learning  of  Gaussian  parameters  

•  Conjugate  priors  – Mean:  Gaussian  prior  

– Variance:  Wishart  Distribu;on  

•  Prior  for  mean:  

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis;c  regression  

Bayesian  Classifica;on  

•  Problem  statement:  – Given  features  X1,X2,…,Xn  – Predict  a  label  Y  

[Next several slides adapted from: Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos

Guestrin, and Dan Weld]

Example  Applica;on  

•  Digit  Recogni)on  

•  X1,…,Xn  ∈  {0,1}  (Black  vs.  White  pixels)  

•  Y  ∈  {0,1,2,3,4,5,6,7,8,9}  

Classifier 5

The  Bayes  Classifier  

•  If  we  had  the  joint  distribu;on  on  X1,…,Xn  and  Y,  could  predict  using:  

–  (for  example:  what  is  the  probability  that  the  image  represents  a  5  given  its  pixels?)  

•  So  …  How  do  we  compute  that?  

The  Bayes  Classifier  

•  Use  Bayes  Rule!  

•  Why  did  this  help?    Well,  we  think  that  we  might  be  able  to  specify  how  features  are  “generated”  by  the  class  label  

Normalization Constant

Likelihood Prior

The  Bayes  Classifier  

•  Let’s  expand  this  for  our  digit  recogni;on  task:  

•  To  classify,  we’ll  simply  compute  these  probabili;es,  one  per  class,  and  predict  based  on  which  one  is  largest  

Model  Parameters  

•  How  many  parameters  are  required  to  specify  the  likelihood,  P(X1,…,Xn|Y)  ?  

–  (Supposing  that  each  image  is  30x30  pixels)  

•  The  problem  with  explicitly  modeling  P(X1,…,Xn|Y)  is  that  there  are  usually  way  too  many  parameters:  

– We’ll  run  out  of  space  

– We’ll  run  out  of  ;me  –  And  we’ll  need  tons  of  training  data  (which  is  usually  not  available)  

Naïve  Bayes  •  Naïve  Bayes  assump;on:  

–  Features  are  independent  given  class:  

– More  generally:  

•  How  many  parameters  now?  •  Suppose  X  is  composed  of  n  binary  features  

The  Naïve  Bayes  Classifier  •  Given:  

–  Prior  P(Y)  –  n  condi;onally  independent  features  X  given  the  class  Y  

–  For  each  Xi,  we  have  likelihood  P(Xi|Y)  

•  Decision  rule:  

If certain assumption holds, NB is optimal classifier! (they typically don’t)

Y

X1 Xn X2

A Digit Recognizer

•  Input: pixel grids

•  Output: a digit 0-9

Are  the  naïve  Bayes  assump;ons  realis;c  here?  

What has to be learned?

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

MLE  for  the  parameters  of  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MLE  for  discrete  NB,  simply:  –  Prior:  

–  Observa;on  distribu;on:    

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

MLE  for  the  parameters  of  NB  

•  Training  amounts  to,  for  each  of  the  classes,  averaging  all  of  the  examples  together:  

MAP  es;ma;on  for  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MAP  es;ma;on  for  discrete  NB,  simply:  –  Prior:  

–  Observa;on  distribu;on:    

•  Called  “smoothing”.  Corresponds  to  Dirichlet  prior!  

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

+ a

+ |X_i|*a

What  about  if  there  is  missing  data?  •  One  of  the  key  strengths  of  Bayesian  approaches  is  that  

they  can  naturally  handle  missing  data  Missing data

  Suppose don’t have value for some attribute Xi

  applicant’s credit history unknown   some medical test not performed on patient   how to compute P(X1=x1 … Xj=? … Xd=xd | y)

  Easy with Naïve Bayes   ignore attribute in instance

where its value is missing   compute likelihood based on observed attributes   no need to “fill in” or explicitly model missing values   based on conditional independence between attributes

Copyright © Victor Lavrenko, 2013

Missing data (2)

•  Ex: three coin tosses: Event = {X1=H, X2=?, X3=T} •  event = head, unknown (either head or tail), tail •  event = {H,H,T} + {H,T,T} •  P(event) = P(H,H,T) + P(H,T,T)

•  General case: Xj has missing value

Copyright © Victor Lavrenko, 2013

Summary   Naïve Bayes classifier

-  explicitly handles class priors -  “normalizes” across observations: outliers comparable -  assumption: all dependence is “explained” by class label

  Continuous example -  unable to handle correlated data

  Discrete example -  fooled by repetitions -  must deal with zero-frequency problem

-  Pros: -  handles missing data -  good computational complexity -  incremental updates

Copyright © Victor Lavrenko, 2013

Computational complexity

•  One of the fastest learning methods •  O(nd+cd) training time complexity

•  c … number of classes •  n … number of instances •  d … number of dimensions (attributes) •  both learning and prediction •  no hidden constants (number of iterations, etc.) •  testing: O(ndc)

•  O(dc) space complexity •  only decision trees are more compact

Copyright © Victor Lavrenko, 2013

[Slide from Victor Lavrenko and Nigel Goddard]

pageMehryar Mohri - Introduction to Machine Learning

Naive Bayes = Linear Classifier

Theorem: assume that for all . Then, the Naive Bayes classifier is defined by

Proof: observe that for any ,

20

xi � {0, 1} i � [1, N ]

x ⇥� sgn(w · x + b),

where

and

i � [1, N ]

logPr[xi | +1]Pr[xi | �1]

=�

logPr[xi = 1 | +1]Pr[xi = 1 | �1]

� logPr[xi = 0 | +1]Pr[xi = 0 | �1]

�xi+log

Pr[xi = 0 | +1]Pr[xi = 0 | �1]

.

wi = log Pr[xi=1|+1]Pr[xi=1|�1] � log Pr[xi=0|+1]

Pr[xi=0|�1]

b = log Pr[+1]Pr[�1] +

�Ni=1 log Pr[xi=0|+1]

Pr[xi=0|�1] .

[Slide from Mehyrar Mohri]

Outline  of  lecture  

•  Review  of  probability  •  Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  

•  Naïve  Bayes  •  Logis)c  regression  

[Next several slides adapted from: Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]

Logis;c  Regression  

  Learn P(Y|X) directly!  Assume a particular functional form

✬  Linear classifier? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0

✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise...

P(Y=1)=0

P(Y=1)=1

Logis;c  Regression  Logistic function (Sigmoid):

•  Learn P(Y|X) directly! •  Assume a particular

functional form

•  Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

z

1

1 + e�z

Logis;c  Func;on  in  n  Dimensions  

-2 0 2 4 6-4-2 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1x1x2

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Logis;c  Regression:  decision  boundary    

A Linear Classifier!

•  Prediction: Output the Y with highest P(Y|X) –  For binary Y, output Y=0 if

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c� 2010, Tom M. Mitchell. 8

−5 0 50

0.2

0.4

0.6

0.8

1

Y

X

Y = 1/(1 + exp(−X))

Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-sumed to follow this form.

and taking the natural log of both sides we have a linear classification rule thatassigns label Y = 0 if X satisfies

0 < w0 +n

⇥i=1

wiXi (18)

and assigns Y = 1 otherwise.Interestingly, the parametric form of P(Y |X) used by Logistic Regression is

precisely the form implied by the assumptions of a Gaussian Naive Bayes classi-fier. Therefore, we can view Logistic Regression as a closely related alternative toGNB, though the two can produce different results in many cases.

3.1 Form of P(Y |X) for Gaussian Naive Bayes ClassifierHere we derive the form of P(Y |X) entailed by the assumptions of a GaussianNaive Bayes (GNB) classifier, showing that it is precisely the form used by Logis-tic Regression and summarized in equations (16) and (17). In particular, considera GNB based on the following modeling assumptions:

• Y is boolean, governed by a Bernoulli distribution, with parameter � =P(Y = 1)

• X = ⇤X1 . . .Xn⌅, where each Xi is a continuous random variable

w.X

+w

0 =

0

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Copyright

c⇥ 2010, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where �(Y = yk) is1 if Y = yk and 0 otherwise. Note the role of � here is to select only those trainingexamples for which Y = yk.

The maximum likelihood estimator for ⇥2ik is

⇥̂2ik =

1⇤ j �(Y j = yk) ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (14)

This maximum likelihood estimator is biased, so the minimum variance unbi-ased estimator (MVUE) is sometimes used instead. It is

⇥̂2ik =

1(⇤ j �(Y j = yk))�1 ⇤

j(X j

i � µ̂ik)2�(Y j = yk) (15)

3 Logistic RegressionLogistic Regression is an approach to learning functions of the form f : X ⇤Y , orP(Y |X) in the case where Y is discrete-valued, and X = ⌅X1 . . .Xn⇧ is any vectorcontaining discrete or continuous variables. In this section we will primarily con-sider the case where Y is a boolean variable, in order to simplify notation. In thefinal subsection we extend our treatment to the case where Y takes on any finitenumber of discrete values.

Logistic Regression assumes a parametric form for the distribution P(Y |X),then directly estimates its parameters from the training data. The parametricmodel assumed by Logistic Regression in the case where Y is boolean is:

P(Y = 1|X) =1

1+ exp(w0 +⇤ni=1 wiXi)

(16)

andP(Y = 0|X) =

exp(w0 +⇤ni=1 wiXi)

1+ exp(w0 +⇤ni=1 wiXi)

(17)

Notice that equation (17) follows directly from equation (16), because the sum ofthese two probabilities must equal 1.

One highly convenient property of this form for P(Y |X) is that it leads to asimple linear expression for classification. To classify any given X we generallywant to assign the value yk that maximizes P(Y = yk|X). Put another way, weassign the label Y = 0 if the following condition holds:

1 <P(Y = 0|X)P(Y = 1|X)

substituting from equations (16) and (17), this becomes

1 < exp(w0 +n

⇤i=1

wiXi)

Genera;ve  (Naïve  Bayes)  maximizes  Data  likelihood  

Discrimina;ve  (Logis;c  Regr.)  maximizes  Condi)onal  Data  Likelihood  

Focuses  only  on  learning  P(Y|X)  -­‐  all  that  maFers  for  classifica;on      

Likelihood  vs.  Condi;onal  Likelihood  

Maximizing  Condi;onal  Log  Likelihood  

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local maxima Concave functions easy to optimize

0 or 1!

Op;mizing  concave  func;on  –  Gradient  ascent    

•  Condi;onal  likelihood  for  Logis;c  Regression  is  concave  !    

Gradient:

Update rule: Learning rate, η>0

Maximize  Condi;onal  Log  Likelihood:  Gradient  ascent  

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

⌅l(w)

⌅wi=

j

xji

�yj � P (Y j = 1|xj , w)

⌅l(w)

⌅wi=

j

⇧⌅

⌅wyj(w0 +

i

wixji )�

⌅wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

1

1 + e�ax

ln p(w) ⇥ ��

2

i

w2i

⌅ ln p(w)

⌅wi= ��wi

1

⇤i

⇤2⇥

e� (x�µik)

2

2�2i

3

error =�

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +�

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +�

i

w2iXi)

P (Y = r|X) = 1�r�1�

j=1

P (Y = j|X)

=�

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

⌅l(w)

⌅wi=

j

xji

�yj � P (Y j = 1|xj , w)

⌅l(w)

⌅wi=

j

⇧⌅

⌅wyj(w0 +

i

wixji )�

⌅wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=�

j

⇧yjxj

i �xj

i exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

=�

j

xji

⇧yj �

exp(w0 +⌥

i wixji )

1 + exp(w0 +⌥

i wixji )

1

1 + e�ax

ln p(w) ⇥ ��

2

i

w2i

⌅ ln p(w)

⌅wi= ��wi

1

⇤i

⇤2⇥

e� (x�µik)

2

2�2i

3

error =⌥

i

(ti � t̂i)2 =

i

⇤ti �

k

wkhk(xi)

⌅2

P (Y = 1|X) ⇥ exp(w10 +⌥

i

w1iXi)

P (Y = 2|X) ⇥ exp(w20 +⌥

i

w2iXi)

P (Y = r|X) = 1�r�1⌥

j=1

P (Y = j|X)

=⌥

j

yj lnew0+

Pi wiXi

1 + ew0+P

i wiXi+ (1� yj) ln

1

1 + ew0+P

i wiXi

�l(w)

�w=

j

xji

�yj � P (Y j = 1|xj , w)

�l(w)

�w=

j

⇧�

�wyj(w0 +

i

wixji )�

�wln

⇤1 + exp(w0 +

i

wixji )

⌅⌃

=⌥

j

yjxji

3

i i

Gradient  Ascent  for  LR  

Gradient ascent algorithm: (learning rate η > 0)

do:  

 For  i=1  to  n:  (iterate  over  features)  

un;l  “change”  <  ε  Loop over training examples!

Naïve  Bayes        vs.      Logis)c  Regression  

Genera)ve    •  Assume  func;onal  form  for    

–  P(X|Y)    assume  cond  indep    –  P(Y)  –  Est.  params  from  train  data  

•  Gaussian  NB  for  cont.  features  •  Bayes  rule  to  calc.  P(Y|X=  x):  

–  P(Y  |  X)  ∝  P(X  |  Y)  P(Y)  •  Indirect  computa;on    

–  Can  generate  a  sample  of  the  data  –  Can  easily  handle  missing  data  

Discrimina)ve  •  Assume  func;onal  form  for    

–  P(Y|X)      no  assump)ons  

–  Est  params  from  training  data  •  Handles  discrete  &  cont  features  

•  Directly  calculate  P(Y|X=x)  –  Can’t  generate  data  sample  

Learning:  h:X   Y            X  –  features                    Y  –  target  classes  

 Naïve  Bayes  vs.  Logis;c  Regression  

•  Genera;ve  vs.  Discrimina;ve  classifiers  •   Asympto;c  comparison    

(#  training  examples    infinity)  –   when  model  correct  

•  NB,  Linear  Discriminant  Analysis  (with  class  independent  variances),  and  Logis;c  Regression  produce  iden;cal  classifiers  

–   when  model  incorrect  •   LR  is  less  biased  –  does  not  assume  condi;onal  independence  

– therefore  LR  expected  to  outperform  NB  

[Ng & Jordan, 2002]

Naïve  Bayes  vs.  Logis;c  Regression  

•  Genera;ve  vs.  Discrimina;ve  classifiers  •  Non-­‐asympto;c  analysis  

–   convergence  rate  of  parameter  es;mates,          (n  =  #  of  a_ributes  in  X)  •  Size  of  training  data  to  get  close  to  infinite  data  solu;on  •  Naïve  Bayes  needs  O(log  n)  samples  •  Logis;c  Regression  needs  O(n)  samples  

– Naïve  Bayes  converges  more  quickly  to  its  (perhaps  less  helpful)  asympto;c  es;mates  

[Ng & Jordan, 2002]

Some  experiments  from  UCI  data  

sets  ©Carlos Guestrin 2005-2009 67

Naïve bayes Logistic Regression

Recommended