38
Probability Es-ma-on Machine Learning 10601B Seyoung Kim Many of these slides are derived from Tom Mitchell, William Cohen, Eric Xing. Thanks!

ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Probability  Es-ma-on  

Machine  Learning  10-­‐601B  

Seyoung  Kim  

Many  of  these  slides  are  derived  from  Tom  Mitchell,  William  Cohen,  Eric  Xing.  Thanks!  

Page 2: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Overview  

•  Joint  probability  distribuIon  –  A  funcIonal  mapping  f:  X-­‐>Y  via  probability  distribuIon      

•  Probability  esImaIon  –  Maximum  likelihood  esImaIon  

–  Maximum  a  priori  esImaIon  

Page 3: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

What  does  all  this  have  to  do  with  func-on  approxima-on  for  f:  X-­‐>Y?  

Page 4: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Joint  Probability  Distribu-on  

Once  you  have  the  joint  distribuIon,  you  can  ask  for  the  probability  of  any  logical  expression  involving  your  aRribute  

Page 5: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Using  the  Joint  Distribu-on  

P(Poor,  Male)  =  0.4654  

P(A)  =  P(A  ^  B)  +  P(A  ^  ~B)  

Page 6: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Using  the  Joint  Distribu-on  

P(Poor)  =  0.7604  

Page 7: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Inference  with  the  Joint  

Page 8: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Inference  with  the  Joint  

P(Male  |  Poor)  =  0.4654  /  0.7604  =  0.612      

Page 9: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Learning and the Joint

Distribution

Suppose we want to learn the function f: <G, H> ! W

Equivalently, P(W | G, H)

Solution: learn joint distribution from data, calculate P(W | G, H)

e.g., P(W=rich | G = female, H = 40.5- ) =

[A.  Moore]    

Page 10: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Copyright  ©  Andrew  W.  Moore  

Density  Es-ma-on  

•  Our  Joint  DistribuIon  learner  is  our  first  example  of  something  called  Density  EsImaIon  

•  A  Density  EsImator  learns  a  mapping  from  a  set  of  aRributes  values  to  a  Probability  

Density  EsImator  

Probability  Input  ARributes  

Page 11: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Copyright  ©  Andrew  W.  Moore  

Density  Es-ma-on  

•  Compare  it  against  the  two  other  major  kinds  of  models:  

Regressor  PredicIon  of  real-­‐valued  output  

Input  ARributes  

Density  EsImator  

Probability  Input  ARributes  

Classifier  PredicIon  of  categorical  output  or  class  

Input  ARributes  

One  of  a  few  discrete  values  

Page 12: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Copyright  ©  Andrew  W.  Moore  

Density  Es-ma-on  !  Classifica-on  

Classifier  PredicIon  of  categorical  output  

Input  ARributes  x  

One of y1, …., yk

To  classify  x    1.  Use  your  esImator  to  compute  P(x,y1),  ….,  P(x,yk)  2.  Return  the  class  y*  with  the  highest  predicted  probability  

Ideally  is  correct  with  P(y*|  x)    =    P(x,y*)/(P(x,y1)  +  ….  +  P(x,yk))  

^   ^  

^   ^   ^   ^  

Density  EsImator  

P(x,y)  Input    ARributes  

Class  

^  

Binary  case:  predict  POS  if  P(x,  ypos)>0.5  

^  

Page 13: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Classifica-on  vs  Density  Es-ma-on  

Classifica-on   Density  Es-ma-on  

Page 14: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Classifica-on  vs  density  es-ma-on  

Page 15: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Modeling  Uncertainty  with  Probabili-es  

•  Y  is  a  Boolean-­‐valued  random  variable  if  –  Y  denotes  an  event,    –  there  is  uncertainty  as  to  whether  Y  occurs.  

•  More  examples  –  Y  =  You  wake  up  tomorrow  with  a  headache  

–  Y  =  The  US  president  in  2023  will  be  male  –  Y  =  there  is  intelligent  life  elsewhere  in  our  galaxy  –  Y  =  the  1,000,000,000,000th  digit  of  π  is  7  –  Y  =  I  woke  up  today  with  a  headache  

•  Define  P(Y|X)  as  “the  fracIon  of  possible  worlds  in  which  Y  is  true,  given  X”  

Page 16: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

sounds like the solution to learning F: X "Y,

or P(Y | X).

Are we done?

Page 17: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

Page 18: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

D  =              {            D1,         D2,          D3,            D4,              D5}  

αH +αT

αH

⎝ ⎜

⎠ ⎟

Page 19: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

αH +αT

αH

⎝ ⎜

⎠ ⎟

Page 20: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Maximum Likelihood Estimate for Θ

[C.  Guestrin]    

Page 21: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

Page 22: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

Page 23: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Issues  with  MLE  es-mate  I  bought  a  loaded  20-­‐faced  die  (d20)  on  EBay…but  it  didn’t  come  with  any  specs.    How  can  I  find  out  how  it  behaves?  

1.  Collect  some  data  (20  rolls)  2.  EsImate  P(i)=CountOf(rolls  of  i)/CountOf(any  roll)  

Page 24: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Issues  with  MLE  es-mate  I  bought  a  loaded  d20  on  EBay…but  it  didn’t  come  with  any  specs.    How  can  I  find  out  how  it  behaves?  

P(1)=0  

P(2)=0  

P(3)=0  

P(4)=0.1  

…  

P(19)=0.25  

P(20)=0.2  

But:  Do  I  really  think  it’s  impossible  to  roll  a  1,2  or  3?  

Page 25: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

A  bePer  solu-on  I  bought  a  loaded  d20  on  EBay…but  it  didn’t  come  with  any  specs.    How  can  I  find  out  how  it  behaves?  

1.  Collect  some  data  (20  rolls)  2.  EsImate  P(i)  

0.  Imagine  some  data  (20  rolls,  each  i  shows  up  1x)  

Page 26: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

A  bePer  solu-on?  

P(1)=1/40  

P(2)=1/40  

P(3)=1/40  

P(4)=(2+1)/40  

…  

P(19)=(5+1)/40  

P(20)=(4+1)/40=1/8  

ˆ P (i) =CountOf (i) +1

CountOf (ANY ) + CountOf (IMAGINED)0.2  vs.  0.125  –  really  different!  Maybe  I  should  “imagine”  less  data?  

MAP  =  maximum  a  posteriori  esImate  

Page 27: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

Page 28: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Beta prior distribution – P(θ)

[C.  Guestrin]    

θ

Page 29: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Beta prior distribution – P(θ)

[C.  Guestrin]    

αH +αT

αH

⎝ ⎜

⎠ ⎟

Page 30: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    

Page 31: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

[C.  Guestrin]    €

αH + βH −1αH + βH −1+αT + βT −1

Page 32: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Conjugate priors

[A.  Singh]    

Page 33: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Dirichlet distribution

•  number of heads in N flips of a two-sided coin –  follows a binomial distribution –  Beta is a good prior (conjugate prior for binomial)

•  what it’s not two-sided, but k-sided? –  follows a multinomial distribution –  Dirichlet distribution is the conjugate prior

Page 34: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Conjugate priors

[A.  Singh]    

Page 35: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Es-ma-ng  Parameters  

•  Maximum  Likelihood  EsImate  (MLE):  choose  θ  that  maximizes  probability  of  observed  data  

•  Maximum  a  Posteriori  (MAP)  esImate:  choose  θ  that  is  most  probable  given  prior  probability  and  the  data  

Page 36: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Expected  values  

Given  discrete  random  variable  X,  the  expected  value  of  X,  wriRen  E[X]  is  

We  also  can  talk  about  the  expected  value  of  funcIons  of  X  

Page 37: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

Covariance  

Given  two  discrete  r.v.’s  X  and  Y,  we  define  the  covariance  of  X  and  Y  as  

e.g.,  X=gender,  Y=playsFootball  or          X=gender,  Y=lezHanded  

Remember:  

Page 38: ProbabilityEsmaon*10601b/slides/mle_map.pdf · 2015. 9. 3. · ProbabilityEsmaon* Machine(Learning(10.601B(SeyoungKim(Many(of(these(slides(are(derived(from(Tom(Mitchell,(William(Cohen,(Eric(Xing.(Thanks!

You  should  know  

•  Density  esImaIon  and  its  relaIon  to  classificaIon  

•  EsImaIng  parameters  from  data  –  maximum  likelihood  esImates  

–  maximum  a  posteriori  esImates  

–  distribuIons  –  binomial,  Beta,  Dirichlet,  …  –  conjugate  priors