42
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 10: Genera4ve Models, Naïve Bayes 1

Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Machine  Learning  &  Data  Mining  CS/CNS/EE  155  

Lecture  10:  Genera4ve  Models,  Naïve  Bayes  

1  

Page 2: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Announcements  

•  Kaggle  submissions  end  1:59pm  Thursday  •  HW5  will  be  released  by  the  end  of  this  week  – Last  year  people  found  it  tough,  so  start  early  and  write  good  code  

2  

Page 3: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Where  We  Are  •  Part  1:  Discrimina4ve  Models  

–  Basics:  Bias/Variance,  Over/UnderfiSng  –  Linear  models:  perceptron,  SVM,  logis4c  regression  –  Regulariza4on  –  Decision  Trees  –  Model  Aggrega4on:  bagging  and  boos4ng  –  Deep  Learning  

•  Part  2:  Genera4ve  and  Unsupervised  Models  <-­‐  We  Are  Here  –  Probabilis4c  genera4ve  models  –  Deep  genera4ve  models  –  Dimensionality  Reduc4on  –  Matrix  Factoriza4on,  Embeddings,  etc.  

3  

Note:  “Genera4ve”  and  “Unsupervised”  are  not  synonyms,  but  genera4ve  models  are  o^en  unsupervised  and  vice  versa.  

Page 4: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What’s  a  Genera4ve  Model?  

Consider  a  dataset    •  Discrimina4ve  models  learn  – Well  defined  task,  but  intelligence  can’t  be  built  on  discrimina4on  alone  

•  Genera4ve  models  learn    – Less  well-­‐defined  task,  but  models  can  do  much  more  than  just  discriminate  

– Currently  the  focus  of  a  significant  percent  of  high-­‐profile  ML  research  

4  

{ }Niii yxS 1),( ==

)|( xyP

P(x, y)

Page 5: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What  can  Genera4ve  Models  Do?  

5  

y   x   P(x,y)  

SPAM   Help!   0.15  

NOT   Help!   0.1  

SPAM   Homework   0.05  

NOT   Homework   0.45  

SPAM   Winner!   0.2  

NOT   Winner!   0.05  

Example  model:  

Full  joint  distribu4on;  sums  to  1  

Our  model  specifies  these.  Don’t  worry  yet  about  the  model  architecture  or  how  we  trained  it.  

Page 6: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What  can  Genera4ve  Models  Do?  

Discriminate  We  can  compute          So  know,  e.g.,  P(y=SPAM|x=Help!):    

P(y=SPAM,  x=Help!)  =  0.15  P(x=Help!)  =  0.25  

=>  P(y=SPAM|x=Help!)  =  0.6    

6  

y   x   P(x,y)  

SPAM   Help!   0.15  

NOT   Help!   0.1  

SPAM   Homework   0.05  

NOT   Homework   0.45  

SPAM   Winner!   0.2  

NOT   Winner!   0.05  

Example  model:  

P(y | x) = P(x, y)P(x)

P(x) = P(y, x)y∑

Page 7: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What  can  Genera4ve  Models  Do?  

7  

y   x   P(x,y)  

SPAM   Help!   0.15  

NOT   Help!   0.1  

SPAM   Homework   0.05  

NOT   Homework   0.45  

SPAM   Winner!   0.2  

NOT   Winner!   0.05  

Example  model:  

∑=x

xyPyP ),()(

Summarize  and  Predict    E.g.  the  marginal  distribu4on  of  y  is      So  we  know  P(y=SPAM)  =  0.4  before  we’ve  even  seen  x.    You  can’t  get  that  from  a  discrimina4ve  model.    

 

Page 8: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

E.g.  Varia4onal  Autoencoders  

8  

Can  discover  structure  in,  e.g.,  the  MNIST  dataset  using  just  the  unlabeled  images:              This  is  a  “walk”  through  the  structure  this  genera4ve  model  discovered.  

hqp://blog.fasrorwardlabs.com/2016/08/22/under-­‐the-­‐hood-­‐of-­‐the-­‐varia4onal-­‐autoencoder-­‐in.html  

Page 9: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

E.g.  Latent  Dirichlet  Alloca4on  (Disclaimer:  this  was  the  top  result  of  Google  Search  “LDA  example”)  

9  hqp://blog.echen.me/2011/08/22/introduc4on-­‐to-­‐latent-­‐dirichlet-­‐alloca4on/  

Can  iden4fy  topics  from  reading  unlabeled  text  corpus,  e.g.  Presiden4al  Campaign,  Wildlife:  

Page 10: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What  can  Genera4ve  Models  Do?  

10  

y   x   P(x,y)  

SPAM   Help!   0.15  

NOT   Help!   0.1  

SPAM   Homework   0.05  

NOT   Homework   0.45  

SPAM   Winner!   0.2  

NOT   Winner!   0.05  

Example  model:  

Generate  We  can  generate  examples  straight  from  the  data  distribu4on:    

 >>>random.uniform(0,1)    0.404…    >>>return  (NOT,  Homework)  

 Or  we  can  generate  condi4onally,  since  going  from  label  to  data  is  as  easy  as  from  data  to  label:          So  know,  e.g.,  P(x=Help!|y=SPAM),  P(x=Homework|y=SPAM),  and  P(x=Winner!|y=SPAM)  and  can  sample  from  that  condi4onal  distribu4on.  

)(),()|(

yPyxPyxP = ∑=

xxyPyP ),()(

Page 11: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

E.g.  Genera4ve  Adversarial  Networks  

11  

hqp://www.nature.com/news/astronomers-­‐explore-­‐uses-­‐for-­‐ai-­‐generated-­‐images-­‐1.21398?WT.mc_id=FBK_NatureNews  

Learn  from  labeled  images,  condi4on  on  “Volcano”  and  generate  new  images,  because  why  not?  

Learn  from  unlabeled  telescope  imagery,  generate  new  images,  now  astronomers  have  more  data  

Page 12: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

E.g.  Style  Transfer  

12  

Figure 2: Images that combine the content of a photograph with the style of several well-knownartworks. The images were created by finding an image that simultaneously matches the contentrepresentation of the photograph and the style representation of the artwork (see Methods). Theoriginal photograph depicting the Neckarfront in Tubingen, Germany, is shown in A (Photo:Andreas Praefcke). The painting that provided the style for the respective generated imageis shown in the bottom left corner of each panel. B The Shipwreck of the Minotaur by J.M.W.Turner, 1805. C The Starry Night by Vincent van Gogh, 1889. D Der Schrei by Edvard Munch,1893. E Femme nue assise by Pablo Picasso, 1910. F Composition VII by Wassily Kandinsky,1913.

5

hqp://gitxiv.com/posts/jG46ukGod8R7Rdtud/a-­‐neural-­‐algorithm-­‐of-­‐ar4s4c-­‐style  

Condi4on  on  a  style  and  a  content  image,  generate  new  images:    

Page 13: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What  can  Genera4ve  Models  Do?  Handle  Missing  Values  and  Model  Uncertainty  

(Previous  example  was  too  simple  to  illustrate  this.)  By  condi4oning  on  known  values,  genera4ve  models  can  handle  missing  values  probabilis4cally.    Further,  many  genera4ve  models  are  structured  to  model  uncertainty  about  their  missing  value  es4mates.  

13  hqps://openai.com/blog/genera4ve-­‐models/  

Naïve  explora4on   Explora4on  with  a  genera4ve    uncertainty  model  

Page 14: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What’s  the  Catch?  

•  Discrimina4ve  models  make  (much)  beqer  predic4ons  –  Discrimina4ve  models  are  directly  op4mized  to  predict  –  Genera4ve  models  make  predic4ons  by  combining  mul4ple  es4mated  values  

14  

P(x) = P(y, x)y∑P(y | x) = P(x, y)

P(x)

Page 15: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What’s  the  Catch?  

•  Genera4ve  modeling  is  an  underspecified  task  –  Goal  of  discrimina4ve  models  is  clear:  improve  accuracy  –  Goal  of  genera4ve  models  is  less  clear:  the  quality  of  a  model  is  task-­‐dependent  and  somewhat  subjec4ve.  

15  

Page 16: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

What’s  the  Catch?  

•  Genera4ve  models  are  more  complicated  –  In  our  example,  the  genera4ve  model  had  6  parameters;  a  discrimina4ve  model  would  have  needed  3  to  do  its  job  

–  Imagine  what  happens  when  feature  space  gets  large,  e.g.  784  con4nuous-­‐valued  pixels!  

     

–  Requires  very  clever  model  design  and  assump4ons  about  data  

16  

P(y | x) = P(x, y)P(x)

What  if  there  are  so    many  different  x  that  P(x)  underflows?  

Page 17: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  

Suppose  we  have  a  binary  y  label  and    binary  x  labels.  

What’s  wrong  with  the  Probability  Table  approach  (arguably  the  most  naïve  genera4ve  model)?  

17  

y   x1=Winner!   x2=Homework   P(x,y)  

SPAM   1   1   0.01  

NOT   1   1   0.01  

SPAM   0   1   0.03  

NOT   0   1   0.35  

SPAM   1   0   0.25  

NOT   1   0   0.05  

SPAM   0   0   0.2  

NOT   0   0   0.1  

Page 18: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  

Suppose  we  have  a  binary  y  label  and    binary  x  labels.  

What’s  wrong  with  the  Probability  Table  approach  (arguably  the  most  naïve  genera4ve  model)?  

18  

y   x1=Winner!   x2=Homework   P(x,y)  

SPAM   1   1   0.01  

NOT   1   1   0.01  

SPAM   0   1   0.03  

NOT   0   1   0.35  

SPAM   1   0   0.25  

NOT   1   0   0.05  

SPAM   0   0   0.2  

NOT   0   0   0.1  Model  Complexity  is  ExponenFal    

w.r.t.  the  length  of  x!    

We  need  a  beMer  model…  

Page 19: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Naïve  Bayes  

19  

Page 20: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Reminder:  Condi4onal  Independence  

x1  and  x2  are  condi0onally  independent  given  y  iff        

 or,  equivalently      

 and  

)|(),|( 121 yxPyxxP =

)|()|()|,( 2121 yxPyxPyxxP =

20  

)|(),|( 212 yxPyxxP =

Page 21: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Reminder:  Condi4onal  Independence  

If  D  random  variables  xd    are  all  condi4onally  independent  given  y,  then      We  can  express  this  graphically:    

21  

∏=

==D

d

dDD yxPyPyPyxxPyxxP1

11 )|()()()|,...,(),,...,(

Y  

X1   XD  …  X2  

…  Roughly  speaking,  an  edge  from    A  to  B  means  B  depends  on  A.  Absence  of  edge  means    condi4onal  independence.  

“Bayesian  Network”,  a  type  of  Graphical  Model  Diagram  

But  graphical  models  are  a  world    unto  themselves,  and  we  won’t  get  into  them  too  deeply…  

Page 22: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

The  Naïve  Bayes  Model  •  A  genera4ve  model,  but  used  in  the  context  of  predic4on.  •  For  a  label  y  and  D  features  xd,  Naïve  Bayes  posits  that  all  the  

features  are  condi4onally  independent  given  y.  

•  This  is  a  strong  (perhaps  naïve)  assump4on    about  our  data  

•  But  we  only  have  to  keep  track  of    P(y)  and  P(xd|y)!  

 22  

Y  

X1   XD  …  X2  

…  

∏=

==D

d

dDD yxPyPyPyxxPyxxP1

11 )|()()()|,...,(),,...,(

Page 23: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Why  is  Naïve  Bayes  Convenient?  

•  Compact  representa4on  

•  Easy  to  compute  any  quan4ty    –  P(y|x),  P(xd|y),  …  

 

•  Easy  to  es4mate  model  components  –  P(y),  P(xd|y)  

•  Easy  to  sample  

•  Easy  to  deal  with  missing  values  

23  

Page 24: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  Model  (Discrete)  

•  Each  xd  binary  (-­‐1  or  +1)  – E.g.,  presence  (+1)  or  absence  (-­‐1)  of  word  

24  

x1=Homework   x2=Winner!  

y=SPAM   P(x1  =1|y)=0.2   P(x2  =1|y)=0.5  

y=NOT   P(x1  =1|y)=0.6   P(x2  =1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

P(x|y)  

P(y)  Model  Complexity  is  Linear  

w.r.t.  the  length  of  x!  

Page 25: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Making  Predic4ons  

25  

P(y | x) = P(x, y)P(x)

= P(x | y)P(y)P(x)

= P(y)P(x)

P(xd | y)d∏

∝P(y) P(xd | y)d∏

Model  components  we  keep  track  of.  

Bayes  Rule  (hence  the  name)  

x1=Homework   x2=Winner!  

y=SPAM   P(x1  =1|y)=0.2   P(x2  =1|y)=0.5  

y=NOT   P(x1  =1|y)=0.6   P(x2  =1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

Page 26: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  Predic4on  

 

26  

x1=Homework   x2=Winner!  

y=SPAM   P(x1  =1|y)=0.2   P(x2  =1|y)=0.5  

y=NOT   P(x1  =1|y)=0.6   P(x2  =1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

...698.0)5.01(*2.0*7.0)1.01(*6.0*3.0

)1.01(*6.0*3.0

)|1,1()()|1()|1()(

)1,1()|1()|1()(

)1,1|(

21

21

21

21

21

=−+−

−=

−==

=−=====

−==

=−=====

−===

∑y

yxxPyPNOTyxPNOTyxPNOTyP

xxPNOTyxPNOTyxPNOTyP

xxNOTyP

Page 27: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

One  Empirical  Comparison  

 •  Columns  denote  how  frequently  each  model  places  

1st,  2nd,  3rd,  etc.  

•  Only  genera4ve  model  (Naïve  Bayes)  is  in  last  place  

27  

An Empirical Comparison of Supervised Learning Algorithms

Table 4. Bootstrap Analysis of Overall Rank by Mean Performance Across Problems and Metrics

model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th

bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000rf 0.390 0.525 0.084 0.001 0.000 0.000 0.000 0.000 0.000 0.000bag-dt 0.030 0.232 0.571 0.150 0.017 0.000 0.000 0.000 0.000 0.000svm 0.000 0.008 0.148 0.574 0.240 0.029 0.001 0.000 0.000 0.000ann 0.000 0.007 0.035 0.230 0.606 0.122 0.000 0.000 0.000 0.000knn 0.000 0.000 0.000 0.009 0.114 0.592 0.245 0.038 0.002 0.000bst-stmp 0.000 0.000 0.002 0.013 0.014 0.257 0.710 0.004 0.000 0.000dt 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.616 0.291 0.089logreg 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.312 0.423 0.225nb 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.030 0.284 0.686

overall, and only a 4.2% chance of seeing them ranklower than 3rd place. Random forests would come in1st place 39% of the time, 2nd place 53% of the time,with little chance (0.1%) of ranking below third place.

There is less than a 20% chance that a method otherthan boosted trees, random forests, and bagged treeswould rank in the top three, and no chance (0.0%)that another method would rank 1st—it appears to bea clean sweep for ensembles of trees. SVMs probablywould rank 4th, and neural nets probably would rank5th, but there is a 1 in 3 chance that SVMs would rankafter neural nets. The bootstrap analysis clearly showsthat MBL, boosted 1-level stumps, plain decision trees,logistic regression, and naive bayes are not competitiveon average with the top five models on these problemsand metrics when trained on 5k samples.

6. Related Work

STATLOG is perhaps the best known study (Kinget al., 1995). STATLOG was a very comprehensivestudy when it was performed, but since then importantnew learning algorithms have been introduced such asbagging, boosting, SVMs, and random forests. LeCunet al. (1995) presents a study that compares severallearning algorithms (including SVMs) on a handwrit-ing recognition problem using three performance crite-ria: accuracy, rejection rate, and computational cost.Cooper et al. (1997) present results from a study thatevaluates nearly a dozen learning methods on a realmedical data set using both accuracy and an ROC-likemetric. Lim et al. (2000) perform an empirical com-parison of decision trees and other classification meth-ods using accuracy as the main criterion. Bauer andKohavi (1999) present an impressive empirical analy-sis of ensemble methods such as bagging and boosting.Perlich et al. (2003) conducts an empirical comparisonbetween decision trees and logistic regression. Provost

and Domingos (2003) examine the issue of predictingprobabilities with decision trees, including smoothedand bagged trees. Provost and Fawcett (1997) discussthe importance of evaluating learning algorithms onmetrics other than accuracy such as ROC.

7. Conclusions

The field has made substantial progress in the lastdecade. Learning methods such as boosting, randomforests, bagging, and SVMs achieve excellent perfor-mance that would have been difficult to obtain just 15years ago. Of the earlier learning methods, feedfor-ward neural nets have the best performance and arecompetitive with some of the newer methods, particu-larly if models will not be calibrated after training.

Calibration with either Platt’s method or Isotonic Re-gression is remarkably effective at obtaining excellentperformance on the probability metrics from learningalgorithms that performed well on the ordering met-rics. Calibration dramatically improves the perfor-mance of boosted trees, SVMs, boosted stumps, andNaive Bayes, and provides a small, but noticeable im-provement for random forests. Neural nets, baggedtrees, memory based methods, and logistic regressionare not significantly improved by calibration.

With excellent performance on all eight metrics, cali-brated boosted trees were the best learning algorithmoverall. Random forests are close second, followed byuncalibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performedpoorest were naive bayes, logistic regression, decisiontrees, and boosted stumps. Although some methodsclearly perform better or worse than other methodson average, there is significant variability across theproblems and metrics. Even the best models some-times perform poorly, and models with poor average

“An  Empirical  Comparison  of  Supervised  Learning  Algorithms”  Caruana,  Niculescu-­‐Mizil,  ICML  2006  

Page 28: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  Predic4on  #2  

•  What  if  we  want  to  compute                                                ?    •  It’s  an  explicitly  defined  model  component:  

28  

P(x1 | x2:D, y)

)|(),|( 1:21 yxPyxxP D =

Page 29: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Example  Predic4on  #3  

•  What  if  we  want  to  compute                                        ?  

29  

P(x1 | x2:D )

∑ ∏

∑ ∏

=

==

==

y

D

d

d

y

D

d

d

y

Dy

D

D

DD

yxPyP

yxPyP

yxPyP

yxxPyP

xPxxPxxP

2

1

:2

:21

:2

:21:21

)|()(

)|()(

)|()(

)|,()(

)(),()|(

Why  is  the  numerator  smaller  than  the  denominator?  

“Marginalizing      out  the  y”  

Page 30: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Marginaliza4on  in  Matrix  Form  

•  Compute  P(xd=1):  

30  

x1=Homework   x2=Winner!  

y=SPAM   P(x1=1|y)=0.2   P(x2=1|y)=0.5  

y=NOT   P(x1=1|y)=0.6   P(x2=1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

O  

P  

P(xd =1) = OTP!" #$d d-­‐th  row    

P(xd =1) = P(xd =1| y)P(y)y∑

For  heaven’s  sake,  don’t  use  a  for  loop  

Page 31: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Predic4on  with  Missing  Values  

•  What  if  we  don’t  observe  x2:D?  •  Predict  P(y=SPAM|x1)  

31  

∑∑ ==DD x

D

x

D

xPyxPxxyPxyP

:2:2 )(),()|,()|( 1

:11:21

How  to  efficiently  sum  over  mulFple  missing  values?  

We  can  marginalize    out  the  missing  values!  

Page 32: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Condi4onal  Independence  to  the  Rescue!  

32  

∑∑ ==DD x

D

x

D

xPyxPxxyPxyP

:2:2 )(),()|,()|( 1

:11:21

∏=

=D

d

dD yxPyPyxP1

:1 )|()(),(

[ ]

)|()(

)|()|()(

)|()(),(

1

,2

1

:2:2

yxPyP

yxPyxPyP

yxPyPyxP

Dd x

dx d

d

x

d

d

DD

=

=

=

∏ ∑

∑∏∑

From  previous  slide  

DefiniFon  of    Naïve  Bayes  

Swap  Product  &  Sum    due  to  independence!  

Marginalizes  to  1!  

Page 33: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Training  •  Maximum  Likelihood  of  Training  Set:  

where  the  argmax/min  is  over  all  possible  Naïve  Bayes  models.  

 

33  

S = (xi, yi ){ }i=1N

∏−=

=

iii

P

iii

PP

yxP

yxPSP

),(logminarg

),(maxarg)(maxarg

Page 34: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Training  

34  

P(y = SPAM ) =Ny=SPAM

N

SPAMy

xSPAMy

N

NSPAMyxP

=

===== 1,1 1

)|1(

Frequency  of  SPAM    documents  in  training  set  

Frequency  of  word  x1    appearing  in  SPAM    documents  in  training  set  

•  Subject  to  Naïve  Bayes  assump4on  on  structure  of  P(x,y),  we  only  need  to  es4mate  P(y)  and  each  each  P(xd|y)!  

•  This  is  just  coun4ng!  

 

 

Page 35: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Training  Deriva4on  

•  Define:    

35  

P(x, y) =wx,y

wx ',y 'x ',y '∑

∑ ∑∑ ⎥⎦

⎤⎢⎣

⎡+−=−

i yxyxyx

wiii

PwwyxP

ii','

',', loglogargmin),(logargmin

∂wx ,y = −Nx,y

wx,y

+Nwx ',y '

x ',y '∑

#  training  examples  (x,y)  

Nx,y

N=

wx,y

wx ',y 'x ',y '∑è  

Frequency  of  (x,y)    in  training  set!  

P(x, y) =Nx,y

Nè  

Just  a  re-­‐parameteriza4on  

Page 36: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Regulariza4on  

•  Add  “pseudo  counts”  –  i.e.  hallucinate  some  data  

36  

P(y = SPAM ) =Ny=SPAM +λPy=SPAM

N +λ

P(x1 =1| y = SPAM ) =N

y=SPAM∧x1=1+λP

y=SPAM∧x1=1

Ny=SPAM +λ

Ocen  just  set  pseudo  counts  to  uniform  distribuFon  

Page 37: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Sampling  

•  Can  sample  from  distribu4on  P(x,y)  –  First  sample  y:  

•  Random  uniform  variable  R  •  Set  y=SPAM  if  R  <  P(y=SPAM)  &  y=NOT  otherwise  

–  Then  sample  each  xd:  •  Sample  uniform  variable  R  •  Set  xd=1  if  R  <  P(xd=1|y)  &  xd=0  otherwise  

37  

Page 38: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Sampling  Example  

•  Sample  P(y)  – R  =  0.5,  so  set  y  =  SPAM  

•  Sample  P(x1|y=SPAM)  – R  =  0.1,  so  set  x1  =  1  

•  Sample  P(x2|y=SPAM)  – R  =  0.9,  so  set  x2  =  0  

38  

x1=Homework   x2=Winner!  

y=SPAM   P(x1=1|y)=0.2   P(x2=1|y)=0.5  

y=NOT   P(x1=1|y)=0.6   P(x2=1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

Can  be  done  in    either  order  

Page 39: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Sampling  Example  #2  

•  Sample  P(y)  – R  =  0.9,  so  set  y  =  NOT  

•  Sample  P(x1|y=NOT)  – R  =  0.5,  so  set  x1  =  1  

•  Sample  P(x2|y=NOT)  – R  =  0.05,  so  set  x2  =  1  

39  

x1=Homework   x2=Winner!  

y=SPAM   P(x1=1|y)=0.2   P(x2=1|y)=0.5  

y=NOT   P(x1=1|y)=0.6   P(x2=1|y)=0.1  

P(y)  

y=SPAM   0.7  

y=NOT   0.3  

Page 40: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Recap:  Genera4ve  Models  and  Naïve  Bayes  

•  Genera4ve  models  (aqempt  to)  model  the  whole  data  distribu4on  – Can  generate  new  data,  tolerate  missing  values,  etc.  

– Not  as  good  at  predic4on  as  discrimina4ve  models  

40  

Page 41: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Recap:  Genera4ve  Models  and  Naïve  Bayes  

•  The  Naïve  Bayes  model  assumes  all  features  are  condi4onally  independent  given  the  label  – Greatly  simplified  model  structure,  but  s4ll  genera4ve  

– Compact  representa4on  – Easy  to  train  – Easy  to  compute  various  probabili4es  – Not  the  most  accurate  for  standard  predic4on  

41  

Page 42: Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Next  Lecture  

•  Hidden  Markov  Models  in  depth  – Sequence  Modeling  – Requires  Dynamic  Programming    –  Implement  aspects  of  HMMs  in  homework  

•  Recita4on  Tuesday:  – Recap  of  Dynamic  Programming  (for  HMMs)  

42