34
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemblemachinelearningtutorial/

Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Ensemble  Methods  

Adapted  from  slides  by  Todd  Holloway  h8p://abeau<fulwww.com/2007/11/23/ensemble-­‐machine-­‐learning-­‐tutorial/  

Page 2: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Outline  

•  The  NeHlix  Prize  – Success  of  ensemble  methods  in  the  NeHlix  Prize  

•  Why  Ensemble  Methods  Work  •  Algorithms  

– Bagging  – Random  forests  – AdaBoost  

Page 3: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Defini<on  

Ensemble  Classifica.on  

Aggrega<on  of  predic<ons  of  mul<ple  classifiers  with  the  goal  of  improving  accuracy.  

Page 4: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Teaser:    How  good  are  ensemble  methods?  

Let’s  look  at  the  Ne-lix  Prize  Compe66on…  

Page 5: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

•  Supervised  learning  task  –  Training  data  is  a  set  of  users  and  ra<ngs  (1,2,3,4,5  stars)  those  users  have  given  to  movies.  

–  Construct  a  classifier  that  given  a  user  and  an  unrated  movie,  correctly  classifies  that  movie  as  either  1,  2,  3,  4,  or  5  stars  

•  $1  million  prize  for  a  10%  improvement  over  NeHlix’s  current  movie  recommender/classifier      (MSE  =  0.9514)  

Began  October  2006  

Page 6: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Just  three  weeks  aaer  it  began,  at  least  40  teams  had  bested  the  NeHlix  classifier.      

Top  teams  showed  about  5%  improvement.  

Page 7: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

from  h8p://www.research.a8.com/~volinsky/neHlix/  

However,  improvement  slowed…  

Page 8: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Today,  the  top  team  has  posted  a  8.5%  improvement.  

Ensemble  methods  are  the  best  performers…  

Page 9: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

“Thanks  to  Paul  Harrison's  collabora<on,  a  simple  mix  of  our  solu<ons  improved  our  result  from  6.31  to  6.75”  

Rookies  

Page 10: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

“My  approach  is  to  combine  the  results  of  many  methods  (also  two-­‐way  interac<ons  between  them)  using  linear  regression  on  the  test  set.  The  best  method  in  my  ensemble  is  regularized  SVD  with  biases,  post  processed  with  kernel  ridge  regression”  

Arek  Paterek  

h8p://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf  

Page 11: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

“When  the  predic<ons  of  mul.ple  RBM  models  and  mul.ple  SVD  models  are  linearly  combined,  we  achieve  an  error  rate  that  is  well  over  6%  be8er  than  the  score  of  NeHlix’s  own  system.”  

U  of  Toronto  

h8p://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf  

Page 12: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Gravity  

home.mit.bme.hu/~gtakacs/download/gravity.pdf    

Page 13: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

“Our  common  team  blends  the  result  of  team  Gravity  and  team  Dinosaur  Planet.”  

Might  have  guessed  from  the  name…  

When  Gravity  and    Dinosaurs  Unite  

Page 14: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

And,  yes,  the  top  team  which  is  from  AT&T…  

“Our  final  solu.on  (RMSE=0.8712)  consists  of  blending  107  individual  results.  “  

BellKor  /  KorBell  

Page 15: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

The winner was an ensemble of ensembles (including BellKor).

Page 16: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Some  Intui<ons  on  Why  Ensemble  Methods  Work…  

Page 17: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Intui<ons  

•  U<lity  of  combining  diverse,  independent  opinions  in  human  decision-­‐making  – Protec<ve  Mechanism  (e.g.  stock  porHolio  diversity)  

•  Viola.on  of  Ockham’s  Razor  –  Iden<fying  the  best  model  requires  iden<fying  the  proper  "model  complexity"  

See  Domingos,  P.  Occam’s  two  razors:  the  sharp  and  the  blunt.    KDD.    1998.  

Page 18: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Majority  vote  Suppose  we  have  5  completely  independent  classifiers…  –  If  accuracy  is  70%  for  each  

•  10  (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)    •  83.7%  majority  vote  accuracy  

– 101  such  classifiers  •  99.9%  majority  vote  accuracy  

Intui<ons  

Page 19: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Strategies  

Bagging    – Use  different  samples  of  observa<ons  and/or  predictors  (features)  of  the  examples  to  generate  diverse  classifiers  

– Aggregate  classifiers:  average  in  regression,  majority  vote  in  classifica<on  

Boos<ng  – Make  examples  currently  misclassified  more  important  (or  less,  in  some  cases)  

Page 20: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Bagging  (Construc<ng  for  Diversity)  

1. Use  random  samples  of  the  examples  to  construct  the  classifiers  

2. Use  random  feature  sets  to  construct  the  classifiers  •  Random  Decision  Forests  

•  Bagging:  Bootstrap  Aggrega<on  

Leo  Breiman  

Page 21: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

•  Bootstrap:  consider  the  following  situa<on:  – A  random  sample                                                                                from  unknown  probability  distribu<on  F  

– We  wish  to  es<mate  parameter    

– We  build  es<mate    – What  is  the  s.d.  of        ?    

x = (x1, . . . , xN )

θ = t(F )θ̂ = s(x)

θ̂

Examples: 1) estimate mean and sd of expected prediction error

2) estimate point-wise confidence bands in smoothing

Page 22: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

•  Bootstrap:  –  It  is  completely  automa<c  

– Requires  no  theore<cal  calcula<ons  – Not  based  on  asympto<c  results  – Available  regardless  of  how  complicated  the  es<mator    θ̂

Page 23: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

•  Bootstrap  algorithm:  1.  Select  B  independent  bootstrap  samples                  

each  consis<ng  of  N  data  values  drawing  with  replacement  from  

2.  Evaluate  the  bootstrap  replica<on  corresponding  to  each  bootstrap  sample  

3.  Es<mate  the  standard  error  of          using  the  sample  standard  error  of  the  B  es<mates    

x∗1, . . . ,x∗B

x

θ̂∗(b) = s(x∗b), b = 1, . . . , B

θ̂

Page 24: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

•  Bagging:  use  bootstrap  to  improve  predic6ons  1.  Create  bootstrap  samples,  es<mate  model  from  

each  bootstrap  sample  2.  Aggregate  predic<ons  (average  if  regression,  

majority  vote  if  classifica<on)  

•  This  works  best  when  perturbing  the  training  set  can  cause  significant  changes  in  the  es<mated  model  

•  For  instance,  for  least-­‐squares,  can  show  variance  is  decreased  while  bias  is  unchanged  

Page 25: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Random  forests  

•  At  every  level,  choose  a  random  subset  of  the  variables  (predictors,  not  examples)  and  choose  the  best  split  among  those  a8ributes  

Page 26: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Random  forests  

•  Let  the  number  of  training  points  be  M,  and  the  number  of  variables  in  the  classifier  be  N.  

For  each  tree,  1.  Choose  a  training  set  by  choosing  N  <mes  with  

replacement  from  all  N  available  training  cases.    

2.  For  each  node,  randomly  choose  m  variables  on  which  to  base  the  decision  at  that  node.    

Calculate  the  best  split  based  on  these.  

Page 27: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Random  forests  

•  Grow  each  tree  as  deep  as  possible  no  pruning!  

•  Out-­‐of-­‐bag  data  can  be  used  to  es<mate  cross-­‐valida<on  error  – For  each  training  point,  get  predic<on  from  averaging  trees  where  point  is  not  included  in  bootstrap  sample  

•  Variable  importance  measures  are  easy  to  calculate  

Page 28: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega
Page 29: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Boos<ng  

1.  Create  a  sequence  of  classifiers,  giving  higher  influence  to  more  accurate  classifiers  

2.  At  each  itera<on,  make  examples  currently  misclassified  more  important  (get  larger  weight  in  the  construc<on  of  the  next  classifier).      

3.  Then  combine  classifiers  by  weighted  vote  (weight  given  by  classifier  accuracy)  

Page 30: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

AdaBoost  Algorithm  1.  Ini<alize  Weights:  each  case  gets  the  same  weight:    

2.  Construct  a  classifier  using  current  weights.    Compute  its  error:  

3.  Get  classifier  influence,  and  update  example  weights  

4.  Goto  step  2…  

wi = 1/N, i = 1, . . . , N

εm =�

i wi × I{yi �= gm(xi)}�i wi

αm = log�

1− εm

εm

�wi ← wi × exp {αmI{yi �= gm(xi)}}

Final  predic<on  is  weighted  vote,  with  weight      αm

Page 31: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Classifica.ons  (colors)  and    Weights  (size)  a[er  1  itera+on  Of  AdaBoost  

3  itera+ons  20  itera+ons  

from  Elder,  John.    From  Trees  to  Forests  and  Rule  Sets  -­‐  A  Unified  Overview  of  Ensemble  Methods.    2007.  

Page 32: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

AdaBoost  

•  Advantages  – Very  li8le  code  – Reduces  variance  

•  Disadvantages  – Sensi<ve  to  noise  and  outliers.    Why?    

Page 33: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

How  was  the  NeHlix  prize  won?  

Gradient boosted decision trees…

Details: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

Page 34: Ensemble(Methods( - UMIACSusers.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdfDefinion EnsembleClassifica.on Aggrega

Sources  •  David  Mease.    Sta<s<cal  Aspects  of  Data  Mining.    Lecture.  

h8p://video.google.com/videoplay?docid=-­‐4669216290304603251&q=stats+202+engEDU&total=13&start=0&num=10&so=0&type=search&plindex=8  

•  Die8erich,  T.  G.    Ensemble  Learning.  In  The  Handbook  of  Brain  Theory  and  Neural  Networks,  Second  edi<on,  (M.A.  Arbib,  Ed.),  Cambridge,  MA:  The  MIT  Press,  2002.  h8p://www.cs.orst.edu/~tgd/publica<ons/hbtnn-­‐ensemble-­‐learning.ps.gz  

•  Elder,  John  and  Seni  Giovanni.    From  Trees  to  Forests  and  Rule  Sets  -­‐  A  Unified  Overview  of  Ensemble  Methods.    KDD  2007  h8p://Tutorial.  videolectures.net/kdd07_elder_afr/  

•  NeHlix  Prize.  h8p://www.neHlixprize.com/  •  Christopher  M.  Bishop.    Neural  Networks  for  Pa8ern  Recogni<on.    Oxford  University  Press.    1995.