35
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemblemachinelearningtutorial/

Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Ensemble  Methods    

Adapted  from  slides  by  Todd  Holloway  h8p://abeau<fulwww.com/2007/11/23/ensemble-­‐machine-­‐learning-­‐tutorial/  

Page 2: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Outline  

•  The  NeHlix  Prize  – Success  of  ensemble  methods  in  the  NeHlix  Prize  

•  Why  Ensemble  Methods  Work  •  Algorithms  

– Bagging  – Random  forests  – AdaBoost  

Page 3: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Defini<on  

 Ensemble  Classifica.on  Aggrega<on  of  predic<ons  of  mul<ple  classifiers  with  the  goal  of  improving  accuracy.  

Page 4: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Teaser:    How  good  are  ensemble  methods?  

Let’s  look  at  the  Ne,lix  Prize  Compe55on…  

Page 5: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

•  Supervised  learning  task  –  Training  data  is  a  set  of  users  and  ra<ngs  (1,2,3,4,5  stars)  those  users  have  given  to  movies.  

–  Construct  a  classifier  that  given  a  user  and  an  unrated  movie,  correctly  classifies  that  movie  as  either  1,  2,  3,  4,  or  5  stars  

•  $1  million  prize  for  a  10%  improvement  over  NeHlix’s  current  movie  recommender/classifier      (MSE  =  0.9514)  

Began  October  2006  

Page 6: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Just  three  weeks  a`er  it  began,  at  least  40  teams  had  bested  the  NeHlix  classifier.      

 Top  teams  showed  about  5%  improvement.  

Page 7: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

from  h8p://www.research.a8.com/~volinsky/neHlix/  

However,  improvement  slowed…  

Page 8: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Today,  the  top  team  has  posted  a  8.5%  improvement.    Ensemble  methods  are  the  best  performers…  

Page 9: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

“Thanks  to  Paul  Harrison's  collabora<on,  a  simple  mix  of  our  solu<ons  improved  our  result  from  6.31  to  6.75”  

Rookies  

Page 10: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

“My  approach  is  to  combine  the  results  of  many  methods  (also  two-­‐way  interac<ons  between  them)  using  linear  regression  on  the  test  set.  The  best  method  in  my  ensemble  is  regularized  SVD  with  biases,  post  processed  with  kernel  ridge  regression”  

Arek  Paterek  

h8p://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf  

Page 11: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

“When  the  predic<ons  of  mul.ple  RBM  models  and  mul.ple  SVD  models  are  linearly  combined,  we  achieve  an  error  rate  that  is  well  over  6%  be8er  than  the  score  of  NeHlix’s  own  system.”  

U  of  Toronto  

h8p://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf  

Page 12: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Gravity  

home.mit.bme.hu/~gtakacs/download/gravity.pdf    

Page 13: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

“Our  common  team  blends  the  result  of  team  Gravity  and  team  Dinosaur  Planet.”    Might  have  guessed  from  the  name…  

When  Gravity  and    Dinosaurs  Unite  

Page 14: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

And,  yes,  the  top  team  which  is  from  AT&T…    “Our  final  solu.on  (RMSE=0.8712)  consists  of  blending  107  individual  results.  “  

BellKor  /  KorBell  

Page 15: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

The winner was an ensemble of ensembles (including BellKor).

Page 16: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Some  Intui<ons  on  Why  Ensemble  Methods  Work…  

Page 17: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Intui<ons  

 •  U<lity  of  combining  diverse,  independent  opinions  in  human  decision-­‐making  – Protec<ve  Mechanism  (e.g.  stock  porHolio  diversity)  

•  Viola.on  of  Ockham’s  Razor  –  Iden<fying  the  best  model  requires  iden<fying  the  proper  "model  complexity"  

See  Domingos,  P.  Occam’s  two  razors:  the  sharp  and  the  blunt.    KDD.    1998.  

Page 18: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Majority  vote  Suppose  we  have  5  completely  independent  classifiers…  –  If  accuracy  is  70%  for  each  

•  10  (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)    •  83.7%  majority  vote  accuracy  

– 101  such  classifiers  •  99.9%  majority  vote  accuracy  

Intui<ons  

Page 19: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Strategies  

Bagging    – Use  different  samples  of  observa<ons  and/or  predictors  (features)  of  the  examples  to  generate  diverse  classifiers  

– Aggregate  classifiers:  average  in  regression,  majority  vote  in  classifica<on  

Boos<ng  – Make  examples  currently  misclassified  more  important  (or  less,  in  some  cases)  

 

Page 20: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Bagging  (Construc<ng  for  Diversity)  

1. Use  random  samples  of  the  examples  to  construct  the  classifiers  

2. Use  random  feature  sets  to  construct  the  classifiers  •  Random  Decision  Forests  

•  Bagging:  Bootstrap  Aggrega<on  

Leo  Breiman  

Page 21: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

•  Bootstrap:  consider  the  following  situa<on:  – A  random  sample                                                                                from  unknown  probability  distribu<on  F  

– We  wish  to  es<mate  parameter    – We  build  es<mate    – What  is  the  s.d.  of        ?    

x = (x1, . . . , xN )

θ = t(F )θ = s(x)

θ

Examples: 1) estimate mean and sd of expected prediction error 2) estimate point-wise confidence bands in smoothing

Page 22: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

•  Bootstrap:  –  It  is  completely  automa<c  – Requires  no  theore<cal  calcula<ons  – Not  based  on  asympto<c  results  – Available  regardless  of  how  complicated  the  es<mator    θ

Page 23: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

•  Bootstrap  algorithm:  1.  Select  B  independent  bootstrap  samples                  

each  consis<ng  of  N  data  values  drawing  with  replacement  from  

2.  Evaluate  the  bootstrap  replica<on  corresponding  to  each  bootstrap  sample  

3.  Es<mate  the  standard  error  of          using  the  sample  standard  error  of  the  B  es<mates    

     

x∗1, . . . ,x∗B

x

θ∗(b) = s(x∗b), b = 1, . . . , B

θ

Page 24: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

•  Bagging:  use  bootstrap  to  improve  predic5ons  1.  Create  bootstrap  samples,  es<mate  model  from  

each  bootstrap  sample  2.  Aggregate  predic<ons  (average  if  regression,  

majority  vote  if  classifica<on)  

•  This  works  best  when  perturbing  the  training  set  can  cause  significant  changes  in  the  es<mated  model  

•  For  instance,  for  least-­‐squares,  can  show  variance  is  decreased  while  bias  is  unchanged  

Page 25: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Random  forests  

•  At  every  level,  choose  a  random  subset  of  the  variables  (predictors,  not  examples)  and  choose  the  best  split  among  those  a8ributes  

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Page 26: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Random  forests  

•  Let  the  number  of  training  points  be  M,  and  the  number  of  variables  in  the  classifier  be  N.  

For  each  tree,  1.  Choose  a  training  set  by  choosing  N  <mes  with  

replacement  from  all  N  available  training  cases.    2.  For  each  node,  randomly  choose  m  variables  on  

which  to  base  the  decision  at  that  node.    

Calculate  the  best  split  based  on  these.  

Page 27: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Random  forests  

•  Grow  each  tree  as  deep  as  possible  no  pruning!  

•  Out-­‐of-­‐bag  data  can  be  used  to  es<mate  cross-­‐valida<on  error  – For  each  training  point,  get  predic<on  from  averaging  trees  where  point  is  not  included  in  bootstrap  sample  

•  Variable  importance  measures  are  easy  to  calculate  

Page 28: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process
Page 29: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Boos<ng  1.  Create  a  sequence  of  classifiers,  giving  higher  influence  to  

more  accurate  classifiers  2.  At  each  itera<on,  make  examples  currently  misclassified  

more  important  (get  larger  weight  in  the  construc<on  of  the  next  classifier).      

3.  Then  combine  classifiers  by  weighted  vote  (weight  given  by  classifier  accuracy)  

Page 30: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

AdaBoost  Algorithm  1.  Ini<alize  Weights:  each  case  gets  the  same  weight:    

2.  Construct  a  classifier  using  current  weights.    Compute  its  error:  

3.  Get  classifier  influence,  and  update  example  weights  

4.  Goto  step  2…  

wi = 1/N, i = 1, . . . , N

εm =�

i wi × I{yi �= gm(xi)}�i wi

αm = log�

1− εm

εm

�wi ← wi × exp {αmI{yi �= gm(xi)}}

Final  predic<on  is  weighted  vote,  with  weight      αm

Page 31: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

AdaBoost  Algorithm  

Dra

ft:D

oN

otD

istrib

ute

152 a course in machine learning

Algorithm 31 AdaBoost(W , D, K)1: d(0) ← � 1

N , 1N , . . . , 1

N � // Initialize uniform importance to each example2: for k = 1 . . . K do3: f (k) ← W(D, d(k-1)) // Train kth classifier on weighted data4: yn ← f (k)(xn), ∀n // Make predictions on training data5: �(k) ← ∑n d(k-1)

n [yn �= yn] // Compute weighted training error

6: α(k) ← 12 log

�1−�(k)

�(k)

�// Compute “adaptive” parameter

7: d(k)n ← 1

Z d(k-1)n exp[−α(k)ynyn], ∀n // Re-weight examples and normalize

8: end for9: return f (x) = sgn

�∑k α(k) f (k)(x)

�// Return (weighted) voted classifier

questions that you got right, you pay less attention to. Those that yougot wrong, you study more. Then you take the exam again and repeatthis process. You continually down-weight the importance of questionsyou routinely answer correctly and up-weight the importance of ques-tions you routinely answer incorrectly. After going over the exammultiple times, you hope to have mastered everything.

The precise AdaBoost training algorithm is shown in Algorithm 11.2.The basic functioning of the algorithm is to maintain a weight dis-tribution d, over data points. A weak learner, f (k) is trained on thisweighted data. (Note that we implicitly assume that our weak learnercan accept weighted training data, a relatively mild assumption thatis nearly always true.) The (weighted) error rate of f (k) is used to de-termine the adaptive parameter α, which controls how “important” f (k)

is. As long as the weak learner does, indeed, achieve < 50% error,then α will be greater than zero. As the error drops to zero, α growswithout bound. What happens if the weak learn-

ing assumption is violated and � isequal to 50%? What if it is worsethan 50%? What does this mean, inpractice?

After the adaptive parameter is computed, the weight distibutionis updated for the next iteration. As desired, examples that are cor-rectly classified (for which ynyn = +1) have their weight decreasedmultiplicatively. Examples that are incorrectly classified (ynyn = −1)have their weight increased multiplicatively. The Z term is a nom-ralization constant to ensure that the sum of d is one (i.e., d can beinterpreted as a distribution). The final classifier returned by Ad-aBoost is a weighted vote of the individual classifiers, with weightsgiven by the adaptive parameters.

To better understand why α is defined as it is, suppose that ourweak learner simply returns a constant function that returns the(weighted) majority class. So if the total weight of positive exam-ples exceeds that of negative examples, f (x) = +1 for all x; otherwisef (x) = −1 for all x. To make the problem moderately interesting,suppose that in the original training set, there are 80 positive ex-amples and 20 negative examples. In this case, f (1)(x) = +1. It’sweighted error rate will be �(1) = 0.2 because it gets every negative

Page 32: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Classifica.ons  (colors)  and    Weights  (size)  aYer  1  itera+on  Of  AdaBoost  

3  itera+ons  20  itera+ons  

from  Elder,  John.    From  Trees  to  Forests  and  Rule  Sets  -­‐  A  Unified  Overview  of  Ensemble  Methods.    2007.  

Page 33: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

AdaBoost  

•  Advantages  – Very  li8le  code  – Reduces  variance  

•  Disadvantages  – Sensi<ve  to  noise  and  outliers.    Why?    

Page 34: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

How  was  the  NeHlix  prize  won?  

Gradient boosted decision trees… Details: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

Page 35: Ensemble(Methods( - users.umiacs.umd.eduusers.umiacs.umd.edu/~hcorrada/PML/pdf/lectures/EnsembleMethods.pdfgot wrong, you study more. Then you take the exam again and repeat this process

Sources  •  David  Mease.    Sta<s<cal  Aspects  of  Data  Mining.    Lecture.  

h8p://video.google.com/videoplay?docid=-­‐4669216290304603251&q=stats+202+engEDU&total=13&start=0&num=10&so=0&type=search&plindex=8  

•  Die8erich,  T.  G.    Ensemble  Learning.  In  The  Handbook  of  Brain  Theory  and  Neural  Networks,  Second  edi<on,  (M.A.  Arbib,  Ed.),  Cambridge,  MA:  The  MIT  Press,  2002.  h8p://www.cs.orst.edu/~tgd/publica<ons/hbtnn-­‐ensemble-­‐learning.ps.gz  

•  Elder,  John  and  Seni  Giovanni.    From  Trees  to  Forests  and  Rule  Sets  -­‐  A  Unified  Overview  of  Ensemble  Methods.    KDD  2007  h8p://Tutorial.  videolectures.net/kdd07_elder_`fr/  

•  NeHlix  Prize.  h8p://www.neHlixprize.com/  •  Christopher  M.  Bishop.    Neural  Networks  for  Pa8ern  Recogni<on.    Oxford  University  Press.    1995.