36
Machine Learning for Language Technology 2015 h6p://stp.lingfil.uu.se/~san?nim/ml/2015/ml4lt_2015.htm Machine Learning in Prac-ce (1) Marina San-ni [email protected]fil.uu.se Department of Linguis-cs and Philology Uppsala University, Uppsala, Sweden Autumn 2015

Lecture 8: Machine Learning in Practice (1)

Embed Size (px)

Citation preview

Page 1: Lecture 8: Machine Learning in Practice (1)

Machine  Learning  for  Language  Technology  2015  h6p://stp.lingfil.uu.se/~san?nim/ml/2015/ml4lt_2015.htm  

   Machine  Learning  in  Prac-ce  (1)  

 Marina  San-ni  

[email protected]    

Department  of  Linguis-cs  and  Philology  Uppsala  University,  Uppsala,  Sweden  

 Autumn  2015  

   

Page 2: Lecture 8: Machine Learning in Practice (1)

Acknowledgements  

•  Weka’s  slides  •  WiHen  et  al.  (2011):  Ch  5  (156-­‐180)  •  Daume’  III  (2015):  ch  4  pp.  65-­‐67.  

Lecture  8  ML  in  Practice  (1) 2

Page 3: Lecture 8: Machine Learning in Practice (1)

Outline  

l  Comparing  schemes:  the  t-­‐test  l  Predic-ng  probabili-es  l  Cost-­‐sensi-ve  measures  l  Occam’s  razor  

Lecture  8  ML  in  Practice  (1) 3

Page 4: Lecture 8: Machine Learning in Practice (1)

4 Lecture  8  ML  in  Practice  (1)

Comparing  data  mining  schemes  

l  Frequent question: which of two learning schemes performs better?

l  Note: this is domain dependent! l  Obvious way: compare 10-fold CV estimates l  Generally sufficient in applications (we don't loose

if the chosen method is not truly better) l  However, what about machine learning research?

♦  Need to show convincingly that a particular method works better

Page 5: Lecture 8: Machine Learning in Practice (1)

5 Lecture  8  ML  in  Practice  (1)

Comparing  schemes  II  l  Want  to  show  that  scheme  A  is  beHer  than  scheme  B  in  a  par-cular  domain  

♦  For  a  given  amount  of  training  data  ♦  On  average,  across  all  possible  training  sets  

l  Let's  assume  we  have  an  infinite  amount  of  data  from  the  domain:  

♦  Sample  infinitely  many  dataset  of  specified  size  ♦  Obtain  cross-­‐valida-on  es-mate  on  each  dataset  for  each  scheme  

♦  Check  if  mean  accuracy  for  scheme  A  is  beHer  than  mean  accuracy  for  scheme  B  

Page 6: Lecture 8: Machine Learning in Practice (1)

6 Lecture  8  ML  in  Practice  (1)

Paired  t-­‐test  l  In practice we have limited data and a limited number of

estimates for computing the mean l  Student’s t-test tells whether the means of two samples

are significantly different l  In our case the samples are cross-validation estimates

for different datasets from the domain l  Use a paired t-test because the individual samples are

paired ♦  The same CV is applied twice

William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Page 7: Lecture 8: Machine Learning in Practice (1)

7 Lecture  8  ML  in  Practice  (1)

Distribu-on  of  the  means  l  x1 x2 … xk l  y1 y2 … yk

l  mx and my are the means

l  With enough samples, the mean of a set of independent samples is normally distributed

l  Estimated variances of the means are σx

2/k and σy2/k

l  If µx and µy are the true means then à à à

are approximately normally distributed with mean 0, variance 1

Page 8: Lecture 8: Machine Learning in Practice (1)

8 Lecture  8  ML  in  Practice  (1)

Student’s  distribu-on  

l  With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom

l  Confidence limits:

0.88 20%

1.38 10%

1.83 5%

2.82

3.25

4.30

z

1%

0.5%

0.1%

Pr[X ≥ z]

0.84 20%

1.28 10%

1.65 5%

2.33

2.58

3.09

z

1%

0.5%

0.1%

Pr[X ≥ z]

9 degrees of freedom normal distribution

Assuming we have 10 estimates

Page 9: Lecture 8: Machine Learning in Practice (1)

9 Lecture  8  ML  in  Practice  (1)

Distribu-on  of  the  differences  l  Let md = mx – my

l  The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom

l  The standardized version of md is called the t-statistic: ….

l  We use t to perform the t-test

l  σd2 = the variance of the difference samples

Page 10: Lecture 8: Machine Learning in Practice (1)

10 Lecture  8  ML  in  Practice  (1)

Performing  the  test  

•  Fix a significance level •  If a difference is significant at the α% level,

there is a (100-α)% chance that the true means differ •  Divide the significance level by two because the test

is two-tailed •  i.e. the true difference can be +ve or – ve

•  Look up the value for z that corresponds to α/2 •  If t ≤ –z or t ≥z then the difference is significant •  I.e. the null hypothesis (that the difference is zero) can be

rejected

Page 11: Lecture 8: Machine Learning in Practice (1)

11 Lecture  8  ML  in  Practice  (1)

Unpaired  observa-ons  l  If the CV estimates are from different

datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other one)

l  Then we have to use an un paired t-test with min(k , j) – 1 degrees of freedom

l  The estimate of the variance of the difference of the means becomes….:

Page 12: Lecture 8: Machine Learning in Practice (1)

12 Lecture  8  ML  in  Practice  (1)

Predic-ng  probabili-es  l  Performance measure so far: success rate l  Also called 0-1 loss function: l  Most classifiers produces class probabilities l  Depending on the application, we might want to

check the accuracy of the probability estimates l  0-1 loss is not the right thing to use in those cases

∑ i {0  if  prediction  is  correct1  if  prediction  is  incorrect

}

Page 13: Lecture 8: Machine Learning in Practice (1)

13 Lecture  8  ML  in  Practice  (1)

Quadra-c  loss  func-on  l  p1 … pk are probability estimates for an

instance

l  c is the index of the instance’s actual class

l  a1 … ak = 0, except for ac which is 1

l  Quadratic loss is:……

l  Want to minimize…..

Page 14: Lecture 8: Machine Learning in Practice (1)

14 Lecture  8  ML  in  Practice  (1)

Informa-onal  loss  func-on  l  The informational loss function is –log(pc),

where c is the index of the instance’s actual class l  Let p1

* … pk*

be the true class probabilities l  Then the expected value for the loss function is:

Page 15: Lecture 8: Machine Learning in Practice (1)

15 Lecture  8  ML  in  Practice  (1)

Discussion  l  Which loss function to choose?

♦  Quadratic loss function takes into account all class probability estimates for an instance

♦  Informational loss focuses only on the probability estimate for the actual class 1义∑ j p j

2

Page 16: Lecture 8: Machine Learning in Practice (1)

16 Lecture  8  ML  in  Practice  (1)

The  kappa  sta-s-c  l  Two  confusion  matrices  for  a  3-­‐class  problem:  actual  predic-ons  (le])  vs.  random  predic-ons  (right)                

l  Number  of  successes:  sum  of  entries  in  diagonal  (D)    l  Kappa  sta-s-c:    measures  rela-ve  improvement  over  random  predic-ons    

D obs e rve d− D random

D perfe ct− D random

Page 17: Lecture 8: Machine Learning in Practice (1)

K  sta-s-c:  Calcula-ons  •  Propor-ons  of  the  class  ”a”  =  0.5  (ie  100  instances  out  of  200  à  50%  à  50/100  à  0.5)  •  Propor-ons  of  the  class  ”b”  =  0.3  (ie  60  instances  out  of  200  à  30%  à  30/100  à  0.3)  •  Propor-ons  of  the  class  ”c”  =  0.2  (ie  40  instances  out  of  200  à  20%  à  20/100  à  0.2)  Both  classifiers  (see  below)  returns  120  a’s,  60  b’s  and  20  c’s,  but  one  classifier  is  random.  How  much  the  actual  classifier  improves  on  the  random  classifier?  A  classifier  randomly  guessing  would  return  the  predic-ons  in  the  table  on  the  RHS:  0.5*120=60;  0.3*60=18;  0.2*20=4  à  60+18+4  =  82  The  actual  classifier  returns  the  predic-ons  in  the  table  on  the  LHS,  140  correct  predic-ons  (see  diagonal),  ie  70%  success  rate.  However:  k  sta$s$c  =  140-­‐82/200-­‐82  =  58/118=0.49=49%  •  So  the  actual  success  rate  of  70%  repesents  an  improvement  of  49%  on  random  guessing!  

Lecture  8  ML  in  Practice  (1) 17

D obs e rve d− D random

D perfe ct− D random

actual predictions (left) vs. random predictions (right)

Page 18: Lecture 8: Machine Learning in Practice (1)

In  summary  

•  A  k  sta-s-c  of  100%  (or  1)  implies  a  perfect  classifier.    •  A  k  sta-s-c  of  0  implies  that  the  classifier  provides  no  informa-on  and  behaves  as  if  it  were  guessing  randomly.    

•  The  Kappa  sta-s-c  is  used  to  measure  the  agreement  between  predicted  and  observed  categoriza-ons  of  a  dataset,  and  corrects  the  agreement  that  occurs  by  chance.    

•  Weka  provides  the  k  sta-s-c  value  to  assess  the  success  rate  beyond  the  chance.  

Lecture  8  ML  in  Practice  (1) 18

Page 19: Lecture 8: Machine Learning in Practice (1)

Quiz  1:  k  sta-s-c  Our  classifier  predicts  Red  41  -mes,  Green  29  -mes  and  Blue  30  -mes.  The  actual  numbers    for  the  sample  are:  40  Red,  30  Green  and  30  Blue.        Overall,  our  classifier  is  right  70%  of  the  -me.      Suppose  these  predic$ons  had  been  random  guesses.  Our  classifier  have  been  randomly  right:  0.4  x  41  +  0.3  x  29  +  0.3  x  30  =  34.1  (random  guess)    So  the  actual  success  rate  of  70%  represents  an  improvement  of  35.9%  on  random  guessing.    What  is  the  k  sta-s-c  for  our  classifier?    1.  0.54  2.  0.60  3.  0.70       Lecture  8  ML  in  Practice  (1) 19

Page 20: Lecture 8: Machine Learning in Practice (1)

20 Lecture  8  ML  in  Practice  (1)

Coun-ng  the  cost  l  In practice, different types of classification

errors often incur different costs l  Examples:

♦  Promotional mailing ♦  Terrorist profiling

l “Not a terrorist” correct 99.99% of the time, but if you miss 0.01% the cost will be very high

♦  Loan decisions ♦  etc.

l  There are many other types of cost! l  E.g.: cost of collecting training data

Page 21: Lecture 8: Machine Learning in Practice (1)

21 Lecture  8  ML  in  Practice  (1)

Coun-ng  the  cost  

l The confusion matrix:

Actual class

True negative False positive No

False negative True positive Yes

No Yes

Predicted class

Page 22: Lecture 8: Machine Learning in Practice (1)

22 Lecture  8  ML  in  Practice  (1)

Classifica-on  with  costs  l  Two  cost  matrices:            

l  Success  rate  is  replaced  by  average  cost  per  predic-on  

♦  Cost  is  given  by  appropriate  entry  in  the  cost  matrix      

Page 23: Lecture 8: Machine Learning in Practice (1)

23 Lecture  8  ML  in  Practice  (1)

Cost-­‐sensi-ve  classifica-on  l  Can  take  costs  into  account  when  making  predic-ons  

♦  Basic  idea:  only  predict  high-­‐cost  class  when  very  confident  about  predic-on  

l  Given:  predicted  class  probabili-es  ♦  Normally  we  just  predict  the  most  likely  class  ♦  Here,  we  should  make  the  predic-on  that  minimizes  the  expected  cost  

l  Expected  cost:  dot  product  of  vector  of  class  probabili-es  and  appropriate  column  in  cost  matrix  

l  Choose  column  (class)  that  minimizes  expected  cost      

Page 24: Lecture 8: Machine Learning in Practice (1)

24 Lecture  8  ML  in  Practice  (1)

Cost-­‐sensi-ve  learning  

l  So far we haven't taken costs into account at training time

l  Most learning schemes do not perform cost-sensitive learning l  They generate the same classifier no matter what

costs are assigned to the different classes l  Example: standard decision tree learner

l  Simple methods for cost-sensitive learning: l  Resampling of instances according to costs l  Weighting of instances according to costs

l  Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes

Page 25: Lecture 8: Machine Learning in Practice (1)

25 Lecture  8  ML  in  Practice  (1)

Li]  charts  

l  In practice, costs are rarely known l  Decisions are usually made by comparing

possible scenarios l  Example: promotional mailout to 1,000,000

households •  Mail to all; 0.1% respond (1000) •  Data mining tool identifies subset of 100,000 most

promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off

•  Identify subset of 400,000 most promising, 0.2% respond (800)

l  A lift chart allows a visual comparison

Page 26: Lecture 8: Machine Learning in Practice (1)

Data  for  a  li]  chart  

Lecture  8  ML  in  Practice  (1) 26

Page 27: Lecture 8: Machine Learning in Practice (1)

27 Lecture  8  ML  in  Practice  (1)

Genera-ng  a  li]  chart  

l  Sort instances according to predicted probability of being positive:

l  x axis is sample size

y axis is number of true positives

… … …

Yes 0.88 4

No 0.93 3

Yes 0.93 2

Yes 0.95 1

Actual class Predicted probability

Page 28: Lecture 8: Machine Learning in Practice (1)

28 Lecture  8  ML  in  Practice  (1)

A  hypothe-cal  li]  chart  

40% of responses for 10% of cost

80% of responses for 40% of cost

Page 29: Lecture 8: Machine Learning in Practice (1)

29 Lecture  8  ML  in  Practice  (1)

ROC  curves  

l  ROC curves are similar to lift charts ♦  Stands for “receiver operating characteristic” ♦  Used in signal detection to show tradeoff

between hit rate and false alarm rate over noisy channel

l  Differences to lift chart: ♦  y axis shows percentage of true positives in

sample rather than absolute number

♦  x axis shows percentage of false positives in sample rather than sample size

Page 30: Lecture 8: Machine Learning in Practice (1)

30 Lecture  8  ML  in  Practice  (1)

A  sample  ROC  curve  

l  Jagged curve—one set of test data l  Smooth curve—use cross-validation

Page 31: Lecture 8: Machine Learning in Practice (1)

31 Lecture  8  ML  in  Practice  (1)

Cross-­‐valida-on  and  ROC  curves  

l  Simple method of getting a ROC curve using cross-validation:

♦  Collect probabilities for instances in test folds ♦  Sort instances according to probabilities

l  This method is implemented in WEKA l  However, this is just one possibility

♦  Another possibility is to generate an ROC curve for each fold and average them

Page 32: Lecture 8: Machine Learning in Practice (1)

32 Lecture  8  ML  in  Practice  (1)

ROC  curves  for  two  schemes  

l  For a small, focused sample, use method A l  For a larger one, use method B l  In between, choose between A and B with appropriate probabilities

Page 33: Lecture 8: Machine Learning in Practice (1)

33 Lecture  8  ML  in  Practice  (1)

Recall-­‐Precision  Curves  

l  Percentage of retrieved documents that are relevant: precision=TP/(TP+FP)

l  Percentage of relevant documents that are returned: recall =TP/(TP+FN)

l  Precision/recall curves have hyperbolic shape l  Summary measures: average precision at 20%, 50% and 80%

recall (three-point average recall) l  F-measure=(2 × recall × precision)/(recall+precision) l  sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN)) l  Area under the ROC curve (AUC):

probability that randomly chosen positive instance is ranked above randomly chosen negative one

Page 34: Lecture 8: Machine Learning in Practice (1)

34 Lecture  8  ML  in  Practice  (1)

Model  selec-on  criteria  l  Model selection criteria attempt to find a good

compromise between: l  The complexity of a model l  Its prediction accuracy on the training data

l  Reasoning: a good model is a simple model that achieves high accuracy on the given data

l  Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts

William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.

Page 35: Lecture 8: Machine Learning in Practice (1)

35 Lecture  8  ML  in  Practice  (1)

Elegance  vs.  errors  

l  Model 1: very simple, elegant model that accounts for the data almost perfectly

l  Model 2: significantly more complex model that reproduces the data without mistakes

l  Model 1 is probably preferable.

Page 36: Lecture 8: Machine Learning in Practice (1)

The  End  

Lecture  8  ML  in  Practice  (1) 36