14
Evaluating What’s Been Learned

Evaluating What’s Been Learned

Embed Size (px)

DESCRIPTION

Evaluating What’s Been Learned. Cross-Validation. Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for training Separation should NOT be “convenience”, Should at least be random - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluating What’s Been Learned

Evaluating What’s Been Learned

Page 2: Evaluating What’s Been Learned

Cross-Validation• Foundation is a simple idea – “holdout” – holds out a

certain amount for testing and uses rest for training• Separation should NOT be “convenience”,

– Should at least be random

– Better – “stratified” random – division preserves relative proportion of classes in both training and test data

• Enhanced : repeated holdout– Enables using more data in training, while still getting a good test

• 10-fold cross validation has become standard• This is improved if the folds are chosen in a “stratified”

random way

Page 3: Evaluating What’s Been Learned

For Small Datasets

• Leave One Out

• Bootstrapping

• To be discussed in turn

Page 4: Evaluating What’s Been Learned

Leave One Out• Train on all but one instance, test on that one (pct correct

always equals 100% or 0%)• Repeat until have tested on all instances, average results• Really equivalent to N-fold cross validation where N =

number of instances available• Plusses:

– Always trains on maximum possible training data (without cheating)

– Efficient to run – no repeated (since fold contents not randomized)– No stratification, no random sampling necessary

• Minuses– Guarantees a non-stratified sample – the correct class will always

be at least a little bit under-represented in the training data– Statistical tests are not appropriate

Page 5: Evaluating What’s Been Learned

Bootstrapping• Sampling done with replacement to form a training dataset

• Particular approach – 0.632 bootstrap– Dataset of n instances is sampled n times

– Some instances will be included multiple times

– Those not picked will be used as test data

– On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test

• This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation)

• May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair>

• This procedure can be repeated any number of times, allowing statistical tests

Page 6: Evaluating What’s Been Learned

Counting the Cost

• Some mistakes are more costly to make than others• Giving a loan to a defaulter is more costly than denying

somebody who would be a good customer• Sending mail solicitation to somebody who won’t buy is less

costly than missing somebody who would buy (opportunity cost)

• Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions)

• Measurement could be average profit/ loss per prediction• To be fair in cost benefit analysis, should also factor in cost

of collecting and preparing the data, building the model …

Page 7: Evaluating What’s Been Learned

Lift Charts

• In practice, costs are frequently not known• Decisions may be made by comparing possible

scenarios• Book Example – Promotional Mailing

– Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond

– Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond

– Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond

– The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)

Page 8: Evaluating What’s Been Learned

Information Retrieval (IR) Measures• E.g., Given a WWW search, a search engine

produces a list of hits supposedly relevant

• Which is better?– Retrieving 100, of which 40 are actually relevant– Retrieving 400, of which 80 are actually relevant– Really depends on the costs

Page 9: Evaluating What’s Been Learned

Information Retrieval (IR) Measures

• IR community has developed 3 measures:– Recall = number of documents retrieved that are relevant

total number of documents that are relevant

– Precision = number of documents retrieved that are relevant

total number of documents that are retrieved

– F-measure = 2 * recall * precision

recall + precision

Page 10: Evaluating What’s Been Learned

WEKA• Part of the results provided by WEKA (that we’ve ignored so far)• Let’s look at an example (Naïve Bayes on my-weather-nominal)=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.667 0.125 0.8 0.667 0.727 yes

0.875 0.333 0.778 0.875 0.824 no

=== Confusion Matrix ===

a b <-- classified as

4 2 | a = yes

1 7 | b = no

• TP rate and recall are the same = TP / (TP + FN) – For Yes = 4 / (4 + 2); For No = 7 / (7 + 1)

• FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4)• Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2)• F-measure = 2TP / (2TP + FP + FN)

– For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11– For No = 2 * 7 / (2*7 + 2 + 1) = 14/17

Page 11: Evaluating What’s Been Learned

In terms of true positives etc• True positives = TP; False positives = FP• True Negatives = TN; False negatives = FN• Recall = TP / (TP + FN) // true positives / actually positive• Precision = TP / (TP + FP) // true positives / predicted

positive • F-measure = 2TP / (2TP + FP + FN)

– This has been generated using algebra from the formula previous– Easier to understand this way – correct predictions are double

counted – once for recall, once for precision.denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant)

• There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two

Page 12: Evaluating What’s Been Learned

WEKA

• For many occasions, this borders on “too much information”, but it’s all there

• We can decide, are we more interested in Yes , or No?

• Are we more interested in recall or precision?

Page 13: Evaluating What’s Been Learned

WEKA – with more than two classes• Contact Lenses with Naïve Bayes=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.8 0.053 0.8 0.8 0.8 soft

0.25 0.1 0.333 0.25 0.286 hard

0.8 0.444 0.75 0.8 0.774 none

=== Confusion Matrix ===

a b c <-- classified as

4 0 1 | a = soft

0 1 3 | b = hard

1 2 12 | c = none

• Class exercise – show how to calculate recall, precision, f-measure for each class

Page 14: Evaluating What’s Been Learned

Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/

Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> PromotersThe confidence of action rule – 0.993 * 0.849 = 0.84Our action rule can target only 4.2 (out of 10.2) detractors.So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status