35
Classification Evaluation

Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Embed Size (px)

Citation preview

Page 1: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Classification Evaluation

Page 2: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Estimating Future Accuracy

• Given available data, how can we reliably predict accuracy on future, unseen data?

• Three basic approaches– Training set– Hold-out set (2 variants)– Cross-validation

Page 3: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Estimating with Training Set

• Simplest approach– Build model from training set– Compute accuracy on training set

• Pros and cons– Easy– Likely to overestimate

• (think overfitting)

Page 4: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Estimating with Hold-out Set (1)

• Method 1– Two distinct data sets are made available a priori– One is used to build the model– The other is used to test the model

• Pros and cons– No “bias”– Not always feasible

Page 5: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Estimating with Hold-out Set (2)

• Method 2:– Randomly partition data into training and test set– Training set used to train/build the model– Test set used evaluate the model

• Pros and cons– Easy– Less likely to overfit– Reduces amount of training data

Page 6: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Holding out data

• The holdout method reserves a certain amount for testing and uses the remainder for training– Usually: one third for testing, the rest for

training

• For “unbalanced” datasets, random samples might not be representative– Few or none instances of some classes

• Stratified sample: – Make sure that each class is represented with

approximately equal proportions in both subsets

Page 7: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Repeated holdout method

• Holdout estimate can be made more reliable by repeating the process with different subsamples– In each iteration, a certain proportion is

randomly selected for training (possibly with stratification)

– The error rates on the different iterations are averaged to yield an overall error rate

• This is called the repeated holdout method

Page 8: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Cross-validation

• Most popular and effective type of repeated holdout is cross-validation

• Cross-validation avoids overlapping test sets– First step: data is split into k subsets of equal

size– Second step: each subset in turn is used for

testing and the remainder for training• This is called k-fold cross-validation• Often the subsets are stratified before the

cross-validation is performed

Page 9: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Cross-validation example:

9

Page 10: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

More on cross-validation

• Standard data-mining method for evaluation: stratified ten-fold cross-validation

• Why ten?– Good choice to get an accurate estimate

• Stratification reduces the estimate’s variance

• Even better: repeated stratified cross-validation– E.g., ten-fold cross-validation is repeated ten

times and results are averaged (reduces the sampling variance)

• Error estimate is the mean across all repetitions

Page 11: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Leave-One-Out cross-validation

• Leave-One-Out:a particular form of cross-validation:– Set number of folds to number of training

instances– I.e., for n training instances, build classifier

n times

• Makes best use of the data• Involves no random subsampling • Computationally expensive, but good

performance

Page 12: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Leave-One-Out-CV and stratification

• Disadvantage of Leave-One-Out-CV: stratification is not possible– It guarantees a non-stratified sample

because there is only one instance in the test set!

• Extreme example: random dataset split equally into two classes– Best model predicts majority class– 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!

Page 13: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Three-way Data Splits

• One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward.

• If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split:– Training set: examples used for learning– Validation set: used to tune parameters– Test set: never used in the model fitting

process, used at the end for unbiased estimate of hold out error

Nested Cross-validation

Page 14: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Issues with Accuracy

• Measuring accuracy– Is 99% accuracy good? Is 20% accuracy bad?– Can be excellent, good, mediocre, poor, terrible

• Why?– Depends on problem complexity– Depends on base accuracy (i.e., majority learner)– Depends on cost of error (e.g., ICU, etc.)

• Problem: assumes equal cost for all errors

Page 15: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Confusion MatrixPredicted Output

True

Out

put (

Targ

et)

1 0

1

0

True Positive (TP)Hits

False Negative (FN)Misses

True Negative (TN)Correct Rejections

False Positive (FP)False Alarm

Accuracy = (TP+TN)/(TP+TN+FP+FN)

Single number: loses information

Page 16: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

PrecisionPredicted Output

True

Out

put (

Targ

et)

1 0

1

0

True Positive (TP)Hits

False Negative (FN)Misses

True Negative (TN)Correct Rejections

False Positive (FP)False Alarm

Precision = TP/(TP+FP)

The percentage of predicted true positives that are target true positives

(of those I predict true, how many are actually true)

Page 17: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

RecallPredicted Output

True

Out

put (

Targ

et)

1 0

1

0

True Positive (TP)Hits

False Negative (FN)Misses

True Negative (TN)Correct Rejections

False Positive (FP)False Alarm

Recall = TP/(TP+FN)

The percentage of target true positives that were predicted as true positives

(of those that are true, how many do I predict are true

Page 18: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

P/R Trade-off (I)• ICU monitoring:

– Is precision the goal? Not so much, rather not miss any– Recall is the goal: Don’t want to miss any, and would rather err towards

accepting some false positives (check patient when unnecessary) and minimize false negatives (not check on a needy patient)

• Google search:– Is recall the goal? Not really, because we never get to the millionth page, rather

get a few very good results early– Precision is the goal: Don’t want to see irrelevant documents (false positives),

can tolerate missing some (false negatives), there are plenty of sites anyways and we don’t need to get all

• Trade-off:– Easy to maximize precision – only classify the one or few most confident

candidates as true– Easy to maximize recall – classify everything as true– Neither is particularly useful!

Page 19: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

P/R Trade-off (II)

Complete P/R curveBreakeven Point defined by P=R

Alternatively, F-measure:

Page 20: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Other Measures

• Sensitivity (Recall):– TP / (TP + FN)

• Specificity:– TN / (TN + FP)

• Positive Predictive Value (Precision):– TP / (TP + FP)

• Negative Predictive Value:– TN / (TN + FN)

Page 21: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

ROC Curves

• Receiver Operating Characteristic Curve– Developed in WWII to statistically model false positive and

false negative detections of radar operators

• Standard measure in medicine and biology• Graphs true positive rate (sensitivity) vs. false

positive rate (1- specificity)• Goal: Maximize TPR and minimize FPR

– Max TPR: classify everything positive– Min FPR: classify everything negative– Neither is acceptable, of course!

Page 22: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Several Points in ROC Space• Lower left point (0, 0) represents the

strategy of never issuing a positive classification;– No FP but also no TP

• Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications.

• Point (0, 1) represents perfect classification. – D's performance is perfect as shown

• Informally, one point in ROC space is better than another if it is to the northwest of the first– TP rate is higher, FP rate is lower, or both.

Page 23: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

ROC Curves and AUC (II)

Each point on the ROC curve represents a different tradeoff (cost ratio) between TPR and FPR

AUC is area under the curve: represents performance averaged over all possible cost ratios

Single summary number

Perfect model has AUC = 1.0Random model has AUC = 0.5

Page 24: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Specific Example

Test Result

Pts with disease

Pts without the disease

Page 25: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Test Result

Call these patients “negative” Call these patients “positive”

Threshold

Page 26: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Test Result

Call these patients “negative” Call these patients “positive”

without the diseasewith the disease

True Positives

Some definitions ...

Page 27: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Test Result

Call these patients “negative” Call these patients “positive”

without the diseasewith the disease

False Positives

Page 28: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Test Result

Call these patients “negative” Call these patients “positive”

without the diseasewith the disease

True negatives

Page 29: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Test Result

Call these patients “negative” Call these patients “positive”

without the diseasewith the disease

False negatives

Page 30: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

How to Construct ROC Curve for one Classifier

• Sort the instances according to their Ppos

• Move a threshold on the sorted instances• For each threshold define a classifier with confusion matrix• Plot the TPR and FP of the classifier

Ppos True Class0.99 pos0.98 pos0.7 neg0.6 pos0.43 neg

TruePredicted

pos neg

pos 2 1

neg 1 1

Page 31: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

ROC Properties

• AUC properties– 1.0 - Perfect prediction– .9 - Excellent– .7 - Mediocre– .5 - Random

• ROC Curve properties– If two ROC curves do not intersect then one method

dominates the other– If they do intersect then one method is better for some cost

ratios, and is worse for others • Blue alg better for precision, yellow alg for recall, red neither

• Can choose method and balance based on goals

Page 32: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Lift (I)

• In some situations, we are not interested in the accuracy over the entire data set– Accurate predictions for 5%, 10%, or 20% of data– Don’t care about the rest

• Prototypical application: direct marketing– Baseline: random targeting of population– Can we do better?

• Want to know how much better a targeted offer is on a fraction of the population

Page 33: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Lift (II)Predicted Output

True

Out

put (

Targ

et)

1 0

1

0

True Positive (TP)Hits

False Negative (FN)Misses

True Negative (TN)Correct Rejections

False Positive (FP)False Alarm

Lift = [TP / (TP+TN)] / [(TP+FP) / (TP+TN+FP+FN)]

How much better a model is over random predictions

Page 34: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Lift (III)

Lift(t) = CR(t) / t

E.g., Lift (25%) = CR(25) / 25= 62

/ 25 = 2.5

If we select 25% of prospects using our model, they are 2.5 times more likely to respond than if we selected them randomly

Can vary t to make decisions (e.g., cost/benefit analysis)

Page 35: Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Summary

• Several measures– Single value vs. range of thresholds

• Restricted to binary classification– Could always cast problem as a set of two class problems but

that can be inconvenient• Accuracy handles multi-class outputs• Key point:

– The measure you optimize makes a difference– The measure you report makes a difference– Measure what you want to optimize/report (i.e., use measure

appropriate to task/domain)