Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Classification Evaluation

Estimating Future Accuracy

• Given available data, how can we reliably predict accuracy on future, unseen data?

• Three basic approaches– Training set– Hold-out set (2 variants)– Cross-validation

Estimating with Training Set

• Simplest approach– Build model from training set– Compute accuracy on training set

• Pros and cons– Easy– Likely to overestimate

• (think overfitting)

Estimating with Hold-out Set (1)

• Method 1– Two distinct data sets are made available a priori– One is used to build the model– The other is used to test the model

• Pros and cons– No “bias”– Not always feasible

Estimating with Hold-out Set (2)

• Method 2:– Randomly partition data into training and test set– Training set used to train/build the model– Test set used evaluate the model

• Pros and cons– Easy– Less likely to overfit– Reduces amount of training data

Holding out data

• The holdout method reserves a certain amount for testing and uses the remainder for training– Usually: one third for testing, the rest for

training

• For “unbalanced” datasets, random samples might not be representative– Few or none instances of some classes

• Stratified sample: – Make sure that each class is represented with

approximately equal proportions in both subsets

Repeated holdout method

• Holdout estimate can be made more reliable by repeating the process with different subsamples– In each iteration, a certain proportion is

randomly selected for training (possibly with stratification)

– The error rates on the different iterations are averaged to yield an overall error rate

• This is called the repeated holdout method

Cross-validation

• Most popular and effective type of repeated holdout is cross-validation

• Cross-validation avoids overlapping test sets– First step: data is split into k subsets of equal

size– Second step: each subset in turn is used for

testing and the remainder for training• This is called k-fold cross-validation• Often the subsets are stratified before the

cross-validation is performed

Cross-validation example:

9

More on cross-validation

• Standard data-mining method for evaluation: stratified ten-fold cross-validation

• Why ten?– Good choice to get an accurate estimate

• Stratification reduces the estimate’s variance

• Even better: repeated stratified cross-validation– E.g., ten-fold cross-validation is repeated ten

times and results are averaged (reduces the sampling variance)

• Error estimate is the mean across all repetitions

Leave-One-Out cross-validation

• Leave-One-Out:a particular form of cross-validation:– Set number of folds to number of training

instances– I.e., for n training instances, build classifier

n times

• Makes best use of the data• Involves no random subsampling • Computationally expensive, but good

performance

Leave-One-Out-CV and stratification

• Disadvantage of Leave-One-Out-CV: stratification is not possible– It guarantees a non-stratified sample

because there is only one instance in the test set!

• Extreme example: random dataset split equally into two classes– Best model predicts majority class– 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!

Three-way Data Splits

• One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward.

• If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split:– Training set: examples used for learning– Validation set: used to tune parameters– Test set: never used in the model fitting

process, used at the end for unbiased estimate of hold out error

Nested Cross-validation

Issues with Accuracy

• Measuring accuracy– Is 99% accuracy good? Is 20% accuracy bad?– Can be excellent, good, mediocre, poor, terrible

• Why?– Depends on problem complexity– Depends on base accuracy (i.e., majority learner)– Depends on cost of error (e.g., ICU, etc.)

• Problem: assumes equal cost for all errors

Confusion MatrixPredicted Output

True

Out

put (

Targ

et)

1 0

1

0

True Positive (TP)Hits

False Negative (FN)Misses

True Negative (TN)Correct Rejections

False Positive (FP)False Alarm

Accuracy = (TP+TN)/(TP+TN+FP+FN)

Single number: loses information

PrecisionPredicted Output

True

Out

put (

Targ

et)

1 0

1

0





Precision = TP/(TP+FP)

The percentage of predicted true positives that are target true positives

(of those I predict true, how many are actually true)

RecallPredicted Output

True

Out

put (

Targ

et)

1 0

1

0





Recall = TP/(TP+FN)

The percentage of target true positives that were predicted as true positives

(of those that are true, how many do I predict are true

P/R Trade-off (I)• ICU monitoring:

– Is precision the goal? Not so much, rather not miss any– Recall is the goal: Don’t want to miss any, and would rather err towards

accepting some false positives (check patient when unnecessary) and minimize false negatives (not check on a needy patient)

• Google search:– Is recall the goal? Not really, because we never get to the millionth page, rather

get a few very good results early– Precision is the goal: Don’t want to see irrelevant documents (false positives),

can tolerate missing some (false negatives), there are plenty of sites anyways and we don’t need to get all

• Trade-off:– Easy to maximize precision – only classify the one or few most confident

candidates as true– Easy to maximize recall – classify everything as true– Neither is particularly useful!

P/R Trade-off (II)

Complete P/R curveBreakeven Point defined by P=R

Alternatively, F-measure:

Other Measures

• Sensitivity (Recall):– TP / (TP + FN)

• Specificity:– TN / (TN + FP)

• Positive Predictive Value (Precision):– TP / (TP + FP)

• Negative Predictive Value:– TN / (TN + FN)

ROC Curves

• Receiver Operating Characteristic Curve– Developed in WWII to statistically model false positive and

false negative detections of radar operators

• Standard measure in medicine and biology• Graphs true positive rate (sensitivity) vs. false

positive rate (1- specificity)• Goal: Maximize TPR and minimize FPR

– Max TPR: classify everything positive– Min FPR: classify everything negative– Neither is acceptable, of course!

Several Points in ROC Space• Lower left point (0, 0) represents the

strategy of never issuing a positive classification;– No FP but also no TP

• Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications.

• Point (0, 1) represents perfect classification. – D's performance is perfect as shown

• Informally, one point in ROC space is better than another if it is to the northwest of the first– TP rate is higher, FP rate is lower, or both.

ROC Curves and AUC (II)

Each point on the ROC curve represents a different tradeoff (cost ratio) between TPR and FPR

AUC is area under the curve: represents performance averaged over all possible cost ratios

Single summary number

Perfect model has AUC = 1.0Random model has AUC = 0.5

Specific Example

Test Result

Pts with disease

Pts without the disease

Test Result

Call these patients “negative” Call these patients “positive”

Threshold

Test Result


without the diseasewith the disease

True Positives

Some definitions ...

Test Result



False Positives

Test Result



True negatives

Test Result



False negatives

How to Construct ROC Curve for one Classifier

• Sort the instances according to their Ppos

• Move a threshold on the sorted instances• For each threshold define a classifier with confusion matrix• Plot the TPR and FP of the classifier

Ppos True Class0.99 pos0.98 pos0.7 neg0.6 pos0.43 neg

TruePredicted

pos neg

pos 2 1

neg 1 1

ROC Properties

• AUC properties– 1.0 - Perfect prediction– .9 - Excellent– .7 - Mediocre– .5 - Random

• ROC Curve properties– If two ROC curves do not intersect then one method

dominates the other– If they do intersect then one method is better for some cost

ratios, and is worse for others • Blue alg better for precision, yellow alg for recall, red neither

• Can choose method and balance based on goals

Lift (I)

• In some situations, we are not interested in the accuracy over the entire data set– Accurate predictions for 5%, 10%, or 20% of data– Don’t care about the rest

• Prototypical application: direct marketing– Baseline: random targeting of population– Can we do better?

• Want to know how much better a targeted offer is on a fraction of the population

Lift (II)Predicted Output

True

Out

put (

Targ

et)

1 0

1

0





Lift = [TP / (TP+TN)] / [(TP+FP) / (TP+TN+FP+FN)]

How much better a model is over random predictions

Lift (III)

Lift(t) = CR(t) / t

E.g., Lift (25%) = CR(25) / 25= 62

/ 25 = 2.5

If we select 25% of prospects using our model, they are 2.5 times more likely to respond than if we selected them randomly

Can vary t to make decisions (e.g., cost/benefit analysis)

Summary

• Several measures– Single value vs. range of thresholds

• Restricted to binary classification– Could always cast problem as a set of two class problems but

that can be inconvenient• Accuracy handles multi-class outputs• Key point:

– The measure you optimize makes a difference– The measure you report makes a difference– Measure what you want to optimize/report (i.e., use measure

appropriate to task/domain)

Documents

Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches