59
Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology Evaluation of Medical Systems

Evaluation of Medical Systems

  • Upload
    zhen

  • View
    24

  • Download
    2

Embed Size (px)

DESCRIPTION

Evaluation of Medical Systems. Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology. Main Concepts. Example of System Discrimination - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluation of Medical Systems

Lucila Ohno-Machado

An introduction to calibration and discrimination methods

HST951 Medical Decision Support

Harvard Medical SchoolMassachusetts Institute of Technology

Evaluation of Medical Systems

Page 2: Evaluation of Medical Systems

Main Concepts

• Example of System• Discrimination

– Discrimination: sensitivity, specificity, PPV, NPV, accuracy, ROC curves

• Calibration– Calibration curves– Hosmer and Lemeshow goodness of fit

Page 3: Evaluation of Medical Systems

Example

Page 4: Evaluation of Medical Systems

Comparison of Practical Prediction Models for

Ambulation Following Spinal Cord Injury

Todd Rowland, M.D.

Decision Systems Group

Brigham and Womens Hospital

Harvard Medical School

Page 5: Evaluation of Medical Systems

Study Rationale

• Patient’s most common question: “Will I walk again”

• Study was conducted to compare logistic regression , neural network, and rough sets models which predict ambulation at discharge based upon information available at admission for individuals with acute spinal cord injury.

• Create simple models with good performance

• 762 training set• 376 test set• univariate statistic compared (means)

Page 6: Evaluation of Medical Systems

SCI Ambulation NN DesignInputs & Output

Admission Info (9 items)

system daysinjury daysagegenderracial/ethnic grouplevel of neurologic fxnASIA impairment indexUEMSLEMS

Ambulation (1 item)

yes

no

Page 7: Evaluation of Medical Systems

Evaluation Indices

Page 8: Evaluation of Medical Systems

General indices

• Brier score (a.k.a. mean squared error)

(ei - oi)2

n

e = estimate (e.g., 0.2)

o = observation (0 or 1)

n = number of cases

Page 9: Evaluation of Medical Systems

Decompositions of Brier score

• Murphy

Brier = d(1-d) - CI + DI

where d = prior probability

CI = calibration index

DI = discrimination index

Page 10: Evaluation of Medical Systems

Calibration and Discrimination

• Calibration measures how close the estimates are to a “real” probability

• Discrimination measures how much the system can discriminate between cases with gold standard ‘1’ and gold standard ‘0’

• “If the system is good in discrimination, calibration may be fixed”

Page 11: Evaluation of Medical Systems

Discrimination Indices

Page 12: Evaluation of Medical Systems

Discrimination

• The system can “somehow” differentiate between cases in different categories

• Special case: – diagnosis (differentiate sick and healthy individuals)

– prognosis (differentiate poor and good outcomes)

Page 13: Evaluation of Medical Systems

Discrimination of Binary Outcomes

• Real outcome is 0 or 1, estimated outcome is usually a number between 0 and 1 (e.g., 0.34) or a rank

• Based on Thresholded Results (e.g., if output > 0.5 then consider “positive”)

• Discrimination is based on:– sensitivity

– specificity

– PPV, NPV

– accuracy

– area under ROC curve

Page 14: Evaluation of Medical Systems

Notation

D means disease present

“D” means system says disease is present

nl means disease absent

“nl” means systems says disease is absent

TP means true positive

FP means false positive

TN means true positive

Page 15: Evaluation of Medical Systems

threshold

3.0

normal Disease

1.0 1.7

FN

True Negative (TN)

FP

TruePositive (TP)

Page 16: Evaluation of Medical Systems

Prevalence = P(D)

= FN + TP

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 17: Evaluation of Medical Systems

• Specificity

Spec = TN/TN+FP

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 18: Evaluation of Medical Systems

• Positive

Predictive Value

PPV = TP/TP+FP

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 19: Evaluation of Medical Systems

• Negative

Predictive Value

NPV = TN/TN+FN

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 20: Evaluation of Medical Systems

• Accuracy = (TN +TP)/all

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 21: Evaluation of Medical Systems

• Sensitivity

Sens = TP/TP+FN• Specificity

Spec = TN/TN+FP• Positive

Predictive Value

PPV = TP/TP+FP• Negative

Predictive Value

NPV = TN/TN+FN• Accuracy = (TN +TP)/all

“nl” “D”

“D”

“nl”

nl D

TN

TPFP

FN

Page 22: Evaluation of Medical Systems

Sens = TP/TP+FN

40/50 = .8

Spec = TN/TN+FP

30/50 = .6

PPV = TP/TP+FP

40/60 = .67

NPV = TN/TN+FN

30/40 = .75

Accuracy = TN +TP

70/100 = .7

“nl” “D”

“D”

“nl”

nl D

30

4020

10

Page 23: Evaluation of Medical Systems

nl disease

threshold

1.0 3.01.4

TN

FP

TP

Sensitivity = 50/50 = 1Specificity = 40/50 = 0.8

“D”

“nl”

nl D

40

5010

0

50 50

40

60

Page 24: Evaluation of Medical Systems

nl disease

threshold

1.0 3.01.7

FN

TN

FP

TP

“D”

“nl”

nl D

40

4010

10

50 50

50

50

Sensitivity = 40/50 = .8Specificity = 40/50 = .8

Page 25: Evaluation of Medical Systems

nl disease

threshold

1.0 3.02.0

FN

TN TP

Sensitivity = 30/50 = .6Specificity = 1

“D”

“nl”

nl D

50

30 0

20

50 50

70

30

Page 26: Evaluation of Medical Systems

ROCcurve

“D”

“nl”

nl D

50

30 0

20

50 50

70

30

“D”

“nl”

nl D

40

4010

10

50 50

50

50

“D”

“nl”

nl D

40

5010

0

50 50

40

60

Sen

siti

vity

1 - Specificity0 1

1

Thr

esho

ld 1

.4T

hre s

hold

1.7

Thr

esho

ld 2

.0

Page 27: Evaluation of Medical Systems

ROCcurveS

ensi

tivi

ty

1 - Specificity0 1

1

All

Thr

esho

lds

Page 28: Evaluation of Medical Systems

Sen

siti

vity

1 - Specificity0 1

145 degree line:no discrimination

Page 29: Evaluation of Medical Systems

Sen

siti

vity

1 - Specificity0 1

145 degree line:no discrimination

Area under ROC: 0.5

Page 30: Evaluation of Medical Systems

Sen

siti

vity

1 - Specificity0 1

1Perfectdiscrimination

Page 31: Evaluation of Medical Systems

Sen

siti

vity

1 - Specificity0 1

1Perfectdiscrimination

Area under ROC: 1

Page 32: Evaluation of Medical Systems

ROCcurveS

ensi

tivi

ty

1 - Specificity0 1

1

Area = 0.86

Page 33: Evaluation of Medical Systems

What is the area under the ROC?

• An estimate of the discriminatory performance of the system – the real outcome is binary, and systems’ estimates are

continuous (0 to 1)

– all thresholds are considered

• NOT an estimate on how many times the system will give the “right” answer

Page 34: Evaluation of Medical Systems

Interpretation of the Area

Systems’ estimates for

• Healthy (real outcome 0)

0.3

0.2

0.5

0.1

0.7

• Sick (real outcome 1)

0.8

0.2

0.5

0.7

0.9

Page 35: Evaluation of Medical Systems

All possible pairs 0-1

Systems’ estimates for

• Healthy

0.3

0.2

0.5

0.1

0.7

• Sick

0.8

0.2

0.5

0.7

0.9

concordant

discordant

concordant

concordant

concordant

<

Page 36: Evaluation of Medical Systems

All possible pairs 0-1

Systems’ estimates for

• Healthy

0.3

0.2

0.5

0.1

0.7

• Sick

0.8

0.2

0.5

0.7

0.9

concordant

tie

concordant

concordant

concordant

Page 37: Evaluation of Medical Systems

C - index

• Concordant

18

• Discordant

4• Ties

3

C -index = Concordant + 1/2 Ties = 18 + 1.5

All pairs 25

Page 38: Evaluation of Medical Systems

All possible pairs 0-1

Systems’ estimates for

• Healthy

0.3

0.2

0.5

0.1

0.7

• Sick

0.8

0.2

0.5

0.7

0.9

concordant

discordant

concordant

concordant

concordant

<

Page 39: Evaluation of Medical Systems

Related statistics

• Mann-Whitney U

p(x < y) = U(x < y)

mn

where m is the number of real ‘0’s (e.g., deaths) and n is the number of real ‘1’s (e.g. survivals)

• Wilcoxon rank test

Page 40: Evaluation of Medical Systems

Calibration Indices

Page 41: Evaluation of Medical Systems

Calibration

• System can reliably estimate probability of– a diagnosis

– a prognosis

• Probability is close to the “real” probability for similar cases

Page 42: Evaluation of Medical Systems

What is the problem?

• Binary events are YES/NO (0/1) i.e., probabilities are 0 or 1 for a given individual

• Some models produce continuous (or quasi-continuous estimates for the binary events)

• Example:– Database of patients with spinal cord injury, and a model

that predicts whether a patient will ambulate or not at hospital discharge

– Event is 0: doesn’t walk or 1: walks

– Models produce a probability that patient will walk: 0.05, 0.10, ...

Page 43: Evaluation of Medical Systems

How close are the estimates to the “true” probability for a patient?

• “True” probability can be interpreted as probability within a set of similar patients

• What are similar patients?– Patients who get similar scores from models

– How to define boundaries for similarity?

• Examples out there in the real world...– APACHE III for ICU outcomes

– ACI-TIPI for diagnosis of MI

Page 44: Evaluation of Medical Systems

Calibration

Sort systems’ estimates, group, sum

0.1 0

0.2 0

0.2 sum of group = 0.5 1 sum = 1

0.3 0

0.5 0

0.5 sum of group = 1.3 1 sum = 1

0.7 0

0.7 1

0.8 1

0.9 sum of group = 3.1 1 sum = 3

Page 45: Evaluation of Medical Systems

Sum

of

syst

em’s

est

imat

es

Sum of real outcomes0 1

1

overestimation

Calibration Curves

Page 46: Evaluation of Medical Systems

Sum

of

syst

em’s

est

imat

es

Sum of real outcomes0 1

1

Regression line

LinearRegressionand450 line

Page 47: Evaluation of Medical Systems

Goodness-of-fitSort systems’ estimates, group, sum, chi-square

Expected Observed0.1 0

0.2 0

0.2 sum of group = 0.5 1 sum = 1

0.3 0

0.5 0

0.5 sum of group = 1.3 1 sum = 1

0.7 0

0.7 1

0.8 1

0.9 sum of group = 3.1 1 sum = 3

= [(observed - expected)2/expected]

= (0.5 - 1)2 / 0.5 + (1.3 - 1)2 / 1.3 + (3.1 - 3)2 / 3.1

Page 48: Evaluation of Medical Systems

Hosmer-Lemeshow C-hat

Groups based on n-iles (e.g., deciles, terciles), n-2 d..f.Measured Groups “Mirror groups”

Expected Observed Expected Observed

0.1 0 0.9 1

0.2 0 0.8 1

0.2 sum = 0.5 1 sum = 1 0.8 sum = 2.5 0 sum = 2

0.3 0 0.7 1

0.5 0 0.5 1

0.5 sum = 1.3 1 sum = 1 0.5 sum = 1.7 0 sum = 2

0.7 0 0.3 1

0.7 1 0.3 0

0.8 1 0.2 0

0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1

=(0.5-1)2/0.5+(1.3-1)2/1.3+(3.1-3)2/3.1+(2.5-2)2/2.5+(1.7-2)2/1.7+(0.9-1)2/0.9

Page 49: Evaluation of Medical Systems

Hosmer-Lemeshow H-hat

Groups based on n fixed thresholds (e.g., 0.3, 0.6, 0.9), n-2 d.f.Measured Groups “Mirror groups”

Expected Observed Expected Observed

0.1 0 0.9 1

0.2 0 0.8 1

0.2 1 0.8 0

0.3 sum = 0.8 0 sum = 1 0.7 sum = 3.2 1 sum = 2

0.5 0 0.5 1

0.5 sum = 1.0 1 sum = 1 0.5 sum = 1.0 0 sum = 1

0.7 0 0.3 1

0.7 1 0.3 0

0.8 1 0.2 0

0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1

=(0.8-1)2/0.8+(1.0-1)2/1+(3.1-3)2/3.1+(3.2-2)2/3.2+(1.0-1)2/1+(0.9-1)2/0.9

Page 50: Evaluation of Medical Systems

Covariance decomposition

• Arkes et al, 1995

Brier =

d(1-d) + bias2 + d(1-d)slope(slope-2) + scatter

• where d = prior• bias is a calibration index• slope is a discrimination index• scatter is a variance index

Page 51: Evaluation of Medical Systems

Covariance Graph

ô = .7

ê = .6

ê1 = .7

ê0 = .4

0 1outcome index (o)

estimated probability (e)

1

slope

PS= .2 bias= -0.1 slope= .3 scatter= .1

Page 52: Evaluation of Medical Systems

Back to Example

Page 53: Evaluation of Medical Systems

Thresholded Results

Sens Spec NPV PPVAccuracy

• LR 0.875 0.853 0.971 0.549 0.856

• NN 0.844 0.878 0.965 0.587 0.872

• RS 0.875 0.862 0.971 0.566 0.864

Page 54: Evaluation of Medical Systems

Brier Scores

BrierSpiegelhalter’s Z

• LR 0.0804 0.01

• NN 0.0811 1.34

• RS 0.0883 -1.0

Page 55: Evaluation of Medical Systems

ROC Curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sensitivity

1-S

pec

ific

ity

LR

NN

RS

Page 56: Evaluation of Medical Systems

Areas under ROC Curves

Model ROC Curve Area Standard Error

Logistic Regression 0.925 0.016

Neural Network 0.923 0.015

Rough Set 0.914 0.016

Page 57: Evaluation of Medical Systems

RS Model

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

Observed

LR Model

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

Observed

NN Model

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

Observed

Page 58: Evaluation of Medical Systems

Results: Goodness-of-fit

• Logistic Regression: H-L p = 0.50

• Neural Network: H-L p = 0.21

• Rough Sets: H-L p <.01

• p > 0.05 indicates reasonable fit

Page 59: Evaluation of Medical Systems

Summary

• Always test performance in previously unseen cases• General indices not very informative• Combination of indices describing discrimination

and calibration are more informative• Hard to conclude that one system is better than

another one in terms of classification performance alone: explanation, variable selection, and acceptability by clinicians are key