Upload
zhen
View
24
Download
2
Embed Size (px)
DESCRIPTION
Evaluation of Medical Systems. Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology. Main Concepts. Example of System Discrimination - PowerPoint PPT Presentation
Citation preview
Lucila Ohno-Machado
An introduction to calibration and discrimination methods
HST951 Medical Decision Support
Harvard Medical SchoolMassachusetts Institute of Technology
Evaluation of Medical Systems
Main Concepts
• Example of System• Discrimination
– Discrimination: sensitivity, specificity, PPV, NPV, accuracy, ROC curves
• Calibration– Calibration curves– Hosmer and Lemeshow goodness of fit
Example
Comparison of Practical Prediction Models for
Ambulation Following Spinal Cord Injury
Todd Rowland, M.D.
Decision Systems Group
Brigham and Womens Hospital
Harvard Medical School
Study Rationale
• Patient’s most common question: “Will I walk again”
• Study was conducted to compare logistic regression , neural network, and rough sets models which predict ambulation at discharge based upon information available at admission for individuals with acute spinal cord injury.
• Create simple models with good performance
• 762 training set• 376 test set• univariate statistic compared (means)
SCI Ambulation NN DesignInputs & Output
Admission Info (9 items)
system daysinjury daysagegenderracial/ethnic grouplevel of neurologic fxnASIA impairment indexUEMSLEMS
Ambulation (1 item)
yes
no
Evaluation Indices
General indices
• Brier score (a.k.a. mean squared error)
(ei - oi)2
n
e = estimate (e.g., 0.2)
o = observation (0 or 1)
n = number of cases
Decompositions of Brier score
• Murphy
Brier = d(1-d) - CI + DI
where d = prior probability
CI = calibration index
DI = discrimination index
Calibration and Discrimination
• Calibration measures how close the estimates are to a “real” probability
• Discrimination measures how much the system can discriminate between cases with gold standard ‘1’ and gold standard ‘0’
• “If the system is good in discrimination, calibration may be fixed”
Discrimination Indices
Discrimination
• The system can “somehow” differentiate between cases in different categories
• Special case: – diagnosis (differentiate sick and healthy individuals)
– prognosis (differentiate poor and good outcomes)
Discrimination of Binary Outcomes
• Real outcome is 0 or 1, estimated outcome is usually a number between 0 and 1 (e.g., 0.34) or a rank
• Based on Thresholded Results (e.g., if output > 0.5 then consider “positive”)
• Discrimination is based on:– sensitivity
– specificity
– PPV, NPV
– accuracy
– area under ROC curve
Notation
D means disease present
“D” means system says disease is present
nl means disease absent
“nl” means systems says disease is absent
TP means true positive
FP means false positive
TN means true positive
threshold
3.0
normal Disease
1.0 1.7
FN
True Negative (TN)
FP
TruePositive (TP)
Prevalence = P(D)
= FN + TP
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
• Specificity
Spec = TN/TN+FP
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
• Positive
Predictive Value
PPV = TP/TP+FP
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
• Negative
Predictive Value
NPV = TN/TN+FN
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
• Accuracy = (TN +TP)/all
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
• Sensitivity
Sens = TP/TP+FN• Specificity
Spec = TN/TN+FP• Positive
Predictive Value
PPV = TP/TP+FP• Negative
Predictive Value
NPV = TN/TN+FN• Accuracy = (TN +TP)/all
“nl” “D”
“D”
“nl”
nl D
TN
TPFP
FN
Sens = TP/TP+FN
40/50 = .8
Spec = TN/TN+FP
30/50 = .6
PPV = TP/TP+FP
40/60 = .67
NPV = TN/TN+FN
30/40 = .75
Accuracy = TN +TP
70/100 = .7
“nl” “D”
“D”
“nl”
nl D
30
4020
10
nl disease
threshold
1.0 3.01.4
TN
FP
TP
Sensitivity = 50/50 = 1Specificity = 40/50 = 0.8
“D”
“nl”
nl D
40
5010
0
50 50
40
60
nl disease
threshold
1.0 3.01.7
FN
TN
FP
TP
“D”
“nl”
nl D
40
4010
10
50 50
50
50
Sensitivity = 40/50 = .8Specificity = 40/50 = .8
nl disease
threshold
1.0 3.02.0
FN
TN TP
Sensitivity = 30/50 = .6Specificity = 1
“D”
“nl”
nl D
50
30 0
20
50 50
70
30
ROCcurve
“D”
“nl”
nl D
50
30 0
20
50 50
70
30
“D”
“nl”
nl D
40
4010
10
50 50
50
50
“D”
“nl”
nl D
40
5010
0
50 50
40
60
Sen
siti
vity
1 - Specificity0 1
1
Thr
esho
ld 1
.4T
hre s
hold
1.7
Thr
esho
ld 2
.0
ROCcurveS
ensi
tivi
ty
1 - Specificity0 1
1
All
Thr
esho
lds
Sen
siti
vity
1 - Specificity0 1
145 degree line:no discrimination
Sen
siti
vity
1 - Specificity0 1
145 degree line:no discrimination
Area under ROC: 0.5
Sen
siti
vity
1 - Specificity0 1
1Perfectdiscrimination
Sen
siti
vity
1 - Specificity0 1
1Perfectdiscrimination
Area under ROC: 1
ROCcurveS
ensi
tivi
ty
1 - Specificity0 1
1
Area = 0.86
What is the area under the ROC?
• An estimate of the discriminatory performance of the system – the real outcome is binary, and systems’ estimates are
continuous (0 to 1)
– all thresholds are considered
• NOT an estimate on how many times the system will give the “right” answer
Interpretation of the Area
Systems’ estimates for
• Healthy (real outcome 0)
0.3
0.2
0.5
0.1
0.7
• Sick (real outcome 1)
0.8
0.2
0.5
0.7
0.9
All possible pairs 0-1
Systems’ estimates for
• Healthy
0.3
0.2
0.5
0.1
0.7
• Sick
0.8
0.2
0.5
0.7
0.9
concordant
discordant
concordant
concordant
concordant
<
All possible pairs 0-1
Systems’ estimates for
• Healthy
0.3
0.2
0.5
0.1
0.7
• Sick
0.8
0.2
0.5
0.7
0.9
concordant
tie
concordant
concordant
concordant
C - index
• Concordant
18
• Discordant
4• Ties
3
C -index = Concordant + 1/2 Ties = 18 + 1.5
All pairs 25
All possible pairs 0-1
Systems’ estimates for
• Healthy
0.3
0.2
0.5
0.1
0.7
• Sick
0.8
0.2
0.5
0.7
0.9
concordant
discordant
concordant
concordant
concordant
<
Related statistics
• Mann-Whitney U
p(x < y) = U(x < y)
mn
where m is the number of real ‘0’s (e.g., deaths) and n is the number of real ‘1’s (e.g. survivals)
• Wilcoxon rank test
Calibration Indices
Calibration
• System can reliably estimate probability of– a diagnosis
– a prognosis
• Probability is close to the “real” probability for similar cases
What is the problem?
• Binary events are YES/NO (0/1) i.e., probabilities are 0 or 1 for a given individual
• Some models produce continuous (or quasi-continuous estimates for the binary events)
• Example:– Database of patients with spinal cord injury, and a model
that predicts whether a patient will ambulate or not at hospital discharge
– Event is 0: doesn’t walk or 1: walks
– Models produce a probability that patient will walk: 0.05, 0.10, ...
How close are the estimates to the “true” probability for a patient?
• “True” probability can be interpreted as probability within a set of similar patients
• What are similar patients?– Patients who get similar scores from models
– How to define boundaries for similarity?
• Examples out there in the real world...– APACHE III for ICU outcomes
– ACI-TIPI for diagnosis of MI
Calibration
Sort systems’ estimates, group, sum
0.1 0
0.2 0
0.2 sum of group = 0.5 1 sum = 1
0.3 0
0.5 0
0.5 sum of group = 1.3 1 sum = 1
0.7 0
0.7 1
0.8 1
0.9 sum of group = 3.1 1 sum = 3
Sum
of
syst
em’s
est
imat
es
Sum of real outcomes0 1
1
overestimation
Calibration Curves
Sum
of
syst
em’s
est
imat
es
Sum of real outcomes0 1
1
Regression line
LinearRegressionand450 line
Goodness-of-fitSort systems’ estimates, group, sum, chi-square
Expected Observed0.1 0
0.2 0
0.2 sum of group = 0.5 1 sum = 1
0.3 0
0.5 0
0.5 sum of group = 1.3 1 sum = 1
0.7 0
0.7 1
0.8 1
0.9 sum of group = 3.1 1 sum = 3
= [(observed - expected)2/expected]
= (0.5 - 1)2 / 0.5 + (1.3 - 1)2 / 1.3 + (3.1 - 3)2 / 3.1
Hosmer-Lemeshow C-hat
Groups based on n-iles (e.g., deciles, terciles), n-2 d..f.Measured Groups “Mirror groups”
Expected Observed Expected Observed
0.1 0 0.9 1
0.2 0 0.8 1
0.2 sum = 0.5 1 sum = 1 0.8 sum = 2.5 0 sum = 2
0.3 0 0.7 1
0.5 0 0.5 1
0.5 sum = 1.3 1 sum = 1 0.5 sum = 1.7 0 sum = 2
0.7 0 0.3 1
0.7 1 0.3 0
0.8 1 0.2 0
0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1
=(0.5-1)2/0.5+(1.3-1)2/1.3+(3.1-3)2/3.1+(2.5-2)2/2.5+(1.7-2)2/1.7+(0.9-1)2/0.9
Hosmer-Lemeshow H-hat
Groups based on n fixed thresholds (e.g., 0.3, 0.6, 0.9), n-2 d.f.Measured Groups “Mirror groups”
Expected Observed Expected Observed
0.1 0 0.9 1
0.2 0 0.8 1
0.2 1 0.8 0
0.3 sum = 0.8 0 sum = 1 0.7 sum = 3.2 1 sum = 2
0.5 0 0.5 1
0.5 sum = 1.0 1 sum = 1 0.5 sum = 1.0 0 sum = 1
0.7 0 0.3 1
0.7 1 0.3 0
0.8 1 0.2 0
0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1
=(0.8-1)2/0.8+(1.0-1)2/1+(3.1-3)2/3.1+(3.2-2)2/3.2+(1.0-1)2/1+(0.9-1)2/0.9
Covariance decomposition
• Arkes et al, 1995
Brier =
d(1-d) + bias2 + d(1-d)slope(slope-2) + scatter
• where d = prior• bias is a calibration index• slope is a discrimination index• scatter is a variance index
Covariance Graph
ô = .7
ê = .6
ê1 = .7
ê0 = .4
0 1outcome index (o)
estimated probability (e)
1
slope
PS= .2 bias= -0.1 slope= .3 scatter= .1
Back to Example
Thresholded Results
Sens Spec NPV PPVAccuracy
• LR 0.875 0.853 0.971 0.549 0.856
• NN 0.844 0.878 0.965 0.587 0.872
• RS 0.875 0.862 0.971 0.566 0.864
Brier Scores
BrierSpiegelhalter’s Z
• LR 0.0804 0.01
• NN 0.0811 1.34
• RS 0.0883 -1.0
ROC Curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Sensitivity
1-S
pec
ific
ity
LR
NN
RS
Areas under ROC Curves
Model ROC Curve Area Standard Error
Logistic Regression 0.925 0.016
Neural Network 0.923 0.015
Rough Set 0.914 0.016
RS Model
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8
Observed
LR Model
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8
Observed
NN Model
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8
Observed
Results: Goodness-of-fit
• Logistic Regression: H-L p = 0.50
• Neural Network: H-L p = 0.21
• Rough Sets: H-L p <.01
• p > 0.05 indicates reasonable fit
Summary
• Always test performance in previously unseen cases• General indices not very informative• Combination of indices describing discrimination
and calibration are more informative• Hard to conclude that one system is better than
another one in terms of classification performance alone: explanation, variable selection, and acceptability by clinicians are key