View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Diagnostic Testing & Predictive Models
John Kwagyan, PhD
Howard University College of Medicine
Design, Biostatistics & Population Studies
GHUCCTS
1
'physicians must be content to end not in certainties, but rather in statistical probabilities.
The physician thus has a right to feel certain, within statistical constraints, but never cocksure.
Absolute certainty remains for some theologians - and like-minded physicians'.
Am J Cardiol 1975;36:592-62
Objective
To understand the usage of diagnostic measures and screening tools
3
Outline
• Examples
• Why/What Diagnostic Testing
• Measures of Diagnostic accuracy
• ROC Curves
• Adaptation of Diagnostic/Screening Tools
• Predictive Models
4
EXAMPLES
5
4P’s Plus Screening Instrument -Substance Abuse in Pregnant Women
What is a positive assessment??J. Perinatology 2005
6
Index to Predict Relapse in Asthma
Factor Score 0 Score 1
Pulse <120 >= 120
Respiration <30 >=30
Pulsus Paradoxus <18 >=18
Peak Flow Rate >120 <=120
Dyspnea Absent or mild Moderate or Severe
Accessory muscle use Absent or mild Moderate or severe
Wheezing Absent or mild Moderate or severe
Fischl et al, NEJM 1981
Positive Test => Score 4 or more7
Study Design: A total of 228 pregnant women underwent screening…. Reliability, sensitivity, specificity, and positive and negative predictive validity were conducted.
Result: Overall reliability for the five-item measure was 0.62. Seventy-four (32.5%) of the women had a positive screen. Sensitivity and specificity were very good, at 87 and 76%, respectively. Positive predictive validity was low (36%), Negative predictive validity was quite high (97%).
Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies.
Validation of the 4P's Plus© screen for substance use in pregnancy validation of the 4P's Plus- J. Perinatology (2007)
8
Maternal Biochemical Serum Screening for Down
Syndrome in Pregnancy With HIV Infection
To estimate the influence of HIV infection and antiretroviral therapy on maternal serum markers levels and the false-positive rate with biochemical maternal serum screening for Down syndrome.
Obstetrics & Gynecology9
Inability to Predict Relapse in Acute Asthma - NEJM 1984;310(9)
• Fischl & Co. developed an index to predict relapse in patients with acute asthma.
• Based on data from ER patients in Miami, FL
• Reported 95% sensitivity and 97% specificity
• Dramatic drop in accuracy when externally validated on patients in Richmond, VA?????
10
Other Examples
• SSAGA for alcohol dependence
• Genetic Screening for hereditary disease
• Etc
11
Why Diagnostic Testing
• Accurate screening/diagnosis of a health condition is often a first step towards its prevention or control
• Need for fast, inexpensive and RELIABLE tools
12
Purpose of Diagnostic Testing
• A (binary) diagnostic test performance is designed to determine whether a target condition is present or absent in a subject from the intended use population.
• the target condition can refer to
- a particular disease
- a disease stage
- health status or condition
that should prompt clinical action, such as the initiation, modification or termination of treatment, counseling, etc
13
Test Scale
• Binary - underlying measure is usually continuous presence or absence of a diseased
• Continuous (quantitative) - biomarkers for cancer PSA are measured as serum concentration - creatinine for kidney malfunction - blood sugar for diabetes, - cholesterol for dyslipidemia
• Ordinal - clinical symptoms moderate, severe, highly severe - Index Score : 0, 1, 2, 3, 4, 5 14
Other Test Scales
• Likert-type rating - highly disagree, disagree, neutral, agree, highly agree
• Nominal ** - genotype groups - ApoE Genotypes: E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, E4/E4
15
What is Diagnostic Testing?• Evaluation of a (new) test to determine whether a target
condition is present by comparison with a benchmark!
• Evaluation of the ability of a test to classify subjects as diseased or disease-free
• For non-binary scales, a classification rule is set by a threshold - PSA > 4.0ng/ml - Blood glucose > 126 mg/dL - BP >140/90 mmHg -used to be 160/95 mmHg - contemplating 130/85 mmHg???
16
………… …………
That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;..
These are the indicators of crisis, subject to data and statistics.
………………………………
Pres. Barack Obama (Inaugural speech)
Benchmarks for Comparison
1. comparison to a reference (Gold) standard
- considered to be the best available method for establishing the presence or absence of the target condition
- can be a single method, or a combination of methods, including clinical follow-up.
2. comparison to a non-reference standard
- method other than a reference (Gold) standard.
Note!!!: The choice of comparative method will determine which performance measures may be reported
18
Some Conventional Tests
• Bacterial cultures - strep throat, urinary tract infection, meningitis, etc• Imaging technology – X-ray for bone fracture, - CT scans for brain injury - MRI for brain injury• Biochemical markers - serum creatine for kidney dysfunction - serum bilirubin for liver dysfunction - blood glucose for diabetics - blood for HIV
19
Other Conventional Tests
• Expert judgment
present or absence of heart murmur
• Response to Questionnaire !!!!
substance abuse
• Experts Interview or Observation
schizophrenic, bi-polar, major depression
• Radiologists score of mammograms
no cancer, benign cancer, possible malignancy, malignancy
20
Measures of Accuracy
Validation
21
Validation
• it is the evaluation of the accuracy of the test
• can only be established by comparing with the Gold Standard
• validity is measured by sensitivity and specificity
The extent to which a test measures what it is supposed to measure!!!
22
Measures for Accuracy
• Sensitivity => True Positive Rate • Specificity => True Negative Rate
• False Negative Rate (FNR) = 1- sensitivity• False Positive Rate (FPR) = 1- specificity
• Predictive values positive predictive value negative predictive value
• Diagnostic Likelihood Ratios LR+ = TPR/FPR = SenS/(1-SpeC) LR- = FNR /TNR = (1-SenS)/SpeC
• ROC Curves 23
Sensitivity & Specificity
• Sensitivity is the ability of a test to correctly classify an individual as ‘diseased’.
Estimated as the proportion of subjects with the target condition in whom the test is positive
• Specificity is the ability of a test to correctly classify an individual as ‘disease-free’.
Estimated as the proportion of subjects without the target condition in whom the test is negative
Best illustrated using a 2 X 2 table!!!!!24
Diagnostic Testing
True Disease Status ________________________________________________________________
Diseased Disease-free______________________________________________________________________________________________________
Test Result
Positive No Error Error I
Negative Error II No Error N1=total diseased N2 =total disease-free
25
Sensitivity & Specificity
True Disease Status ________________________________________________________________
Diseased Disease-free______________________________________________________________________________________________
Test Result Positive True-positive False-positive
Negative False-negative True-negative
• Sensitivity ~ ability of a test to detect a disease when present => True-positive fraction• Specificity ~ability to indicate disease-free when absent =>True-negative fraction 26
Consequence of Diagnostic Errors
• False negative errors, i.e., missing disease that is present. -can result in people foregoing needed treatment for the disease - the consequence can be as serious as death
• False positive errors, i.e., falsely indicating disease - disease-free are subjected to unnecessary work-up procedures or even treatment. - negative impact include personal inconvenience and/or unnecessary stress, anxiety, etc .
27
Estimating Sensitivity & Specificity
(Estimate) SenS = TP/T_D (Estimate) SpeC = TN/T_Df False Negative rate = FN/T_D = 1- SenS False Positive Rate = FP/T_Df = 1- SpeC
True Diseased Status
Test Diseased Disease-free Total
Positive TP FP T+
Negative FN TN T-
Total T_D T_Df T_N
28
ExampleCoronary Artery Surgery Study
SenS = 815/1023 = 0.79 CI=[0.77,0.82] FNF = 208/1023 = 0.19 CI=[0.18, 0.23]
SpeC = 327/442 = 0.74 CI =[70, 78] FPF = 115/442 = 0.26 CI =[0.22, 0.30]
CADGold Standard Arteriography
EST Diseased Disease-free
Positive 815 115 930
Negative 208 327 535
1023 442 1465
29
Absolute certainty remains for some theologians - and like-
minded physicians!!!.
30
If Ever There is a Perfect Test!Ideal Case
True Diseased Status
Test Diseased Disease-free
Positive TP FP = 0 T+
Negative FN = 0 TN T-
T_D (=TP)
T_Df (=TN)
T_N
SenS = TP/T_D = 100%
SpeC = TN/T_Df = 100%
FNF = FN/T_D=1-SenS = 0%
FPF = FP/T_Df=1-SpeC = 0%31
Uninformative (Useless)Tests• Test is uninformative, if test result is unrelated to disease status
• the probability distributions of the measure are the same in both diseased and disease-free populations
• for uninformative tests, SenS = 1- SpeC TPF = FPF
Ex: Exercise Stress Test to determine Diabetes, HIV, etc
• Test is informative if: SenS + SpeC > 1
32
Clinical Application
Detection of Primary Angle Closure GlaucomaGold Standard = Gonioscopy
Test SenS (%) SpeC(%)
Intraocular Pressure 47 92
Torch Light Test 80 70
van Herick Test 61.9 89.3
Indian J Ophthalmology 2008;56:45-50
Which test should we use to screen for PACG??
33
Sensitivity vs. SpecificityRule Out & Rule In
• Test with high degree of sensitivity have a low FNR, - they ensure that not many true cases of the disease are missed.
• A screening test which is used to “rule out” a diagnosis, should have a high degree of sensitivity
•Test with high degree of specificity have a low FPR - they ensure that not many patients are misdiagnosed.
• A confirmatory test which is used to “rule in” a diagnosis, should have a high degree of specificity
34
Clinical Application
Detection of Primary Angle Closure GlaucomaGold Standard = Gonioscopy
Test SenS (%) SpeC(%)
Intraocular Pressure 47 92
Torch Light Test 80 70
van Herick Test 61.9 89.3
Which test should we use to screen for PACG?? How about Combining the tests !!!!
35
Combining Tests!
• 2 ways for performing combination Tests in parallel in series
• 2 Rules for combination "the OR rule" “the AND rule”
36
Combining 2 Tests
• “OR rule” Test is positive, if either test is positive Test is negative, if both are negative
SenS (Combo Test) = SenS1 + SenS2- SenS1*SenS2 SpeC (Combo Test) = SpeC1*SpeC2
• “AND rule” Test is positive if only both A and B are positive Test is negative if either A or both are negative
SpeC (Combo Test) = SpeC1 + SpeC2 –SpeC1*SpeC2 SenS (Combo Test) = SenS1*SenS2
37
Clinical ApplicationCombining Test
Detection of Primary Angle Closure GlaucomaGold Standard = Gonioscopy
Test SenS SpeCTorch Light Test 80 70
van Herick Test 62 89.3
Combined Test (OR RULE)
92.4 62.3
Combined Test (AND RULE)
49.6 97
38
Combining Tests
• "the OR rule" increases SenS ~ useful for screening tests to “rule out” diagnosis
• “the AND rule” increases SpeC ~ useful for confirmatory tests to “rule in” diagnosis
39
Predictive Values Daring Clinical Questions
How likely is the disease, given the test result??
- what is the likelihood of disease when test is positive? - what is likelihood of non-disease when test is negative?
Answers to these questions are known as the predictive values
40
Predictive Values
• +PV – fraction of test positives who are diseased
PPV = probability ( diseased | positive test)
• - PV – fraction of test negatives who are non-diseased
NPV = probability (disease-free | negative test)
41
Predictive Values
Positive Predictive Value = TP/T+
Negative Predictive Value = TN/T-
True Diseased Status
Test Diseased Disease-free
Positive TP FP T+
Negative FN TN T-
T_D T_Df T_N
42
Predictive ValuesExample
PPV = 815/930= 87%
NPV= 327/535= 61%
True Diseased Status
Test Diseased Disease-free
Positive 815 115 930
Negative 208 327 535
1023 442 1465
43
Predictive Values
• A perfect test will predict perfectly, i.e., PPV = 1, NPV=1
• Predictive values depend on the prevalence of the disease PPV decreases with decreasing prevalence - low PPV may simply be a result of low prevalence NPV decreases with increasing prevalence
• Useless Test if: PPV = prevalence and NPV = 1- prevalence
• PV are not used to quantify the inherent accuracy of the test44
Attributes of Measures
Classification Probabilities Predictive Values
Perfect Test SenS = 1 SpeC = PPV=1, NPV=1
Useless Test SenS = 1- SpeC PPV=ρ, NPV=1-ρ
Context Accuracy Clinical prediction
Question addressed To what degree does the test reflect the true disease state
How likely is the disease given the test result
Affected by disease prevalence?
No Yes
45
Study Design: A total of 228 pregnant
Result: Seventy-four (32.5%) a positive screen = prevalence!!!!!.
Sensitivity = 87% => missed 13% of diseased Specificity = 76% => incorrectly classified 24% as diseased
Positive predictive validity = 36% => fraction w/D that tested +ve Negative predictive validity = 97% => fraction wo/D that tested -ve
Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies.
Validation of the 4P's Plus© screen for substance use
46
ROC CurvesNon-Binary Scales
47
ROC Curves• ~ for evaluating tests that yield on a non-binary scale, set by a threshold.
BP > 140/90 mmHg for Hypertension ~ used to be 160/95 mmHg ~ contemplating 130/85 mmHg
• BP values can fluctuate in any individual-healthy or diseased- there will be some overlap of values in the disease population.
•The choice of a threshold depends on the trade-off that is acceptable between failing to detect disease and falsely identifying disease
•The ROC curve is a devise that describes the range of trade-offs that can be achieved 48
ROC Curves
• Plot of SenS against 1-SpeC for all possible thresholds
• It is a visual representation of the global performance of the test
• ROC plot shows the trade-off of sensitivity and specificity
• Test is Useless (Uninformative Test) if for any threshold, c SenS = 1- SpeC
49
Construction of ROC Curves
Cutoff SenS SpeC 1-SpeC Comments
0 100 0 100 Ideal case
…
110 98 20 80
120 95 40 60
130 92 60 40
140 78 80 20
150 55 90 10
160 40 92 8
…
500 0 100 0
Calculate SenS and (1-SpeC) for all possible cutoff point
50
51
52
Attributes of ROC curves
• Provides complete description of potential performance of a test
• Facilitates comparing and combining information across studies of the same test
• Guides the choice of threshold in application
• Provides mechanism for relevant comparison between different tests
53
Reporting of EstimatesVariability
• Point estimates of sensitivity, specificity, predictive values, are not sufficient.
• Confidence intervals reflect the uncertainty of the estimates, and should be reported.
• Focus is on confidence intervals to characterize the performance of the test and not on hypothesis testing
54
Sources of Bias
Diagnostic Testing are subject to an array of biases:
• Verification bias -study selectively includes diseased for verification
• Imperfect reference standard - error in reference stand!
• Spectrum bias subjects may not be a complete representative of the population, i.e., important subgroups are missing 55
Measures of Agreement
Reliability
56
Reliability vs. Validity
• Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a "gold standard."
• Other times one merely wants to know the consistency of ratings made by different raters.
• In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values
57
Measures of Agreement
• Reliability Coefficients
positive percent agreement
negative percent agreement
Overall agreement
overall % agreement
Kappa- how much is due to chance???
58
Other Measures of Agreement
• B-Statistics
• McNemar Test
• Latent class models
• Bayesian methods
59
Agreement
Test 2
Test 1 Positive Negative
Positive AGREEMENT DISAGREEMENT
Negative DISAGREEMENT AGREEMENT
60
Raw AgreementTest 1
TEST 2 Positive Negative Total
Positive 40 1 41
Negative 1 512 513
Total 41 531 572
Positive % agreement = 40/41 = 97.6%
Negative % agreement = 512/531 = 96.4%
Overall % agreement = (40+512)/572 = 96.5
61
Limitation of Agreement Measures
• Reliability is not proof of validity; - two tests can report the same readings, but be wrong
• It does not tell how the disagreements occurred: - whether the positive and negative results are evenly distributed
• Does not tell the extent to which the agreement occurs by chance
62
Generalization
Cultural Adaptation
63
Generalization
• Ultimate question (s)!!!!!
- Does the test perform well for new, unseen patients?
- Does the test perform well in other populations?
64
Inability to Predict Relapse in Acute Asthma - NEJM 1984;310(9)
• Fischl & Co. developed an index to predict relapse in patients with acute asthma.
• Based on data from ER patients in Miami, FL
• Reported 95% sensitivity and 97% specificity
• Dramatic drop in accuracy when externally validated on patients in Richmond, VA?????
65
Inability to Predict Relapse in Acute Asthma
• Index to predict relapse in patients with acute asthma.
• 205 Based on data from ER patients in Miami, FL
95% sensitivity
97% specificity
• 114 Based on data from ER patients seen at Richmond, VA
18.1% Sensitivity??? 82.4% Specificity
Centor RM et al. NEJM 1984;310(9):577-580.66
Centor RM et al. NEJM 1984;310(9):577-580.
FL
VA
67
Generalization-Validation
• Internal validation - restricted to a single data set
- data splitting (or cross-validation)
• Temporal validation
- evaluation on a second data set from the same population.
• External validation
- evaluation on data from other populations, perhaps by different investigators.
68
Predictive Models
69
• A predictive model is a model for making expectations on future events
• It is usually build from a number of predictors, and a response (or outcome) variable
Predictive Models
70
When Does a Diagnostic Test Work??
Does the diagnostic test add anything to what is already known?
• Example: A diagnostic test for macular degeneration would need to show that it is better than just using a person’s age.
71
Covariate Modeling: Age in Home Macular Perimeter
Problem: If you sample from subjects with MD and those without, there is likely to be an age difference that could confound the assessment of HMP. This is a bias!!!!.
•Question: Is HMP just a surrogate for age?
• Solution: Build a predictive model -logistic model- using age and HMP -visual field functional defects - to predict risk of MD and see if HMP adds anything
72
SummaryMedical applications
• Screening – triage => prioritization • Diagnosis – triage – management and decision making – test selection• Prognosis => Prediction – management and decision making – informing patients and their families – risk adjustment – eligibility in clinical trials
73
Thank You
????????
74
References
1. Weinstern et al. Clinical Evaluation of Diagnostic Tests. AJR 20052. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and
prediction. New York: Oxford University Press
75