Diagnostic Testing & Predictive Models John Kwagyan, PhD Howard University College of Medicine Design, Biostatistics & Population Studies GHUCCTS 1

Diagnostic Testing & Predictive Models

John Kwagyan, PhD

Howard University College of Medicine

Design, Biostatistics & Population Studies

GHUCCTS

1

'physicians must be content to end not in certainties, but rather in statistical probabilities.

The physician thus has a right to feel certain, within statistical constraints, but never cocksure.

Absolute certainty remains for some theologians - and like-minded physicians'.

Am J Cardiol 1975;36:592-62

Objective

To understand the usage of diagnostic measures and screening tools

3

Outline

• Examples

• Why/What Diagnostic Testing

• Measures of Diagnostic accuracy

• ROC Curves

• Adaptation of Diagnostic/Screening Tools

• Predictive Models

4

EXAMPLES

5

4P’s Plus Screening Instrument -Substance Abuse in Pregnant Women

What is a positive assessment??J. Perinatology 2005

6

Index to Predict Relapse in Asthma

Factor Score 0 Score 1

Pulse <120 >= 120

Respiration <30 >=30

Pulsus Paradoxus <18 >=18

Peak Flow Rate >120 <=120

Dyspnea Absent or mild Moderate or Severe

Accessory muscle use Absent or mild Moderate or severe

Wheezing Absent or mild Moderate or severe

Fischl et al, NEJM 1981

Positive Test => Score 4 or more7

Study Design: A total of 228 pregnant women underwent screening…. Reliability, sensitivity, specificity, and positive and negative predictive validity were conducted.

Result: Overall reliability for the five-item measure was 0.62. Seventy-four (32.5%) of the women had a positive screen. Sensitivity and specificity were very good, at 87 and 76%, respectively. Positive predictive validity was low (36%), Negative predictive validity was quite high (97%).

Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies.

Validation of the 4P's Plus© screen for substance use in pregnancy validation of the 4P's Plus- J. Perinatology (2007)

8

Maternal Biochemical Serum Screening for Down

Syndrome in Pregnancy With HIV Infection

To estimate the influence of HIV infection and antiretroviral therapy on maternal serum markers levels and the false-positive rate with biochemical maternal serum screening for Down syndrome.

Obstetrics & Gynecology9

Inability to Predict Relapse in Acute Asthma - NEJM 1984;310(9)

• Fischl & Co. developed an index to predict relapse in patients with acute asthma.

• Based on data from ER patients in Miami, FL

• Reported 95% sensitivity and 97% specificity

• Dramatic drop in accuracy when externally validated on patients in Richmond, VA?????

10

Other Examples

• SSAGA for alcohol dependence

• Genetic Screening for hereditary disease

• Etc

11

Why Diagnostic Testing

• Accurate screening/diagnosis of a health condition is often a first step towards its prevention or control

• Need for fast, inexpensive and RELIABLE tools

12

Purpose of Diagnostic Testing

• A (binary) diagnostic test performance is designed to determine whether a target condition is present or absent in a subject from the intended use population.

• the target condition can refer to

- a particular disease

- a disease stage

- health status or condition

that should prompt clinical action, such as the initiation, modification or termination of treatment, counseling, etc

13

Test Scale

• Binary - underlying measure is usually continuous presence or absence of a diseased

• Continuous (quantitative) - biomarkers for cancer PSA are measured as serum concentration - creatinine for kidney malfunction - blood sugar for diabetes, - cholesterol for dyslipidemia

• Ordinal - clinical symptoms moderate, severe, highly severe - Index Score : 0, 1, 2, 3, 4, 5 14

Other Test Scales

• Likert-type rating - highly disagree, disagree, neutral, agree, highly agree

• Nominal ** - genotype groups - ApoE Genotypes: E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, E4/E4

15

What is Diagnostic Testing?• Evaluation of a (new) test to determine whether a target

condition is present by comparison with a benchmark!

• Evaluation of the ability of a test to classify subjects as diseased or disease-free

• For non-binary scales, a classification rule is set by a threshold - PSA > 4.0ng/ml - Blood glucose > 126 mg/dL - BP >140/90 mmHg -used to be 160/95 mmHg - contemplating 130/85 mmHg???

16

………… …………

That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;..

These are the indicators of crisis, subject to data and statistics.

………………………………

Pres. Barack Obama (Inaugural speech)

Benchmarks for Comparison

1. comparison to a reference (Gold) standard

- considered to be the best available method for establishing the presence or absence of the target condition

- can be a single method, or a combination of methods, including clinical follow-up.

2. comparison to a non-reference standard

- method other than a reference (Gold) standard.

Note!!!: The choice of comparative method will determine which performance measures may be reported

18

Some Conventional Tests

• Bacterial cultures - strep throat, urinary tract infection, meningitis, etc• Imaging technology – X-ray for bone fracture, - CT scans for brain injury - MRI for brain injury• Biochemical markers - serum creatine for kidney dysfunction - serum bilirubin for liver dysfunction - blood glucose for diabetics - blood for HIV

19

Other Conventional Tests

• Expert judgment

present or absence of heart murmur

• Response to Questionnaire !!!!

substance abuse

• Experts Interview or Observation

schizophrenic, bi-polar, major depression

• Radiologists score of mammograms

no cancer, benign cancer, possible malignancy, malignancy

20

Measures of Accuracy

Validation

21

Validation

• it is the evaluation of the accuracy of the test

• can only be established by comparing with the Gold Standard

• validity is measured by sensitivity and specificity

The extent to which a test measures what it is supposed to measure!!!

22

Measures for Accuracy

• Sensitivity => True Positive Rate • Specificity => True Negative Rate

• False Negative Rate (FNR) = 1- sensitivity• False Positive Rate (FPR) = 1- specificity

• Predictive values positive predictive value negative predictive value

• Diagnostic Likelihood Ratios LR+ = TPR/FPR = SenS/(1-SpeC) LR- = FNR /TNR = (1-SenS)/SpeC

• ROC Curves 23

Sensitivity & Specificity

• Sensitivity is the ability of a test to correctly classify an individual as ‘diseased’.

Estimated as the proportion of subjects with the target condition in whom the test is positive

• Specificity is the ability of a test to correctly classify an individual as ‘disease-free’.

Estimated as the proportion of subjects without the target condition in whom the test is negative

Best illustrated using a 2 X 2 table!!!!!24

Diagnostic Testing

True Disease Status ________________________________________________________________

Diseased Disease-free______________________________________________________________________________________________________

Test Result

Positive No Error Error I

Negative Error II No Error N1=total diseased N2 =total disease-free

25

Sensitivity & Specificity

True Disease Status ________________________________________________________________

Diseased Disease-free______________________________________________________________________________________________

Test Result Positive True-positive False-positive

Negative False-negative True-negative

• Sensitivity ~ ability of a test to detect a disease when present => True-positive fraction• Specificity ~ability to indicate disease-free when absent =>True-negative fraction 26

Consequence of Diagnostic Errors

• False negative errors, i.e., missing disease that is present. -can result in people foregoing needed treatment for the disease - the consequence can be as serious as death

• False positive errors, i.e., falsely indicating disease - disease-free are subjected to unnecessary work-up procedures or even treatment. - negative impact include personal inconvenience and/or unnecessary stress, anxiety, etc .

27

Estimating Sensitivity & Specificity

(Estimate) SenS = TP/T_D (Estimate) SpeC = TN/T_Df False Negative rate = FN/T_D = 1- SenS False Positive Rate = FP/T_Df = 1- SpeC

True Diseased Status

Test Diseased Disease-free Total

Positive TP FP T+

Negative FN TN T-

Total T_D T_Df T_N

28

ExampleCoronary Artery Surgery Study

SenS = 815/1023 = 0.79 CI=[0.77,0.82] FNF = 208/1023 = 0.19 CI=[0.18, 0.23]

SpeC = 327/442 = 0.74 CI =[70, 78] FPF = 115/442 = 0.26 CI =[0.22, 0.30]

CADGold Standard Arteriography

EST Diseased Disease-free

Positive 815 115 930

Negative 208 327 535

1023 442 1465

29

Absolute certainty remains for some theologians - and like-

minded physicians!!!.

30

If Ever There is a Perfect Test!Ideal Case


Test Diseased Disease-free

Positive TP FP = 0 T+

Negative FN = 0 TN T-

T_D (=TP)

T_Df (=TN)

T_N

SenS = TP/T_D = 100%

SpeC = TN/T_Df = 100%

FNF = FN/T_D=1-SenS = 0%

FPF = FP/T_Df=1-SpeC = 0%31

Uninformative (Useless)Tests• Test is uninformative, if test result is unrelated to disease status

• the probability distributions of the measure are the same in both diseased and disease-free populations

• for uninformative tests, SenS = 1- SpeC TPF = FPF

Ex: Exercise Stress Test to determine Diabetes, HIV, etc

• Test is informative if: SenS + SpeC > 1

32

Clinical Application

Detection of Primary Angle Closure GlaucomaGold Standard = Gonioscopy

Test SenS (%) SpeC(%)

Intraocular Pressure 47 92

Torch Light Test 80 70

van Herick Test 61.9 89.3

Indian J Ophthalmology 2008;56:45-50

Which test should we use to screen for PACG??

33

Sensitivity vs. SpecificityRule Out & Rule In

• Test with high degree of sensitivity have a low FNR, - they ensure that not many true cases of the disease are missed.

• A screening test which is used to “rule out” a diagnosis, should have a high degree of sensitivity

•Test with high degree of specificity have a low FPR - they ensure that not many patients are misdiagnosed.

• A confirmatory test which is used to “rule in” a diagnosis, should have a high degree of specificity

34

Clinical Application


Test SenS (%) SpeC(%)

Intraocular Pressure 47 92

Torch Light Test 80 70

van Herick Test 61.9 89.3

Which test should we use to screen for PACG?? How about Combining the tests !!!!

35

Combining Tests!

• 2 ways for performing combination Tests in parallel in series

• 2 Rules for combination "the OR rule" “the AND rule”

36

Combining 2 Tests

• “OR rule” Test is positive, if either test is positive Test is negative, if both are negative

SenS (Combo Test) = SenS1 + SenS2- SenS1*SenS2 SpeC (Combo Test) = SpeC1*SpeC2

• “AND rule” Test is positive if only both A and B are positive Test is negative if either A or both are negative

SpeC (Combo Test) = SpeC1 + SpeC2 –SpeC1*SpeC2 SenS (Combo Test) = SenS1*SenS2

37

Clinical ApplicationCombining Test


Test SenS SpeCTorch Light Test 80 70

van Herick Test 62 89.3

Combined Test (OR RULE)

92.4 62.3

Combined Test (AND RULE)

49.6 97

38

Combining Tests

• "the OR rule" increases SenS ~ useful for screening tests to “rule out” diagnosis

• “the AND rule” increases SpeC ~ useful for confirmatory tests to “rule in” diagnosis

39

Predictive Values Daring Clinical Questions

How likely is the disease, given the test result??

- what is the likelihood of disease when test is positive? - what is likelihood of non-disease when test is negative?

Answers to these questions are known as the predictive values

40

Predictive Values

• +PV – fraction of test positives who are diseased

PPV = probability ( diseased | positive test)

• - PV – fraction of test negatives who are non-diseased

NPV = probability (disease-free | negative test)

41

Predictive Values

Positive Predictive Value = TP/T+

Negative Predictive Value = TN/T-



Positive TP FP T+

Negative FN TN T-

T_D T_Df T_N

42

Predictive ValuesExample

PPV = 815/930= 87%

NPV= 327/535= 61%



Positive 815 115 930

Negative 208 327 535

1023 442 1465

43

Predictive Values

• A perfect test will predict perfectly, i.e., PPV = 1, NPV=1

• Predictive values depend on the prevalence of the disease PPV decreases with decreasing prevalence - low PPV may simply be a result of low prevalence NPV decreases with increasing prevalence

• Useless Test if: PPV = prevalence and NPV = 1- prevalence

• PV are not used to quantify the inherent accuracy of the test44

Attributes of Measures

Classification Probabilities Predictive Values

Perfect Test SenS = 1 SpeC = PPV=1, NPV=1

Useless Test SenS = 1- SpeC PPV=ρ, NPV=1-ρ

Context Accuracy Clinical prediction

Question addressed To what degree does the test reflect the true disease state

How likely is the disease given the test result

Affected by disease prevalence?

No Yes

45

Study Design: A total of 228 pregnant

Result: Seventy-four (32.5%) a positive screen = prevalence!!!!!.

Sensitivity = 87% => missed 13% of diseased Specificity = 76% => incorrectly classified 24% as diseased

Positive predictive validity = 36% => fraction w/D that tested +ve Negative predictive validity = 97% => fraction wo/D that tested -ve

Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies.

Validation of the 4P's Plus© screen for substance use

46

ROC CurvesNon-Binary Scales

47

ROC Curves• ~ for evaluating tests that yield on a non-binary scale, set by a threshold.

BP > 140/90 mmHg for Hypertension ~ used to be 160/95 mmHg ~ contemplating 130/85 mmHg

• BP values can fluctuate in any individual-healthy or diseased- there will be some overlap of values in the disease population.

•The choice of a threshold depends on the trade-off that is acceptable between failing to detect disease and falsely identifying disease

•The ROC curve is a devise that describes the range of trade-offs that can be achieved 48

ROC Curves

• Plot of SenS against 1-SpeC for all possible thresholds

• It is a visual representation of the global performance of the test

• ROC plot shows the trade-off of sensitivity and specificity

• Test is Useless (Uninformative Test) if for any threshold, c SenS = 1- SpeC

49

Construction of ROC Curves

Cutoff SenS SpeC 1-SpeC Comments

0 100 0 100 Ideal case

…

110 98 20 80

120 95 40 60

130 92 60 40

140 78 80 20

150 55 90 10

160 40 92 8

…

500 0 100 0

Calculate SenS and (1-SpeC) for all possible cutoff point

50

51

52

Attributes of ROC curves

• Provides complete description of potential performance of a test

• Facilitates comparing and combining information across studies of the same test

• Guides the choice of threshold in application

• Provides mechanism for relevant comparison between different tests

53

Reporting of EstimatesVariability

• Point estimates of sensitivity, specificity, predictive values, are not sufficient.

• Confidence intervals reflect the uncertainty of the estimates, and should be reported.

• Focus is on confidence intervals to characterize the performance of the test and not on hypothesis testing

54

Sources of Bias

Diagnostic Testing are subject to an array of biases:

• Verification bias -study selectively includes diseased for verification

• Imperfect reference standard - error in reference stand!

• Spectrum bias subjects may not be a complete representative of the population, i.e., important subgroups are missing 55

Measures of Agreement

Reliability

56

Reliability vs. Validity

• Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a "gold standard."

• Other times one merely wants to know the consistency of ratings made by different raters.

• In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values

57

Measures of Agreement

• Reliability Coefficients

positive percent agreement

negative percent agreement

Overall agreement

overall % agreement

Kappa- how much is due to chance???

58

Other Measures of Agreement

• B-Statistics

• McNemar Test

• Latent class models

• Bayesian methods

59

Agreement

Test 2

Test 1 Positive Negative

Positive AGREEMENT DISAGREEMENT

Negative DISAGREEMENT AGREEMENT

60

Raw AgreementTest 1

TEST 2 Positive Negative Total

Positive 40 1 41

Negative 1 512 513

Total 41 531 572

Positive % agreement = 40/41 = 97.6%

Negative % agreement = 512/531 = 96.4%

Overall % agreement = (40+512)/572 = 96.5

61

Limitation of Agreement Measures

• Reliability is not proof of validity; - two tests can report the same readings, but be wrong

• It does not tell how the disagreements occurred: - whether the positive and negative results are evenly distributed

• Does not tell the extent to which the agreement occurs by chance

62

Generalization

Cultural Adaptation

63

Generalization

• Ultimate question (s)!!!!!

- Does the test perform well for new, unseen patients?

- Does the test perform well in other populations?

64

Inability to Predict Relapse in Acute Asthma - NEJM 1984;310(9)

• Fischl & Co. developed an index to predict relapse in patients with acute asthma.

• Based on data from ER patients in Miami, FL

• Reported 95% sensitivity and 97% specificity

• Dramatic drop in accuracy when externally validated on patients in Richmond, VA?????

65

Inability to Predict Relapse in Acute Asthma

• Index to predict relapse in patients with acute asthma.

• 205 Based on data from ER patients in Miami, FL

95% sensitivity

97% specificity

• 114 Based on data from ER patients seen at Richmond, VA

18.1% Sensitivity??? 82.4% Specificity

Centor RM et al. NEJM 1984;310(9):577-580.66

Centor RM et al. NEJM 1984;310(9):577-580.

FL

VA

67

Generalization-Validation

• Internal validation - restricted to a single data set

- data splitting (or cross-validation)

• Temporal validation

- evaluation on a second data set from the same population.

• External validation

- evaluation on data from other populations, perhaps by different investigators.

68

Predictive Models

69

• A predictive model is a model for making expectations on future events

• It is usually build from a number of predictors, and a response (or outcome) variable

Predictive Models

70

When Does a Diagnostic Test Work??

Does the diagnostic test add anything to what is already known?

• Example: A diagnostic test for macular degeneration would need to show that it is better than just using a person’s age.

71

Covariate Modeling: Age in Home Macular Perimeter

Problem: If you sample from subjects with MD and those without, there is likely to be an age difference that could confound the assessment of HMP. This is a bias!!!!.

•Question: Is HMP just a surrogate for age?

• Solution: Build a predictive model -logistic model- using age and HMP -visual field functional defects - to predict risk of MD and see if HMP adds anything

72

SummaryMedical applications

• Screening – triage => prioritization • Diagnosis – triage – management and decision making – test selection• Prognosis => Prediction – management and decision making – informing patients and their families – risk adjustment – eligibility in clinical trials

73

Thank You

????????

74

References

1. Weinstern et al. Clinical Evaluation of Diagnostic Tests. AJR 20052. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and

prediction. New York: Oxford University Press

75

Documents

Diagnostic Testing & Predictive Models John Kwagyan, PhD Howard University College of Medicine Design, Biostatistics & Population Studies GHUCCTS 1