Diagnostic test accuracy reviews. Advanced Meta-analysis

The University of Sydney

School of Public Health

Diagnostic test accuracy reviews.

Advanced Meta-analysis: dealing with

heterogeneity and test comparisons.

Petra Macaskill

Screening and Test Evaluation Program

School of Public Health

University of Sydney

Co-convenor, Cochrane Screening and Diagnostic Tests

Methods Group

Outline

• Background

• Descriptive Analyses (available in Revman)

– Graphical displays

– Summary ROC

– Exploring heterogeneity

• Hierarchical Models (not available in Revman)

– Rationale for using hierarchical models

– Choice of model:

• Bivariate

• HSROC (Rutter and Gatsonis model)

– Investigating heterogeneity

– Index test comparisons

Requires statistical expertise

Major steps covered in:

Cochrane Handbook for Systematic Reviews of

Diagnostic Test Accuracy

Objective of the review (e.g. performance of a single test,

exploring heterogeneity in test performance, test comparisons)

Locating and selecting studies

Assessing study quality – QUADAS2 updates in preparation

Extracting data – to be updated

Meta-analysis

Interpretation of the results – in preparation

Chapter 10: Analysing and Presenting Results Petra Macaskill, Constantine Gatsonis, Jonathan Deeks, Roger Harbord, Yemisi

Takwoingi.

Systematic Review of

Diagnostic Test Performance

http://srdta.cochrane.org/handbook-dta-reviews

Single index test:

Remains a common form of systematic review

Heterogeneity in test performance between studies is likely to be present, and reasons for it should be explored.

Test comparisons:

Increasing in importance and relevance

Methods for investigating heterogeneity can be applied

Ideally, test comparisons should focus on studies that directly compare the tests of interest

Systematic Review of

Diagnostic Test Performance

Reference test (binary)

“true” disease status, i.e. target condition

Index test (continuous, ordinal or binary)

Test threshold

Sensitivity and specificity

Likelihood ratios

ROC curve

Underlying Concepts

Test threshold: Individual Study Level

A plot of sensitivity against 1-specificity across the range of thresholds

results in a receiver operating characteristic (ROC) curve.

a single study:

diseasednon-diseased

TP

TP increases

FP increases

FP

threshold

TP decreases

FP decreases

ROC curves: Individual Study Level


0 40 80 120

test measurement

0.0

0.2

0.4

0.6

0.8

1.0

sen

sitiv

ity

0.00.20.40.60.81.0

specificity


0 40 80 120

test measurement

0.0

0.2

0.4

0.6

0.8

1.0

sen

sitiv

ity

0.00.20.40.60.81.0

specificity


0 40 80 120

test measurement

0.0

0.2

0.4

0.6

0.8

1.0

sen

sitiv

ity

0.00.20.40.60.81.0

specificity


0 40 80 120

test measurement

0.0

0.2

0.4

0.6

0.8

1.0

sen

sitiv

ity

0.00.20.40.60.81.0

specificity

Most studies report test sensitivity and specificity at a threshold(s),

or provide sufficient information to construct the following 2 x 2

table at the threshold(s):

From this table we can compute

True positive rate (tpr):

False positive rate (fpr):

Data extraction

DnTPFNTPTPysensitivit

D

nFPTNFPFPyspecificit 1

“true” disease status

+ -

test

result

+ TP FP

- FN TN

Reasons for variability in test accuracy

between studies

• Random sampling error

For each study, the estimated sensitivity and specificity is subject to

sampling error. The larger the sample size, the smaller the

sampling error as shown by the confidence intervals in a Forest

plot.

Because the sensitivity and specificity are both proportions, the

within study sampling error is straightforward to estimate using

the binomial distribution.


between studies

• True underlying differences between studies

– In diagnostic reviews, sampling error is unlikely to account for all

of the variability (scatter) between studies.

– Additional heterogeneity in test performance between studies is

likely to occur for other reasons, including differences in:

• Cut-point chosen to define a positive test (threshold effect)

• Spectrum of disease

• Clinical setting

• Study design

• etc…

Even if all studies use the same cut-point, sensitivity and

specificity are expected to vary between studies

Graphical Displays

Descriptive plots should include:

– Forest plot showing sensitivity and specificity for each study and

the numbers on which these estimates are based for each study

– Scatter plot showing (1-specificity, sensitivity) pair for each study

in ROC space. The size of each marker should ideally reflect the

numbers in both the diseased and non-diseased groups.

RevMan provides facilities for:

• graphical displays (improvements made in version 5.2).

• summary ROC curve estimation based on Moses-Littenberg method

• Descriptive exploration of heterogeneity using subgroup analyses

50 studies taken from the review conducted by Nishimura (2007) of

Rheumatoid factor (RF) as a marker for rheumatoid arthritis (RA)

The cut-point for test positivity for RF varied between studies ranging 3

to 100 U/ml (not all studies reported the cut-point)

The reference standard was based on the 1987 revised American

College of Rheumatology (ACR) criteria or clinical diagnosis.

Note: RF contributes to the ACR criteria so there is some risk of bias in

this analysis.

Example: Rheumatoid Factor as a marker

for Rheumatoid Arthritis

Study

Aho 1999

Anuradha 2005

Banchuin 1992

Bas 2003

Berthelot 1995

Bizzaro 2001

Bombardieri 2004

Carpenter 1989

Choi 2005

Cordonnier 1996

Das 2004

Davis 1989

de Bois 1996

De Rycke 2004

Despres 1994

Dubucquoi 2004

Fernandez-Suarez 2005

Girelli 2004

Goldbach-Mansky 2000

Gomes-Daudrix 1994

Greiner 2005

Grootenboer-Mignot 2004

Hitchon 2004

Jansen 2003

Jonsson 1998

Kamali 2005

Kwok 2005

Lee 2003

Lopez-Hoyos 2004

Nell 2005

Quinn 2006

Rantapaa-Dahlqvist 2003

Raza 2005

Saraux 1995

Saraux 2003

Sauerland 2005

Schellekens 2000

Soderlin 2004

Spiritus 2004

Suzuki 2003

Swedler

Thammanichanond 2005

Vallbracht 2004

van Leeuwen 1988

Vasiliauskiene 2001

Visser 1996

Vittecoq 2001

Vittecoq 2004

Winkles 1989

Young 1991

TP

64

482

36

143

80

61

27

60

261

20

42

18

8

93

143

84

30

32

70

48

75

64

32

130

50

20

77

73

36

56

115

49

22

8

35

161

80

5

57

383

89

57

196

163

75

157

26

62

113

25

FP

16

2

6

43

50

36

6

8

54

2

46

3

8

28

39

41

2

29

39

1

42

18

10

8

14

32

16

22

3

11

53

23

2

8

8

89

28

4

9

38

3

25

75

10

21

287

1

11

19

1

FN

27

82

41

53

39

37

3

20

63

29

14

31

0

25

63

56

23

3

36

40

12

29

9

128

20

26

52

29

5

46

67

28

20

31

51

7

69

11

33

166

9

6

99

28

21

78

32

114

29

14

TN

153

153

313

196

45

196

33

119

197

18

127

25

31

118

130

90

73

13

93

99

191

73

13

113

191

25

52

90

70

87

63

359

80

91

149

360

284

49

93

170

39

111

345

140

106

1466

29

127

481

20

cutoff

8.0

100.0

15.0

87.0

9.0

40.0

16.3

3.0

3.125

100.0

20.0

50.0

20.0

20.0

20.0

20.0

30.0

40.0

20.0

15.0

80.0

22.0

40.0

20.0

30.0

40.0

20.0

3.0

20.0

20.0

15.0

20.0

20.0

15.0

17.0

9.0

80.0

16.0

40.0

Method

LA

LA

ELISA

ELISA

LA

Nephelometry

Nephelometry

ELISA

LA

LA

Nephelometry

ELISA

ELISA

LA

LA

ELISA

Nephelometry

Nephelometry

Nephelometry

ELISA

Nephelometry

Nephelometry

Nephelometry

Nephelometry

ELISA

LA

Nephelometry

LA

Nephelometry

Not reported

Not reported

ELISA

LA

LA

ELISA

Nephelometry

ELISA

LA

Nephelometry

Nephelometry

Nephelometry

LA

ELISA

ELISA

ELISA

ELISA

LA

ELISA

LA

RA hemagglutination

Sensitivity

0.70 [0.60, 0.79]

0.85 [0.82, 0.88]

0.47 [0.35, 0.58]

0.73 [0.66, 0.79]

0.67 [0.58, 0.76]

0.62 [0.52, 0.72]

0.90 [0.73, 0.98]

0.75 [0.64, 0.84]

0.81 [0.76, 0.85]

0.41 [0.27, 0.56]

0.75 [0.62, 0.86]

0.37 [0.23, 0.52]

1.00 [0.63, 1.00]

0.79 [0.70, 0.86]

0.69 [0.63, 0.76]

0.60 [0.51, 0.68]

0.57 [0.42, 0.70]

0.91 [0.77, 0.98]

0.66 [0.56, 0.75]

0.55 [0.44, 0.65]

0.86 [0.77, 0.93]

0.69 [0.58, 0.78]

0.78 [0.62, 0.89]

0.50 [0.44, 0.57]

0.71 [0.59, 0.82]

0.43 [0.29, 0.59]

0.60 [0.51, 0.68]

0.72 [0.62, 0.80]

0.88 [0.74, 0.96]

0.55 [0.45, 0.65]

0.63 [0.56, 0.70]

0.64 [0.52, 0.74]

0.52 [0.36, 0.68]

0.21 [0.09, 0.36]

0.41 [0.30, 0.52]

0.96 [0.92, 0.98]

0.54 [0.45, 0.62]

0.31 [0.11, 0.59]

0.63 [0.53, 0.73]

0.70 [0.66, 0.74]

0.91 [0.83, 0.96]

0.90 [0.80, 0.96]

0.66 [0.61, 0.72]

0.85 [0.80, 0.90]

0.78 [0.69, 0.86]

0.67 [0.60, 0.73]

0.45 [0.32, 0.58]

0.35 [0.28, 0.43]

0.80 [0.72, 0.86]

0.64 [0.47, 0.79]

Specificity

0.91 [0.85, 0.94]

0.99 [0.95, 1.00]

0.98 [0.96, 0.99]

0.82 [0.77, 0.87]

0.47 [0.37, 0.58]

0.84 [0.79, 0.89]

0.85 [0.69, 0.94]

0.94 [0.88, 0.97]

0.78 [0.73, 0.83]

0.90 [0.68, 0.99]

0.73 [0.66, 0.80]

0.89 [0.72, 0.98]

0.79 [0.64, 0.91]

0.81 [0.73, 0.87]

0.77 [0.70, 0.83]

0.69 [0.60, 0.77]

0.97 [0.91, 1.00]

0.31 [0.18, 0.47]

0.70 [0.62, 0.78]

0.99 [0.95, 1.00]

0.82 [0.76, 0.87]

0.80 [0.71, 0.88]

0.57 [0.34, 0.77]

0.93 [0.87, 0.97]

0.93 [0.89, 0.96]

0.44 [0.31, 0.58]

0.76 [0.65, 0.86]

0.80 [0.72, 0.87]

0.96 [0.88, 0.99]

0.89 [0.81, 0.94]

0.54 [0.45, 0.64]

0.94 [0.91, 0.96]

0.98 [0.91, 1.00]

0.92 [0.85, 0.96]

0.95 [0.90, 0.98]

0.80 [0.76, 0.84]

0.91 [0.87, 0.94]

0.92 [0.82, 0.98]

0.91 [0.84, 0.96]

0.82 [0.76, 0.87]

0.93 [0.81, 0.99]

0.82 [0.74, 0.88]

0.82 [0.78, 0.86]

0.93 [0.88, 0.97]

0.83 [0.76, 0.89]

0.84 [0.82, 0.85]

0.97 [0.83, 1.00]

0.92 [0.86, 0.96]

0.96 [0.94, 0.98]

0.95 [0.76, 1.00]

Sensitivity

0 0.2 0.4 0.6 0.8 1

Specificity

0 0.2 0.4 0.6 0.8 1

Forest plot – sorted by specificity



Moses LE, Shapiro D, Littenberg B Stat Med 1993; 12:1293-1316.

For each study i

Compute accuracy (log diagnostic odds ratio, lnDOR):

and proxy for threshold (based on overall positivity rate):

Moses-Littenberg SROC regression

)logit()logit( iii fprtprD

)logit()logit( iii fprtprS

The relationship between test accuracy and test threshold is modelled

to estimate a summary ROC curve.

This fixed effect model is generally fitted using linear regression

(unweighted or weighted by inverse variance of lnDOR).

b 0 Accuracy depends on threshold resulting in an

asymmetric SROC

b = 0 Accuracy is independent of threshold resulting in a

symmetric SROC

The SROC is produced by using the estimates of a and b to compute the

expected sensitivity (tpr) across a range of values for 1-specificity (fpr)

SROC regression: model specification

bSaD

SROC regression:

properties and summary measures



Moses-Littenberg SROC

Historically, this has been the most commonly used method

easy to implement

uses standard regression methods / software

can use regression diagnostics to identify influential studies

but

does not take proper account of within and between study variability

confidence intervals and P-values are likely to be inaccurate

should be regarded as a descriptive/exploratory analysis

Hence:

Revman5 will provide only exploratory analyses based on SROC

regression. Statistical inference will require more complex analyses

using multilevel (hierarchical) models using other software.

Moses-Littenberg SROC regression:

comments

Historically, this has been the most commonly used method

easy to implement

uses standard regression methods / software

can use regression diagnostics to identify influential studies

but

does not take proper account of within and between study variability

confidence intervals and P-values are likely to be inaccurate

should be regarded as a descriptive/exploratory analysis

Multilevel (hierarchical) models have the advantage that they take

proper account of both:

(i) within study variability (sampling error)

(ii) between study variability not accounted for by (i), through the

inclusion of random effects

Moses-Littenberg SROC regression:

comments

Hierarchical (Mixed) models have the advantage that

they take account of both:

(i) within study variability (sampling error)

(ii) between study variability (heterogeneity) not

accounted for by (i), through the inclusion of random

effects

Hierarchical models provide a more rigorous method that

allow statistical inferences to be made.

Hierarchical (Mixed) models

Two hierarchical models most commonly used for the

meta-analysis of studies of diagnostic accuracy:

Bivariate model: the primary objective is to obtain a

summary estimate of sensitivity and specificity

and

HSROC model: the primary objective is to fit a

summary ROC

The two models are mathematically equivalent when no

covariates are included in the model

Hierarchical (Mixed) models

Estimating a summary operating point:

• This is appropriate if there is a common cut-point or criterion for

test positivity between studies

• If studies use different criteria for test positivity the summary

operating point will be difficult to interpret.

Estimating a summary curve:

• This is appropriate if there is variation in the cut-point or criterion

for test positivity between studies

• If studies use the same criterion for test positivity, there will be

very limited information to inform the shape of the curve.

Which method to use?

If no covariates included in the model, the Bivariate

and HSROC methods are mathematically equivalent:

• The parameter estimates from the HSROC model can be used to

derive the summary point and corresponding confidence region

• The parameter estimates from the Bivariate model can be used

to obtain the HSROC

If covariates are included in the model to explore

reasons for heterogeneity in test performance, the

choice will be guided jointly by:

The research question: Whether we want to make inferences about (i)

the summary curve or (ii) the summary point

Whether or not there is a common criterion for test positivity.

Which method to use?

Bivariate model:

Models the relationship between sensitivity and specificity directly (after

logit transformation), including random effects for both and allowing

for correlation between them.

The focus is on estimating the expected sensitivity and specificity (i.e.

expected operating point).

An underlying SROC can be derived from the estimated model

parameters (the HSROC is one of the possible SROC curves).

HSROC (Rutter and Gatsonis) model:

Includes random effects test accuracy and the proxy for test threshold.

The focus is on estimating a summary ROC.

The expected sensitivity for a given specificity, expected operating

point, etc can be derived from the estimated model parameters.

Multilevel (hierarchical) models

LEVEL 1

For each study (i), the number testing positive is assumed to follow a

Binomial distribution

where j=1 represents diseased group

j=2 represents non-diseased group

represents the number in group j

represents the probability of a positive

test result in group j

LEVEL 2

Model can be fitted using random effects logistic regression

(e.g. SAS, Stata, R, ...)

Bivariate model

),(~ ijijij nBy

ijn

ij

2

2

~)1logit(

)logit(

BAB

ABA

B

A

i

iBN

B

A

spec

sens


anti-cyclic citrullinated peptide antibody (anti-CCP).

the anti-CCP test is deemed positive if any anti-CCP antibody is

detected. Hence, detection may be considered a common threshold

the reference standard was based on the 1987 revised American


if we can assume a common threshold (cut-point or criterion for test

positivity) across studies, it is appropriate to focus on summary

estimate(s) for sensitivity and specificity.

Bivariate Model Example :

Anti-CCP for the diagnosis of rheumatoid arthritis.

Example: Anti-CCP for the diagnosis of

rheumatoid arthritis. Study

Bas 2003

Bizzaro 2001

Goldbach-Mansky 2000

Jansen 2003

Saraux 2003

Schellekens 2000

Vincent 2002

Zeng 2003

Aotsuka 2005

Bombardieri 2004

Choi 2005

Correa 2004

De Rycke 2004

Dubucquoi 2004

Fernandez-Suarez 2005

Garcia-Berrocal 2005

Girelli 2004

Greiner 2005

Grootenboer-Mignot 2004

Hitchon 2004

Kamali 2005

Kumagai 2004

Kwok 2005

Lee 2003

Lopez-Hoyos 2004

Nell 2005

Nielen 2005

Quinn 2006

Rantapaa-Dahlqvist 2003

Raza 2005

Sauerland 2005

Soderlin 2004

Suzuki 2003

Vallbracht 2004

van Gaalen 2005

van Venrooij 2004

Vittecoq 2004

TP

110

40

43

110

40

72

139

90

115

23

236

74

89

90

31

69

25

70

167

26

26

64

71

68

38

42

149

147

47

24

171

7

481

190

82

865

69

FP

24

5

1

3

11

14

7

7

17

0

20

11

4

2

0

8

2

5

8

8

1

14

2

14

3

2

7

10

7

3

26

2

23

12

13

79

5

FN

86

58

63

148

46

77

101

101

16

7

88

8

29

50

22

18

10

17

98

15

20

15

58

35

0

60

109

35

20

18

60

9

68

105

71

252

107

TN

215

227

120

118

146

298

464

313

73

39

231

130

142

129

75

38

40

228

88

15

56

293

66

132

73

96

114

106

375

79

443

51

185

408

301

2218

133

Generation

CCP1

CCP1

CCP1

CCP1

CCP1

CCP1

CCP1

CCP1

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

CCP2

Sensitivity

0.56 [0.49, 0.63]

0.41 [0.31, 0.51]

0.41 [0.31, 0.51]

0.43 [0.37, 0.49]

0.47 [0.36, 0.58]

0.48 [0.40, 0.57]

0.58 [0.51, 0.64]

0.47 [0.40, 0.54]

0.88 [0.81, 0.93]

0.77 [0.58, 0.90]

0.73 [0.68, 0.78]

0.90 [0.82, 0.96]

0.75 [0.67, 0.83]

0.64 [0.56, 0.72]

0.58 [0.44, 0.72]

0.79 [0.69, 0.87]

0.71 [0.54, 0.85]

0.80 [0.71, 0.88]

0.63 [0.57, 0.69]

0.63 [0.47, 0.78]

0.57 [0.41, 0.71]

0.81 [0.71, 0.89]

0.55 [0.46, 0.64]

0.66 [0.56, 0.75]

1.00 [0.91, 1.00]

0.41 [0.32, 0.51]

0.58 [0.51, 0.64]

0.81 [0.74, 0.86]

0.70 [0.58, 0.81]

0.57 [0.41, 0.72]

0.74 [0.68, 0.80]

0.44 [0.20, 0.70]

0.88 [0.85, 0.90]

0.64 [0.59, 0.70]

0.54 [0.45, 0.62]

0.77 [0.75, 0.80]

0.39 [0.32, 0.47]

Specificity

0.90 [0.85, 0.93]

0.98 [0.95, 0.99]

0.99 [0.95, 1.00]

0.98 [0.93, 0.99]

0.93 [0.88, 0.96]

0.96 [0.93, 0.98]

0.99 [0.97, 0.99]

0.98 [0.96, 0.99]

0.81 [0.71, 0.89]

1.00 [0.91, 1.00]

0.92 [0.88, 0.95]

0.92 [0.86, 0.96]

0.97 [0.93, 0.99]

0.98 [0.95, 1.00]

1.00 [0.95, 1.00]

0.83 [0.69, 0.92]

0.95 [0.84, 0.99]

0.98 [0.95, 0.99]

0.92 [0.84, 0.96]

0.65 [0.43, 0.84]

0.98 [0.91, 1.00]

0.95 [0.92, 0.97]

0.97 [0.90, 1.00]

0.90 [0.84, 0.95]

0.96 [0.89, 0.99]

0.98 [0.93, 1.00]

0.94 [0.88, 0.98]

0.91 [0.85, 0.96]

0.98 [0.96, 0.99]

0.96 [0.90, 0.99]

0.94 [0.92, 0.96]

0.96 [0.87, 1.00]

0.89 [0.84, 0.93]

0.97 [0.95, 0.99]

0.96 [0.93, 0.98]

0.97 [0.96, 0.97]

0.96 [0.92, 0.99]

Sensitivity

0 0.2 0.4 0.6 0.8 1

Specificity

0 0.2 0.4 0.6 0.8 1

Proc NLMIXED for Bivariate Model

data accp (keep=study_id sens spec true n);

input study_id $ generation tp fp fn tn;

sens=1; spec=0; true=tp; n=tp+fn; output; sens=0; spec=1; true=tn; n=tn+fp; output;

cards;

Bas 1 110 24 86 215

Bizzaro 1 40 5 58 227

Goldbach-Mansky 1 43 1 63 120

Jansen 1 110 3 148 118

Saraux 1 40 11 46 146

Schellekens 1 72 14 77 298

Vincent 1 139 7 101 464

Zeng 1 90 7 101 313

Aotsuka 2 115 17 16 73

Bombardieri 2 23 0 7 39

.

.

; The resulting SAS dataset accp will have two records per study,

the first contains the numerator and denominator for sensitivity

the second contains the numerator and denominator for specificity


Summary estimate of

logit(sensitivity)

Summary estimate of

logit(specificity)

proc nlmixed data=accp cov ecov;

parms msens=2 mspec= 2 s2usens=0.5 s2uspec=0.5 covsesp=0;

logitp = (msens + usens)*sens + (mspec + uspec)*spec;

p = exp(logitp)/(1+exp(logitp));

model true ~ binomial(n,p);

random usens uspec ~ normal([0 , 0],[s2usens,covsesp,s2uspec])

subject=study_id out=randeffs;

run;


proc nlmixed data=accp cov ecov;

parms msens=2 mspec= 2 s2usens=0.5 s2uspec=0.5 covsesp=0;

logitp = (msens + usens)*sens + (mspec + uspec)*spec;



random usens uspec ~ normal([0 , 0],[s2usens,covsesp,s2uspec])


run;

Random effects

Distribution of the random effects

Fit Statistics

-2 Log Likelihood 545.6

AIC (smaller is better) 555.6

AICC (smaller is better) 556.4

BIC (smaller is better) 563.6

Parameter Estimates

Standard

Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

msens 0.6534 0.1275 35 5.13 <.0001 0.05 0.3946 0.9122 0.000013

mspec 3.1090 0.1459 35 21.31 <.0001 0.05 2.8128 3.4051 -0.00015

s2usens 0.5426 0.1463 35 3.71 0.0007 0.05 0.2455 0.8397 0.000222

s2uspec 0.5717 0.1873 35 3.05 0.0043 0.05 0.1914 0.9520 0.000039

covsesp -0.2704 0.1199 35 -2.26 0.0304 0.05 -0.5137 -0.02710 0.000036

Covariance Matrix of Parameter Estimates

Row Parameter msens mspec s2usens s2uspec covsesp

1 msens 0.01625 -0.00741 0.000890 -0.00004 -0.00004

2 mspec -0.00741 0.02128 -0.00006 0.004286 -0.00116

3 s2usens 0.000890 -0.00006 0.02142 0.003997 -0.00874

4 s2uspec -0.00004 0.004286 0.003997 0.03509 -0.01184

5 covsesp -0.00004 -0.00116 -0.00874 -0.01184 0.01436


Input of Model Results to RevMan

The specificities appear to be

relatively homogenous but there is

considerable variation in the

sensitivities. (This is evident in the size of

the prediction region on the SROC plot.)

The summary estimate of sensitivity

and specificity is shown by the solid

black dot. (The sensitivity and specificity at

this point can be computed by inverse

transformation of the logit estimates to give

0.66 and 0.96 respectively.)


Anti-CCP for the diagnosis of rheumatoid arthritis.

LEVEL 1








The model takes the form:

where represents the “true” disease status (coded as -0.5 for the non-

diseased and 0.5 for the diseased)

Rutter and Gatsonis HSROC model

),(~ ijijij nBy

ijn

ij

ijijiiij disdis exp)logit(

ijdis

LEVEL 1 cont.

The model is based on the ordinal logistic regression proposed by McCullagh.



dependence

of accuracy on

threshold

(fixed effect)

threshold

(random effect)

accuracy

(random effect)

When = 0, the model reduces to a logistic regression model and

i is estimated by (logit(tpri) + logit(fpri))/2 ( = Si/2)

i is estimated by logit(tpri) - logit(fpri) ( = lnDORi)

LEVEL 1 cont.



LEVEL 1 cont.


ijijiiij disdis exp)logit( ijijiiij disdis exp)logit(

LEVEL 2

The random effects are assumed to be independent and normally

distributed:

The SROC curve is computed using for

chosen values of fpr

When = 0, provides a global estimate of the expected test accuracy

(lnDOR) and the resulting SROC is symmetric.

The expected tpr and fpr are given by and

respectively.


),(~ 2

Ni

),(~ 2

Ni

efpreetprE logit5.0

11)(

5.05.011 ee

5.05.011 ee

The Rutter and Gatsonis HSROC model is a generalised non-linear

random effects model and hence requires more specialised software

to fit it.

It is often fitted using SAS Proc NLMIXED, or using Bayesian (MCMC)

methods.

Notes:

Metandi (macro available for Stata) exploits the relationship between

the Bivariate model and the HSROC model to fit the summary curve.

This software cannot accommodate covariates.

The METADAS macro for SAS create code for Proc NLMIXED and

provide output suitable for input to RevMan

Fitting the HSROC model







Note: RF contributes to the ACR criteria so there is some risk of bias in

this analysis.



data rf (keep=study_id dis pos n);

input study_id $ tp fp fn tn method $;

dis=0.5; pos=tp; n=tp+fn; output;

dis=-0.5; pos=fp; n=tn+fp; output;

cards;

Bizzaro 61 36 37 196 N Bombardieri 27 6 3 33 N Das 42 46 14 127 N Suzuki 383 38 166 170 N Swedler 89 3 9 39 N Aho 64 16 27 153 LA Berthelot 80 50 39 45 LA Choi 261 54 63 197 LA Cordonnier 20 2 29 18 LA DeRycke 93 28 25 118 LA . . ;

Proc NLMIXED for HSROC Model

The resulting SAS dataset rf will have two records per study,

the first contains the numerator and denominator for sensitivity

the second contains the numerator and denominator for 1-specificity

proc nlmixed data=rf ecov cov;

parms alpha=2 theta=0 beta=0 s2ua=0 s2ut=0 ;

logitp = (theta + ut + (alpha + ua)*dis) * exp(-(beta)*dis);


model pos ~ binomial(n,p);

random ut ua ~ normal([0,0],[s2ut,0,s2ua])


run;


Summary estimate

for “threshold”

Summary estimate

for “accuracy”

Shape parameter

estimate


parms alpha=2 theta=0 beta=0 s2ua=0 s2ut=0 ;

logitp = (theta + ut + (alpha + ua)*dis) * exp(-(beta)*dis);





run;`


Random effects

Distribution of the random effects

Parameter Estimates

Standard


alpha 2.6016 0.1862 48 13.97 <.0001 0.05 2.2273 2.9759 2.227E-6

theta -0.4370 0.1469 48 -2.98 0.0046 0.05 -0.7323 -0.1417 4.573E-6

beta 0.2267 0.1624 48 1.40 0.1691 0.05 -0.09978 0.5532 -1.16E-6

s2ua 1.3014 0.3046 48 4.27 <.0001 0.05 0.6890 1.9137 -6.42E-7

s2ut 0.5423 0.1237 48 4.39 <.0001 0.05 0.2937 0.7909 -6.99E-6


Input of Model Results to RevMan

Example: RF for the diagnosis of

rheumatoid arthritis.

The summary curve shows the

expected trade-off between sensitivity

and specificity as threshold varies.

Notes:

Since RF constitutes part of the ACR

criteria, diagnostic accuracy may be

overestimated.

The impact of potentially influential studies

should be investigated.


between studies

• True underlying differences between studies

– In diagnostic reviews, sampling error is unlikely to account for all

of the variability (scatter) between studies.

– Additional heterogeneity in test performance between studies is

likely to occur for other reasons, including differences in:

• Cut-point chosen to define a positive test (threshold effect)

• Spectrum of disease

• Clinical setting

• Study design

• etc…

Covariates can be included in both the Bivariate and

HSROC models to investigate factors that may be

associated with heterogeneity.


anti-cyclic citrullinated peptide antibody (anti-CCP).

the anti-CCP test is deemed positive if any anti-CCP antibody is

detected. Hence, detection may be considered a common threshold

the reference standard was based on the 1987 revised American


two generations of CCP are included in the analysis, CCP1 and CCP2


Anti-CCP for the diagnosis of rheumatoid arthritis:

generation of CCP.



generation of CCP.

LEVEL 1








LEVEL 2

Assuming a study level covariate Z (assumed to have a fixed effect)

Model can be fitted using random effects logistic regression

(e.g. SAS, Stata, R, ...)

Bivariate model with a covariate

),(~ ijijij nBy

ijn

ij

2

2

~)1logit(

)logit(

BAB

ABA

iBB

iAA

i

i

Zv

ZvBN

B

A

spec

sens


data accp (keep=study_id sens spec true n ccpg);

input study_id $ generation tp fp fn tn;

if generation eq 1 then ccpg=0;

if generation eq 2 then ccpg=1; sens=1; spec=0; true=tp; n=tp+fn; output; sens=0; spec=1; true=tn; n=tn+fp; output;

cards;

Bas 1 110 24 86 215

Bizzaro 1 40 5 58 227

Goldbach-Mansky 1 43 1 63 120

Jansen 1 110 3 148 118

Saraux 1 40 11 46 146

Schellekens 1 72 14 77 298

Vincent 1 139 7 101 464

Zeng 1 90 7 101 313

Aotsuka 2 115 17 16 73

Bombardieri 2 23 0 7 39

.

.

;

CCP1 is the

referent category


proc nlmixed data=accp cov ecov; parms msens=2 mspec= 2 s2usens=0.5 s2uspec=0.5 covsesp=0

se1=0 sp1=0;

logitp = (msens + se1*ccpg + usens)*sens + (mspec + sp1*ccpg + uspec)*spec;



random usens uspec ~ normal([0 , 0],[s2usens,covsesp,s2uspec]) subject=study_id out=randeffs;

/* Estimate logit(sensitivity) and logit(specificity) for CCP2 */

estimate 'logitsens CCP2' msens + se1;

estimate 'logitspec CCP2' mspec + sp1;

run;

run;

Notes:

The variance of the random effects for CCP1 and CCP2 are assumed to be the same


Random effects estimates common to both CCP1 and CCP2

Fit Statistics





Parameter Estimates

Standard


msens -0.09653 0.2203 35 -0.44 0.6640 0.05 -0.5438 0.3507 -0.00024

mspec 3.4467 0.2982 35 11.56 <.0001 0.05 2.8412 4.0522 -0.00002

s2usens 0.3598 0.1022 35 3.52 0.0012 0.05 0.1524 0.5673 0.000479

s2uspec 0.5399 0.1802 35 3.00 0.0050 0.05 0.1742 0.9057 -0.00002

covsesp -0.1968 0.09836 35 -2.00 0.0532 0.05 -0.3965 0.002825 0.000213

se1 0.9626 0.2513 35 3.83 0.0005 0.05 0.4523 1.4728 -0.00025

sp1 -0.4302 0.3377 35 -1.27 0.2111 0.05 -1.1158 0.2554 0.000046


Row Parameter msens mspec s2usens s2uspec covsesp se1 sp1

1 msens 0.04854 -0.02464 -0.00012 -0.00001 -0.00003 -0.04855 0.02465

2 mspec -0.02464 0.08895 -0.00002 0.004771 -0.00065 0.02463 -0.08834

3 s2usens -0.00012 -0.00002 0.01044 0.002118 -0.00440 0.000693 -0.00005

4 s2uspec -0.00001 0.004771 0.002118 0.03246 -0.00860 -0.00007 -0.00039

5 covsesp -0.00003 -0.00065 -0.00440 -0.00860 0.009674 0.000100 -0.00091

6 se1 -0.04855 0.02463 0.000693 -0.00007 0.000100 0.06317 -0.03160

7 sp1 0.02465 -0.08834 -0.00005 -0.00039 -0.00091 -0.03160 0.1140


Fit Statistics





Parameter Estimates

Standard


msens -0.09653 0.2203 35 -0.44 0.6640 0.05 -0.5438 0.3507 -0.00024

mspec 3.4467 0.2982 35 11.56 <.0001 0.05 2.8412 4.0522 -0.00002

s2usens 0.3598 0.1022 35 3.52 0.0012 0.05 0.1524 0.5673 0.000479

s2uspec 0.5399 0.1802 35 3.00 0.0050 0.05 0.1742 0.9057 -0.00002

covsesp -0.1968 0.09836 35 -2.00 0.0532 0.05 -0.3965 0.002825 0.000213

se1 0.9626 0.2513 35 3.83 0.0005 0.05 0.4523 1.4728 -0.00025

sp1 -0.4302 0.3377 35 -1.27 0.2111 0.05 -1.1158 0.2554 0.000046


Row Parameter msens mspec s2usens s2uspec covsesp se1 sp1

1 msens 0.04854 -0.02464 -0.00012 -0.00001 -0.00003 -0.04855 0.02465

2 mspec -0.02464 0.08895 -0.00002 0.004771 -0.00065 0.02463 -0.08834

3 s2usens -0.00012 -0.00002 0.01044 0.002118 -0.00440 0.000693 -0.00005

4 s2uspec -0.00001 0.004771 0.002118 0.03246 -0.00860 -0.00007 -0.00039

5 covsesp -0.00003 -0.00065 -0.00440 -0.00860 0.009674 0.000100 -0.00091

6 se1 -0.04855 0.02463 0.000693 -0.00007 0.000100 0.06317 -0.03160

7 sp1 0.02465 -0.08834 -0.00005 -0.00039 -0.00091 -0.03160 0.1140

Estimates for CCP1 (the referent category),

Additional Estimates

Standard

Label Estimate Error DF t Value Pr > |t| Alpha Lower Upper

logitsens CCP2 0.8660 0.1209 35 7.16 <.0001 0.05 0.6206 1.1114

logitspec CCP2 3.0165 0.1622 35 18.59 <.0001 0.05 2.6871 3.3459

Covariance Matrix of Additional Estimates

Row Label Cov1 Cov2

1 logitsens CCP2 0.01461 -0.00697

2 logitspec CCP2 -0.00697 0.02632


the ESTIMATE command is used to get corresponding values for CCP2

The change in -2logLikelihood when

the two covariates were added to the

model was 12.2 (a chi-squared statistic

with 2 df, P=0.002).

Hence, there is strong statistical

evidence that sensitivity and/or

specificity vary by generation.

The confidence regions show that

sensitivity varies by generation, but not

specificity.

Further models may be fitted to formally test

the effect of removing the covariate for

specificity from the model.



generation of CCP.

Summary estimates for specificity:

0.97 (95%CI 0.94, 0.98) for CCP1 and

0.95 (95%CI 0.94, 0.97) for CCP2.

Summary estimates for sensitivity:

0.48 (95%CI 0.37, 0.59) for CCP1 and

0.70 (95% CI 0.65, 0.75) for CCP2.

These results indicate an improvement

in sensitivity, without loss of specificity

for CCP2 compared with CCP1.



generation of CCP.







Method of measurement of RF:

15 studies used nephelometry (N), 16 used latex agglutination (LA),

16 used ELISA (E)

(3 studies excluded: 2 method not specified, 1 used RA

hemaggltination)

Example: Rheumatoid Factor as a marker for

Rheumatoid Arthritis:

Method of measurement of RF




LEVEL 1








Assuming a study level covariate Z (assumed to have a fixed effect)

where represents the “true” disease status (coded as -0.5 for the non-

diseased and 0.5 for the diseased)

Rutter and Gatsonis HSROC model with a

covariate

),(~ ijijij nBy

ijn

ij

ijdis

ijiijiiiiij disZdisZZ exp)logit(

data rf (keep=study_id dis pos n rfm1 rfm2);

input study_id $ tp fp fn tn method $;

rfm1=0; if method eq ‘LA’ then rfm1=1;

rfm2=0; if method eq ‘E’ then rfm2=1;

dis=0.5; pos=tp; n=tp+fn; output;

dis=-0.5; pos=fp; n=tn+fp; output;

cards;

Bizzaro 61 36 37 196 N Bombardieri 27 6 3 33 N Das 42 46 14 127 N Suzuki 383 38 166 170 N Swedler 89 3 9 39 N Aho 64 16 27 153 LA Berthelot 80 50 39 45 LA Choi 261 54 63 197 LA Cordonnier 20 2 29 18 LA DeRycke 93 28 25 118 LA . . ;


N is the referent

category


parms alpha=2 theta=0 beta=0 s2ua=0 s2ut=0

a1=0 a2=0 t1=0 t2=0 b1=0 b2=0 ;

logitp = (theta + t1*rfm1 +t2*rfm2 + ut +

(alpha + a1*rfm1 +a2*rfm2 + ua)*dis)

* exp(-(beta + b1*rfm1 + b2*rfm2)*dis);





run;`


This model assumes the SROC curves differ in shape.

Removing b1*rfm1 + b2*rfm2 from the model changed the -2 logLikelihood

by only 0.2 (a chi-squared statistic with 2df, P=0.9 ). Hence there is no statistical

evidence that the curves differ in shape.


parms alpha=2 theta=0 beta=0 s2ua=0 s2ut=0

a1=0 a2=0 t1=0 t2=0;

logitp = (theta + t1*rfm1 +t2*rfm2 + ut +

(alpha + a1*rfm1 +a2*rfm2 + ua)*dis)

* exp(-(beta)*dis);





/* parameter estimates for the methods of RF measurement; */

estimate 'alpha ELISA' alpha + a1;

estimate 'theta ELISA' theta + t1;

estimate 'alpha Nephelometry' alpha + a2;

estimate 'theta Nephelometry' theta + t2;

run;


This model assumes the SROC curves all have the same asymmetric shape





Parameter Estimates

Standard


alpha 2.4552 0.3245 45 7.57 <.0001 0.05 1.8017 3.1087 -0.0004

theta -0.5490 0.2137 45 -2.57 0.0136 0.05 -0.9794 -0.1186 0.000139

beta 0.1995 0.1702 45 1.17 0.2472 0.05 -0.1432 0.5423 -0.00018

s2ua 1.2865 0.3109 45 4.14 0.0002 0.05 0.6603 1.9128 -0.00038

s2ut 0.4786 0.1139 45 4.20 0.0001 0.05 0.2492 0.7080 0.00062

a1 0.2483 0.4408 45 0.56 0.5760 0.05 -0.6395 1.1361 -0.00038

a2 0.3328 0.4439 45 0.75 0.4573 0.05 -0.5612 1.2269 0.000093

t1 -0.1962 0.2614 45 -0.75 0.4568 0.05 -0.7227 0.3303 -0.00017

t2 0.4960 0.2627 45 1.89 0.0654 0.05 -0.03301 1.0250 0.000366

Additional Estimates

Standard

Label Estimate Error DF t Value Pr > |t| Alpha Lower Upper

alpha ELISA 2.7035 0.3278 45 8.25 <.0001 0.05 2.0433 3.3637

theta ELISA -0.7452 0.2103 45 -3.54 0.0009 0.05 -1.1687 -0.3217

alpha Nephelometry 2.7880 0.3067 45 9.09 <.0001 0.05 2.1704 3.4057

theta Nephelometry -0.05297 0.2125 45 -0.25 0.8043 0.05 -0.4810 0.3750


The common shape parameter to all 3 curves is given by beta




LA appears to be less accurate

than N and E whose curves show

very similar accuracy.

Removing a1*rfm1 +a2*rfm2

from the model gave a chi-squared

statistic of 0.6, 2df, P=0.74. Hence,

there is no statistical evidence

that the method of measurement

of RF is associated with

accuracy.

The effect of potentially influential

studies should be investigated.

Index Test Comparisons

Comparison based on all studies that evaluate one or both tests:

Methods of analysis follow the same approach as already outlined for investigation of heterogeneity

It may be necessary to allow variances of random effects to vary by test.

Such comparisons may be biased due to confounding arising from heterogeneity among studies in terms of design, study quality, setting, etc

Adjusting for potential confounders is often not feasible because the required information is typically missing or poorly reported.

Index Test Comparisons

Comparison restricted to studies that evaluate both tests:

Restricting the analysis to studies that evaluated both tests in the same patients ( truly “paired” studies), or randomized patients to receive each test, removes the need to adjust for confounders.

Methods of analysis for investigation of heterogeneity are extended to model sensitivity and specificity for both tests within each study (i.e. 2 records for sensitivity and 2 records for specificity per study, with a covariate for test type) all studies are analysed as if they are randomised

this approach is generally conservative

methods for dealing for pairing of test results within studies under development

The cross classification of tests results within disease groups for truly paired studies is generally not reported

Example: Comparison of Computed Tomogrpahy (CT)

and Ultrasonography (US) for the diagnosis of

appendicitis.

22 studies were included in the review by Terasawa (2004)

12 studies evaluated CT

14 studies evaluated US

4 studies evaluated both CT and US.



appendicitis.

Analysis based on all studies:

Strong statistical evidence of a difference

in sensitivity and specificity between the

tests (P<0.001)

CT has higher sensitivity and specificity

than US.



appendicitis.

Analysis based on comparative studies:

CT consistently shows higher sensitivity

than US

Specificity for CT is equal to or greater

than for US

Only 4 studies available for this model.

Convergence is an issue, and simplifying

assumptions may be necessary.

Analyses in RevMan are designed to be descriptive and exploratory.

Hierarchical models provide a more rigorous approach. The Bivariate model

and Rutter and Gatsonis HSROC models are the most commonly used.

The choice of model must be informed by the research question and whether a

common threshold for test positivity is used across studies.

Covariates can be included in hierarchical models to investigate heterogeneity.

The results can be input to RevMan for graphical display.

Modelling of test comparisons follows approach for investigation of

heterogeneity.

Ideally, comparative meta-analysis should focus on studies that compare tests

directly.

A comprehensive list of references is provided in Chapter 10 of the Handbook

for DTA Reviews.

Concluding Remarks

Small number of studies

Convergence issues

Model checking

Data reported at multiple thresholds per study:

• choosing a cutpoint for each study

• methods for analysing multiple 2x2 tables per study Hamza Taye H.; Arends Lidia R.; van Houwelingen Hans C.; Stijnen Theo

Multivariate random effects meta-analysis of diagnostic tests with multiple thresholds BMC MEDICAL RESEARCH METHODOLOGY Vol 9, Article Number: 73 DOI: 10.1186/1471-

2288-9-73 Published: NOV 10 2009

Other?

Discussion Points ( Methods continue to be extended and refined! )

Documents

Diagnostic test accuracy reviews. Advanced Meta-analysis