11
Neurosurg Focus / Volume 33 / July 2012 Neurosurg Focus 33 (1):E5, 2012 1 C OMPARATIVE effectiveness research is defined by the Institute of Medicine as “the generation and syn- thesis of evidence that compares the benefit and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition or to improve delivery of care.” The purpose of CER is “to assist consumers, clini- cians, purchasers, and policy makers to make informed decisions that will improve health care at both the indi- vidual and population levels.” 29 This CER paradigm is not a novel concept as such, and began to take form in the 1850s through the writings of Claude Bernard, 7 a French physiologist. Comparative effectiveness research will no doubt occupy an increasingly important place in clinical practice and health care policy in the future. 39 Conse- quently, the neurosurgical community should be familiar with CER. The number of diagnostic options has grown at an increasing rate over the past 2 decades in medicine, and likewise in neurosurgery. Currently, physicians have a large variety of diagnostic tests at their disposal for the same condition. Furthermore, they are faced with pa- tients who require more and more guarantees, as well as the best diagnostic test associated with the fewest compli- cations or side effects. Thus, it is a fundamental task for clinicians to gather data and to use that data to formulate an optimal care plan for their patients. Meta-analysis represents a high level of evidence- based medicine by summarizing results of well-designed studies on the same topic, thereby achieving the best es- timate of performance of a diagnostic device. Meta-anal- ysis makes it possible to obtain more precise estimates when only small studies are available and can also deter - mine the covariates that may influence results. The process leading to publication of a meta-analysis should be transparent and reproducible. Unlike meta- analyses of randomized controlled treatment trials, the methodology of a meta-analysis of diagnostic test accu- racy is less known and understood. Despite the existence for more than 10 years of more and more methodological work, 11,14,16,35,59 many uncertainties remain, and there is no Meta-analysis of diagnostic test accuracy in neurosurgical practice JULIE DUBOURG, M.D., 1 MONCEF BERHOUMA, M.D., 2 MICHAEL COTTON, M.D., F.R.C.S., 3 AND MAHMOUD MESSERER, M.D. 4 1 Centre d’Investigation Clinique 201, Epidemiologie Pharmacologie Investigation Clinique Information Medicale Mere Enfant, Université Claude Bernard Lyon 1, Groupement Hospitalier Est, Hospices Civils de Lyon; 2 Service de Neurochirurgie, Université Claude Bernard Lyon 1, Hôpital Pierre Wertheimer, Hospices Civils de Lyon, France; 3 Service des urgences, Centre Hospitalier Universitaire Vaudois, Lausanne; and 4 Service de Neurochirurgie, Université de Lausanne UNIL, Département des Neurosciences Cliniques, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland Comparative effectiveness research (CER) allows evidence to be evaluated on the effectiveness, benefits, and detriments of management options, diagnostic tests, or ways to deliver health care. This process can be achieved in different ways, such as with well-designed randomized controlled trials or by meta-analyses. Several medical sub- specialties are increasingly using CER, but CER remains underused by the neurosurgical community. Meta-analysis is a highly accurate method that permits results from multiple well-designed research studies to be quantitatively compared. Meta-analysis can be performed in many settings, such as the evaluation of treatment or of a diagnostic test or prognostic factor. Meta-analyses of randomized controlled treatment trials are well known, but there is a pau- city of papers describing the ways to perform a meta-analysis of a diagnostic test. The aim of this paper is to improve neurosurgeons’ familiarity with the meta-analysis of diagnostic test accuracy by describing and detailing each stage leading to publication. (http://thejns.org/doi/abs/10.3171/2012.5.FOCUS1295) KEY WORDS meta-analysis diagnostic test accuracy evidence-based medicine comparative effectiveness research 1 Abbreviations used in this paper: CER = comparative effective- ness research; ICP = intracranial pressure; ONSD = optic nerve sheath diameter; QUADAS = Quality Assessment of Diagnostic Accuracy Studies; SROC = summary receiver operating character- istic. Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

Meta-analysis of diagnostic test accuracy in neurosurgical ......methodology of a meta-analysis of diagnostic test accu-racy is less known and understood. Despite the existence for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Neurosurg Focus / Volume 33 / July 2012

    Neurosurg Focus 33 (1):E5, 2012

    1

    Comparative effectiveness research is defined by the Institute of Medicine as “the generation and syn-thesis of evidence that compares the benefit and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition or to improve delivery of care.” The purpose of CER is “to assist consumers, clini-cians, purchasers, and policy makers to make informed decisions that will improve health care at both the indi-vidual and population levels.”29 This CER paradigm is not a novel concept as such, and began to take form in the 1850s through the writings of Claude Bernard,7 a French physiologist. Comparative effectiveness research will no doubt occupy an increasingly important place in clinical practice and health care policy in the future.39 Conse-quently, the neurosurgical community should be familiar with CER.

    The number of diagnostic options has grown at an

    increasing rate over the past 2 decades in medicine, and likewise in neurosurgery. Currently, physicians have a large variety of diagnostic tests at their disposal for the same condition. Furthermore, they are faced with pa-tients who require more and more guarantees, as well as the best diagnostic test associated with the fewest compli-cations or side effects. Thus, it is a fundamental task for clinicians to gather data and to use that data to formulate an optimal care plan for their patients.

    Meta-analysis represents a high level of evidence-based medicine by summarizing results of well-designed studies on the same topic, thereby achieving the best es-timate of performance of a diagnostic device. Meta-anal-ysis makes it possible to obtain more precise estimates when only small studies are available and can also deter-mine the covariates that may influence results.

    The process leading to publication of a meta-analysis should be transparent and reproducible. Unlike meta-analyses of randomized controlled treatment trials, the methodology of a meta-analysis of diagnostic test accu-racy is less known and understood. Despite the existence for more than 10 years of more and more methodological work,11,14,16,35,59 many uncertainties remain, and there is no

    Meta-analysis of diagnostic test accuracy in neurosurgical practice

    Julie Dubourg, M.D.,1 Moncef berhouMa, M.D.,2 Michael cotton, M.D., f.r.c.S.,3 anD MahMouD MeSSerer, M.D.41Centre d’Investigation Clinique 201, Epidemiologie Pharmacologie Investigation Clinique Information Medicale Mere Enfant, Université Claude Bernard Lyon 1, Groupement Hospitalier Est, Hospices Civils de Lyon; 2Service de Neurochirurgie, Université Claude Bernard Lyon 1, Hôpital Pierre Wertheimer, Hospices Civils de Lyon, France; 3Service des urgences, Centre Hospitalier Universitaire Vaudois, Lausanne; and 4Service de Neurochirurgie, Université de Lausanne UNIL, Département des Neurosciences Cliniques, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland

    Comparative effectiveness research (CER) allows evidence to be evaluated on the effectiveness, benefits, and detriments of management options, diagnostic tests, or ways to deliver health care. This process can be achieved in different ways, such as with well-designed randomized controlled trials or by meta-analyses. Several medical sub-specialties are increasingly using CER, but CER remains underused by the neurosurgical community. Meta-analysis is a highly accurate method that permits results from multiple well-designed research studies to be quantitatively compared. Meta-analysis can be performed in many settings, such as the evaluation of treatment or of a diagnostic test or prognostic factor. Meta-analyses of randomized controlled treatment trials are well known, but there is a pau-city of papers describing the ways to perform a meta-analysis of a diagnostic test. The aim of this paper is to improve neurosurgeons’ familiarity with the meta-analysis of diagnostic test accuracy by describing and detailing each stage leading to publication.(http://thejns.org/doi/abs/10.3171/2012.5.FOCUS1295)

    Key WorDS      •      meta-analysis      •      diagnostic test      •      accuracy      •      evidence-based medicine      •      comparative effectiveness research

    1

    Abbreviations used in this paper: CER = comparative effective-ness research; ICP = intracranial pressure; ONSD = optic nerve sheath diameter; QUADAS = Quality Assessment of Diagnostic Accuracy Studies; SROC = summary receiver operating character-istic.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • J. Dubourg et al.

    2 Neurosurg Focus / Volume 33 / July 2012

    consensus indicating the best statistical method to syn-thesize results from studies of diagnostic tests.

    The aim of this paper is to improve neurosurgeons’ familiarity with the meta-analysis of diagnostic test ac-curacy by describing and detailing each stage leading to the publication of such a meta-analysis.

    Overview of Meta-Analysis of  Diagnostic Test Accuracy

    The major steps for conducting a systematic review are as follows: (Fig. 1):32,33 1) identification of the objec-tives of the meta-analysis; 2) search strategy and study se-lection; 3) assessment of study quality; 4) data extraction; 5) statistical analysis; and 6) interpretation of results.

    Identification of the Objectives of the Meta-AnalysisThe diagnostic accuracy of a diagnostic test is its

    ability to identify or exclude a given disease or target con-dition. An “index test,” the diagnostic test under evalua-tion, is measured against a “gold standard,” which is the best currently available test to classify a patient’s disease status. For example, to evaluate the diagnostic accuracy of percutaneous biopsy of the cavernous sinus region through the foramen ovale (index test) in detecting benign or malignant lesions, the reference standard will be the

    histopathological analysis of the lesion obtained by open craniotomy. This method will correctly and definitively classify patients as having benign or malignant status.40 According to the gold standard, results of the index test will be classified as true positives, false positives, false negatives, and true negatives, from which the calculation of accuracy indices is possible (Table 1).

    Before issuing the study protocol, investigators must ask certain questions about the possible role of the di-agnostic test. Three types of questions arise. The first question is: is this test reliable? In other words, is this test accurate? That is, if we say the sodium level is 140 mmol/L, is that really what it is? Also, is the test result precise? That is, if one measures the same patient’s blood 10 times, will the sodium level continue to be measured as 140 mmol/L? This question describes the test’s analyti-cal validity.

    The second question is: is this test significant? This question describes the test’s clinical validity. And the fi-nal question is: is this test helpful? In other words, does the realization of the test modify patient outcomes? This question describes the test’s clinical utility.

    Most physicians are probably satisfied with a test with only clinical validity. However, we believe that it is funda-mental that the test also has clinical utility. When defining clinical utility, 3 other questions are raised to define the objectives of the diagnostic test under evaluation.10

    Fig. 1. Overview of the complete process of performing a meta-analysis of diagnostic test accuracy.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • Neurosurg Focus / Volume 33 / July 2012

    Meta-analysis of diagnostic test accuracy in neurosurgery

    3

    The first question is known as the replacement test (Fig. 2A): does the index test intend to replace the gold standard test? This is an important question, for it will be impossible to demonstrate the superiority of the index test regarding its diagnostic accuracy because the target

    condition is defined in terms of the reference standard. Therefore, the index test will be, at best, equal in diag-nostic accuracy. The aim will be to demonstrate that the index test is as good as the reference test and to highlight the benefits of the index test in terms of cost, lack of com-plications, ease of implementation and interpretation, and availability.

    The second question is known as the triage test (Fig. 2B): is the index test intended to serve as a triage test? This test should have a 100% sensitivity or specificity de-pending on whether it is meant to confirm or exclude a diagnosis compared with the reference test.

    And the last question is known as the add-on test (Fig. 2C): is the index test intended as an additional test?

    TABLE 1: The 2 × 2 contingency table comparing a reference standard with an index test

    Reference StandardIndex Test Positive Negative

    positive true positive false positivenegative false negative true negative

    Fig. 2. Examples of diagnostic tests addressing 3 questions (replacement, triage, and add-on tests) that are raised to define the objectives of the diagnostic test under evaluation. A: Replacement test. In this neurosurgical example, we present MRI as an index test evaluated for the diagnosis of spinal cord compression in comparison with the previous gold standard (myelogra-phy). As the target condition (spinal cord compression) is classified as regards the gold standard (myelography), the index test (MRI) can be, at most, as accurate as myelography from a methodological point of view. The choice of this test as a replacement of the gold standard is made thanks to its advantages. In this chart, we can see that MRI is noninvasive, easily interpretable, and engenders no complications. We can note that in current clinical practice, MRI is the current gold standard. B: Triage test. In this neurosurgical example, we present the index test (percutaneous biopsy of lesions in the cavernous sinus region through the foramen ovale) for the diagnosis of the histological nature of lesions in comparison with the gold standard (open craniotomy). We studied a series40 of 50 patients, which showed a 100% specificity of the index test (percutaneous biopsy) for distinguishing between benign and malignant tumors classified by the gold standard test (open craniotomy). This means that when biopsy indi-cates malignancy, this diagnosis is certain. Furthermore, this index test is less invasive and involves less morbidity than the gold standard. Therefore, we can propose this index test as a triage test instead of a possible open craniotomy, in terms of achieving the same results as percutaneous biopsy. C: Add-on test. In this neurosurgical example, we present the index test (ONSD ultrasonography) for the diagnosis of the target condition (raised ICP) in comparison with the gold standard (ICP monitoring). We performed a meta-analysis on this subject and concluded that ONSD ultrasonography presents a good diagnostic accuracy.19 However, ONSD ultrasonography is not intended to replace ICP monitoring, notably due to a 10% false-negative rate with this test. In this chart, we can see that ONSD ultrasonography is noninvasive, widely available, and low cost. Therefore, in the meta-analysis we have proposed this index test as an add-on test to ICP monitoring. Use of ONSD ultrasonography can help decision making in cases in which there are no recommendations or can initiate management when ICP monitoring is not immediately available.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • J. Dubourg et al.

    4 Neurosurg Focus / Volume 33 / July 2012

    This would be the case if there is a reference test with high specificity but poor sensitivity, and there are many false negatives. In this case, it would be useful to have a test with high sensitivity. Both tests will therefore be complementary. Figure 2 shows examples of diagnostic tests in these 3 situations in neurosurgical settings.

    Thus, a meta-analysis should evaluate an index test (for replacement, triage, or add-on) for a target condition and compare it to a reference test in a more or less specific population. Once the clinical problem and the clinical role of the index test are identified, a protocol must be written. It should define the context of the study (index test, pathol-ogy, reference test), the intended purpose of the index test (replacement, triage, or add-on), the search strategy, study selection, data extraction, validation of the methodologi-cal quality of selected studies, analysis and presentation of results, and the expected final interpretation.

    Search Strategy and Study Selection

    Search Strategy. The search strategy must be de-termined in the study protocol. Identification of eligible studies is a key step. If improperly performed, potentially eligible articles will not be considered, and this will lead to publication bias and therefore potentially erroneous re-sults. It is important to note that a search to find articles on diagnostic test accuracy is more difficult that a search to find randomized controlled trials.25

    The primary sources of data are the electronic da-tabases. There are general medical databases (Medline Pubmed, Embase, Biosis, Cochrane), specific topic da-tabases (Neuroscience Information Framework), and national and regional databases (PASCAL [Europe], IndMed [India], African Index Medicius [Africa], Aus-tralasian Medical Index [Australia], and LILACS [Latin America and Caribbean]). These databases may have free or paid access. The ideal strategy is to search in as many databases as possible.

    The search strategy is essential. It must be the most sensitive and least restrictive possible; thus, it will use the “free text” and the Medical Subject Heading (MeSH) terms.34 Many search filters have been developed,2,26 but these filters involve an increased risk of not finding all eli-gible studies, and therefore they are not recommended.36 The search strategy must therefore have no restrictions on language, publication date, or publication type (such as re-view of the literature, reported cases, and others). The de-velopment of key words is an important step; they should be “index test,” “target condition,” “gold standard,” and “target population.” The search structure should not be too specific because of the risk of missing potentially eli-gible studies, as noted above. The MeSH term “sensitiv-ity and specificity” may appear suitable but is inconsis-tently applied in most databases. Table 2 shows examples of search strategies. For example, a search strategy for articles about the diagnostic accuracy of ONSD ultraso-nography in detecting raised ICP may include the terms “brain injuries,” “ultrasonography,” “intracranial hyper-tension,” “intracranial pressure,” “optic nerve,” “raised intracranial pressure,” “myelin sheath,” “intraventricular catheter,” “cerebrospinal fluid pressure,” and “diagnosis,”

    “sensitivity,” “specificity,” “predictive value,” “likelihood ratio,” “false positive,” or “true positive.”19

    These different search strategies should be used in the same database and then repeated in all the databases used. It is probably best to import the retrieved references into bibliographic management software (such as Refer-ence Manager, ProCite, EndNote, RefWorks, or others). These software engines save time, and also reduce tran-scription errors and eliminate duplicate references with 1 click; they can also sort references within subgroups (such as included and excluded).

    The search strategy must be kept in mind and ex-plained in the protocol and publication, to further the goal of transparency and reproducibility. A major step in the publication of the meta-analysis will be to also in-dicate the number of retrieved references, as a result of the search strategy. Thus, for example, the search strategy for an article on ONSD ultrasonography retrieved 699 ar-ticles.19 This number constitutes the first element of the flow chart that must be reported in the meta-analysis.

    However, this search strategy is not sufficient, espe-cially for studies of diagnostic tests. If performed alone, there is a high risk of publication bias. It has been shown that studies with statistically significant results are more likely to be published than those that show no differ-ence.50 Consequently, it is important that all studies on the same topic must be included to better estimate its true diagnostic accuracy. Other search methods include checking references of retrieved articles, hand-searching clinical meeting abstracts, contacting specialists involved by email, and searching in clinical trial registries such as www.clinicaltrials.gov.

    A search is complex and needs to be clearly explained in the methodology section of the article, as do the results of the search. This search will be more effective if there is close contact between clinicians and methodologists and/or librarians.

    Study Selection. The eligibility criteria of studies in the meta-analysis should be clearly defined in the pro-tocol. These criteria will vary depending on the clinical problem and may concern the type of design, the popula-tion, the index test, the reference test, and the pathology target.

    Many designs exist for diagnostic tests, such as cases reported, cross-sectional studies, case-control studies, ran-domized trials, and cohort studies. In general, each study in which the index test results are compared with gold stan-dard test results will be eligible. However, randomized tri-als are rarely used to assess diagnostic accuracy because patients will benefit from either the reference test or the index test; thus it is impossible to relate the index test re-

    TABLE 2: Examples of search strategies

    « index test » AND « target condition »« gold standard » AND « target condition »« index test » AND « target condition » AND « specific pop-

    ulation »« gold standard » AND « target condition » AND « specific pop-

    ulation »

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • Neurosurg Focus / Volume 33 / July 2012

    Meta-analysis of diagnostic test accuracy in neurosurgery

    5

    sults as a function of the gold standard. It is only feasible to assess the impact on patient outcomes. For this reason, cross-sectional studies are widely included.

    In an article, this process of analysis must be detailed in the flow chart by indicating the number of studies in-cluded and/or excluded and the reasons why. A minimum of 2 authors should independently review the complete list of references found by the search strategy and decide on their eligibility. Once this process is finished, they should meet to check the compatibility of their results and to resolve differences by consensus, or with the help of an additional mentor.

    Assessment of Study QualityMeta-analysis of diagnostic accuracy studies is often

    characterized by large heterogeneity due to differences in study design. Therefore, it is essential to carefully as-sess the quality of included studies. Guidelines such as the Standards for Reporting of Diagnostic Accuracy (STARD) initiative are requested by several journals to promote transparency of study design and thereby im-prove the quality of diagnostic studies. Many tools are available, but the most efficient and most recommended is the Quality Assessment of Diagnostic Accuracy Studies (QUADAS), developed by Whiting et al. in 2003.58 Ini-tially this tool included 14 items for the assessment of risk bias and sources of variation. Each item is rated “yes,” “no,” or “unclear.” Some authors use an overall quality score regrouping all QUADAS items, but this way of re-porting quality is not recommended, because some items could be more important than others.57 Depending on the subject of the literature review, variables differ in impor-tance. It is therefore up to each author to define the main criteria for his or her study and when there is a need to answer “yes,” “no,” or “unclear.” Close collaboration with a clinician expert in the specialty is thus required.

    As mentioned above, it is rarely justified to present the results in an overall score form. The most representa-tive method is to select several or all of the variables and represent them individually using graphics (Fig. 3).56 Re-cently, the QUADAS-2 tool60 was developed to offer ad-ditional and improved features, including distinguishing between bias and applicability, and help with assessing the risk of bias.

    The 3 main biases of which one must be aware in meta-analyses of diagnostic tests are the spectrum of pa-tients, verification of the “condition” of patients, and the blind comparison.59

    It is essential that the studies included are composed of the spectrum of patients representative of the target condition in clinical practice. For example, case-control studies are very poor studies to include because the test is very sensitive in the “case” population and the speci-ficity is also very important in the “control” population. Moreover, this spectrum will never be representative of the spectrum of patients encountered in clinical practice.

    For verification of the “condition” of patients, the choice of the gold standard is very important. If the gold standard that is used is not optimal, it may incorrectly classify patients and therefore underestimate or overes-timate the diagnostic accuracy of the index test. On the

    other hand, the timing of performing the index test can also result in bias. It is important to avoid using an index test that is a part of the gold standard. It also requires that the lead time between the 2 tests is as short as possible to avoid any change in patient condition status. This verifi-cation bias can also be observed if the reference test is not performed in the entire population studied.

    The test results must always be interpreted in a dou-ble-blind fashion. Otherwise it might influence the inter-pretation of results by the clinician.

    The major problems with this part of the meta-analy-sis (assessment of quality) mainly concern the number of authors validating the methodological quality of studies and their background, the subjectivity of the exercise, and how to resolve disagreements.

    Data ExtractionThe data extraction stage should also involve at least

    2 authors. They will extract data independently of each study selected through a data extraction sheet, which has already been defined by the protocol. This extraction will allow them to build a 2 × 2 contingency table (Table 1). They will calculate the following indices of diagnostic accuracy: sensitivity, specificity, positive and negative predictive values, positive and negative likelihood ratios, and diagnostic odds ratio (Appendix). Again, the authors will meet and agree on a possible consensus on the differ-ences in results observed.

    Statistical Analysis

    Exploring the Publication Bias. In published studies, favorable findings are more likely to be presented than inconclusive results, which leads to reporting bias.18,20,50 The validity of a meta-analysis of diagnostic test accu-racy depends on minimizing bias in the identification of studies.15,54 If included studies present results that differ

    Fig. 3. Bar graph of quality assessment displaying stacked bars for  each  quality  item,  including modified QUADAS criteria. Each bar represents the results of quality assessment. For example, 80% of the included studies have realized the index test without knowledge of the results of the gold-standard test (blind interpretation).

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • J. Dubourg et al.

    6 Neurosurg Focus / Volume 33 / July 2012

    from relevant studies that are not included, the conclu-sions of the meta-analysis will be compromised by pub-lication bias.5 The search for the presence of publication bias can be achieved by constructing a funnel plot21 and considering its symmetry.31,51 The classical methods to achieve this are the Begg method6 and the Egger test.21 These methods investigate publication bias by plotting a measure of effect size against a measure of study pre-cision, which appears symmetric if no bias is present. However, assessing this symmetry is very subjective, so nonparametric and parametric linear regression methods have been developed to test for plot asymmetry reflect-ing publication bias.15,43,47 This funnel plot, often used in meta-analyses of randomized clinical trials,17,30 has demonstrated many flaws when used in diagnostic stud-ies. However, alternatives to assess the presence of pub-lication bias have not been very effective.15 Investigating publication bias in diagnostic accuracy studies is a matter of debate, because the determinants for publication are not the same as for randomized controlled trials, and di-agnostic accuracy studies do not routinely use the report of p values, but instead use estimates of sensitivity and specificity with 95% CIs. Therefore, publication bias will not be associated with statistical nonsignificance. Finally, the best choice to avoid publication bias is to perform an exhaustive search strategy.

    Exploring for Heterogeneity. In general, there is more heterogeneity between studies of diagnostic tests compared with randomized controlled trials. The search for heterogeneity can be achieved through several tests, but the most powerful is the calculation of the I2 statistic (see Appendix).27 This statistic measures the percentage of variability between summary indices that is due to het-erogeneity rather than chance. Generally, a study with an I2 > 50% is considered to have substantial heterogeneity. Another widely used test is the Cochran Q test,13 which indicates heterogeneity when the p value is < 0.05 (Fig. 4). However, searching for heterogeneity is not enough; it is important to consider its causes.53 Part of the heteroge-neity may be due to chance because the number of these studies is usually low.3 There are other common sources of heterogeneity between studies,37,48,59 such as the use of different thresholds of positivity, differences in meth-odological quality, various techniques for realization or subjectivity of the test index, various techniques for real-ization or subjectivity of the gold standard, and variable sources of recruitment involving varied prevalence of the target condition. If heterogeneity is found, the reasons may be explored by relating study-level covariables to an accuracy measure (such as the diagnostic odds ratio), and using meta-regression techniques4 such as a maximum-likelihood estimation method.

    Fig. 4. Forest plot showing pooled sensitivity and specificity of an index test compared with a reference test. Each point indicates sensitivity and specificity estimates for each individual study. Horizontal lines indicate the 95% CI. This graph also shows the combined sensitivity and specificity with 95% CI at the bottom. In these contrived data, the sensitivity of the index test compared with the reference standard is 96% (95% CI 92%–98%) and the pooled specificity is 98% (95% CI 95%–99%). The higher the sensitivity and specificity are, the more accurate is the index test. In this graph, we can also view the results of total variation across studies that are attributable to heterogeneity rather than to chance. For sensitivity, I2 is < 50%, and the p value of the Cochrane Q test is > 0.05, so there is no substantial heterogeneity. The reverse is true for specificity: I2 is > 50% and the p value is < 0.05, indicating heterogeneity. When readers observe heterogeneity, they should interpret results more carefully.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • Neurosurg Focus / Volume 33 / July 2012

    Meta-analysis of diagnostic test accuracy in neurosurgery

    7

    Performances Indices. Sensitivity, also known as the true-positive rate, is the ability of the index test to identify a patient with the target condition (Appendix). Conversely, specificity, also known as the false-positive rate, is the ability of the index test to identify healthy patients. When a test has a high sensitivity, a negative result rules out the diagnosis, and when a test has a high specificity, a posi-tive result rules in the diagnosis. It is important to note that there is a negative correlation between sensitivity and specificity. The choice of threshold of the index test deter-mines the values of sensitivity and specificity.

    Negative or positive predictive value of the index test is the probability of a negative or positive result being correct.

    Clinical usefulness of the index test could also be evaluated by calculating positive and negative likelihood ratios. With a positive likelihood ratio > 1, the probability

    of the target condition being present increases. When this ratio is < 1, the probability of it being present decreases. When the ratio is 1, the probability is unchanged.

    The diagnostic odds ratio is an indicator of overall di-agnostic accuracy of the index test (Appendix). With this ratio, the odds of positivity of the index test in patients with the target condition can be calculated, compared with the odds of positivity of the index test in those with-out the target condition.

    Statistical Pooling. In any meta-analysis, the first step is to graph the results of the individual selected studies. The SROC curve1,41,46 consists of representing the paired results for sensitivity and specificity (Fig. 5). In the receiver operating characteristic space, the specificity is displayed on the x-axis and the sensitivity on the y-axis. In this graph, there is a diagonal that represents the value of sensitivity

    Fig. 5. Representation of an SROC curve for a contrived data set showing pairs of sensitivity (SENS) and specificity (SPEC) values for each contrived study (circles) with 95% confidence and prediction regions around summary operating sensitivity and specificity points. Each circle represents a study and an SROC curve (black line) is drawn from all of these individual studies. Overall diagnostic accuracy is indicated by the closeness of the summary line (SROC curve) to the top left corner (summary op-erating point) that represents high sensitivity and specificity. However, this graph must be used sparingly because when applied to clinical contexts that may vary across pathologies, clinicians may favor either sensitivity or specificity. The dashed line around the summary operating point estimate represents the 95% CI and the dotted line represents the 95% prediction region. The area under the curve is the average of the true-positive rate over the entire range of false-positive rate values. In this graph, the area under the curve (AUC) is 0.99 (95% CI 0.98–1), which demonstrates an excellent diagnostic accuracy. Readers can use the area under the curve as a performance index, providing the overall accuracy of a diagnostic tool, and observe the combined sensitivity and specificity, keeping in mind that in some settings this point might not be the most accurate. For example, a clinician should favor a point with 100% sensitivity and negligible specificity; in this case, the threshold would be different.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • J. Dubourg et al.

    8 Neurosurg Focus / Volume 33 / July 2012

    and specificity of a “noninformative” index test. The opti-mal index test will be positioned in the upper right corner (sensitivity and specificity = 1). The area under the curve serves as a global measure of performance. The following guidelines52 have been suggested for interpreting area un-der the curve values: low accuracy for values between 0.5 and 0.7, moderate accuracy for values between 0.7 and 0.9, and high accuracy for values greater than 0.9.

    The forest plot (Fig. 4) gives an overall estimate of the sensitivity and specificity of the index test and en-ables sensitivity and specificity to be displayed separately with CIs. However, the forest plot does not display the co-variation between sensitivity and specificity. Others pos-sibilities for representing the results graphically are the creation of a Fagan nomogram (Fig. 6) or a conditional probability plot (Fig. 7).

    The choice of statistical method is between using a fixed effect model or a random effects model.8,9,23 In a fixed effect model, it is assumed that there is only 1 over-all performance of the index test. Therefore, the result is an estimate of this performance, whereas in a random ef-fects model, it is assumed that there are several possible performances and the result is the average of estimates of performance. The random effects model can thus bet-ter manage heterogeneity, but is less efficient in terms of estimating the performance in cases involving consistent results. In practice, it is customary to check the presence of heterogeneity and to choose a fixed effect model in the absence of variation. However, it is not essential to ask this question for meta-analyses of studies testing diagnos-tic indices. In practice, in most cases, a random effects model is used because these studies are often heteroge-neous. The random effects model also offers several op-tions: the traditional method, such as the method of Lit-tenberg and Moses,38,42 or more sophisticated statistical methods, such as the bivariate random approach1,44,45 or the hierarchical approach.24,49,55

    Studies of diagnostic tests have many indicators of diagnostic performance. Most studies represent estimates of sensitivity and specificity, either alone or combined with other performance indicators.28 Summing pairwise sensitivity and specificity is not easy because of the fre-quent presence of a negative correlation of these 2 indi-cators in studies; for example, this negative correlation may be due to the use of different thresholds of positivity between studies.

    The principle of the method of Littenberg and Mo-ses38 is to construct an SROC curve. This approach con-verts each pair of sensitivity/specificity values into a single performance measure, the diagnostic odds ratio.22 This method is widely used in meta-analyses published in the literature,24 but this method has major drawbacks. This method does not distinguish between the detection capability of the disease (sensitivity) and “no disease” (specificity) and ignores the heterogeneity between stud-ies that might possibly be present. These drawbacks result in an inaccurate estimation of CIs using this method.

    Two approaches can circumvent the limitations of the method of Littenberg and Moses: the hierarchical SROC model and the bivariate random model. The bi-variate random model44 can directly sum the sensitivi-

    ties and specificities in the same model using a random approach, taking into account the negative correlation between sensitivity and specificity across studies. This model can examine the effect of covariates. The hierar-chical SROC model focuses on estimating an average of the diagnostic performance and the average threshold of positivity of the test used to build an asymmetrical curve. Both models provide valid estimates of the SROC and their CIs. Both models can be combined with statistical software, a method called “mixed models.”12 We recom-

    Fig. 6. Example of a Fagan nomogram, a very useful tool that allows clinicians without strong statistical knowledge to determine a diagnostic accuracy of the test under evaluation. The clinical utility of the index test is evaluated using the likelihood ratios (LRs) to calculate posttest proba-bility based on the Bayes theorem. For instance, a clinician believes that a patient has a 20% risk (Prior Prob, Pre-test Probability) of having the target condition based on the radiographic clinical features. By drawing a line from the pretest probability (vertical axis on the left) through the likelihood ratios (center vertical axis) and extended to the posttest prob-ability (right vertical axis), you obtain the posttest probability, so you can determine if the realization of the test adds substantial information. In this example, with positive and negative likelihood ratios of 50 and 0.04, respectively, the probability of having the target condition increases from 20% to 93% if the index test is positive (Post_Prob_Pos; red line) and decreases from 20% to 1% (dashed line) if the index test is nega-tive (Post_Prob_Neg). In this case the index test is extremely accurate.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • Neurosurg Focus / Volume 33 / July 2012

    Meta-analysis of diagnostic test accuracy in neurosurgery

    9

    mend and encourage the use of these 2 models that are flexible and powerful; however, we draw attention to the fact that they require a fairly large number of studies to be accurate, and they also require the assistance of an experienced statistician.

    All of these statistical analyses, and their representa-tions, can be performed using a wide variety of free soft-ware (such as R and Metadisc61). The more sophisticated models, such as hierarchical or bivariate models, require the use of more sophisticated software (such as Stata, SAS, and SPSS ).Interpretation of Results

    Interpretation of results is the most important step of the meta-analysis, but is often omitted. Nonetheless, this step is crucial in helping clinicians incorporate the results into their medical decision-making. This is a difficult step because it requires a link to be established with clinical practice. The interpretation of results must be structured as follows: 1) reminders of the main results (recall the main issue, number of studies and patients included, char-acteristics and methodological quality of studies, and the average results of diagnostic performance indices); 2) strengths and weaknesses of the meta-analysis (lack of research literature, low methodological quality of includ-ed studies); 3) applicability of the results in clinical prac-

    tice; and 4) conclusions (implications for clinicians and public health policy, and implications for future research).

    ConclusionsComparative effectiveness research is becoming an

    important public policy issue. Many efforts have been made by the US government to create a national CER program to improve clinical practice. Meta-analysis of diagnostic test accuracy is a tool that can be used in the CER paradigm. Neurosurgeons need to be familiar with this tool to improve research in their specialty. The abil-ity to assess the validity of results and apply them appro-priately to neurosurgical practice is of increasingly para-mount importance. The principles of interpretation given in this paper will help neurosurgeons in this difficult task.

    Appendix

    Sensitivity = proportion of patients with the target condition who have a positive index test result:

    True Positives/(True Positives + False Negatives)Specificity = proportion of patients without the target condition

    who have a negative index test result:True Negatives/(True Negatives + False Positives)Positive Predictive Value = probability of target condition

    among patients with a positive index test result:True Positives/(True Positives + False Positives)Negative Predictive Value = probability of nontarget condition

    among patients with a negative index test result:True Negatives/(True Negatives + False Negatives)Positive Likelihood Ratio (LR) = indicates how much the odds

    of the target condition increase when the test index is positive:Sensitivity/(1-Specificity)Negative LR = indicates how much the odds of the target condi-

    tion decrease when the test index is negative:(1-Sensitivity)/SpecificityDiagnostic Odds Ratio = overall indicator of diagnostic accu-

    racy:(True Positives × True Negatives)/(False Positives × False

    Neg atives)I2 Measure for Heterogeneity:((Q-df) / Q) × 100%df = degrees of freedom, number of studies – 1; Q = Cochran

    Q StatisticCochran Q Statistic for Heterogeneity:∑ wi(qi-q)2 wi = weight of each study; q = log mean LR; qi = estimate of

    the log LR for each study

    Disclosure

    The authors report no conflict of interest concerning the mate-rials or methods used in this study or the findings specified in this paper.

    Author contributions to the study and manuscript preparation include the following. Conception and design: Dubourg, Messerer. Drafting the article: Dubourg, Messerer. Critically revising the article: all authors. Approved the final version of the manuscript on behalf of all authors: Dubourg. Study supervision: Dubourg, Messerer.

    References

    1. Arends LR, Hamza TH, van Houwelingen JC, Heijenbrok-Kal MH, Hunink MG, Stijnen T: Bivariate random effects meta-analysis of ROC curves. Med Decis Making 28:621–638, 2008

    Fig. 7. Example of a conditional probability plot. The probability of a target condition given the positive or negative results of the index test (the positive or negative predictive values; see Appendix), is criti-cally important to clinical practice. This plot represents predictive values across different prevalence or pretest probabilities. This plot shows the same data as the Fagan nomogram. To obtain posttest probability, you draw a vertical line from the chosen pretest probability to the positive likelihood ratio (LR) line (red) if the test is positive, or to the negative likelihood ratio line (dashed) if the test is negative.

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • J. Dubourg et al.

    10 Neurosurg Focus / Volume 33 / July 2012

    2. Bachmann LM, Coray R, Estermann P, Ter Riet G: Identifying diagnostic studies in MEDLINE: reducing the number needed to read. J Am Med Inform Assoc 9:653–658, 2002

    3. Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM: Sam-ple sizes of studies on diagnostic accuracy: literature survey. BMJ 332:1127–1129, 2006

    4. Baker WL, White CM, Cappelleri JC, Kluger J, Coleman CI: Understanding heterogeneity in meta-analysis: the role of me-ta-regression. Int J Clin Pract 63:1426–1434, 2009

    5. Begg CB, Berlin JA: Publication bias: a problem in interpret-ing medical data. J R Stat Soc A 151:419–463, 1988

    6. Begg CB, Mazumdar M: Operating characteristics of a rank correlation test for publication bias. Biometrics 50:1088–1101, 1994

    7. Bernard C: Introduction à l’Etude de la Médecine Expéri-mentale. Paris: J. B. Baillière et fils, 1865

    8. Borenstein M, Hedges LV, Higgins JP, Rothstein HR: A ba-sic introduction to fixed effect and random effects models for meta-analysis. Research Synthesis Methods 1:97–111, 2010

    9. Borenstein M, Hedges LV, Higgins JP, Rothstein HR: Intro-duction to Meta-Analysis. West Sussex, UK: John Wiley & Sons, 2009

    10. Bossuyt PM, Irwig L, Craig J, Glasziou P: Comparative accu-racy: assessing new tests against existing diagnostic pathways. BMJ 332:1089–1092, 2006

    11. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al: Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Ann Intern Med 138:40–44, 2003

    12. Chu H, Cole SR: Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. J Clin Epidemiol 59:1331–1333, 2006 (Letter)

    13. Cochran WG: The combination of estimates from different experiments. Biometrics 10:101–129, 1954

    14. Deeks JJ: Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests. BMJ 323: 157–162, 2001

    15. Deeks JJ, Macaskill P, Irwig L: The performance of tests of pub-lication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin  Epidemiol 58:882–893, 2005

    16. Devillé WL, Buntinx F, Bouter LM, Montori VM, de Vet HC, van der Windt DA, et al: Conducting systematic reviews of diagnostic studies: didactic guidelines. BMC Med Res Meth-odol 2:9, 2002

    17. Dickersin K: The existence of publication bias and risk factors for its occurrence. JAMA 263:1385–1389, 1990

    18. Dickersin K, Min YI, Meinert CL: Factors influencing publi-cation of research results. Follow-up of applications submitted to two institutional review boards. JAMA 267:374–378, 1992

    19. Dubourg J, Javouhey E, Geeraerts T, Messerer M, Kassai B: Ultrasonography of optic nerve sheath diameter for detection of raised intracranial pressure: a systematic review and meta-analysis. Intensive Care Med 37:1059–1068, 2011

    20. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR: Publica-tion bias in clinical research. Lancet 337:867–872, 1991

    21. Egger M, Davey Smith G, Schneider M, Minder C: Bias in meta-analysis detected by a simple, graphical test. BMJ 315: 629–634, 1997

    22. Glas AS, Lijmer JG, Prins MH, Bonsel GJ, Bossuyt PM: The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol 56:1129–1135, 2003

    23. Hamza TH, Arends LR, van Houwelingen HC, Stijnen T: Mul-tivariate random effects meta-analysis of diagnostic tests with multiple thresholds. BMC Med Res Methodol 9:73, 2009

    24. Harbord RM, Whiting P, Sterne JA, Egger M, Deeks JJ, Shang A, et al: An empirical comparison of methods for meta-anal-ysis of diagnostic accuracy showed hierarchical models are necessary. J Clin Epidemiol 61:1095–1103, 2008

    25. Haynes RB, Wilczynski N, McKibbon KA, Walker CJ, Sinclair JC: Developing optimal search strategies for detecting clini-cally sound studies in MEDLINE. J Am Med Inform Assoc 1:447–458, 1994

    26. Haynes RB, Wilczynski NL: Optimal search strategies for retrieving scientifically strong studies of diagnosis from Med-line: analytical survey. BMJ 328:1040, 2004

    27. Higgins JP, Thompson SG: Quantifying heterogeneity in a meta-analysis. Stat Med 21:1539–1558, 2002

    28. Honest H, Khan KS: Reporting of measures of accuracy in sys-tematic reviews of diagnostic literature. BMC Health Serv Res 2:4, 2002

    29. Institute of Medicine: Initial National Priorities  for Com-parative  Effectiveness Research: Report  Brief. Washing-ton, DC: National Academies Press, 2009 (http://www.iom.edu/~/media/Files/Report%20Files/2009/ComparativeEffec tivenessResearchPriorities/CER%20report%20brief%2008-13-09.pdf) [Accessed May 15, 2012]

    30. Ioannidis JP: Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA 279:281–286, 1998

    31. Ioannidis JP, Trikalinos TA: The appropriateness of asymme-try tests for publication bias in meta-analyses: a large survey. CMAJ 176:1091–1096, 2007

    32. Irwig L, Macaskill P, Glasziou P, Fahey M: Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol 48: 119–132, 1995

    33. Irwig L, Tosteson AN, Gatsonis C, Lau J, Colditz G, Chalmers TC, et al: Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 120:667–676, 1994

    34. Kassaï B, Sonié S, Shah NR, Boissel JP: Literature search pa-rameters marginally improved the pooled estimate accuracy for ultrasound in detecting deep venous thrombosis. J Clin Epidemiol 59:710–714, 2006

    35. Knottnerus JA: The Evidence Base  of Clinical Diagnosis. London: BMJ Books, 2002

    36. Leeflang MM, Scholten RJ, Rutjes AW, Reitsma JB, Bossuyt PM: Use of methodological search filters to identify diagnos-tic accuracy studies can lead to the omission of relevant stud-ies. J Clin Epidemiol 59:234–240, 2006

    37. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al: Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 282:1061–1066, 1999

    38. Littenberg B, Moses LE: Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method. Med Decis Making 13:313–321, 1993

    39. Marko NF, Weil RJ: An introduction to comparative effective-ness research. Neurosurgery 70:425–434, 2012

    40. Messerer M, Dubourg J, Saint-Pierre G, Jouanneau E, Sindou M: Percutaneous biopsy of lesions in the cavernous sinus re-gion through the foramen ovale: diagnostic accuracy and lim-its in 50 patients. Clinical article. J Neurosurg 116:390–398, 2012

    41. Miller SW, Sinha D, Slate EH, Garrow D, Romagnuolo J: Bayesian adaptation of the summary ROC curve method for meta-analysis of diagnostic test performance. J Data Sci 7: 349–364, 2009

    42. Mitchell MD: Validation of the summary ROC for diagnostic test meta-analysis: a Monte Carlo simulation. Acad Radiol 10: 25–31, 2003

    43. Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L: Comparison of two methods to detect publication bias in me-ta-analysis. JAMA 295:676–680, 2006

    44. Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH: Bivariate analysis of sensitivity and speci-ficity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 58:982–990, 2005

    45. Riley RD, Abrams KR, Sutton AJ, Lambert PC, Thompson JR: Bivariate random-effects meta-analysis and the estimation

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC

  • Neurosurg Focus / Volume 33 / July 2012

    Meta-analysis of diagnostic test accuracy in neurosurgery

    11

    of between-study correlation. BMC Med Res Methodol 7:3, 2007

    46. Rosman AS, Korsten MA: Application of summary receiver operating characteristics (sROC) analysis to diagnostic clini-cal testing. Adv Med Sci 52:76–82, 2007

    47. Rücker G, Schwarzer G, Carpenter J: Arcsine test for publica-tion bias in meta-analyses with binary outcomes. Stat Med 27: 746–763, 2008

    48. Rutjes AW, Reitsma JB, Di Nisio M, Smidt N, van Rijn JC, Bossuyt PM: Evidence of bias and variation in diagnostic ac-curacy studies. CMAJ 174:469–476, 2006

    49. Rutter CM, Gatsonis CA: A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 20:2865–2884, 2001

    50. Song F, Eastwood AJ, Gilbody S, Duley L, Sutton AJ: Publica-tion and related biases. Health Technol Assess 4:1–115, 2000

    51. Song F, Khan KS, Dinnes J, Sutton AJ: Asymmetric funnel plots and publication bias in meta-analyses of diagnostic ac-curacy. Int J Epidemiol 31:88–95, 2002

    52. Swets JA: Measuring the accuracy of diagnostic systems. Sci-ence 240:1285–1293, 1988

    53. Thompson SG: Why sources of heterogeneity in meta-analysis should be investigated. BMJ 309:1351–1355, 1994

    54. Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R: Selective publication of antidepressant trials and its influ-ence on apparent efficacy. N Engl J Med 358:252–260, 2008

    55. Wang F, Gatsonis CA: Hierarchical models for ROC curve summary measures: design and analysis of multi-reader, multi-modality studies of medical tests. Stat  Med 27:243–256, 2008

    56. Westwood ME, Whiting PF, Kleijnen J: How does study qual-

    ity affect the results of a diagnostic meta-analysis? BMC Med Res Methodol 5:20, 2005

    57. Whiting P, Harbord R, Kleijnen J: No role for quality scores in systematic reviews of diagnostic accuracy studies. BMC Med Res Methodol 5:19, 2005

    58. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J: The development of QUADAS: a tool for the quality assess-ment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol 3:25, 2003

    59. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J: Sources of variation and bias in studies of diagnos-tic accuracy: a systematic review. Ann Intern Med 140:189–202, 2004

    60. Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al: QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 155:529–536, 2011

    61. Zamora J, Abraira V, Muriel A, Khan K, Coomarasamy A: Meta-DiSc: a software for meta-analysis of test accuracy data. BMC Med Res Methodol 6:31, 2006

    Manuscript submitted March 15, 2012.Accepted May 8, 2012.Please include this information when citing this paper: DOI:

    10.3171/2012.5.FOCUS1295. Address correspondence to: Julie Dubourg, M.D., Université de

    Lyon 1, Lyon, Centre d’Investigation Clinique CIC 201, EPICIME, Hôpital Louis Pradel, Hospices Civils de Lyon, 59 Boulevard Pinel, F-69677 Bron, France. email: [email protected].

    Unauthenticated | Downloaded 07/04/21 08:56 AM UTC