29
Title: Biostatistics and Evaluating Published Studies Authors: Ron W. Reeder 1, 2 , PhD; Russell Banks 1 , MS; Richard Holubkov 1 , PhD 1 Department of Pediatrics, University of Utah, 295 Chipeta Way, Salt Lake City, UT 84108, USA 2 Corresponding author Email addresses of authors: Richard Holubkov <[email protected]>; Russell Banks <[email protected]>; Ron Reeder <[email protected]> Key words: Biostatistics, linear regression, logistic regression, Cox proportional hazards, limitations, study design 1 ABSTRACT The purpose of this chapter is to provide guidance for interpreting the statistical results of published studies. This chapter gives explicit and practical guidance for assessing strength of study designs, and interpreting p-values, confidence intervals, and several common statistical models such as logistic regression. Randomized controlled trials (RCTs) provide the strongest level of evidence for the effectiveness of a therapy, but if care is taken to account for potential confounding, then other designs, such as observational studies, may also provide strong evidence. Hypothesis testing and p-values are fundamental tools for evaluating relationships between outcomes and potential predictors. Regression analysis provides a way to assess the relationship between two or more variables. Logistic regression, Cox regression, and ordinary linear regression are common types of regression modeling in clinical research. These regression models allow for the control of potentially confounding variables. Understanding how to appropriately interpret results of these models (odds ratios, hazard ratios, and effect sizes) allows the reader to understand how the predictors relate to the outcome being modeled. Correct interpretation is especially important for evaluating the potential clinical importance of the variables being modeled. In addition to correctly interpreting statistical results, it is important to understand the generalizability of the reported results in order to appropriately contextualize the findings. A careful review of study limitations, including those that may not be reported, is essential for understanding generalizability. Keywords: Study design, confidence interval, p-value, hypothesis test, power, significance, regression, generalizability, interpreting results 1.1 INTRODUCTION The purpose of this chapter is to provide guidance for interpreting the statistical results of published studies. This chapter gives explicit and practical guidance for assessing strength of study designs, and interpreting p-values, confidence intervals, and several common statistical models such as logistic regression. Additionally, this chapter will orient the reader to important concepts for understanding the

1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Title: Biostatistics and Evaluating Published Studies

Authors: Ron W. Reeder1, 2, PhD; Russell Banks1, MS; Richard Holubkov1, PhD

1 Department of Pediatrics, University of Utah, 295 Chipeta Way, Salt Lake City, UT 84108, USA

2 Corresponding author

Email addresses of authors: Richard Holubkov <[email protected]>; Russell Banks <[email protected]>; Ron Reeder <[email protected]>

Key words: Biostatistics, linear regression, logistic regression, Cox proportional hazards, limitations, study design

1 ABSTRACT The purpose of this chapter is to provide guidance for interpreting the statistical results of published studies. This chapter gives explicit and practical guidance for assessing strength of study designs, and interpreting p-values, confidence intervals, and several common statistical models such as logistic regression.

Randomized controlled trials (RCTs) provide the strongest level of evidence for the effectiveness of a therapy, but if care is taken to account for potential confounding, then other designs, such as observational studies, may also provide strong evidence.

Hypothesis testing and p-values are fundamental tools for evaluating relationships between outcomes and potential predictors. Regression analysis provides a way to assess the relationship between two or more variables. Logistic regression, Cox regression, and ordinary linear regression are common types of regression modeling in clinical research. These regression models allow for the control of potentially confounding variables. Understanding how to appropriately interpret results of these models (odds ratios, hazard ratios, and effect sizes) allows the reader to understand how the predictors relate to the outcome being modeled. Correct interpretation is especially important for evaluating the potential clinical importance of the variables being modeled.

In addition to correctly interpreting statistical results, it is important to understand the generalizability of the reported results in order to appropriately contextualize the findings. A careful review of study limitations, including those that may not be reported, is essential for understanding generalizability.

Keywords: Study design, confidence interval, p-value, hypothesis test, power, significance, regression, generalizability, interpreting results

1.1 INTRODUCTION The purpose of this chapter is to provide guidance for interpreting the statistical results of published studies. This chapter gives explicit and practical guidance for assessing strength of study designs, and interpreting p-values, confidence intervals, and several common statistical models such as logistic regression. Additionally, this chapter will orient the reader to important concepts for understanding the

Page 2: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

importance, generalizability, and limitations of reported studies. We begin with a discussion of how study design impacts the strength and generalizability of research.

1.2 STUDY DESIGN Randomized controlled trials (RCTs) provide the strongest level of evidence for the effectiveness of a therapy, but if care is taken to account for potential confounding, then other designs such as observational studies may also provide strong evidence.

1.3 INTERPRETING RESULTS Hypothesis testing and p-values are fundamental tools for evaluating relationships between outcomes and potential predictors. Regression analysis provides a way to assess the relationship between two or more variables. Logistic regression, Cox regression, and ordinary linear regression are common types of regression modeling in clinical research. These regression models allow for the control of potentially confounding variables. Understanding how to appropriately interpret results of these models (odds ratios, hazard ratios, and effect sizes) allows the reader to understand how the predictors relate to the outcome being modeled. Correct interpretation is especially important for evaluating the potential clinical importance of variables being modeled.

1.4 UNDERSTANDING LIMITATIONS In addition to correctly interpreting statistical results, it is important to understand the generalizability of the reported results in order to appropriately contextualize the findings. A careful review of study limitations, including those that may not be reported, is essential for understanding generalizability.

2 LEARNING OBJECTIVES • Explain how study design impacts the strength of research evidence • Describe general statistical principles, including p-values and confidence intervals • Interpret results from logistic and other regression models • Describe the impact of limitations on the generalizability of study conclusions

3 STUDY DESIGN This section discusses the overall level of evidence one might obtain in support of the effectiveness of a therapy based on different types of study designs. Table 51-1 summarizes, in a general fashion, the types of studies that fall into each of three categories—Strong, Moderate, and Suggestive. These categories may be used as an initial gauge of the strength of evidence for a particular publication that is being reviewed. However, there may be limitations which affect this interpretation; these limitations are discussed in detail in Understanding Limitations.

Table 51-1. Categories of Evidence used by the Agency for Healthcare Research and Quality (AHRQ) Innovations Exchange to assess the strength of the link between an innovation and the observed results

Page 3: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Strength of evidence Description of research Examples of study

designs

Strong

The evidence is based on one or more evaluations using experimental designs based on random allocation of individuals or groups of individuals (e.g., medical practices or hospital units) to comparison groups. The results of the evaluation(s) show consistent direct evidence of the effectiveness of the innovation in improving the targeted health care outcomes and/or processes, or structures in the case of health care policy innovations.

- Randomized controlled trial

Moderate

While there are no randomized, controlled experiments, the evidence includes at least one systematic evaluation of the impact of the innovation using a quasi-experimental design, which could include the non-random assignment of individuals to comparison groups, before-and-after comparisons in one group, and/or comparisons with a historical baseline or control. The results of the evaluation(s) show consistent direct or indirect evidence of the effectiveness of the innovation in improving targeted health care outcomes and/or processes, or structures in the case of health care policy innovations. However, the strength of the evidence is limited by the size, quality, or generalizability of the evaluations, and thus alternative explanations cannot be ruled out.

- Controlled trial without randomization

- Prospective cohort study

- Case-control study

Suggestive

While there are no systematic experimental or quasi-experimental evaluations, the evidence includes non-experimental or qualitative support for an association between the innovation and targeted health care outcomes or processes, or structures in the case of health care policy innovations. This evidence may include non-comparative case studies, correlation analysis, or anecdotal reports. As with the category above, alternative explanations for the results achieved cannot be ruled out.

- Case series - Case reports

Strength of evidence and description of research columns, but not examples of study designs, are used with permission from AHRQ.

Randomized clinical trials (RCTs) are studies in which participants are randomly allocated to one of two or more treatments. RCTs provide the strongest level of evidence for the comparative or relative effectiveness of a therapy since, when the randomization is delivered in a valid fashion, comparisons between treatment arms are statistically valid, and observed differences in outcomes between the treatment arms are usually not attributable to bias or unobserved differences in patient characteristics between the treatment arms. This property of randomized trials is why the evidence is referred to as “direct” evidence of an effect.

The Moderate level of evidence in Table 51-1 refers to “quasi-experimental” designs, where a randomized trial is not carried out, but an effort is made to compare nonrandomized groups, or ascertain effectiveness in a single group over time, in a valid fashion. The strongest type of

Page 4: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

nonrandomized study is a cohort study, in which two or more groups of participants are followed over time and compared with respect to an outcome. Such studies are often carried out when treatments cannot be assigned in a random fashion due to ethical or logistical realities. Ethically, for example, one cannot randomize tobacco use among consenting adult patients. Similarly, logistical and ethical concerns deny researchers the ability to randomize all patients with coronary artery disease within a hospital system who are eligible for both bypass graft surgery and catheter-based revascularization to one of the two approaches. While a sufficiently inclusive and detailed administrative database may allow researchers to prospectively compare health resource utilization between smokers and nonsmokers, or compare outcomes, including survival between patients revascularized with surgery versus those treated with catheter-based approaches, the level of evidence for such comparisons is admittedly weaker than in a randomized trial. Outside the randomization setting, there may be other systematic differences between the groups predominantly responsible for the observed association with the outcomes of interest. Smokers are typically older than nonsmokers, for example, and more severe coronary artery disease is more likely to be treated with bypass surgery than less extensive disease. Thus, when evaluating the validity of cohort study results, reviewers must determine whether appropriate adjustment was made for factors that could have confounded the observed relationship between treatment and outcome. Adjustment is discussed further in Controlling for other variables.

A case control study is also classified as Moderate evidence. A case control study compares cohorts of patients with and without an outcome (e.g., a rare cancer) are compared for past exposure to agents or conditions for which an association with the outcome is being investigated. Although not considered to provide the strength of evidence of a RCT, in certain settings where a RCT may not be ethical or practical, case control studies may prove most valuable. For example, the landmark associations of lung carcinoma with cigarette smoking and vaginal adenocarcinoma with maternal stilbestrol use were established via case control studies. However, case control studies must be carefully evaluated for their relevance, as retrospective evaluation of exposures is susceptible to potential flaws, such as recall and ascertainment bias, that do not occur in the prospective study setting, as well as estimation bias due to suboptimal matching of cases (with the outcome) to controls (without the outcome).

The Suggestive level of evidence includes studies such as case reports and case series, which are anecdotal reports where results are reported without comparison to a relevant control group. Therapies that show promise are quite often initially reported to the clinical community in such settings. The level of evidence is only considered suggestive since, “alternative explanations cannot be ruled out” for observed absolute or relative efficacy when the setting is not at least quasi-experimental. Small case reports, for example, may involve evaluations of patients who are ideal candidates for a particular therapy and so may also have performed relatively well on other therapies, and the level of rigor in evaluation of outcomes may not be as high as in larger, multicenter, prospective evaluations.

These general observations about levels of evidence should be interpreted with caution and in the context of the specific studies being considered. For example, a large, rigorously controlled, nonrandomized prospective cohort study may possibly be viewed as superior in quality to a small randomized trial with substantial missing outcome data. Potential limitations are discussed in detail in Understanding Limitations.

Margin notes: Randomized controlled trials (RCTs) provide the strongest level of evidence for the comparative or relative effectiveness of a therapy.

Page 5: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4 INTERPRETING RESULTS Clinicians should develop the skills to interpret the statistical results of a clinical research study to appropriately apply the findings to patient care. While clinicians often rely on statistician collaborators to perform the statistical analyses and to aid in the interpretation of results, a clinician collaborator has the responsibility to develop a basic understanding of the principles of interpreting statistical results. In this section, the general principles at the core of most statistical analyses are described, and those principles are applied to some of the most common methodologies.

4.1 GENERAL PRINCIPLES The basic methods for drawing conclusions based on clinical studies are confidence intervals and statistical tests (i.e., hypothesis tests).

4.1.0 Confidence intervals A confidence interval (CI) is a range of plausible values for a parameter of interest. The percentage (e.g., 95% CI) is the level of confidence. For example, there is 95% confidence that the parameter of interest is somewhere inside the 95% CI. Restated, the 95% confidence interval can be expected to contain the true value of the parameter of interest for approximately 95 out of every 100 studies in which it is used. Confidence intervals are usually reported along with a best estimate of the parameter. An interesting example comes from the Therapeutic Hypothermia after Pediatric Cardiac Arrest Out-of-Hospital (THAPCA-OH) trial.

In the THAPCA-OH trial, 295 children remaining unconscious at 38 hospitals after an out-of-hospital cardiac arrest were randomly assigned to treatment with targeted temperature management of either 33° C (hypothermia) or 36.8° C (normothermia) within 6 hours after the return of circulation. Among the 260 evaluable subjects who had satisfactory neurobehavioral status prior to their cardiac arrest, subjects treated with hypothermia were estimated to have a 7.3 percentage point increase (95% CI, -1.5 to 16.1) in the probability of a good neurobehavioral outcome at 12 months compared to subjects treated with normothermia. Thus, the best estimate is that hypothermia increases the probability of a good outcome by 7.3 percentage points. However, since the confidence interval contains zero as a plausible effect, the possibility that therapeutic hypothermia has no effect at all on neurobehavioral outcome cannot be ruled out. However, the possibility that therapeutic hypothermia improves the probability of good outcome by 16.1 percentage points or more can be ruled out with high confidence. Similarly, the possibility that therapeutic hypothermia worsens the probability of good outcome by 1.5 percentage points or more can also be excluded.

Another way of evaluating the relationship between an intervention and an outcome is through hypothesis testing.

Margin notes: A confidence interval (CI) is a range of plausible values. If the confidence interval contains zero as a plausible effect, the possibility that the intervention has no effect at all on the outcome of interest cannot be excluded.

Page 6: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4.1.1 Hypothesis testing A statistical test is an evaluation of two competing hypotheses. The alternative hypothesis is typically that a certain relationship exists, while the null hypothesis is that the relationship does not exist. For example, in the THAPCA-OH trial, the null hypothesis was that therapeutic hypothermia has no effect on neurobehavioral outcome at 12 months for children experiencing an out-of-hospital cardiac arrest. The alternative hypothesis was that therapeutic hypothermia either increases or decreases the probability of a good 12-month neurobehavioral outcome, compared to normothermia. Clinical studies are typically designed with the aim of ruling out or rejecting the null hypothesis that there is no effect or relationship, in favor of the alternative hypothesis.

The p-value of a statistical test is a measure of the plausibility of the null hypothesis based on the observed data. Large p-values indicate a plausible null hypothesis, whereas small p-values suggest that the observed data would be unlikely if the null hypothesis were true. The p-value can be interpreted as the chance that a treatment effect at least as large as the one actually observed would have been found by chance, if there were truly no treatment effect at all. For example, suppose a hypothetical trial found 12% higher survival with Drug B compared to Drug A, and reported a p-value of 0.03. This value of 0.03 indicates that if investigators were to carry out exactly this same trial repeatedly with the same number of participants, but survival truly was the same with both drugs, a difference of 12% or more would be found by chance only about 3 out of 100 times. In summary, the p-value is intended to summarize in a single number the concept of “are the observed results convincing?” If a reported p-value is sufficiently small, the researcher rejects the null hypothesis as implausible and concludes that the alternative is true. The value that a p-value must be in order to be considered sufficiently small (i.e., statistically significant) may vary, but p-values less than 0.05 are generally considered sufficient evidence in favor of the alternative hypothesis. Using a threshold of p < 0.05 ensures that the false positive rate will be no more than 5%. In other words, if the null hypothesis is true, there is no more than a 5% chance that a study would conclude otherwise using this threshold. The threshold chosen is referred to as alpha (α), and the false positive rate is referred to as the Type I Error Rate.

As an example, in the THAPCA-OH trial, the researchers would have concluded that therapeutic hypothermia affects the mortality rate if the p-value had been less than 0.05. However, since the reported p-value was 0.14, the researchers concluded that the null hypothesis of no effect was sufficiently plausible and reported that ‘in comatose children who survived out-of-hospital cardiac arrest, therapeutic hypothermia, as compared with therapeutic normothermia, did not confer a significant benefit in survival with a good functional outcome at 1 year.’

An estimate and confidence interval are often reported along with a p-value. A p-value provides a quick way to determine whether sufficient evidence of a relationship exists, and the estimate and confidence interval demonstrate how large the effect might be. The results of the THAPCA-OH trial might be reported in the following way: There was no significant difference in the probability of good neurocognitive outcome at 12 months in the hypothermia group compared to the normothermia group (estimated risk difference, 7.3%; 95% CI, -1.5 to 16.1%; p = 0.14). Assuming that the same statistical approach is used to generate the p-value and confidence interval, a 95% CI will not include the value corresponding to the null hypothesis (for example, a treatment difference of 0) when the p-value is < 0.05.

Page 7: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Margin notes: Hypothesis testing is a formal evaluation of two competing hypotheses. The p-value indicates the probability that a treatment effect at least as large as the one actually observed would have been found by chance, if there were truly no treatment effect at all.

4.1.1.1 Common tests The number of different statistical tests available is innumerable. However, knowing a handful of the most commonly used tests provides an advantage when interpreting the results of many clinical studies. Table 51-2 provides salient details of some of the most frequently-used tests for comparing two groups.

Table 51-2. Frequently used tests for comparing two groups

Test name What is being assessed? When it may be appropriate?

What a low p-value indicates?

T-test for two independent samples

Is the average of a specified numeric variable the same in two different groups? For example, one might assess whether the average of each subject’s daily red cell transfusion volume while receiving extracorporeal membrane oxygenation is the same in survivors vs. non-survivors.

The results from one group are independent from the results of the other group1; AND the sample size is large OR the specified numeric variable does not have values that are much larger or smaller than typical values (i.e., no outliers). For very small sample sizes, the variable’s distribution in each sample should be approximately normal, i.e. bell-shaped.s

The average is larger in one group compared to the other.

Wilcoxon rank-sum test, also known as the Mann-Whitney U test

Is the distribution of a specified numeric variable the same in two different groups? This is similar to, but not identical to the above t-test. For example, one might assess whether the distribution of daily red cell transfusion volume while receiving extracorporeal membrane oxygenation is the same in survivors vs. non-survivors.

The results from one group are independent from the results of the other group.1 Note that this test is not excessively influenced by outliers and does not rely on large samples sizes.

The specified numeric variable tends to be larger in one group compared to the other.

Chi-square test Are two specified categorical variables independent? For example, one might assess whether clinician responsiveness (yes, no) to improved oxygenation following initiation of inhaled nitric oxide for

Two categorical variables are assessed on each subject; AND A rather technical requirement is that most combinations (e.g., ≥

There is a relationship between the two specified categorical variables.

Page 8: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

respiratory failure is independent of survival to hospital discharge (yes, no). Although this example uses binary variables, the variables may have additional categories.

75%) of the two variables would occur in at least a handful of subjects (e.g., ≥ 5) if the variables were independent. In practice, this condition is likely to be met unless one of the variables has a category with very few subjects (as might be seen when comparing rates of rare events).

Fisher’s exact test Same as the Chi-square test. Two categorical variables are assessed on each subject. Note that the somewhat technical assumption needed for the Chi-square test regarding combinations of the two variables is NOT required for Fisher’s exact test. For this reason, it is preferred when some table values are small (e.g., < 5), such as for comparing rates of rare events.

Same as the Chi-square test.

1 Results from one group are considered ‘independent’ from the results of another group if the results of the second group are not connected or dependent on the results of the first group. For example, if a clinical trial has a group of subjects who receive an active treatment while a second group of subjects receives a placebo, these two groups are independent. In contrast, if a pharmacokinetic trial comparing two formulations of a drug evaluates both drug formulations in the same cohort, then the two groups (Formulation A and Formulation B) are connected or dependent because they use the same subjects. In particular, a subject’s characteristics may lead to higher concentrations of the drug when administered Formulation A and Formulation B.

4.1.2 The Concept of Statistical Power If a study concludes that there is not a significant effect or relationship, one must consider whether the study had the ability to discern a relationship of a clinically important magnitude. Statistical power is the probability that a study will detect a statistically significant effect, given that the true effect is of a given magnitude and considering the number of study participants. A related concept is Type II Error, which occurs when a study erroneously fails to detect a relationship, i.e. when a false null hypothesis is

Page 9: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

not rejected. The Type II error rate, i.e. the probability of a Type II Error, is referred to as beta (β). Power and β are directly related (power = 1 – β) so that if the power is 0.90 (90%) then β is 0.10 (10%).

Power is calculated using a variety of formulas that depend on the study design and the type of outcome being analyzed (e.g., binary or continuous). Details involving power calculations are beyond the scope of this chapter. Rather, the primary objective of this section is to introduce the reader to the concept of statistical power. For a more detailed discussion of statistical power calculations, the authors suggest ‘The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results’ (See Suggested Readings).

The concept of statistical power is paramount for the design of clinical trials. The following general relationship between power, sample size, and treatment effect magnitude are fundamental.

Larger studies will have more power to detect a treatment effect of any given magnitude than smaller studies; and

To detect a small treatment effect with adequate power, a larger number of subjects is required than would be required to detect a larger treatment effect.

The concept of statistical power may also be useful for the interpretation of a published trial. It is expected that, during the design phase, a clinical trial should pre-specify the assumed treatment effect and other necessary technical assumptions about the outcomes; for example, when comparing continuous outcomes between two arms, estimated standard deviations are required for each study arm as well as expected mean outcomes. The significance level, i.e., false positive rate, is also pre-specified, usually but not always at 0.05. The number of participants enrolled in the trial should then be sufficiently high to detect a significant difference with substantial power, typically 80% or greater. Any quality published clinical trial should describe the concepts of assumed treatment effect, power, and significance criterion.

When a published trial is negative, e.g., p > 0.05, some care is required in the interpretation. One must not be too quick to declare that the studied treatment is not effective. Instead, the reader should first examine the magnitude of treatment effect that the trial was powered to detect. If the trial was only powered to detect a change in survival from 40% to 50% or more, then the reader should not necessarily rule out the possibility that the treatment improves survival from, say, 40% to 45%. Refer back to Hypothesis testing to review how confidence intervals demonstrate what magnitude of treatment effect is plausible, and conversely, what can be ruled out.

For non-randomized studies, the number of participants may not be modifiable, as the researcher may be working with an existing cohort or database with a fixed number of patients. However, even in this case, negative study findings should address whether the study had the statistical power to detect effects of important magnitude. In these cases, confidence intervals alone are typically utilized to determine what magnitude of treatment effect can be ruled out (see Hypothesis testing).

Margin notes: Statistical power is the probability that a study will detect a statistically significant effect.

4.1.3 Clinical vs. statistical significance When the number of research subjects is very large, effects of a small magnitude may be identified as statistically significant. For example, the GUSTO trial enrolled over 41,000 patients suffering a myocardial infarction, comparing a then new and expensive thrombolytic therapy (tissue-plasminogen

Page 10: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

activator (t-PA)) to the commonly used therapy (streptokinase). Rates of death or serious stroke were 6.9% with t-PA versus 7.8% with streptokinase; the difference of less than 1% was highly statistically significant. While many clinicians believe that this is a clinically important finding, as lives are being saved with the new therapy, concerns were raised about the practical importance of this small difference given the added cost of the therapy, roughly $2,000 per patient.

The GUSTO trial was designed with an extremely large sample size precisely because a very small absolute benefit on reducing rates of death and stroke was believed to be critically important by the participating investigators. Nevertheless, the GUSTO example is presented here to point out that there may be instances where a treatment effect or association is reported as statistically significant, but the magnitude of the effect may not be uniformly interpretable as being of clinical or practical importance. A very large study showing, for example, a correlation between two factors that is modest in magnitude, but statistically significant, should be reviewed critically.

Margin notes: Both clinical and statistical significance should be considered when evaluating the potential impact of study findings.

4.1.4 Number needed to treat Another way of gauging the potential clinical significance of reported results is the number needed to treat. If an intervention is found to improve outcomes (e.g., p-value < 0.05), the clinical significance of the finding may be conveyed as the number needed to treat (NNT). The NNT is the average number of subjects that need to be treated in order to prevent a single poor or undesired outcome. To derive this measure, we first calculate the absolute benefit of the better treatment, i.e. the absolute decrease in rates of the poor outcome in those receiving the better treatment, compared to the inferior treatment. The NNT is then simply (1/absolute benefit).

Revisiting the GUSTO example above, the number of subjects needed to be treated with t-PA, compared to streptokinase, to prevent one additional death or serious stroke can be estimated to be:

(1/(7.8%-6.9%)) = (1/0.9%) = 1/0.009 ≈ 111.

Assuming that the extra cost of treatment with t-Pa at the time of the trial was $2,000, this NNT implies that the “cost per patient saved from dying or having a serious stroke” was over $200,000 (i.e., 111 patients x $2000 per patient = $222,000); this quantification led to controversy about the interpretation of GUSTO, as discussed above.

When significant harm results from a treatment, a similar calculation can be made to determine the number needed to harm (NNH), which represents the number of subjects that need to be treated with the agent to result in one additional patient who is harmed. In addition to an estimate of NNT or NHT, the associated confidence interval can be valuable by showing the range of plausible values for NNT or NHT.

Margin notes: The number needed to treat (NNT) is the average number of subjects that need to be treated in order to prevent a single poor outcome. Numerically, it is the reciprocal of the absolute decrease in the rate of the poor outcome in those receiving the better treatment compared to those receiving the inferior treatment.

Page 11: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4.2 INTERPRETATION OF STATISTICAL MODELS Now that some general principles have been introduced, a description of how to interpret the results of common regression analyses can be provided. Regression analysis is a general methodology in which the relationships between two or more variables can be estimated. One variable will be considered the outcome (i.e., the dependent variable), while other variables will be considered as predictors of that outcome (i.e., independent variables).

Margin notes: Regression analysis is a methodology for assessing relationships between two or more variables.

4.2.0 Logistic regression Logistic regression is a model in which the odds of a certain outcome are associated with one or more variables. For a typical logistic regression, the outcome must have exactly two possible values (e.g., mortality vs. survival to hospital discharge). The results from a logistic regression model are expressed in terms of estimated odds ratios, confidence intervals, and p-values. The following example illustrates these concepts.

In a multicenter study of 484 pediatric subjects receiving extracorporeal membrane oxygenation (ECMO), Cashen reported the association between prematurity and mortality with an odds ratio (and 95% CI) of 2.97 (1.42, 6.21) along with a p-value of 0.004. The p-value suggests that the null hypothesis of no relationship is not plausible (since p < 0.05), and it is concluded that there is a relationship between prematurity and mortality. The estimated odds ratio describes the magnitude or size of this relationship. The odds ratio is defined as the estimated odds of mortality for preterm neonates, divided by the estimated odds of mortality for other children studied. The reported odds ratio of 2.97 with a lower limit of the 95% CI larger than 1, suggests that the odds of mortality are greater for preterm neonates. It can be further deduced from the odds ratio of 2.97, that the odds of mortality are estimated to be 2.97 times higher for preterm neonates, or in other words, the odds of mortality are 197% higher for preterm neonates compared to other children studied.

In the example above, the predictor, preterm neonate, was binary. However, this need not be the case. A categorical predictor may have three or more levels, or the predictor can be continuous. These configurations are important to consider, but they are not specific to logistic regression, so they are discussed separately in Categorical predictors with more than two levels and Continuous predictor variables, respectively.

Margin notes: Logistic regression models are commonly used to assess the relationship of one or more variables with a binary outcome.

4.2.0.1 A careful look at odds ratios In order to prevent accidental misinterpretation from logistic regression, it is essential to understand how odds are defined. The odds of mortality is defined as the probability of mortality divided by the probability of survival. Thus, if the probability of mortality is 50%, then the odds are 0.5/0.5 = 1. If the probability of mortality is 10%, then the odds of mortality are 0.1/0.9 = 1/9. In a study of children receiving ECMO, the probability of mortality was 48% with venoarterial support versus 30% with a venovenous mode. Thus, the odds of mortality were 0.48/0.52 = 0.92 with venoarterial mode and 0.30/0.70 = 0.43 for venovenous mode. The odds ratio is then estimated to be 0.92/0.43 = 2.14. We

Page 12: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

interpret this to mean that the odds of mortality are estimated to be 2.14 times higher with venoarterial mode compared to venovenous mode. It is tempting to erroneously report that mortality is 2.14 times more likely with venoarterial mode. However, when the relative risk is calculated, i.e., the ratio of probabilities, we see that mortality is only 0.48/0.30 = 1.60 times more likely with the venoarterial mode. Odds ratios are not relative risks, and care should be taken to correctly interpret them. In general, the odds and the relative risk tend to be similar when the probabilities of the event is close to 0 or 1. In the example above, the probabilities were 0.48 and 0.30, which is why the odds ratio and relative risk were so different. When the event being modeled is rare, odds ratios and relative risks are so similar that one can reasonably interpret an odds ratio as a relative risk.

4.2.0.2 Receiver operating characteristic (ROC) curves One tool to assess how well a logistic regression model is able to predict its outcome is a receiver operating characteristic (ROC) curve. Since a logistic regression model fits a probability of having an outcome for each subject in the study, this probability can be used to classify a subject as having the outcome (‘positive’) or not (‘negative’), using any chosen cutpoint. For example, in a hypothetical study of mortality, we could call all subjects with a predicted probability of death of at least 10% positives, and others negatives. It is quite possible that this model could misclassify some subjects, for example some subjects with a model-predicted probability of over 10% could have survived. The combinations of model-predicted classifications with actual classification is provided in Table 51-3.

Table 51-3. Combinations predicted versus actual classifications

Predicted: positive Predicted: negative Actual: positive True positives False negatives Actual: negative False positives True negatives

An ROC curve illustrates the true positive rate (sensitivity) compared with the false positive rate (100% - specificity), as this cutpoint for classification varies from 0% to 100%. Since the variable predicted is binary, one level is arbitrarily considered ‘positive.’ For example, if the outcome variable is mortality, one might refer to mortality as positive and survival as negative. The true positive rate is the proportion of ‘positives’ that are correctly identified, e.g. the proportion of subjects who died for whom the model was able to correctly predict an outcome of mortality. The false positive rate is the proportion of ‘negatives’ that are erroneously classified as ‘positives,’ e.g. the proportion of subjects that survived that the model erroneously predicted would die. Formulas for true positive rate, false positive rate, and other common measures of prediction accuracy are provided in Table 51-4.

Table 51-4. Terminology and formulas for measures of prediction accuracy

Terminology Description Formula Sensitivity (true positive rate)

Proportion of actual positives that are correctly classified.

True positives/(true positives + false negatives)

Specificity (1 – false positive rate)

Proportion of actual negatives that are classified correctly.

True negatives/(true negatives + false positives)

False positive rate (1 – specificity)

Proportion of actual negatives that are classified incorrectly.

False positives/(true negatives + false positives)

Page 13: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Positive predictive value

Proportion of predicted positives that are classified correctly.

True positives/(true positives + false positives)

Negative predictive value

Proportion of predicted negatives that are classified correctly.

True negatives/(true negatives + false negatives)

An ideal situation is to have a model that has a high true positive rate while maintaining a low false positive rate. If the receiver operating characteristic (ROC) curve (Figure 51-1) gets close to the upper left corner (true positive rate of 100% and false positive rate of 0%), then the model is able to very accurately predict the outcome. One way of quantifying this is to report the area under the ROC curve (c-statistic). A c-statistic can range from 50% to 100%, with higher values indicating a better predictive model. An example of a ROC curve with poor ability to predict mortality in a study of ECMO is found in Figure 51-1. Figure 51-2 illustrates an ROC curve from a model with moderate predictive ability. Both ROC curves are based on models of mortality in the same cohort. The model used for Figure 51-2 used additional predictor variables that allowed more precise predictions.

Figure 51-1. Receiver Operating Characteristic (ROC) curve with a c-statistic of 55%

This figure demonstrates the ROC curve from a logistic regression model predicting mortality based only on the mode (venoarterial versus venovenous) of extracorporeal membrane oxygenation in pediatric subjects. The area under the curve is only 55% indicating a poor ability to predict mortality. (This curve is very close to a straight line, which has an area under the curve of 50%, and indicates the model has no predictive ability, since no matter what probability cutpoint is used, subjects with the outcome are not classified any more accurately than those without!)

0% 20% 40% 60% 80% 100%False positive rate

0%

20%

40%

60%

80%

100%

True

pos

itive

rate

Page 14: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Figure 51-2. Receiver Operating Characteristic (ROC) curve with a c-statistic of 79%

This figure illustrates the ROC curve from a logistic regression model predicting mortality of pediatric subjects supported with extracorporeal membrane oxygenation. The model uses several variables including mode (venoarterial versus venovenous), age category, diagnosis of meconium aspiration syndrome, diagnosis of congenital diaphragmatic hernia, documented blood infection, indication for ECMO (eCPR, cardiac, respiratory), and the last measurement of arterial pH, partial thromboplastin time, and international normalized ratio prior to initiating extracorporeal membrane oxygenation. The area under the curve is 79% indicating a moderate ability to predict mortality.

4.2.1 Ordinary linear regression Ordinary linear regression is a model in which the outcome, which is a continuous variable, is related to one or more variables. The latter variables may be categorical or continuous. Continuous predictor variables are discussed in Continuous predictor variables. The results from an ordinary linear regression model are expressed in terms of estimated effect sizes, confidence intervals, and p-values as illustrated in the following example.

In a multicenter study of 216 children receiving ECMO, researchers reported that the use of continuous renal replacement therapy was associated with a mean increase of 16.3 mg/dL (95% CI: 3.2, 29.4) in the daily plasma free hemoglobin concentration (p = 0.01). The low p-value suggests that the null hypothesis of no relationship between the use of renal replacement therapy and daily plasma free hemoglobin concentration is not plausible, and it is concluded that renal replacement therapy is associated with

0% 20% 40% 60% 80% 100%False positive rate

0%

20%

40%

60%

80%

100%

True

pos

itive

rate

Page 15: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

daily plasma free hemoglobin concentrations. The estimated effect can be interpreted as follows: Subjects with continuous renal replacement therapy had, on average, daily plasma free hemoglobin levels that were 16.3 mg/dL higher than subjects without continuous renal replacement therapy.

Margin notes: Ordinary linear regression models are commonly used to assess the relationship of one or more variables with a continuous outcome.

4.2.2 Cox proportional hazards regression When duration of follow-up varies between subjects, Cox proportional hazards regression can be used to assess the relationship of one or more variables with the occurrence of an event for which a cohort is at risk. Duration of follow-up and timing of events are incorporated into the analysis. This type of analysis is often referred to as survival analysis or time-to-event analysis. As with other types of regression, variables associated with the outcome may be categorical or continuous (see Continuous predictor variables). The results from a Cox regression model are expressed in terms of hazard ratios, confidence intervals, and p-values. The following example illustrates these concepts.

In a multicenter study of 118 infants treated with endoscopic third ventriculostomy and choroid plexus cauterization (ETV+CPC) for hydrocephalus, Kulkarni identified variables associated with the risk of ETV+CPC failure, i.e., failure of the ETV+CPC to divert and reduce cerebral spinal fluid effectively. Subjects were considered at risk of failure from the date of the ETV+CPC until failure (re-operation for the treatment of hydrocephalus) or until loss to follow-up (the end of the study follow-up period, or when a specific subject could no longer be contacted). Follow-up was conducted during the study enrollment period and for six months thereafter. Therefore, duration of follow-up was generally longer for the first subjects enrolled than for the last. Follow-up durations varied between subjects due to the timing of enrollment and also due to subject-specific factors that precluded continued follow-up. Cox regression can, under certain conditions, appropriately handle variation in follow-up durations by excluding subjects from the ‘at risk’ cohort after the last time of contact. From this study, researchers reported that the risk of ETV+CPC failure had a hazard ratio (and 95% CI) of 0.33 (0.10, 1.10) associated with the presence of an intraoperative clear view of the basilar artery, with a p-value of 0.07. The p-value suggests that there may be a relationship between whether the surgeon has a clear view of the basilar artery and the hazard of ETV+CPC failure. The estimated hazard ratio may be interpreted as follows: the hazard of ETV+CPC failure when there is a clear view of the basilar artery is estimated to be 0.33 times the hazard of failure when there is not a clear view. In other words, the researchers estimated that the hazard of ETV+CPC failure was 67% lower when there was a clear view of the basilar artery. However, the 95% CI indicates that the hazard of ETV+CPC failure may reasonably be as much as 90% lower or 10% higher with a clear view of the basilar artery versus without a clear view, consistent with the non-significant p-value.

Margin notes: When duration of follow-up varies between subjects, Cox proportional hazards regression is commonly used to assess the relationship of one or more variables with the occurrence of an event for which a cohort is at risk.

4.2.3 Additional concepts in statistical modeling Now that some of the most common statistical modeling techniques in clinical research have been described and the manner in which they can be used to interpret study results delineated, issues that are common to all of the models previously discussed can be reviewed.

Page 16: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4.2.3.1 Categorical predictors with more than two levels In the previous sections, the examples were restricted to situations in which the predictor variable had only two possible values. Each of the modeling techniques previously discussed also allow predictor variables with more than two categories. One of the categories will be chosen as the reference. Confidence intervals are presented for each category in comparison to the reference, and the width of each confidence interval is affected both by the number of subjects in the category and the number of subjects in the reference category. In the case where there are only two categories, there is often a natural reference. For example, if the variable is whether a subject received an active study drug vs. placebo, the placebo is a natural reference. If the variable is sex (male vs. female), an arbitrary choice of reference may be needed. When more than two categories exist, the choice of reference becomes more important. If there are a similar number of subjects in each category, a ‘natural’ reference might be chosen. For example, with age categories, the youngest age category may be a natural reference. When the number of subjects in each category varies greatly, the category with the most subjects is often chosen as the reference. The category with the most subjects may be considered a natural reference, but more importantly, this approach reduces the widths of the confidence intervals that compare this category to all others.

An example from a study of children supported with ECMO reveals the results of ordinary linear regression modeling with a 5-level categorical variable (Table 51-5). All age categories are compared to the reference group, in this case full-term neonates. The low p-value indicates that there is a relationship between age group and daily plasma free hemoglobin levels. However, this does not indicate that a specific group has different plasma free hemoglobin levels—just that there are differences between groups. The confidence intervals illustrate more clearly where the differences might be. In this example, the child age group appears to be the most different age group compared to the full-term neonate age group, and we estimate that subjects in the child age group have, on average, a daily plasma free hemoglobin concentration that is approximately 36 mg/dL lower than subjects in the full-term neonate group.

Table 51-5. Ordinary Linear Regression Model of Daily Plasma Free Hemoglobin

Plasma free hemoglobin (mg/dL)

Effect

(95% CI) P-value

Age <.001 Pre-term neonate -3.57 (-22.97, 15.82) Full-term neonate Reference Infant -2.14 (-21.26, 16.97) Child -35.89 (-50.73, -21.05)

Adolescent -29.82 (-44.32, -15.32)

Margin notes: When interpreting regression models with a categorical predictor variable, all categories are compared to a reference category.

Page 17: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4.2.3.2 Continuous predictor variables Continuous variables are measurements that can be any number within an interval, rather than predefined numbers or categories. Examples include height, weight, and drug dose. Continuous variables can be included as predictor variables in any of the modeling approaches previously discussed. The interpretation of the results is slightly different for continuous predictor variables. Odds ratios, effect sizes, or hazards ratios for a continuous predictor variable show the magnitude of the association for each increase of one unit in the predictor variable.

For example, in an ordinary linear regression model of plasma free hemoglobin concentration (mg/dL) predicted by weight (kg), if the estimated effect is -0.5, then plasma free hemoglobin concentration decreases, on average, by 0.5 mg/dL for each 1 kg increase in subject weight. Importantly, a reader should not necessarily interpret an estimate of -0.5 to be of little clinical importance. In order to appropriately assess the importance of weight in predicting plasma free hemoglobin, one should consider the range of typical values of weight. Thus, for a 10 kg increase in weight, we would expect a 5 mg/dL decrease in plasma free hemoglobin concentration. If subjects in our cohort have weights that typically vary by 10 kg or more, it would be more appropriate to consider whether a 5 mg/dL difference in plasma free hemoglobin level is clinically relevant.

The interpretation is similar, but not identical, for continuous predictor variables in logistic or Cox regression. For example, researchers found that the highest lactate concentration (mmol/L) in the first 48 hours after initiation of ECMO is associated with risk of in-hospital mortality. In particular, they reported an odds ratio (and 95% CI) of 1.13 (1.09, 1.17) and a p-value of < 0.001. This is interpreted to mean that for each increase by one mmol/L in lactate, the odds of mortality is multiplied by an estimated 1.13, or in other words, is increased by 13%. Suppose lactate is typically around 5 mmol/L in this population and 25% of the population had a lactate 10 mmol/L or higher. In order to fully appreciate the clinical importance of this relationship between lactate levels and mortality, we can calculate how much the odds of mortality increases when lactate is increased by 5 mmol/L. The odds ratio for an increase of 5 mmol/L is (1.13)5 = 1.84. Thus, the odds of mortality increase by 84% when the lactate level increases by 5 mmol/L. Hazard ratios for continuous variables in Cox regression are interpreted analogously by replacing ‘odds’ with ‘hazard.’

Using continuous predictors in regression modeling requires caution. If a continuous predictor is included in the model (in the standard way), then increasing the value of the predictor will either 1) always increase the expected outcome or 2) always decrease the expected outcome. In particular, a regression analysis cannot appropriately capture the relationship when both low and high values of the predictor are associated with a higher outcome, while moderate values of the predictor are associated with a lower outcome. Thus, assuming the standard, linear relationships between predictors and outcomes may not be sufficient. For example, researchers demonstrated that lower values of arterial oxygen (PaO2) in the first 48 hours of ECMO were associated with lower risk of in-hospital mortality, until the PaO2 reaches levels below 60 Torr. PaO2 values < 60 Torr were associated with a higher risk of mortality, as were PaO2 levels > 300 Torr. A relationship such as this is often referred to as U-shaped because when the probability of mortality is plotted against PaO2, the curve looks somewhat like a U (Figure 51-3). In particular, risk of mortality begins high, decreases as the PaO2 increases, and then increases when the PaO2 increases further. Appropriate approaches to modeling this U-shaped relationship may include 1) partitioning PaO2 into nominal intervals to create a categorical variable (e.g.,

Page 18: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

< 60, 60-200, 201-400, >400 Torr), 2) excluding subjects with PaO2 < 60 Torr, or 3) including both PaO2 and (PaO2)2 as predictors in the model to capture the U-shape.

Figure 51-3. U-shaped relationship between highest PaO2 in the first 48 hours of extracorporeal membrane oxygenation and in-hospital mortality rate. Copied with permission from: Cashen K, Reeder R, Dalton HJ, et al; Eunice Kennedy Shriver National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network (CPCCRN). Hyperoxia and hypocapnia during pediatric extracorporeal membrane oxygenation: associations with complications, mortality, and functional status among survivors. Pediatr Crit Care Med. 2018;19(3):245-253.

Margin notes: When interpreting regression models with a continuous predictor variable, the reported odds ratio, effect, or hazard ratio is for each one unit increase in the predictor variable.

4.2.3.3 Controlling for other variables Each type of model previously discussed can account for one or more explanatory variables. A model with a single predictor variable is referred to as a univariable model, whereas a model with two or more predictor variables is called a multivariable model. The concept of including additional predictor variables is that there are often multiple variables that are associated with the outcome, and those variables may be related to each other. For example, pump type (centrifuge versus roller head) for children receiving ECMO may be associated with the risk of mortality. Clinical site in a multicenter study may also be associated with the risk of mortality for various reasons. Consider the scenario in which

< 60 60 - 99 100 - 199 200 - 299 300 - 399 >=400

Highest PaO2 (Torr)

0

20

40

60

In-h

ospi

tal m

orta

lity

(%)

Page 19: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

pump type is the only predictor in a logistic regression model of in-hospital mortality. In this scenario, we might estimate that the odds of mortality is 37% lower (odds ratio 0.63; 95% CI; 0.44, 0.91) with the centrifuge pump compared to the roller head pump (p = 0.01). However, there are certainly additional variables that influence the risk of mortality, including the clinical site. If clinical site is included in addition to pump type in the model, it is estimated that the odds of mortality are only 31% lower with the centrifuge pump, controlling for clinical site. Controlling or adjusting for other variables means that those other variables are included as predictors in the regression model. The relationship between pump type and mortality can now be interpreted under the assumption that clinical site is unchanged. In other words, if two subjects are at the same clinical site, but one has a roller head pump and the other has a centrifuge pump, then the odds of mortality are estimated to be 31% (not 37%) lower for the subject with the centrifuge pump. However, if the new p-value for the relationship between pump type and mortality controlling for clinical site (p = 0.13) is considered, the conclusion that pump type is independently associated with mortality can no longer be made. It may be that the only link between pump type and mortality is that centrifuge pumps just happen to be more frequently used at hospitals which, for other reasons, have lower mortality rates. In this situation, clinical site is considered to be confounding the relationship between pump type and mortality.

Margin notes: Controlling or adjusting for variables means that those other variables are included as predictors in the regression model. This can prevent these variables from confounding the effect of other variables on the outcome.

4.2.3.4 Explanatory variable selection Sometimes researchers attempt to identify explanatory variables associated with a specific clinical outcome, but they do not know in advance which variables might be important. In this case, a variable selection technique may be used as a data-driven approach to the identification of relevant variables. For example, in a study of thrombosis in children receiving ECMO, researchers considered many variables as potential predictors of daily thrombosis, including patient characteristics, support configuration, and severity of illness prior to cannulation. A variable selection technique called stepwise selection was used to identify which of the considered variables are useful in a model predicting occurrence of thrombosis on a daily basis. In stepwise selection, variables are added or removed from the model one at a time until a final model is reached. At each step, the decision of which variable to add or remove from the model is often based on statistical significance. The model is declared final when a specified criterion is met, e.g. there are no additional variables that significantly contribute to the model. Although many variables were considered in the example provided, only three were included in the final model.

In contrast, a researcher may completely specify all variables that will be included in the model prior to examining the data. There are advantages and disadvantages to both approaches. A data-driven approach to variable selection provides some assurance that an important variable is not omitted. However, the confidence intervals and p-values are less rigorous when data-driven approaches are used. Stepwise selection assesses several models before identifying the final reported model. The pitfalls of testing multiple models, i.e. multiple comparisons, are described in detail in Prespecified Hypotheses. In contrast, when variables are completely specified in advance, an important variable may inadvertently be omitted, but the confidence intervals and p-values are more rigorous. Completely pre-specifying which variables will be included in a model is an opportunity to confirm a clinical hypothesis before

Page 20: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

becoming biased by current data trends that have yet to be vetted. In contrast, extemporaneous modeling decisions can quite furtively lead researchers to spurious conclusions. In this way, completely specifying the variables to be included in the model in advance can provide stronger evidence due to the integrity of the confidence intervals and p-values. However, when reviewing pre-specified models readers should scrutinize the variables chosen. In particular, a reader should ask ‘is there a variable omitted that might be driving the result?’ For example, a researcher may report that pump type is associated with mortality in pediatric ECMO, but fail to consider the important relationship with clinical site. In particular, each clinical site may favor a particular type of pump and also have other unique characteristics that impact mortality. Thus, while pump type is associated with mortality, it may not be independently associated with mortality. A variable is considered to be independently associated with the outcome if it is associated with the outcome after controlling for other relevant variables. Researchers appropriately reported no evidence that pump type impacts mortality after controlling for clinical site.

A common approach to modeling is to do both when either approach may be criticized. To do so, researchers first completely specify the variables to be used for the primary analysis and then, as an exploratory analysis, use a data-driven approach to build an additional multivariable model from a larger pool of candidate variables. When exploratory analysis confirms the findings of the primary analysis, a reader may correctly view the findings with more confidence. An important caveat is that neither approach can account for variables that were not measured as part of the study. Authors should identify potentially important variables that were not collected in the limitations section, but it is also the responsibility of the reader to consider whether important variables were omitted.

Margin notes: Model selection techniques such as stepwise selection may provide some assurance that an important variable was not omitted, but the resulting p-values and confidence intervals are less rigorous.

5 UNDERSTANDING LIMITATIONS After reading conclusions made by research authors, readers should ask themselves if there are any plausible alternative explanations for the results other than the conclusion drawn by the author. Considering other plausible explanations tends to moderate the conclusions ultimately drawn by the reader. This puts the current research in context and elucidates both what is known and what is not. This analytic questioning is so important to the understanding of reported research that a paragraph about specific limitations of the reported research is expected in the discussion section of manuscripts reporting clinical studies. The limitations section of a manuscript informs the reader what limitations the author considered. The reader should carefully review the limitations and be sure that the conclusions the author has drawn are still reasonable in spite of the limitations. Finally, the reader should consider what other limitations might exist. Some specific limitations that should be considered when reading the published results of a clinical study will now be reviewed.

5.1 TREATMENT ASSIGNMENT Randomized trials, if correctly implemented, assign treatments to participants without bias. The flow diagram in a trial should report whether all consented patients were randomized, and whether all of the

Page 21: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

randomized patients received their assigned treatments. More than very occasional deviations from the assigned randomization may be a cause for concern.

In sufficiently large trials, randomization achieves approximate balance between treatment arms with respect to both observed and unobserved factors. Reports of randomized trials will typically show and compare distributions of key baseline factors between treatment arms. Substantial imbalance with respect to an important prognostic factor may be cause for concern, but this could be mitigated by explicitly incorporating the factor into the analysis. As an example, Willson conducted a RCT of calfactant versus placebo among 153 children with respiratory failure from acute lung injury and found an overall increased mortality in those receiving placebo (odds ratio 2.32; 95% CI; 1.15, 4.85, p = 0.03). However, analysis of the baseline characteristics of the two groups demonstrated that there were more immunocompromised patients randomized to the placebo group including those who had undergone bone marrow transplantation. Given the established higher mortality among the immunocompromised, the authors conducted an analysis controlling for immunocompromised state which revealed a statistically insignificant effect of calfactant on mortality (odds ratio 2.11; 95% CI; 0.93, 4.79, p = 0.07).

Additionally, imbalances between treatment groups in studies comparing therapies, but not involving randomization, often occur with respect to key factors. Distributions of such factors should be reported, and approaches to control for such factors should be clearly described.

Margin notes: Imbalance between treatment groups with respect to an important prognostic factor may confound a relationship, especially if treatment assignment is not randomized.

5.2 UNCONTROLLED CONFOUNDERS Except in very small studies, it is typically possible to control for known confounders to some extent in the ascertainment of a treatment effect. It is also possible that other unmeasured confounding factors may affect the magnitude of effects that are observed. In some settings, analyses can be conducted to assess the sensitivity or robustness of the result to possible uncontrolled confounding. For example, smoking clearly confounds the relationship between occupational exposure to toxins and lung cancer. Yet, the relationship between occupational exposure to potential carcinogens and lung cancer can be evaluated in a population with unknown smoking status, by appropriately comparing relative rates of chronic obstructive pulmonary disease (COPD) among patients in the same population who did, versus did not, develop lung cancer. Since COPD is associated with smoking, but not with occupational exposure to carcinogens, these relative rates provide estimates of the confounding effect of smoking in the observed relationships between occupational carcinogen exposure and lung cancer. Another strategy is to estimate how strong the effect of an unmeasured confounder would have to be to eliminate an observed relationship of an exposure to an outcome. Whether such adjustment or robustness analysis is done or not, the possibility of confounding should always be considered in the review of any study comparing effects of therapies outside the randomized setting.

Margin notes: When controlling for potential confounders is not possible, analyses to evaluate the sensitivity or robustness of the results may be possible.

Page 22: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

5.3 GENERALIZABILITY In order to contextualize research findings, it is important to identify the population to whom the results generalize. Eligibility criteria for the study should be carefully scrutinized, as should the source and location of the participants enrolled. Studies generated from a large administrative database may capture a more wide-ranging population than randomized clinical trials with extensive inclusion and exclusion criteria. In addition, studies requiring informed consent may lose generalizability, as subjects likely to consent may differ from those who refuse consent.

The first Bypass Angioplasty Revascularization Investigation (BARI) is an interesting case study of randomized trial generalizability. Among 353 patients with treated diabetes and multiple-vessel coronary artery disease randomized to revascularization with either angioplasty or bypass surgery, a strong and highly significant advantage in five-year survival was noted in those undergoing bypass surgery, leading the National Institutes of Health to issue an alert to clinicians recommending treatment with bypass surgery. However, in an additional 339 patients who met all eligibility criteria for BARI but refused randomization, instead receiving treatment selected by themselves and/or their medical caregiver, virtually no survival advantage was observed for surgical treatment. Patients not randomized tended to be more educated and reported better quality of life versus randomized patients, which may suggest better care of their diabetic condition. Moreover, within the nonrandomized patients, those with more extensive coronary disease were more likely to receive bypass surgery, which tended to provide more extensive revascularization in the setting of numerous, complex blockages. Therefore, the strong findings of this rigorous, multicenter trial did not necessarily even extend to all other patients with exactly the same condition treated at the same hospitals over the same time interval. While bypass surgery was likely the preferred approach for some types of diabetic patients with extensive heart disease, it would not confer a survival benefit for all such patients. In the current era of personalized medicine, where certain treatments may benefit only patients with particular genetic or biomarker-based profiles, generalizability should always be considered, and assertions that a particular treatment is optimal for an entire broad population viewed with appropriate criticism.

Margin notes: An important part of understanding published research is to consider to which population the reported results may generalize.

5.4 PRESPECIFIED HYPOTHESES Relatively modern advances in data collection technology have increased both the quantity and scope of clinical research data on an unprecedented scale. While vast amounts of data now provide researchers with the ability to investigate essentially an unlimited number of clinical hypotheses, care must be taken not to inadvertently report that there is a “finding” or an association when truly there is not.

Hypothesis testing provides a discussion on p-values and their interpretation. The reader will note that a p-value is a statement about the probability of error. Recall that in a single study with no actual treatment effect, it is common to accept a 5% chance of a “false positive” p < 0.05. Put another way, there is a 95% chance we will NOT see a significant result when there is no treatment effect. Now, if a study looks at 10 different outcomes and there is still truly no association, the chance of seeing NO false positive findings can possibly be as low as 95% multiplied by itself 10 times, or 60%. If no steps are taken to control for the false positive rate, evaluating 10 outcomes in a setting with truly no treatment effect will result in up to a 40% chance of seeing at least one false positive finding.

Page 23: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

This multiple comparisons issue is a primary reason that clinical trials will typically have one, or a small number, of prespecified primary hypotheses. More exploratory studies that look at more variables should be described as such and provide an idea of the scope of relationships that were examined. Many comparisons is not necessarily “bad”; for example, genomic studies may examine associations of thousands of single-nucleotide polymorphisms with an outcome, but these should incorporate principles appropriate for this setting in order to limit false findings.

Unless it has been specifically stated that hypotheses were specified before looking at the data, the reader of a report should assume that this is not the case. A report may include the results of many statistical tests, e.g., the correlation of 5 biomarkers with 10 different outcomes, without any method to control the false positive rate. If no guidance is presented in the report, a conservative and easily implemented approach is a Bonferroni correction, where the statistical criterion that would have been used for a single statistical test of association (typically 0.05) is divided by the total number of tests that were performed (in this example, 0.05/50=0.001). Then, if the reader considers only those associations that have a p < 0.001 to be significant, the probability that one or more of them are really due to chance is no more than 5%, just like the criterion used for a single comparison. The Bonferroni correction is common when the number of tests is moderate. However, it is conservative, meaning that the probability of one or more false discoveries is no more than 5%. In practice, it can be much lower, making it more difficult to identify statistically significant findings. In particular, this approach would be too restrictive for the example above in which thousands of comparisons are made. For an overview of additional, less conservative, approaches, the reader can refer to the paper written by Chen ‘The false discovery rate: a key concept in large-scale genetic studies’ listed in the Suggested Readings.

Margin notes: If many statistical tests are performed, an approach to limit false positives is needed. The Bonferroni approach provides an easy to implement, but very conservative method that can be employed by the reader.

5.5 OUTCOME The clinical importance of the outcome(s) being evaluated in the report should also be considered. For example, hospital length of stay may be important, but mortality is much more so. The outcome studied may simply be a surrogate for an important longer-term outcome. The realities of study feasibility often require use of surrogate outcomes. For example, new functional morbidity that develops between baseline and hospital discharge might be used as a surrogate for longer-term quality of life. For surrogate outcomes, the reader should consider how strongly the surrogate and longer-term outcome are related, recognizing that an improvement in the surrogate may not always translate to an improvement in the clinically important, longer-term outcome.

Outcomes should be evaluated in an objective, unbiased fashion, in a way that minimizes sources of bias. As an example, in the THAPCA-OH trial, neurobehavioral status among children surviving at one year was performed by a telephone-based parental interview, using the Vineland Adaptive Behavior Scale, with the instrument administered from a single central location by an interviewer blinded to which treatment the child received.

Page 24: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

6 SUMMARY In this chapter, we have discussed important statistical and study design concepts that should be considered in the review of a published study. Some study designs, such as randomized clinical trials, are usually considered to provide stronger evidence of an effect than others. Any type of published study must be critically reviewed to ensure that appropriate outcome measures have been used, that the data have been analyzed using an appropriate approach, and that the interpretation of the results is consistent with the analytic approach. Moreover, statistical significance, while highly important, must not be the only criterion used to gauge implications of a study. Statistical significance should be rigorously determined, with multiple comparisons limited or otherwise appropriately handled, and statistically significant effects should be justified as being of a clinically relevant magnitude. Negative reports should demonstrate whether there was sufficient power to rule out effects of a clinically important magnitude by presentation and discussion of confidence intervals for effect estimates.

Page 25: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

7 REVIEW QUESTIONS 1. You are conducting a small pilot trial of a new care bundle aimed at reducing pediatric intensive

care unit length of stay. It has been implemented in 15 patients to date. You are encouraged that it may be working, but it is labor intensive. Thus, you want to analyze its effectiveness and compare the length of stay of these 15 patients with the age, gender and diagnosis matched controls. The data are as follows:

Category Lengths of stay (Days) Care Bundle Patients 2, 2, 3, 4, 4, 4, 5, 5, 7, 7, 8, 8, 10, 11, 109 Control Patients 2, 4, 4, 5, 6, 7, 8, 8, 9, 10, 10, 12, 14, 17, 18

Of the following, the most appropriate statistical test to use to conduct this analysis would be:

a. Fisher’s exact test b. Paired t-test c. Unpaired t-test d. Wilcoxon rank-sum test

2. The following graph represents a receiver operating characteristic (ROC) curve analysis.

Which of the following best identifies the letters on the chart?

a. A = line of no prediction, X = Specificity, Y = 1 - Sensitivity b. A = line of no prediction, X = 1 - Specificity, Y = Sensitivity c. A = line of optimal prediction, X = 1 - Sensitivity, Y = Specificity d. A = line of optimal prediction, X = 1 - Specificity, Y = Sensitivity

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

A

Y

X

Page 26: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

3. In the THAPCA-OH trial, among the 260 evaluable subjects who had satisfactory neurobehavioral status prior to their cardiac arrest, subjects treated with hypothermia were estimated to have a 7.3 percentage point increase (95% CI, -1.5% to 16.1%) in the probability of a good neurobehavioral outcome at 12 months compared to subjects treated with normothermia. Given those findings, which of the following statements is FALSE?

a. The best estimate is that hypothermia increases the probability of a good outcome by 7.3%, but that improvement is not statistically significant because the 95% confidence interval includes one.

b. The 95% confidence interval may be narrowed by enrolling more subjects, but the benefits would need to be balanced against the added costs of enrolling more patients.

c. Because the confidence interval contains zero as a plausible effect, we cannot rule out the possibility that therapeutic hypothermia has no effect at all on neurobehavioral outcome.

d. The possibility that therapeutic hypothermia improves the probability of good outcome by 16.1 percentage points or more can be ruled out with a high degree of confidence.

4. You are reviewing a manuscript that assessed the association of ten novel biomarkers with mortality among children admitted to the pediatric intensive care unit with septic shock. The authors utilized a standard p value criterion of < 0.05 to indicate statistical significance and report that three of the biomarkers are associated with mortality in this patient population citing p values of 0.042, 0.030, and 0.025 respectively. You are concerned that the multiple comparisons increased the chance of a false positive finding by 40% (i.e., (0.95)10 ~= 0.60 or a 40% chance of a false positive finding). Thus, you suggested that a Bonferroni correction of the p value be made to establish a more appropriate statistical criterion to test the association of each of the biomarkers with mortality. If the Bonferroni correction approach is utilized, the p value that should be used to establish statistical significance should be which one of the following:

a. 0.02 b. 0.001 c. 0.002 d. 0.005

5. In a hypothetical randomized, placebo controlled trial of the use of inhaled nitric oxide (iNO) among

pediatric acute respiratory distress syndrome patients (ARDS), the mortality rate was 10% among those children who received nitric oxide and 30% for those who received placebo. Which of the following values most accurately represents the odds ratio of mortality for children with ARDS treated with iNO compared to those treated with placebo?

a. 0.11 b. 0.26 c. 0.33 d. 0.43

Answers:

1. D 2. B 3. A

Page 27: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

4. D 5. B

Page 28: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

8 SUGGESTED READING Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. (2015). Fundamentals of clinical trials. Fifth Edition. Switzerland: Springer International.

Lambert J. Statistics in brief: how to assess bias in clinical studies? Clin Orthop Relat Res. 2011;469(6):1794-1796.

Moler FW, Silverstein FS, Holubkov R, et al; THAPCA Trial Investigators. Therapeutic hypothermia after out-of-hospital cardiac arrest in children. N Engl J Med. 2015;372(20):1898-1908.

Ellis PD. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results. Cambridge; New York: Cambridge University Press.

Chelluri L, Sirio CA, Angus DC. Thrombolytic therapy for acute myocardial infarction: GUSTO criticized. N Engl J Med. 1994;330(7):505.

Friedman HS. Thrombolytic therapy for acute myocardial infarction: GUSTO criticized. N Engl J Med. 1994;330(7):504.

Cashen K, Reeder R, Dalton HJ, et al; Eunice Kennedy Shriver National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network (CPCCRN). Hyperoxia and hypocapnia during pediatric extracorporeal membrane oxygenation: associations with complications, mortality, and functional status among survivors. Pediatr Crit Care Med. 2018;19(3):245-253.

Dalton HJ, Reeder R, Garcia-Filion P, et al; Eunice Kennedy Shriver National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network. Factors associated with bleeding and thrombosis in children receiving extracorporeal membrane oxygenation. Am J Respir Crit Care Med. 2017;196(6):762-771.

Dalton HJ, Cashen K, Reeder RW, et al; Eunice Kennedy Shriver National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network (CPCCRN). Hemolysis during pediatric extracorporeal membrane oxygenation: associations with circuitry, complications, and mortality. Pediatr Crit Care Med. 2018;19(11):1067-1076.

Kulkarni AV, Riva-Cambrin J, Rozzelle CJ, et al. Endoscopic third ventriculostomy and choroid plexus cauterization in infant hydrocephalus: a prospective study by the Hydrocephalus Clinical Research Network. J Neurosurg Pediatr. 2018;21(3):214-223.

Richardson DB, Laurier D, Schubauer-Berigan MK, Tchetgen Tchetgen E, Cole SR. Assessment and indirect adjustment for confounding by smoking in cohort studies using relative hazards models. Am J Epidemiol. 2014;180(9):933-940.

Groenwold RH, Nelson DB, Nichol KL, Hoes AW, Hak E. Sensitivity analyses to estimate the potential impact of unmeasured confounding in causal research. Int J Epidemiol. 2010;39(1):107-117.

Bypass Angioplasty Revascularization Investigation (BARI) Investigators. Comparison of coronary bypass surgery with angioplasty in patients with multivessel disease. N Engl J Med. 1996;335(4):217-225.

Page 29: 1 ABSTRACT - Mathreeder/5090/additionalpractice... · Title: Biostatistics and Evaluating Published Studies . Authors: Ron W. Reeder. 1, 2, PhD; Russell Banks. 1, MS; Richard Holubkov

Willson DF, Thomas NJ, Markovitz BP, et al; Pediatric Acute Lung Injury and Sepsis Investigators. Effect of exogenous surfactant (calfactant) in pediatric acute lung injury: a randomized controlled trial. JAMA. 2005;293(4):470-476.

Detre KM, Guo P, Holubkov R, et al. Coronary revascularization in diabetic patients: a comparison of the randomized and observational components of the Bypass Angioplasty Revascularization Investigation (BARI). Circulation. 1999;99(5):633-640.

Chen JJ, Roberson PK, Schell MJ. The false discovery rate: a key concept in large-scale genetic studies. Cancer Control. 2010;17(1):58-62.