Practical Significance of Meta-Analysis Findings Lipsey

1

The Campbell Collaboration www.campbellcollaboration.org

Applied topics:

Interpreting the Practical Significance of Meta-Analysis Findings

Mark Lipsey

Co-Chair, The Campbell Collaboration Co-Editor-in-Chief, Campbell Systematic Reviews

Director, Peabody Research Institute, Vanderbilt University, USA


The problem •  The effect size statistics that constitute the direct findings of a meta-

analysis often provide little insight into the nature, magnitude, or practical significance of the effects they represent.

•  Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful in an applied context.

•  Example: The mean standardized mean difference effect size (Cohen’s d or Hedges g) for the effects of educational interventions with middle school students on standardized reading tests is about .15 and statistically significant.

–  Seems small: Is .15 large enough to have practical significance for improving the reading skills of middle school students?

•  Most important to recognize: There is no necessary relationship between the numerical magnitude of an effect size and the practical significance of the effect it represents!

2


A widely used but inappropriate and misleading characterization of effect sizes

• Statistical effect sizes assessed by Cohen’s small (.20), medium (.50) and large (.80) categories

–  Impressionistic norms across a wide range of outcomes in social and behavioral research

–  Almost never are these the appropriate norms for the particular outcomes of a particular intervention

• Comparing an obtained mean effect size with norms can be informative, but those norms must be appropriate to the context, intervention, nature of the outcomes, etc. [more on this later]


Two approaches to review here

1.  Descriptive representations of intervention effect sizes: –  Translations of effect sizes into forms that are more readily

interpreted. –  Supports better intuitions about the practical significance of

the effect size.

2.  Direct assessment of practical significance: –  Assessing statistical effect sizes in relationship to criteria that

have recognized practical value in the context of application. –  Requires that appropriate criteria be used; different criteria may

yield different conclusions.

3


Useful Descriptive Representations of Intervention Effect Sizes


Back translation to an original metric •  Useful when the original metric is readily interpretable; not so useful

when it is in arbitrary units. •  Example: Mean Phi coefficient for effects of intervention on the

reoffense rates of juvenile offenders < .20 allegedly trivial. •  Computation of Phi Coefficient as an effect size: Reoffend

(failure)

Don’t Reoffend (success)

Tx a = p b = 1-p a+b=1

Ct c = q d = 1-q c+d=1 a+c= p+q

b+d= (1-p)+ (1-q)

Phi = (ad-bc)/SQRT((a+b)(c+d)(a+c)(b+d))

4


Back translation to original metric: Phi coefficient example

•  Mean reoffense rate for the control groups in the studies was .50. •  Some algebra (or trial & error in a spreadsheet) yields the reoffense

rate of the average treatment group required to produce Phi = .20 •  [Note: Similar procedure would work for odds ratio ES as well]

Reoffend (failure)

Don’t Reoffend (success)

Tx .30 .70 1.00

Ct .50 .50 1.00 .80 1.20

Phi = .20

Phi = .20 thus means an average .20 reduction in the reoffense rate from a .50 average baseline value; That is, a 40% decrease in the reoffense rate.

Hardly trivial!


Back translation to original metric: Standardized test example

•  Suppose the mean standardized mean difference effect size for intervention effects on vocabulary tests is .30

•  The most frequently used measure of vocabulary in the contributing studies was the Peabody Picture Vocabulary Test (PPVT)

•  The PPVT has a normed standard score of 100 with a standard deviation of 15. Differences in standard scores are readily understood by researchers and practitioners familiar with standardized tests

•  The control groups in the studies using the PPVT had a mean standard score of 87.

•  How much improvement in the PPVT standard score is represented by an effect size of .30?

5


Back translation to original metric: PPVT


Intervention effect sizes represented as percentiles on the normal distribution

Percentile values on the control distribution of the intervention effect in standard deviation units

6


Translating effect sizes into percentiles from a table of areas under the normal curve


The percentage of the treatment group that is above the control group mean is Cohen’s U3 index

Effect Size

Proportion above the Control Mean

Additional proportion above

original mean .10 .54 .04 .20 .58 .08 .30 .62 .12 .40 .66 .16 .50 .69 .19 .60 .73 .23 .70 .76 .26 .80 .79 .29 .90 .82 .32

1.00 .84 .34 1.10 .86 .36 1.20 .88 .38

7


Rosenthal and Rubin Binomial Effect Size Display (BESD)

d = .80


BESD representations of SMD and correlation ESs

Effect Size

r

Proportion of control/

intervention cases above the

grand median

BESD

(difference between the proportions)

.10 .05 .47 / .52 .05

.20 .10 .45 / .55 .10

.30 .15 .42 / .57 .15

.40 .20 .40 / .60 .20

.50 .24 .38 / .62 .24

.60 .29 .35 / .64 .29

.70 .33 .33 / .66 .33

.80 .37 .31 / .68 .37

.90 .41 .29 / .70 .41 1.00 .45 .27 / .72 .45 1.10 .48 .26 / .74 .48 1.20 .51 .24 / .75 .51

8


Even better, use an inherently meaningful threshold •  Suppose we have a mean standardize mean difference effect size of .

23 for the effects of treatment for depression on outcome measures of depression.

•  For many measures of depression, a threshold score has been determined for the range that constitutes clinical levels of depression.

•  Suppose, then, that we can determine from at least a subset of representative studies that the average proportion of the control groups whose scores are in the clinical range is 64%.

•  Assuming that depression scores are normally distributed, we can then use this proportion and the effect size to determine the average proportion in the clinical range for the treatment groups.

•  From that we find the proportion of clinically depressed patients moved out of the clinical range by the treatment.


Proportions of T and C samples above and below a meaningful reference value

Success threshold

Proportion above

Proportion below

9


Using a table of areas under the normal curve Z Cum p Tail p Z Cum p Tail p

64% of the area of the normal curve is below Z=.36

Subtracting ES=.23 SD from Z=.36 gives Z=.13 with 55% of the area

of the normal curve below


The mean effect size of .23 indicates that, on average, the intervention reduced the number of clinically depressed

patients from 64% to 55%, a 9% differential Clinical threshold

Proportion above=36%

Proportion below=64%

Proportion below=55%

Proportion above=45%

10


The more general point

•  With some understanding of the nature of the effect size index you are working with …

•  and some understanding of the context of the intervention and what might be an interpretable representation of the magnitude of the intervention effect on the outcomes of interest,

•  it will almost always be possible to translate any effect size or mean effect size into a form that facilitates interpretation of its practical significance.


Direct Assessments of Practical Significance

11


Assessing the practical significance of effect sizes requires a criterion from the context of application

•  Neither the numerical value of an effect size nor its statistical significance is a valid indicator of the practical significance of the effect.

•  Translating the numerical value into terms easier to understand facilitates an intuitive assessment of practical significance, but is inherently subjective.

•  A more direct assessment of practical significance can often be made by comparing the effect size with an appropriate criterion drawn from the context of application and, therefore, meaningful in that context.

•  The clinical and normative thresholds used as examples in the previous section are a step in that direction, but more can be learned from a more fully-developed criterion framework


Examples of some criterion frameworks that can be used to assess the practical significance of intervention effect sizes

E.g., compare the mean effect size found with: •  Established normative expectations for change •  Effects others have found on similar measures with similar

interventions •  Policy-relevant performance gaps •  Intervention costs (not discussed here) Some examples from education follow (happens to be where we have done a lot of work recently)

12


Benchmarking against normative expectations for change from test norming samples

Data compiled from national norms for standardized achievement tests: •  Up to seven tests were used for reading, math, science, and social

science •  The mean and standard deviation of the scores for each grade were

obtained from the test manuals •  The standardized mean difference effect size across succeeding

grades was computed


Annual achievement gain: Mean effect sizes across 7 nationally-normed tests

Grade Transition

Reading

Math

Science

Social Studies

K – 1 1.52 1.14 -- -- 1 - 2 .97 1.03 .58 .63 2 - 3 .60 .89 .48 .51 3 - 4 .36 .52 .37 .33 4 - 5 .40 .56 .40 .35 5 - 6 .32 .41 .27 .32 6 - 7 .23 .30 .28 .27 7 - 8 .26 .32 .26 .25 8 - 9 .24 .22 .22 .18

9 - 10 .19 .25 .19 .19 10 - 11 .19 .14 .15 .15 11- 12 .06 .01 .04 .04

Adapted from Bloom, Hill, Black, and Lipsey (2008). Spring-to-spring differences. The means shown are the simple (unweighted) means of the effect sizes from all or a subset of seven tests: CAT5, SAT9, Terra Nova-CTBS, Gates-MacGinitie, MAT8, Terra Nova-CAT, and SAT10.

13


Mean effect size relative to the effect size for achievement gain from pretest baseline

.31 SD (38% increase)

.82 SD

Gain from the Beginning to End of Pre-K on a Summary Achievement Measure for Children Who Participated in Pre-K Compared to Children Who Did Not Participate

Pre-K Participants

Nonparticipants

ES for mean control group pre-post gain

Mean interventionES = .31


Benchmarking against effect sizes for achievement from random assignment studies of education interventions

Data in our current compilation: •  124 random assignment studies •  181 independent subject samples •  829 effect size estimates

14


Achievement effect sizes by grade level and type of achievement test

Grade Level & Achievement Measure

N of ES Es:mates Mean SD

Elementary School 693 .28 .46 Standardized test (broad) 89 .08 .27 Standardized test (narrow) 374 .25 .42 Specialized topic/test 230 .40 .55

Middle School 70 .33 .38 Standardized test (broad) 13 .15 .33 Standardized test (narrow) 30 .32 .26 Specialized topic/test 27 .43 .48

High school 66 .23 .34 Standardized test (broad) -‐-‐ -‐-‐ -‐-‐ Standardized test (narrow) 22 .03 .07 Specialized topic/test 43 .34 .38


Achievement effect sizes by grade level and type of achievement test

Grade Level & Achievement Measure

N of ES Es:mates Mean SD

Elementary School 693 .28 .46 Standardized test (broad) 89 .08 .27 Standardized test (narrow) 374 .25 .42 Specialized topic/test 230 .40 .55

Middle School 70 .33 .38 Standardized test (broad) 13 .15 .33 Standardized test (narrow) 30 .32 .26 Specialized topic/test 27 .43 .48

High school 66 .23 .34 Standardized test (broad) -‐-‐ -‐-‐ -‐-‐ Standardized test (narrow) 22 .03 .07 Specialized topic/test 43 .34 .38

15


Achievement effect sizes by target recipients

Target Recipients

Number of ES

Es:mates

Mean ES SD

Individual Students (one-‐on-‐one) 252 .40 .53

Small groups (not classrooms) 322 .26 .40

Classroom of students 176 .18 .41

Whole school 35 .10 .30

Mixed 44 .30 .33


Benchmarking against policy-relevant demographic performance gaps

•  Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups

•  Effect size gaps for groups may vary across grades, years, tests, and districts

16


Demographic performance gaps on SAT 9 scores in a large urban school district as effect sizes

Subject & Grade

Black-White

Hispanic-White

Eligible-Ineligible for

FRPL Reading

Grade 4 1.09 1.03 .86 Grade 8 1.02 1.14 .68 Grade 12 1.11 1.16 .58

Math Grade 4 .95 .71 .68 Grade 8 1.11 1.07 .58 Grade 12 1.20 1.12 .51

Adapted from Bloom, Hill, Black, and Lipsey (2008). District local outcomes are based on SAT-9 scaled scores for tests administered in spring 2000, 2001, and 2002. SAT 9: Stanford Achievement Tests, 9th Edition (Harcourt Educational Measurement, 1996).


Benchmarking against performance gaps between “average” and “weak” schools

Main idea: •  What is the performance gap (in effect size) for the same types

of students in different schools? Approach: •  Estimate a regression model that controls for student

characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.

•  Infer performance gap (in effect size) between schools at different percentiles of the performance distribution

17


Performance gaps between average (50 percentile) and weak (10 percentile) schools in 4 districts as effect sizes

School District Subject & Grade A B C D Reading

Grade 3 .31 .18 .16 .43 Grade 5 .41 .18 .35 .31 Grade 7 .25 .11 .30 NA Grade 10 .07 .11 NA NA

Math Grade 3 .29 .25 .19 .41 Grade 5 .27 .23 .36 .26 Grade 7 .20 .15 .23 NA Grade 10 .14 .17 NA NA

Adapted from Bloom, Hill, Black, and Lipsey (2008). “NA” indicates that a value is not available due to missing test score data. Means are regression-adjusted for test scores in prior grade and students’ demographic characteristics. The tests are the ITBS for District A, SAT9 for District B, MAT for District C, and SAT8 for District D.


Cost effectiveness as a framework for practical significance: Example for juvenile offender programs

Excerpted from Aos, Phipps, Barnoski, & Lieb, 2001

18


In conclusion … •  The numerical values of statistical effect size indices for intervention effects

provide little understanding of the practical magnitude of those effects.

•  Translating effect sizes into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers.

•  There are a number of easily applied translations that could be routinely used in reporting intervention effect sizes.

•  Directly assessing the practical significance of those effects, however, requires that they be benchmarked against some criterion that is meaningful in the intervention context.

•  Assessing practical significance directly is more difficult, but there are approaches that may be appropriate depending on the intervention and outcome construct.


References Aos, S., Phipps, P., Barnoski, R., & Lieb, R. (2001). The comparative costs and benefits of programs to reduce crime (Version 4.0). Washington State Institute for Public Policy. Bloom, H. S., Hill, C. J., Black, A. B., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289-328. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum. Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning behavior. Review of Educational Research, 51, 237-245. Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74(2), 166-169.

19


Campbell Collaboration P.O. Box 7004 St. Olavs plass

0130 Oslo, Norway

E-mail: [email protected] http://www.campbellcollaboration.org

Contact Information [email protected]

Documents

Practical Significance of Meta-Analysis Findings Lipsey