Upload
chandarkm
View
8
Download
0
Embed Size (px)
DESCRIPTION
Practical Significance of Meta-Analysis
Citation preview
1
The Campbell Collaboration www.campbellcollaboration.org
Applied topics:
Interpreting the Practical Significance of Meta-Analysis Findings
Mark Lipsey
Co-Chair, The Campbell Collaboration Co-Editor-in-Chief, Campbell Systematic Reviews
Director, Peabody Research Institute, Vanderbilt University, USA
The Campbell Collaboration www.campbellcollaboration.org
The problem • The effect size statistics that constitute the direct findings of a meta-
analysis often provide little insight into the nature, magnitude, or practical significance of the effects they represent.
• Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful in an applied context.
• Example: The mean standardized mean difference effect size (Cohen’s d or Hedges g) for the effects of educational interventions with middle school students on standardized reading tests is about .15 and statistically significant.
– Seems small: Is .15 large enough to have practical significance for improving the reading skills of middle school students?
• Most important to recognize: There is no necessary relationship between the numerical magnitude of an effect size and the practical significance of the effect it represents!
2
The Campbell Collaboration www.campbellcollaboration.org
A widely used but inappropriate and misleading characterization of effect sizes
• Statistical effect sizes assessed by Cohen’s small (.20), medium (.50) and large (.80) categories
– Impressionistic norms across a wide range of outcomes in social and behavioral research
– Almost never are these the appropriate norms for the particular outcomes of a particular intervention
• Comparing an obtained mean effect size with norms can be informative, but those norms must be appropriate to the context, intervention, nature of the outcomes, etc. [more on this later]
The Campbell Collaboration www.campbellcollaboration.org
Two approaches to review here
1. Descriptive representations of intervention effect sizes: – Translations of effect sizes into forms that are more readily
interpreted. – Supports better intuitions about the practical significance of
the effect size.
2. Direct assessment of practical significance: – Assessing statistical effect sizes in relationship to criteria that
have recognized practical value in the context of application. – Requires that appropriate criteria be used; different criteria may
yield different conclusions.
3
The Campbell Collaboration www.campbellcollaboration.org
Useful Descriptive Representations of Intervention Effect Sizes
The Campbell Collaboration www.campbellcollaboration.org
Back translation to an original metric • Useful when the original metric is readily interpretable; not so useful
when it is in arbitrary units. • Example: Mean Phi coefficient for effects of intervention on the
reoffense rates of juvenile offenders < .20 allegedly trivial. • Computation of Phi Coefficient as an effect size: Reoffend
(failure)
Don’t Reoffend (success)
Tx a = p b = 1-p a+b=1
Ct c = q d = 1-q c+d=1 a+c= p+q
b+d= (1-p)+ (1-q)
Phi = (ad-bc)/SQRT((a+b)(c+d)(a+c)(b+d))
4
The Campbell Collaboration www.campbellcollaboration.org
Back translation to original metric: Phi coefficient example
• Mean reoffense rate for the control groups in the studies was .50. • Some algebra (or trial & error in a spreadsheet) yields the reoffense
rate of the average treatment group required to produce Phi = .20 • [Note: Similar procedure would work for odds ratio ES as well]
Reoffend (failure)
Don’t Reoffend (success)
Tx .30 .70 1.00
Ct .50 .50 1.00 .80 1.20
Phi = .20
Phi = .20 thus means an average .20 reduction in the reoffense rate from a .50 average baseline value; That is, a 40% decrease in the reoffense rate.
Hardly trivial!
The Campbell Collaboration www.campbellcollaboration.org
Back translation to original metric: Standardized test example
• Suppose the mean standardized mean difference effect size for intervention effects on vocabulary tests is .30
• The most frequently used measure of vocabulary in the contributing studies was the Peabody Picture Vocabulary Test (PPVT)
• The PPVT has a normed standard score of 100 with a standard deviation of 15. Differences in standard scores are readily understood by researchers and practitioners familiar with standardized tests
• The control groups in the studies using the PPVT had a mean standard score of 87.
• How much improvement in the PPVT standard score is represented by an effect size of .30?
5
The Campbell Collaboration www.campbellcollaboration.org
Back translation to original metric: PPVT
The Campbell Collaboration www.campbellcollaboration.org
Intervention effect sizes represented as percentiles on the normal distribution
Percentile values on the control distribution of the intervention effect in standard deviation units
6
The Campbell Collaboration www.campbellcollaboration.org
Translating effect sizes into percentiles from a table of areas under the normal curve
The Campbell Collaboration www.campbellcollaboration.org
The percentage of the treatment group that is above the control group mean is Cohen’s U3 index
Effect Size
Proportion above the Control Mean
Additional proportion above
original mean .10 .54 .04 .20 .58 .08 .30 .62 .12 .40 .66 .16 .50 .69 .19 .60 .73 .23 .70 .76 .26 .80 .79 .29 .90 .82 .32
1.00 .84 .34 1.10 .86 .36 1.20 .88 .38
7
The Campbell Collaboration www.campbellcollaboration.org
Rosenthal and Rubin Binomial Effect Size Display (BESD)
d = .80
The Campbell Collaboration www.campbellcollaboration.org
BESD representations of SMD and correlation ESs
Effect Size
r
Proportion of control/
intervention cases above the
grand median
BESD
(difference between the proportions)
.10 .05 .47 / .52 .05
.20 .10 .45 / .55 .10
.30 .15 .42 / .57 .15
.40 .20 .40 / .60 .20
.50 .24 .38 / .62 .24
.60 .29 .35 / .64 .29
.70 .33 .33 / .66 .33
.80 .37 .31 / .68 .37
.90 .41 .29 / .70 .41 1.00 .45 .27 / .72 .45 1.10 .48 .26 / .74 .48 1.20 .51 .24 / .75 .51
8
The Campbell Collaboration www.campbellcollaboration.org
Even better, use an inherently meaningful threshold • Suppose we have a mean standardize mean difference effect size of .
23 for the effects of treatment for depression on outcome measures of depression.
• For many measures of depression, a threshold score has been determined for the range that constitutes clinical levels of depression.
• Suppose, then, that we can determine from at least a subset of representative studies that the average proportion of the control groups whose scores are in the clinical range is 64%.
• Assuming that depression scores are normally distributed, we can then use this proportion and the effect size to determine the average proportion in the clinical range for the treatment groups.
• From that we find the proportion of clinically depressed patients moved out of the clinical range by the treatment.
The Campbell Collaboration www.campbellcollaboration.org
Proportions of T and C samples above and below a meaningful reference value
Success threshold
Proportion above
Proportion below
9
The Campbell Collaboration www.campbellcollaboration.org
Using a table of areas under the normal curve Z Cum p Tail p Z Cum p Tail p
64% of the area of the normal curve is below Z=.36
Subtracting ES=.23 SD from Z=.36 gives Z=.13 with 55% of the area
of the normal curve below
The Campbell Collaboration www.campbellcollaboration.org
The mean effect size of .23 indicates that, on average, the intervention reduced the number of clinically depressed
patients from 64% to 55%, a 9% differential Clinical threshold
Proportion above=36%
Proportion below=64%
Proportion below=55%
Proportion above=45%
10
The Campbell Collaboration www.campbellcollaboration.org
The more general point
• With some understanding of the nature of the effect size index you are working with …
• and some understanding of the context of the intervention and what might be an interpretable representation of the magnitude of the intervention effect on the outcomes of interest,
• it will almost always be possible to translate any effect size or mean effect size into a form that facilitates interpretation of its practical significance.
The Campbell Collaboration www.campbellcollaboration.org
Direct Assessments of Practical Significance
11
The Campbell Collaboration www.campbellcollaboration.org
Assessing the practical significance of effect sizes requires a criterion from the context of application
• Neither the numerical value of an effect size nor its statistical significance is a valid indicator of the practical significance of the effect.
• Translating the numerical value into terms easier to understand facilitates an intuitive assessment of practical significance, but is inherently subjective.
• A more direct assessment of practical significance can often be made by comparing the effect size with an appropriate criterion drawn from the context of application and, therefore, meaningful in that context.
• The clinical and normative thresholds used as examples in the previous section are a step in that direction, but more can be learned from a more fully-developed criterion framework
The Campbell Collaboration www.campbellcollaboration.org
Examples of some criterion frameworks that can be used to assess the practical significance of intervention effect sizes
E.g., compare the mean effect size found with: • Established normative expectations for change • Effects others have found on similar measures with similar
interventions • Policy-relevant performance gaps • Intervention costs (not discussed here) Some examples from education follow (happens to be where we have done a lot of work recently)
12
The Campbell Collaboration www.campbellcollaboration.org
Benchmarking against normative expectations for change from test norming samples
Data compiled from national norms for standardized achievement tests: • Up to seven tests were used for reading, math, science, and social
science • The mean and standard deviation of the scores for each grade were
obtained from the test manuals • The standardized mean difference effect size across succeeding
grades was computed
The Campbell Collaboration www.campbellcollaboration.org
Annual achievement gain: Mean effect sizes across 7 nationally-normed tests
Grade Transition
Reading
Math
Science
Social Studies
K – 1 1.52 1.14 -- -- 1 - 2 .97 1.03 .58 .63 2 - 3 .60 .89 .48 .51 3 - 4 .36 .52 .37 .33 4 - 5 .40 .56 .40 .35 5 - 6 .32 .41 .27 .32 6 - 7 .23 .30 .28 .27 7 - 8 .26 .32 .26 .25 8 - 9 .24 .22 .22 .18
9 - 10 .19 .25 .19 .19 10 - 11 .19 .14 .15 .15 11- 12 .06 .01 .04 .04
Adapted from Bloom, Hill, Black, and Lipsey (2008). Spring-to-spring differences. The means shown are the simple (unweighted) means of the effect sizes from all or a subset of seven tests: CAT5, SAT9, Terra Nova-CTBS, Gates-MacGinitie, MAT8, Terra Nova-CAT, and SAT10.
13
The Campbell Collaboration www.campbellcollaboration.org
Mean effect size relative to the effect size for achievement gain from pretest baseline
.31 SD (38% increase)
.82 SD
Gain from the Beginning to End of Pre-K on a Summary Achievement Measure for Children Who Participated in Pre-K Compared to Children Who Did Not Participate
Pre-K Participants
Nonparticipants
ES for mean control group pre-post gain
Mean interventionES = .31
The Campbell Collaboration www.campbellcollaboration.org
Benchmarking against effect sizes for achievement from random assignment studies of education interventions
Data in our current compilation: • 124 random assignment studies • 181 independent subject samples • 829 effect size estimates
14
The Campbell Collaboration www.campbellcollaboration.org
Achievement effect sizes by grade level and type of achievement test
Grade Level & Achievement Measure
N of ES Es:mates Mean SD
Elementary School 693 .28 .46 Standardized test (broad) 89 .08 .27 Standardized test (narrow) 374 .25 .42 Specialized topic/test 230 .40 .55
Middle School 70 .33 .38 Standardized test (broad) 13 .15 .33 Standardized test (narrow) 30 .32 .26 Specialized topic/test 27 .43 .48
High school 66 .23 .34 Standardized test (broad) -‐-‐ -‐-‐ -‐-‐ Standardized test (narrow) 22 .03 .07 Specialized topic/test 43 .34 .38
The Campbell Collaboration www.campbellcollaboration.org
Achievement effect sizes by grade level and type of achievement test
Grade Level & Achievement Measure
N of ES Es:mates Mean SD
Elementary School 693 .28 .46 Standardized test (broad) 89 .08 .27 Standardized test (narrow) 374 .25 .42 Specialized topic/test 230 .40 .55
Middle School 70 .33 .38 Standardized test (broad) 13 .15 .33 Standardized test (narrow) 30 .32 .26 Specialized topic/test 27 .43 .48
High school 66 .23 .34 Standardized test (broad) -‐-‐ -‐-‐ -‐-‐ Standardized test (narrow) 22 .03 .07 Specialized topic/test 43 .34 .38
15
The Campbell Collaboration www.campbellcollaboration.org
Achievement effect sizes by target recipients
Target Recipients
Number of ES
Es:mates
Mean ES SD
Individual Students (one-‐on-‐one) 252 .40 .53
Small groups (not classrooms) 322 .26 .40
Classroom of students 176 .18 .41
Whole school 35 .10 .30
Mixed 44 .30 .33
The Campbell Collaboration www.campbellcollaboration.org
Benchmarking against policy-relevant demographic performance gaps
• Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups
• Effect size gaps for groups may vary across grades, years, tests, and districts
16
The Campbell Collaboration www.campbellcollaboration.org
Demographic performance gaps on SAT 9 scores in a large urban school district as effect sizes
Subject & Grade
Black-White
Hispanic-White
Eligible-Ineligible for
FRPL Reading
Grade 4 1.09 1.03 .86 Grade 8 1.02 1.14 .68 Grade 12 1.11 1.16 .58
Math Grade 4 .95 .71 .68 Grade 8 1.11 1.07 .58 Grade 12 1.20 1.12 .51
Adapted from Bloom, Hill, Black, and Lipsey (2008). District local outcomes are based on SAT-9 scaled scores for tests administered in spring 2000, 2001, and 2002. SAT 9: Stanford Achievement Tests, 9th Edition (Harcourt Educational Measurement, 1996).
The Campbell Collaboration www.campbellcollaboration.org
Benchmarking against performance gaps between “average” and “weak” schools
Main idea: • What is the performance gap (in effect size) for the same types
of students in different schools? Approach: • Estimate a regression model that controls for student
characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.
• Infer performance gap (in effect size) between schools at different percentiles of the performance distribution
17
The Campbell Collaboration www.campbellcollaboration.org
Performance gaps between average (50 percentile) and weak (10 percentile) schools in 4 districts as effect sizes
School District Subject & Grade A B C D Reading
Grade 3 .31 .18 .16 .43 Grade 5 .41 .18 .35 .31 Grade 7 .25 .11 .30 NA Grade 10 .07 .11 NA NA
Math Grade 3 .29 .25 .19 .41 Grade 5 .27 .23 .36 .26 Grade 7 .20 .15 .23 NA Grade 10 .14 .17 NA NA
Adapted from Bloom, Hill, Black, and Lipsey (2008). “NA” indicates that a value is not available due to missing test score data. Means are regression-adjusted for test scores in prior grade and students’ demographic characteristics. The tests are the ITBS for District A, SAT9 for District B, MAT for District C, and SAT8 for District D.
The Campbell Collaboration www.campbellcollaboration.org
Cost effectiveness as a framework for practical significance: Example for juvenile offender programs
Excerpted from Aos, Phipps, Barnoski, & Lieb, 2001
18
The Campbell Collaboration www.campbellcollaboration.org
In conclusion … • The numerical values of statistical effect size indices for intervention effects
provide little understanding of the practical magnitude of those effects.
• Translating effect sizes into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers.
• There are a number of easily applied translations that could be routinely used in reporting intervention effect sizes.
• Directly assessing the practical significance of those effects, however, requires that they be benchmarked against some criterion that is meaningful in the intervention context.
• Assessing practical significance directly is more difficult, but there are approaches that may be appropriate depending on the intervention and outcome construct.
The Campbell Collaboration www.campbellcollaboration.org
References Aos, S., Phipps, P., Barnoski, R., & Lieb, R. (2001). The comparative costs and benefits of programs to reduce crime (Version 4.0). Washington State Institute for Public Policy. Bloom, H. S., Hill, C. J., Black, A. B., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289-328. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum. Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning behavior. Review of Educational Research, 51, 237-245. Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74(2), 166-169.
19
The Campbell Collaboration www.campbellcollaboration.org
Campbell Collaboration P.O. Box 7004 St. Olavs plass
0130 Oslo, Norway
E-mail: [email protected] http://www.campbellcollaboration.org
Contact Information [email protected]