Upload
mark-a-stoove
View
214
Download
0
Embed Size (px)
Citation preview
Research withourt Tears
What are we looking at, and how big is it?
Mark A. Stoovea,*, Mark B. Andersenb
aSchool of Health Sciences, Deakin University, Melbourne, AustraliabSchool of Human Movement, Recreation and Performance, Victoria University, Melbourne, Australia
Abstract
Some of the most important outcomes of physical therapy treatment have to do with behaviour and quality of life. This article involves
examining what it is we are measuring in physical therapy research and what those measurements mean. In looking at differences between
groups (e.g. placebo-control) or strength of association between variables (e.g. correlation, regression) the practitioner/researcher must
consider what are meaningful magnitudes of effects. Depending on the variable that one measures, a medium effect size (e.g. Cohen’s
d ¼ 0:50) may, in the real world, be insignificant, or in the case of elite athletic performance such an effect size might be gigantic. A major
problem in the sports sciences is the confusion of p values and significance testing with the results of interest, the magnitudes of effects. Also,
the prevalence of possible Type II errors in the sports sciences and medicine may be quite high in light of the small sample sizes and the
paucity of power analyses for non-significant results. We make an appeal for determining a priori minimal meaningful differences (or
associations) to use as the primary metrics in discussing results.
Crown Copyright q 2003 Published by Elsevier Science Ltd. All rights reserved.
Keywords: Statistical inference; Meaningful difference; Effect size; Power
1. What are we looking at, and how big is it?
In Kaplan’s (1994) article on outcomes in health
research, he used the comic strip Ziggy to present a
fundamental principle regarding the careful choice of
dependent variables in intervention outcome research. In
the comic strip, Ziggy climbs a mountain to ask a Guru,
‘What is the meaning of life?’ The Guru responds with, ‘Ah
yes… the meaning of life, my boy, is doin’ stuff!!’ Ziggy
questions the Guru, ‘Life is doin’ stuff?… that’s it?’ The
Guru responds, ‘As opposed to death, which is NOT doin’
stuff!’ Ziggy walks back down the mountain musing, ‘It’s a
more elementary theory than I had expected, but one you
can’t argue with’. The point Kaplan was making was that
the important and meaningful issue in health research is
doing stuff. Is the patient or client, after some intervention,
medication or surgery, able to do more or do something
more efficiently or function better than before the interven-
tion? Improved scores on paper and pencil tests (e.g.
measures of anxiety or depression) or physiological
parameters (e.g. serum cortisol) are of interest only if they
are related to behavioural, functional or quality of life
issues. That is, ‘doin’ stuff’.
1.1. What’s in it for the client?—choosing your dependent
variables wisely
What Kaplan (1994) proposed applies to both qualitative
and quantitative questions and methods. In this article,
however, we will be looking at the choice of quantitative
dependent variables and the issues of interpreting measure-
ments, statistical results and intervention outcomes in terms
of what is meaningful to the clients or patients. For example,
in the case of lateral epicondylitis one may measure the
amount of pain on resisted wrist extension using any one of
several pain scales. After a course of cryotherapy, muscle
stretching and soft tissues massage, does the lateral
epicondlylitis treatment group report lower scores on the
pain scale compared to a control group who received no
treatment? What is meaningful to the client is whether they
experience less pain when a treatment is provided,
potentially giving the client better functionality and
allowing them to do more. This research design of pre-
and post-testing with an experimental and control group is
common in medicine and health research. And the data from
such experiments are usually analysed incorrectly, but that
is a whole other story (see Huck & McClean 1975).
1466-853X/03/$ - see front matter Crown Copyright q 2003 Published by Elsevier Science Ltd. All rights reserved.
doi:10.1016/S1466-853X(03)00039-7
Physical Therapy in Sport 4 (2003) 93–97
www.elsevier.com/locate/yptsp
* Corresponding author. Address: School of Health Sciences, Deakin
University, 221 Burwood Hwy, Burwood, Vic. 3125, Australia. Tel.: þ61-
3-9251-7059; fax: þ61-3-9244-6017.
E-mail address: [email protected] (M.A. Stoove).
The dependent variable in the study described above was a
score on a paper and pencil test of pain perception. Such
scores mean very little by themselves because they are not
calibrated against the variables of interest, and those have to
do with function, behaviour and quality of life. For example,
what reduction in scores on a pain scale is needed for clients
to perceive an improvement in their functional ability
during day-to-day activities? So in choosing dependent
variables, one needs to choose one (or more) that has
intimate connections to real life variables.
In a recent article, in Physical Therapy in Sport, Bennell
et al. (2000) discussed this issue when assessing the
psychometric properties of self-report questionnaires for
patellofemoral pain syndrome. The authors reported that a
disadvantage of these questionnaires is that one does not
know the change scores that are indicative for a clinically
significant difference in pain. Although the sensitivity of
instruments to detect statistically significant changes over
time has been investigated for some of the instruments, the
authors go on to recommend that, ‘…the size of change
needed for a clinically significant result should also be
investigated’ (p. 40). We propose that clinical significance
needs to include what are meaningful behavioural changes
for the patient. For example, with patients or clients who are
elderly and have undergone joint replacement surgery (e.g.
hip arthroplasty), the dependent variables of interest in
medicine and physiotherapy may be things like range of
motion, flexibility, joint stability and strength. The hands-on
physical therapy and home exercises chosen to affect these
dependent variables, however, might be time consuming
and painful. Is the object of physical therapy the rehabilita-
tion of that surgically replaced joint and the muscles and
tendons surrounding it, or is the object of research and
practice the health, welfare and quality of life of the owner
of that joint? Is some sacrifice of range of motion or strength
an acceptable outcome if it means that the patient can forego
a painful and possibly lengthy rehabilitation? What degree
of functionality does the patient require for maintaining or
improving quality of life? How the practitioner and
researcher answer these questions determines the dependent
variables of choice. A holistic research picture of the
practice of physical therapy will likely include psychologi-
cal, physical and quality of life dependent measures.
Because a condition can be treated does not necessarily
mean it should be treated. Kaplan (1994) uses an example of
the effect of three treatment options for prostate cancer on
quality adjusted life expectancy (QALEs). He draws upon
the work of Fleming et al. (1993) to compare the efficacy of
the traditional approach to cancer treatment, targeting the
tumour with radiation therapy or removing the prostate
gland, with a watchful waiting option. The results showed
that QALE scores for the three treatment options were
equivalent for groups of men aged between 60 and 75 years.
The fundamental difference between these treatment
alternatives, however, is that the traditional approaches
carry high risks of complications that could reduce quality
of life (e.g. impotence, incontinence), whereas watchful
waiting simply involves evaluation and supervision by a
physician. Another difference to consider is that the
traditional approaches focus on treating the cancer and the
other focuses on treating the patient. The choice of what to
measure and what is a meaningful change following an
intervention is a value driven decision that cannot be made
without input from the client and should ultimately be a
matter of patient preference.
Once one has chosen the dependent variable (or
variables), the researcher needs to take things one step
further and establish what would be the minimal meaningful
benefit (MMB) and the minimal meaningful harm (MMH;
Hopkins 2001, 2002). Determination of the MMB and
MMH comes from knowledge of the field, knowledge of the
variables and knowledge of the potential changes possible
with the specific population one is studying.
To illustrate how MMB and MMH can be helpful in
interpreting the data of the dependent variable, let us take an
intervention that we believe will be meaningful only if there
is at least a 2% change from pre to post, and the MMH
would be anything less than 0% change (i.e. people get
worse). For example, let us say we deliver some treatment to
a group, and for the change scores pre to post we end up with
a mean of 2% with a standard deviation of 1%. As illustrated
in Fig. 1, given the outcome, there is a 50% probability that
the intervention will meet or exceed the MMB. There is a
48% chance that the intervention will produce a trivial
difference (0 to 22.0 SD), and a 2% chance that the
intervention will be harmful (,22.0 SD). Although this
result means that it is a coin toss as to whether the
intervention will benefit the client, the results clearly show
that the intervention is highly unlikely to do harm.
Reflecting on the variable and the population in question,
it is ultimately up to the practitioner/researcher to determine
whether the treatment is worthwhile. For a review of the
concept of MMB and MMH and more discussion on clinical
versus statistical significance refer to Hopkins (2001, 2002).
Fig. 1. Minimum meaningful harm, trivial change and minimal meaningful
benefit.
M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–9794
1.2. Size does matter (but only in context)
Cohen (1990), wrote that the meaningfulness of
statistical results has little to do with whether p is less
than 0.05. He stated:
The primary product of research is one or more measures
of effect size, not p values …Effect size measures
include mean differences (raw or standardized), corre-
lations and squared correlations of all kinds, odds ratios,
kappas—whatever conveys the magnitude of the
phenomenon of interest appropriate to the research
context (p. 1310).
In a true sense, the probability of rejecting a false null
hypothesis is ultimately 0.0. The null hypothesis is always
false if you get a large enough N. The major confusion that
exists in probability statements is that they are often
confused with the magnitude of the effect. If the result is
p , 0:05, that is good, and if it is p , 0:01; that is even
better, and if it is p , 0:001; that is really wonderful. The
mistake people make is to think that an effect that is ‘more
significant’ is, therefore, more meaningful. But p values are
easy to manipulate (e.g. get more participants, use more
homogenous groups to lower variability). Getting a p ,
0:05 for a study tells us nothing about how large the
difference between groups was (e.g. in t tests and ANOVA
designs) or how strong the measure of association was (e.g.
in correlation and regression designs).
Different effect sizes tell us different things. Cohen’s d
for independent means tells the difference between the
means of two groups in Z (standardized) terms. If, for
example the experimental group had higher scores on some
variable and the Cohen’s d was 1.0 (equal to one standard
deviation), then one could say that the mean of the
distribution of the experimental group moved one standard
deviation above the mean of the control group, representing
an increase of 34 percentile (not percentage!) points. This
change represents quite a large difference. By convention, in
the behavioural sciences, a Cohen’s d of 0.20 represents a
small effect; 0.50 represents a medium effect, and 0.80 or
above represents a large effect. For another example, the
effect size h2 (an equivalent of R2) represents the amount of
variance in the dependent variable accounted for by group
membership, and is used with ANOVA designs. The
conventions in the behavioural sciences for h2 are 0.01
for a small effect, 0.06 for a medium effect, and 0.14 for a
large effect.
Whether an effect is small, medium, or large is all
ultimately related to the variable one is examining. For
example, in elite sport, changes in times of half a percent
(probably a small effect) may translate into more podium
visit outcomes at competitions. As we emphasised earlier,
meaningful effects need to be determined by the researcher
prior to conducting the investigation, and should relate to
the population being investigated.
The American Psychological Association (APA 2001) in
their publication manual admonished authors of quantitative
studies that the reporting of effect sizes is essentially
required for almost all quantitative designs. The manual
contains the following:
For the reader to fully understand the importance of your
findings, it is almost always necessary to include some
index of effect size or strength of relationship in your
Results section. You can estimate the magnitude of the
effect or the strength of the relationship with a number of
common effect size estimates, including (but not limited
to) r2;h2;v2;R2;F2; Cramers V, Kendall’s W, Cohen’s d
and _k; Goodman–Kruskal’s l and g, Jacobson and
Truax’s (1991) and Kendall’s proposed measures of
clinical significance, and the multivariate Roy’s u and the
Pillai-Bartlett V (p. 25–26).
Similar publication policies exist for other journals
including Educational and Psychological Measurement
(Thompson 1994), Journal of Applied Psychology (Murphy
1997), Journal of Experimental Education (Heldref Foun-
dation 1997) and Measurement and Evaluation in Counsel-
ing and Development (1992). The instructions to
contributors in all these journals state that effect sizes are
required or strongly encouraged, in addition to p values.
Reporting effect sizes is only the first step in a complete
disclosure, and ultimately, an interpretation of statistical
results. Speed and Andersen (2000) discovered, after
examining two years worth of the articles in the Journal
of Science and Medicine in Sport, that: (a) effect sizes were
rarely reported, and (b) when they were reported they were
not interpreted in any meaningful way. Similar results have
been obtained from a selection of APA journals published in
1995 (Kirk 1996), from two volumes of the Journal of
Experimental Education (Thompson & Snyder 1997) and
from 68 articles published between 1990 and 1996 in
Measurement and Evaluation in Counseling and Develop-
ment (Vacha-Haase & Nilsson 1998). The reporting of effect
size, like statistical significance, is relatively meaningless
without an interpretation of what this magnitude of change
or association means in a real world metric. For example,
Gage (1978) pointed out that the association between
smoking and lung cancer involves an h2 of 0.02, meaning
that 2% of variance in lung cancer is accounted for by
smoking. From a statistical perspective this appears to be a
small effect. When the variables of note, however, are
projected onto a large population, small effects such as this
have meaningful and profound effects on public health.
An example of relatively large effect in health and
pharmacology was the clinical trials of the drug gemfi-
brozil’s (lowers serum cholesterol) ability to prevent death
from ischemic heart disease (IHD; see Kaplan 1990). In this
drug trial, participants received either the drug gemfibrozil
or a placebo over a six-year period. The number of patients
who died of IHD in the placebo and drug groups was
M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–97 95
compared. Significantly more people in the placebo group
died of IHD with a relatively large effect size (26%
reduction in deaths from IHD). The conclusion that
gemfibrozil was effective in reducing deaths from IHD
seemed experimentally supported. The problem with this
research though, is that when the ultimate dependent
variable, mortality from all causes of death, was examined,
then it was revealed that the mortality rates of both groups
were essentially the same (three more people died in the
drug group than in the placebo group). The people taking
gemfibrozil died at approximately the same rates as the
placebo group; they just did not die of IHD as often. Here
we have a case of a treatment with a large effect that is
essentially ineffective in the most meaningful of dependent
variables, mortality. The people who took gemfibrizil were
no less likely to die, and a drug was given to them with side
effects (e.g. cholecystitis, diarrhoea), possibly decreasing
quality of life. Gemfibrozil is still on the market.
1.3. A powerless state of affairs
Effect sizes are intimately connected with the question of
power (the ability to reject the null hypothesis) and
statistical significance. Sufficient sample sizes are needed
to detect statistically significant changes in dependent
variables, even when the changes represent large and
meaningful effects. Researchers who have investigated the
statistical power of studies published in the biomedical and
exercise sciences have suggested that research in these fields
may contain substantial Type II errors because the research
designs lack the power to detect medium or small effects
(Christensen & Christensen 1977; Jones & Brewer 1972;
Reed & Slaichert 1981; Speed & Andersen 2000). Studies
reporting similar results have been conducted in the
psychology (Cohen 1962; Sedlmeier & Gigerenzer 1989),
speech pathology (Kroll & Chase 1975), education (Brewer
1972) and sport psychology (Speed & Andersen 1997)
disciplines. The APA (2001) publication manual also
contains a strong statement about demonstrating power.
…you should routinely provide evidence that your study
has sufficient power to detect effects of substantive
interest (e.g. see Cohen 1988). You should be similarly
aware of the role played by sample sizes and cases in
which not rejecting the null hypothesis is desirable (i.e.
when you wish to argue there are no differences), when
testing various assumptions underlying the statistical
model adopted (e.g. normality, homogeneity of variance,
homogeneity of regression), and in model fitting…(p. 24)
Researchers should routinely refer to previous research
to estimate the sample size required for an 80% chance of
detecting a small, medium or large (depending on the
variable) effect at the desired a-level (a power of 0.80). If
there is insufficient research published in a particular area,
Cohen (1988) provides power tables for a variety of
experimental designs and statistical analyses that estimate
the sample sizes required for a given power and a
hypothesised effect size. This emphasis on power and
power analyses, however, is a direct result of the ubiquity of
using p , 0:05 in evaluating the results of research.
Powerless research designs have a number of ramifica-
tions. First, it is usual in between-groups experimental
designs (e.g. experimental versus control group designs
commonly found in the medical and exercise sciences) to
test statistically whether the two groups are equivalent on
some measures prior to conducting an intervention.
Equivalence is often determined by examining whether
two groups are significantly different on the variables in
question. With small Ns, however, the magnitude of
difference between the two groups, in order to be
significantly different, would have to be large, or the
variability within these groups small. The upshot of this
common situation is that saying two groups are not
‘significantly different’ does not mean that they are the
same. With small sample sizes, it is probable that
researchers who claim to have ‘equal’ groups have nothing
of the sort. Another concern arising from low power, based
on the same logic, is the likelihood of Type II errors when
reporting the results of research (e.g. the treatment is not
effective when it actually is). Saying there are ‘no
differences’ following an intervention (because p was
greater than 0.05), when there are differences, leads to
questionable conclusions that might undervalue the utility
of a worthwhile treatment. Finally, some studies may never
make it to publication because of low power and a lack of
statistically significant results. A related issue is that a whole
generation of journal editors and reviewers have biases
towards not accepting papers that report non-significant
results. Not publishing studies that report non-significant
results is potentially doing a disservice in many research
fields. Cohen (1962) pointed this issue out 40 years ago, but
the message remains largely unheard. ‘A generation of
researchers could be suitably employed in repeating
interesting studies which originally used inadequate sample
sizes. Unfortunately, the ones most deserving such rep-
etition are least likely to have appeared in print’ (p. 153).
2. Conclusion
Rather than defenestrate significance testing, as some
would suggest (e.g. Cohen 1994; Gigerenzer 1993), we
argue that researchers need to report and interpret effect
sizes and p values (cf. Andersen & Stoove 1998; Thomas
et al. 1991). Of perhaps more significance in research where
outcomes relate to the health status or functionality of
clients or patients, is choosing variables that are important
in determining well-being and quality of life, and making a
priori decisions about what constitutes meaningful changes
in these variables. For this purpose, the determination MMB
M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–9796
and MMH prior to conducting research may be useful when
interpreting results.
Acknowledgements
The authors would like to thank Will Hopkins for
enlightening this paper and the sports sciences in general.
The authors would also like to thank Herbert Badgery and
Eddie Wysbraum for their valued suggestions and inspired
commentary.
References
Andersen, M.B., Stoove, M.A., 1998. The sanctity of p , 0:05 obfuscates
good stuff: a comment of Kerr and Goss. Journal of Applied Sport
Psychology 10, 168–173.
American Psychological Association, 2001. Publication Manual, 5th edn.,
American Psychological Association, Washington, DC.
Bennell, K., Bartam, S., Crossley, K., Green, S., 2000. Outcome measures
in patellofemoral pain syndrome: test retest reliability and inter-
relationships. Physical Therapy in Sport 1, 32–41.
Brewer, J.K., 1972. On the power of statistical tests in the American
Educational Research Journal. American Educational Research Journal
9, 391–401.
Christensen, J.E., Christensen, C.E., 1977. Statistical power analysis of
health, physical education, and recreation research. Research Quarterly
48, 204–208.
Cohen, J., 1962. The statistical power of abnormal-social psychology
research. Journal of Abnormal and Social Psychology 65, 145–153.
Cohen, J., 1988. Statistical Power Analysis for the Behavioural Sciences,
2nd edn., Erlbaum, Hillsdale, NJ.
Cohen, J., 1990. Things I have learned so far. American Psychologist 45
(12), 1304–1312.
Cohen, J., 1994. The earth is round ðp , 0:05Þ. American Psychologist 49,
997–1003.
Fleming, C., Wasson, J.H., Albertson, P.C., Barry, M.J., Wennberg, J.E.,
1993. A decision analysis of alternative treatment strategies for
clinically localized prostate cancer. Journal of the American Medical
Association 269, 2650–2658.
Gage, N.L., 1978. The Scientific Basis of the Art of Teaching, Teachers
College Press, New York.
Gigerenzer, G., 1993. The superego, the ego, and the id in statistical
reasoning. In: Keren, G., Lewis, C. (Eds.), A Handbook for Data
Analysis in the Behavioural Sciences: Methodological Issues, Erlbaum,
Hillsdale, NJ, pp. 311–339.
Heldref Foundation, 1997. Guidelines for contributors. Journal of
Experimental Education 65, 95–96.
Hopkins, W.G., 2001. Clinical vs statistical significance. Sportscience 5,
1–2.
Hopkins, W.G., 2002. Probabilities of clinical and practical significance.
Sportscience 6, 1–2.
Huck, S.W., McClean, R.A., 1975. Using a repeated measures ANOVA to
analyze the data from a pretest–posttest design: a potentially confusing
task. Psychological Bulletin 82, 511–518.
Jones, B.J., Brewer, J.K., 1972. An analysis of the power of statistical
tests reported in The Research Quarterly. The Research Quarterly 43,
23–30.
Kaplan, R.M., 1990. Behavior as the central outcome in health care.
American Psychologist 45, 1211–1220.
Kaplan, R.M., 1994. The Ziggy theorem: toward an outcome-focused
health psychology. Health Psychology 13, 451–460.
Kirk, R.E., 1996. Practical significance: a concept whose time has come.
Educational and Psychological Measurement 56, 746–759.
Kroll, R.M., Chase, L.J., 1975. Communication disorders: a power analytic
assessment of recent research. Journal of Communication Disorders 8,
237–247.
Measurement and Evaluation in Counseling and Development, 1992.
Measurement and Evaluation in Counseling and Development 25, 143.
Murphy, K.R., 1997. Editorial. Journal of Applied Psychology 82, 3–5.
Reed, J.F. III, Slaichert, W., 1981. Statistical proof in inconclusive negative
trials. Archives of Internal Medicine 141, 1307–1310.
Sedlmeier, P., Gigerenzer, G., 1989. Do studies of statistical power have an
effect on the power of studies. Psychological Bulletin 105, 309–316.
Speed, H.D., Andersen, M.B., 1997. Powerless in the face of small effects:
power in sport psychology research (Abstract). Journal of Applied Sport
Psychology 9 (Suppl.), S158.
Speed, H.D., Andersen, M.B., 2000. What exercise and sport scientists
don’t understand. Journal of Science and Medicine in Sport 3 (1),
84–92.
Thomas, J.R., Salazar, W., Landers, D.M., 1991. What is missing in
p , 0:05? Effect size. Research Quarterly for Exercise and Sport 62,
344–351.
Thompson, B., 1994. Guidelines for authors. Educational and Psychologi-
cal Measurement 54, 837–847.
Thompson, B., Snyder, P.A., 1997. Statistical significance testing practices
in The Journal of Experimental Education. The Journal of Experimental
Education 66, 75–83.
Vacha-Haase, T., Nilsson, J.E., 1998. Statistical significance reporting:
current trends and uses in MECD. Measurement and Evaluation in
Counseling and Development 31, 46–57.
M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–97 97