What are we looking at, and how big is it?

Research withourt Tears

What are we looking at, and how big is it?

Mark A. Stoovea,*, Mark B. Andersenb

aSchool of Health Sciences, Deakin University, Melbourne, AustraliabSchool of Human Movement, Recreation and Performance, Victoria University, Melbourne, Australia

Abstract

Some of the most important outcomes of physical therapy treatment have to do with behaviour and quality of life. This article involves

examining what it is we are measuring in physical therapy research and what those measurements mean. In looking at differences between

groups (e.g. placebo-control) or strength of association between variables (e.g. correlation, regression) the practitioner/researcher must

consider what are meaningful magnitudes of effects. Depending on the variable that one measures, a medium effect size (e.g. Cohen’s

d ¼ 0:50) may, in the real world, be insignificant, or in the case of elite athletic performance such an effect size might be gigantic. A major

problem in the sports sciences is the confusion of p values and significance testing with the results of interest, the magnitudes of effects. Also,

the prevalence of possible Type II errors in the sports sciences and medicine may be quite high in light of the small sample sizes and the

paucity of power analyses for non-significant results. We make an appeal for determining a priori minimal meaningful differences (or

associations) to use as the primary metrics in discussing results.

Crown Copyright q 2003 Published by Elsevier Science Ltd. All rights reserved.

Keywords: Statistical inference; Meaningful difference; Effect size; Power

1. What are we looking at, and how big is it?

In Kaplan’s (1994) article on outcomes in health

research, he used the comic strip Ziggy to present a

fundamental principle regarding the careful choice of

dependent variables in intervention outcome research. In

the comic strip, Ziggy climbs a mountain to ask a Guru,

‘What is the meaning of life?’ The Guru responds with, ‘Ah

yes… the meaning of life, my boy, is doin’ stuff!!’ Ziggy

questions the Guru, ‘Life is doin’ stuff?… that’s it?’ The

Guru responds, ‘As opposed to death, which is NOT doin’

stuff!’ Ziggy walks back down the mountain musing, ‘It’s a

more elementary theory than I had expected, but one you

can’t argue with’. The point Kaplan was making was that

the important and meaningful issue in health research is

doing stuff. Is the patient or client, after some intervention,

medication or surgery, able to do more or do something

more efficiently or function better than before the interven-

tion? Improved scores on paper and pencil tests (e.g.

measures of anxiety or depression) or physiological

parameters (e.g. serum cortisol) are of interest only if they

are related to behavioural, functional or quality of life

issues. That is, ‘doin’ stuff’.

1.1. What’s in it for the client?—choosing your dependent

variables wisely

What Kaplan (1994) proposed applies to both qualitative

and quantitative questions and methods. In this article,

however, we will be looking at the choice of quantitative

dependent variables and the issues of interpreting measure-

ments, statistical results and intervention outcomes in terms

of what is meaningful to the clients or patients. For example,

in the case of lateral epicondylitis one may measure the

amount of pain on resisted wrist extension using any one of

several pain scales. After a course of cryotherapy, muscle

stretching and soft tissues massage, does the lateral

epicondlylitis treatment group report lower scores on the

pain scale compared to a control group who received no

treatment? What is meaningful to the client is whether they

experience less pain when a treatment is provided,

potentially giving the client better functionality and

allowing them to do more. This research design of pre-

and post-testing with an experimental and control group is

common in medicine and health research. And the data from

such experiments are usually analysed incorrectly, but that

is a whole other story (see Huck & McClean 1975).

1466-853X/03/$ - see front matter Crown Copyright q 2003 Published by Elsevier Science Ltd. All rights reserved.

doi:10.1016/S1466-853X(03)00039-7

Physical Therapy in Sport 4 (2003) 93–97

www.elsevier.com/locate/yptsp

* Corresponding author. Address: School of Health Sciences, Deakin

University, 221 Burwood Hwy, Burwood, Vic. 3125, Australia. Tel.: þ61-

3-9251-7059; fax: þ61-3-9244-6017.

E-mail address: [email protected] (M.A. Stoove).

http://www.elsevier.com/locate/yptsp

The dependent variable in the study described above was a

score on a paper and pencil test of pain perception. Such

scores mean very little by themselves because they are not

calibrated against the variables of interest, and those have to

do with function, behaviour and quality of life. For example,

what reduction in scores on a pain scale is needed for clients

to perceive an improvement in their functional ability

during day-to-day activities? So in choosing dependent

variables, one needs to choose one (or more) that has

intimate connections to real life variables.

In a recent article, in Physical Therapy in Sport, Bennell

et al. (2000) discussed this issue when assessing the

psychometric properties of self-report questionnaires for

patellofemoral pain syndrome. The authors reported that a

disadvantage of these questionnaires is that one does not

know the change scores that are indicative for a clinically

significant difference in pain. Although the sensitivity of

instruments to detect statistically significant changes over

time has been investigated for some of the instruments, the

authors go on to recommend that, ‘…the size of change

needed for a clinically significant result should also be

investigated’ (p. 40). We propose that clinical significance

needs to include what are meaningful behavioural changes

for the patient. For example, with patients or clients who are

elderly and have undergone joint replacement surgery (e.g.

hip arthroplasty), the dependent variables of interest in

medicine and physiotherapy may be things like range of

motion, flexibility, joint stability and strength. The hands-on

physical therapy and home exercises chosen to affect these

dependent variables, however, might be time consuming

and painful. Is the object of physical therapy the rehabilita-

tion of that surgically replaced joint and the muscles and

tendons surrounding it, or is the object of research and

practice the health, welfare and quality of life of the owner

of that joint? Is some sacrifice of range of motion or strength

an acceptable outcome if it means that the patient can forego

a painful and possibly lengthy rehabilitation? What degree

of functionality does the patient require for maintaining or

improving quality of life? How the practitioner and

researcher answer these questions determines the dependent

variables of choice. A holistic research picture of the

practice of physical therapy will likely include psychologi-

cal, physical and quality of life dependent measures.

Because a condition can be treated does not necessarily

mean it should be treated. Kaplan (1994) uses an example of

the effect of three treatment options for prostate cancer on

quality adjusted life expectancy (QALEs). He draws upon

the work of Fleming et al. (1993) to compare the efficacy of

the traditional approach to cancer treatment, targeting the

tumour with radiation therapy or removing the prostate

gland, with a watchful waiting option. The results showed

that QALE scores for the three treatment options were

equivalent for groups of men aged between 60 and 75 years.

The fundamental difference between these treatment

alternatives, however, is that the traditional approaches

carry high risks of complications that could reduce quality

of life (e.g. impotence, incontinence), whereas watchful

waiting simply involves evaluation and supervision by a

physician. Another difference to consider is that the

traditional approaches focus on treating the cancer and the

other focuses on treating the patient. The choice of what to

measure and what is a meaningful change following an

intervention is a value driven decision that cannot be made

without input from the client and should ultimately be a

matter of patient preference.

Once one has chosen the dependent variable (or

variables), the researcher needs to take things one step

further and establish what would be the minimal meaningful

benefit (MMB) and the minimal meaningful harm (MMH;

Hopkins 2001, 2002). Determination of the MMB and

MMH comes from knowledge of the field, knowledge of the

variables and knowledge of the potential changes possible

with the specific population one is studying.

To illustrate how MMB and MMH can be helpful in

interpreting the data of the dependent variable, let us take an

intervention that we believe will be meaningful only if there

is at least a 2% change from pre to post, and the MMH

would be anything less than 0% change (i.e. people get

worse). For example, let us say we deliver some treatment to

a group, and for the change scores pre to post we end up with

a mean of 2% with a standard deviation of 1%. As illustrated

in Fig. 1, given the outcome, there is a 50% probability that

the intervention will meet or exceed the MMB. There is a

48% chance that the intervention will produce a trivial

difference (0 to 22.0 SD), and a 2% chance that the

intervention will be harmful (,22.0 SD). Although this

result means that it is a coin toss as to whether the

intervention will benefit the client, the results clearly show

that the intervention is highly unlikely to do harm.

Reflecting on the variable and the population in question,

it is ultimately up to the practitioner/researcher to determine

whether the treatment is worthwhile. For a review of the

concept of MMB and MMH and more discussion on clinical

versus statistical significance refer to Hopkins (2001, 2002).

Fig. 1. Minimum meaningful harm, trivial change and minimal meaningful

benefit.

M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–9794

1.2. Size does matter (but only in context)

Cohen (1990), wrote that the meaningfulness of

statistical results has little to do with whether p is less

than 0.05. He stated:

The primary product of research is one or more measures

of effect size, not p values …Effect size measures

include mean differences (raw or standardized), corre-

lations and squared correlations of all kinds, odds ratios,

kappas—whatever conveys the magnitude of the

phenomenon of interest appropriate to the research

context (p. 1310).

In a true sense, the probability of rejecting a false null

hypothesis is ultimately 0.0. The null hypothesis is always

false if you get a large enough N. The major confusion that

exists in probability statements is that they are often

confused with the magnitude of the effect. If the result is

p , 0:05, that is good, and if it is p , 0:01; that is even

better, and if it is p , 0:001; that is really wonderful. The

mistake people make is to think that an effect that is ‘more

significant’ is, therefore, more meaningful. But p values are

easy to manipulate (e.g. get more participants, use more

homogenous groups to lower variability). Getting a p ,

0:05 for a study tells us nothing about how large the

difference between groups was (e.g. in t tests and ANOVA

designs) or how strong the measure of association was (e.g.

in correlation and regression designs).

Different effect sizes tell us different things. Cohen’s d

for independent means tells the difference between the

means of two groups in Z (standardized) terms. If, for

example the experimental group had higher scores on some

variable and the Cohen’s d was 1.0 (equal to one standard

deviation), then one could say that the mean of the

distribution of the experimental group moved one standard

deviation above the mean of the control group, representing

an increase of 34 percentile (not percentage!) points. This

change represents quite a large difference. By convention, in

the behavioural sciences, a Cohen’s d of 0.20 represents a

small effect; 0.50 represents a medium effect, and 0.80 or

above represents a large effect. For another example, the

effect size h2 (an equivalent of R2) represents the amount of

variance in the dependent variable accounted for by group

membership, and is used with ANOVA designs. The

conventions in the behavioural sciences for h2 are 0.01

for a small effect, 0.06 for a medium effect, and 0.14 for a

large effect.

Whether an effect is small, medium, or large is all

ultimately related to the variable one is examining. For

example, in elite sport, changes in times of half a percent

(probably a small effect) may translate into more podium

visit outcomes at competitions. As we emphasised earlier,

meaningful effects need to be determined by the researcher

prior to conducting the investigation, and should relate to

the population being investigated.

The American Psychological Association (APA 2001) in

their publication manual admonished authors of quantitative

studies that the reporting of effect sizes is essentially

required for almost all quantitative designs. The manual

contains the following:

For the reader to fully understand the importance of your

findings, it is almost always necessary to include some

index of effect size or strength of relationship in your

Results section. You can estimate the magnitude of the

effect or the strength of the relationship with a number of

common effect size estimates, including (but not limited

to) r2;h2;v2;R2;F2; Cramers V, Kendall’s W, Cohen’s d

and _k; Goodman–Kruskal’s l and g, Jacobson and

Truax’s (1991) and Kendall’s proposed measures of

clinical significance, and the multivariate Roy’s u and the

Pillai-Bartlett V (p. 25–26).

Similar publication policies exist for other journals

including Educational and Psychological Measurement

(Thompson 1994), Journal of Applied Psychology (Murphy

1997), Journal of Experimental Education (Heldref Foun-

dation 1997) and Measurement and Evaluation in Counsel-

ing and Development (1992). The instructions to

contributors in all these journals state that effect sizes are

required or strongly encouraged, in addition to p values.

Reporting effect sizes is only the first step in a complete

disclosure, and ultimately, an interpretation of statistical

results. Speed and Andersen (2000) discovered, after

examining two years worth of the articles in the Journal

of Science and Medicine in Sport, that: (a) effect sizes were

rarely reported, and (b) when they were reported they were

not interpreted in any meaningful way. Similar results have

been obtained from a selection of APA journals published in

1995 (Kirk 1996), from two volumes of the Journal of

Experimental Education (Thompson & Snyder 1997) and

from 68 articles published between 1990 and 1996 in

Measurement and Evaluation in Counseling and Develop-

ment (Vacha-Haase & Nilsson 1998). The reporting of effect

size, like statistical significance, is relatively meaningless

without an interpretation of what this magnitude of change

or association means in a real world metric. For example,

Gage (1978) pointed out that the association between

smoking and lung cancer involves an h2 of 0.02, meaning

that 2% of variance in lung cancer is accounted for by

smoking. From a statistical perspective this appears to be a

small effect. When the variables of note, however, are

projected onto a large population, small effects such as this

have meaningful and profound effects on public health.

An example of relatively large effect in health and

pharmacology was the clinical trials of the drug gemfi-

brozil’s (lowers serum cholesterol) ability to prevent death

from ischemic heart disease (IHD; see Kaplan 1990). In this

drug trial, participants received either the drug gemfibrozil

or a placebo over a six-year period. The number of patients

who died of IHD in the placebo and drug groups was

M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–97 95

compared. Significantly more people in the placebo group

died of IHD with a relatively large effect size (26%

reduction in deaths from IHD). The conclusion that

gemfibrozil was effective in reducing deaths from IHD

seemed experimentally supported. The problem with this

research though, is that when the ultimate dependent

variable, mortality from all causes of death, was examined,

then it was revealed that the mortality rates of both groups

were essentially the same (three more people died in the

drug group than in the placebo group). The people taking

gemfibrozil died at approximately the same rates as the

placebo group; they just did not die of IHD as often. Here

we have a case of a treatment with a large effect that is

essentially ineffective in the most meaningful of dependent

variables, mortality. The people who took gemfibrizil were

no less likely to die, and a drug was given to them with side

effects (e.g. cholecystitis, diarrhoea), possibly decreasing

quality of life. Gemfibrozil is still on the market.

1.3. A powerless state of affairs

Effect sizes are intimately connected with the question of

power (the ability to reject the null hypothesis) and

statistical significance. Sufficient sample sizes are needed

to detect statistically significant changes in dependent

variables, even when the changes represent large and

meaningful effects. Researchers who have investigated the

statistical power of studies published in the biomedical and

exercise sciences have suggested that research in these fields

may contain substantial Type II errors because the research

designs lack the power to detect medium or small effects

(Christensen & Christensen 1977; Jones & Brewer 1972;

Reed & Slaichert 1981; Speed & Andersen 2000). Studies

reporting similar results have been conducted in the

psychology (Cohen 1962; Sedlmeier & Gigerenzer 1989),

speech pathology (Kroll & Chase 1975), education (Brewer

1972) and sport psychology (Speed & Andersen 1997)

disciplines. The APA (2001) publication manual also

contains a strong statement about demonstrating power.

…you should routinely provide evidence that your study

has sufficient power to detect effects of substantive

interest (e.g. see Cohen 1988). You should be similarly

aware of the role played by sample sizes and cases in

which not rejecting the null hypothesis is desirable (i.e.

when you wish to argue there are no differences), when

testing various assumptions underlying the statistical

model adopted (e.g. normality, homogeneity of variance,

homogeneity of regression), and in model fitting…(p. 24)

Researchers should routinely refer to previous research

to estimate the sample size required for an 80% chance of

detecting a small, medium or large (depending on the

variable) effect at the desired a-level (a power of 0.80). If

there is insufficient research published in a particular area,

Cohen (1988) provides power tables for a variety of

experimental designs and statistical analyses that estimate

the sample sizes required for a given power and a

hypothesised effect size. This emphasis on power and

power analyses, however, is a direct result of the ubiquity of

using p , 0:05 in evaluating the results of research.

Powerless research designs have a number of ramifica-

tions. First, it is usual in between-groups experimental

designs (e.g. experimental versus control group designs

commonly found in the medical and exercise sciences) to

test statistically whether the two groups are equivalent on

some measures prior to conducting an intervention.

Equivalence is often determined by examining whether

two groups are significantly different on the variables in

question. With small Ns, however, the magnitude of

difference between the two groups, in order to be

significantly different, would have to be large, or the

variability within these groups small. The upshot of this

common situation is that saying two groups are not

‘significantly different’ does not mean that they are the

same. With small sample sizes, it is probable that

researchers who claim to have ‘equal’ groups have nothing

of the sort. Another concern arising from low power, based

on the same logic, is the likelihood of Type II errors when

reporting the results of research (e.g. the treatment is not

effective when it actually is). Saying there are ‘no

differences’ following an intervention (because p was

greater than 0.05), when there are differences, leads to

questionable conclusions that might undervalue the utility

of a worthwhile treatment. Finally, some studies may never

make it to publication because of low power and a lack of

statistically significant results. A related issue is that a whole

generation of journal editors and reviewers have biases

towards not accepting papers that report non-significant

results. Not publishing studies that report non-significant

results is potentially doing a disservice in many research

fields. Cohen (1962) pointed this issue out 40 years ago, but

the message remains largely unheard. ‘A generation of

researchers could be suitably employed in repeating

interesting studies which originally used inadequate sample

sizes. Unfortunately, the ones most deserving such rep-

etition are least likely to have appeared in print’ (p. 153).

2. Conclusion

Rather than defenestrate significance testing, as some

would suggest (e.g. Cohen 1994; Gigerenzer 1993), we

argue that researchers need to report and interpret effect

sizes and p values (cf. Andersen & Stoove 1998; Thomas

et al. 1991). Of perhaps more significance in research where

outcomes relate to the health status or functionality of

clients or patients, is choosing variables that are important

in determining well-being and quality of life, and making a

priori decisions about what constitutes meaningful changes

in these variables. For this purpose, the determination MMB

M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–9796

and MMH prior to conducting research may be useful when

interpreting results.

Acknowledgements

The authors would like to thank Will Hopkins for

enlightening this paper and the sports sciences in general.

The authors would also like to thank Herbert Badgery and

Eddie Wysbraum for their valued suggestions and inspired

commentary.

References

Andersen, M.B., Stoove, M.A., 1998. The sanctity of p , 0:05 obfuscates

good stuff: a comment of Kerr and Goss. Journal of Applied Sport

Psychology 10, 168–173.

American Psychological Association, 2001. Publication Manual, 5th edn.,

American Psychological Association, Washington, DC.

Bennell, K., Bartam, S., Crossley, K., Green, S., 2000. Outcome measures

in patellofemoral pain syndrome: test retest reliability and inter-

relationships. Physical Therapy in Sport 1, 32–41.

Brewer, J.K., 1972. On the power of statistical tests in the American

Educational Research Journal. American Educational Research Journal

9, 391–401.

Christensen, J.E., Christensen, C.E., 1977. Statistical power analysis of

health, physical education, and recreation research. Research Quarterly

48, 204–208.

Cohen, J., 1962. The statistical power of abnormal-social psychology

research. Journal of Abnormal and Social Psychology 65, 145–153.

Cohen, J., 1988. Statistical Power Analysis for the Behavioural Sciences,

2nd edn., Erlbaum, Hillsdale, NJ.

Cohen, J., 1990. Things I have learned so far. American Psychologist 45

(12), 1304–1312.

Cohen, J., 1994. The earth is round ðp , 0:05Þ. American Psychologist 49,

997–1003.

Fleming, C., Wasson, J.H., Albertson, P.C., Barry, M.J., Wennberg, J.E.,

1993. A decision analysis of alternative treatment strategies for

clinically localized prostate cancer. Journal of the American Medical

Association 269, 2650–2658.

Gage, N.L., 1978. The Scientific Basis of the Art of Teaching, Teachers

College Press, New York.

Gigerenzer, G., 1993. The superego, the ego, and the id in statistical

reasoning. In: Keren, G., Lewis, C. (Eds.), A Handbook for Data

Analysis in the Behavioural Sciences: Methodological Issues, Erlbaum,

Hillsdale, NJ, pp. 311–339.

Heldref Foundation, 1997. Guidelines for contributors. Journal of

Experimental Education 65, 95–96.

Hopkins, W.G., 2001. Clinical vs statistical significance. Sportscience 5,

1–2.

Hopkins, W.G., 2002. Probabilities of clinical and practical significance.

Sportscience 6, 1–2.

Huck, S.W., McClean, R.A., 1975. Using a repeated measures ANOVA to

analyze the data from a pretest–posttest design: a potentially confusing

task. Psychological Bulletin 82, 511–518.

Jones, B.J., Brewer, J.K., 1972. An analysis of the power of statistical

tests reported in The Research Quarterly. The Research Quarterly 43,

23–30.

Kaplan, R.M., 1990. Behavior as the central outcome in health care.

American Psychologist 45, 1211–1220.

Kaplan, R.M., 1994. The Ziggy theorem: toward an outcome-focused

health psychology. Health Psychology 13, 451–460.

Kirk, R.E., 1996. Practical significance: a concept whose time has come.

Educational and Psychological Measurement 56, 746–759.

Kroll, R.M., Chase, L.J., 1975. Communication disorders: a power analytic

assessment of recent research. Journal of Communication Disorders 8,

237–247.

Measurement and Evaluation in Counseling and Development, 1992.

Measurement and Evaluation in Counseling and Development 25, 143.

Murphy, K.R., 1997. Editorial. Journal of Applied Psychology 82, 3–5.

Reed, J.F. III, Slaichert, W., 1981. Statistical proof in inconclusive negative

trials. Archives of Internal Medicine 141, 1307–1310.

Sedlmeier, P., Gigerenzer, G., 1989. Do studies of statistical power have an

effect on the power of studies. Psychological Bulletin 105, 309–316.

Speed, H.D., Andersen, M.B., 1997. Powerless in the face of small effects:

power in sport psychology research (Abstract). Journal of Applied Sport

Psychology 9 (Suppl.), S158.

Speed, H.D., Andersen, M.B., 2000. What exercise and sport scientists

don’t understand. Journal of Science and Medicine in Sport 3 (1),

84–92.

Thomas, J.R., Salazar, W., Landers, D.M., 1991. What is missing in

p , 0:05? Effect size. Research Quarterly for Exercise and Sport 62,

344–351.

Thompson, B., 1994. Guidelines for authors. Educational and Psychologi-

cal Measurement 54, 837–847.

Thompson, B., Snyder, P.A., 1997. Statistical significance testing practices

in The Journal of Experimental Education. The Journal of Experimental

Education 66, 75–83.

Vacha-Haase, T., Nilsson, J.E., 1998. Statistical significance reporting:

current trends and uses in MECD. Measurement and Evaluation in

Counseling and Development 31, 46–57.

M.A. Stoove, M.B. Andersen / Physical Therapy in Sport 4 (2003) 93–97 97

Documents

What are we looking at, and how big is it?