Upload
j
View
216
Download
3
Embed Size (px)
Citation preview
http://jbd.sagepub.com/International Journal of Behavioral Development
http://jbd.sagepub.com/content/35/6/550The online version of this article can be found at:
DOI: 10.1177/0165025411425873
2011 35: 550International Journal of Behavioral DevelopmentIrene Klugkist, Floryt van Wesel and Jessie Bullens
Do we know what we test and do we test what we want to know?
Published by:
http://www.sagepublications.com
On behalf of:
International Society for the Study of Behavioral Development
can be found at:International Journal of Behavioral DevelopmentAdditional services and information for
http://jbd.sagepub.com/cgi/alertsEmail Alerts:
http://jbd.sagepub.com/subscriptionsSubscriptions:
http://www.sagepub.com/journalsReprints.navReprints:
http://www.sagepub.com/journalsPermissions.navPermissions:
http://jbd.sagepub.com/content/35/6/550.refs.htmlCitations:
What is This?
- Nov 23, 2011Version of Record >>
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
Methods and measures
Do we know what we test and dowe test what we want to know?
Irene Klugkist,1 Floryt van Wesel,2 and Jessie Bullens3
AbstractNull hypothesis testing (NHT) is the most commonly used tool in empirical psychological research even though it has several knownlimitations. It is argued that since the hypotheses evaluated with NHT do not reflect the research-question or theory of the researchers,conclusions from NHT must be formulated with great modesty, that is, they cannot be stated in a confirmative way. Since confirmation ortheory evaluation is, however, what researchers often aim for, we present an alternative approach that is based on the specification ofexplicit, informative statistical hypotheses. The statistical approach for the evaluation of these hypotheses is a Bayesian model-selectionprocedure. A non-technical explanation of the Bayesian approach is provided and it will be shown that results obtained with this methodgive more direct answers to the questions asked and are easier to interpret. An additional advantage of the offered possibility to formulateand evaluate informative hypotheses is that it stimulates researchers to more carefully think through and specify their expectations.
Introduction
There is a long tradition of conducting empirical studies in the field
of psychology in general, and in related sub-disciplines (i.e., devel-
opmental psychology) more specifically. In contrast to the early
philosophical approach in which research was conducted in a
rationalistic way, empirical research is evidence-based and consists
of a number of subsequent stages. These stages roughly involve a)
establishing a research topic by collecting and organizing existing
empirical facts, b) formulating expectations, hypotheses and/or
research questions, c) designing the research and collecting data,
d) analyzing the data, and e) evaluating the results with respect to
the formulated hypotheses and expectations.
In this paper we argue that a non-trivial part of psychological (and
other) research is rather explicit in the first three stages (i.e., thorough
literature search, clearly formulated expectations, and a well thought-
through research design), but then a procedure follows in the fourth
stage that is not well-suited for such explicit designs and hypotheses.
Consequently, the last stage, where the results are linked to the
hypotheses stated at the start, is not straightforward. We will explain
and illustrate this with a study by Lee (2009) about aggression in
young children. Note that it is not at all our intention to criticize this
particular article; the research described in the paper just serves as a
typical example of the, in our opinion, almost standard approach of
data analysis and subsequent drawing of conclusions.
As is typical in empirical research, Lee starts with an overview
of the literature and previous findings on the topic of interest in the
introduction of the article. Although Lee’s research is more elabo-
rate, here we will focus only on the topic where the relation between
gender, sociometric status (preferred, rejected, neglected, and con-
troversial) and bullying in fifth-grade children is discussed. The
existing literature is used to guide the reader towards the, at the end
of the introduction stated, expectations and hypotheses. One of
these expectations is, for instance, that the amount of bullying is
especially high in boys with reject status while for girls the highest
amount of bullying is expected for the controversial status group
(Lee, 2009, p. 324).
To investigate this hypothesis (and others that are not discussed
here), perceived bullying and peer acceptance and rejection among
fifth-grade boys and girls was assessed. The peer acceptance and
rejection scores were used to assign children to one of the four
sociometric status groups. After the data were collected, analyses
were performed. Among other things, a 2 (gender) by 4 (socio-
metric status) analysis of variance was conducted with bullying
as the outcome variable. The results showed a significant main
effect of both gender and sociometric status as well as a significant
interaction effect of these two factors. Subsequently, post hoc pair-
wise comparisons were performed for the effect of sociometric sta-
tus on bullying for boys and girls separately. Results showed that,
for boys, the four sociometric status groups did not differ signifi-
cantly with respect to bullying and that, for girls, only the contro-
versial group differed from the other three groups. With respect
to bullying Lee (2009) concluded that: ‘‘ . . . the present research
found no evidence that boys bullied more than girls’’ (p. 328), and:
‘‘ . . . controversial sociometric status, especially in girls, showed
the highest scores in peer nomination as bullies’’ (p. 329). Although
more is said about the findings on bullying in the discussion, we did
not find an explicit reference to the expectations as formulated in
the introduction of the paper.
All tests and thus the subsequent conclusions were based on null
hypothesis testing (NHT). In the example, the three null hypotheses
1 Department of Methodology and Statistics, Utrecht University, the
Netherlands2 Department of Methodology, VU University Amsterdam, the Netherlands3 Helmholtz Research Institute, Experimental Psychology, Utrecht Univer-
sity, the Netherlands
Corresponding author:
Irene Klugkist, Utrecht University, Department of Methodology and
Statistics, Utrecht University, PO Box 80140, 3508 TC Utrecht, the
Netherlands.
Email: [email protected]
International Journal ofBehavioral Development
35(6) 550–560ª The Author(s) 2011
Reprints and permissions:sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165025411425873ijbd.sagepub.com
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
tested are ‘‘no main effect of gender’’, ‘‘no main effect of
sociometric status’’, and ‘‘no interaction effect of gender and socio-
metric status’’. These hypotheses, however, do not reflect the care-
fully formulated expectation of Lee (2009). In our opinion, NHT is
not well-suited for theory evaluation because there is a big gap
between the researcher’s (specific) hypothesis and the (not so spe-
cific) hypotheses that are central in NHT. On the other hand, the
Bayesian method presented in this paper is designed to deal with
hypotheses that are closely related to the substantive hypotheses.
Although both methods (NHT and Bayesian) will deliver valid test
results, the connection between the results and the original expecta-
tions will be more straightforward with the Bayesian method than
with NHT. Also for Lee’s (2009) paper, this may be the explanation
for not finding an explicit discussion of the originally stated expecta-
tion in the discussion of the test results and subsequent conclusions.
Limitations of NHT have been pointed out by many authors
before us (see, for instance, a review by Nickerson, 2000). Although
we will repeat some of the arguments, the main goal of this paper is
to introduce the Bayesian alternative approach. This method is devel-
oped for the direct evaluation of specific (clearly and explicitly for-
mulated) expectations. Observing that most research in psychology is
based on NHT and that NHT does not test the explicit hypotheses
often stated at the beginning of the research, we address the question:
Do we know what we test and do we test what we want to know?
In the remainder of the paper we will first elaborate on the argu-
ments stated above. This includes a summary of some of the mis-
conceptions that exist around NHT and p-values. In the following
section we will present the alternative approach that does not
demand a large change in the current research practice, and can
be useful when theory evaluation or (competing) theory comparison
is the main goal. Since Lee (2009) provided a specific hypothesis on
the effects of gender and sociometric status on bullying, we com-
pare the results presented in Lee’s paper with the results from the
proposed Bayesian approach for this hypothesis. The concluding
section includes several references for further reading and available
software for the novel approach.
Limitations of null hypothesis testing
Null hypothesis testing (NHT) is the most commonly used tool in
empirical psychological research even though there are numerous
publications describing its limitations. An interesting discussion
is provided in a book titled What If There Were No Significance
Tests? by Harlow, Mulaik, and Steiger (1997; but see also a critical
review article by Krantz, 1999). For a more recent and extensive
overview of the literature on (limitations of) NHT, we further refer
to Nickerson (2000). Although we too summarize some of the criti-
cisms in the next subsections, we do not aim for yet another paper
criticizing the use and misuse of NHT and subsequent p-values. The
main goal of this paper is to inform the reader about an alternative
that can be used when researchers have specific theories or expec-
tations about the outcomes of their research. Before introducing the
novel approach, we will however highlight three aspects of NHT
that are topics of ongoing debate: the nature of the hypotheses in
NHT, the (mis)interpretation of p-values, and (lack of) replicability.
Hypotheses in null hypothesis testing
As the name already suggests, NHT always requires a null hypoth-
esis, stating that a parameter of interest (e.g., a mean m or a
correlation r) has a certain value (often zero), or that a set of
parameters is exactly equal (e.g., m1 ¼ m2 ¼ m3). Stated differently,
the null hypothesis (H0) almost invariably states that ‘‘nothing is
going on’’ in the population of interest. Several authors have argued
that absolutely nothing is highly unlikely since some small differ-
ences between parameters of interest will almost always be present
(e.g., Cohen, 1990, 1994; Krueger, 2001; Lykken, 1991). Meehl
(1967) also agrees that in some disciplines (like psychology) the
null hypothesis is, using his expression, [quasi-] always false and
explains this as follows:
Any dependent variable of interest . . . depends mainly upon a finite
number of ‘‘strong’’ variables characteristic of the organism studied
. . . plus the influences manipulated by the experimenter. . . . In
order for two groups which differ in some identified properties
. . . to differ not at all in the ‘‘output’’ variable of interest, it would
be necessary that all determiners of the output variable have pre-
cisely the same average values in both groups, or else that their val-
ues should differ by a pattern of amounts of difference which
precisely counterbalance one another to yield a net difference of
zero. (p. 108)
He then concludes that this is obviously:
. . . so extremely unlikely that no psychologist or statistician would
assign more than a negligibly small probability to such a state of
affairs. (p. 108)
Based on the fact that the null hypothesis seems unlikely, Cohen
(1990) concludes:
So if the null hypothesis is always false, what’s the big deal about
rejecting it? (p. 1308)
Also Royall, in a monograph that addresses the problem of inter-
preting statistical data as evidence (Royall, 1997), wonders if there
are ‘‘statistical ‘null’ hypotheses that are scientifically important?’’
and states:
The answer to the question ‘Is the null hypothesis correct?’ is always
the same – no! . . . If the purpose of experiments were to answer
such questions, there would be no point in doing experiments, since
we already know the answers. (p. 79–80)
Royall continues by explaining that the focus should not be on the
question whether observations are evidence against the null hypoth-
esis. Instead, the question to ask is whether there are scientifically
meaningful alternative hypotheses that are better supported (Royall,
1997, p. 81).
However, the alternative hypothesis in NHT approaches most
often states nothing more than ‘‘not H0’’ and is therefore not at all
specific or scientifically meaningful. Consider, for instance, a two-
way analysis of variance. The null hypotheses in such an analysis
state that there are no main effects and no interaction effect of the
two factors. For each of the three null hypotheses, the usual alterna-
tive hypothesis is that the effect at hand (main or interaction) is
present. Nevertheless, a hypothesis stating ‘‘an interaction effect
is present’’ is not explicit, and thus not scientifically meaningful:
It does not specify the theoretical expectation of the researcher.
Note that not all alternative hypotheses are uninformative. As a
counterexample consider the testing of planned contrasts, where the
researcher translates explicit expectations in statistical contrasts
(see, for instance, Rosenthal, Rosnow, & Rubin, 2000). The evalua-
tion is however still based on a (unrealistic) null hypothesis.
Klugkist et al. 551
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
In conclusion, NHT is based on a highly unrealistic null
hypothesis and almost invariably an uninformative alternative
one. This has the drawback that only limited conclusions can be
drawn from NHT approaches. This will be further discussed in the
next subsection.
Misinterpretations of p-values
Besides the formulation of the statistical hypotheses, another draw-
back of NHT is that the interpretation of its main result, the p-value,
is not at all straightforward. Researchers seek for an answer to the
question: To what extent do the data support my hypothesis?, but a
p-value does not provide that information. Cohen (1994) sum-
marizes the difference in focus as follows: a p-value is the probabil-
ity of obtaining the observed (or more extreme) data if the null
hypothesis is true, Pr(data|H0), whereas the researcher is more inter-
ested in the probability of H0 given the observed data, Pr(H0|data).
Misinterpretations occur either out of ignorance, or, as stated by
Cohen (1994), despite knowledge about the correct interpretation:
... it [the p-value] does not tell us what we want to know, and we so
much want to know what we want to know that, out of desperation,
we nevertheless believe that it does! (p. 997)
In the frequentist (NHT) framework, we neither get information
about the probability that the null hypothesis is true nor about
the probability that the alternative or research hypothesis is true:
In the classical or frequentist framework it is not possible to assign
probabilities to hypotheses as a consequence of the definition of
probability as long-run frequency.
Replicability
Another limitation of p-values and what they can (not) tell us about
our hypotheses becomes clear by examining sampling variability
and the related issue whether specific results are replicable.
Although psychological researchers are trained in the concept of
sampling variability and should be aware of the uncertainty in the
results, this is often ignored or underestimated when results are
described and conclusions are formulated. Cumming (2008) inves-
tigated sampling variability of the p-value by asking the following
question: Given an observed (two-tailed) p-value of .05, what are
reasonable expected values if the same experiment is replicated?
Irrespective of sample size, the 80% replication interval for the
p-value in this scenario turns out to be (.00008, .44), implying
there is a 10% chance that – in a replication – p < .00008, a 10% chance
that p > .44, and an 80% chance that p falls within this interval.
The need for replication studies has been emphasized by several
authors (see, for instance, Howard, Maxwell and Fleming, 2000;
Schmidt, 1996). More recently, Killeen (2005a) proposed to report
the statistic prep, an estimate of the probability of replicating an
effect according to the following definition of replication: ‘‘an
effect of the same sign as that found in the original experiment’’
(p. 346). Killeen’s proposal led to a series of (either supporting,
elaborating, or criticizing) articles on the topic (Cumming,
2005, 2010; Doros & Geier, 2005; Iverson, Wagenmakers, & Lee,
2010; Killeen, 2005b, 2010; Lecoutre, Lecoutre, & Poitevineau,
2010; MacDonald, 2005; Maraun & Gabriel, 2010; Serlin,
2010), showing that the question of replicability is considered
important by many researchers and some kind of ‘‘best practice’’
has not yet been established or agreed upon. Although we too
believe that replication is a key element of science and the
accumulation of knowledge, a further discussion of the prep is
beyond the message of this paper.
To summarize, despite all efforts for improvements, NHT does
not evaluate scientifically meaningful hypotheses and still puts too
much weight on often misinterpreted and rather uncertain p-values.
However, it is still the dominant tool for statistical analyses in
psychological research.
Suggestions for solutions
A common answer to some limitations of NHT and p-values is that
additional results like effect sizes and confidence intervals must be
reported and that conclusions must be formulated more carefully,
that is, with specific attention for the uncertainty in the results. For
instance, in 1996, the Task Force on Statistical Inference (TSFI)
was convened by the APA Board of Scientific Affairs to elucidate
some of the controversial issues surrounding significance testing.
This led to guidelines and explanations on how to improve the inter-
pretation and communication of research results (Wilkinson & the
TSFI, 1999). Their recommendations indeed included standard
reporting of effect sizes and confidence intervals. Unfortunately,
compelling researchers to standardly report confidence intervals
does not lead to an actual interpretation of the them (Fidler, Thoma-
son, Cumming, Finch, & Leeman, 2004). Confidence intervals are
often ignored in the discussion of results or only used to determine
if the hypothesized value is within the interval or not. The latter use
of confidence intervals has all the same drawbacks as interpreting a
p-value from the NHT procedure (Schmidt & Hunter, 1997).
Compared to NHT, the evaluation of effect sizes (preferably
with confidence intervals) is more informative and therefore closer
to the theory-based approach we propose. However, the decision
whether an effect is meaningful is made after the results are
obtained (i.e., based on the observed effect size) and rather subjec-
tive (who decides if an effect of 0.10 is meaningful, and what will a
researcher finding an effect of 0.09 conclude?). In our opinion, the
formulation of specific expectations before data are collected leads
to more objective research and conclusions. In relatively simple set-
ups such expectations could be captured in sharper formulated null
hypotheses as, for instance, H0: m ¼2 versus HA: m > 2. The Baye-
sian alternative to NHT proposed in this paper is based on a similar
principle but the type of specific hypotheses is different (they are
introduced in the next section), and more extensive – multiple para-
meter – hypotheses can be evaluated.
Bayesian evaluation of informativehypotheses
In this section a non-technical explanation of the Bayesian
approach for the evaluation of informative hypotheses is provided.
First, we will elaborate on the types of hypotheses that can be for-
mulated by the researcher, and subsequently, the analysis of these
hypotheses using Bayesian model selection is presented.
Informative hypotheses
Researchers aim to design their research in such way that it is
possible to evaluate explicit theories, which lead to specific expec-
tations about the outcomes of the data analysis. In the research pre-
sented by Lee (2009), for example, it was expected that the amount
552 International Journal of Behavioral Development
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
of bullying is especially high in boys with reject status, while for
girls the highest amount of bullying is expected for the controver-
sial status group. Hypotheses such as these can be translated into
informative hypotheses, that is, hypotheses that give direction, in
terms of inequality constraints (i.e., ‘‘larger than’’, ‘‘better than’’,
etc.) to the researcher’s expectations. With m denoting the mean
level of bullying, subscripts B and G denoting boys and girls, and
subscripts P, R, N and C for preferred, rejected, neglected, and con-
troversial status, respectively, the hypothesis in our example can be
translated into:
HLee: mBR > (mBP, mBN, mBC) and mGC > (mGP, mGR, mGN).
To extend the illustration of Bayesian model selection we added
two competing informative hypotheses to the research hypothesis
stated by Lee:
Hphysical: mBR > mGC > mBC > mGR > (mBN, mGN, mBP, mGP),
Hrelational: mGC > mBR > mGR > mBC > (mBN, mGN, mBP, mGP).
The first hypothesis is based on the expectation that physical
aggression has more impact on being appointed as bully and that
boys show more physical aggression than girls. Combined with the
information in HLee concerning higher bullying in the reject and
controversial groups, this leads to Hphysical. In a similar vein, a com-
peting expectation could state that especially relational aggression
(which is assumed to be higher in girls) is related to bullying which
could lead to Hrelational. Both hypotheses are loosely based on infor-
mation provided by Lee in the introduction, but bear in mind that
we are no experts in this area and the hypotheses serve to illustrate
the statistical method and are not to be interpreted from a substan-
tive point of view.
With the Bayesian model selection procedure informative
hypotheses (Hi) formulated using inequalities or a mix of inequal-
ities and equalities can be evaluated. The researcher decides on a set
of hypotheses that he/she wants to confront with the data and the
method returns the relative support provided by the data for each
hypothesis in the form of posterior model (hypothesis) probabil-
ities. The approach therefore provides answers in terms of (Baye-
sian) probabilities assigned to hypotheses, enabling easier
interpretation of test results and preventing dichotomous decisions
(i.e., ‘‘significant’’ or ‘‘not significant’’).
An additional advantage of the proposed method is that several
informative hypotheses can be mutually compared. Note that the
traditional null and alternative hypothesis can, but do not necessa-
rily have to, be included as a hypothesis of interest. In case the
researcher decides to include (one of) them, they do not play a cen-
tral role in the analysis: every (informative) hypothesis is evaluated
against all others.
The analysis of informative hypotheses
To be able to support the explanation of the analysis with graphical
representations, here we will limit ourselves to a 2-parameter exam-
ple. Generalizations and technical details are provided in the
Appendix. The ingredients required for the analysis will be elabo-
rated for the hypotheses Hi: m1 > m2, H0: m1 ¼ m2, and HA: m1, m2
(the comma in HA denotes that no constraints are imposed), where
the m’s represent independent group means for data that are
assumed to be normally distributed and the residual variance is
assumed to be known.
Prior distributions for model parameters
Specification of a prior distribution for the model parameters is an
essential part of the Bayesian analysis and makes it fundamentally
different from a classical, frequentist approach. Within the frequen-
tist framework, parameters are assumed fixed (although unknown),
whereas Bayesians treat parameters as random and use probability
distributions to reflect knowledge or uncertainty about them.
Knowledge about parameters before data collection is reflected in
the prior distribution.
We assume vague, independent and identically distributed
priors for m1 and m2 so that the prior distribution is dominated by
the data. For instance, for an outcome of interest that is measured
on a 0–7 scale, a prior distribution for the mean could be a uniform
distribution on the scale 0–7, that is, U(0,7). Such a prior states that
before seeing the data each value of m is equally likely and can
therefore be considered objective (no subjective information about
m is incorporated). For parameters that are constrained in one or
more hypotheses equal prior distributions are used (again to avoid
subjective input in the analysis).
Usually, when several models are compared, for each model a prior
distribution for the model specific parameters must be defined
(Gelman et al., 2004, pp. 184–186). However, the hypotheses (leading
to different models in the model selection procedure) discussed in this
paper have a special feature: All informative hypotheses, Hi, are nested
in an unconstrained hypothesis (i.e., HA). This is illustrated in Figure 1,
where the white square on the right represents a prior distribution for
the statistical model consisting of two parameters, m1 and m2, without
any constraints, that is, the prior for both m1 and m2 is U(0,7). The
height of the prior density (not visible in the two-dimensional plot)
is equal for all combinations of m1 and m2 within the square. Note that,
here we choose (bounded) uniform priors to keep the explanation and
plots simple, but in the software used for Bayesian model selection,
other prior distributions are used. Prior specification is shortly dis-
cussed later in the paper, while more details are provided in the appen-
dix (or, see also, Van Wesel, Klugkist, & Hoijtink, in press).
An important consequence of the nesting of constrained hypoth-
eses in HA, can be seen in the other two plots of Figure 1, for
instance, in the left-hand plot, the upper diagonal is greyed out since
in that area m1 is smaller than m2, which is in disagreement with the
hypothesis. In the analysis the same initial prior distribution is used
for the constrained hypotheses, however, parts that are not in agree-
ment with the constraints are truncated (i.e., the prior density in
those areas is set to zero).
Some special attention is required for the null hypothesis plotted
in the middle. Strictly, the whole square must be greyed out,
because only the diagonal line (m1 ¼ m2) is in agreement with the
hypothesis and thus receives non-zero prior density. Computation-
ally, equality constraints are much more difficult to handle than
inequality constraints (for more details, see Van Wesel, Hoijtink,
& Klugkist, in press). Conceptually, however, the Bayesian proce-
dure can be understood by considering an approximation of the
hypothesis stating that m1 and m2 are about equal. This is plotted
in Figure 1 by the small white area around the line m1 ¼ m2.
The likelihood function to describe the information inthe data
The second ingredient in a Bayesian analysis is the data. The sam-
ple contains additional information about the parameters of the
Klugkist et al. 553
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
model (i.e., m1 and m2) and this information is reflected in the
likelihood function. An example is plotted in Figure 2. Given the
observed data, some combinations of values of m1 and m2 are more
likely (e.g., m1 ¼ m2 ¼ 4) and others are highly unlikely (e.g., m1 ¼m2 ¼ 0, or, m1 ¼ 7 and m2 ¼ 0). The top of the likelihood function
shows which values of m1 and m2 are the most likely given the data
at hand and is also known as the maximum likelihood (the most
common estimator in frequentist statistics).
Using the likelihood function to describe the information the
data provides about the parameters is standard in both frequentist
and Bayesian methods. However, typical in the Bayesian approach
is that this information is combined with the information in the prior
distribution. This brings us to the so-called marginal likelihood.
Marginal likelihood of a hypothesis conditional on data
In Bayesian model selection for each model or hypothesis under
consideration, the marginal likelihood is computed and represents
the likelihood of the hypothesis conditional on the data. Loosely
stated, it is the average data density (height in Figure 2) weighted
with the prior density (height of the prior, see also Figure 1). For
our application with constrained parameters the marginal likeli-
hood is illustrated in Figure 3, where priors and likelihood func-
tions are put together for several possible outcomes. The square
denotes the prior distribution for the unconstrained HA (equal
height for the entire square), the circles are the two dimensional
representation of the likelihood function, with smaller circles
denoting points with higher likelihoods (and the maximum in the
middle of the smaller circle).
The three plots in the top row show the results for a data set with
a larger sample mean for group 1 than for group 2 (e.g., 4 and 2.5,
respectively). This is, therefore, an example of a data set that is in
agreement with Hi: m1 > m2. The average data density in the left-
hand plot (Hi) will be higher than the average data density in the
right-hand plot (HA). For Hi, due to the constraint the area with rel-
atively low density (upper left triangle) is excluded (prior density is
zero) and, consequently, the average density of the remaining area
(bottom right triangle) is higher than when the entire square is eval-
uated (as is done for HA). The plot in the centre (H0) excludes the
area with the highest values of the likelihood function and thus
takes the average of relatively low values of the data density. The
marginal likelihood will be highest for Hi reflecting the support
in the data for this hypothesis. Compared to HA, a smaller marginal
likelihood, and thus no support, will be obtained for H0.
The second row presents the likelihood function of data with
equal sample means. In a similar vein, the marginal likelihood in
the centre plot (H0) will now be higher than in the right-hand plot
(HA), correctly showing that the data do support H0. The marginal
likelihood in the left-hand plot will be about equal as in the right-
hand plot, because the area that is truncated (upper-left triangle)
includes both high and low data densities and is similar to the area
that remains (lower-right triangle). The data show no support for Hi,
but also no clear refutation.
This is different in the last row, where the sample means are the
opposite of what was expected under Hi (i.e., first mean is smaller
than second). Both in Hi and H0 the large data densities fall in the
greyed out area and do not contribute to the marginal likelihood
since the prior density in that area is zero. The marginal likelihood
of HA will be the largest, implying that both constrained hypotheses
are not supported by the data.
Bayes factor as a model selection criterion
In model selection, two elements are combined: ‘‘fit’’ (how well
does the model describe the data), and ‘‘complexity’’ or also ‘‘par-
simoniousness’’ (the size or number of parameters of the model). A
complexity correction is important because the model with the best
fit is not necessarily the model with the best predictive quality (see,
for instance, Myung, 2000). The unconstrained hypothesis, for
example, always has a good fit, that is, there is no combination
of values of m1 and m2 that is not in accordance with HA. However,
Figure 2. Likelihood function.
µ1
µ1
µ1
µ2
µ1
µ2
H0: µ1 = µ2 (=µ)Hi: µ1 > µ1 HA: µ1 , µ2
Figure 1. Prior parameter distributions for three hypotheses.
554 International Journal of Behavioral Development
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
that does not imply that HA is a useful hypothesis. In fact, it predicts
that anything goes, which is scientifically not very interesting. In
terms of Myung (2000), the predictive quality of HA is very small
because it is not parsimonious.
In non-Bayesian model selection criteria like, for instance,
Aikake’s Information Criterion (AIC; Akaike, 1974), model
complexity is incorporated as an explicit penalty term, which is a
function of the number of parameters. A model with inequality con-
straints among the parameters, however, is more parsimonious than
its unconstrained counterpart without reducing the number of
parameters. A criterion is needed that does take the complexity
(size) of the model into account without (solely) basing this on the
number of parameters involved in the model. The marginal likeli-
hood does exactly that. The fit of HA can never be worse than the
fit of a hypothesis nested in HA (e.g., Hi or H0), but, nevertheless,
the marginal likelihood of Hi on the top row of Figure 3 is larger
than the marginal likelihood of HA on the top row. In the marginal
likelihood, the correction for model complexity is implicit and auto-
matic (also known as Ockham’s razor) and does not require the
number of parameters as explicit stated input in the procedure.
H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2
H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2
H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2
µ2
µ2
µ2
µ2
µ1 µ1 µ1
µ1 µ1 µ1
µ1 µ1 µ1
µ2
µ2
µ2
µ2
µ2
Figure 3. Priors and likelihood functions combined to obtain the marginal likelihood. Top row: data in agreement with Hi. Second row: data in agreement
with H0; Bottom row: data not in agreement with either H0 or Hi.
Klugkist et al. 555
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
The Bayesian model selection criterion is the Bayes factor
(BF; Kass & Raftery, 1995), which is the ratio of two marginal
likelihoods. To evaluate informative hypotheses, first, each is com-
pared with the unconstrained HA, that is, BFi,A is computed for each
Hi. BFi,A will be larger than 1 if the data support the constraints of
the informative hypotheses, and smaller than 1 otherwise. From this
set of BFi,A values (one for each hypothesis of interest), the relative
support for each of the informative hypotheses can be obtained
when mutually compared. Using a numerical example, let’s say
BF1,A (i.e., the Bayes factor comparing H1 with HA) ¼ 10, BF2,A
¼ 2, and BF3,A ¼ 0.5. These numbers show that the data support
H1 and H2 (Bayes factor > 1), but not H3 (Bayes factor < 1). Further-
more, compared to HA, the support for H1 is 10 times stronger,
and support for H2 is 2 times stronger. A direct comparison of
H1 against H2 is obtained via BF1,2 ¼ BF1,A/BF2,A, that is, BF1,2
¼ 10/2¼ 5, implying that the support for H1 is 5 times stronger than
for H2. Similarly, we can mutually compare H1 with H3 (BF1,3 ¼10/0.5 ¼ 20) and H2 with H3 (BF2,3 ¼ 2/0.5 ¼ 4).
The (Bayesian) probability of a hypothesis
When more than two hypotheses are under investigation, interpre-
tation of results is easier after translation of the Bayes factors into
posterior model probabilities (PMP). A posterior model prob-
ability represents the relative support for a hypothesis within a
certain set of hypotheses. Posterior model probabilities are num-
bers between 0 and 1, and all posterior model probabilities in a
set add up to 1. In order to calculate posterior model probabil-
ities, first the prior model probabilities must be specified. They
represent the relative support given for each hypothesis before
data collection. Note the distinction between priors for hypoth-
eses (needed to obtain poster model probabilities) and the prior
distributions of the model parameters that were specified and
discussed earlier.
An objective choice is to specify the same prior probability for
each hypothesis under consideration. For T hypotheses we use 1/T
as the prior probability for each hypothesis. With this specification
it is easy to compute posterior model probabilities from a set of
Bayes factors by the computation of the ratio of BFi,A and the sum
of all BFi,A. For the numerical illustration, the resulting posterior model
probabilities are 10/(10 þ 2 þ 0.5) ¼ 0.8 for H1, 2/(10 þ 2 þ 0.5)
¼ 0.16 for H2 and 0.5/(10 þ 2 þ 0.5) ¼ 0.04 for H3.
The posterior model probabilities show that, in a Bayesian
framework (in contrast to the frequentist point of view), one can
speak of the probability that a hypothesis is true, because within this
framework ‘‘probability’’ is defined as a degree of belief. Compu-
tation of the Bayesian probability of a specific H0 given a certain
(frequentist) p-value shows that the difference between the two is
substantial. For instance, for data yielding a p-value of .05, the
lower bound estimate of Pr(H0|data), under some general assump-
tions (e.g., default vague prior information), is about .30, implying
only weak evidence against the null hypothesis (Berger and Sellke,
1987). Therefore, the formulation of confirmative conclusions (e.g.,
‘‘support was found for . . . ’’ or ‘‘results show the effect . . . ’’)
based on a rejected null hypothesis (i.e., p < .05) is both fundamen-
tally wrong as well as overstating the amount of evidence from the
data. A posterior model probability does provide the amount of sup-
port for a specific model (hypothesis) but, in the interpretation of
the results, one should keep in mind that the definition of ‘‘probabil-
ity’’ is different for Bayesians than for frequentists.
Subjectivity of results?
An important aspect of a Bayesian analysis is whether the results are
sensitive for the (subjective) specification of the prior distributions
for the model parameters. For most estimation purposes it is straight-
forward to specify uninformative priors and results will not be sensi-
tive to changes in these priors. For Bayesian model selection,
however, two different priors that are both uninformative in estima-
tion can give very different marginal likelihoods (and thus Bayes fac-
tors and posterior model probabilities). It is therefore important to
carefully choose prior distributions and investigate robustness for
changes in the prior (called a prior sensitivity analysis).
In the application for informative hypotheses, in previous work
it has been shown that hypotheses specified using inequality con-
straints between parameters are not sensitive to the prior, but
hypotheses that include equalities are (Klugkist & Hoijtink,
2007). The software developed for the evaluation of informative
hypotheses (further discussed at the end of the discussion) automat-
ically derives priors with good properties for the data at hand. The
specification is based on a method where priors are specified using
a small part of the observed data (for training data methods, see
Berger & Pericchi, 1996, 2004; Perez & Berger, 2002; Van Wesel,
Hoijtink, & Klugkist, in press). Users of the software, therefore, do
not need to specify (and worry about) the prior distributions. More
details about the prior specification are provided in the appendix.
Results for the example
As we had no access to the original data, we used simulated data
based on the descriptive statistics given in Lee (2009). Sample
sizes for the four status groups were reported by Lee but not
how boys and girls were distributed over these groups. Based
on the remark that children were ‘‘fairly evenly distributed across
groups’’ (p. 325; this conclusion was supported with a non-
signicant Chi-square test), we generated data with the reported
sample size for each sociometric status and an equal number of
boys and girls within all status groups.
Classical analysis
Indeed, our results were similar to Lee’s results in the 2 (gender) by 4
(sociometric status) ANOVA (see Table 1). The main effects for
gender and sociometric status were statistically significant as was the
interaction effect. Post hoc tests revealed no significant differences
between the sociometric status groups for the boys, whereas post hoc
tests for the girls showed that controversial girls scored significantly
higher on bullying than preferred, rejected and neglected girls.
In terms of the hypotheses as stated by us (HLee, Hphysical,
Hrelational), these results provide no clear or direct information
about whether the hypotheses are supported or not.
Bayesian analysis
Based on the same data, we performed the Bayesian model selec-
tion procedure described above. In this analysis we included the
informative hypotheses HLee, Hphysical and Hrelational that were
previously formulated. For each hypothesis, the Bayes factor com-
paring the hypothesis with the unconstrained hypothesis, HA, shows
if there is support in the data for the constraints (if BFi,A > 1), or not
(BFi,A < 1). The results are presented in Table 2 and show support
for the expectation of Lee (BFLee,A ¼ 10.5). However, Hrelational is
556 International Journal of Behavioral Development
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
an even better hypothesis with BFRelational,A¼ 38.3. From these two
numbers we can also compute that the support for HRelational is about
3.5 (i.e., 38.3/10.5) times stronger than for HLee. The last column of
Table 2 shows the posterior model probabilities for the mutual com-
parison of the three hypotheses.
These results provide direct information on each of the informa-
tive hypotheses specified. We can conclude that support was found
for the theory hypothesized by Lee, and more for an alternative
hypothesis (stated by us, loosely based on information discussed
by Lee in the introduction of the paper). We like to stress again that
the hypotheses were formulated for didactical purposes, that is, to
illustrate the Bayesian approach, and therefore substantive conclu-
sions about these results can not be made.
Discussion
Ample papers demonstrating the limitations and risks of (solely rely-
ing on) NHT methods have been published in the last decades (several
papers are referred to throughout this paper, more can be found on, for
instance, www.indiana.edu/*stigtsts/refs.html). Although we criti-
cize the use and misuse of NHT and subsequent p-values, the main
goal of this paper is to inform the reader about Bayesian evaluation
of informative hypotheses as a promising alternative method.
With informative hypotheses and the Bayesian approach to eval-
uate them, researchers with explicit theories are provided with tools
to directly evaluate their hypotheses. In discussing the advantages
of this new approach, so far, we have mainly focused on giving
researchers the answers they would like to have, i.e., probabilities
of explicitly stated hypotheses. However, in addition to this advan-
tage, the new approach also has other advantages. That is, formulat-
ing informative hypotheses requires a well thought-through
research plan. This follows from the fact that translating explicit
ideas in informative hypotheses in terms of (in)equality constraints
is a difficult task, therefore, researchers are encouraged to carefully
think through their expectations. Herein, they are explicitly
motivated to explore and formulate alternative theories or predicted
outcomes. This will stimulate researchers to discuss their ‘‘explicit’’
ideas with colleagues, and to search in the literature for previous stud-
ies presenting results that might be in conflict with their own hypoth-
eses. Hence, specifying (multiple) informative hypotheses requires
clearly stated theoretical predictions, thereby clarifying the goal of
the study. It will motivate the search for competing theories, and it
will help to communicate research results in a straightforward way.
A remaining question is whether researchers are willing to adopt
such a novel approach to hypothesis evaluation. Van Wesel, Boeije,
and Hoijtink (in press) interviewed several psychological research-
ers with different levels of seniority (i.e., from PhD students to full
professors) about the role hypotheses play in their research projects,
how they formulate their hypotheses, and how they evaluate them.
From this study it became clear that many researchers are hesitant
to use new and unknown (e.g., Bayesian) methods. Responses that
were reported include statements like: ‘‘NHT is what everybody
uses, why should I use something else’’, ‘‘I am no expert in statistics
and will probably not understand this new method’’, and ‘‘Editors
and reviewers will probably not be familiar with these methods and
therefore I may not get my work published’’.
With the current paper, and the many included citations and refer-
ences, we debate that there is a need for methods beyond NHT. We
feel that the formulation and evaluation of informative hypotheses
can form an important step towards more consistent and unbiased lit-
erature. Since the first paper on Bayesian evaluation of informative
hypotheses was published in 2001 (Hoijtink, 2001), much work has
been done. This has lead to several papers explaining the statistics at
a more technical level (e.g., Klugkist & Hoijtink, 2007; Laudy &
Hoijtink, 2007; Laudy, Boom, & Hoijtink, 2004; Mulder, Hoijtink,
& Klugkist, 2010; Van Wesel, Hoijtink, & Klugkist, in press), but
also to a non-technical book written for psychological researchers
(Hoijtink, Klugkist, & Boelen, 2008) and some educational papers
(Klugkist, Laudy, & Hoijtink, 2005, 2010; Van de Schoot et al.,
2011). Also, for (in)equality constrained hypotheses in several statis-
tical models (ANOVA, ANCOVA, repeated measurements ANOVA
and other multivariate normal linear models, latent class analysis and
contingency tables analysis), free software is made available through
www.fss.uu.nl/ms/informativehypotheses. The package called
BIEMS can handle any type of constrained hypothesis for the multi-
variate normal linear model, and is the one used in this paper. The
Table 1. Sample size (N), mean (M), and standard deviation (SD) for bullying in the four sociometric status groups and the results of a classical analysis
(ANOVA and pairwise comparisons)
Boys (B) Girls (G)
N M SD N M SD
Preferred (P) 25 �0.11a 0.56 25 0.18a 1.45
Rejected (R) 26 0.24a 0.83 26 0.23a 1.58
Neglected (N) 27 �0.15a 0.54 27 �0.35a 0.30
Controversial (C) 6 �0.14a 0.22 6 2.88b 0.17
ANOVA results
Gender F(1,160) ¼ 18.56, p < .001
Status F(3,160) ¼ 9.85, p < .001
Gender*Status F(3,160) ¼ 9.68, p < .001
Note: Follow-up comparisons were performed separately by gender, using a Bonferroni correction. Group means sharing the same subscript are not significantlydifferent.
Table 2. Bayes factors for each hypothesis with HA (BFi,A) and posterior
model probabilities (PMP)
BFi,A PMP
HLee: mBR > (mBP, mBN, mBC)
and mGC > (mGP, mGR, mGN)
10.5 .22
Hphysical: mBR > mGC > mBC >
mGR > (mBN, mGN, mBP, mGP)
0.0 .00
Hrelational: mGC > mBR > mGR >
mBC > (mBN, mGN, mBP, mGP)
38.3 .78
Klugkist et al. 557
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
recently added windows interface to the BIEMS program, in combi-
nation with an elaborate tutorial, makes the use of this software very
easy. Also, the other packages (e.g., ContingencyTable for the anal-
ysis of hypotheses including constraints on cell probabilities or odds
ratios in contingency tables, and ConfirmatoryANOVA for Bayesian
and other approaches for (in)equality-constrained analysis of var-
iance models), come with tutorials and examples, enabling research-
ers to apply these methods without the need to fully understand all
technical details. Finally, the hesitation referring to the review and
publication process is understandable given the ‘‘publish or perish’’
culture that we have to deal with. However, science is and should
be innovative by nature, and this must also include additions to the
ever growing statistical toolbox. Reluctance to submit or accept
manuscripts reporting results obtained using novel methods would
slow down the scientific progress. It is reassuring that our experi-
ences so far are mainly positive. Journals that published psychologi-
cal applications which used the Bayesian approach for informative
hypotheses include Behavior Modification (Van Well et al., 2008),
Child Development (Meeus et al., 2010), Developmental Psychology
(Bullens, Klugkist, & Postma, in press), European Journal of Devel-
opmental Psychology (Laudy et al., 2005), Experimental Brain
Research (Kammers et al., 2009), Self & Identity (Van de Schoot
& Wong, in press), and we expect more to come.
In all, we have shown that there is an alternative for NHT that
could be beneficial for psychological researchers. Our question to
these researchers is: Do you test what you want to know?
Acknowledgment
The authors are grateful to Harald Kunst for his helpful comments
on an earlier version of this article.
References
Akaike, H. (1974). A new look at the statistical model identification.
IEEE Transactions on Automatic Control, 19, 716–723.
Berger, J.O., & Pericchi, L.R. (1996). The intrinsic Bayes factor for
model selection and prediction. Journal of the American Statistical
Association, 91, 109–122.
Berger, J.O., & Pericchi, L.R. (2004). Training samples in objec-
tive Bayesian model selection. The Annals of Statistics, 32,
841–869.
Berger, J.O., & Sellke, T. (1987). Testing a point null hypothesis: The
irreconcilability of p values and evidence. Journal of the American
Statistical Association, 82, 112–122.
Bullens, J., Klugkist, I., & Postma, A. (in press). The role of local and
distal landmarks in the development of object location memory.
Developmental Psychology.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of
the American Statistical Association, 90, 1313–1321.
Cohen, J. (1990). Things I have learned (so far). American Psycholo-
gist, 45, 1304–1312.
Cohen, J. (1994). The earth is round (p<.05). American Psychologist,
49, 997–1003.
Cumming, G. (2005). Understanding the average probability of repli-
cation. Comment on Killeen (2005). Psychological Science, 16,
1002–1004.
Cumming, G. (2008). Replication and p Intervals. p values predict
the future only vaguely, but confidence intervals do much better.
Perspectives on Psychological Science, 3, 286–300.
Cumming, G. (2010). Replication, prep, and confidence intervals:
Comment prompted by Iverson, Wagenmakers, and Lee
(2010); Lecoutre, Lecoutre, and Poitevineau (2010); and
Maraun and Gabriel (2010). Psychological Methods, 15,
192–198.
Doros, G., & Geier, A.B. (2005). Probability of replication revisited.
Comment on ‘‘An alternative to null-hypothesis significance tests’’.
Psychological Science, 16, 1005–1006.
Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004).
Editors can lead researchers to confidence intervals, but can’t make
them think. Statistical reform lessons from medicine. Psychological
Science, 15, 119–126.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian
Data Analysis (2nd ed.). London: Chapman & Hall.
Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What If
There Were No Significance Tests? Mahwah, NJ: Lawrence
Erlbaum Associates.
Hoijtink, H. (2001). Confirmatory latent class analysis: Model selection
using Bayes factors and (pseudo) likelihood ratio statistics. Multi-
variate Behavioral Research, 36, 563–588.
Hoijtink, H., Klugkist, I., & Boelen, P.A. (Eds.). (2008). Bayesian
evaluation of informative hypotheses. New York: Springer.
Howard, G. S., Maxwell, S. E., & Fleming, K.J. (2000). The proof of
the pudding: An illustration of the relative strengths of null hypoth-
esis, meta-analysis, and Bayesian analysis. Psychological Methods,
5, 315–332.
Iverson, G.J., Wagenmakers, E.-J., & Lee, M.D. (2010). A model-
averaging approach to replication: The case of prep. Psychological
Methods, 15, 172–181.
Kammers, M., Mulder, J., de Vignemont, F., & Dijkerman, H. C.
(2009). The weight of representing the body. Addressing the poten-
tially indefinite number of body representations in healthy individ-
uals. Experimental Brain Research, 204, 333–342.
Kass, R.E., & Raftery, A.E. (1995). Bayes Factors. Journal of the
American Statistical Association, 90, 773–795.
Killeen, P.R. (2005a). An alternative to null-hypothesis significance
tests. Psychological Science, 16, 345–353.
Killeen, P.R. (2005b). Replicability, Confidence intervals, and priors.
Psychological Science, 16, 1009–1012.
Killeen, P.R. (2010). prep replicates: Comment prompted by Iverson,
Wagenmakers, and Lee (2010); Lecoutre, Lecoutre, and Poitevi-
neau (2010); and Maraun and Gabriel (2010). Psychological Meth-
ods, 15, 199–202.
Klugkist, I., & Hoijtink, H. (2007). The Bayes factor for inequality and
about equality constrained models. Computational Statistics and
Data Analysis, 51, 6367–6379.
Klugkist, I., Laudy, O., & Hoijtink, H. (2005). Inequality constrained
analysis of variance: A Bayesian approach. Psychological Methods,
10, 477–493.
Klugkist, I., Laudy, O., & Hoijtink, H. (2010). Bayesian evaluation of
inequality and equality constrained hypotheses for contingency
tables. Psychological Methods, 15, 281–299.
Krantz, D.H. (1999). The null hypothesis testing controversy in
psychology. Journal of the Amercian Statistical Association, 44,
1372–1381.
Krueger, J. (2001). Null hypothesis significance testing: On the survival
of a flawed method. American Psychologist, 56, 16–26.
Laudy, O., & Hoijtink, H. (2007). Bayesian methods for the analysis of
inequality constrained contingency tables. Statistical Methods in
Medical Research, 16, 123–138.
Laudy, O., Boom, J., & Hoijtink, H. (2004). Bayesian computational
methods for inequality constrained latent class analysis. In A. van
der Ark, M. Croon & K. Sijtsma (Eds.), New developments in
558 International Journal of Behavioral Development
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
categorical data analysis for the social and behavioral sciences
(pp. 63–83). Mahwah, NJ: Erlbaum.
Laudy, O., Zoccolillo, M., Baillargeon, R., Boom, J., Tremblay, R., &
Hoijtink, H. (2005). Applications of confirmatory latent class anal-
ysis in developmental psychology. European Journal of Develop-
mental Psychology, 2, 1–15.
Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2010). Killeen’s
probability of replication and predictive probabilities: How to
compute, use, and interpret them. Psychological Methods, 15,
158–171.
Lee, E. (2009). The relationship between aggression and bullying to
social preference: Differences in gender and types of aggression.
International Journal of Behavioral Development, 33, 323–330.
Lykken, D.T. (1991). What’s wrong with psychology anyway? In D.
Cicchetti & W.M. Grove (Eds.), Thinking clearly about psychology:
Vol. 1. Matters of public interest: Essays in honor of Paul Everett
Meehl (pp. 3–39). Minneapolis, MN: University of Minnesota Press.
MacDonald, R.R. (2005). Why replication probabilities depend on prior
probability distributions. A rejoinder to Killeen (2005). Psychologi-
cal Science, 16, 1007–1008.
Maraun, M., & Gabriel, S. (2010). Killeen’s (2005) prep coefficient:
Logical and mathematical problems. Psychological Methods, 15,
182–191.
Meehl, P.E. (1967). Theory-testing in psychology and physics: A meth-
odological paradox. Philosophy of Science, 34, 103–115.
Meeus, W., Van de Schoot, R., Keijsers, L., Schwartz, S.J., & Branje, S.
(2010). On the progression and stability of adolescent identity for-
mation. A five-wave longitudinal study in early-to-middle and
middle-to-late adolescence. Child Development, 81, 1565–1581.
Mulder, J., Hoijtink, H., & Klugkist, I. (2010). Equality and inequality
constrained multivariate linear models: Objective model selection
using constrained posterior priors. Journal of Statistical Planning
and Inference, 140, 887–906.
Myung, J. (2000). The Importance of complexity in model selection.
Journal of Mathematical Psychology, 44, 190–204.
Nickerson, R.S. (2000). Null hypothesis significance testing: A review
of an old and continuing controversy. Psychological Methods, 5,
241–301.
Perez, J.M., & Berger, J.O. (2002). Expected-posterior prior distribu-
tions for model selection. Biometrika, 89, 491–511.
Rosenthal, R., Rosnow, R.L., & Rubin, D.B. (2000). Contrasts and
effect sizes in behavorial research. A correlation Approach.
Cambridge University Press.
Royall, R.M. (1997). Statistical evidence. A likelihood paradigm.
New York, NY: Chapman & Hall.
Schmidt, F.L. (1996). Statistical significance testing and cumulative
knowledge in psychology: Implications for training of researchers.
Psychological Methods, 1, 115–129.
Schmidt, F.L., & Hunter, J.E. (1997). Eight common but false objec-
tions to the discontinuation of significance testing in the analysis
of research data. In: What If There Were No Significance Tests?
(pp. 37–64) Mahwah, NJ: Lawrence Erlbaum Associates.
Serlin, R.C. (2010). Regarding prep: Comment prompted by Iverson,
Wagenmakers, and Lee (2010); Lecoutre, Lecoutre, and Poitevi-
neau (2010); and Maraun and Gabriel (2010). Psychological Meth-
ods, 15, 203–208.
Van de Schoot, R., Hoijtink, H., Mulder, J., Van Aken, M.A.G., Orobio
de Castro, B., Meeus, W. & Romeijn, J.-W. (2011). Evaluating
expectations about negative emotional states of aggressive boys
using Bayesian model selection. Developmental Psychology, 47,
203–212.
Van de Schoot, R., & Wong, T. (in press). Do antisocial young adults
have a high or a low level of self-concept? Self & Identity.
Van Well, S., Kolk, A.M., & Klugkist, I.G. (2008). Effects of sex,
gender role identification, and gender relevance of two types of
stressors on cardiovascular and subjective responses: Sex and gen-
der match/mismatch effects. Behavior Modification, 32, 427–449.
Van Wesel, F., Boeije, H., & Hoijtink, H. (in press). Use of hypotheses
for analysis of variance models: Challenging the current practice.
Quality and Quantity.
Van Wesel, F., Hoijtink, H., & Klugkist, I. (in press). Choosing priors
for inequality constrained analysis of variance: Methods based on
training data. Scandinavian Journal of Statistics.
Wilkinson L., & Task Force on Statistical Inference (1999). Statistical
methods in psychology journals: Guidelines and explanations.
American Psychologist, 54, 594–604.
Appendix
Here, we will shortly outline the Bayesian approach applied in this
paper for the analysis of variance (ANOVA) model:
yi ¼XJ
j¼1
mjdji þ ei;
where yi is the outcome of person i (i ¼ 1, . . . ,n), d ji denotes group
membership for j ¼ 1, . . . , J groups (where d ji ¼ 1 if person is a
member of group j, and zero otherwise), and mj is the mean of
group j. The residuals ei are assumed to be independent and normally
distributed with mean zero and variance s2.
The ANOVA model has the following likelihood:
f ðyjD; m;s2Þ ¼Yn
i¼1
1ffiffiffiffiffiffiffiffiffiffiffi2ps2p exp � 1
2s2yi �
XJ
j¼1
mjdji
" # !20@
1A;
where y¼ {y1, . . . ,yn}, D¼ {d1, . . . ,dj}, dj¼ {dj1, . . . ,djn}, and m¼{m1, . . . ,mj}.
Prior based on training data
The method and software used for the Bayesian analysis is based on
Van Wesel, Hoijtink and Klugkist (in press). In this paper, a thor-
ough investigation of different priors that can be used for the anal-
ysis of informative hypotheses (Hi) in the context of ANOVA is
presented. The method is based on the use of an encompassing
prior, that is, a (low informative) prior is specified for the uncon-
strained hypothesis HA and the prior distributions for the con-
strained hypotheses can be derived by truncation of the prior
parameter space, using:
gðm;s2jHiÞ ¼gðm;s2jHAÞIHiR
gðm;s2jHAÞIHidmds2;
where IHi is an indicator function with value one if the means are in
agreement with Hi, and zero otherwise.
The specification of the unconstrained prior g(m,s2 | HA) is based
on training data (Berger & Pericchi, 1996, 2004; Perez & Berger,
2002). A training sample is a small part of the data that can be used
Klugkist et al. 559
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from
to update the reference prior for the ANOVA model, 1/s2 (Ber-
nardo, 1979), such that the resulting posterior is proper but also low
informative and objective (i.e., no subjective information is used).
In the approaches described in the references above, multiple train-
ing samples are used and the results are combined in different
ways. Van Wesel et al. (in press) proposed a prior that is based
on the same principles but tailored for constrained hypotheses and
less computer intensive (i.e., faster). This prior is called the aver-
age constrained posterior prior (ACPP). For a detailed explanation
and elaborate motivation for this prior we refer to the original
paper.
The general form of the ACPP is:
gACPPðm;s2jHAÞ ¼ Nðmjm; SÞ � Inv� w2ðs2jn; k2Þ;
where N(.) denotes the multivariate normal distribution with a
mean parameter and covariance matrix, and Inv-w2(.) denotes the
scaled inverse chi-square distribution with the degrees of freedom
and a scale parameter.
Posterior
The posterior distribution based on the ACPP is:
hACPPðm;s2jy;D;HAÞ / f ðyjD; m;s2Þ � Nðmjm; SÞ� Inv� w2ðs2jn; k2Þ:
Bayes factors
The Bayes factor comparing two hypotheses is the ratio of two mar-
ginal likelihoods. A marginal likelihood, for instance m(y | HA), is
the density of the data averaged over the prior distribution of HA.
Chib (1995) noted that for the estimation of the marginal likelihood
it can be useful to use the expression (imputing our choice of prior
and subsequent posterior):
mðyjHAÞ ¼f ðyjD; m;s2ÞgACPPðm;s2jHAÞ
hACPPðm;s2jy;D;HAÞ:
Subsequently, Klugkist and Hoijtink (2007) derived that in the
context of encompassing priors (i.e., the constrained model is
nested in the unconstrained), the Bayes factor comparing an infor-
mative hypothesis Hi with the unconstrained HA (BFi,A) reduces to
the ratio of two proportions: the proportion of the unconstrained
posterior distribution in agreement with the constraints of Hi, and
the proportion of the unconstrained prior distribution in agreement
with the constraints of Hi. These proportions are estimated using
(MCMC) sampling methods.
Note that in this approach each Hi is evaluated against HA but
that mutual comparison of, for instance Hi1 with Hi2 is also possi-
ble, using:
BFi1;i2 ¼BFi1;A
BFi2;A:
Posterior model probabilities
Using a uniform prior on the model space, the posterior model prob-
abilities for t (t ¼ 1, . . . ,T) hypotheses are computed using:
PMPðHtÞ ¼BFt;APT
t¼1
BFt;A
:
560 International Journal of Behavioral Development
at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from