12
http://jbd.sagepub.com/ International Journal of Behavioral Development http://jbd.sagepub.com/content/35/6/550 The online version of this article can be found at: DOI: 10.1177/0165025411425873 2011 35: 550 International Journal of Behavioral Development Irene Klugkist, Floryt van Wesel and Jessie Bullens Do we know what we test and do we test what we want to know? Published by: http://www.sagepublications.com On behalf of: International Society for the Study of Behavioral Development can be found at: International Journal of Behavioral Development Additional services and information for http://jbd.sagepub.com/cgi/alerts Email Alerts: http://jbd.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://jbd.sagepub.com/content/35/6/550.refs.html Citations: What is This? - Nov 23, 2011 Version of Record >> at Aston University - FAST on May 1, 2014 jbd.sagepub.com Downloaded from at Aston University - FAST on May 1, 2014 jbd.sagepub.com Downloaded from

Do we know what we test and do we test what we want to know?

  • Upload
    j

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

http://jbd.sagepub.com/International Journal of Behavioral Development

http://jbd.sagepub.com/content/35/6/550The online version of this article can be found at:

 DOI: 10.1177/0165025411425873

2011 35: 550International Journal of Behavioral DevelopmentIrene Klugkist, Floryt van Wesel and Jessie Bullens

Do we know what we test and do we test what we want to know?  

Published by:

http://www.sagepublications.com

On behalf of: 

  International Society for the Study of Behavioral Development

can be found at:International Journal of Behavioral DevelopmentAdditional services and information for    

  http://jbd.sagepub.com/cgi/alertsEmail Alerts:

 

http://jbd.sagepub.com/subscriptionsSubscriptions:  

http://www.sagepub.com/journalsReprints.navReprints:  

http://www.sagepub.com/journalsPermissions.navPermissions:  

http://jbd.sagepub.com/content/35/6/550.refs.htmlCitations:  

What is This? 

- Nov 23, 2011Version of Record >>

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

Methods and measures

Do we know what we test and dowe test what we want to know?

Irene Klugkist,1 Floryt van Wesel,2 and Jessie Bullens3

AbstractNull hypothesis testing (NHT) is the most commonly used tool in empirical psychological research even though it has several knownlimitations. It is argued that since the hypotheses evaluated with NHT do not reflect the research-question or theory of the researchers,conclusions from NHT must be formulated with great modesty, that is, they cannot be stated in a confirmative way. Since confirmation ortheory evaluation is, however, what researchers often aim for, we present an alternative approach that is based on the specification ofexplicit, informative statistical hypotheses. The statistical approach for the evaluation of these hypotheses is a Bayesian model-selectionprocedure. A non-technical explanation of the Bayesian approach is provided and it will be shown that results obtained with this methodgive more direct answers to the questions asked and are easier to interpret. An additional advantage of the offered possibility to formulateand evaluate informative hypotheses is that it stimulates researchers to more carefully think through and specify their expectations.

Introduction

There is a long tradition of conducting empirical studies in the field

of psychology in general, and in related sub-disciplines (i.e., devel-

opmental psychology) more specifically. In contrast to the early

philosophical approach in which research was conducted in a

rationalistic way, empirical research is evidence-based and consists

of a number of subsequent stages. These stages roughly involve a)

establishing a research topic by collecting and organizing existing

empirical facts, b) formulating expectations, hypotheses and/or

research questions, c) designing the research and collecting data,

d) analyzing the data, and e) evaluating the results with respect to

the formulated hypotheses and expectations.

In this paper we argue that a non-trivial part of psychological (and

other) research is rather explicit in the first three stages (i.e., thorough

literature search, clearly formulated expectations, and a well thought-

through research design), but then a procedure follows in the fourth

stage that is not well-suited for such explicit designs and hypotheses.

Consequently, the last stage, where the results are linked to the

hypotheses stated at the start, is not straightforward. We will explain

and illustrate this with a study by Lee (2009) about aggression in

young children. Note that it is not at all our intention to criticize this

particular article; the research described in the paper just serves as a

typical example of the, in our opinion, almost standard approach of

data analysis and subsequent drawing of conclusions.

As is typical in empirical research, Lee starts with an overview

of the literature and previous findings on the topic of interest in the

introduction of the article. Although Lee’s research is more elabo-

rate, here we will focus only on the topic where the relation between

gender, sociometric status (preferred, rejected, neglected, and con-

troversial) and bullying in fifth-grade children is discussed. The

existing literature is used to guide the reader towards the, at the end

of the introduction stated, expectations and hypotheses. One of

these expectations is, for instance, that the amount of bullying is

especially high in boys with reject status while for girls the highest

amount of bullying is expected for the controversial status group

(Lee, 2009, p. 324).

To investigate this hypothesis (and others that are not discussed

here), perceived bullying and peer acceptance and rejection among

fifth-grade boys and girls was assessed. The peer acceptance and

rejection scores were used to assign children to one of the four

sociometric status groups. After the data were collected, analyses

were performed. Among other things, a 2 (gender) by 4 (socio-

metric status) analysis of variance was conducted with bullying

as the outcome variable. The results showed a significant main

effect of both gender and sociometric status as well as a significant

interaction effect of these two factors. Subsequently, post hoc pair-

wise comparisons were performed for the effect of sociometric sta-

tus on bullying for boys and girls separately. Results showed that,

for boys, the four sociometric status groups did not differ signifi-

cantly with respect to bullying and that, for girls, only the contro-

versial group differed from the other three groups. With respect

to bullying Lee (2009) concluded that: ‘‘ . . . the present research

found no evidence that boys bullied more than girls’’ (p. 328), and:

‘‘ . . . controversial sociometric status, especially in girls, showed

the highest scores in peer nomination as bullies’’ (p. 329). Although

more is said about the findings on bullying in the discussion, we did

not find an explicit reference to the expectations as formulated in

the introduction of the paper.

All tests and thus the subsequent conclusions were based on null

hypothesis testing (NHT). In the example, the three null hypotheses

1 Department of Methodology and Statistics, Utrecht University, the

Netherlands2 Department of Methodology, VU University Amsterdam, the Netherlands3 Helmholtz Research Institute, Experimental Psychology, Utrecht Univer-

sity, the Netherlands

Corresponding author:

Irene Klugkist, Utrecht University, Department of Methodology and

Statistics, Utrecht University, PO Box 80140, 3508 TC Utrecht, the

Netherlands.

Email: [email protected]

International Journal ofBehavioral Development

35(6) 550–560ª The Author(s) 2011

Reprints and permissions:sagepub.co.uk/journalsPermissions.nav

DOI: 10.1177/0165025411425873ijbd.sagepub.com

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

tested are ‘‘no main effect of gender’’, ‘‘no main effect of

sociometric status’’, and ‘‘no interaction effect of gender and socio-

metric status’’. These hypotheses, however, do not reflect the care-

fully formulated expectation of Lee (2009). In our opinion, NHT is

not well-suited for theory evaluation because there is a big gap

between the researcher’s (specific) hypothesis and the (not so spe-

cific) hypotheses that are central in NHT. On the other hand, the

Bayesian method presented in this paper is designed to deal with

hypotheses that are closely related to the substantive hypotheses.

Although both methods (NHT and Bayesian) will deliver valid test

results, the connection between the results and the original expecta-

tions will be more straightforward with the Bayesian method than

with NHT. Also for Lee’s (2009) paper, this may be the explanation

for not finding an explicit discussion of the originally stated expecta-

tion in the discussion of the test results and subsequent conclusions.

Limitations of NHT have been pointed out by many authors

before us (see, for instance, a review by Nickerson, 2000). Although

we will repeat some of the arguments, the main goal of this paper is

to introduce the Bayesian alternative approach. This method is devel-

oped for the direct evaluation of specific (clearly and explicitly for-

mulated) expectations. Observing that most research in psychology is

based on NHT and that NHT does not test the explicit hypotheses

often stated at the beginning of the research, we address the question:

Do we know what we test and do we test what we want to know?

In the remainder of the paper we will first elaborate on the argu-

ments stated above. This includes a summary of some of the mis-

conceptions that exist around NHT and p-values. In the following

section we will present the alternative approach that does not

demand a large change in the current research practice, and can

be useful when theory evaluation or (competing) theory comparison

is the main goal. Since Lee (2009) provided a specific hypothesis on

the effects of gender and sociometric status on bullying, we com-

pare the results presented in Lee’s paper with the results from the

proposed Bayesian approach for this hypothesis. The concluding

section includes several references for further reading and available

software for the novel approach.

Limitations of null hypothesis testing

Null hypothesis testing (NHT) is the most commonly used tool in

empirical psychological research even though there are numerous

publications describing its limitations. An interesting discussion

is provided in a book titled What If There Were No Significance

Tests? by Harlow, Mulaik, and Steiger (1997; but see also a critical

review article by Krantz, 1999). For a more recent and extensive

overview of the literature on (limitations of) NHT, we further refer

to Nickerson (2000). Although we too summarize some of the criti-

cisms in the next subsections, we do not aim for yet another paper

criticizing the use and misuse of NHT and subsequent p-values. The

main goal of this paper is to inform the reader about an alternative

that can be used when researchers have specific theories or expec-

tations about the outcomes of their research. Before introducing the

novel approach, we will however highlight three aspects of NHT

that are topics of ongoing debate: the nature of the hypotheses in

NHT, the (mis)interpretation of p-values, and (lack of) replicability.

Hypotheses in null hypothesis testing

As the name already suggests, NHT always requires a null hypoth-

esis, stating that a parameter of interest (e.g., a mean m or a

correlation r) has a certain value (often zero), or that a set of

parameters is exactly equal (e.g., m1 ¼ m2 ¼ m3). Stated differently,

the null hypothesis (H0) almost invariably states that ‘‘nothing is

going on’’ in the population of interest. Several authors have argued

that absolutely nothing is highly unlikely since some small differ-

ences between parameters of interest will almost always be present

(e.g., Cohen, 1990, 1994; Krueger, 2001; Lykken, 1991). Meehl

(1967) also agrees that in some disciplines (like psychology) the

null hypothesis is, using his expression, [quasi-] always false and

explains this as follows:

Any dependent variable of interest . . . depends mainly upon a finite

number of ‘‘strong’’ variables characteristic of the organism studied

. . . plus the influences manipulated by the experimenter. . . . In

order for two groups which differ in some identified properties

. . . to differ not at all in the ‘‘output’’ variable of interest, it would

be necessary that all determiners of the output variable have pre-

cisely the same average values in both groups, or else that their val-

ues should differ by a pattern of amounts of difference which

precisely counterbalance one another to yield a net difference of

zero. (p. 108)

He then concludes that this is obviously:

. . . so extremely unlikely that no psychologist or statistician would

assign more than a negligibly small probability to such a state of

affairs. (p. 108)

Based on the fact that the null hypothesis seems unlikely, Cohen

(1990) concludes:

So if the null hypothesis is always false, what’s the big deal about

rejecting it? (p. 1308)

Also Royall, in a monograph that addresses the problem of inter-

preting statistical data as evidence (Royall, 1997), wonders if there

are ‘‘statistical ‘null’ hypotheses that are scientifically important?’’

and states:

The answer to the question ‘Is the null hypothesis correct?’ is always

the same – no! . . . If the purpose of experiments were to answer

such questions, there would be no point in doing experiments, since

we already know the answers. (p. 79–80)

Royall continues by explaining that the focus should not be on the

question whether observations are evidence against the null hypoth-

esis. Instead, the question to ask is whether there are scientifically

meaningful alternative hypotheses that are better supported (Royall,

1997, p. 81).

However, the alternative hypothesis in NHT approaches most

often states nothing more than ‘‘not H0’’ and is therefore not at all

specific or scientifically meaningful. Consider, for instance, a two-

way analysis of variance. The null hypotheses in such an analysis

state that there are no main effects and no interaction effect of the

two factors. For each of the three null hypotheses, the usual alterna-

tive hypothesis is that the effect at hand (main or interaction) is

present. Nevertheless, a hypothesis stating ‘‘an interaction effect

is present’’ is not explicit, and thus not scientifically meaningful:

It does not specify the theoretical expectation of the researcher.

Note that not all alternative hypotheses are uninformative. As a

counterexample consider the testing of planned contrasts, where the

researcher translates explicit expectations in statistical contrasts

(see, for instance, Rosenthal, Rosnow, & Rubin, 2000). The evalua-

tion is however still based on a (unrealistic) null hypothesis.

Klugkist et al. 551

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

In conclusion, NHT is based on a highly unrealistic null

hypothesis and almost invariably an uninformative alternative

one. This has the drawback that only limited conclusions can be

drawn from NHT approaches. This will be further discussed in the

next subsection.

Misinterpretations of p-values

Besides the formulation of the statistical hypotheses, another draw-

back of NHT is that the interpretation of its main result, the p-value,

is not at all straightforward. Researchers seek for an answer to the

question: To what extent do the data support my hypothesis?, but a

p-value does not provide that information. Cohen (1994) sum-

marizes the difference in focus as follows: a p-value is the probabil-

ity of obtaining the observed (or more extreme) data if the null

hypothesis is true, Pr(data|H0), whereas the researcher is more inter-

ested in the probability of H0 given the observed data, Pr(H0|data).

Misinterpretations occur either out of ignorance, or, as stated by

Cohen (1994), despite knowledge about the correct interpretation:

... it [the p-value] does not tell us what we want to know, and we so

much want to know what we want to know that, out of desperation,

we nevertheless believe that it does! (p. 997)

In the frequentist (NHT) framework, we neither get information

about the probability that the null hypothesis is true nor about

the probability that the alternative or research hypothesis is true:

In the classical or frequentist framework it is not possible to assign

probabilities to hypotheses as a consequence of the definition of

probability as long-run frequency.

Replicability

Another limitation of p-values and what they can (not) tell us about

our hypotheses becomes clear by examining sampling variability

and the related issue whether specific results are replicable.

Although psychological researchers are trained in the concept of

sampling variability and should be aware of the uncertainty in the

results, this is often ignored or underestimated when results are

described and conclusions are formulated. Cumming (2008) inves-

tigated sampling variability of the p-value by asking the following

question: Given an observed (two-tailed) p-value of .05, what are

reasonable expected values if the same experiment is replicated?

Irrespective of sample size, the 80% replication interval for the

p-value in this scenario turns out to be (.00008, .44), implying

there is a 10% chance that – in a replication – p < .00008, a 10% chance

that p > .44, and an 80% chance that p falls within this interval.

The need for replication studies has been emphasized by several

authors (see, for instance, Howard, Maxwell and Fleming, 2000;

Schmidt, 1996). More recently, Killeen (2005a) proposed to report

the statistic prep, an estimate of the probability of replicating an

effect according to the following definition of replication: ‘‘an

effect of the same sign as that found in the original experiment’’

(p. 346). Killeen’s proposal led to a series of (either supporting,

elaborating, or criticizing) articles on the topic (Cumming,

2005, 2010; Doros & Geier, 2005; Iverson, Wagenmakers, & Lee,

2010; Killeen, 2005b, 2010; Lecoutre, Lecoutre, & Poitevineau,

2010; MacDonald, 2005; Maraun & Gabriel, 2010; Serlin,

2010), showing that the question of replicability is considered

important by many researchers and some kind of ‘‘best practice’’

has not yet been established or agreed upon. Although we too

believe that replication is a key element of science and the

accumulation of knowledge, a further discussion of the prep is

beyond the message of this paper.

To summarize, despite all efforts for improvements, NHT does

not evaluate scientifically meaningful hypotheses and still puts too

much weight on often misinterpreted and rather uncertain p-values.

However, it is still the dominant tool for statistical analyses in

psychological research.

Suggestions for solutions

A common answer to some limitations of NHT and p-values is that

additional results like effect sizes and confidence intervals must be

reported and that conclusions must be formulated more carefully,

that is, with specific attention for the uncertainty in the results. For

instance, in 1996, the Task Force on Statistical Inference (TSFI)

was convened by the APA Board of Scientific Affairs to elucidate

some of the controversial issues surrounding significance testing.

This led to guidelines and explanations on how to improve the inter-

pretation and communication of research results (Wilkinson & the

TSFI, 1999). Their recommendations indeed included standard

reporting of effect sizes and confidence intervals. Unfortunately,

compelling researchers to standardly report confidence intervals

does not lead to an actual interpretation of the them (Fidler, Thoma-

son, Cumming, Finch, & Leeman, 2004). Confidence intervals are

often ignored in the discussion of results or only used to determine

if the hypothesized value is within the interval or not. The latter use

of confidence intervals has all the same drawbacks as interpreting a

p-value from the NHT procedure (Schmidt & Hunter, 1997).

Compared to NHT, the evaluation of effect sizes (preferably

with confidence intervals) is more informative and therefore closer

to the theory-based approach we propose. However, the decision

whether an effect is meaningful is made after the results are

obtained (i.e., based on the observed effect size) and rather subjec-

tive (who decides if an effect of 0.10 is meaningful, and what will a

researcher finding an effect of 0.09 conclude?). In our opinion, the

formulation of specific expectations before data are collected leads

to more objective research and conclusions. In relatively simple set-

ups such expectations could be captured in sharper formulated null

hypotheses as, for instance, H0: m ¼2 versus HA: m > 2. The Baye-

sian alternative to NHT proposed in this paper is based on a similar

principle but the type of specific hypotheses is different (they are

introduced in the next section), and more extensive – multiple para-

meter – hypotheses can be evaluated.

Bayesian evaluation of informativehypotheses

In this section a non-technical explanation of the Bayesian

approach for the evaluation of informative hypotheses is provided.

First, we will elaborate on the types of hypotheses that can be for-

mulated by the researcher, and subsequently, the analysis of these

hypotheses using Bayesian model selection is presented.

Informative hypotheses

Researchers aim to design their research in such way that it is

possible to evaluate explicit theories, which lead to specific expec-

tations about the outcomes of the data analysis. In the research pre-

sented by Lee (2009), for example, it was expected that the amount

552 International Journal of Behavioral Development

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

of bullying is especially high in boys with reject status, while for

girls the highest amount of bullying is expected for the controver-

sial status group. Hypotheses such as these can be translated into

informative hypotheses, that is, hypotheses that give direction, in

terms of inequality constraints (i.e., ‘‘larger than’’, ‘‘better than’’,

etc.) to the researcher’s expectations. With m denoting the mean

level of bullying, subscripts B and G denoting boys and girls, and

subscripts P, R, N and C for preferred, rejected, neglected, and con-

troversial status, respectively, the hypothesis in our example can be

translated into:

HLee: mBR > (mBP, mBN, mBC) and mGC > (mGP, mGR, mGN).

To extend the illustration of Bayesian model selection we added

two competing informative hypotheses to the research hypothesis

stated by Lee:

Hphysical: mBR > mGC > mBC > mGR > (mBN, mGN, mBP, mGP),

Hrelational: mGC > mBR > mGR > mBC > (mBN, mGN, mBP, mGP).

The first hypothesis is based on the expectation that physical

aggression has more impact on being appointed as bully and that

boys show more physical aggression than girls. Combined with the

information in HLee concerning higher bullying in the reject and

controversial groups, this leads to Hphysical. In a similar vein, a com-

peting expectation could state that especially relational aggression

(which is assumed to be higher in girls) is related to bullying which

could lead to Hrelational. Both hypotheses are loosely based on infor-

mation provided by Lee in the introduction, but bear in mind that

we are no experts in this area and the hypotheses serve to illustrate

the statistical method and are not to be interpreted from a substan-

tive point of view.

With the Bayesian model selection procedure informative

hypotheses (Hi) formulated using inequalities or a mix of inequal-

ities and equalities can be evaluated. The researcher decides on a set

of hypotheses that he/she wants to confront with the data and the

method returns the relative support provided by the data for each

hypothesis in the form of posterior model (hypothesis) probabil-

ities. The approach therefore provides answers in terms of (Baye-

sian) probabilities assigned to hypotheses, enabling easier

interpretation of test results and preventing dichotomous decisions

(i.e., ‘‘significant’’ or ‘‘not significant’’).

An additional advantage of the proposed method is that several

informative hypotheses can be mutually compared. Note that the

traditional null and alternative hypothesis can, but do not necessa-

rily have to, be included as a hypothesis of interest. In case the

researcher decides to include (one of) them, they do not play a cen-

tral role in the analysis: every (informative) hypothesis is evaluated

against all others.

The analysis of informative hypotheses

To be able to support the explanation of the analysis with graphical

representations, here we will limit ourselves to a 2-parameter exam-

ple. Generalizations and technical details are provided in the

Appendix. The ingredients required for the analysis will be elabo-

rated for the hypotheses Hi: m1 > m2, H0: m1 ¼ m2, and HA: m1, m2

(the comma in HA denotes that no constraints are imposed), where

the m’s represent independent group means for data that are

assumed to be normally distributed and the residual variance is

assumed to be known.

Prior distributions for model parameters

Specification of a prior distribution for the model parameters is an

essential part of the Bayesian analysis and makes it fundamentally

different from a classical, frequentist approach. Within the frequen-

tist framework, parameters are assumed fixed (although unknown),

whereas Bayesians treat parameters as random and use probability

distributions to reflect knowledge or uncertainty about them.

Knowledge about parameters before data collection is reflected in

the prior distribution.

We assume vague, independent and identically distributed

priors for m1 and m2 so that the prior distribution is dominated by

the data. For instance, for an outcome of interest that is measured

on a 0–7 scale, a prior distribution for the mean could be a uniform

distribution on the scale 0–7, that is, U(0,7). Such a prior states that

before seeing the data each value of m is equally likely and can

therefore be considered objective (no subjective information about

m is incorporated). For parameters that are constrained in one or

more hypotheses equal prior distributions are used (again to avoid

subjective input in the analysis).

Usually, when several models are compared, for each model a prior

distribution for the model specific parameters must be defined

(Gelman et al., 2004, pp. 184–186). However, the hypotheses (leading

to different models in the model selection procedure) discussed in this

paper have a special feature: All informative hypotheses, Hi, are nested

in an unconstrained hypothesis (i.e., HA). This is illustrated in Figure 1,

where the white square on the right represents a prior distribution for

the statistical model consisting of two parameters, m1 and m2, without

any constraints, that is, the prior for both m1 and m2 is U(0,7). The

height of the prior density (not visible in the two-dimensional plot)

is equal for all combinations of m1 and m2 within the square. Note that,

here we choose (bounded) uniform priors to keep the explanation and

plots simple, but in the software used for Bayesian model selection,

other prior distributions are used. Prior specification is shortly dis-

cussed later in the paper, while more details are provided in the appen-

dix (or, see also, Van Wesel, Klugkist, & Hoijtink, in press).

An important consequence of the nesting of constrained hypoth-

eses in HA, can be seen in the other two plots of Figure 1, for

instance, in the left-hand plot, the upper diagonal is greyed out since

in that area m1 is smaller than m2, which is in disagreement with the

hypothesis. In the analysis the same initial prior distribution is used

for the constrained hypotheses, however, parts that are not in agree-

ment with the constraints are truncated (i.e., the prior density in

those areas is set to zero).

Some special attention is required for the null hypothesis plotted

in the middle. Strictly, the whole square must be greyed out,

because only the diagonal line (m1 ¼ m2) is in agreement with the

hypothesis and thus receives non-zero prior density. Computation-

ally, equality constraints are much more difficult to handle than

inequality constraints (for more details, see Van Wesel, Hoijtink,

& Klugkist, in press). Conceptually, however, the Bayesian proce-

dure can be understood by considering an approximation of the

hypothesis stating that m1 and m2 are about equal. This is plotted

in Figure 1 by the small white area around the line m1 ¼ m2.

The likelihood function to describe the information inthe data

The second ingredient in a Bayesian analysis is the data. The sam-

ple contains additional information about the parameters of the

Klugkist et al. 553

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

model (i.e., m1 and m2) and this information is reflected in the

likelihood function. An example is plotted in Figure 2. Given the

observed data, some combinations of values of m1 and m2 are more

likely (e.g., m1 ¼ m2 ¼ 4) and others are highly unlikely (e.g., m1 ¼m2 ¼ 0, or, m1 ¼ 7 and m2 ¼ 0). The top of the likelihood function

shows which values of m1 and m2 are the most likely given the data

at hand and is also known as the maximum likelihood (the most

common estimator in frequentist statistics).

Using the likelihood function to describe the information the

data provides about the parameters is standard in both frequentist

and Bayesian methods. However, typical in the Bayesian approach

is that this information is combined with the information in the prior

distribution. This brings us to the so-called marginal likelihood.

Marginal likelihood of a hypothesis conditional on data

In Bayesian model selection for each model or hypothesis under

consideration, the marginal likelihood is computed and represents

the likelihood of the hypothesis conditional on the data. Loosely

stated, it is the average data density (height in Figure 2) weighted

with the prior density (height of the prior, see also Figure 1). For

our application with constrained parameters the marginal likeli-

hood is illustrated in Figure 3, where priors and likelihood func-

tions are put together for several possible outcomes. The square

denotes the prior distribution for the unconstrained HA (equal

height for the entire square), the circles are the two dimensional

representation of the likelihood function, with smaller circles

denoting points with higher likelihoods (and the maximum in the

middle of the smaller circle).

The three plots in the top row show the results for a data set with

a larger sample mean for group 1 than for group 2 (e.g., 4 and 2.5,

respectively). This is, therefore, an example of a data set that is in

agreement with Hi: m1 > m2. The average data density in the left-

hand plot (Hi) will be higher than the average data density in the

right-hand plot (HA). For Hi, due to the constraint the area with rel-

atively low density (upper left triangle) is excluded (prior density is

zero) and, consequently, the average density of the remaining area

(bottom right triangle) is higher than when the entire square is eval-

uated (as is done for HA). The plot in the centre (H0) excludes the

area with the highest values of the likelihood function and thus

takes the average of relatively low values of the data density. The

marginal likelihood will be highest for Hi reflecting the support

in the data for this hypothesis. Compared to HA, a smaller marginal

likelihood, and thus no support, will be obtained for H0.

The second row presents the likelihood function of data with

equal sample means. In a similar vein, the marginal likelihood in

the centre plot (H0) will now be higher than in the right-hand plot

(HA), correctly showing that the data do support H0. The marginal

likelihood in the left-hand plot will be about equal as in the right-

hand plot, because the area that is truncated (upper-left triangle)

includes both high and low data densities and is similar to the area

that remains (lower-right triangle). The data show no support for Hi,

but also no clear refutation.

This is different in the last row, where the sample means are the

opposite of what was expected under Hi (i.e., first mean is smaller

than second). Both in Hi and H0 the large data densities fall in the

greyed out area and do not contribute to the marginal likelihood

since the prior density in that area is zero. The marginal likelihood

of HA will be the largest, implying that both constrained hypotheses

are not supported by the data.

Bayes factor as a model selection criterion

In model selection, two elements are combined: ‘‘fit’’ (how well

does the model describe the data), and ‘‘complexity’’ or also ‘‘par-

simoniousness’’ (the size or number of parameters of the model). A

complexity correction is important because the model with the best

fit is not necessarily the model with the best predictive quality (see,

for instance, Myung, 2000). The unconstrained hypothesis, for

example, always has a good fit, that is, there is no combination

of values of m1 and m2 that is not in accordance with HA. However,

Figure 2. Likelihood function.

µ1

µ1

µ1

µ2

µ1

µ2

H0: µ1 = µ2 (=µ)Hi: µ1 > µ1 HA: µ1 , µ2

Figure 1. Prior parameter distributions for three hypotheses.

554 International Journal of Behavioral Development

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

that does not imply that HA is a useful hypothesis. In fact, it predicts

that anything goes, which is scientifically not very interesting. In

terms of Myung (2000), the predictive quality of HA is very small

because it is not parsimonious.

In non-Bayesian model selection criteria like, for instance,

Aikake’s Information Criterion (AIC; Akaike, 1974), model

complexity is incorporated as an explicit penalty term, which is a

function of the number of parameters. A model with inequality con-

straints among the parameters, however, is more parsimonious than

its unconstrained counterpart without reducing the number of

parameters. A criterion is needed that does take the complexity

(size) of the model into account without (solely) basing this on the

number of parameters involved in the model. The marginal likeli-

hood does exactly that. The fit of HA can never be worse than the

fit of a hypothesis nested in HA (e.g., Hi or H0), but, nevertheless,

the marginal likelihood of Hi on the top row of Figure 3 is larger

than the marginal likelihood of HA on the top row. In the marginal

likelihood, the correction for model complexity is implicit and auto-

matic (also known as Ockham’s razor) and does not require the

number of parameters as explicit stated input in the procedure.

H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2

H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2

H0: µ1 = µ2 (=µ)Hi: µ1 > µ2 HA: µ1 , µ2

µ2

µ2

µ2

µ2

µ1 µ1 µ1

µ1 µ1 µ1

µ1 µ1 µ1

µ2

µ2

µ2

µ2

µ2

Figure 3. Priors and likelihood functions combined to obtain the marginal likelihood. Top row: data in agreement with Hi. Second row: data in agreement

with H0; Bottom row: data not in agreement with either H0 or Hi.

Klugkist et al. 555

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

The Bayesian model selection criterion is the Bayes factor

(BF; Kass & Raftery, 1995), which is the ratio of two marginal

likelihoods. To evaluate informative hypotheses, first, each is com-

pared with the unconstrained HA, that is, BFi,A is computed for each

Hi. BFi,A will be larger than 1 if the data support the constraints of

the informative hypotheses, and smaller than 1 otherwise. From this

set of BFi,A values (one for each hypothesis of interest), the relative

support for each of the informative hypotheses can be obtained

when mutually compared. Using a numerical example, let’s say

BF1,A (i.e., the Bayes factor comparing H1 with HA) ¼ 10, BF2,A

¼ 2, and BF3,A ¼ 0.5. These numbers show that the data support

H1 and H2 (Bayes factor > 1), but not H3 (Bayes factor < 1). Further-

more, compared to HA, the support for H1 is 10 times stronger,

and support for H2 is 2 times stronger. A direct comparison of

H1 against H2 is obtained via BF1,2 ¼ BF1,A/BF2,A, that is, BF1,2

¼ 10/2¼ 5, implying that the support for H1 is 5 times stronger than

for H2. Similarly, we can mutually compare H1 with H3 (BF1,3 ¼10/0.5 ¼ 20) and H2 with H3 (BF2,3 ¼ 2/0.5 ¼ 4).

The (Bayesian) probability of a hypothesis

When more than two hypotheses are under investigation, interpre-

tation of results is easier after translation of the Bayes factors into

posterior model probabilities (PMP). A posterior model prob-

ability represents the relative support for a hypothesis within a

certain set of hypotheses. Posterior model probabilities are num-

bers between 0 and 1, and all posterior model probabilities in a

set add up to 1. In order to calculate posterior model probabil-

ities, first the prior model probabilities must be specified. They

represent the relative support given for each hypothesis before

data collection. Note the distinction between priors for hypoth-

eses (needed to obtain poster model probabilities) and the prior

distributions of the model parameters that were specified and

discussed earlier.

An objective choice is to specify the same prior probability for

each hypothesis under consideration. For T hypotheses we use 1/T

as the prior probability for each hypothesis. With this specification

it is easy to compute posterior model probabilities from a set of

Bayes factors by the computation of the ratio of BFi,A and the sum

of all BFi,A. For the numerical illustration, the resulting posterior model

probabilities are 10/(10 þ 2 þ 0.5) ¼ 0.8 for H1, 2/(10 þ 2 þ 0.5)

¼ 0.16 for H2 and 0.5/(10 þ 2 þ 0.5) ¼ 0.04 for H3.

The posterior model probabilities show that, in a Bayesian

framework (in contrast to the frequentist point of view), one can

speak of the probability that a hypothesis is true, because within this

framework ‘‘probability’’ is defined as a degree of belief. Compu-

tation of the Bayesian probability of a specific H0 given a certain

(frequentist) p-value shows that the difference between the two is

substantial. For instance, for data yielding a p-value of .05, the

lower bound estimate of Pr(H0|data), under some general assump-

tions (e.g., default vague prior information), is about .30, implying

only weak evidence against the null hypothesis (Berger and Sellke,

1987). Therefore, the formulation of confirmative conclusions (e.g.,

‘‘support was found for . . . ’’ or ‘‘results show the effect . . . ’’)

based on a rejected null hypothesis (i.e., p < .05) is both fundamen-

tally wrong as well as overstating the amount of evidence from the

data. A posterior model probability does provide the amount of sup-

port for a specific model (hypothesis) but, in the interpretation of

the results, one should keep in mind that the definition of ‘‘probabil-

ity’’ is different for Bayesians than for frequentists.

Subjectivity of results?

An important aspect of a Bayesian analysis is whether the results are

sensitive for the (subjective) specification of the prior distributions

for the model parameters. For most estimation purposes it is straight-

forward to specify uninformative priors and results will not be sensi-

tive to changes in these priors. For Bayesian model selection,

however, two different priors that are both uninformative in estima-

tion can give very different marginal likelihoods (and thus Bayes fac-

tors and posterior model probabilities). It is therefore important to

carefully choose prior distributions and investigate robustness for

changes in the prior (called a prior sensitivity analysis).

In the application for informative hypotheses, in previous work

it has been shown that hypotheses specified using inequality con-

straints between parameters are not sensitive to the prior, but

hypotheses that include equalities are (Klugkist & Hoijtink,

2007). The software developed for the evaluation of informative

hypotheses (further discussed at the end of the discussion) automat-

ically derives priors with good properties for the data at hand. The

specification is based on a method where priors are specified using

a small part of the observed data (for training data methods, see

Berger & Pericchi, 1996, 2004; Perez & Berger, 2002; Van Wesel,

Hoijtink, & Klugkist, in press). Users of the software, therefore, do

not need to specify (and worry about) the prior distributions. More

details about the prior specification are provided in the appendix.

Results for the example

As we had no access to the original data, we used simulated data

based on the descriptive statistics given in Lee (2009). Sample

sizes for the four status groups were reported by Lee but not

how boys and girls were distributed over these groups. Based

on the remark that children were ‘‘fairly evenly distributed across

groups’’ (p. 325; this conclusion was supported with a non-

signicant Chi-square test), we generated data with the reported

sample size for each sociometric status and an equal number of

boys and girls within all status groups.

Classical analysis

Indeed, our results were similar to Lee’s results in the 2 (gender) by 4

(sociometric status) ANOVA (see Table 1). The main effects for

gender and sociometric status were statistically significant as was the

interaction effect. Post hoc tests revealed no significant differences

between the sociometric status groups for the boys, whereas post hoc

tests for the girls showed that controversial girls scored significantly

higher on bullying than preferred, rejected and neglected girls.

In terms of the hypotheses as stated by us (HLee, Hphysical,

Hrelational), these results provide no clear or direct information

about whether the hypotheses are supported or not.

Bayesian analysis

Based on the same data, we performed the Bayesian model selec-

tion procedure described above. In this analysis we included the

informative hypotheses HLee, Hphysical and Hrelational that were

previously formulated. For each hypothesis, the Bayes factor com-

paring the hypothesis with the unconstrained hypothesis, HA, shows

if there is support in the data for the constraints (if BFi,A > 1), or not

(BFi,A < 1). The results are presented in Table 2 and show support

for the expectation of Lee (BFLee,A ¼ 10.5). However, Hrelational is

556 International Journal of Behavioral Development

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

an even better hypothesis with BFRelational,A¼ 38.3. From these two

numbers we can also compute that the support for HRelational is about

3.5 (i.e., 38.3/10.5) times stronger than for HLee. The last column of

Table 2 shows the posterior model probabilities for the mutual com-

parison of the three hypotheses.

These results provide direct information on each of the informa-

tive hypotheses specified. We can conclude that support was found

for the theory hypothesized by Lee, and more for an alternative

hypothesis (stated by us, loosely based on information discussed

by Lee in the introduction of the paper). We like to stress again that

the hypotheses were formulated for didactical purposes, that is, to

illustrate the Bayesian approach, and therefore substantive conclu-

sions about these results can not be made.

Discussion

Ample papers demonstrating the limitations and risks of (solely rely-

ing on) NHT methods have been published in the last decades (several

papers are referred to throughout this paper, more can be found on, for

instance, www.indiana.edu/*stigtsts/refs.html). Although we criti-

cize the use and misuse of NHT and subsequent p-values, the main

goal of this paper is to inform the reader about Bayesian evaluation

of informative hypotheses as a promising alternative method.

With informative hypotheses and the Bayesian approach to eval-

uate them, researchers with explicit theories are provided with tools

to directly evaluate their hypotheses. In discussing the advantages

of this new approach, so far, we have mainly focused on giving

researchers the answers they would like to have, i.e., probabilities

of explicitly stated hypotheses. However, in addition to this advan-

tage, the new approach also has other advantages. That is, formulat-

ing informative hypotheses requires a well thought-through

research plan. This follows from the fact that translating explicit

ideas in informative hypotheses in terms of (in)equality constraints

is a difficult task, therefore, researchers are encouraged to carefully

think through their expectations. Herein, they are explicitly

motivated to explore and formulate alternative theories or predicted

outcomes. This will stimulate researchers to discuss their ‘‘explicit’’

ideas with colleagues, and to search in the literature for previous stud-

ies presenting results that might be in conflict with their own hypoth-

eses. Hence, specifying (multiple) informative hypotheses requires

clearly stated theoretical predictions, thereby clarifying the goal of

the study. It will motivate the search for competing theories, and it

will help to communicate research results in a straightforward way.

A remaining question is whether researchers are willing to adopt

such a novel approach to hypothesis evaluation. Van Wesel, Boeije,

and Hoijtink (in press) interviewed several psychological research-

ers with different levels of seniority (i.e., from PhD students to full

professors) about the role hypotheses play in their research projects,

how they formulate their hypotheses, and how they evaluate them.

From this study it became clear that many researchers are hesitant

to use new and unknown (e.g., Bayesian) methods. Responses that

were reported include statements like: ‘‘NHT is what everybody

uses, why should I use something else’’, ‘‘I am no expert in statistics

and will probably not understand this new method’’, and ‘‘Editors

and reviewers will probably not be familiar with these methods and

therefore I may not get my work published’’.

With the current paper, and the many included citations and refer-

ences, we debate that there is a need for methods beyond NHT. We

feel that the formulation and evaluation of informative hypotheses

can form an important step towards more consistent and unbiased lit-

erature. Since the first paper on Bayesian evaluation of informative

hypotheses was published in 2001 (Hoijtink, 2001), much work has

been done. This has lead to several papers explaining the statistics at

a more technical level (e.g., Klugkist & Hoijtink, 2007; Laudy &

Hoijtink, 2007; Laudy, Boom, & Hoijtink, 2004; Mulder, Hoijtink,

& Klugkist, 2010; Van Wesel, Hoijtink, & Klugkist, in press), but

also to a non-technical book written for psychological researchers

(Hoijtink, Klugkist, & Boelen, 2008) and some educational papers

(Klugkist, Laudy, & Hoijtink, 2005, 2010; Van de Schoot et al.,

2011). Also, for (in)equality constrained hypotheses in several statis-

tical models (ANOVA, ANCOVA, repeated measurements ANOVA

and other multivariate normal linear models, latent class analysis and

contingency tables analysis), free software is made available through

www.fss.uu.nl/ms/informativehypotheses. The package called

BIEMS can handle any type of constrained hypothesis for the multi-

variate normal linear model, and is the one used in this paper. The

Table 1. Sample size (N), mean (M), and standard deviation (SD) for bullying in the four sociometric status groups and the results of a classical analysis

(ANOVA and pairwise comparisons)

Boys (B) Girls (G)

N M SD N M SD

Preferred (P) 25 �0.11a 0.56 25 0.18a 1.45

Rejected (R) 26 0.24a 0.83 26 0.23a 1.58

Neglected (N) 27 �0.15a 0.54 27 �0.35a 0.30

Controversial (C) 6 �0.14a 0.22 6 2.88b 0.17

ANOVA results

Gender F(1,160) ¼ 18.56, p < .001

Status F(3,160) ¼ 9.85, p < .001

Gender*Status F(3,160) ¼ 9.68, p < .001

Note: Follow-up comparisons were performed separately by gender, using a Bonferroni correction. Group means sharing the same subscript are not significantlydifferent.

Table 2. Bayes factors for each hypothesis with HA (BFi,A) and posterior

model probabilities (PMP)

BFi,A PMP

HLee: mBR > (mBP, mBN, mBC)

and mGC > (mGP, mGR, mGN)

10.5 .22

Hphysical: mBR > mGC > mBC >

mGR > (mBN, mGN, mBP, mGP)

0.0 .00

Hrelational: mGC > mBR > mGR >

mBC > (mBN, mGN, mBP, mGP)

38.3 .78

Klugkist et al. 557

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

recently added windows interface to the BIEMS program, in combi-

nation with an elaborate tutorial, makes the use of this software very

easy. Also, the other packages (e.g., ContingencyTable for the anal-

ysis of hypotheses including constraints on cell probabilities or odds

ratios in contingency tables, and ConfirmatoryANOVA for Bayesian

and other approaches for (in)equality-constrained analysis of var-

iance models), come with tutorials and examples, enabling research-

ers to apply these methods without the need to fully understand all

technical details. Finally, the hesitation referring to the review and

publication process is understandable given the ‘‘publish or perish’’

culture that we have to deal with. However, science is and should

be innovative by nature, and this must also include additions to the

ever growing statistical toolbox. Reluctance to submit or accept

manuscripts reporting results obtained using novel methods would

slow down the scientific progress. It is reassuring that our experi-

ences so far are mainly positive. Journals that published psychologi-

cal applications which used the Bayesian approach for informative

hypotheses include Behavior Modification (Van Well et al., 2008),

Child Development (Meeus et al., 2010), Developmental Psychology

(Bullens, Klugkist, & Postma, in press), European Journal of Devel-

opmental Psychology (Laudy et al., 2005), Experimental Brain

Research (Kammers et al., 2009), Self & Identity (Van de Schoot

& Wong, in press), and we expect more to come.

In all, we have shown that there is an alternative for NHT that

could be beneficial for psychological researchers. Our question to

these researchers is: Do you test what you want to know?

Acknowledgment

The authors are grateful to Harald Kunst for his helpful comments

on an earlier version of this article.

References

Akaike, H. (1974). A new look at the statistical model identification.

IEEE Transactions on Automatic Control, 19, 716–723.

Berger, J.O., & Pericchi, L.R. (1996). The intrinsic Bayes factor for

model selection and prediction. Journal of the American Statistical

Association, 91, 109–122.

Berger, J.O., & Pericchi, L.R. (2004). Training samples in objec-

tive Bayesian model selection. The Annals of Statistics, 32,

841–869.

Berger, J.O., & Sellke, T. (1987). Testing a point null hypothesis: The

irreconcilability of p values and evidence. Journal of the American

Statistical Association, 82, 112–122.

Bullens, J., Klugkist, I., & Postma, A. (in press). The role of local and

distal landmarks in the development of object location memory.

Developmental Psychology.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of

the American Statistical Association, 90, 1313–1321.

Cohen, J. (1990). Things I have learned (so far). American Psycholo-

gist, 45, 1304–1312.

Cohen, J. (1994). The earth is round (p<.05). American Psychologist,

49, 997–1003.

Cumming, G. (2005). Understanding the average probability of repli-

cation. Comment on Killeen (2005). Psychological Science, 16,

1002–1004.

Cumming, G. (2008). Replication and p Intervals. p values predict

the future only vaguely, but confidence intervals do much better.

Perspectives on Psychological Science, 3, 286–300.

Cumming, G. (2010). Replication, prep, and confidence intervals:

Comment prompted by Iverson, Wagenmakers, and Lee

(2010); Lecoutre, Lecoutre, and Poitevineau (2010); and

Maraun and Gabriel (2010). Psychological Methods, 15,

192–198.

Doros, G., & Geier, A.B. (2005). Probability of replication revisited.

Comment on ‘‘An alternative to null-hypothesis significance tests’’.

Psychological Science, 16, 1005–1006.

Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004).

Editors can lead researchers to confidence intervals, but can’t make

them think. Statistical reform lessons from medicine. Psychological

Science, 15, 119–126.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian

Data Analysis (2nd ed.). London: Chapman & Hall.

Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What If

There Were No Significance Tests? Mahwah, NJ: Lawrence

Erlbaum Associates.

Hoijtink, H. (2001). Confirmatory latent class analysis: Model selection

using Bayes factors and (pseudo) likelihood ratio statistics. Multi-

variate Behavioral Research, 36, 563–588.

Hoijtink, H., Klugkist, I., & Boelen, P.A. (Eds.). (2008). Bayesian

evaluation of informative hypotheses. New York: Springer.

Howard, G. S., Maxwell, S. E., & Fleming, K.J. (2000). The proof of

the pudding: An illustration of the relative strengths of null hypoth-

esis, meta-analysis, and Bayesian analysis. Psychological Methods,

5, 315–332.

Iverson, G.J., Wagenmakers, E.-J., & Lee, M.D. (2010). A model-

averaging approach to replication: The case of prep. Psychological

Methods, 15, 172–181.

Kammers, M., Mulder, J., de Vignemont, F., & Dijkerman, H. C.

(2009). The weight of representing the body. Addressing the poten-

tially indefinite number of body representations in healthy individ-

uals. Experimental Brain Research, 204, 333–342.

Kass, R.E., & Raftery, A.E. (1995). Bayes Factors. Journal of the

American Statistical Association, 90, 773–795.

Killeen, P.R. (2005a). An alternative to null-hypothesis significance

tests. Psychological Science, 16, 345–353.

Killeen, P.R. (2005b). Replicability, Confidence intervals, and priors.

Psychological Science, 16, 1009–1012.

Killeen, P.R. (2010). prep replicates: Comment prompted by Iverson,

Wagenmakers, and Lee (2010); Lecoutre, Lecoutre, and Poitevi-

neau (2010); and Maraun and Gabriel (2010). Psychological Meth-

ods, 15, 199–202.

Klugkist, I., & Hoijtink, H. (2007). The Bayes factor for inequality and

about equality constrained models. Computational Statistics and

Data Analysis, 51, 6367–6379.

Klugkist, I., Laudy, O., & Hoijtink, H. (2005). Inequality constrained

analysis of variance: A Bayesian approach. Psychological Methods,

10, 477–493.

Klugkist, I., Laudy, O., & Hoijtink, H. (2010). Bayesian evaluation of

inequality and equality constrained hypotheses for contingency

tables. Psychological Methods, 15, 281–299.

Krantz, D.H. (1999). The null hypothesis testing controversy in

psychology. Journal of the Amercian Statistical Association, 44,

1372–1381.

Krueger, J. (2001). Null hypothesis significance testing: On the survival

of a flawed method. American Psychologist, 56, 16–26.

Laudy, O., & Hoijtink, H. (2007). Bayesian methods for the analysis of

inequality constrained contingency tables. Statistical Methods in

Medical Research, 16, 123–138.

Laudy, O., Boom, J., & Hoijtink, H. (2004). Bayesian computational

methods for inequality constrained latent class analysis. In A. van

der Ark, M. Croon & K. Sijtsma (Eds.), New developments in

558 International Journal of Behavioral Development

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

categorical data analysis for the social and behavioral sciences

(pp. 63–83). Mahwah, NJ: Erlbaum.

Laudy, O., Zoccolillo, M., Baillargeon, R., Boom, J., Tremblay, R., &

Hoijtink, H. (2005). Applications of confirmatory latent class anal-

ysis in developmental psychology. European Journal of Develop-

mental Psychology, 2, 1–15.

Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2010). Killeen’s

probability of replication and predictive probabilities: How to

compute, use, and interpret them. Psychological Methods, 15,

158–171.

Lee, E. (2009). The relationship between aggression and bullying to

social preference: Differences in gender and types of aggression.

International Journal of Behavioral Development, 33, 323–330.

Lykken, D.T. (1991). What’s wrong with psychology anyway? In D.

Cicchetti & W.M. Grove (Eds.), Thinking clearly about psychology:

Vol. 1. Matters of public interest: Essays in honor of Paul Everett

Meehl (pp. 3–39). Minneapolis, MN: University of Minnesota Press.

MacDonald, R.R. (2005). Why replication probabilities depend on prior

probability distributions. A rejoinder to Killeen (2005). Psychologi-

cal Science, 16, 1007–1008.

Maraun, M., & Gabriel, S. (2010). Killeen’s (2005) prep coefficient:

Logical and mathematical problems. Psychological Methods, 15,

182–191.

Meehl, P.E. (1967). Theory-testing in psychology and physics: A meth-

odological paradox. Philosophy of Science, 34, 103–115.

Meeus, W., Van de Schoot, R., Keijsers, L., Schwartz, S.J., & Branje, S.

(2010). On the progression and stability of adolescent identity for-

mation. A five-wave longitudinal study in early-to-middle and

middle-to-late adolescence. Child Development, 81, 1565–1581.

Mulder, J., Hoijtink, H., & Klugkist, I. (2010). Equality and inequality

constrained multivariate linear models: Objective model selection

using constrained posterior priors. Journal of Statistical Planning

and Inference, 140, 887–906.

Myung, J. (2000). The Importance of complexity in model selection.

Journal of Mathematical Psychology, 44, 190–204.

Nickerson, R.S. (2000). Null hypothesis significance testing: A review

of an old and continuing controversy. Psychological Methods, 5,

241–301.

Perez, J.M., & Berger, J.O. (2002). Expected-posterior prior distribu-

tions for model selection. Biometrika, 89, 491–511.

Rosenthal, R., Rosnow, R.L., & Rubin, D.B. (2000). Contrasts and

effect sizes in behavorial research. A correlation Approach.

Cambridge University Press.

Royall, R.M. (1997). Statistical evidence. A likelihood paradigm.

New York, NY: Chapman & Hall.

Schmidt, F.L. (1996). Statistical significance testing and cumulative

knowledge in psychology: Implications for training of researchers.

Psychological Methods, 1, 115–129.

Schmidt, F.L., & Hunter, J.E. (1997). Eight common but false objec-

tions to the discontinuation of significance testing in the analysis

of research data. In: What If There Were No Significance Tests?

(pp. 37–64) Mahwah, NJ: Lawrence Erlbaum Associates.

Serlin, R.C. (2010). Regarding prep: Comment prompted by Iverson,

Wagenmakers, and Lee (2010); Lecoutre, Lecoutre, and Poitevi-

neau (2010); and Maraun and Gabriel (2010). Psychological Meth-

ods, 15, 203–208.

Van de Schoot, R., Hoijtink, H., Mulder, J., Van Aken, M.A.G., Orobio

de Castro, B., Meeus, W. & Romeijn, J.-W. (2011). Evaluating

expectations about negative emotional states of aggressive boys

using Bayesian model selection. Developmental Psychology, 47,

203–212.

Van de Schoot, R., & Wong, T. (in press). Do antisocial young adults

have a high or a low level of self-concept? Self & Identity.

Van Well, S., Kolk, A.M., & Klugkist, I.G. (2008). Effects of sex,

gender role identification, and gender relevance of two types of

stressors on cardiovascular and subjective responses: Sex and gen-

der match/mismatch effects. Behavior Modification, 32, 427–449.

Van Wesel, F., Boeije, H., & Hoijtink, H. (in press). Use of hypotheses

for analysis of variance models: Challenging the current practice.

Quality and Quantity.

Van Wesel, F., Hoijtink, H., & Klugkist, I. (in press). Choosing priors

for inequality constrained analysis of variance: Methods based on

training data. Scandinavian Journal of Statistics.

Wilkinson L., & Task Force on Statistical Inference (1999). Statistical

methods in psychology journals: Guidelines and explanations.

American Psychologist, 54, 594–604.

Appendix

Here, we will shortly outline the Bayesian approach applied in this

paper for the analysis of variance (ANOVA) model:

yi ¼XJ

j¼1

mjdji þ ei;

where yi is the outcome of person i (i ¼ 1, . . . ,n), d ji denotes group

membership for j ¼ 1, . . . , J groups (where d ji ¼ 1 if person is a

member of group j, and zero otherwise), and mj is the mean of

group j. The residuals ei are assumed to be independent and normally

distributed with mean zero and variance s2.

The ANOVA model has the following likelihood:

f ðyjD; m;s2Þ ¼Yn

i¼1

1ffiffiffiffiffiffiffiffiffiffiffi2ps2p exp � 1

2s2yi �

XJ

j¼1

mjdji

" # !20@

1A;

where y¼ {y1, . . . ,yn}, D¼ {d1, . . . ,dj}, dj¼ {dj1, . . . ,djn}, and m¼{m1, . . . ,mj}.

Prior based on training data

The method and software used for the Bayesian analysis is based on

Van Wesel, Hoijtink and Klugkist (in press). In this paper, a thor-

ough investigation of different priors that can be used for the anal-

ysis of informative hypotheses (Hi) in the context of ANOVA is

presented. The method is based on the use of an encompassing

prior, that is, a (low informative) prior is specified for the uncon-

strained hypothesis HA and the prior distributions for the con-

strained hypotheses can be derived by truncation of the prior

parameter space, using:

gðm;s2jHiÞ ¼gðm;s2jHAÞIHiR

gðm;s2jHAÞIHidmds2;

where IHi is an indicator function with value one if the means are in

agreement with Hi, and zero otherwise.

The specification of the unconstrained prior g(m,s2 | HA) is based

on training data (Berger & Pericchi, 1996, 2004; Perez & Berger,

2002). A training sample is a small part of the data that can be used

Klugkist et al. 559

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from

to update the reference prior for the ANOVA model, 1/s2 (Ber-

nardo, 1979), such that the resulting posterior is proper but also low

informative and objective (i.e., no subjective information is used).

In the approaches described in the references above, multiple train-

ing samples are used and the results are combined in different

ways. Van Wesel et al. (in press) proposed a prior that is based

on the same principles but tailored for constrained hypotheses and

less computer intensive (i.e., faster). This prior is called the aver-

age constrained posterior prior (ACPP). For a detailed explanation

and elaborate motivation for this prior we refer to the original

paper.

The general form of the ACPP is:

gACPPðm;s2jHAÞ ¼ Nðmjm; SÞ � Inv� w2ðs2jn; k2Þ;

where N(.) denotes the multivariate normal distribution with a

mean parameter and covariance matrix, and Inv-w2(.) denotes the

scaled inverse chi-square distribution with the degrees of freedom

and a scale parameter.

Posterior

The posterior distribution based on the ACPP is:

hACPPðm;s2jy;D;HAÞ / f ðyjD; m;s2Þ � Nðmjm; SÞ� Inv� w2ðs2jn; k2Þ:

Bayes factors

The Bayes factor comparing two hypotheses is the ratio of two mar-

ginal likelihoods. A marginal likelihood, for instance m(y | HA), is

the density of the data averaged over the prior distribution of HA.

Chib (1995) noted that for the estimation of the marginal likelihood

it can be useful to use the expression (imputing our choice of prior

and subsequent posterior):

mðyjHAÞ ¼f ðyjD; m;s2ÞgACPPðm;s2jHAÞ

hACPPðm;s2jy;D;HAÞ:

Subsequently, Klugkist and Hoijtink (2007) derived that in the

context of encompassing priors (i.e., the constrained model is

nested in the unconstrained), the Bayes factor comparing an infor-

mative hypothesis Hi with the unconstrained HA (BFi,A) reduces to

the ratio of two proportions: the proportion of the unconstrained

posterior distribution in agreement with the constraints of Hi, and

the proportion of the unconstrained prior distribution in agreement

with the constraints of Hi. These proportions are estimated using

(MCMC) sampling methods.

Note that in this approach each Hi is evaluated against HA but

that mutual comparison of, for instance Hi1 with Hi2 is also possi-

ble, using:

BFi1;i2 ¼BFi1;A

BFi2;A:

Posterior model probabilities

Using a uniform prior on the model space, the posterior model prob-

abilities for t (t ¼ 1, . . . ,T) hypotheses are computed using:

PMPðHtÞ ¼BFt;APT

t¼1

BFt;A

:

560 International Journal of Behavioral Development

at Aston University - FAST on May 1, 2014jbd.sagepub.comDownloaded from