15
This article was downloaded by: [University Of Pittsburgh] On: 10 November 2014, At: 09:22 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Educational Psychology: An International Journal of Experimental Educational Psychology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/cedp20 The effect of clustering on statistical tests: an illustration using classroom environment data Jeffrey Paul Dorman a a School of Education , Australian Catholic University , Virginia, Australia Published online: 14 Jul 2008. To cite this article: Jeffrey Paul Dorman (2008) The effect of clustering on statistical tests: an illustration using classroom environment data, Educational Psychology: An International Journal of Experimental Educational Psychology, 28:5, 583-595, DOI: 10.1080/01443410801954201 To link to this article: http://dx.doi.org/10.1080/01443410801954201 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

The effect of clustering on statistical tests: an illustration using classroom environment data

Embed Size (px)

Citation preview

Page 1: The effect of clustering on statistical tests: an illustration using classroom environment data

This article was downloaded by: [University Of Pittsburgh]On: 10 November 2014, At: 09:22Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Educational Psychology: AnInternational Journal of ExperimentalEducational PsychologyPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/cedp20

The effect of clustering on statisticaltests: an illustration using classroomenvironment dataJeffrey Paul Dorman aa School of Education , Australian Catholic University , Virginia,AustraliaPublished online: 14 Jul 2008.

To cite this article: Jeffrey Paul Dorman (2008) The effect of clustering on statistical tests: anillustration using classroom environment data, Educational Psychology: An International Journal ofExperimental Educational Psychology, 28:5, 583-595, DOI: 10.1080/01443410801954201

To link to this article: http://dx.doi.org/10.1080/01443410801954201

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Page 2: The effect of clustering on statistical tests: an illustration using classroom environment data

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 3: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational PsychologyVol. 28, No. 5, August 2008, 583–595

ISSN 0144-3410 print/ISSN 1469-5820 online© 2008 Taylor & FrancisDOI: 10.1080/01443410801954201http://www.informaworld.com

The effect of clustering on statistical tests: an illustration using classroom environment data

Jeffrey Paul Dorman*

School of Education, Australian Catholic University, Virginia, AustraliaTaylor and Francis LtdCEDP_A_295586.sgm(Received 25 October 2007; final version received 30 January 2008)

10.1080/01443410801954201Educational Psychology0144-3410 (print)/1469-5820 (online)Original Article2008Taylor & [email protected] paper discusses the effect of clustering on statistical tests and illustrates this effect usingclassroom environment data. Most classroom environment studies involve the collection of datafrom students nested within classrooms and the hierarchical nature to these data cannot beignored. In particular, this paper studies the influence of intraclass correlations on tests ofstatistical significance conducted with the individual as the unit of analysis. Theory that adjustst-test scores for nested data in two-group comparisons is presented and applied to classroomenvironment data. This paper demonstrates that Type I error rates inflate greatly as the intraclasscorrelation increases. Data analysis techniques that recognise the clustering of students inclassrooms in classroom environment studies are essential, and it is recommended that eithermultilevel analysis or adjustments to statistical parameters be undertaken in studies involvingnested data.

Keywords: classroom environment; instrument; secondary

The unit of analysis problem has been an on-going issue in classroom environment research.Various reviews of learning environment research over the past 20 years have recognised thisissue as important, because inappropriate analysis of data can invalidate conclusions (seeDorman, 2002; Fraser, 1986, 1998). For classroom environment researchers, how to analyse datais of particular concern because classroom environment research relies heavily on the collectionof perceptual data from students in classes within schools. The data are clearly hierarchical, andmultilevel analysis textbooks consistently use school settings as exemplars of data hierarchy (e.g.,Goldstein, 2004; Hox, 2002). While much learning environment literature has focused on choos-ing the right unit of analysis, Cronbach (1976), Burstein (1980), and Raudenbush and Willms(1991) have noted that the key issue is not one of choosing one unit of analysis but of recognisingvariation in scores at different levels.

The purposes of this paper are threefold. First, it explores the effect of clustering on theresults of statistical testing. An index of the degree of clustering of individuals at one level withinanother level (e.g., students within classes) is assessed by the intraclass correlation (or variancepartition coefficient). Thus, the first purpose is to investigate how the intraclass correlation influ-ences statistical testing. As demonstrated below, this effect is achieved primarily through theinflation of Type I error rates. Second, the paper demonstrates a simple approach that correctsstatistical inference parameters for inflated Type I error rates. Third, the paper applies the abovetheory to classroom environment research. A data set from an Australian study conducted byDorman in 1994 is used to demonstrate the effective use of the theory in a particular classroomenvironment study.

*Email: [email protected]

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 4: The effect of clustering on statistical tests: an illustration using classroom environment data

584 J.P. Dorman

Before addressing these three aims, this paper provides background information on classroomenvironment research and in particular its dominant methodology – the use of perceptualmeasures.

Classroom environment research

Reviews of classroom environment research by Fraser (1998) and Dorman (2002) and editedbooks by Khine and Fisher (2003) and Fisher and Khine (2006) have reported research into theassessment, determinants, and consequences of classroom environments. These studies haveinvestigated associations between classroom environment and cognitive and affective outcomes,evaluated educational innovations, researched differences between students’ and teachers’perceptions of classrooms, compared actual and preferred environments, studied the effect ofdeterminants or antecedents (e.g., student gender, year level, subject, school type) on the class-room environment, and investigated the transition from primary to secondary school, schoolpsychology, student meta-cognition, teacher education, educational productivity research, andthe facilitation of changes in school life.

As noted above, the conceptualisation and assessment of classroom environments has beena major field of interest. Since the mid-1960s a suite of classroom environment measurementinstruments has been developed, validated, and used in a host of school settings. Pioneeringresearch work in the 1960s in the USA resulted in the Classroom Environment Scale (CES;Moos & Trickett, 1987) and the Learning Environment Inventory (LEI; Fraser, Anderson, &Walberg). In Australia, Fraser (1990) developed the Individualised Classroom EnvironmentQuestionnaire and Fraser, Giddings, and McRobbie (1995) reported the use of the ScienceLaboratory Environment Inventory (SLEI). In the Netherlands, the Questionnaire on TeacherInteraction (QTI; Wubbels & Levy, 1993) has been used in many studies. More recently, Fraserreported the development and validation of the What is Happening in this Class questionnaire(WIHIC; Fraser, 1998), which has been used in many research studies internationally over thepast decade.

Methodological advances in the development, validation, and use of classroom environmentinstruments over the past decade were demonstrated in a recent special issue of LearningEnvironments Research (see Shavelson & Seidel, 2006). Papers in this issue addressed the use ofRasch modelling to develop true measures of learning environments (Cavanagh & Romanoski,2006), multilevel modelling in classroom environment research (den Brok, Brekelmans, &Wubbels, 2006), latent class analysis to study micro teaching learning environments (Seidel,2006), and construct and criterion validity of student and teacher ratings of classrooms (Kunter& Baumert, 2006). In addition, LISREL has been used to establish the factorial invariance of theWIHIC (Dorman, 2003) and to perform multitrait–multimethod analysis of actual and preferredversions of the WIHIC (Dorman, in press b).

Recent classroom environment research has studied associations between web-based chemis-try learning environments and attitudes and achievement (Frailich, Kesner, & Hofstein, 2007),assessed the classroom environment in universities in Spain (Villar & Alegre, 2007), used class-room environment assessments to ascertain whether computer classes in New Zealand secondaryschools should be single-sex (Logan, 2007), and investigated links among classroom socialenvironment, motivational beliefs, and engagement (Patrick, Ryan, & Kaplan 2007).

Other studies have investigated parents’ and students’ perceptions of classroom environments(Allen & Fraser, 2007), classroom participation by deaf or hard of hearing students in governmentschools in the USA (Antia, Sabers, & Stinson, 2006), primary students’ perceptions of teacherinterpersonal behaviours (Waldrip, Reene, Fisher, & Dorman, 2008), students’ perceptions ofscience teachers’ interpersonal behaviours in Turkish secondary schools (Telli, den Brok, &

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 5: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 585

Cakiroglu, 2007), and the effects of extended instructional time on learning environment,achievement, and attitudes in middle school algebra classes (Azimioara & Fraser, 2007). Studiesin the United States by Veltri, Banning, and Davies (2006) and in Slovenia by Schmidt and[Ccaron] agran (2006) illustrate the effective use of qualitative data collection methods in classroomenvironment research. Fisher and Khine’s (2006) recently edited volume of 25 chapters demon-strates the depth and breadth of international learning environment research today.

Use of perceptual measures in classroom environment research

Murray’s (1938) need-press theory is one of the foundation stones of classroom environmentresearch. According to this theory, individuals have internal needs and the external environmentor press either supports or frustrates that need. Murray (1938, p. 122) distinguished between‘alpha press’, ‘which is the press that actually exists as far as scientific discovery can determineit,’ and ‘beta press’, ‘which is the subject’s own interpretation of the phenomena’. That is, alphapress is the environment as observed by an outsider and beta press refers to the perceptions of theenvironment’s inhabitants (i.e., students and teachers).

Since the late 1960s, an overwhelming feature of classroom environment research and asso-ciated instrumentation has been the collection and analysis of high-inference student perceptualdata, usually in intact classes. In contrast to low-inference measures which focus on specific,denotable, relatively objective behaviours, high-inference measures require respondents to makesummary judgments based on a sustained period of time in the classroom using specificconstructs (e.g., teacher support; see Rosenshine, 1970). The issue of inference is fundamentallyabout the degree of processing that occurs between the classroom data as it occurred and therepresentation of those data in codified form. Studies which focus on the meaning of school andclassroom events have tended to utilise high-inference measures.

According to Walberg (1976), students are quite capable of making summary molar judg-ments about their classrooms rather than reporting on a myriad of molecular events. Allied withthis view is the realisation that, compared with external observers, students stand at a superiorvantage point for making summary judgments. They are immersed in the complexity of theclassroom and its dynamics. While trained observers can provide important information aboutclassrooms, they do not have access to the quantity and quality of data that students accumulate.Fiedler’s (1975) study of classroom interaction showed that students’ perceptions of the class-room are better at predicting outcomes than data collected from observers.

While all data collection methods have inherent weaknesses, it is instructive to consider threeweaknesses of observational data when assessing the affective characteristics of classes (seeAnderson, 1981). First, inferring affective characteristics from observable behaviours is problem-atic because the context of the classroom governs behaviours. An observer in a classroom for30 minutes only experiences a limited contextual range. Consistency of behaviours across a rangeof classroom contexts over time is needed if the link between behaviours and affective character-istics is to be established. Second, observers could fail to observe relevant characteristics of theclassroom, and thus render the data collection process invalid. Third, observations of student andteacher behaviours can be misinterpreted. This is a particular problem if only one person observesthe classroom and investigator triangulation is not possible. Given the problematic nature andcosts associated with observational data collection, self-report methods through the use of ques-tionnaires have gained overwhelming support in classroom environment research. Classroomsare extremely complex entities and, as Walberg rightly asserted over 30 years ago, ‘There is oneinstrument that can help us to deal with such complexity – the human mind’ (Walberg, 1976,p. 160). It is the minds of the students in the classroom that need to be tapped to gain summaryjudgments of the classroom environment.

C

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 6: The effect of clustering on statistical tests: an illustration using classroom environment data

586 J.P. Dorman

Implications of the use of perceptual measures with intact classes

As discussed above, the dominant methodology in classroom environment research is the collec-tion of perceptual data from students in particular classes. These classes are nested in particularschools. While the hierarchical/nested nature of these data is inescapable, this essential datacharacteristic has often been ignored. Analyses have used either the individual as the unit ofanalysis, and ignored class membership, or used the class as the unit of analysis with aggregateddata and thus ignored the individual student. In response to criticisms, some researchers havereported parallel but essentially independent sets of analyses conducted with both the individualstudent and the class as units of analysis in the one study (e.g., Goh & Fraser, 1998). As class-room environment focuses on the assessment of classrooms, school membership has almostalways been ignored in classroom environment research. Proponents of multilevel modellinghave argued that the existence of grouping hierarchies in data is neither accidental nor ignorable(see Rowe, 2007), and that data with a clear hierarchy should not be analysed as if they are all onthe same level as this can lead to statistical and interpretational errors (Tabachnick & Fidell,2007). As Hedges notes:

A common mistake in analysis of cluster-randomised trials is made when the data are analysed as ifthe data were a simple random sample and assignment was carried out at the level of individuals. Thistypically leads to an overstatement of the precision of results and consequently to anticonservativeconclusions about precision and statistical significance of treatment effects. (Hedges, 2007, p. 152)

The fundamental issue concerning group effects is that even if individuals are assigned to groupson a random basis, as a group they will become differentiated. For students in classes, this meansthat students influence and are influenced by other students in the class (Goldstein, 2003). Thereis a class effect. It is also true that schools can create class effects by directing students to classeson a clearly biased basis (e.g., timetabling, specialist teacher availability, subject choice). If allstudents within classes are idiosyncratic and class and school membership do not contribute tostudent scores, then there will be no variance at the class or school levels. On the other hand, ifall students within the same class have the same scores then there will be no student-levelvariation (see Wheldall, Beaman, & Mok, 1999). In essence, variance in students’ scores can bepartitioned at the student, class, and school levels. The intra-unit correlation (often referred to asthe intraclass correlation, ρ) or variance partition coefficient (VPC) is the proportion of varianceaccounted for by higher level units and can be thought of as the ‘extent of clustering’ (Goldstein,Browne, & Rasbash, 2002). Qualitatively, the VPC can be taken as a measure of the importanceof a particular level. So, the computed value of the VPC for classes provides an indication of howimportant class membership is to scores on the variable under consideration. According to Lee(2000), a variance proportion above 10% at any level is non-trivial and needs to be taken intoaccount in any analysis.

Clearly, multilevel modelling is the optimal approach to handling hierarchical data with power-ful programs like MLwiN (Rasbash, Steele, Browne, & Prosser (2005) and HLM (Raudenbush& Bryk, 2002) readily available. However, sometimes the raw data from a research report maynot be available for re-analysis. In this case, multilevel analysis cannot be employed and a post-hoc procedure based on the results of statistical testing and the sample characteristics is needed.The present paper demonstrates such a post-hoc approach with a two-group comparison of class-room environment scores.

Aggregating data to the class level usually involves computation of class means for each scaleand using these means as scores for analysis. In essence, student-level variation is ignored andvariance is compressed. Information is lost from the analysis; statistical power is lost (Hox,2002). Additionally, aggregation of data to higher levels raises the issue of aggregation bias andecological fallacies, in which a relationship identified statistically at a higher level is used to make

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 7: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 587

assertions about lower-level variables (see Alker, 1969; Freedman, 1999). According to Aitkinand Longford (1986, p. 42), employing aggregated data ‘is dangerous at best and disastrous atworst’.

From a statistical perspective, if data are to be analysed using the student as the unit of anal-ysis, an effective sample size which takes into account the design effect of having students nestedin classes can be employed (see Snijders & Bosker, 1999). The higher the VPC, the higher thedesign effect and the greater the adjustment in the sample size if analysis is to be conducted atthe individual level only. Another way of dealing with this issue is to conduct analyses with thestudent as the unit of analysis and the existing sample size, and then adjust post-hoc the values ofstatistical parameters being used to make statistical inferences. The following section describes astatistical theory taking this latter approach by Hedges (2007). A later section of this paper willdemonstrate this useful approach with a classroom environment data set.

A theoretical perspective

A recent paper by Hedges (2007) provides a theoretical approach that addresses the threepurposes of the present paper. This theory focuses on two-group comparisons and the influenceof the nesting of data. The intraclass correlation ρ is defined as:

where σ2B is the common between-cluster variance and σ2

W is the common within-cluster vari-ance. It is the proportion of total variance attributed to between-cluster variation. As such, theintraclass correlation (or variance partition coefficient) is a measure of clustering. Higher valuesof ρ indicate higher clustering, with ρ = 1 indicating no within-cluster variability. Similarly, if ρ= 0 there is no clustering effect and all cases can be treated as statistically independent.

If ρ ≠ 0 then clustering should be taken into account in any statistical testing. While multilevelmodelling would be the optimal approach, it is sometimes useful to proceed by adjusting theexisting parameters of statistical tests. The normal approach to comparing population meansusing samples from two groups is a t-test. If clustering is ignored, the test is:

– that is, a t score with N – 2 degrees of freedom. According to Hedges (2007), the appropriatetest if clustering is taken into account and cluster sizes are equal is:

where the adjusted t value, tA, has a t distribution with h degrees of freedom, and:

and:

ρσ

σ σ=

+B

B W

2

2 2

t N( , )− 2 α

t ctA =

cN n

N n=

− − −− + −

( ) ( )

( )( ( ) )

2 2 1

2 1 1

ρρ

hN n

N n N n N n=

− − −− − + − + − −

[( ) ( ) ]

( ) ( ) ( ) ( ) ( )

2 2 1

2 1 2 2 2 1

2

2 2

ρρ ρ ρ ρ

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 8: The effect of clustering on statistical tests: an illustration using classroom environment data

588 J.P. Dorman

where N is the total sample size, n is the number of students in each cluster, tA is the adjusted tscore with h degrees of freedom, and ρ is the intraclass correlation.

Apart from the nominal significance level, three variables, N, n, and ρ, appear to influencethe adjusted t value. To illustrate these effects within a school context, a series of computationswere performed with students nested in classes. Table 1 shows the effect of the number of classes(and therefore N) in each of the two groups in a two-group comparison. The number of studentsis kept at 20 per class and ρ is set at .20. As shown in Table 1, the adjusted t-test significancelevel has inflated but it appears to be invariant to the number of classes per group. For example,at the nominal α of .05, the actual significance level is around .375 irrespective of how manyclasses are in the sample. This inflation of Type I error rate – over seven times higher than thatnominated – is clearly unacceptable and could result in a rejection of the null hypothesis when,in fact, it is true.

Table 2 shows the effect of varying the number of students in each class while keeping thenumber of classes per group constant. Clearly the effect on the actual significance level isdramatic. A two-group comparison of 20 classes per group with 30 students per class (N = 1200)has an actual significance level of .454 – over nine times the nominal value. The effect is evenmore pronounced when the nominal α is .001 with the inflationary effect being over 200.

To illustrate the effect of the intraclass correlation ρ on actual significance levels, three graphsfor nominal α values of .05, .01, and .001 have been drawn for a two-group comparison of20 students per class and 25 classes in each group (N = 1000) (see Figure 1). The effect ispronounced; Figure 1 shows that even a relatively small increase in the value of the intraclasscorrelation can create sizeable changes in the actual Type I error rate. For example, with nominal

Table 1. Actual significance levels for different numbers of classes in a two-group comparison for threenominal t-test significance levels (ρ = .20).

Actual t-test significance levels for α = .05, .01, and .001

Students per class Classes per group α = .05 α = .01 α = .001

20 5 .379 .247 .13720 10 .375 .243 .13520 15 .374 .242 .13420 20 .373 .241 .13420 25 .372 .241 .13420 30 .372 .241 .134

Table 2. Actual significance levels for different numbers of students per class in a two-group comparisonfor three nominal t-test significance levels (ρ = .20).

Actual t-test significance levels for α = .05, .01, and .001

Students per class Classes per group α = .05 α = .01 α = .001

5 20 .145 .055 .01410 20 .243 .124 .04915 20 .316 .188 .09220 20 .373 .241 .13425 20 .418 .286 .17330 20 .454 .325 .208

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 9: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 589

α set at .05 an intraclass correlation of .10 yields an actual significance level of .250. One note-worthy feature of these graphs is their linearity, with Pearson’s correlation coefficients of .970 forα = .05, .989 for α = .01, and .998 for α = .001. These graphs illustrate the importance of nestingto statistical test results.Figure 1. Actual significance levels for different intraclass correlations for a two-group comparison (20 students in each class, 25 classes in each group, and nominal α of .05, .01, and .001).If t values are known for analyses conducted with the individual student as the unit of analy-sis, the simple formula

where tA has h degrees of freedom can be used to adjust for the nesting of data. The adjusted tscore with h degrees of freedom can then be compared with the nominal value for α to ascertainwhether statistical significance remains. If ρ > 0, then 0 < c < 1, so tA < t for any nominal α andany ρ. Whether or not the computed value of tA falls in the critical region of the tA distributiondepends on the extent to which c adjusts t downwards and h. Furthermore, if tA still falls withinthe critical region, the value of ρ needed for tA to move outside the critical region can be computedusing the critical values of the t distribution and the formula for c above.

An application of the theory in classroom environment research

To illustrate the theory described above, a data set from a study conducted in Australian Catholicschools has been analysed (Dorman, 1994). In this study, a sample of 1719 students from 80 classesin 20 Catholic secondary schools responded to the Catholic School Classroom EnvironmentQuestionnaire (CSCEQ). The group comparison studied in the present paper is based on subjecttype, with 40 religious education and 40 science classes in the sample. To investigate differences

t ctA =

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.150.10.05 0.2 0.350.30.25 0.4

Intraclass correlation

Act

ual s

igni

fica

nce

leve

l

α = .05

α = .01

α = .001

Figure 1. Actual significance levels for different intraclass correlations for a two-group comparison(20 students in each class, 25 classes in each group, and nominal α of .05, .01, and .001).

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 10: The effect of clustering on statistical tests: an illustration using classroom environment data

590 J.P. Dorman

in classroom environment in religious education and science classes, a series of t-tests with thestudent as the unit of analysis was conducted, with clustering of students in the 80 classes ignored.Additionally, the theory explained above was used to compute inflated actual Type I error ratesif clustering is ignored. Adjusted t-test results that take into account the clustering of students inclasses are also presented.

The CSCEQ consists of 66 items assigned to seven underlying scales (student affiliation,interactions, cooperation, task orientation, order and organisation, individualisation, and teachercontrol). Each item employs a five-point Likert response format (1 = strongly disagree; 2 =disagree; 3 = neither; 4 = agree; 5 = strongly agree), with items scores aggregated to form scalescores for each respondent. Table 3 provides descriptions and the number of items for eachCSCEQ scale. Different versions of the CSCEQ have been used in previous research. Forexample, Dorman, McRobbie, and Foster (2002) used a personalised form of the CSCEQ to studyassociations between classroom environment in religion classes and attitudes to Christianity. Ithas been shown to provide a valid assessment of classroom environment, especially in AustralianCatholic schools. Reliability coefficients (Cronbach coefficient α) were computed for each scale(see Table 3). These results show that, apart from individualisation, all scales had at least soundinternal consistency. Indices ranged from .54 for individualisation to .90 for interactions. Thesevalues compare favorably with those reported in previous learning environment research (seeFraser, 1998). Table 3 also shows means and standard deviations for each scale.

Use of student as unit of analysis in t-tests with clustering ignored

Seven t-tests with the student as the unit of analysis compared classroom environment accordingto subject (i.e., religious education and science classes). Class membership (i.e., class clustering)was ignored. That is, these analyses assumed that all students were statistically independent. Asthe present analysis involved seven independent tests, the use of the Bonferroni inequality

Table 3. Descriptive information for the CSCEQ.

CSCEQ subscale Scale descriptionNumber of items

Coefficient α M SD

Student affiliation

Extent to which students know, help, and are friendly towards each other.

9 .69 3.51 0.53

Interactions Extent to which teacher–student interactions emphasise a concern for the personal welfare and social growth of the student

10 .90 3.71 0.74

Cooperation Extent to which students cooperate rather than compete with each other

10 .71 3.23 0.52

Task orientation Extent to which it is important to complete activities planned and to stay on the subject matter

9 .76 3.17 0.65

Order and organisation

Emphasis on students behaving in an orderly, quiet, and polite manner, and on the overall organisation of classroom activities

10 .84 2.71 0.69

Individualisation Extent to which students are allowed to make decisions and are treated differently according to ability, interest, and rate of working

9 .54 2.66 0.47

Teacher control The number of rules, how strictly rules are enforced, and how severely infractions are punished

9 .75 3.47 0.61

Note: Means and standard deviations were based on total scale score divided by the number of items in the scale.

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 11: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 591

resulted in the nominal Type 1 error rate of .05 being adjusted downwards. It was replaced by themore stringent benchmark of .01 for all tests. Statistically significant differences in scale scoresfor religious education and science classes were found for three CSCEQ scales: cooperation[t(1717) = 4.23, p < .01], task orientation [t(1717) = 7.68, p < .01], and individualisation [t(1717)= 2.77, p < .01], with respective effect size magnitudes of .20, .36, and .13. Compared to religiouseducation classes, science classes had lower cooperation and individualisation but higher taskorientation.

However, as indicated by the theory presented earlier in this paper, actual p values can beinflated substantially if there are appreciable intraclass correlations (ρ) (i.e., a proportion of vari-ance is explained by class membership). Multilevel analysis with MLwiN (Rasbash et al., 2005)with the student as the first-level variable and class as the second-level variable was used tocompute intraclass correlations for cooperation (.160), task orientation (.237), and individualisa-tion (.203). For these three scales, the actual p values computed using the above theory were .21,.29, and .26 respectively – up to 26 times the nominal level of .01. If a nominal level of .05 hadbeen adopted, then the respective actual p values would be .34, .42, and .39. These are clearlyunacceptably high Type I error rates. That is to say, the probability of concluding that differencesexist between religious education and science classes, when in fact there are no differences, ismuch too high. Accordingly, the potential for spurious results and conclusions that can follow ifclustering is ignored is very real.

Adjusting t scores for clustering

The above t scores can be adjusted to take clustering into account by using the formula

where tA is the adjusted t score with h degrees do freedom, c is as defined earlier in this paper interms of class size, number of classes, and the intraclass correlation, and t is the t score with clus-tering ignored. A mean class size of 21.48 was employed in all analyses adjusting t scores forclustering. Using the theory described above, adjusted t scores (tA) and p values were computed:cooperation [tA(1132) = 2.04, p = .04], task orientation [tA(803) = 3.16, p = .002], and individu-alisation [tA(936) = 1.22, p = .22]. These new t values shed new light on these comparisons, withonly task orientation now showing a statistically significant difference between religious educa-tion and science classes, at p < .01.

It is also possible to compute the intraclass correlation that would have to be exceeded for theadjusted t score for task orientation to exceed the critical value at p = .01. In the present study,this value was computed to be .381. Similarly, if the p value was set at .05 then the intraclasscorrelation could be as high as .689 before the t score exceeded the critical value.

Analogously, how low the intraclass correlation would need to be for cooperation and indi-vidualisation t scores to be above the critical value can be calculated. At p = .01, the intraclasscorrelations for cooperation and individualisation would need to be below.082 and .008 respec-tively – well below the observed values of .160 and .203. At p = .05, intraclass correlations couldbe as high as .177 for cooperation and .048 for individualisation for the t scores to be below thecritical values. Collectively, these results demonstrate the dramatic effect that clustering has onthe results of statistical testing when the individual student is taken as the unit of analysis.

Discussion

There are three important implications of the theory and application presented in this paper.

t ctA =

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 12: The effect of clustering on statistical tests: an illustration using classroom environment data

592 J.P. Dorman

First, this paper has demonstrated that it is improper to analyse clustered data as if all respon-dents are statistically independent. This is because the intraclass correlation inherent in clustereddata inflates the Type I error rate. Of interest is the magnitude by which the intraclass correlationinflates the nominal Type I error rate. This paper has shown that, even for modest absolutevalues of intraclass correlation (i.e., .05 < ρ < .10), the effect on actual p values is quitedramatic. Figure 1 demonstrated that with nominal α set at .05, intraclass correlations of .05 and.10 yield actual significance levels of .160 and .250 respectively. With regard to classroom envi-ronment research, the potential for committing Type I errors is very real, mainly because intrac-lass correlations are potentially much higher. Recent research by Dorman (in press a) reportedintraclass correlations for scores on 22 classroom environment scales ranging from .07 to .31(M = .19, SD = .07). Using a large data set of 44,415 students in 1913 classes in the Netherlands,den Brok et al. (2006) revealed intraclass correlations for the eight Questionnaire on TeacherInteraction scales ranging from .23 to .51 (M = .42, SD = .08) for the personal version and from.29 to .50 (M = .43, SD = .07) for the class version. Accordingly, massive adjustments to thenominal Type I error rate will be evident even if the class sizes are relatively small.

Second, this paper has demonstrated that, provided a good estimate of the intraclass correla-tion is available and basic sample parameters of cluster size and total sample size are known, it ispossible to compute adjusted parameter values without access to the raw data. As illustrated bythe application of the theory to a classroom environment data set, there can be substantial changesin parameter estimates which can lead to a reversal of conclusions concerning statistical signifi-cance. That is, parameter estimates without intraclass correlation adjustments can indicate that thenull hypothesis should be rejected and that there are significant differences between the groups.In contrast, parameter estimates with intraclass correlation adjustments can indicate an accep-tance of the null hypothesis. Thus one can easily commit a Type I error if clustering is ignored,falsely reject the null hypothesis, and erroneously conclude that there are statistically significantdifferences between groups.

One limitation to these computations is that the intraclass correlation has to be estimated ifit is not provided in the research report. While the η2 statistic (the proportion of varianceexplained by class membership in a one-way ANOVA) can be used, it may not be provided inthe research report. Hedges notes this problem and describes how various authors are establish-ing reference tables of intraclass correlations. For example, in the United States Hedges andHedberg (2007) developed a compendium of several hundred intraclass correlations for readingand mathematics academic achievement. It would be useful if a record of intraclass correlationsfor classroom environment scales used over the past decade were available to researchers. Sucha record could be cross-tabulated according to particular subsamples (e.g., country, location,year level).

A third implication is that classroom environment researchers need to pay more attention toresearch design, and in particular the selection of the sample. Historically, most classroom envi-ronment researchers have surveyed intact classes. This is efficient – it allows for the collectionof substantial numbers of questionnaires from one location in a short period of time. However,Table 2 shows the dramatic effect of cluster size on the inflation of the Type I error rate. Whilethe effect of intraclass correlation on nominal Type I error cannot be attenuated, the effect of clus-tering can be reduced by reducing cluster size. Accordingly, classroom environment researchersshould consider reducing the number of students per class surveyed and increasing proportionallythe number of classes in the sample. For example, with a total sample set at 600, instead of survey-ing 20 classes with 30 students per class, it would be better to survey 60 classes and randomlyselect 10 students from each class to respond to the questionnaire. Analogously, studies involvingteachers responding to school environment instruments need to involve more schools and fewerteachers per school.

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 13: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 593

Conclusion

The purpose of this paper was to bring to the attention of researchers the effect of clustering onthe results of statistical testing, and the potential problems that arise if nested data are treated asstatistically independent. As much educational data are collected from students clustered inclasses within schools, this issue cannot be ignored. The effect of cluster size and the intraclasscorrelation (or variance partition coefficient) was shown to have substantial effects on the infla-tion of Type I error probabilities. While multilevel analysis is the preferred approach to handlingnested data, a relatively simple procedure for adjusting statistical parameter estimates for cluster-ing was demonstrated. Classroom environment researchers using traditional statistical methods(e.g., ANOVA) should not use the individual student as the unit of analysis unless techniques likethose demonstrated in this paper are employed.

References

Alker, H.R. (1969). A typology of ecological fallacies. In H. Dogan & S. Rokkan (Eds.), Quantitativeecological analysis in the social sciences (pp. 69–86). London: MIT Press.

Allen, D., & Fraser, B.J. (2007). Parent and student perceptions of the classroom learning environment andits association with student outcomes. Learning Environments Research, 10, 67–82.

Anderson, L. (1981). Assessing affective characteristics in the schools. Boston, MA: Allyn & Bacon.Antia, S.D., Sabers, D.L., & Stinson, M.S. (2006). Validity and reliability of the Classroom Participation

Questionnaire with deaf and hard of hearing students in public schools. Journal of Deaf Studies andDeaf Education, 12, 158–171.

Aitkin, M., & Longford, N. (1986). Statistical modelling issues in school effectiveness studies (withdiscussion). Journal of the Royal Statistical Society, Series A, 149, 1–43.

Azimioara, M., & Fraser, B.J. (2007, April). The effect of extended instructional time on learning environ-ment, achievement, and attitudes in middle schools algebra classes. Paper presented at the annualmeeting of the American Educational Research Association, Chicago.

Burstein, L. (1980). The analysis of multi-level data in educational research and evaluation. In D.C. Berliner(Ed.), Review of research in education, Vol. 8 (pp. 158–233). Washington: American EducationalResearch Association.

Cavanagh, R.F., & Romanoski, J.T. (2006). Rating scale instruments and measurement. LearningEnvironments Research, 9, 273–289.

Cronbach, L.J. (1976). Research on classrooms and schools: Formulation of questions, design and analy-sis. Occasional paper of the Stanford Evaluation Consortium. Stanford, CA: Stanford University.

den Brok, P., Brekelmans, M., & Wubbels, T. (2006). Multilevel issues in research using students’perceptions of learning environments: The case of the Questionnaire on Teacher Interaction. LearningEnvironments Research, 9, 199–213.

Dorman, J.P. (1994). A study of school and classroom environments in Queensland Catholic secondaryschools. Unpublished PhD thesis, Curtin University of Technology, Perth, Australia.

Dorman, J.P. (2002). Classroom environment research: Progress and possibilities. Queensland Journal ofEducational Research, 18, 112–140.

Dorman, J.P. (2003). Cross national validation of the What Is Happening In this Class? questionnaire usingconfirmatory factor analysis. Learning Environments Research, 6, 231–245.

Dorman, J.P. (in press a). Partitioning the variance in scores on classroom environment instruments.Journal of Educational Research.

Dorman, J.P. (in press b). Use of multitrait-multimethod modelling to validate actual and preferred formsof the ‘What Is Happening In this Class?’ questionnaire. Learning Environments Research.

Dorman, J.P., McRobbie, C.J., & Foster, W.J. (2002). Associations between psychosocial environment inreligious education classes and students’ attitude to Christianity. Religious Education, 97, 23–42.

Fiedler, M.L. (1975). Bidirectionality of influence in classroom interaction. Journal of EducationalPsychology, 67, 735–744.

Fisher, D.L., & Khine, M.S. (Eds.). (2006). Contemporary approaches to research on learning environ-ments. Singapore: World Scientific.

Frailich, M., Kesner, M., & Hofstein, A. (2007). The influence of web-based chemistry learning onstudents’ perceptions, attitudes, and achievements. Research in Science and Technological Education,25, 179–197.

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 14: The effect of clustering on statistical tests: an illustration using classroom environment data

594 J.P. Dorman

Fraser, B.J. (1986). Classroom environment. London: Croom Helm.Fraser, B.J. (1990). Individualised classroom environment questionnaire. Melbourne, NSW: Australian

Council for Educational Research.Fraser, B.J. (1998). Science learning environments: Assessments, effects and determinants. In B.J. Fraser

& K.G. Tobin (Eds.), International handbook of science education (pp. 527–564). Dordrecht, theNetherlands: Kluwer.

Fraser, B.J., Anderson, G.J., & Walberg, H.J. (1982). Assessment of learning environments: Manual forlearning environment inventory (LEI) and My Class Inventory (MCI) (3rd version). Perth: WesternAustralian Institute of Technology.

Fraser, B.J., Giddings, G.J., & McRobbie, C.J. (1995). Evolution and validation of a personal form of aninstrument for assessing science laboratory classroom environment. Journal of Research in ScienceTeaching, 32, 399–422.

Freedman, D.A. (1999). Ecological inference and the ecological fallacy. Report prepared for theInternational Encyclopedia of the Social and Behavioral Sciences (Technical Report no. 549).Retrieved August 30, 2007, from http://www.stanford.edu/class/ed260/freedman549.pdf

Goh, S.C., & Fraser, B.J. (1998). Teacher interpersonal behaviour, classroom environment and studentoutcomes in primary mathematics in Singapore. Learning Environments Research, 1, 199–229.

Goldstein, H. (2003). Multilevel statistical models (3rd ed.). London: Edward Arnold.Goldstein, H. (2004). Some observations on the definition and estimation of effect sizes. In I. Schagen &

K. Elliot (Eds.), But what does it mean? The use of effect sizes in educational research (pp. 67–71).Slough, UK: NFER.

Goldstein, H., Browne, W., & Rasbash, J. (2002). Partitioning variation in multilevel models. Understand-ing Statistics, 1, 223–231.

Hedges, L.V. (2007). Correcting a significance test for clustering. Journal of Educational and BehavioralStatistics, 32, 151–179.

Hedges, L.V., & Hedberg, E.C. (2007). Intraclass correlations for planning group-randomised trials ineducation. Educational Evaluation and Policy Analysis, 29, 60–87.

Hox, J.J. (2002). Multilevel analysis: Techniques and applications. Mahwah, NJ: Erlbaum.Khine, M.S., & Fisher, D.L. (Eds.). (2003). Technology-rich learning environments: A future perspective.

Singapore: World Scientific.Kunter, M., & Baumert, J. (2006). Who is the expert? Construct and criteria validity of student and teacher

ratings of instruction. Learning Environments Research, 9, 231–251.Lee, V.E. (2000). Using hierarchical linear modeling to study social contexts: The case of school effects.

Educational Psychologist, 35, 125–141.Logan, K.A. (2007). Should computing be taught in single-sex environments? An analysis of the comput-

ing learning environment of upper secondary students. Educational Studies, 33, 233–248.Moos, R.H., & Trickett, E.J. (1987). Classroom Environment Scale manual (2nd ed.). Palo Alto, CA:

Consulting Psychologists Press.Murray, H.A. (1938). Explorations in personality. New York: Oxford University Press.Patrick, H., Ryan, A.M., & Kaplan, A. (2007). Early adolescents’ perceptions of the classroom social

environment, motivational beliefs, and engagement. Journal of Educational Psychology, 99, 83–98.Rasbash, J., Steele, F., Browne, W., & Prosser, B. (2005). A user’s guide to MLwiN version 2.0. Bristol,

UK: Centre for Multilevel Modelling, University of Bristol.Raudenbush, S., & Bryk, A. (2002). Hierarchical linear models. Thousand Oaks, CA: Sage.Raudenbush, S.W., & Willms, J.D. (Eds.). (1991). Schools, classrooms and pupils: International studies of

schooling from multilevel perspective. San Diego, CA: Academic.Rosenshine, B. (1970). Evaluation of classroom instruction. Review of Educational Research, 40, 279–300.Rowe, K.J. (2007). Practical multilevel analysis with MLwiN and LISREL: An integrated course (7th ed.).

Camberwell, VIC: ACER.Schmidt, M., & [Ccaron] agran, B. (2006). Classroom climate in regular primary school settings with children with

special needs. Educational Studies, 32, 361–372.Seidel, T. (2006). The role of student characteristics in studying micro teaching-learning environments.

Learning Environments Research, 9, 253–271.Shavelson, R.J., & Seidel, T. (2006). Approaches in measuring learning environments. Learning

Environments Research, 9, 195–197.Snijders, T.A.B., & Bosker, R.J. (1999). Multilevel analysis: An introduction to basic and advanced multi-

level modeling. London: Sage.Tabachnick, B.G., & Fidell, L.S. (2007). Understanding multivariate statistics (5th ed.). Boston, MA:

Pearson.

C

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14

Page 15: The effect of clustering on statistical tests: an illustration using classroom environment data

Educational Psychology 595

Telli, S., den Brok, P., & Cakiroglu, J. (2007). students’ perceptions of science teachers’ interpersonalbehaviour in secondary schools: Development of a Turkish version of the Questionnaire on TeacherInteraction. Learning Environments Research, 10, 115–129.

Veltri, S., Banning, J.H., & Davies, T.G. (2006). The community college classroom environment: Studentperceptions. College Student Journal, 40, 517–527.

Villar, L.M., & Alegre, O.M. (2007). Student perceptions of classroom environment and instructors’reflections. University of Seville. (ERIC Document Reproduction Service No. ED495249)

Walberg, H.J. (1976). Psychology of learning environments: Behavioral, structural, or perceptual? Reviewof Research in Education, 4, 142–178.

Waldrip, B.G., Reene, P., Fisher, D.L., & Dorman, J.P. (2008). Changing primary students’ perceptions ofteacher interpersonal behaviours. Research in Science Education, 38, 213–235.

Wheldall, K., Beaman, R., & Mok, M. (1999). Does the Individualized Classroom Environment Question-naire (ICEQ) measure classroom climate? Educational and Psychological Measurement, 59, 847–854.

Wubbels, T., & Levy, J. (Eds.). (1993). Do you know what you look like? Interpersonal relationships ineducation. London: Falmer.

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 0

9:22

10

Nov

embe

r 20

14