19
Educational Measurement: Issues and Practice Fall 2011, Vol. 30, No. 3, pp. 10–28 Mean Effects of Test Accommodations for ELLs and Non-ELLs: A Meta-Analysis of Experimental Studies Maria Pennock-Roman, MPR Psychometric and Statistical Research and Consulting, and Charlene Rivera, George Washington University The objective was to examine the impact of different types of accommodations on performance in content tests such as mathematics. The meta-analysis included 14 U.S. studies that randomly assigned school-aged English language learners (ELLs) to test accommodation versus control conditions or used repeated measures in counter-balanced order. Individual effect sizes (Glass’s d) were calculated for 50 groups of ELLs and 32 groups of non-ELLs. Individual effect sizes for English language and native language accommodations were classified into groups according to type of accommodation and timing conditions. Means and standard errors were calculated for each category. The findings suggest that accommodations that require extra printed materials need generous time limits for both the accommodated and unaccommodated groups to ensure that they are effective, equivalent in scale to the original test, and therefore more valid owing to reduced construct-irrelevant variance. Computer-administered glossaries were effective even when time limits were restricted. Although the Plain English accommodation had very small average effect sizes, inspection of individual effect sizes suggests that it may be much more effective for ELLs at intermediate levels of English language proficiency. For Spanish-speaking students with low proficiency in English, the Spanish test version had the highest individual effect size (+1.45). Keywords: English language learners, test accommodations, meta-analysis, test validity, state assessments W e must be able to accurately assess the language skills and content knowledge of English language learners (ELLs) in order to monitor their academic progress. The term ELLs refers to students whose first language is not En- glish, encompassing those who are just beginning to learn English (often referred to as “limited English proficient” or “LEP”) together with those who have already developed con- siderable proficiency (La Celle Peterson & Rivera, 1994). The imperative to assess ELLs is an explicit part of the 1994 Im- proving America’s Schools Act (IASA, 1994) and the 2001 No Child Left Behind Act (NCLB, 2002), which hold districts and schools accountable to the same standards for all students in- cluding ELLs. However, because ELLs, by definition, are still developing proficiency in English, their knowledge and skills may often be inadequately reflected by the tests used to as- sess them (e.g., Abedi, 2004b; Abedi, Hofstetter, & Lord, 2004; AERA, APA, & NCME, 1999; Alderman, 1982; Pennock-Roman, 1992, 2002; Rivera, Collum, Shafer Willner, & Sia, 2006; Rivera, Stansfield, Scialdone, & Sharkey, 2000). Thus, state policies increasingly allow test accommodations to facilitate ELLs’ participation in large-scale assessments, a strategy initially Maria Pennock-Roman, MPR Psychometric and Statistical Re- search and Consulting, 3 Upland Way, Falmouth, ME 04105; [email protected]. Charlene Rivera, Executive Director, The George Washington University Center for Equity and Excel- lence in Education, 1555 Wilson Boulevard, Suite 515, Arlington, VA 22209-2004; [email protected]. developed and used to increase the participation of students with disabilities (SWD). Accommodations are defined as changes to a test or test- ing situation that are intended to improve student access to the content of the test without altering the test construct. When Rivera et al. (2000, 2006) and Shafer Willner, Rivera, and Acosta (2008) reviewed state assessment policies, they identified an abundance of accommodation alternatives that do not address ELL needs specifically. For instance, in the Shafer Willner et al. (2008) study, 64 out of a total of 104 accommodations were responsive exclusively to SWDs. More- over, Rivera et al. (2006) and Shafer Willner et al. (2008) found little guidance in the policies for distinguishing which accommodations met the linguistic needs of ELLs versus the disability needs of SWDs. In particular, the categorization of accommodations in these policies traditionally focused on the mode of administering the accommodation (i.e., presen- tation, response, format, and timing/scheduling). This tra- ditional taxonomy ignored other dimensions of assessment design that may reduce the language load by providing ELLs linguistic access to the content of test items. Hence, Rivera et al. (2006) and Shafer Willner et al. (2008) developed and applied an alternative taxonomy of ELL-responsive accommodations comprising two broad categories: direct linguistic support and indirect linguistic support. Direct linguistic support accommodations involve adjustments to the language of the test in English, the native language, or a combination of the two languages. Indirect 10 Copyright C 2011 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

Mean Effects of Test Accommodations for ELLs and Non-ELLs: A Meta-Analysis of Experimental Studies

Embed Size (px)

Citation preview

Educational Measurement: Issues and PracticeFall 2011, Vol. 30, No. 3, pp. 10–28

Mean Effects of Test Accommodations for ELLs and Non-ELLs:A Meta-Analysis of Experimental Studies

Maria Pennock-Roman, MPR Psychometric and Statistical Research and Consulting,and Charlene Rivera, George Washington University

The objective was to examine the impact of different types of accommodations on performance incontent tests such as mathematics. The meta-analysis included 14 U.S. studies that randomly assignedschool-aged English language learners (ELLs) to test accommodation versus control conditions or usedrepeated measures in counter-balanced order. Individual effect sizes (Glass’s d) were calculated for 50groups of ELLs and 32 groups of non-ELLs. Individual effect sizes for English language and nativelanguage accommodations were classified into groups according to type of accommodation and timingconditions. Means and standard errors were calculated for each category. The findings suggest thataccommodations that require extra printed materials need generous time limits for both theaccommodated and unaccommodated groups to ensure that they are effective, equivalent in scale tothe original test, and therefore more valid owing to reduced construct-irrelevant variance.Computer-administered glossaries were effective even when time limits were restricted. Although thePlain English accommodation had very small average effect sizes, inspection of individual effect sizessuggests that it may be much more effective for ELLs at intermediate levels of English languageproficiency. For Spanish-speaking students with low proficiency in English, the Spanish test version hadthe highest individual effect size (+1.45).

Keywords: English language learners, test accommodations, meta-analysis, test validity, state assessments

We must be able to accurately assess the language skillsand content knowledge of English language learners

(ELLs) in order to monitor their academic progress. Theterm ELLs refers to students whose first language is not En-glish, encompassing those who are just beginning to learnEnglish (often referred to as “limited English proficient” or“LEP”) together with those who have already developed con-siderable proficiency (La Celle Peterson & Rivera, 1994). Theimperative to assess ELLs is an explicit part of the 1994 Im-proving America’s Schools Act (IASA, 1994) and the 2001 NoChild Left Behind Act (NCLB, 2002), which hold districts andschools accountable to the same standards for all students in-cluding ELLs. However, because ELLs, by definition, are stilldeveloping proficiency in English, their knowledge and skillsmay often be inadequately reflected by the tests used to as-sess them (e.g., Abedi, 2004b; Abedi, Hofstetter, & Lord, 2004;AERA, APA, & NCME, 1999; Alderman, 1982; Pennock-Roman,1992, 2002; Rivera, Collum, Shafer Willner, & Sia, 2006; Rivera,Stansfield, Scialdone, & Sharkey, 2000). Thus, state policiesincreasingly allow test accommodations to facilitate ELLs’participation in large-scale assessments, a strategy initially

Maria Pennock-Roman, MPR Psychometric and Statistical Re-search and Consulting, 3 Upland Way, Falmouth, ME 04105;[email protected]. Charlene Rivera, Executive Director,The George Washington University Center for Equity and Excel-lence in Education, 1555 Wilson Boulevard, Suite 515, Arlington,VA 22209-2004; [email protected].

developed and used to increase the participation of studentswith disabilities (SWD).

Accommodations are defined as changes to a test or test-ing situation that are intended to improve student access tothe content of the test without altering the test construct.When Rivera et al. (2000, 2006) and Shafer Willner, Rivera,and Acosta (2008) reviewed state assessment policies, theyidentified an abundance of accommodation alternatives thatdo not address ELL needs specifically. For instance, in theShafer Willner et al. (2008) study, 64 out of a total of 104accommodations were responsive exclusively to SWDs. More-over, Rivera et al. (2006) and Shafer Willner et al. (2008)found little guidance in the policies for distinguishing whichaccommodations met the linguistic needs of ELLs versus thedisability needs of SWDs. In particular, the categorizationof accommodations in these policies traditionally focused onthe mode of administering the accommodation (i.e., presen-tation, response, format, and timing/scheduling). This tra-ditional taxonomy ignored other dimensions of assessmentdesign that may reduce the language load by providing ELLslinguistic access to the content of test items.

Hence, Rivera et al. (2006) and Shafer Willner et al.(2008) developed and applied an alternative taxonomyof ELL-responsive accommodations comprising two broadcategories: direct linguistic support and indirect linguisticsupport. Direct linguistic support accommodations involveadjustments to the language of the test in English, the nativelanguage, or a combination of the two languages. Indirect

10 Copyright C© 2011 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

linguistic support accommodations involve adjustments tothe conditions under which an ELL takes an assessmentthat may help the student more adequately process lan-guage found in test items but entail no changes to thetest itself (e.g., amount of time allowed). This linguistictaxonomy facilitates the cataloguing of ELL-responsiveaccommodation alternatives; it also provides a frameworkfor selecting the most appropriate accommodations forspecific ELL students varying in educational history andliteracy skills in English and in their native language. Aframework for tailoring the choice of test accommodationfor each student was needed because research by Kopriva,Emick, Hipolito-Delgado, and Cameron (2007) has demon-strated that accommodations have to be matched to anELL’s particular needs and characteristics in order to beeffective.

While guidelines are urgently needed for choosing appro-priate and effective methods to address the unique linguisticand socio-cultural needs of ELLs, the body of research nec-essary to help create such guidelines is still relatively smalland unsystematic (Duran, 2008; Rivera et al., 2006; Sireci,Li, & Scarpati, 2003; Shafer Willner et al., 2008). Althoughthe meta-analysis by Kieffer, Lesaux, Rivera, and Francis(2009) allowed a comparison of ELL-responsive accommo-dation types by average effect sizes across studies, its designdid not focus on identifying which types of accommodationsare best suited to particular student needs as recommendedby Kopriva et al. (2007).

In order to aid the development of guidelines for practi-tioners, the objective of this meta-analysis was to extend thesystematic evaluation of ELL-responsive accommodations bytaking into account the characteristics of ELLs and imple-mentation features that may impact the effectiveness of ac-commodations.

Background

For an ELL, even one having an advanced level of Englishlanguage proficiency (ELP), a content assessment in En-glish is likely to introduce construct-irrelevant variance (i.e.,score variation due to language and/or format) immaterialto the content knowledge and skills the test is intended toassess (AERA, APA, & NCME, 1999). At lower levels of ELP,the scores typically underestimate students’ actual knowl-edge (e.g., Rivera et al., 2006; Abedi, 2004b). Some studieson the test performance of ELLs in high school and college(Alderman, 1982, Pennock-Roman, 2002) have shown thatthe degree of construct irrelevant variance differs with stu-dents’ level of ELP. It may constitute as much as 34% of thevariance in test scores for general verbal tests, 17%–18% ofthe variance in science tests, and 8% of the variance in math-ematics tests for ELL students applying to graduate schools(Pennock-Roman, 2002).

Research on accommodations. Syntheses of accommoda-tions research indicate that there are many more empiri-cal studies on accommodations for SWDs (Cormier, Altman,Shyyan, & Thurlow, 2010; Fletcher, Francis, Boudosquie,Copeland, Young, Kalinowski, & Vaughn, 2006; Sireci,Scarpati, & Li, 2005; Thompson, Blount, & Thurlow, 2002;Thurlow, McGrew, Tindal, Thompson, Ysseldyke, & Elliott,2000) than for ELLs (Duran, 2008; Kieffer et al., 2009; Riveraet al., 2006; Sireci et al., 2003; Shafer Willner et al., 2008). Re-search on ELLs has focused on the discussion of policy or the

review of past studies rather than the empirical evaluation ofaccommodations.

One of the criticisms of test accommodation research hasbeen reliance on the analysis of mean performance levels asa method to evaluate the effectiveness and validity of test ac-commodations, an approach of known limitations (see reviewby Sireci, Scarpati, & Li, 2005). When the average perfor-mance of ELLs on an accommodated test is improved com-pared with the original, unaccommodated version of the test,there is a possibility that the accommodated version was sim-ply easier, not that it reduced construct-irrelevant variance.Such a result would be undesirable because “accommodatedtest scores [of ELLs] should be sufficiently equivalent in scaleto be pooled with unaccommodated scores” (Acosta, Rivera,& Shafer Willner, 2008, p. 1).

To resolve this ambiguity, researchers have suggested ex-amining the improvement in average score for the groupsneeding accommodations as well as for non-ELLs without dis-abilities (see review by Sireci et al., 2005). Sizeable positivedifferences for both ELLs and non-ELLs between accommo-dated and unaccommodated versions of a test would indicatethat the accommodated version was merely easier for bothgroups. However, if accommodations have sizeable positiveeffects for ELLs but not for non-ELLs, the evidence wouldsuggest reduced construct-irrelevant variance due to ELP, apattern known as the interaction hypothesis. When the meansof the accommodated and unaccommodated versions of tests,which are constructed to have the same content specifica-tions, are equal for non-ELLs, we can say that they are tau-equivalent. This term designates tests having the same truescore and comparable scores near the mean (for a broaderdiscussion of comparability of scores, see Sireci, 2005). Sireciet al. (2005) suggested that “the interaction hypothesis needsqualification,” because they argued that an accommodationcan be effective even if the gain for non-ELLs (or non-SWDs)is also positive as long as the gains are significantly larger forELLs (or SWDs); this concept has been called “differentialboost” in studies reviewed by Sireci et al. (2005).

Among studies examining improvements in mean perfor-mance levels for ELLs and SWDs due to accommodations,there is increasing evidence that the effectiveness of ac-commodation types varies by student characteristics. Theempirical study by Kopriva et al. (2007) demonstrated that“accommodations keyed to [each student’s] particular needswere significantly more efficacious for. . .ELLs” as comparedwith no accommodations” (p. 11). The effectiveness of accom-modations may be influenced by several factors, including thelanguage of instruction (Hofstetter, 2003) and the student’slevel of proficiency in English (Bensoussan, 1983; Kiplinger,Haug, & Abedi, 2000) and in the native language. Analogously,Fletcher et al. (2006) recommended that an accommodationassigned to a SWD should be specific to the type of academicdisability.

The Kieffer et al. (2009) meta-analysis identified 21 empir-ical studies of accommodations administered to ELLs, butfound only 11 experimental and quasi-experimental stud-ies eligible for meta-analysis because of methodological andreporting issues. Their search criteria omitted dissertationstudies, and their quantitative synthesis excluded experimen-tal studies having a repeated measures design owing to thebias this design may introduce into the effect size measuresused—Cohen’s d and Hedges’s g. Instead, they compared theresults of the repeated-measures studies to the averages foreach type of accommodation.

Fall 2011 11

The accommodation types for which averages were re-ported by Kieffer et al. (2009) included: (1) provision ofan English dictionary or glossary; (2) “simplified English”—presentation of items having text with reduced grammaticalcomplexity; (3) provision of a bilingual dictionary or glossary(all Spanish-English); (4) “Spanish version”—presentationof items and test directions in Spanish; (5) “dual-languagebooklet”—side-by-side presentation of an entire test bookletin both English and Spanish; (6) “dual-language questions+ read aloud in Spanish”—side-by-side presentation of testitems in both English and Spanish with reading passages pre-sented only in English and test items and directions also readaloud in Spanish on an audiocassette tape via headphonescontrolled by students; and (7) “extra time”—providing ex-tra time without any change to the test itself. All availableeffect sizes among studies for a particular type of accommoda-tion were combined in one average regardless of the diversityof samples with respect to language background. However,subsequent regression analysis using a mixed fixed and ran-dom effects model examined whether other study variablesbesides accommodation type had a significant impact on ef-fect sizes. In their discussion, Kieffer et al. (2009, p. 1186)stated:

The findings from the meta-analysis indicated that Englishdictionaries and glossaries had a statistically significant—ifsmall—impact on the performance of ELLs and that providingtests in simplified English had a negligible impact. Of the re-maining five accommodations that have been studied to date,there is limited evidence that any of them are effective inimproving the performance of a large and diverse group ofELLs. . . . It is important to note also that none of the accom-modations studied were found to affect the performance ofnative English speakers, thus suggesting little reason to doubtthe validity of providing these accommodations. The findingsalso demonstrate little evidence of moderating effects of anyof the reported characteristics of students or tests, albeit witha relatively small collection of samples. (p. 1186)

Specifically, they found that “extra time provided aloneor in conjunction with other accommodations. . . did not ex-plain significant variation across the. . .effect sizes” (Kiefferet al., 2009, p. 1184). In evaluating the contribution of extratime and other moderator variables, there was no discussionof the statistical limitations of a random effects approachwith such few effect sizes and with categorical variables hav-ing highly skewed frequency distributions. Overall, Kiefferet al. concluded that “empirical research to date indicatesthat . . . although valid, accommodations are largely ineffec-tive in improving the performance of ELLs on large-scaleassessment” (p. 1190).

Rationale

The present meta-analysis is a more comprehensive and moredetailed examination of average accommodation effect sizesthan the Kieffer et al. meta-analysis. We were able to in-crease the number of experimental studies included in thequantitative analysis by 40% (four additional studies) becausewe incorporated dissertation research, the study by Koprivaet al. (2007), and used Glass’s d index, which allowed inclu-sion of studies using a repeated measures design (Becker,1988; Glass, McGaw, & Smith, 1981; Morris & DeShon,2002).

We categorized effect sizes by accommodation type to-gether with time constraints and separated effect sizes for

native language accommodations by student ELP and lan-guage of instruction. We also categorized effect sizes for non-ELLs in the same way, when possible. These analyses areused to investigate to what degree the original and accommo-dated versions of tests provided the same level of difficultyfor non-ELLs. Our detailed summary is an endeavor to iden-tify under what conditions and for which students particularaccommodation types are effective.

MethodResearch Questions Considered

The following question was asked for each accommodationtype: (1) How effective was this accommodation type forELLs? This question is answered by using effect sizes thatcompare the mean on an accommodated test to the mean onthe corresponding standard test booklet with no accommoda-tion. For a few studies that incorporated students’ languagebackground we asked: (2) Which accommodations were mosteffective for ELLs with low ELP and/or who were receivinginstruction in their native language (Spanish)? (3) Whichaccommodations were most effective for ELLs with interme-diate ELP and/or who were receiving instruction in English(IE)? With respect to the association between time limitsduring test administration and the effectiveness of accom-modations we considered: (4) Were there systematic trendsin the distribution of individual effect sizes for ELLs acrosstime conditions (generous vs. restricted time)? We also con-sidered: (5) What were the average effect sizes for categoriesof accommodations across time conditions (generous vs. re-stricted time) for ELLs? For each accommodation type admin-istered to non-ELLs we asked the following question: (6) Onaverage, was this accommodation on the same scale (i.e., tau-equivalent) as the unaccommodated test for non-ELLs underrestricted time conditions and/or under generous time con-ditions? This question was answered by using average effectsizes for non-ELLs for categories of accommodations classi-fied by time conditions.

Search and Study Inclusion Criteria

We identified studies completed between 1990 and 2007 ingrades K-12 in the United States that investigated at leastone accommodation that was ELL-responsive, as defined byRivera et al. (2006) and Shafer Willner et al. (2008). Weconsulted Psychological Abstracts, Dissertation Abstracts,and Language Testing, along with the on-line resources ofthe National Clearinghouse on English Language Acquisition(NCELA), and the Educational Resources Information Center(ERIC). We also examined technical reports for organizationsknown to carry out relevant accommodations research andthe papers cited in relevant syntheses of, and position paperson, accommodations research, including: Abedi (2001, 2004a,2004b); Butler and Stevens (1997); Kieffer et al. (2009);Hollenbeck (2002); Phillips (2002); Rivera et al. (2006);Sireci et al. (2003); and Sireci et al. (2005). A total of 25nonoverlapping empirical studies on test accommodations ofELLs were initially identified.

We included empirical studies in the meta-analysis thatmet each of five criteria: (1) At least one student groupreceived a direct linguistic support accommodation adminis-tered singly or in combination with extra time or with anotherdirect linguistic support accommodation. If combinations ofaccommodations were used, each individual in the group for

12 Educational Measurement: Issues and Practice

which a mean was reported received the same combinationof accommodations and was tested under the same admin-istration conditions. That is, studies were excluded if everyaccommodation group for which a mean was reported wascomprised of individuals who had received different com-binations of accommodations. (2) The dependent variableexamined was the test score of a student in the accommoda-tion or control conditions. (3) All statistics on test scores forat least one accommodation condition and its correspondingcontrol that are necessary for the calculation of an effect size(i.e., means and SD) were reported. (4) The research designwas experimental. For studies where the control and accom-modation groups were independent samples, ELLs and/ornon-ELLs were randomly assigned to accommodated and un-accommodated conditions and their test performance wasmeasured and reported. For studies with a repeated mea-sures design, the choice of which test version was adminis-tered first was systematically varied. Researchers assignedstudents randomly to the order in which the accommodatedand unaccommodated test versions were administered. (5)The data for which effect sizes were calculated for each studydid not overlap with the data of the other studies.

The studies by Abedi, Lord, and Hofstetter (1998) andHofstetter (2003) originally overlapped substantially in thatboth contained the same Hispanic students. Whereas theHofstetter (2003) analyses included only the Hispanic stu-dents, Abedi et al.’s (1998) analyses comprised a larger groupvarying in language background. To create wholly indepen-dent results across the two studies, the reported means, SDs,and sample sizes for each condition from the two studieswere used to derive the corresponding statistics for the non-Hispanic student groups in Abedi et al. (see Appendix). Were-calculated the effect sizes for the Abedi et al. study basedon only the non-Hispanic students, thereby reporting resultsthat were independent of Hofstetter’s.

Although Kieffer et al. categorized the study by Abedi,Courtney, and Leon (2003b) as quasi-experimental, it didmeet our criterion for experimental design. Specifically,Abedi, Courtney, and Leon (2003b, p. 27) stated that a“process was developed to ensure that the test materials andaccommodations were distributed efficiently and randomly[italics added], yet as evenly as possible, among both the ELLand non-ELL students” (p. 27). Their Figures 1 and 2 (Abediet al., 2003b, p. 29) diagram this process. Hence this studywas included in the present meta-analysis and is also presentin the Kieffer et al. analyses. One study included by Kiefferet al. did not meet our criteria for experimental design: Brown(1999). (See Appendix for an explanation of the reasons forexclusion of particular investigations.)

The 14 experimental studies that met all our criteria areidentified in the reference list by the numbers in parenthesespreceding each reference. Studies having a design with in-dependent samples are numbered 1–12 and the two stud-ies with repeated measures designs are numbered 13 and14. In all, there were ten studies included in the Kiefferet al. analyses that were also included in the present meta-analysis (see references preceded by numbers 1–6, 8–10,and 12).

Calculation of Individual Effect Sizes

The unit of observation was each separate ELL or non-ELLsubsample that received an accommodation and that had acorresponding control group within each of the 14 studies.

For each unit of observation, one effect size was calculatedusing Glass’s d index (Glass et al., 1981):

Glass’s d = (MeanA − MeanC )S D C

,

where MeanA is the mean for an experimental group taking aparticular Accommodated version, MeanC is the mean for thecorresponding Control group taking the standard test book-let, and SDC is the estimate of the standard deviation of testscores in the population of students taking the control, stan-dard test booklet. For both ELLs and non-ELLs, the controlgroup’s mean is subtracted from the accommodated group’smean. This index minimizes possible biases in combining ef-fect sizes from independent and correlated sample designswithin the same meta-analysis (Becker, 1988; Glass, McGaw,& Smith, 1981; Morris & DeShon, 2002). In the Appendix,the advantages of this index for these data are further de-tailed, and the difference in metric for ELLs and non-ELLsis explained. Hedges’s correction factor c(υ) for υ degreesof freedom was multiplied by each effect size to eliminate astatistical bias occurring for small control group sample sizes(Hedges, 1981). These unbiased estimates of true effect sizeswere used in all subsequent statistics.

With only one exception, each separate ELL group having acorresponding control group generated an effect size. Owingto extremely small sample sizes for ELL groups in the Riveraand Stansfield (2004) study, it was necessary to combine themeans and SDs for more than one randomly equivalent ELLgroup. The Rivera and Stansfield design included two sepa-rate forms of the tests per grade level that were constructedto be parallel tests, with four ELL subsamples per level—two receiving accommodated versions and two receiving thecorresponding unaccommodated versions using a randomlyequivalent assignment of test forms. The ELL results for twoaccommodated versions were joined and the two unaccom-modated groups were joined within their grade levels in orderto obtain more stable estimates of means and SDs for ELLs.

Calculation of Average Effect Sizes

Categorization of accommodations. Before calculating av-erage effect sizes, accommodations were grouped into ho-mogeneous categories that could be anticipated to sharethe same expected values—a common set of effect size pa-rameters (Morris & DeShon, 2002). First, accommodationsused in different studies were classified into 11 types con-sidered qualitatively distinct from each other. Four of theaccommodations were the same as those examined by Kiefferet al. (2009): plain English (called “simplified English” byKieffer et al.), bilingual glossary (paper and pencil versions),Spanish version, and extra time. A fifth type, English dictio-nary/glossary, reflected paper and pencil versions of this ac-commodation with the computer-administered version, pop-up English glossary, comprising a sixth category. The seventhcategory, dual language (DL), was similar to Kieffer et al.’s“dual language booklet,” except that it also included Kiefferet al.’s “dual language questions + read aloud in Spanish” andessay tests for which students had the option of writing theirresponses either in English or Spanish as in Aguirre-Munoz(2000), a dissertation study not part of Kieffer et al.’s analy-ses. Three additional types of accommodations in the presentsynthesis—picture dictionary, pop-up bilingual glossary, andread aloud (in English) that were administered individually

Fall 2011 13

or in pairs—were unique to the Kopriva et al. study not ana-lyzed by Kieffer et al. The eleventh type was the small groupsaccommodation where the unchanged test was administeredin small groups. This category comprised only one effect sizefor ELLs and one for non-ELLs.

Further considerations led to categories within each typethat were variants according to timing conditions. For eachaccommodation type, subsamples were categorized into threegroups according to their expected values: (1) equally re-stricted time constraints for both the accommodated andcontrol groups, with an expected effect size of δj, definedas the true average effect size for the jth accommodationtype in the population under restricted time conditions;(2) equally generous time constraints for both the accom-modated and control groups, with an expected effect size ofδj + δjXT , where the parameter δjXT is a possible time-by-accommodation interaction for experimental conditions al-lowing generous time; and (3) extra time allowed only for theaccommodated group, with an expected effect size of δj +δjXT + δT , where the last parameter is a main effect fortime—a possible change in effect size when either the ac-commodated or control test booklet is administered with ex-tra time. The Appendix contains clarification on why the thirdgroup has more parameters. Conditions were categorized asallowing equally generous time for both groups if the authorsspecifically mentioned allowing extra time for both groups, orif they described the measure as being a power test, whichmeans it allowed unlimited time or sufficient time for allstudents to finish under standard administration conditions.

Lastly, we took into account students’ levels of ELP and thelanguage of instruction in deciding what effect sizes could beconsidered as “replications” sharing a common average in thepopulation. The few groups that were administered the Span-ish version were so heterogeneous in language backgroundthat one could anticipate they would each have a differentexpected value for the effect size. Thus, we report individualeffect sizes for the Spanish version but not their average valueacross different studies.

Weighting schemes and sampling variances. The effect sizevalues corrected for bias were averaged for each category byweighting each effect size from the ith study for the jth ac-commodation type by the inverse of its sampling variance(wij = 1/vij), as recommended by Shadish and Haddock(1994). The values for studies with repeated measures werein the middle of the range of values for their correspondingcategory revealing no evidence of bias. Therefore, all sub-samples for the same accommodation category were groupedtogether regardless of design.

Different equations are necessary to calculate the sam-pling variances for independent groups and repeated mea-sures designs included among the 14 experimental studies.For designs involving independent groups with possibly un-equal variances, we used the normal approximation to thesampling variance of Glass’s unbiased effect size using sam-ple estimates of parameters given by Gleser and Olkin (1994).For the repeated measures designs, we used the normal ap-proximation of the sampling variance for Glass’s unbiasedindex given by Becker (1988).

Additionally, we calculated the sampling variance of theaverage effect size for each category of accommodation, us-ing a fixed-effects model (Shadish & Haddock, 1994). Weopted for a fixed effects approach because it is preferred

over a random effects approach when the studies cannotbe viewed “as representative of a larger population or uni-verse of implementations of a treatment” (Raudenbush, 1994,p. 316), particularly when the number of studies per categoryof effect size is small. Specifically, Raudenbush (1994, arguedthat:

[A] conception of study effects as fixed versus random de-pends in part on the number of studies available. In the extremecase of. . .only two studies [of a treatment or procedure]. . .thenotion of generalizing to a larger population of studies would. . ..seem ludicrous. . .. At the other extreme, with several hun-dred studies, the fixed effects approach would make little sense.(p. 307)

In the present synthesis, there were only two categoriesof accommodations for ELLs comprised of more than threestudies that could be considered replications of the sametreatment. The final step was to evaluate the homogeneity ofeffect sizes within categories using the Q statistic (Shadish& Haddock, 1994). Information about the equations used inthe present meta-analysis for average effect sizes, standarderrors/variances of individual and average effect sizes, and theQ statistic is available from the first author upon request.

ResultsEffectiveness of Accommodations: Effect Sizes for ELLs

In the 14 experimental studies, 50 separate groups of ELLsreceived an accommodation and could be contrasted to a ran-domly equivalent ELL group receiving the unaccommodatedtest (control condition). This count includes two ELL groupsthat served as their own control in the studies with a repeatedmeasures design.

Figure 1 displays all 50 effect sizes for ELLs in an “inside-out” display, where the result for a subsample is placed ina rectangle with borders, and its vertical position indicatesthe interval that contains its effect size. Within each box,the first three entries identify the effect size by the studynumber, an abbreviation for its accommodation type, and theELL subgroup within that study (i.e., subgroups classifiedeither by ELP, language of instruction, or grade level). Thefourth entry in the box is the value of the effect size d aftercorrection for bias. The author(s) and date correspondingto each study number are identified in the references. Notethat studies 1–12 had designs based on independent samples,whereas studies 13–14 had a repeated measures design.

Effect sizes in Figure 1 are displayed in two groupings ac-cording to the amount of time allowed in test administration.On the left are effect sizes from studies that offered gener-ous time conditions, either by allowing extended time or byadministering a power test (i.e., one that has essentially un-limited time). On the right-hand side of Figure 1 are the morenumerous effect sizes from studies that did not mention usinga power test or allowing extended time. The left-hand sideincludes two effect sizes where extra time is the accommoda-tion itself (Study #1—Abedi, Courtney, & Leon, 2003a; Study #4—Abedi, Hofstetter, Baker, & Lord, 2001). Several of the ef-fect sizes on the left are from studies with a power test: (#13)Abedi, Lord, and Plummer (1997); (#14) Albus, Thurlow, Liu,and Bielinski (2005); and (#8) Anderson, Liu, Swierzbin,Thurlow, and Bielinski (2000). Also on the left are effectsizes from (#3) Abedi, Courtney, Mirocha, Leon, and Gold-berg (2005), where all conditions were provided with extra

14 Educational Measurement: Issues and Practice

1.45 7, SV, L, 1.45

1.40

0.95 10, SV, IS, 0.95

0.75

3, EP, 8, 0.71 0.70

0.65

0.60

0.55 7, EP, HI, 0.57

3, EG, 4, 0.52 0.50

3, BG, 4, 0.45 0.45 1, EGP, 4, 0.46

0.40 7, EP, L, 0.40 6, EG, 8, 0.41

0.35

1, T, 4, 0.28 3, EG, 8, 0.29 4, EG, 8, 0.29 8, DL, 8, 0.30 0.30 1, EG, 4, 0.29

0.25

0.20 12, EP, 4, 0.20 2, EP, 8, 0.20

3, EP, 4, 0.14 4, T, 8, 0.16 0.15 1, EG, 8, 0.13 7, EP, LI, 0.13 10, EP, IS, 0.13 1, EGP, 8, 0.17

13, EP, 8, 0.08 14, EG, M, 0.08 0.10 2, EG, 8, 0.10 4, EP, 8, 0.10 7, DL, LI, 0.12

0.05 6, BG, 8, 0.03 11, C2, 3+4, 0.03 10, EP, IE, 0.03 7, DL, HI, 0.04 11, C1, 3+4, 0.05 11, BP, 3+4, 0.07

0.00 7, SV, LI, -0.02 11, RA, 3+4, 0.00 2, EG, 4, 0.01 11, C3, 3+4, 0.02 11, PD, 3+4, 0.04

-0.05 4, EG, 8, -0.04

-0.10 2, BG, 8, -0.12 5, EP, 8, -0.12 7, SV, HI, -0.11 2, EP, 4, -0.09

-0.15 7, DL, L, -0.16

-0.20

3, BG, 8, -0.23 -0.25 EG = English Dictionary/Glossary

-0.30 2, BG, 4, -0.28 EGP = English Glossary, Pop-up

-0.35 10, SV, IE, -0.34 EP = English, Plain

-0.40 RA = Read Aloud

-0.45 PD = Picture Dictionary

-0.50 1, GS, 4, -0.51 BG = Bilingual Glossary

-0.55 BP = Bilingual Pop-up Glossary

-0.60 DL = Dual Language

-0.65 SV = Spanish Version

-0.70 12, EP, 6, -0.66 GS = Small Groups

-0.75 T = Extra Time for A Group Only

-0.80 C1 = Combined RA + PD

-0.85 C2 = Combined RA + BP

-0.90 C3 = Combined BP + PD

-0.95 Other Variables

-1.00 3, 4, 6, 8 = Grade

-1.05 M = Middle School

-1.10 IE = Instruction in English

-1.15 5, SV,8, -1.13 IS = Instruction in Spanish

-1.20

Less Restricted Time

IntervalMidpoint More Restricted Time Limits

1.00 to 1.35

0.80 to 0.90

Accommodations

L, LI, HI = Low, low intermediate, and

high intermediate English proficiency

FIGURE 1. Inside-out display of Glass’s unbiased d values for accommodations administered to ELLs.Note. Each cell above the first includes the study number, followed by the accommodation type, other variables, and the effect size d, inthat order. Level of proficiency in English or language of instruction is noted when available rather than grade level due to available spaceon the figure.

time. One effect size on the left is from (#4) Abedi, Hofstet-ter, Baker, and Lord (2001) that provided extra time for theEnglish dictionary/glossary condition but not for the controlgroup.

A positive effect size indicates that the accommodationgroup performed better on the test, whereas a negative effectsize indicates that the control group receiving the standardtest version performed better. We use Cohen’s (1988) rulesof thumb for identifying the absolute value of an effect sizeas “large” (about 0.8 or interval 0.66 and larger); “medium”

(about 0.5 or interval 0.36 to 0.65); or “small” (about 0.2 orinterval 0.06 to 0.35). Absolute values smaller than 0.05 arecategorized as trivially small.

The cells in Figure 1 are shaded to distinguish direct andindirect linguistic support accommodations. At a glance, itcan be seen that direct linguistic support accommodationsin English predominate (unshaded cells, which constitute58% of the 50 effect sizes). Those receiving direct linguisticsupport accommodations in the native language (mediumgray) constitute 36%, whereas indirect linguistic support

Fall 2011 15

Table 1. ELLS: Contrasting Average Values of Effect Sizes for the Same Accommodation AcrossTime Conditions

Both A & C Groups ReceivingExtended Time for

Restricted Time Extended Time A Group OnlyAccommodations (δ j ) (δ j + δ j X T ) (δ j + δ j X T + δT )

Pop-Up English Glossary 0.285∗ (1) Not studied Not studiedEnglish Dictionary/Glossary 0.085 (2) 0.229∗ (9) 0.295 (13)Picture Dictionary 0.036 (3) Not studied Not studiedPlain English 0.053 (4) 0.108 (10) Not studiedRead Aloud −0.002 (5) Not studied Not studiedDual Language 0.003 (6) 0.299 (11) Not studiedPop-up Bilingual Glossary 0.069 (7) Not studied Not studiedBilingual Glossary −0.176∗ (8) 0.247 (12) Not studiedExtra Time (by itself)1 N/A N/A 0.233 (14)

Note. Numbers in parentheses refer to order of same means in Table 2. The parameters in parentheses represent all components of the expectedvalues for the jth accommodation in that column, which are estimated by the average values in each cell of the table. In addition to theaccommodation main effect �j, the expectation may include the extra-time main effect �T, and/or the interaction effect �jxT .∗Significantly different from zero at p < .05 (two-tailed).1The expected value of the Extra Time accommodation is defined to have only one component—the main effect for time �T . All other �components are defined to be zero for this accommodation.

accommodations (extra time and small groups conditions,shaded in the darkest gray) constitute 6%.

Although effect sizes ranged from −1.13 to +1.45, the ma-jority (36 values) were clustered in the range −0.12 to +0.41.We examined effect sizes to see if there were systematic vari-ations in size according to accommodation type, languagebackground of the students, generosity of time limits, testcontent (e.g., science, mathematics), and grade level. Wewere able to find systematic differences according to all ofthese considerations except for test content and grade level.First we report the variation in accommodation type effectswithin language background groupings, then the midpoints ofmost common intervals of effect sizes within time constraintgroupings.

Most effective accommodations for low ELP. A critical ques-tion is: Which accommodations were most effective for ELLswith low ELP and/or who were receiving instruction in theirnative language (Spanish)? The clear answer is that Spanishlanguage versions of tests (SV in Figure 1, all on the right-hand side) had, by far, the largest effect sizes for studentswith low ELP and/or who had been instructed in Spanish, ascompared with other accommodation methods. In particular,for Study #10 (Hofstetter, 2003), the group receiving instruc-tion in Spanish (IS) had an effect size of+0.95 for the Spanishversion compared with +0.13 for plain English (EP). Simi-larly, for Study #7, Aguirre-Munoz (2000), the group with thelowest proficiency in English (L) had an effect size of +1.45(p < .05, two-tailed) for the Spanish version as comparedwith +0.40 for plain English, and −0.16 for DL essay testswith the option of writing in English or Spanish.

Most effective accommodations for intermediate ELP. Theresults appear markedly different if we ask: Which accommo-dations were most effective for ELLs with intermediate ELPand/or receiving IE? For the high intermediate (HI) ELPgroup in Study #7, Aguirre-Munoz (2000), the plain Englishaccommodation was more effective (0.57, p < .05, two-tailed)than either the Spanish version (−0.11) or the DL (0.04) ac-commodation. For the low intermediate (LI) ELP groups inStudy #7, Aguirre-Munoz (2000), the plain English effect size

(0.13) was not significantly different from zero, but higherthan the Spanish version value (−0.02) and essentially equalto the DL (0.12) effect size. For those receiving IE in Study#10 (Hofstetter, 2003), the plain English accommodation wasnot effective (0.03) but it was still more favorable than theSpanish version (−0.34). The Spanish version result for non-Hispanic ELLs in Study #5–(Abedi, Lord, & Hofstetter, 1998)was extremely negative (−1.13).

These results are consistent in showing that the Spanishversion accommodation was not effective for three classifica-tions of ELLs: those with intermediate levels of ELP, thosereceiving IE, and those with a home language backgroundother than Spanish. At HI ELP, plain English was most effec-tive, but at LI levels of ELP, none of the accommodations hadeffect sizes larger than 0.13.

Distribution of individual effect sizes across time conditions.In Figure 1, the two most frequent intervals for effect sizesfor tests with standard time limits (on the right-hand side)had midpoints of +0.00 and +0.05. In contrast, the mostfrequent interval of effect sizes for tests with more generoustime limits (on the left-hand side) had a midpoint of +0.30.This trend for larger individual effect sizes when availabletime was extended or generous was explored further in theanalyses based on average effect sizes.

Average Effect Sizes for ELLs by Accommodation Type andTime Condition

Average effect sizes, corresponding standard errors, and con-fidence intervals are shown in Tables 1 and 2 for 14 categoriesof accommodation-by-time conditions, which, in total, involve40 independent ELL groups. There were so few studies thatreported means for subsamples differing in ELP that it was notmeaningful to calculate averages by level of ELP. We do notprovide an average for the six Spanish version effect sizes be-cause they were too heterogeneous in language background.Also excluded from the averages in Tables 1 and 2 are threeeffects for the unique combinations of direct linguistic sup-port accommodations from Study #11 (Kopriva et al., 2007),and the unique small groups effect size from Study #1 (Abedi

16 Educational Measurement: Issues and Practice

Table 2. ELLs: Means and Standard Errors for Effect Sizes

95% C I Expected ValuesMean Lower Upper # Sub Total N Total N

Category Effect Size St Err Limit Limit samples (A) (C) δ j δ j X T δT

A and C Groups with Restricted Time Limits1 Pop-up English Glossary 0.285∗ 0.126 0.038 0.532 2 119 166 √2 English Dictionary/ Glossary 0.085 0.050 −0.013 0.183 6 827 835 √3 Picture Dictionary 0.036 0.245 −0.445 0.517 1 36 33 √4 Plain English 0.053 0.042 −0.029 0.136 11 1178 1183 √5 Read Aloud −0.002 0.246 −0.484 0.480 1 33 33 √6 Dual Language 0.003 0.125 −0.242 0.247 3 88 159 √7 Pop-up Bilingual Glossary 0.069 0.249 −0.420 0.558 1 36 33 √8 Bilingual Glossary −0.176∗ 0.067 −0.308 −0.044 3 324 525 √

A and C Groups with Little or No Time Constraints9 English Dictionary/ Glossary 0.229∗ 0.096 0.042 0.417 3 215 217 √ √

10 Plain English 0.108 0.062 −0.013 0.229 3 502 555 √ √11 Dual Language 0.299 0.216 −0.123 0.722 1 53 52 √ √12 Bilingual Glossary 0.247 0.151 −0.050 0.543 2 80 84 √ √

A Group Extra Time, C Group No Extra Time13 English Dictionary/ Glossary 0.295 0.244 −0.184 0.773 1 29 144 √ √ √14 Extra Time 0.233 0.132 −0.026 0.492 2 119 224 √

∗Significantly different from zero at p < .05 (two-tailed).

et al., 2003a). Table 1 displays the means in a cross-tabulationof accommodation type by time condition, whereas Table 2lists the same averages, one per row, with the correspondingstatistical details. To simplify matching values between thetwo tables for the reader, we numbered the means in Table 1(shown in parentheses) according to the corresponding rowin Table 2. Note that the size of the standard errors variedgreatly depending on the number of subsamples on which theaverage was based and the sampling variances of the individ-ual effect sizes (dependent on sample sizes). Hence, somemean effect sizes larger than 0.25 were not significant if thestandard error was also large owing to low statistical power.

We consider both the magnitude of the average effect sizesand statistical significance, giving them equal weight, be-cause there was low statistical power for the majority of theeffect sizes. Accommodation conditions having nontrivial ef-fect sizes that do not reach statistical significance may beworth examining in future research to verify whether the ef-fect sizes are replicable and statistically significant with alarger sample size. In Table 1, the values in the first columnare estimates of the main effect for the accommodation δjunder restricted time conditions, whereas the values in thesecond column incorporate estimates for both the main effectδj and time interaction δjXT . The values in column three havea third component in the expected value—the main effect ofextra time δT .

One can observe higher effect sizes in the second and thirdcolumns of Table 1 for conditions having generous time ascompared with the first column for conditions with restrictedtime. In the first column, 6 out of 8 average effect sizes arebelow 0.10 in absolute value, whereas 5 out of 6 values incolumns two and three are above 0.20. In the first column,the largest positive and statistically significant value (0.285,p < .05, two-tailed) was for the English language pop-upglossary accommodation. There was also one statistically sig-nificant negative value for the bilingual glossary category(−0.176, p < .05, two-tailed).

There was evidence of nontrivial interaction effects for allfour accommodations that were administered under generous

time limits for both the accommodated and control groups.Subtracting values in the second column of Table 1 fromcorresponding values in the first column yields improvementsin the average effect size values of 0.144 for the paper andpencil English dictionary/glossary type, of 0.055 for the plainEnglish type, of 0.296 for the DL type, and of 0.423 for thebilingual glossary type.

Variation in effect sizes within the same accommodationcategory. For each of the nine averages based on morethan one effect size value (see Table 2, column 6), wecalculated the Q statistic to evaluate whether the valueswere consistent with the hypothesis that they shared thesame population mean value. This analysis suggests that forall but two categories our grouping of subsamples was, in-deed, homogeneous. The only two significant results show-ing heterogeneity in effect sizes within a category werethe plain-English-contrasted-to-standard-version category(p < .0122) and the bilingual-glossary-with-extra-time-contrasted-to-standard-test-versions + extra-time category(p < .0370). The significant Q statistic in the bilingual glos-sary class reflects a large difference between only two valuesfor 4th grade (+0.453) and 8th grade (−0.227) in the Abediet al. (2005) study (#3 in Figure 1).

Equivalency of Scale for Accommodated andUnaccommodated Tests under Restricted or Generous TimeConditions: Average Effect Sizes for Non-ELLs

In order to examine whether accommodations preserved thescale of the original test, effect sizes were calculated for 32separate groups of non-ELLs that received an accommoda-tion and were contrasted to a randomly equivalent controlgroup receiving the unaccommodated test. In the 32 non-ELLaccommodated groups, 96.6% received direct linguistic sup-port accommodations in English (75%) or the native language(15.6%) which were mostly bilingual dictionaries (9.4%). Ad-ditionally, two non-ELL groups received indirect linguistic

Fall 2011 17

Table 3. Non-ELLs: Contrasting Average Values of Effect Sizes for the Same AccommodationAcross Time Conditions

Both A & C Groups Receiving Extended Time for

Restricted Time Extended Time A Group OnlyAccommodations ( δj ) ( δj + δjXT ) ( δj + δjXT δT )

Pop-up English Glossary 0.032 (1) Not studied Not studiedEnglish Dictionary/Glossary −0.004 (2) 0.018 (6) 0.417 (8)Plain English −0.008 (3) 0.064 ∗ (7) Not studiedDual Language −0.169 (4) Not studied Not studiedBilingual Glossary −0.134 (5) Not studied Not studiedExtra Time (by itself)1 N/A N/A −0.030 (9)

Note. Numbers in parentheses refer to order of same means in Table 4. The parameters in parentheses represent all components of the expectedvalues for the jth accommodation in that column, which are estimated by the average values in each cell of the table. In addition to theaccommodation main effect �j, the expectation may include the extra-time main effect �T, and/or the interaction effect �jxT .∗Significantly different from zero at p < .05 (two-tailed).1The expected value of the Extra Time accommodation is defined to have only one component—the main effect for time �T . All other �components are defined to be zero for this accommodation.

Table 4. Non-ELLs: Means and Standard Errors for Effect Sizes

95% C I

Mean Lower Upper # Sub Total N Total N Expected ValuesCategory Effect Size St Err Limit Limit samples (A) (C) δ j δ j X T δT

A and C Groups with Restricted Time Limits1 Pop-up English Glossary 0.032 0.109 −0.180 0.245 2 112 229 √2 English Dictionary/ Glossary −0.004 0.048 −0.098 0.090 6 871 924 √3 Plain English −0.008 0.017 −0.042 0.026 10 6,571 6,591 √4 Dual Language −0.169 0.138 −0.439 0.101 1 74 119 √5 Bilingual Glossary −0.134 0.070 −0.272 0.003 3 305 565 √

A and C Groups with Little or No Time Constraints6 English Dictionary/ Glossary 0.018 0.079 −0.137 0.173 3 193 187 √ √7 Plain English 0.064

∗0.028 0.010 0.118 2 569 631 √ √

A Group Extra Time, C Group No Extra Time8 English Dictionary/ Glossary 0.417 0.216 −0.006 0.840 1 30 130 √ √ √9 Extra Time −0.030 0.118 −0.261 0.201 2 109 228 √

∗Significantly different from zero at p < .05 (two-tailed).

support accommodations: extra time using the unchangedtest booklet (6%) and one group was tested in a small groupcondition.

As compared with the values for ELLs, the effect sizes fornon-ELLs were closer to zero and less variable. Hence, inTables 3 and 4, we report average values of effect sizes forthose categories of accommodations that comprised all buttwo of the effect sizes. The two excluded effect sizes—for thesmall groups accommodation for non-ELLs and the Spanishversion accommodation for Hispanic non-ELLs—are givenlater. Table 3 displays effect sizes by type of accommodationcross-classified by time condition, where applicable, whereasTable 4 gives more statistical information for these averages.

Most of the values for the linguistic accommodations ofTable 3 were essentially zero. In the first column, there weretwo exceptions, DL and bilingual glossary. Both accommoda-tions had negative values that were nontrivially large, sug-gesting that these accommodations were harder than the un-accommodated test version for non-ELLs when both groupshad restricted time limits. In the second column, the plainEnglish accommodation, administered with generous timefor both groups, had an average effect size that was verysmall, yet statistically significant. The average values for thepaper and pencil English dictionary/glossary category wereessentially zero when both the accommodated and unaccom-modated groups received the same time constraint. However,

when only the experimental group had extra time, the effectwas nontrivial in size (third column), though not statisticallysignificant (+0.417).

Variation in effect sizes within the same accommodationcategory. The Q statistic results were nonsignificant in 6out of 7 categories having two or more accommodated sub-samples, confirming that the categories were sufficientlyhomogeneous. The exception was the plain English condi-tion administered with restricted time limits for both theaccommodated and control groups. In this category, com-prising 10 experimental subsamples, the effect sizes rangedfrom −0.24 (Abedi, Hofstetter, Baker, & Lord, 2001) to +0.28for non-Hispanic non-ELLs in Abedi et al. (1998). The mostfrequent interval included four values in the trivially smallrange (±.05).

Effect sizes for non-ELLs omitted from Tables 3 and 4. Inaddition to the average effect sizes, there were two individualeffect sizes for accommodations infrequently administeredto non-ELLs. There was a small, statistically nonsignificantnegative effect of −0.26 based on one study for the smallgroup accommodation (Abedi et al., 2003b). This effect sizewas based on very small sample sizes (four and nine casesin the experimental and control groups, respectively). There

18 Educational Measurement: Issues and Practice

was also one effect size for the Spanish version accommoda-tion for non-ELLs of Hispanic background (−0.78, p < .05,two-tailed) that was based on only 24 cases in the accommo-dation group and 61 cases in the control group from schoolsin California (Hofstetter, 2003).

DiscussionIn order to facilitate the evaluation of each direct linguisticsupport accommodation type for specific needs of ELLs, weconsider each type one at a time and integrate results forall research questions for that accommodation, where pos-sible. We begin with comparing our findings with the majorconclusions of Kieffer et al. (2009) concerning time condi-tions and the effectiveness of the two most frequently studiedaccommodations—English dictionary/glossary variants andplain English. Later, we consider other direct linguistic sup-port accommodation types and the limitations of accommo-dation research up to the present time.

Points of Agreement and Disagreement with Kieffer et al.(2009)

Time constraints. In contrast to Kieffer et al. (2009), weidentified a clear pattern of interaction effects between hav-ing generous time limits and particular accommodations re-quiring additional printed materials. This discrepancy canbe partly explained by two design differences in the meta-analyses: (1) the categorization of time limits in administra-tion conditions; and (2) the selection of studies. For example,Kieffer et al. grouped the power test in the Anderson et al.(2000) study together with accommodations that did not in-clude extra time (see Kieffer et al.’s Supplement A, p. 1192)which is justifiable, in a sense, because the time allowed wasroutine for that type of test, not “extra” to routine conditions.In contrast, we classified conditions according to the avail-ability of time for students to respond to the test booklet,which logically placed power tests together with designs thatprovided extra time to both experimental and control groups.This latter categorization reflects conditions where studentshad ample time to respond to the test, regardless of whetherthe time provided was routine or expanded beyond the rou-tine. Additionally, we were able to include two more studieswith power tests in the meta-analysis.

English dictionary/glossary. Our findings with respect tothe English dictionary/glossary conditions differ substantiallyfrom Kieffer et al.’s analyses. We computed separate effectsizes for four variants of the English/dictionary glossary condi-tion: (1) pop-up glossary with restricted time for both groups,(2) paper and pencil version with restricted time for bothgroups, (3) paper and pencil version with generous time forboth groups, and (4) paper and pencil version with extra timefor the accommodated group only. It was evident that only thepop-up version had a significantly nonzero average effect sizevalue (0.285 p < .05, Table 1, column 1) compared with otherEnglish dictionary conditions administered with restrictedtime limits (0.085, p >.05, Table 1, column 1). The averageeffect size for paper and pencil versions was significantly dif-ferent from zero for only the three studies with extended timefor both accommodated and control groups (0.229 p < .05,Table 1 column 2). When Kieffer et al. grouped these four vari-ants into one omnibus dictionary category; the average result

they obtained was intermediate among the four values (0.15).Hence, their overly broad categorization obscured importantdifferences among these variants that are necessary to informpolicy about implementing dictionary accommodations.

The greater effectiveness of the pop-up English dictio-nary/glossary when time limits were restricted is most likelydue to its more convenient and time-efficient delivery of in-formation. Moreover, it was tau-equivalent to the original testbooklet for non-ELLs.

In contrast, the paper and pencil English dictio-nary/glossary accommodation required power tests or extratime for all groups to be effective; its effects were essen-tially zero with restricted time conditions. Also, it was tau-equivalent to the original only when both the original testand the accommodated test received equal time conditions,whether restricted or generous (Table 3, columns 1 and 2).However, if the accommodated version was given extra time,but those receiving the original test booklet had restrictedtime, the dictionary/glossary version was substantially easierfor non-ELLs (based on only one effect size) and, therefore,not tau-equivalent (Table 3, column 3). There is no empiricalevidence yet to evaluate whether this accommodation type isdifferentially more effective at low vs. intermediate levels ofELP. The studies that classified ELLs by language of instruc-tion or level of ELP omitted this accommodation type in theirdesign.

Plain English. We have confirmed Kieffer et al.’s findings that(1) average effect size values for plain or simplified Englishaccommodations for ELLs were small, and (2) plain Englishis essentially tau-equivalent to the original test version fornon-ELLs under restricted time conditions. Nonetheless, wedisagree with their other conclusions. Specifically, they ob-served that “there is little reason to be optimistic about thepotential effectiveness of simplified English as a test accom-modation” (p. 1187). In our analysis it had a substantial indi-vidual effect size among ELLs categorized as having HI ELPin the one study (Aguirre-Munoz, 2000) that incorporatedlanguage proficiency into its design (Figure 1, Study #7). Thegreater effectiveness of plain English for students with in-termediate ELP was also found in studies by Kiplinger et al.(2000), and Albus et al. (2005). The ELP analyses from thesetwo latter studies did not meet the criteria for inclusion in themeta-analysis owing to insufficient descriptive statistics perELP group. However, the authors of both investigations founda significantly higher test performance for the accommodatedgroups vs. unaccommodated groups at an intermediate levelof ELP, whereas no effects were observed at low levels of ELP.

On average, the plain English test booklet was also a lit-tle easier (effect size 0.064, p < .05, in Table 3, column 2)than the original test booklet for non-ELLs (therefore nottau-equivalent) when generous time was provided for bothexperimental and control conditions. Alternatively, this find-ing may have occurred because of the presence of nonnativespeakers of English within the non-ELL group who had exitedELL programs but who were still experiencing some difficultyin understanding the unaccommodated test. Most researchersrelied on the classification of ELLs vs. non-ELLs by schools,although definitions and procedures across schools withinthe same district or across different districts are often incon-sistent (Ragan & Lesaux, 2006). Hence, in the studies exam-ined, the non-ELL groups may have been quite heterogeneousin ELP.

Fall 2011 19

Also, the variation among effect sizes across ELL groups re-ceiving plain English was statistically significant, indicatinggreater heterogeneity in this accommodation type as com-pared with other English language accommodations. Onelikely explanation is that this variation may reflect differ-ences in ELP of the ELL samples. However, it is also possiblethat the quality of the implementations of the accommodationvaried by study. Alternatively, the original test booklets mayhave varied in grammatical complexity, and some originalitems may already have had a reduced language load therebybenefitting little from more simplification. Abedi et al. (1998)and Abedi et al. (2004) have argued that plain English testversions reduce construct irrelevant variance for both ELLsand non-ELLs and may be a more valid way to assess mathe-matics and science for all students. In our opinion, judgmenton the effectiveness and scale equivalence of plain Englishshould be suspended until there are more investigations hav-ing factorial designs that cross type of accommodation withlevels of students’ ELP and that evaluate the linguistic com-plexity of both the original test version and its plain Englishversion.

Effectiveness of accommodations. Our findings contradictKieffer and colleagues’ conclusion that accommodations are“largely ineffective” for ELLs (p. 1190). We found that fourout of five average effect sizes for direct linguistic supportaccommodations were 0.229 or larger when ELLs were givensufficient time (Table 1, columns 2 and 3). Kieffer et al. un-derestimated these effects because they applied a coarsergrouping of accommodation conditions where the more nu-merous effect sizes under restricted time conditions broughtdown the average for each category. Additionally, there aremany design features of the experimental studies that mayhave led to an underestimation of the effectiveness of mostaccommodations.

Meaning of trivial effect sizes for non-ELLs. Like Kieffer et al.we found near zero effect sizes for non-ELLs for the majorityof accommodations. However, unlike Kieffer and colleagues,we do not equate equality of means for non-ELLs as evidenceof validity because, in isolation, it does not reflect any reduc-tion of construct-irrelevant variance for ELLs. The equality ofmeans for non-ELLs confirms that most accommodated andunaccommodated versions of tests have equal overall diffi-culty, which, by itself, is evidence of scale comparability notvalidity. It is only when a substantial improvement for ELLsis paired together with near-zero improvement for non-ELLsthat there is evidence of greater validity due to a reductionin construct-irrelevant variance (Sireci et al., 2005).

Variation by test content and grade level. Our inspection ofeffect sizes in Figure 1 showed no consistent pattern accord-ing to students’ grade level or content tested. This findingconfirms Kieffer et al.’s nonsignificant results for the effect ofthese dimensions on effect sizes.

Other English Language Accommodations

The read aloud (−0.002, Table 1, column 1) and the pop-uppicture dictionary (0.036, Table 1, column 1) effect sizes,each based on only one study, show essentially zero effectsfor each accommodation on its own or in combination with

the other (Kopriva et al., 2007). Their tau-equivalency innon-ELL groups has not been studied yet. Neither accom-modation has been studied with generous time conditionsor with designs incorporating language of instruction or ELPlevel as separate variables in the design. Although there islittle experimental research to support the use of read aloudfor ELLs, it is the most commonly provided English languageaccommodation—currently used in 40 states (Shafer Willneret al., 2008; Acosta et al., 2008). Conversely, English dictio-nary/glossary accommodations have effect sizes above 0.2 inpop-up format or in paper and pencil format with generoustime, but they are provided in only 11 states (Shafer Willneret al., 2008; Acosta et al., 2008).

Native Language Accommodations

Dual language. This class of accommodation has not beenwell studied: only four results were available for ELLs andone for non-ELLs. The limited findings suggest that this ac-commodation requires generous time limits to be effectivefor ELLs and tau-equivalent for non-ELLs. Specifically, it wasnot effective when administered with restricted time limits inthe study by Aguirre-Munoz (2002, see Figure 1 Study # 7) forELLs at low or high intermediate levels of ELP (–0.16, 0.04,respectively), and only slightly effective (0.12) at LI ELP. Itsaverage effect size with restricted time was .003 (ns, Table 1,column 1). However, one study showed an effect size of 0.299(ns, Table 1, column 2) for ELLs when the test was adminis-tered with generous time limits for both the accommodatedand unaccommodated test booklets (Anderson et al., 2000).

This type of accommodation was not tau-equivalent for onestudy of non-ELLs under a restricted time condition becausethe accommodated version was harder (−0.169, ns, Table 3,column 1) than the original version. When the DL conditiontakes the form of the side-by-side version of English and Span-ish language items, the booklet can be very long, requiringadditional time for students to page through the materials.Alternatively, the presentation of items in two languages mayhave been confusing to students unfamiliar with this format.However, this type of bilingual accommodation was only stud-ied for non-ELLs with restricted time; therefore, we cannotsay whether it would be tau-equivalent for non-ELLs underpower test conditions.

Bilingual glossary. Like most bilingual accommodations thistype shows some promise, but it has not been well studied: onlyfive effect sizes for ELLs and three effect sizes for non-ELLswere available for the paper and pencil bilingual glossary.None of the studies on this accommodation type incorpo-rated language of instruction or level of ELP in the design.The findings suggest that it is effective for ELLs and tau-equivalent for non-ELLs only when administered with gener-ous time limits. Specifically, it was harder than the originaltest when time was restricted for both ELLs (−0.176, p < .05,Table 1, column 1) and non-ELLs (−0.134, ns, Table 3, col-umn 1). However, the average effect size increased by almosthalf a standard deviation for ELLs when power tests or extratime were available for both groups (average effect of 0.247,ns, Table 1, column 3). Its tau-equivalency for non-ELLs un-der generous time limits has not been studied. Although 50states allow the use of some type of bilingual dictionary orglossary, in 39 states the dictionaries are not required to betailored to tests. This practice introduces the possibility of

20 Educational Measurement: Issues and Practice

construct-irrelevant variance because it may make answersor hints to test items available to students (Shafer Willneret al., 2008; Acosta et al., 2008).

Pop-up bilingual glossary. The pop-up version of the bilin-gual glossary was included in only one study by Koprivaet al. (2007) that did not incorporate level of ELP or lan-guage of instruction as separate variables in its design. Thesmall but nontrivial effect size for this accommodation (0.069,ns, Table 1, column 1) was considerably higher than the nega-tive average result for the paper and pencil version under therestricted time conditions. Again, this pattern is consistentwith the view that pop-up glossary versions are more effectivethan paper and pencil glossary versions under restricted timeconditions because they deliver information more easily andquickly. However, it is unknown whether it is tau-equivalentfor non-ELLs because the Kopriva et al. study included onlyELLs.

Spanish version. There were only six effect sizes for ELLsand one for non-ELLs based on the Spanish version ac-commodation, all under restricted time conditions (seeFigure 1). The results showed extreme variation according tolanguage background, from+1.45 (p< .05, two-tailed) for theELL group with the lowest proficiency in English (L) in Study#7, Aguirre-Munoz (2000) to negative values of −0.34 forHispanic ELLs receiving IE in Study #10 (Hofstetter, 2003)and of −1.13 for non-Hispanic ELLs (#5—Abedi, Lord, &Hofstetter, 1998). The administration of the Spanish versionto non-Hispanic ELLs was probably the inadvertent result ofrandomly assigned accommodation booklets in a large study;this extreme negative result is not surprising given that thosestudents probably knew little Spanish. There was also oneeffect size for the Spanish version accommodation for non-ELLs of Hispanic background (−0.78, p < .05, two-tailed,in Hofstetter, 2003) that was based on only 24 cases in theaccommodation group and 61 cases in the control group fromschools in California. It is understandable that Hispanic non-ELLs in California public schools would perform poorly onitems entirely in Spanish because the state policy does notsupport the maintenance of literacy in the native languagesonce students exit out of ELL status. In sum, this accommo-dation shows promise and the highest effect sizes of all, butonly when assigned appropriately—for students with literacyskills in their native language, who have had instruction onthe subject matter in their native language, and who havelow English proficiency. It would be desirable to include thisaccommodation in future studies where ELLs are classifiedby language background to see if these few results based onsmall samples can be replicated.

Possible Underestimation of Effect Sizes

In addition to the scarcity of studies allowing sufficient timefor those accommodations involving extra printed materials,there are several design features that may have obscured orartificially lowered the effects of the accommodations. One isthat researchers largely ignored the effects of ELP and lan-guage of instruction. While the use of control variables suchas ELL reading ability or ELP in some studies was a step inthe right direction, very few studies looked at possible inter-actions between ELP and type of accommodation in terms ofeffectiveness. The few studies that included these variables

did show important interactions; sometimes effects were neg-ative or near zero for some levels of ELP while substantiallypositive for other levels of ELP.

A second design feature that may have reduced effect sizeswas a possible poor match between test content and curricu-lum. Many studies using items from data banks from testssuch as NAEP and TIMSS may not have accurately reflectedthe taught curriculum (e.g., Kiplinger et al. found the testitems used were much too hard for students). Even whentest items were derived from standards specific to the state(found in five studies), there was no systematic evaluationof the degree to which the assessments included contentcovered in the classrooms where ELLs and non-ELLs wereconcentrated, an important issue highlighted by Abedi et al.(2004). Duran, Brown, and McCall (2002) and Gandara, Rum-berger, Maxwell-Jolly, and Callahan (2003) have pointed outthe content deficiencies in the curriculum for classrooms ofELLs as compared with the curriculum offered in classroomsfor non-ELLs. As a result, Abedi et al. (2003b) have advocatedstudying “a possible differential level of OTL [opportunity tolearn] for students with different language background char-acteristics” (p. 78) and its impact on test performance. Whentest scores are depressed and have little variance becausestudents have not studied particular content, differences be-tween conditions are artificially lowered.

Gaps in Research on Accommodations

Although the research shows that the most effective methodswere English dictionary/glossary and bilingual accommoda-tions with generous time limits (for most ELLs) or nativelanguage versions (for ELLs with low ELP), studies of theseaccommodations are scarce. As a result, there was little sta-tistical power for evaluating the effects of Spanish versions,bilingual glossary, DL or generous time limits, owing to thesmall number of subsamples receiving these conditions. How-ever, less restricted and unlimited time conditions are morerepresentative of actual practice in the overwhelming ma-jority of state assessments (Cormier et al., 2010; Thompsonet al., 2002).

Design features that impede meta-analyses. Several stud-ies did not report means and SDs by condition, just F testsfor overall contrasts together with marginal means; othersgrouped results for more than one type of accommodationtogether; still others did not pair the accommodated ELLgroup with an appropriate ELL control group, thus not allow-ing an effect size calculation. Even if each original result perunique combination of conditions by itself is not statisticallysignificant because of low power due to small sample sizes,the accumulated evidence across several studies using meta-analysis can point to a nontrivial effect size that is statisticallysignificant in the aggregate.

Conclusions and RecommendationsOur categorization of effect sizes for test accommodationstook into account language proficiency, test format, and timeconstraints, which enabled us to uncover particular circum-stances under which selected accommodations were effica-cious. With few exceptions, most accommodations did im-prove the test performance of ELLs beyond a trivial levelwhen students were allowed sufficient time to work with the

Fall 2011 21

extra printed materials provided. The findings for accommo-dations administered under generous time limits are likelyto be more generalizable to the many state assessments thatemploy power tests. The patterns of effect sizes across test ad-ministration conditions and students’ language history, whilenot always statistically significant due to low statistical power,confirm the necessity of including these dimensions when cat-egorizing effect sizes. The most promising accommodationswith generous time limits appear to be the DL, the bilingualglossary, and the English glossary/dictionary conditions. Themost promising accommodation under restricted time limitswas the pop-up English glossary; it is likely to continue to beeffective with power tests also.

It is probable that true potential effects have been under-estimated in the majority of studies thus far. At this point, fewresearchers have included consideration of additional timefor accommodations involving extra printed materials and thecorresponding control condition. In calculating average effectsizes, Kieffer et al. (2009) ignored time constraints and for-mat distinctions; their results largely reflect findings for thebulk of studies that did not allow adequate time. Thus, theiranalyses most likely underestimated potential effect sizes foreffective accommodations.

Findings from the few investigations that controlled forlevels of ELP or language of instruction suggested that theeffectiveness of a particular accommodation may vary widelyaccording to a student’s level of English proficiency and thematch between the language of the test and language ofinstruction. For instance, the plain English accommodationwas more effective for students at HI levels of ELP, but lesseffective for students at lower levels of ELP. For Spanish-speaking students with low ELP, or those receiving contentIS, the native language versions of tests were, by far, the mosteffective accommodations. Conversely, the Spanish versionwas more difficult than the original test in English for ELLsat intermediate levels of ELP, or those receiving content IE.

Results for non-ELLs revealed that most English languageand bilingual accommodations could be considered essen-tially tau-equivalent to the original version under the sametime constraints, with some notable exceptions. Non-ELLstudents receiving the plain English version scored slightlybut significantly higher than those receiving the standardversion when both groups were given generous time. Incontrast, DL and bilingual glossary accommodations weredifferentially harder than the standard versions for non-ELLsunder equally strict time conditions. In power test conditions,perhaps the DL and bilingual glossary accommodation typeswould no longer be differentially more difficult and would betau-equivalent for non-ELLs. Nevertheless, states with assess-ments involving extra materials for accommodated conditionsshould investigate what time limits are sufficient to achievetau-equivalency in pilot studies.

These results imply, but do not conclusively demonstrate,that construct-irrelevant variance due to English proficiencywas reduced in tests with selected accommodations as com-pared with the corresponding original test booklets. Whenwe consider how little research on improving content assess-ment for ELLs had been done 10–15 years ago, these resultsindicate great progress. Nevertheless, the many limitations ofthe existent body of literature attest to the need for increasedattention to this area of research.

The importance of continued evaluation of accommoda-tions merits reiteration. Valid tests with accommodations thatmeet ELLs’ linguistic needs are not intended to be the end so-

lution towards the reduction of the achievement gap. Rather,accommodations are one tool that can reduce construct-irrelevant variance in the measurement of students’ edu-cational gains. Hence, it is important to continue to improveaccommmodations and to investigate how best to match themto particular student needs.

The effect size approach to study whether accommodationsimprove validity is only a first step—these findings shouldbe verified with other designs to examine construct validity.Koenig and Bachman (2004) have cautioned that compar-ing score gains is “useful in assessing the effectiveness of anaccommodation, . . .[but] cannot confirm that [an accommo-dation] yields valid score interpretations. . . . To compare themeanings of the two types of scores, it is necessary to ob-tain other information on the construct of interest by meansof measures external to the test” (pp. 96–97). For example,confirmatory factor analysis could be used to investigate therelationship between original and accommodated test scoreswith measures of ELP. Also, the comparative validity of ac-commodated and unaccommodated test scores for predictingschool achievement could be examined using a regressionapproach that considers under- or over-prediction of the cri-terion.

To evaluate accommodations, investigations should incor-porate factorial designs that cross type of accommodationwith ELP. The investigation of interactions between ELP andaccommodations needs to be extended to a greater number ofEnglish language accommodations and replicated with morestudies for native language accommodations. These analyseswould be useful to identify what types of accommodations aremost effective for students at particular levels of English andnative language proficiency.

Additionally, future investigations should separate the non-ELL student groups into categories such as native speakersof English, language minority students initially classified asfluent in English, and language minority students who haveexited ELL programs. For exited ELLs, researchers shouldrecord the number of years students have been out of theprogram and their level of proficiency in English language artsor some other measure of literacy. This breakdown of non-ELLgroups and the supplementary variables would be valuable indetermining the level of ELP at which accommodations wouldno longer be necessary (Acosta et al., 2008).

The findings support the conclusion of Abedi, Hofstet-ter, and Lord’s (2004) review that there is no “one-size-fits-all accommodation.” These trends suggest that practitionersshould take care to select accommodations according to stu-dent individual needs, including ELP and language of instruc-tion, and the availability of time during test administration.The results thus far support Abedi et al.’s (2004, p. 18) as-sertion, “The language of assessment should match students’primary language of instruction.” Also, we concur with theconclusion that ELLs must have the opportunity to learn thecontent covered in assessments. Much research needs to bedone on the effectiveness and validity of accommodations forELLs, which is still “meager” (Abedi et al., 2004, p. 18) in com-parison to the extensive body of work on accommodations forSWDs.

AppendixDescriptive Summary of Included Studies

Descriptive characteristics for the 14 experimental stud-ies included in the meta-analysis are detailed in Figure A1

22 Educational Measurement: Issues and Practice

7839.G/yranoitciD hsilgnEPop-up English Glossary 44 68Extra Time 84 0Small Group Testing 9 0No Accommodation 98 131

142742.G/yranoitciD hsilgnE

Plain English 257 241

Bilingual Glossary 101 129

542142noitadommoccA oN

English Dictionary/G. + Time 88 36

Plain English + Time 23 0

Bilingual Glossary + Time 0 0

Extra Time 85 33

English Dictionary/G.

Plain English

English Dictionary/G. + Time

Extra Time

No Accommodation

Plain English

Spanish Version

No Accommodation

English Dictionary/G.

Bilingual Glossary

No Accommodation

Plain English

Dual Language

Spanish Version

No Accommodation

Dual Language

No Accommodation

Dual Language 74d 0

No Accommodation 119d 82b, e

2000

8

7Aguirre- Munoz

Duncan, Parent, Chen, Ferrara, Johnson, Oppler, & Shieh

AccommodationAuthor(s) Report

date

4

2

1998

3

Abedi, Courtney, & Leon, #608

2003a

Sample Size per Accommo-dation & Grade

2001

a In its original form, this study partially overlapped with the Hofstetter study (2003), but only non-overlapping results are included here.

8

6

Grade(s) Design

8

26

77

Study #

Abedi, Lord, Kim, & Miyoshi

5

1

b An effect size could not be calculated for these groups because there was no corresponding accommodation or control group.

Abedi, Courtney, & Leon, #586

2003b

2000

64

Abedi, Courtney, Mirocha, Leon, & Goldberg

2005

59

20

268

37

20059

Anderson, Liu, Swierzbin, Thurlow, & Bielinski

52 101b

8

Random Assignment

0

25

46

Random Assignment Classrooms

Random Assignment

(Power Test)c

52 8261

25

41

30

Random Assignment

70

55 82

58

75

79

82

115

8944 7 (Levels of profi- ciency in English defined 4 groups)

8

0

0

53

127b

0

102

Random Assignment

8

146

Random Assignment

121

25

124 117

Abedi, Lord, &

Hofstettera

30

130144

30

117

29

0

107

15

268

4Random Assignment

Abedi, Hofstetter, Baker, & Lord

2001

Random Assignment

270 206

Random Assignment

84

135

86

4

8

8

23

11

16

199

ELLs Non-ELLs

199

209

119

284

3586

0089

1180

8464

FIGURE A1. Descriptive Information for 14 Experimental Studies Evaluated in Meta-Analysis

(online supporting information). Studies are sorted by de-sign, then by alphabetic order of authors. The first 12 had theindependent groups design whereas Studies 13 and 14 had therepeated measures design with counterbalancing of the or-der in which accommodated and unaccommodated measureswere administered. In two instances where technical reportswere later published as journal articles, we cite the journal

article reference. Specifically, we cite the article by Duncan etal. (2005), which replaced the earlier technical report, Gar-cia et al. (2000); and we cite the article by Albus, Thurlow,and Bielinski, (2005) instead of the technical report, Albus,Bielinski, Thurlow, and Liu (2001). On the other hand, wederived effect sizes from the technical report by Abedi, Lord,and Plummer (1997) in lieu of the corresponding article by

Fall 2011 23

Plain English

Spanish Version

No Accommodation

Pop-up Bilingual Glossary (BG)

Picture Dictionary (PD)

Read Aloud (RA)

RA + PD

RA + BG

BG + PD

No Accommodation

1412 1368

1416 1423

1430 1415

1416 1416

Plain English

No Accommodation

English Dictionary/G.

No Accommodation

Random Assignment63f 147g 24g

9f 229g 61g

222g

33

10 Hofstettera 2003

None

82g6f

e These non-ELLs were native speakers of English who had no background in Spanish.fThese students attended basic 8th grade mathematics classes taught in Spanish.

Random Assignment through spiraling of booklets

34 15

38 22

Random Assignment

33 None

36

None

37 None

None

None

36

6

Kopriva, Emick, Hipolito-Delgado, & Cameron

2007

Plain English

No Accommodation

3 & 41129 None

33

Rivera & Stansfield

200412

Report date ELLs Accommodation

Sample Size per Accommo-dation & Grade

DesignGrade(s)Study

# Author(s)

g These students attended a variety of mathematics classes taught in English..

c Designs with power tests either had no time limits or had very generous time limits, allowing all items to be attempted.

Counterbalanced Repeated Measures

(Power Test)c

133 69

Middle school

Albus, Thurlow, Liu, & Bielinski

200514

133 69

Counterbalanced Repeated Measures

(Power Test)c

8

8

4

13

546

Non-ELLs

d These groups had a Spanish-language background but were considered non-ELLs because they had had three or more years of instruction in English and therefore met the criteria for inclusion in NAEP assessments.

471 546Abedi, Lord, & Plummer (Accuracy Study)

1997

471

FIGURE A1. Continued.

Abedi & Lord (2001) because the former was more completein descriptive statistics and other information.

In Figure A1, the footnotes identify groups not included inany effect sizes, i.e., accommodation groups without corre-sponding control groups or vice versa. It’s important to notethat there were only five studies using items based on stateassessments or standards (Studies #7, 8, 11, 12, and 13); inthe remainder of the studies, the tests were experimentalforms with items adapted from NAEP or TIMSS item data-banks. Also, the two most common content areas tested weremathematics (seven studies) and science (four studies). Theschools in seven studies were all in California and another

three studies involved schools from a group of states thatincluded California.

Excluded Studies

Eleven empirical studies did not meet one or more of the firstfour criteria described in the method section and were notincluded in the meta-analysis. Castellon-Wellington (2000)did not meet the first criterion because the researcher re-ported means for students who had received their preferredor not preferred accommodation. Within each preferencegroup, not all individuals received the same accommodation.

24 Educational Measurement: Issues and Practice

Johnson and Monroe (2004) also did not meet the first crite-rion. They reported means for simplified English and originaltest booklets, but the test administration conditions were notthe same for all students within each group. Two studies,Hafner (2000) and Kiplinger et al. (2000), did not meet thethird criterion because they did not report all the necessarygroup statistics. Excluded nonexperimental studies (crite-rion 4) were: Abedi (2003); Brown (1999); Duran, Brown, andMcCall (2002); Shepard, Taylor, and Betebenner (1998) andthe dissertation by Nickerson (2004). Although Kieffer et al.had starred the Shepard et al. (1998) study in their referencelist as if it were included in their meta-analysis, it was actuallyexcluded as detailed in their method section and it was omit-ted from their tables. The Shaftel, Belton-Kocher, Glasnapp,and Poggio (2003) study was experimental for non-ELLs butnot for ELLs. Also, they did not report the necessary descrip-tive statistics for non-ELLs.

Removing the Overlap between the Abedi et al.(1998) andthe Hofstetter (2003) Studies

The original results from Abedi et al. (1998) and Hofstetter(2003) were not independent because they incorporated datafrom a shared Hispanic group, although the Abedi et al. studyalso included a non-Hispanic group of students. To separatethe results, we derived the effect sizes for the non-Hispanicstudents in Abedi et al. using the descriptive statistics re-ported in each study. Specifically, the sum of test scores andthe sum of squared test scores for each condition in each studywere derived from the reported means and standard devia-tions. The sums of scores and squared scores and the samplesizes pertaining to each condition for Hispanic students fromHofstetter’s results were subtracted from the correspondingvalues in Abedi et al. Subsequently, these re-calculated sumsand sample sizes were used to derive the means and standarddeviations of the non-Hispanic students from Abedi et al. foreach condition.

Other Considerations in Choosing Glass’s d Index

The possible inequality of the standard deviations for theoriginal test and the various accommodated versions was an-other consideration for choosing Glass’s index. Tests and theircorresponding accommodations cannot be assumed to be psy-chometrically parallel instruments having equal SDs and reli-abilities, although they are designed to have the same contentspecifications. Different test variants could have unequal SDsin the population if one form is systematically easier or harderthan the others and leads to restriction of range. These con-siderations were supported by empirical statistical findingsrejecting the homogeneity of variance assumption for accom-modation and corresponding control groups in the presentstudy. Having a common metric is essential for comparing ef-fect sizes across a variety of studies (Morris & DeShon, 2002;Becker, 2003), and this consideration was another factor inthe choice of Glass’s index.

Caveats in Contrasting Average Effect Sizes for ELLs andNon-ELLs

Although one is tempted to interpret the size of differences be-tween the effects for ELLs and non-ELLs to examine whetheraccommodations are differentially more effective for ELLs,we must keep in mind that direct subtractions are not mean-

ingful. The indices cannot be said to be on the same metricbecause the respective control group SDs vary.

Expected Values of Effect Sizes Having Three Parameters

When only the accommodation group receives a generoustime limit, the control group mean would not reflect a maineffect for time. Consequently, the main effect for time wouldnot cancel out when taking the difference in means for con-ditions when only the accommodation group receives extratime. Thus, the expected value under these conditions reflectsthree parameters—two main effects and the interaction (δj +δjXT + δT).

Calculations for Inferential Statistics

The equations used for most calculations are available uponrequest from the first author. We were able to derive theunreported correlations needed for the calculation of sam-pling variances in repeated measures designs using the teststatistics for repeated measures (t or F) together with otherdescriptive statistics reported by the study authors.

Average values for a group of effect sizes were calculatedby weighting each effect size by the inverse of each effect’ssampling variance. Inverse variance weights are consideredsuperior to sample size weights for calculating the mean effectbecause they minimize the variance of the average effect size(Shadish & Haddock, 1994). Calculations for the samplingvariances of individual effect sizes involve the populationvalue for the mean effect size, which, in turn, depends on theinverse variance weights. Consequently, the estimate of themean population value based on inverse variance weights wasderived iteratively. For the first iteration, the average effectsize was calculated using weights wij = nCij (the sample sizefor the control group), but subsequent iterations used inversevariance weights. The standard error of estimate for the av-erage effect size j is the square root of this sampling variance(vj) 1

2 , which was used to calculate 95% confidence bands.

AcknowledgmentsWe would like to commend and thank the many researcherswho successfully carried out the large, complex, and difficultexperimental research studies in schools that laid the founda-tion for this meta-analysis. We are grateful to Steve Sireci andhis coauthors for sharing their extensive review on accommoda-tions prior to its publication. At earlier stages of this research,we received invaluable advice and feedback from Brian Gongand the participants at the Edward F. Reidy, Jr. InteractiveLecture Series at the National Center for the Improvement ofEducational Assessment in Dover, NH. Others who also providedhelpful feedback included Jamal Abedi, Caroline Huie Hofstet-ter, Charles Stansfield, and David J. Francis. We also appreciatedthe critiques provided by Sue Rigney (discussant for our presen-tation at AERA in 2010), Steve Ferrara, Matthew R. Watkins,Barbara Acosta, Jennifer McCreadie, and Lynn Shafer Willner.Steve Ferrara gave us vital guidance and encouragement. Thedetailed, constructive suggestions by Jacqueline Leighton andthree anonymous reviewers helped us to sharpen, refine, andstrengthen this paper. Additionally, we would like to give specialthanks to Charlotte Blane, Julie Millet, and Roshaun Tyson fortheir excellent editorial work.

Fall 2011 25

References

References marked with a number in parentheses before the authors’names indicate studies included in the supporting online meta-analysis. Numbers 1–12 had an independent samples design andnumbers 13–14 had a repeated measures design.

Abedi, J. (2001). Assessment and accommodations for En-glish language learners: Issues and recommendations (Pol-icy Brief 4). Los Angeles: National Center for Research onEvaluation, Standards, and Student Testing, University of Califor-nia at Los Angeles (http://www.cse.ucla.edu/products/Reports/R603.pdfhttp://www.cse.ucla.edu/products/policy/cresst_policy4.pdf, ac-cessed 1 December 2004).

Abedi, J. (2003). Impact of student language background on content-based performance: Analyses of extant data (CSE Technical Report603). Los Angeles: University of California, Los Angeles, NationalCenter for Research on Evaluation, Standards, and Student Test-ing (http://www.cse.ucla.edu/products/Reports/R603.pdf, accessed 1December 2004).

Abedi, J. (2004a). Inclusion of students with limited English proficiencyin NAEP: Classification and measurement issues (CSE Technical Re-port 629). Los Angeles: University of California, Los Angeles, NationalCenter for Research on Evaluation, Standards, and Student Test-ing (http://www.cse.ucla.edu/products/Reports/R629.pdf, accessed 1December 2004).

Abedi, J. (2004b). The No Child Left Behind Act and English lan-guage learners: Assessment and accountability issues. EducationalResearcher, 33(1), 1–14.

(2) Abedi, J., Courtney, M., & Leon, S. (2003a). Effectiveness and valid-ity of accommodations for English language learners in large-scaleassessments (CSE Technical Report No. 608). Los Angeles: NationalCenter for Research on Evaluation, Standards, and Student Test-ing (http://www.cse.ucla.edu/products/Reports/R608.pdf, accessed 8June 2005).

(1) Abedi, J., Courtney, M., & Leon, S. (2003b). Research-supportedaccommodation for English language learners in NAEP (CSE Tech.Report No. 586). Los Angeles: University of California, NationalCenter for Research on Evaluation, Standards, and Student Test-ing (http://www.cse.ucla.edu/products/Reports/TR586.pdf, accessed1 December 2004).

(3) Abedi, J., Courtney, M., Mirocha, J., Leon, S., & Goldberg, J.(2005). Language accommodations for English language learn-ers in large-scale assessments: Bilingual dictionaries and linguis-tic modification (CSE Report 666). Los Angeles: National Cen-ter for Research on Evaluation, Standards, and Student Test-ing (http://www.cse.ucla.edu/products/reports/r666.pdf, accessed 28September 2006).

(4) Abedi, J., Hofstetter, C., Baker, E., & Lord, C. (2001). NAEP mathperformance and tests accommodations: Interactions with studentlanguage background (CSE Technical Report No. 536). Los Angeles:National Center for Research on Evaluation, Standards, and StudentTesting (http://www.cse.ucla.edu/products/Reports/newTR536.pdf,accessed 1 December 2004).

Abedi, J., Hofstetter, C. H., & Lord, C. (2004). Assessment accommo-dations for English language learners: Implications for policy-basedresearch. Review of Educational Research, 74(1), 1–28.

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests.Applied Measurement in Education, 14(3), 219–234.

(5) Abedi, J., Lord, C., & Hofstetter, C. (1998). Impact of se-lected background variables on students’ NAEP math performance(CSE Technical Report No. 478) Los Angeles: National Cen-ter for Research on Evaluation, Standards, and Student Testing(http://www.cse.ucla.edu/products/Reports/TECH478.pdf, accessed1 December 2004).

(6) Abedi, J., Lord, C., Kim, C., & Miyoshi, J. (2001). The effects of accom-modations on the assessment of limited English proficient studentsin the national assessment of educational progress (PublicationNo. NCES 2001–13). Washington, DC: National Center for Educa-tion Statistics (http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=iu200113, accessed 1 December 2004).

(13) Abedi, J., Lord, C., & Plummer, J. R. (1997). Final report oflanguage background as a variable in NAEP mathematics per-formance. Los Angeles: National Center for Research on Evalu-ation, Standards, and Student Testing (http://www.cse.ucla.edu/products/Reports/TECH429.pdf, accessed 1 December 2004).

Acosta, B., Rivera, C., & Shafer Willner, L. (2008). Best practicesin state assessment policies for accommodating English languagelearners: A Delphi study. Prepared for the LEP Partnership, U.S.Department of Education Arlington, VA: The George Washing-ton University Center for Equity and Excellence in Education(http://ceee.gwu.edu/AA/BestPractices.pdf, accessed 3 April 2009).

AERA, APA, & NCME (1999). Standards for educational and psycho-logical testing. Washington, DC: Author.

(7) Aguirre-Munoz, Z. (2000). The impact of language proficiencyon complex performance assessments: Examining linguistic ac-commodation strategies for English language learners (Doc-toral dissertation, University of California at Los Angeles). Pro-quest Dissertations and Theses Full Text (Publication no. AAT9973171).

Albus, D., Bielinski, J., Thurlow, M., & Liu, K. (2001). The effectof a simplified English language dictionary on a reading test(LEP Projects Report 1). Minneapolis: University of Minnesota, Na-tional Center on Educational Outcomes (http://education.umn.edu/NCEO/OnlinePubs/LEP1.html, accessed 1 December 2004).

(14) Albus, D., Thurlow, M., Liu, K., & Bielinski, J. (2005). Reading testperformance of English-language learners using an English dictio-nary. Journal of Educational Research, 98(4), 245–254.

Alderman, D. A. (1982). Language proficiency as a moderator variablein testing academic aptitude. Journal of Educational Psychology,74(4), 580–587.

(8) Anderson, M., Liu, K., Swierzbin, B., Thurlow, M., & Bielinski, J.(2000). Bilingual accommodations for limited English proficientstudents on statewide reading rests: Phase 2 (Minnesota ReportNo. 31). Minneapolis: National Center for Educational Outcomes(http://education.umn.edu/NCEO/OnlinePubs/MnReport31.html,accessed 1 December 2004).

Becker, B. J. (1988). Synthesizing standardized mean-change mea-sures. British Journal of Mathematical and Statistical Psychology,41, 257–278.

Becker, B. J. (2003). Introduction to the special section on metric inmeta-analysis. Psychological Methods, 8(4) 403–405; 37(4), 341–345.

Bensoussan, M. (1983). Dictionaries and tests of EFL reading compre-hension. ELT Journal, 37(4), 341–345.

Brown, P. (1999). Findings of the 1999 plain language fieldtest inclusive comprehensive assessment system (PublicationT99–013.1). Delaware Education Research & Development Center(http://dspace.udel.edu:8080/dspace/handle/19716/2336, accessed 4October 2006).

Butler, F. A., & Stevens, R. (1997). Accommodation strategies forEnglish language learners on large-scale assessments: Studentcharacteristics and other considerations (CSE Technical Report448). Los Angeles: University of California, Los Angeles, Na-tional Center for Research on Evaluation, Standards, and Stu-dent Testing, Graduate School of Education & Information Studies(http://www.cse.ucla.edu/products/Reports/TECH448.pdf, accessed1 December 2004).

Castellon-Wellington, M. (2000). The impact of preference for ac-commodations: The performance of English language learnerson large-scale academic achievement tests (CSE Technical Re-port 524). Los Angeles: University of California, Los Angeles, Na-tional Center for Research on Evaluation, Standards, and Stu-dent Testing, Graduate School of Education & Information Studies(http://www.cse.ucla.edu/products/Reports/TECH524.pdf, accessed1 December 2004).

Cohen, J. (1988). Statistical power analysis for the behavioral sciences(2nd edition). Hillsdale, NJ: Erlbaum.

Cormier, D. C., Altman, J., Shyyan, V., & Thurlow M. L. (2010). A sum-mary of the research on the effects of test accommodations: 2007–2008(NCEO Technical Report 56). Minneapolis: University of Minnesota,National Center on Educational Outcomes.

26 Educational Measurement: Issues and Practice

(9) Duncan, T. G., del Rio Parent, L., Chen, W.-H., Ferrara, S., John-son, E., Oppler, S., & Shieh, Y.-Y. (2005). Study of a dual-languagetest booklet in eighth-grade mathematics. Applied Measurement inEducation, 18(2), 129–161.

Duran, R. P. (2008). Assessing English language learners’ achieve-ment. Review of Research in Education, 32, 292–327. doi:10.3102/0091732×07309372.

Duran, R. P., Brown, C., & McCall, M. (2002). Assessment of Englishlanguage learners in the Oregon statewide assessment system: Na-tional and state perspectives. In T. M. Haladyna & G. Tindal (Eds.),Large-scale assessment programs for all students: Validity, techni-cal adequacy, and implementation (pp. 371–394). Mahwah, NJ:Erlbaum.

Fletcher, J. M., Francis, D. J., Boudosquie, A., Copeland, K., Young, V.,Kalinowski, S., & Vaughn, S. (2006). Effects of accommodations onhigh-stakes testing for students with reading disabilities. ExceptionalChildren, 72(2), 136–150.

Gandara, P., Rumberger, R., Maxwell-Jolly, J., & Callahan,R. (2003). English learners in California schools: Un-equal resources, unequal outcomes. Educational PolicyAnalysis Archives, 11(36) (http://epaa.asu.edu/ojs/article/view/264, accessed 2 March 2010).

Garcia, T., del Rio Parent, L., Chen, L., Ferrara, S., Garavaglia, D.,Johnson, E., Liang, J., Oppler, S., Searcy, C., Shieh, Y., & Ye, Y.) (2000,November). Study of a dual language test booklet in eighth grademathematics: Final report. Washington, DC: American Institutes forResearch.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in socialresearch. Newbury Park, CA: Sage.

Gleser, L. J., Olkin, I. (1994). Stochastically dependent effectsizes. In H. M. Cooper & L. V. Hedges (Eds.), The handbookof research synthesis (pp. 339–355). New York: Russell SageFoundation.

Hafner, A. L. (2000). Evaluating the impact of test accommoda-tions on test scores of LEP students and non-LEP students. Fi-nal report, research supported by the U.S. Department of Ed-ucation through the Delaware Department of Education, Grant#R279A950022 (http://www.doe.k12.de.us/aab/Atch13.pdf, accessed1 December 2004).

Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effectsize and related estimators. Journal of Educational Statistics, 6(2),107–128.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis.Orlando, FL: Academic Press.

Hofstetter, C. H. (2003). Contextual and mathematics accommodationtest effects for English language learners. Applied Measurement inEducation, 16(2), 159–188.

Hollenbeck, K. (2002). Determining when test alterations are validaccommodations or modifications for large-scale assessment. In T.M. Haladyna & G. Tindal (Eds.), Large-scale assessment programsfor all students: Validity, technical adequacy, and implementation(pp. 395–425). Mahwah, NJ: Erlbaum.

Improving America’s Schools Act of 1994. Pub. L. No. 103–382.Johnson, E., & Monroe, B. (2004). Simplified language as

an accommodation on math tests. Assessment for Effec-tive Intervention, 29, 35–45. doi:10.1177/073724770402900303(http://aei.sagepub.com/cgi/content/abstract/29/3/35, accessed 8March 2010).

Kieffer, M. J., Lesaux, N. K., Rivera, M., & Francis, D. J. (2009). Accommo-dations for English language learners taking large-scale assessments:A meta-analysis on effectiveness and validity. Review of EducationalResearch, 79(3), 1168–1201.

Kiplinger, A., Haug, C. A., & Abedi, J. (2000). A math assessment shouldtest math not reading: One state’s approach to the problem. Presentedat the 30th Annual National Conference on Large-Scale Assessment,Snowbird, UT.

Koenig, J. A., & Bachman, L. F. (Eds.) (2004). Keeping score forall: The effects of inclusion and accommodation policies on large-scale educational assessments. Washington, DC: National ResearchCouncil.

(11) Kopriva, R. J., Emick, J. E., Hipolito-Delgado, C. P., & Cameron, C. A.(2007). Do proper accommodation assignments make a difference?Examining the impact of improved decision making on scores forEnglish language learners. Educational Measurement: Issues andPractice, 26(3), 11–20.

La Celle-Peterson, M. W., & Rivera, C. (1994). Is it real for all kids?A framework for equitable assessment policies for English languagelearners. Harvard Educational Review, 64(1) 55–75.

Morris, S. B, & DeShon, R. P. (2002). Combining effect size estimatesin meta-analysis with repeated measures and independent-groupsdesigns. Psychological Methods, 7(1), 105–125.

No Child Left Behind Act of 2001 (2002). Pub. L. No. 107–110.Nickerson, B. (2004). English language learners, the Stanford Achieve-

ment Test, and perceptions regarding the effectiveness of testingaccommodations: A study of eighth graders (Doctoral dissertation,The George Washington University, Washington, DC). DissertationAbstracts International, 65/07, 2465.

Pennock-Roman, M. (1992). Interpreting test performance in selectiveadmissions for Hispanic students. In K. F. Geisinger (Ed.), Psycho-logical testing of Hispanics (pp. 99–136). Washington, DC: AmericanPsychological Association.

Pennock-Roman, M. (2002). Relative effects of English proficiency ongeneral admissions tests versus subject tests. Research in HigherEducation, 43(5), 601–623.

Phillips, S. E. (2002). Legal issues affecting special populations inlarge-scale testing programs. In G. Tindal & T. M. Haladyna (Eds.),Large-scale assessment programs for all students: Validity, techni-cal adequacy, and implementation (pp. 109–148). Mahwah, NJ:Erlbaum.

Ragan, A., & Lesaux, N. (2006). Federal, state, and district level Englishlanguage learner program entry and exit requirements: Effects on theeducation of language minority learners. Education Policy AnalysisArchives, 14(20). (http://epaa.asu.edu/ojs/article/view/91, accessed2 March 2010).

Raudenbush, S. W. (1994). Random effects models. In H. C. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis (pp. 301–321).New York: Russell Sage Foundation.

Rivera, C., Collum, E., Shafer Willner, L., & Sia, J. K., Jr. (2006). Ananalysis of state assessment policies addressing the accommodationof English language learners. In C. Rivera & E. Collum (Eds.), Anational review of state assessment policy and practice for Englishlanguage learners (pp. 1–173). Mahwah, NJ: Erlbaum.

(12) Rivera, C., & Stansfield, C. W. (2004). The effect of linguistic sim-plification of science test items on score comparability. EducationalAssessment, 9(3), 79–105.

Rivera, C., Stansfield, C. W., Scialdone, L., & Sharkey, M. (2000). Ananalysis of state policies for the inclusion and accommodation ofEnglish language learners in state assessment programs during1998–1999. Arlington, VA: The George Washington University Centerfor Equity and Excellence in Education.

Shadish, W. R., & Haddock, C. K. (1994). Combining estimates ofeffect sizes. In H. M. Cooper & L. V. Hedges (Eds.), The hand-book of research synthesis (pp. 261–281). New York: Russell SageFoundation.

Shafer Willner, L., Rivera, C., & Acosta, B. (2008). Descriptivestudy of state assessment policies for accommodating Englishlanguage learners. Arlington, VA: The George Washington Uni-versity Center for Equity and Excellence in Education. (http://ceee.gwu.edu/AA/DescriptiveStudy.pdf, accessed 3 April 2009).

Shaftel, J., Belton-Kocher, E., Glasnapp, D. R., & Poggio, J. P.(2003). The differential impact of accommodations in statewideassessment: Research summary. Minneapolis: University of Min-nesota, National Center on Educational Outcomes (http://www.cehd.umn.edu/NCEO/TopicAreas/Accommodations/Kansas.htm, ac-cessed 2 March 2010).

Shepard, L. A., Taylor, G., Betebenner, D. W., & Educational Re-sources Information Center (U.S.). (1998). Inclusion of limited-English-proficient students in Rhode Island’s grade 4 mathe-matics performance assessment (CSE Technical Report 486).Los Angeles: University of California, Los Angeles National

Fall 2011 27

Center for Research on Evaluation, Standards, and Student Testing(http://www.cse.ucla.edu/products/Reports/TECH486.pdf, accessed1 April 2006).

Sireci, S. G. (2005). Unlabeling the disabled: A psychometric perspec-tive on flagging scores from accommodated test administrations.Educational Researcher, 34(1), 3–12.

Sireci S. G., Li, S., & Scarpati, S. (2003). The effects of test accommo-dation on test performance: A review of the literature (Center forEducational Assessment Research Report No. 485). Amherst, MA:University of Massachusetts Amherst, Center for Educational As-sessment (http://www.education.umn.edu/NCEO/OnlinePubs/TestAccommLitReview.pdf, accessed 15 June 2004).

Sireci, S. G., Scarpati, S., & Li, S. (2005). Test accommodations forstudents with disabilities: An analysis of the interaction hypothesis.Review of Educational Research, 75(4), 457–490.

Thompson, S., Blount, A., & Thurlow, M. (2002). A summary of researchon the effects of test accommodations: 1999 through 2001 (TechnicalReport 34). Minneapolis: University of Minnesota, National Centeron Educational Outcomes.

Thurlow, M. L., McGrew, S., Tindal, G., Thompson, S. J., Ysseldyke, J.E., & Elliott, J. L. (2000). Assessment accommodations research:Considerations for design and analysis (NCEO Technical Report26). Minneapolis: University of Minnesota, National Center on Edu-cational Outcomes.

28 Educational Measurement: Issues and Practice