NCLB at Year 8 in the Assessment of English Language Learners: Taking Stock of the Assessment and Accountability Systems National Association of Test Directors

NCLB at Year 8 in the Assessment of English Language Learners:Taking Stock of the Assessment and Accountability Systems

National Association of Test Directors 2009 Symposium

Organized by:

Phil MorseLos Angeles Unified School District

Edited by:

Joseph O'Reilly, Ph.D. Mesa (AZ) Public Schools

This is the twenty-fourth volume of the published symposia, papers and surveys of the National Association of Test Directors (NATD).

This publication serves an essential mission of NATD ‑ to promote discussion and debate on testing matters from both a theoretical and practical perspective. In the spirit of that mission, the views expressed in this volume are those of the authors and not NATD. The paper and discussant comments presented in this volume were presented at the April, 2009 meeting of the National Council on Measurement in Education (NCME) in San Diego, California.

NCLB at Year 8 in the Assessment of English Language Learners:Taking Stock of the Assessment Jamal Abedi, University of California, Davis

Policy and Reality:Making Academic Assessments Work for English Learners Rebecca Kopriva, University of Wisconsin– Madison

Second Generation Accountability for English Language Learners David Francis, University of Houston

Discussant Comments Gregory Cizek, University of North Carolina - Chapel Hill Robert Linquanti, WestEd

Jamal AbediUniversity of California, Davis

NCLB at Year 8 in the Assessment of English Language

Learners: Taking Stock of the Assessment

Systems

How successful has NCLB been in resolving issues concerning

assessment and accountability for ELL students?

1. Problems in classification/reclassification of ELL students

2. Inclusion of ELL students in the state and national assessments

3. Quality of assessments (both Title I and Title III) for ELL students

4. Issues concerning accommodations for ELL students

5. Instability of ELL subgroup

Assessment of English Language Learners?

1. 1. What testing strategies may be helpful to What testing strategies may be helpful to improve reliability and validity as they relate to ELL improve reliability and validity as they relate to ELL students?students?

2. What impact has construct-irrelevance variance 2. What impact has construct-irrelevance variance had on students’ performance on high-stakes tests?had on students’ performance on high-stakes tests?

3. What effect has the focus on the English learners 3. What effect has the focus on the English learners subgroups had on their performance in school and subgroups had on their performance in school and on performance and language tests? Andon performance and language tests? And

4. Do testing accommodations assist English 4. Do testing accommodations assist English learners to display accurately their educational learners to display accurately their educational ability? Which are effective?ability? Which are effective?

Quality of Assessments for Quality of Assessments for English Language LearnersEnglish Language Learners

1. What testing strategies may be helpful to 1. What testing strategies may be helpful to improve reliability and validity as they improve reliability and validity as they relate to ELL students?relate to ELL students?

2. What impact has construct-irrelevance 2. What impact has construct-irrelevance variance had on students’ performance on variance had on students’ performance on high-stakes tests?high-stakes tests?

Measurement Quality for ELL Students

• Language factors impact student performance, particularly in content-based assessments such as math and science

• Performance-gap between ELL and non-ELL students in highest with assessments high in language demand

• Standardized achievement are usually constructed and field tested for the native speakers of English

• For ELL students, language is an additional source of measurement error and reduces reliability of assessments

• For ELL students, language factors (unnecessary linguistic complexity of test items) is a source of construct irrelevant variance and affects validity of assessments.

Are the Standardized Achievement Tests

Reliable and Valid for these Students?• The reliability coefficients of the test

scores for ELL students are substantially lower than those for non-ELL students

• ELL students’ test outcomes show lower criterion-related validity

• Structural relationships between test components and across measurement domains are lower for ELL students

Performance/Reliability-Gap Between ELL and non-ELL

StudentsPerformance-Gap Reliability-Gap

Reading 20% - 60% 15% - 40%

Science/Social Sciences

10% - 40% 12% - 35%

Math Problem Solving

8% - 25% 10% - 30%

Math Computation

0% - 10% 10 – 15%

Examining Complex Linguistic Features in Content-Based Test Items

Feature Feature Description Categories Combined

1 I tem length 1, 2, 4, 45

2 Vocabulary 3, 26, 27

3 Nominal heaviness 5, 6, 29, 30, 31, 32

4 Verb voice 7, 33

5 Modal 8, 34

6 Relative clause 9, 10, 11, 35, 36, 37

7 Adverbial modification 12, 13, 14, 15, 16, 17, 38, 39, 40, 41

8 Conditional clause 18, 19

9 Complement clause 20, 44

10 Sentence structure 28, 42, 43, 46

11 Preferred argument structure 22, 23, 47, 48

12 Question form 21

13 Global difficulty 24

14 Content interest 25

Additional Complex Linguistic Features

More recent research has identified these additional features:

• Complex verbs• Subordinate clauses (including relative

clauses)• Complex noun phrases• Various entities as subjects

What NCLB Has Accomplished What NCLB Has Accomplished by Focusing on by Focusing on

English Language LearnersEnglish Language Learners

What effect has the focus on the English What effect has the focus on the English learners subgroups had on their learners subgroups had on their performance in school and on performance performance in school and on performance and language tests?and language tests?

As a Case Example of Focus On ELLs:

Impact of NCLB on English Language Proficiency Assessment

• ELP assessment status prior to NCLB• Current status• The impact/contribution of NCLB on ELP assessment

Status of English Language Proficiency (ELP) Assessment

ELP assessment status prior to NCLB

Existing English language proficiency tests prior to NCLB

There were many different English language proficiency tests prior to the NCLB implementation. However, there are issues with many of these tests. For example:

• Lack of a clear operational definition of ELP for many of these tests

• Differences in types of tasks the tests cover and the specific item content of the tests

• They are based on different theoretical emphases

• Problems with the reliability and validity of these tests, the adequacy of the scoring directions, and issues concerning filed testing (samples, etc).

Reviews of Language Proficiency Tests

Zehler, et al. (1994) and Del Vecchio & Guerrero (1995) compared content and psychometrics of six most commonly used English language proficiency tests, they found:

• The content of the tests differ considerably in types of tasks and specific item content

• The tests differ in the grade levels for which they were designed, time limits and test item format

• The tests represent distinct approaches to definition of language proficiency, reflecting different theoretical emphases

• There were major issues with reliability and validity of the tests, the adequacy of the scoring directions, the limited populations used in field testing

Status of English Language Proficiency (ELP) Assessment

Current status

The impact/contribution of NCLB on ELP assessment

Title III of NCLB requires that:

• State measure proficiency and show progress;

• assess all ELL students; • provide independent measures for the four

skill domains of reading, writing, speaking, and listening;

• report a separate measure for comprehension;

• assess proficiency in academic language and in the language of social interaction, and

• align the assessments with state English Language Development (ELD) standards

Major Events

• The U.C. Department of Education was provided Enhanced Assessment Grant under Title VI (Section 6112) of No Child Left Behind (P.L. 107-110; NCLB) for development, validation, and implementation of English proficiency assessments

• Four different consortia of states were funded to develop measures of English language proficiency

• The U.S. Department of Education provided additional Enhancement Grants for validating/ improving these assessments

The Four Consortia

• The Mountain West Assessment Consortium

• The WIDA (ACCESS for ELLs) Consortium• The ELDA Consortium• The CELLA Consortium

The Process• To test four modalities: (1) Reading, (2)

Writing, (3) Speaking, and (4) Listening along with (5) Comprehension

• Four grade clusters (k-2; 3-5; 6-8; and 9-12)

• Based on state’s ELP content standards • To pilot and/or field test items on a large

sample and conduct validity studies• To conduct standard setting• Additional studies, e.g., examining the

validity of the accommodations used • Provide test administration and technical

manual, and scoring and reporting instructions.

Issues with the newly developed ELP

assessments

1. ELP Standards• Which ELP standards, from which of the participating states?

• Are there a set of common ELP standards?• Do all the participating states have a set

of defined ELP standards?• How ELP standards are defined by states?


assessments

2. Setting achievement levels• Should achievement levels be set separately for each of the four modalities?

• How discrepancies can be addressed?• Or, should achievement levels be set at

the whole test level?• How the total score should be obtained:• Simple composite?• Latent composite?


assessments

3. Dimensionality issues•Should the four modalities be considered as four separate subscales/dimensions?

•Sawaki, Stricker, & Oranje (2007) suggest a higher-order factor model with a general factor and four group factors (reading, listening, speaking & writing)

•Lall, Gaj, Broer, Carlson & Gu (2007) found a dominant first factor across the three modalities (listening, reading, & writing) to support a common vertical scale across grades 4-11

Issues with the newly developed ELP assessments

4. Comparability and feasibility issues• States used different ELP tests to establish

the baseline for AMAOs• Thus, the comparability factor becomes a

serious issue• There are also issues concerning

feasibility and test burden• Some of these assessments require over 6

hours of testing time.

Issues with the newly developed ELP assessments

5. Lack of data to judge the quality of ELP measures

• Validation of achievement levels (internal and external criteria)

• Relationship between Title I and Title III assessments

• Relationship between ELL classification and ELP test scores

Where Are We in Term of Accommodations for ELL

Students At Year 8 of NCLB?

Do testing accommodations assist English Do testing accommodations assist English learners to display accurately their learners to display accurately their educational ability? Which are effective?educational ability? Which are effective?

Principle of Equity and Fairness

•Provide assistance in the form of accommodations

Therefore, the Principle of Equity and Fairness demands assistance to these students

Samples of Accommodations Used for ELL Students That May Not be

Relevant• Test-taker marks answers in a test booklet• Copying assistance provided between

drafts• Test-taker indicates answers by pointing

or other similar method• Paper is secured to work area with

tape/magnet• Physical assistance is provided

Presenting Language-Related Accommodations for ELLs

• English Extended Time• Dictionary• English Glossary• Bilingual Dictionary/Glossary• Customized Dictionary• Native Language Testing• Linguistically Modified Test

Studies on Linguistic Accommodations

• Results of national data are not conclusive• Most of the CRESST studies found significant

gain for ELL students on linguistically modified version

• However, the outcome of national research on the impact of linguistic modification is mixed (Francis, et al.)

• Sources that lead to such discrepancies include variation in methodology in implementing linguistic modification approach, sampling and power issues, variation in test items and the nature of linguistic complexity, etc.

Summary & Conclusion

What NCLB has accomplished after 7 years?

• More attention to the inclusion, assessment, and accountability of subgroups of students including ELL students

• More focus on standard-based assessments• More organized efforts in creating and

administering English language proficiency assessments

• The quality of content-based assessment for subgroups of students in general and for ELL students in particular still questionable

• The validity of classification of ELL students is still questionable

For more information contact:

Jamal Abedi:[email protected]

(530) 754-9150

mailto:[email protected]


Policy and Reality:Making Academic

Assessments Work for English Learners

Rebecca KoprivaUniversity of Wisconsin,

Madison

[email protected]

•


Future Policy for ELs: Probable Scenarios

• For academic tests, holding schools accountable for ELs and most of the other subgroups is most likely NOT going to go away.

• For the EL subgroup, including ‘former ELs’ in the academic accounting mix will probably be formalized in the next federal legislation.

• Future federal peer reviews of academic tests will most likely be ‘tougher’ on demanding evidence of valid and comparable inferences for students receiving accommodations and students taking alternative assessments within the system, as compared to those taking the general test under standard conditions.

Future Policy for ELs: Probable Scenarios

• A peer review-type system will probably be set up for ELP tests.

• Federal interest in how ELs are identified and exited has been raised.

• NAEP is continuing to clarify how to handle inclusion/exclusion rates for ELs and SDs in states and districts. They are also continuing to clarify accommodation options and how to make sure students with similar profiles in different schools/states get accommodations that make sense.

What Does this Mean for ELs and Academic Content Tests?

EL students need to be included in the testing systems appropriately (otherwise, what’s the point….)

• We know there about 15 primary and secondary test accommodations that are useful for ELs on large-scale academic tests.

• Research on accommodations and packages is continuing. The investigations need to focus on studying how the accommodations are working for students who need them (as compared to how they work for broad groups of EL students).

• Overall, proper post hoc accommodations appear to be useful for more advanced ELs but even high intermediate students are still at risk.

• There seem to be 3 issues that need attention….

What Are the Most Pressing Issues?What to do About Them?

1) Students with English language proficiency at pre-functional, beginner, and low intermediate levels are generally not served well.

This group seems to need more support than many packages of accommodations provide. a) If taught in English, they need comprehensive support in both languages, including oral in L1 if not literate in L1. b) ‘Plain language’ forms are NOT useful if not supported by L1 support that goes beyond bilingual glossaries.c) Response accommodations are essential for this group.

What Are the Issues?What to do About Them?

2) ELs at the higher intermediate and some advanced English proficiency levels are still at risk as well because items with cognitive complexity generally require more language.

Items with more cognitive complexity (e.g. DOK = 2 and up) often require:

• Language precision that cannot be adequately duplicated in visuals or significantly reduced

• Elucidation of abstract concepts that require more complex language structures

• Contexts that cannot be meaningfully represented in static visuals only.

Bilingual and English glossaries are best for nouns or action verbs. More complex items use abstract nouns, verb tenses, conceptual language, phrase and clause structures usually not found in these glossaries.


To address 1) and 2),a) Written translations may be useful for students with grade-level literacy in L1 (not common among some groups).

Translations may be of limited use if these students have been taught in English for a substantial period of time. In this case, experts have recommended dual language forms for students so they can check meaning in the other language.

There are about 20 languages used in academic instruction. Issues of quality of translation and evenness of language meaning across forms needs to be overseen.


b) For students not literate in L1, oral L1 with English booklets may be useful for lower English proficient ELs. This is being done in some states and preliminary finding suggest that the technical adequacy of the ‘split’ accommodation is robust and comparable to oral English and standard administration conditions. Translation issues and number of translations still apply.

c) If portfolios are used for lower English proficient students, the data collection, evaluation, and oversight, needs to be rigorous. See Petit & Rigney (1995).

d) For higher level ELP students who have been in the U.S. for a substantial period of time, oral English with English booklets (preferably ‘plain language’ booklets) seems to be useful.


e) Two projects are investigating using interactive computer capabilities to substitute for language in more complex items. Results have been effective, especially for lower proficient ELs.

Computer capabilities include animated and interactive context and question-building, and response options using clicking, dragging, and modeling.

What language is left is in English on the screen and translated to the student when they click the speaker icon. Far fewer words minimize translation comparability issues.


3) Current procedures are not effective in making sure the proper accommodations are getting to the students who need them. a) States and districts need to oversee implementation to ensure accommodation decisions are being carried out.b) The bigger problem, however, is that procedures in use today do not yield effective accommodation decisions or consistent decisions across students with similar profiles.

Several research studies have shown that decisions by teachers or committees often yield results no better than random assignment, and are notoriously inconsistent across LEA and SEAs.

Further, proper assignments lead to significantly higher scores than improper assignments, with improper assignments no different than random assignments.


A standardized, research-based, online system of matching individual EL students to their accommodations has been recently built. The system (STELLA) can be used to help guide teachers or committees.

STELLA, which collects data from records, teachers, and parents/guardians or students, has been validated and is available for licensing.

It is designed to provide accommodation recommendations for large-scale tests consistent with local state or district policy, while also providing best-practice recommendations useful for classroom use.

STELLA recommends individualized pretest support for teachers too. It goes beyond test prep and focuses on purposes of testing, types of questions used on US tests that are different from the student’s country (e.g. word problems in math), and identification of cognitive skills on US tests but not evaluated in the student’s previous schooling. URL for STELLA: www.wida.us/UW/STELLA

For more information…

Improving Testing for English Language Learners, (2008),Kopriva, R.J. Routledge, NY, NY.

[email protected]


Second Generation Accountability for English

Language Learners

David J. FrancisHugh Roy and Lillie Cranz Cullen Professor of PsychologyUniversity of Houston

Texas Institute for Measurement, Evaluation, and Statistics

Overview

• How can testing and reporting systems be improved in order to promote more meaningful accountability for English Language Learners?

• Do testing accommodations assist English Learners to measure accurately their educational ability? Which are effective?

• How could the NCLB English Language Learner accountability provisions be improved in its reauthorization?

LEP as a Subgroup Under NCLB

• More so than other subgroups (e.g., FRL, gender, ethnicity), there is no universal definition of LEP, nor is there a universally accepted approach to identifying children as LEP.

• Defining the population of interest and monitoring their academic progress is less precise than it needs to be– States differ in the instruments used to assess

language proficiency– States differ in the criteria used to judge

proficiency– Too often, subjective judgments are part of the

decision process

LEP as a Subgroup Under NCLB

• Several factors make LEP unique as a subgroup under NCLB– Unlike other NCLB subgroups (e.g., gender,

ethnicity, learning disabilities), membership in the LEP category is dynamic and developmental

– Moreover, the defining characteristic (i.e., language proficiency) is causally linked to schooling AND to the outcomes of interest (i.e., content area achievement)

Accountability for Subgroups

• NCLB presumes static group membership.– Instruction is expected to affect subgroup

achievement, not subgroup membership• LEP subgroup membership is affected by school

performance– Students lose their membership as they

acquire English – Acquisition of English is both

• a consequence of effective schooling, and • a mediator of the effects of schooling on

content mastery

Accountability for Subgroups

• Membership in LEP subgroup requires LOW performance on a cognitive dimension (viz. English Language) that is causally linked to the outcomes of interest (viz. English Achievement) and is a consequence of effective instruction

• Ignoring the developmental nature of language and its role in mediating the effects of instruction on achievement can bias our estimates of how schools are performing for the LEP subgroup

RFEP

ELLELL (included in comparisons)

RFEP (excluded in comparisons)

Comparison of ELLs and former ELLs on State Reading Test in

Texas 2002 Level of Language Proficiency for ELL Groups

Grade Beginning Intermediate Advanced

(2002) Advanced

(2000)

3 13.9 38.3 90.6 90.0

4 13.1 37.4 84.1 93.6

5 16.5 24.1 69.5 96.1

6 14.5 12.8 46.0 86.8

7 15.0 12.4 43.9 85.0

8 23.2 19.2 55.3 90.2

10 21.3 28.5 66.4 85.8

Overall 15.8 30.4 76.4 89.6

http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html

Current Law

• Allows retention of LEP label for accountability purposes for up to two years after achieving FEP status– This practice boosts the performance of the

LEP group– Gives a less biased view of the performance of

LEP students than when FEPs are not counted– But is it really the view that schools and the

public need? Is it sufficiently accurate to be useful?

Including FEPs in the LEP Group: Are we really getting at the right

information? Achievement %Proficient

without FEP-21

Achievement %Proficient with FEP-21 Grade

3 25.4 68.6

4 31.9 65.6

5 33.5 61.4

6 21.2 47.3

7 22.1 45.1

8 30.7 55.1

10 39.2 62.4


1Hypothetical result based on 2003 percentages in each language proficiency category

Which column best captures the long term results for LEP students? Which column tells us how the school/district/state is doing?

Retaining FEPs in the LEP group…

• Improves on typical practice, but …– is insufficient to get an accurate and

complete picture about the performance of LEP students•It does not accurately reflect the long

term outcomes for students who began school as LEP, and

•Confounds language proficiency and achievement

Including FEPs in the LEP Group: Are we really getting at the right

information? Level of Language Proficiency for LEP Groups Achievement

%Proficient without FEP-21

Achievement %Proficient with FEP-21 Grade Beginning Intermediate

Advanced (2002)

Advanced (2000)

3 13.9 38.3 90.6 90.0 25.4 68.6

4 13.1 37.4 84.1 93.6 31.9 65.6

5 16.5 24.1 69.5 96.1 33.5 61.4

6 14.5 12.8 46.0 86.8 21.2 47.3

7 15.0 12.4 43.9 85.0 22.1 45.1

8 23.2 19.2 55.3 90.2 30.7 55.1

10 21.3 28.5 66.4 85.8 39.2 62.4



Which column best captures the long term results for LEP students?Which one really tells us how the school/district/state is doing?

Retaining FEPs in the LEP group…

• Improves on typical practice, but …– is insufficient to get an accurate and

complete picture about the performance of LEP students•It does not allow reporting of long

term outcomes for students who began school as LEP, and

•Confounds language proficiency and achievement

What would you conclude if the overall results looked like

this from 2002 to 2003? Level of Language Proficiency for LEP Groups Achievement

%Proficient 20021

Achievement %Proficient

2003 Grade Beginning Intermediate Advanced

(2002) Advanced

(2000)

3 68.6 68.6

4 65.6 64.9

5 61.4 60.7

6 47.3 44.6

7 45.1 42.3

8 55.1 52.2

10 62.4 62.7





%Proficient 20021



(2002) Advanced

(2000)

3 13.9 38.3 90.6 90.0 68.6 68.6

4 13.1 37.4 84.1 93.6 65.6 64.9

5 16.5 24.1 69.5 96.1 61.4 60.7

6 14.5 12.8 46.0 86.8 47.3 44.6

7 15.0 12.4 43.9 85.0 45.1 42.3

8 23.2 19.2 55.3 90.2 55.1 52.2

10 21.3 28.5 66.4 85.8 62.4 62.7





%Proficient 20021



(2002) Advanced

(2000)

3 15.3 42.1 92.4 91.8 68.6 68.6

4 14.4 41.1 85.8 95.5 65.6 64.9

5 18.2 26.5 70.9 98.0 61.4 60.7

6 15.9 14.1 46.9 88.5 47.3 44.6

7 16.5 13.6 44.8 86.7 45.1 42.3

8 25.5 21.1 56.4 92.0 55.1 52.2

10 23.4 31.4 67.7 87.5 62.4 62.7



Aggregate Reporting Masks Performance Changes

When Demographics Shift Level of Language Proficiency for LEP Groups

Achievement %Proficient1

Grade Beginning Intermediate Advanced Advanced 2 Years Prior

3 15.3 42.1 92.4 91.8 68.6

4 14.4 41.1 85.8 95.5 64.9

5 18.2 26.5 70.9 98.0 60.7

6 15.9 14.1 46.9 88.5 44.6

7 16.5 13.6 44.8 86.7 42.3

8 25.5 21.1 56.4 92.0 52.2

10 23.4 31.4 67.7 87.5 62.7

1Hypothetical result based on increasing achievement in all groups while increasing the percentage of students in the lowest three categories of language proficiency

Achievement is up for children in each ELP category, but overall achievement is the same or lower.

What’s going on?




Grade 3 Beginning Intermediate Advanced Advanced 2 Years Prior

2002 Achievement

13.9 38.3 90.6 90.0

2003 Achievement

15.3 42.1 92.4 91.8


Overall Achievement depends on achievement in each ELP Category




Grade 3 Beginning Intermediate Advanced Advanced 2 Years Prior

2002 Achievement

13.9 38.3 90.6 90.0 68.6

% of students

16.0% 18.0% 18.0% 48.0%

2003 Achievement

15.3 42.1 92.4 91.8 68.6

% of students

17.6% 19.8% 19.8% 42.8%



AND

on the percentage of students in different ELP categories.

The percentage of children in each language proficiency category is partly a function of instruction, but it also a function of demographics.



Overall %Proficient1

Grade Beginning Intermediate Advanced Advanced 2 Years Prior

3 15.3 42.1 92.4 91.8 68.6

4 14.4 41.1 85.8 95.5 64.9

5 18.2 26.5 70.9 98.0 60.7

6 15.9 14.1 46.9 88.5 44.6

7 16.5 13.6 44.8 86.7 42.3

8 25.5 21.1 56.4 92.0 52.2

10 23.4 31.4 67.7 87.5 62.7



AND

on the percentage of students in different ELP categories.

The percentage of children in each language proficiency category is partly a function of instruction, but it is also a function of demographics.

Allowing FEP students to count in the ELL category for up to two

years…• Boosts the overall percent proficient within the ELL category,

• But it does NOT– allow us to easily determine the academic

achievement of ELLs who become proficient in English

– allow us to easily determine the long term achievement outcomes for children who entered school as ELLs

– provide schools with actionable information about their ELL students’ performance, or

– resolve the problem of aggregation bias when demographics are shifting

Problems with Current Accountability Practice for ELLs• Overly simplistic view of a complex

developmental and educational process• Fails to take into account the developmental

nature of language acquisition– Language takes time to acquire– This time frame contains both maturational,

educational, and environmental influences• It can be accelerated through good

instruction and experiences in a language rich environment,

• It CANNOT be reduced to zero• Fails to take into account the causal role that

language plays in acquisition of content area knowledge

How much can the performance of LEP students be improved through appropriate test accommodations?

Meta-Analysis: Literature Search

• Final sample was 11 studies – Each study used random assignment of

ELLs and non-ELLs to testing conditions with and without accommodations (one study we could not confirm random assignment)

– Involved 38 different samples of students

– Reported 38 different tests of the effectiveness of accommodations for ELLs

Study Descriptions• Grades included

– 4th: n=11 – 8th: n=23– 5th or 6th: n=2 each

• Subject Areas– Math: n = 17– Science: n=20 – Reading: n=1

• Type of test– NAEP items: n=23– NAEP and TIMSS: n=6 – State Accountability Assessment: n=9 (two

different states)

Types of Accommodations

0 5 10 15 20

Simplified English

English Dictionary

Bilingual Dictionary

Spanish Version

Dual Language Questions

Dual Language Booklet

Extra Time

Number of Study Samples

How large are the achievement gaps between ELLs tested without

accommodations and non-ELLs?

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Math Science

Eff

ect

Siz

e

Meta-analysis

NAEP - 4thGradeNAEP - 8thGrade

Accommodation

Results for Fixed Effects Analysis

Number of

Studies

Effect Size and 95% Confidence Interval

Test of Mean Effect = 0

Test of Heterogeneity in Effect Sizes

Mean Effect Size

s.e. Lower Limit

Upper Limit

Z p Q df(Q) p(Q)

English Dictionary-Glossary

11 .146 0.043 .063 .230 3.427 .001 14.804 10 .139

Simplified English 16 .030 0.043 -.053 .114 0.708 .479 23.885 15 .067

Bilingual Dictionary-Glossary

5 -.096 0.065 -.223 .031 -1.479 .139 13.53 4 .009

Spanish Version 2 -.263 0.102 -.463 -.062 -2.572 .010 14.465 1 <.001


1 -0.177 0.065 -.223 .031 -1.199 .231

Dual Language Questions + Read Aloud in Spanish

1 .273 0.195 -.109 .654 1.401 .161

Extra Time 2 .209 0.142 -.069 .488 1.473 .141 0.155 1 .693

TOTAL WITHIN 66.844 31 <.001

TOTAL BETWEEN

24.426 6 <.001

OVERALL MEAN

38 0.038 0.025 -.012 .087 1.481 .139 91.270 37 <.001

Results for Random Effects Analysis

Effect Size and 95% Confidence Interval

Test of Mean Effect = 0

Test of Heterogeneity in Effect Sizes

Accommodation

Number of

Studies Mean Effect Size

s.e. Lower Limit

Upper Limit

Z p Q df(Q) p(Q)

English Dictionary-Glossary

11 0.178 0.055 .070 .287 3.232 .001

Simplified English 16 0.037 0.067 -.093 .168 0.557 .557

Bilingual Dictionary-Glossary

5 -0.039 0.131 -.295 .217 -0.298 .766

Spanish Version 2 0.302 0.719 -1.107 1.711 0.420 .674


1 -0.177 0.065 -.223 .031 -1.199 .231

Dual Language Questions + Read Aloud in Spanish

1 0.273 0.195 -.109 .654 1.401 .161

Extra Time 2 0.209 0.142 -.069 .488 1.473 .141

TOTAL WITHIN

TOTAL BETWEEN

9.013 6 .173

OVERALL MEAN

38 0.102 0.037 0.029 0.174 2.753 .006

Findings: Effectiveness• Of the accommodations studied, only providing

English dictionaries had a significant positive effect.– Hedges’ gu = .15 in fixed effects model; .18

random effects model– Approximately 10% – 25% of the difference

between ELLs & native English speakers

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Math Science English Dictionaries

Effec

t Si

ze

Meta-analysis

NAEP - 4th

NAEP - 8th

Summary of Results

Of the seven types of accommodations used, only one had an overall positive effect on ELL outcomes: English language dictionaries and glossaries– Produced an average effect which is positive

and statistically different from zero– No indication that this effect varied across the

studied conditions– No evidence that it was not a valid

accommodation– No evidence that effect sizes varied when

bundled with extra time, or when glossaries were electronic.

How do time and language work to predict the content area achievement of ELLs?

3-Level Model for ELA and Math• Unconditional Model (within grade)

– V(Students(schools))

– V(Schools(Districts))

– V(Districts)

• Conditional Models

– Years in US

– ELP

– Years in US and ELP

ELA MATH

Grade Source Estimate s.e. Z p %Variancea Estimate s.e. Z p %Variancea

District 23.99 5.98 4.01 <.0001 0.15 36.70 8.18 4.49 <.0001 0.17

Schools 25.61 3.30 7.76 <.0001 0.16 34.10 4.26 8.00 <.0001 0.15

4

Students 111.69 2.62 42.70 <.0001 0.69 149.64 3.37 44.36 <.0001 0.68

District 25.20 5.95 4.23 <.0001 0.17 43.69 9.35 4.67 <.0001 0.19

Schools 15.58 2.92 5.33 <.0001 0.11 34.94 5.32 6.57 <.0001 0.15

5

Students 107.48 2.94 36.57 <.0001 0.72 151.28 3.98 38.01 <.0001 0.66

District 21.05 6.20 3.39 0.0003 0.15 48.79 11.27 4.33 <.0001 0.23

Schools 20.47 4.04 5.06 <.0001 0.14 23.93 5.30 4.51 <.0001 0.11

6

Students 100.77 2.82 35.77 <.0001 0.71 135.55 3.67 36.96 <.0001 0.65

District 25.79 6.72 3.84 <.0001 0.17 58.80 12.91 4.55 <.0001 0.29

Schools 17.57 3.73 4.72 <.0001 0.12 20.00 4.37 4.58 <.0001 0.10

7

Students 108.15 3.06 35.36 <.0001 0.71 120.66 3.30 36.56 <.0001 0.60

District 26.05 7.63 3.41 0.0003 0.16 52.35 11.19 4.68 <.0001 0.27

Schools 24.18 5.11 4.73 <.0001 0.15 29.67 5.43 5.47 <.0001 0.15

8

Students 115.44 3.28 35.23 <.0001 0.70 110.01 3.03 36.27 <.0001 0.57

a%Variance computed as intra-class correlations (ICCs), viz. as ratio of estimate to sum of estimates for District, School, and Students. Percentages may not sum to 100% due to rounding.

Unconditional Random Effects for ELA and MATH

Conditional Random Effects for ELA and MATH predicted from

Years in US, ELP, and Years + ELPELA MATH

Grade Source Years in

US ΔR2 ELP-Perf. ΔR2

Years and ELP ΔR2

Years in US ΔR2

ELP-Perf. ΔR2

Years and ELP ΔR2

District 27.21 -0.13 15.13 0.37 14.73 0.39 41.11 -0.12 29.10 0.21 26.51 0.28

Schools 25.04 0.02 15.66 0.39 15.72 0.39 32.74 0.04 22.62 0.34 22.85 0.33

4

Students 108.37 0.03 81.83 0.27 81.67 0.27 145.14 0.03 119.72 0.20 118.84 0.21

District 25.73 -0.02 11.62 0.54 11.11 0.56 45.24 -0.04 36.52 0.16 35.45 0.19

Schools 14.83 0.05 9.25 0.41 9.53 0.39 33.28 0.05 23.34 0.33 22.88 0.35

5

Students 104.37 0.03 70.30 0.35 69.65 0.35 149.57 0.01 120.02 0.21 117.86 0.22

District 22.15 -0.05 9.16 0.56 8.59 0.59 49.56 -0.02 35.88 0.26 31.27 0.36

Schools 18.24 0.11 12.68 0.38 12.90 0.37 23.81 0.01 20.05 0.16 20.34 0.15

6

Students 97.03 0.04 66.38 0.34 66.07 0.34 133.72 0.01 111.82 0.18 109.45 0.19

District 27.88 -0.08 11.20 0.57 11.05 0.57 61.72 -0.05 47.32 0.20 43.68 0.26

Schools 13.08 0.26 4.53 0.74 4.63 0.74 19.42 0.03 14.44 0.28 15.03 0.25

7

Students 104.51 0.03 60.65 0.44 60.68 0.44 119.63 0.01 97.70 0.19 95.91 0.21

District 26.70 -0.02 10.87 0.58 9.54 0.63 51.31 0.02 42.76 0.18 37.47 0.28

Schools 22.99 0.05 7.58 0.69 8.46 0.65 30.17 -0.02 22.09 0.26 21.84 0.26

8

Students 113.83 0.01 73.83 0.36 72.40 0.37 109.00 0.01 92.13 0.16 89.03 0.19

a ΔR2computed as change in variance component from unconditional model (Table 5) relative to magnitude of variance component in

unconditional model (Table 5-Table 6)/(Table 5).

Conditional Random Effects for ELA and MATH predicted from Years in US + ELP measured as

(1) Performance Levels, (2) Scaled Score, or (3) Domain Scores

Grade Source

ELA MATH

Years + ELP-PL ΔR2

Years + ELP-SS ΔR2

Years + ELP-DS ΔR2

Years ELP-PL ΔR2

Years + ELP-SS ΔR2

Years and ELP-DS ΔR2

4 District 14.73 0.39 14.00 0.42 10.92 0.54 26.51 0.28 25.27 0.31 20.12 0.45

Schools 15.72 0.39 14.90 0.42 12.81 0.50 22.85 0.33 22.08 0.35 18.65 0.45

Students 81.67 0.27 74.28 0.33 60.20 0.46 118.84 0.21 112.80 0.25 100.78 0.33

5 District 11.11 0.56 10.72 0.57 7.66 0.70 35.45 0.19 33.94 0.22 25.35 0.42

Schools 9.53 0.39 8.27 0.47 7.21 0.54 22.88 0.35 22.44 0.36 20.03 0.43

Students 69.65 0.35 65.37 0.39 60.02 0.44 117.86 0.22 112.85 0.25 105.12 0.31

6 District 8.59 0.59 7.02 0.67 7.60 0.64 31.27 0.36 28.26 0.42 26.63 0.45

Schools 12.90 0.37 10.78 0.47 6.67 0.67 20.34 0.15 18.99 0.21 17.36 0.27

Students 66.07 0.34 61.97 0.39 56.49 0.44 109.45 0.19 104.52 0.23 97.72 0.28

7 District 11.05 0.57 11.09 0.57 8.04 0.69 43.68 0.26 42.19 0.28 34.16 0.42

Schools 4.63 0.74 4.05 0.77 3.06 0.83 15.03 0.25 14.21 0.29 12.25 0.39

Students 60.68 0.44 57.85 0.47 53.16 0.51 95.91 0.21 93.23 0.23 85.99 0.29

8 District 9.54 0.63 8.36 0.68 3.32 0.87 37.47 0.28 35.03 0.33 27.24 0.48

Schools 8.46 0.65 7.14 0.70 5.52 0.77 21.84 0.26 20.94 0.29 19.99 0.33

Students 72.40 0.37 69.24 0.40 60.64 0.47 89.03 0.19 85.41 0.22 75.99 0.31

a ΔR2computed as change in variance component from unconditional model (Table 5) relative to magnitude of variance component in unconditional model (Table 5-Table 7)/(Table 5).

Analysis Summary

• Years in the US predicted ELA and MATH performance at the district, school, and student levels

• However, Years in the US was a relatively weak predictor compared with ELP

• When ELP was included with Years in US, the effects of Years in the US were unsystematic and small.

• Effects of ELP remained strong and consistent (i.e., outcomes increased with increases in ELP)

Analysis Summary• How ELP was measured made some

difference in its value as a predictor; Domain Scores predicted best

• Using Domain Scores for Reading and Writing only was almost as good as using Reading, Writing, Speaking, and Listening

• These results suggest that the academic components of the language assessment are the most important predictors of content area achievement

• It is noteworthy that ELP performance accounted for so much of the school and district variability in ELA and MATH

Conclusions from Statistical Analyses• Taken together, the results highlight the

importance of language in the development of content area knowledge. – Analyses not presented here further

highlight that children need to be taught the content in order to close achievement gaps.

– Gaining proficiency in English was not a guarantee of success on the content area assessments.

• The problem of content area proficiency is not simply a problem of testing as evidenced by the limited impact of test accommodations.

Suggestions for improving the accountability system for ELLs

Suggestions for Improved Accountability for ELLs

We clearly have a reporting problem that is fueled by the dynamic nature of the ELL category and the role that language plays in the development of content area proficiency

1.Create a separate reporting category of ELLs who are reclassified as FEP

2.Report achievement results within ELP proficiency levels –

• Beginner, • Intermediate, • Advanced Intermediate, • Fluent English Proficient


3. Set achievement performance expectations that are challenging but realistic for each particular group– Expect students in the FEP category to perform

at levels comparable to monolinguals in the school/district/state

– Establish more realistic achievement targets for students who are still developing English• Goals must be challenging, but attainable, to

properly motivate students and teachers• Long term expectations for all students are

the same, but short term goals must be challenging, yet attainable in the short term


4. Hold schools/districts accountable under Title I for language proficiency

5. Hold schools/districts accountable under Title I for the time required for children to achieve FEP status

6. Expect all students to reach FEP status7. Short-circuit gamesmanship by limiting the time

before children must be counted in the FEP category for achievement accountability

How does a system that incorporates these steps improve accountability?• Gives schools actionable information about the

content area achievement of all ELLs• Allows more accurate evaluation of school

performance in the face of shifting demographics• Credits language proficiency gains and content

area achievement gains that are reasonable given students language proficiency

• Meaningfully counts the achievement results for all children regardless of their level of language proficiency

• Recognizes the importance of setting challenging but attainable goals to motivate maximal performance

Conclusions

• An accountability model that addresses these issues will provide more accurate information to teachers, principals, and other stakeholders about the performance of ELLs

• Place emphasis on integration of language instruction into content area instruction, and

• Increase the emphasis on teaching content when ELLs first reach school

• Increase the demand for language tests that will serve as better barometers of ELL students’ acquisition of the academic language skills needed to master content domains.

Thank You

[email protected]

Gregory J. [email protected]


Purpose: Broad Overview of Three Areas:1) Strategies for improving reliability and

validity of ELL assessments2) Impact of construct-irrelevant features

on student performance3) Effects of attention to ELL subgroups

on student achievement

“The reliability coefficients of the test

scores

for ELL students are substantially lower

than

those for non-ELL students” (p. 6) rxx’ = S2

T

S2T + S2

E

“ELL students’ test outcomes show

lower

criterion-related validity.” (p. 6) rxy ≤ √ rxx’ ryy’

“Structural relationships between test

components and across measurement

domains are lower for ELL students.” (p. 6)

* change/multidimensionality of construct

* difficulties in applying standard test

developmen t practices

Four grade clusters:K-23-56-89-12

* Test-taker marks answers in a test booklet

* Paper is secured to work area with tape/magnet

1) Review NCLB accountability for language-related subgroups (LRSs)

2) Report on meta-analysis re:evidence that specific test accommodations are effective in improving the performance of ELLs;

evidence that specific test accommodations for ELLs are valid in large-scale assessments.

1) States differ in the instruments used to assess language proficiency

2) States differ in the criteria used to judge proficiency

3) Too often, subjective judgments are part of the decision process.” (p. 3)

Unlike any other concern of NCLB, ELL

subgroup membership is affected by

instruction.

Aggregate reporting masks performance

changes when demographics shift.

... given these points (variation, unlike other variables, demographic shifts mask performance changes), perhaps the most serious yet waved at validity concern is not the validity of any accommodations, but the validity of aggregating these data with other data obtained for AYP reporting purposes.

• An Area for Presenters to Clarify?

• “We know that there are about 12-15 test accommodations which appear to be useful for ELLs on large-scale academic tests.” (p. 1)

Continuous (or multidimensional) trait(s), but dichotomous accommodation policies.

Possible Solution: “A standardized computer-based system of matching students to accommodations has been built and is ready to be used.” (p. 1)

Robert Linquanti

WestEd

[email protected]

Discussant’s Comments

Annual CELDT Scores, 2001-2007

11 10 7 7 6 10 9

2319

14 13 1318 17

4037

36 33 33

39 39

2125

3233 33

25 28

4 9 11 15 147 8

0

20

40

60

80

100

2001 2002 2003 2004 2005 2006 2007

Pe

rce

nt

of

EL

s

Advanced

Early Adv

Intermed.

Early Interm

Beg

Baseline AMAOs (2003-04)

AMAOs (2004-05)

AMAOs (2005-06)

AMAOs (2006-07)

AMAOs (2007-08)

Overall performance of ELs on CELDT (%), 2005-07

(Linquanti, 2008)

CA Title III LEAs Meeting Individual AMAOs

Source: CDE 2008

85 84

65

86 87

7473 7469

8277

38

8382 80

0

10

20

30

40

50

60

70

80

90

100

Met AMAO 1 Met AMAO 2 Met AMAO 3

Perc

en

t o

f L

EA

s

2003 2004 2005 2006 2007

81

54

84

6568

54

74

31

77

68

0

10

20

30

40

50

60

70

80

90

100

Met AMAOs 1 & 2 Met All AMAOs (1, 2, 3)

Perc

en

t o

f L

EA

s

2003 2004 2005 2006 2007CELDT CELDT & academic

achiev. testSource: CDE 2008

CA Title III LEAs Meeting Two or More AMAOs

CA ELs meeting AMAO 1 Growth by Prior CELDT Level (2006-07 and

2007-08)

58 60

3745

7264 62

40

51

74

0102030405060708090

100

Beg. (19%->18%)

Early Int. (23%->20%)

Interm. (38%->40%)

EA/A not EP(2%->3%)

Eng. Prof.(17%->18%)

% E

Ls

mee

tin

g

2006-07

2007-08

% AMAO 1 cohort (06/0707/08)

CA EL Performance on Two Statewide Assessments: 2006-

07

16%19%

28%

36%32%

40% 41%

30%32%

38%

45%

31%

43%

32%

22%24%

18%

23%

11% 10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th

Grade

% E

L 1

2 M

os

or

+

Eng. Prof. Level (CELDT Fall 2006) Mid-Basic or Above (CST-ELA Spring 2007)

* % test takers for each test by grade

AMAO 2 Goal

Common Reclass. criterion

CA EL Performance on CELDT and CST-ELA: 2007-08

17%

22%

31%

44%

38%

44%46%

33% 33%

41%

33%33%

17%

27%

17%13%

10% 8% 9%6% 4%

17%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th TOTAL

Grade

% E

L 12

Mos

or

+

Eng. Prof. Level (CELDT Fall 2007) Proficient or Advanced (CST-ELA Spring 2008)

* % test takers for each test by grade Total Ns: CST = 1,040,063; CELDT = 1,094,254

AMAO 2 Goal

AMAO 3 Goal

CA 2008 CST-ELA Results: ELs 12 or + months

13%19%

9%15% 18%

24% 27%20%

31%44%21%

27%

21%

24%28%

33%33%

38%

38%

34%

34%

38%

43%

44%41%

32%32% 33%

26%18%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Grade

% S

tude

nts

in C

ST L

evel

s Adv.

Prof.

Bas ic

BB

FBB

Total N = 1,040,063Basic or Below: N = 867,673 (83% of all ELs 12 or + mos)

CA 2008 CST-ELA Results: Former ELs (Reclassified-FEP)

9%8% 9%13%

17%19%

33%

16%

29%33%

32%37% 33%

38%

37%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Grade

% S

tude

nts

in C

ST L

evel

s Adv.

Prof.

Bas ic

BB

FBB

Total N = 596,919Basic or Below: N = 266,841 (45% of all RFEPs)

Misassignments by subject area, 2003-07 (n= 22,352)

7492

4315

2960

1214 1182 977 948 928 718 699 649270

0

1000

2000

3000

4000

5000

6000

7000

8000

54% (12,077)

Source: CTCC, 2008

Conclusions from CA’s AMAO Data to Date

• AMAOs 1 & 2 show steady progress• AMAO 3 is in effect “running the show”• Title I AYP (single-year, status-bar criterion) is

very different from Title III AMAOs 1 & 2 (longitudinal growth-model, sensitive to current level & length-of-time)

Conclusions• AMAO 1 & 2: ELP criteria and targets accepted

as reasonable, meaningful, useful– Signal many ELs need more focused

instructional support to move beyond Intermediate

• AMAO 3: AYP status bar insufficiently sensitive to growth; target to 100% undermining credibility

• While many ELs at Basic, even more below • Many RFEPs need help (45% Basic or below)

Implications for Accountability

• Specify expected progess in ELD by time– Lower is faster, higher is slower (Cook, 2008)– Proficient cut, time to proficiency target are

key• Explore AYP growth models to measure EL

academic progress toward proficiency– Benchmarking, reporting by ELD level– Do we specify expectation: X years to

academic proficiency?

Implications for Accountability

• Focus teacher PD on instruction ELs need to move beyond Intermediate (ELP); into & beyond Basic (Acad. achievement)

• EL reclassification not whole story, not end of story– Ongoing linguistic, academic needs – RFEP monitoring with resources to hold

the gains over time

In a better world, NCLB reauthorization will…

• Promote true progress models, including ELP level, time in program, prior performance, benchmarking (CRESST)

• Operationalize, measure, & evaluate opportunity to learn

• Foster capacity for internal accountability (local, reciprocal) (Newmann, Elmore)

• Incorporate more authentic, multiple measures of greater utility to teachers (Shepard, Baker, Resnick)

Documents

NCLB at Year 8 in the Assessment of English Language Learners: Taking Stock of the Assessment and Accountability Systems National Association of Test Directors