Upload
darren-boyd
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
NCLB at Year 8 in the Assessment of English Language Learners:Taking Stock of the Assessment and Accountability Systems
National Association of Test Directors 2009 Symposium
Organized by:
Phil MorseLos Angeles Unified School District
Edited by:
Joseph O'Reilly, Ph.D. Mesa (AZ) Public Schools
This is the twenty-fourth volume of the published symposia, papers and surveys of the National Association of Test Directors (NATD).
This publication serves an essential mission of NATD ‑ to promote discussion and debate on testing matters from both a theoretical and practical perspective. In the spirit of that mission, the views expressed in this volume are those of the authors and not NATD. The paper and discussant comments presented in this volume were presented at the April, 2009 meeting of the National Council on Measurement in Education (NCME) in San Diego, California.
NCLB at Year 8 in the Assessment of English Language Learners:Taking Stock of the Assessment Jamal Abedi, University of California, Davis
Policy and Reality:Making Academic Assessments Work for English Learners Rebecca Kopriva, University of Wisconsin– Madison
Second Generation Accountability for English Language Learners David Francis, University of Houston
Discussant Comments Gregory Cizek, University of North Carolina - Chapel Hill Robert Linquanti, WestEd
Jamal AbediUniversity of California, Davis
NCLB at Year 8 in the Assessment of English Language
Learners: Taking Stock of the Assessment
Systems
How successful has NCLB been in resolving issues concerning
assessment and accountability for ELL students?
1. Problems in classification/reclassification of ELL students
2. Inclusion of ELL students in the state and national assessments
3. Quality of assessments (both Title I and Title III) for ELL students
4. Issues concerning accommodations for ELL students
5. Instability of ELL subgroup
Assessment of English Language Learners?
1. 1. What testing strategies may be helpful to What testing strategies may be helpful to improve reliability and validity as they relate to ELL improve reliability and validity as they relate to ELL students?students?
2. What impact has construct-irrelevance variance 2. What impact has construct-irrelevance variance had on students’ performance on high-stakes tests?had on students’ performance on high-stakes tests?
3. What effect has the focus on the English learners 3. What effect has the focus on the English learners subgroups had on their performance in school and subgroups had on their performance in school and on performance and language tests? Andon performance and language tests? And
4. Do testing accommodations assist English 4. Do testing accommodations assist English learners to display accurately their educational learners to display accurately their educational ability? Which are effective?ability? Which are effective?
Quality of Assessments for Quality of Assessments for English Language LearnersEnglish Language Learners
1. What testing strategies may be helpful to 1. What testing strategies may be helpful to improve reliability and validity as they improve reliability and validity as they relate to ELL students?relate to ELL students?
2. What impact has construct-irrelevance 2. What impact has construct-irrelevance variance had on students’ performance on variance had on students’ performance on high-stakes tests?high-stakes tests?
Measurement Quality for ELL Students
• Language factors impact student performance, particularly in content-based assessments such as math and science
• Performance-gap between ELL and non-ELL students in highest with assessments high in language demand
• Standardized achievement are usually constructed and field tested for the native speakers of English
• For ELL students, language is an additional source of measurement error and reduces reliability of assessments
• For ELL students, language factors (unnecessary linguistic complexity of test items) is a source of construct irrelevant variance and affects validity of assessments.
Are the Standardized Achievement Tests
Reliable and Valid for these Students?• The reliability coefficients of the test
scores for ELL students are substantially lower than those for non-ELL students
• ELL students’ test outcomes show lower criterion-related validity
• Structural relationships between test components and across measurement domains are lower for ELL students
Performance/Reliability-Gap Between ELL and non-ELL
StudentsPerformance-Gap Reliability-Gap
Reading 20% - 60% 15% - 40%
Science/Social Sciences
10% - 40% 12% - 35%
Math Problem Solving
8% - 25% 10% - 30%
Math Computation
0% - 10% 10 – 15%
Examining Complex Linguistic Features in Content-Based Test Items
Feature Feature Description Categories Combined
1 I tem length 1, 2, 4, 45
2 Vocabulary 3, 26, 27
3 Nominal heaviness 5, 6, 29, 30, 31, 32
4 Verb voice 7, 33
5 Modal 8, 34
6 Relative clause 9, 10, 11, 35, 36, 37
7 Adverbial modification 12, 13, 14, 15, 16, 17, 38, 39, 40, 41
8 Conditional clause 18, 19
9 Complement clause 20, 44
10 Sentence structure 28, 42, 43, 46
11 Preferred argument structure 22, 23, 47, 48
12 Question form 21
13 Global difficulty 24
14 Content interest 25
Additional Complex Linguistic Features
More recent research has identified these additional features:
• Complex verbs• Subordinate clauses (including relative
clauses)• Complex noun phrases• Various entities as subjects
What NCLB Has Accomplished What NCLB Has Accomplished by Focusing on by Focusing on
English Language LearnersEnglish Language Learners
What effect has the focus on the English What effect has the focus on the English learners subgroups had on their learners subgroups had on their performance in school and on performance performance in school and on performance and language tests?and language tests?
As a Case Example of Focus On ELLs:
Impact of NCLB on English Language Proficiency Assessment
• ELP assessment status prior to NCLB• Current status• The impact/contribution of NCLB on ELP assessment
Status of English Language Proficiency (ELP) Assessment
ELP assessment status prior to NCLB
Existing English language proficiency tests prior to NCLB
There were many different English language proficiency tests prior to the NCLB implementation. However, there are issues with many of these tests. For example:
• Lack of a clear operational definition of ELP for many of these tests
• Differences in types of tasks the tests cover and the specific item content of the tests
• They are based on different theoretical emphases
• Problems with the reliability and validity of these tests, the adequacy of the scoring directions, and issues concerning filed testing (samples, etc).
Reviews of Language Proficiency Tests
Zehler, et al. (1994) and Del Vecchio & Guerrero (1995) compared content and psychometrics of six most commonly used English language proficiency tests, they found:
• The content of the tests differ considerably in types of tasks and specific item content
• The tests differ in the grade levels for which they were designed, time limits and test item format
• The tests represent distinct approaches to definition of language proficiency, reflecting different theoretical emphases
• There were major issues with reliability and validity of the tests, the adequacy of the scoring directions, the limited populations used in field testing
Status of English Language Proficiency (ELP) Assessment
Current status
The impact/contribution of NCLB on ELP assessment
Title III of NCLB requires that:
• State measure proficiency and show progress;
• assess all ELL students; • provide independent measures for the four
skill domains of reading, writing, speaking, and listening;
• report a separate measure for comprehension;
• assess proficiency in academic language and in the language of social interaction, and
• align the assessments with state English Language Development (ELD) standards
Major Events
• The U.C. Department of Education was provided Enhanced Assessment Grant under Title VI (Section 6112) of No Child Left Behind (P.L. 107-110; NCLB) for development, validation, and implementation of English proficiency assessments
• Four different consortia of states were funded to develop measures of English language proficiency
• The U.S. Department of Education provided additional Enhancement Grants for validating/ improving these assessments
The Four Consortia
• The Mountain West Assessment Consortium
• The WIDA (ACCESS for ELLs) Consortium• The ELDA Consortium• The CELLA Consortium
The Process• To test four modalities: (1) Reading, (2)
Writing, (3) Speaking, and (4) Listening along with (5) Comprehension
• Four grade clusters (k-2; 3-5; 6-8; and 9-12)
• Based on state’s ELP content standards • To pilot and/or field test items on a large
sample and conduct validity studies• To conduct standard setting• Additional studies, e.g., examining the
validity of the accommodations used • Provide test administration and technical
manual, and scoring and reporting instructions.
Issues with the newly developed ELP
assessments
1. ELP Standards• Which ELP standards, from which of the participating states?
• Are there a set of common ELP standards?• Do all the participating states have a set
of defined ELP standards?• How ELP standards are defined by states?
Issues with the newly developed ELP
assessments
2. Setting achievement levels• Should achievement levels be set separately for each of the four modalities?
• How discrepancies can be addressed?• Or, should achievement levels be set at
the whole test level?• How the total score should be obtained:• Simple composite?• Latent composite?
Issues with the newly developed ELP
assessments
3. Dimensionality issues•Should the four modalities be considered as four separate subscales/dimensions?
•Sawaki, Stricker, & Oranje (2007) suggest a higher-order factor model with a general factor and four group factors (reading, listening, speaking & writing)
•Lall, Gaj, Broer, Carlson & Gu (2007) found a dominant first factor across the three modalities (listening, reading, & writing) to support a common vertical scale across grades 4-11
Issues with the newly developed ELP assessments
4. Comparability and feasibility issues• States used different ELP tests to establish
the baseline for AMAOs• Thus, the comparability factor becomes a
serious issue• There are also issues concerning
feasibility and test burden• Some of these assessments require over 6
hours of testing time.
Issues with the newly developed ELP assessments
5. Lack of data to judge the quality of ELP measures
• Validation of achievement levels (internal and external criteria)
• Relationship between Title I and Title III assessments
• Relationship between ELL classification and ELP test scores
Where Are We in Term of Accommodations for ELL
Students At Year 8 of NCLB?
Do testing accommodations assist English Do testing accommodations assist English learners to display accurately their learners to display accurately their educational ability? Which are effective?educational ability? Which are effective?
Principle of Equity and Fairness
•Provide assistance in the form of accommodations
Therefore, the Principle of Equity and Fairness demands assistance to these students
Samples of Accommodations Used for ELL Students That May Not be
Relevant• Test-taker marks answers in a test booklet• Copying assistance provided between
drafts• Test-taker indicates answers by pointing
or other similar method• Paper is secured to work area with
tape/magnet• Physical assistance is provided
Presenting Language-Related Accommodations for ELLs
• English Extended Time• Dictionary• English Glossary• Bilingual Dictionary/Glossary• Customized Dictionary• Native Language Testing• Linguistically Modified Test
Studies on Linguistic Accommodations
• Results of national data are not conclusive• Most of the CRESST studies found significant
gain for ELL students on linguistically modified version
• However, the outcome of national research on the impact of linguistic modification is mixed (Francis, et al.)
• Sources that lead to such discrepancies include variation in methodology in implementing linguistic modification approach, sampling and power issues, variation in test items and the nature of linguistic complexity, etc.
Summary & Conclusion
What NCLB has accomplished after 7 years?
• More attention to the inclusion, assessment, and accountability of subgroups of students including ELL students
• More focus on standard-based assessments• More organized efforts in creating and
administering English language proficiency assessments
• The quality of content-based assessment for subgroups of students in general and for ELL students in particular still questionable
• The validity of classification of ELL students is still questionable
Policy and Reality:Making Academic
Assessments Work for English Learners
Rebecca KoprivaUniversity of Wisconsin,
Madison
•
Future Policy for ELs: Probable Scenarios
• For academic tests, holding schools accountable for ELs and most of the other subgroups is most likely NOT going to go away.
• For the EL subgroup, including ‘former ELs’ in the academic accounting mix will probably be formalized in the next federal legislation.
• Future federal peer reviews of academic tests will most likely be ‘tougher’ on demanding evidence of valid and comparable inferences for students receiving accommodations and students taking alternative assessments within the system, as compared to those taking the general test under standard conditions.
Future Policy for ELs: Probable Scenarios
• A peer review-type system will probably be set up for ELP tests.
• Federal interest in how ELs are identified and exited has been raised.
• NAEP is continuing to clarify how to handle inclusion/exclusion rates for ELs and SDs in states and districts. They are also continuing to clarify accommodation options and how to make sure students with similar profiles in different schools/states get accommodations that make sense.
What Does this Mean for ELs and Academic Content Tests?
EL students need to be included in the testing systems appropriately (otherwise, what’s the point….)
• We know there about 15 primary and secondary test accommodations that are useful for ELs on large-scale academic tests.
• Research on accommodations and packages is continuing. The investigations need to focus on studying how the accommodations are working for students who need them (as compared to how they work for broad groups of EL students).
• Overall, proper post hoc accommodations appear to be useful for more advanced ELs but even high intermediate students are still at risk.
• There seem to be 3 issues that need attention….
What Are the Most Pressing Issues?What to do About Them?
1) Students with English language proficiency at pre-functional, beginner, and low intermediate levels are generally not served well.
This group seems to need more support than many packages of accommodations provide. a) If taught in English, they need comprehensive support in both languages, including oral in L1 if not literate in L1. b) ‘Plain language’ forms are NOT useful if not supported by L1 support that goes beyond bilingual glossaries.c) Response accommodations are essential for this group.
What Are the Issues?What to do About Them?
2) ELs at the higher intermediate and some advanced English proficiency levels are still at risk as well because items with cognitive complexity generally require more language.
Items with more cognitive complexity (e.g. DOK = 2 and up) often require:
• Language precision that cannot be adequately duplicated in visuals or significantly reduced
• Elucidation of abstract concepts that require more complex language structures
• Contexts that cannot be meaningfully represented in static visuals only.
Bilingual and English glossaries are best for nouns or action verbs. More complex items use abstract nouns, verb tenses, conceptual language, phrase and clause structures usually not found in these glossaries.
What Are the Issues?What to do About Them?
To address 1) and 2),a) Written translations may be useful for students with grade-level literacy in L1 (not common among some groups).
Translations may be of limited use if these students have been taught in English for a substantial period of time. In this case, experts have recommended dual language forms for students so they can check meaning in the other language.
There are about 20 languages used in academic instruction. Issues of quality of translation and evenness of language meaning across forms needs to be overseen.
What Are the Issues?What to do About Them?
b) For students not literate in L1, oral L1 with English booklets may be useful for lower English proficient ELs. This is being done in some states and preliminary finding suggest that the technical adequacy of the ‘split’ accommodation is robust and comparable to oral English and standard administration conditions. Translation issues and number of translations still apply.
c) If portfolios are used for lower English proficient students, the data collection, evaluation, and oversight, needs to be rigorous. See Petit & Rigney (1995).
d) For higher level ELP students who have been in the U.S. for a substantial period of time, oral English with English booklets (preferably ‘plain language’ booklets) seems to be useful.
What Are the Issues?What to do About Them?
e) Two projects are investigating using interactive computer capabilities to substitute for language in more complex items. Results have been effective, especially for lower proficient ELs.
Computer capabilities include animated and interactive context and question-building, and response options using clicking, dragging, and modeling.
What language is left is in English on the screen and translated to the student when they click the speaker icon. Far fewer words minimize translation comparability issues.
What Are the Issues?What to do About Them?
3) Current procedures are not effective in making sure the proper accommodations are getting to the students who need them. a) States and districts need to oversee implementation to ensure accommodation decisions are being carried out.b) The bigger problem, however, is that procedures in use today do not yield effective accommodation decisions or consistent decisions across students with similar profiles.
Several research studies have shown that decisions by teachers or committees often yield results no better than random assignment, and are notoriously inconsistent across LEA and SEAs.
Further, proper assignments lead to significantly higher scores than improper assignments, with improper assignments no different than random assignments.
What Are the Issues?What to do About Them?
A standardized, research-based, online system of matching individual EL students to their accommodations has been recently built. The system (STELLA) can be used to help guide teachers or committees.
STELLA, which collects data from records, teachers, and parents/guardians or students, has been validated and is available for licensing.
It is designed to provide accommodation recommendations for large-scale tests consistent with local state or district policy, while also providing best-practice recommendations useful for classroom use.
STELLA recommends individualized pretest support for teachers too. It goes beyond test prep and focuses on purposes of testing, types of questions used on US tests that are different from the student’s country (e.g. word problems in math), and identification of cognitive skills on US tests but not evaluated in the student’s previous schooling. URL for STELLA: www.wida.us/UW/STELLA
For more information…
Improving Testing for English Language Learners, (2008),Kopriva, R.J. Routledge, NY, NY.
Second Generation Accountability for English
Language Learners
David J. FrancisHugh Roy and Lillie Cranz Cullen Professor of PsychologyUniversity of Houston
Texas Institute for Measurement, Evaluation, and Statistics
Overview
• How can testing and reporting systems be improved in order to promote more meaningful accountability for English Language Learners?
• Do testing accommodations assist English Learners to measure accurately their educational ability? Which are effective?
• How could the NCLB English Language Learner accountability provisions be improved in its reauthorization?
LEP as a Subgroup Under NCLB
• More so than other subgroups (e.g., FRL, gender, ethnicity), there is no universal definition of LEP, nor is there a universally accepted approach to identifying children as LEP.
• Defining the population of interest and monitoring their academic progress is less precise than it needs to be– States differ in the instruments used to assess
language proficiency– States differ in the criteria used to judge
proficiency– Too often, subjective judgments are part of the
decision process
LEP as a Subgroup Under NCLB
• Several factors make LEP unique as a subgroup under NCLB– Unlike other NCLB subgroups (e.g., gender,
ethnicity, learning disabilities), membership in the LEP category is dynamic and developmental
– Moreover, the defining characteristic (i.e., language proficiency) is causally linked to schooling AND to the outcomes of interest (i.e., content area achievement)
Accountability for Subgroups
• NCLB presumes static group membership.– Instruction is expected to affect subgroup
achievement, not subgroup membership• LEP subgroup membership is affected by school
performance– Students lose their membership as they
acquire English – Acquisition of English is both
• a consequence of effective schooling, and • a mediator of the effects of schooling on
content mastery
Accountability for Subgroups
• Membership in LEP subgroup requires LOW performance on a cognitive dimension (viz. English Language) that is causally linked to the outcomes of interest (viz. English Achievement) and is a consequence of effective instruction
• Ignoring the developmental nature of language and its role in mediating the effects of instruction on achievement can bias our estimates of how schools are performing for the LEP subgroup
RFEP
ELLELL (included in comparisons)
RFEP (excluded in comparisons)
Comparison of ELLs and former ELLs on State Reading Test in
Texas 2002 Level of Language Proficiency for ELL Groups
Grade Beginning Intermediate Advanced
(2002) Advanced
(2000)
3 13.9 38.3 90.6 90.0
4 13.1 37.4 84.1 93.6
5 16.5 24.1 69.5 96.1
6 14.5 12.8 46.0 86.8
7 15.0 12.4 43.9 85.0
8 23.2 19.2 55.3 90.2
10 21.3 28.5 66.4 85.8
Overall 15.8 30.4 76.4 89.6
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
Current Law
• Allows retention of LEP label for accountability purposes for up to two years after achieving FEP status– This practice boosts the performance of the
LEP group– Gives a less biased view of the performance of
LEP students than when FEPs are not counted– But is it really the view that schools and the
public need? Is it sufficiently accurate to be useful?
Including FEPs in the LEP Group: Are we really getting at the right
information? Achievement %Proficient
without FEP-21
Achievement %Proficient with FEP-21 Grade
3 25.4 68.6
4 31.9 65.6
5 33.5 61.4
6 21.2 47.3
7 22.1 45.1
8 30.7 55.1
10 39.2 62.4
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
1Hypothetical result based on 2003 percentages in each language proficiency category
Which column best captures the long term results for LEP students? Which column tells us how the school/district/state is doing?
Retaining FEPs in the LEP group…
• Improves on typical practice, but …– is insufficient to get an accurate and
complete picture about the performance of LEP students•It does not accurately reflect the long
term outcomes for students who began school as LEP, and
•Confounds language proficiency and achievement
Including FEPs in the LEP Group: Are we really getting at the right
information? Level of Language Proficiency for LEP Groups Achievement
%Proficient without FEP-21
Achievement %Proficient with FEP-21 Grade Beginning Intermediate
Advanced (2002)
Advanced (2000)
3 13.9 38.3 90.6 90.0 25.4 68.6
4 13.1 37.4 84.1 93.6 31.9 65.6
5 16.5 24.1 69.5 96.1 33.5 61.4
6 14.5 12.8 46.0 86.8 21.2 47.3
7 15.0 12.4 43.9 85.0 22.1 45.1
8 23.2 19.2 55.3 90.2 30.7 55.1
10 21.3 28.5 66.4 85.8 39.2 62.4
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
1Hypothetical result based on 2003 percentages in each language proficiency category
Which column best captures the long term results for LEP students?Which one really tells us how the school/district/state is doing?
Retaining FEPs in the LEP group…
• Improves on typical practice, but …– is insufficient to get an accurate and
complete picture about the performance of LEP students•It does not allow reporting of long
term outcomes for students who began school as LEP, and
•Confounds language proficiency and achievement
What would you conclude if the overall results looked like
this from 2002 to 2003? Level of Language Proficiency for LEP Groups Achievement
%Proficient 20021
Achievement %Proficient
2003 Grade Beginning Intermediate Advanced
(2002) Advanced
(2000)
3 68.6 68.6
4 65.6 64.9
5 61.4 60.7
6 47.3 44.6
7 45.1 42.3
8 55.1 52.2
10 62.4 62.7
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
1Hypothetical result based on 2003 percentages in each language proficiency category
What would you conclude if the overall results looked like
this from 2002 to 2003? Level of Language Proficiency for LEP Groups Achievement
%Proficient 20021
Achievement %Proficient
2003 Grade Beginning Intermediate Advanced
(2002) Advanced
(2000)
3 13.9 38.3 90.6 90.0 68.6 68.6
4 13.1 37.4 84.1 93.6 65.6 64.9
5 16.5 24.1 69.5 96.1 61.4 60.7
6 14.5 12.8 46.0 86.8 47.3 44.6
7 15.0 12.4 43.9 85.0 45.1 42.3
8 23.2 19.2 55.3 90.2 55.1 52.2
10 21.3 28.5 66.4 85.8 62.4 62.7
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
1Hypothetical result based on 2003 percentages in each language proficiency category
What would you conclude if the overall results looked like
this from 2002 to 2003? Level of Language Proficiency for LEP Groups Achievement
%Proficient 20021
Achievement %Proficient
2003 Grade Beginning Intermediate Advanced
(2002) Advanced
(2000)
3 15.3 42.1 92.4 91.8 68.6 68.6
4 14.4 41.1 85.8 95.5 65.6 64.9
5 18.2 26.5 70.9 98.0 61.4 60.7
6 15.9 14.1 46.9 88.5 47.3 44.6
7 16.5 13.6 44.8 86.7 45.1 42.3
8 25.5 21.1 56.4 92.0 55.1 52.2
10 23.4 31.4 67.7 87.5 62.4 62.7
http://www.tea.state.tx.us/student.assessment/reporting/results/rpteanalysis/2002/reading/statewide.html
1Hypothetical result based on 2003 percentages in each language proficiency category
Aggregate Reporting Masks Performance Changes
When Demographics Shift Level of Language Proficiency for LEP Groups
Achievement %Proficient1
Grade Beginning Intermediate Advanced Advanced 2 Years Prior
3 15.3 42.1 92.4 91.8 68.6
4 14.4 41.1 85.8 95.5 64.9
5 18.2 26.5 70.9 98.0 60.7
6 15.9 14.1 46.9 88.5 44.6
7 16.5 13.6 44.8 86.7 42.3
8 25.5 21.1 56.4 92.0 52.2
10 23.4 31.4 67.7 87.5 62.7
1Hypothetical result based on increasing achievement in all groups while increasing the percentage of students in the lowest three categories of language proficiency
Achievement is up for children in each ELP category, but overall achievement is the same or lower.
What’s going on?
Aggregate Reporting Masks Performance Changes
When Demographics Shift Level of Language Proficiency for LEP Groups
Achievement %Proficient1
Grade 3 Beginning Intermediate Advanced Advanced 2 Years Prior
2002 Achievement
13.9 38.3 90.6 90.0
2003 Achievement
15.3 42.1 92.4 91.8
1Hypothetical result based on increasing achievement in all groups while increasing the percentage of students in the lowest three categories of language proficiency
Overall Achievement depends on achievement in each ELP Category
Aggregate Reporting Masks Performance Changes
When Demographics Shift Level of Language Proficiency for LEP Groups
Achievement %Proficient1
Grade 3 Beginning Intermediate Advanced Advanced 2 Years Prior
2002 Achievement
13.9 38.3 90.6 90.0 68.6
% of students
16.0% 18.0% 18.0% 48.0%
2003 Achievement
15.3 42.1 92.4 91.8 68.6
% of students
17.6% 19.8% 19.8% 42.8%
1Hypothetical result based on increasing achievement in all groups while increasing the percentage of students in the lowest three categories of language proficiency
Overall Achievement depends on achievement in each ELP Category
AND
on the percentage of students in different ELP categories.
The percentage of children in each language proficiency category is partly a function of instruction, but it also a function of demographics.
Aggregate Reporting Masks Performance Changes
When Demographics Shift Level of Language Proficiency for LEP Groups
Overall %Proficient1
Grade Beginning Intermediate Advanced Advanced 2 Years Prior
3 15.3 42.1 92.4 91.8 68.6
4 14.4 41.1 85.8 95.5 64.9
5 18.2 26.5 70.9 98.0 60.7
6 15.9 14.1 46.9 88.5 44.6
7 16.5 13.6 44.8 86.7 42.3
8 25.5 21.1 56.4 92.0 52.2
10 23.4 31.4 67.7 87.5 62.7
1Hypothetical result based on increasing achievement in all groups while increasing the percentage of students in the lowest three categories of language proficiency
Overall Achievement depends on achievement in each ELP Category
AND
on the percentage of students in different ELP categories.
The percentage of children in each language proficiency category is partly a function of instruction, but it is also a function of demographics.
Allowing FEP students to count in the ELL category for up to two
years…• Boosts the overall percent proficient within the ELL category,
• But it does NOT– allow us to easily determine the academic
achievement of ELLs who become proficient in English
– allow us to easily determine the long term achievement outcomes for children who entered school as ELLs
– provide schools with actionable information about their ELL students’ performance, or
– resolve the problem of aggregation bias when demographics are shifting
Problems with Current Accountability Practice for ELLs• Overly simplistic view of a complex
developmental and educational process• Fails to take into account the developmental
nature of language acquisition– Language takes time to acquire– This time frame contains both maturational,
educational, and environmental influences• It can be accelerated through good
instruction and experiences in a language rich environment,
• It CANNOT be reduced to zero• Fails to take into account the causal role that
language plays in acquisition of content area knowledge
How much can the performance of LEP students be improved through appropriate test accommodations?
Meta-Analysis: Literature Search
• Final sample was 11 studies – Each study used random assignment of
ELLs and non-ELLs to testing conditions with and without accommodations (one study we could not confirm random assignment)
– Involved 38 different samples of students
– Reported 38 different tests of the effectiveness of accommodations for ELLs
Study Descriptions• Grades included
– 4th: n=11 – 8th: n=23– 5th or 6th: n=2 each
• Subject Areas– Math: n = 17– Science: n=20 – Reading: n=1
• Type of test– NAEP items: n=23– NAEP and TIMSS: n=6 – State Accountability Assessment: n=9 (two
different states)
Types of Accommodations
0 5 10 15 20
Simplified English
English Dictionary
Bilingual Dictionary
Spanish Version
Dual Language Questions
Dual Language Booklet
Extra Time
Number of Study Samples
How large are the achievement gaps between ELLs tested without
accommodations and non-ELLs?
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Math Science
Eff
ect
Siz
e
Meta-analysis
NAEP - 4thGradeNAEP - 8thGrade
Accommodation
Results for Fixed Effects Analysis
Number of
Studies
Effect Size and 95% Confidence Interval
Test of Mean Effect = 0
Test of Heterogeneity in Effect Sizes
Mean Effect Size
s.e. Lower Limit
Upper Limit
Z p Q df(Q) p(Q)
English Dictionary-Glossary
11 .146 0.043 .063 .230 3.427 .001 14.804 10 .139
Simplified English 16 .030 0.043 -.053 .114 0.708 .479 23.885 15 .067
Bilingual Dictionary-Glossary
5 -.096 0.065 -.223 .031 -1.479 .139 13.53 4 .009
Spanish Version 2 -.263 0.102 -.463 -.062 -2.572 .010 14.465 1 <.001
Dual Language Booklet
1 -0.177 0.065 -.223 .031 -1.199 .231
Dual Language Questions + Read Aloud in Spanish
1 .273 0.195 -.109 .654 1.401 .161
Extra Time 2 .209 0.142 -.069 .488 1.473 .141 0.155 1 .693
TOTAL WITHIN 66.844 31 <.001
TOTAL BETWEEN
24.426 6 <.001
OVERALL MEAN
38 0.038 0.025 -.012 .087 1.481 .139 91.270 37 <.001
Results for Random Effects Analysis
Effect Size and 95% Confidence Interval
Test of Mean Effect = 0
Test of Heterogeneity in Effect Sizes
Accommodation
Number of
Studies Mean Effect Size
s.e. Lower Limit
Upper Limit
Z p Q df(Q) p(Q)
English Dictionary-Glossary
11 0.178 0.055 .070 .287 3.232 .001
Simplified English 16 0.037 0.067 -.093 .168 0.557 .557
Bilingual Dictionary-Glossary
5 -0.039 0.131 -.295 .217 -0.298 .766
Spanish Version 2 0.302 0.719 -1.107 1.711 0.420 .674
Dual Language Booklet
1 -0.177 0.065 -.223 .031 -1.199 .231
Dual Language Questions + Read Aloud in Spanish
1 0.273 0.195 -.109 .654 1.401 .161
Extra Time 2 0.209 0.142 -.069 .488 1.473 .141
TOTAL WITHIN
TOTAL BETWEEN
9.013 6 .173
OVERALL MEAN
38 0.102 0.037 0.029 0.174 2.753 .006
Findings: Effectiveness• Of the accommodations studied, only providing
English dictionaries had a significant positive effect.– Hedges’ gu = .15 in fixed effects model; .18
random effects model– Approximately 10% – 25% of the difference
between ELLs & native English speakers
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Math Science English Dictionaries
Effec
t Si
ze
Meta-analysis
NAEP - 4th
NAEP - 8th
Summary of Results
Of the seven types of accommodations used, only one had an overall positive effect on ELL outcomes: English language dictionaries and glossaries– Produced an average effect which is positive
and statistically different from zero– No indication that this effect varied across the
studied conditions– No evidence that it was not a valid
accommodation– No evidence that effect sizes varied when
bundled with extra time, or when glossaries were electronic.
How do time and language work to predict the content area achievement of ELLs?
3-Level Model for ELA and Math• Unconditional Model (within grade)
– V(Students(schools))
– V(Schools(Districts))
– V(Districts)
• Conditional Models
– Years in US
– ELP
– Years in US and ELP
ELA MATH
Grade Source Estimate s.e. Z p %Variancea Estimate s.e. Z p %Variancea
District 23.99 5.98 4.01 <.0001 0.15 36.70 8.18 4.49 <.0001 0.17
Schools 25.61 3.30 7.76 <.0001 0.16 34.10 4.26 8.00 <.0001 0.15
4
Students 111.69 2.62 42.70 <.0001 0.69 149.64 3.37 44.36 <.0001 0.68
District 25.20 5.95 4.23 <.0001 0.17 43.69 9.35 4.67 <.0001 0.19
Schools 15.58 2.92 5.33 <.0001 0.11 34.94 5.32 6.57 <.0001 0.15
5
Students 107.48 2.94 36.57 <.0001 0.72 151.28 3.98 38.01 <.0001 0.66
District 21.05 6.20 3.39 0.0003 0.15 48.79 11.27 4.33 <.0001 0.23
Schools 20.47 4.04 5.06 <.0001 0.14 23.93 5.30 4.51 <.0001 0.11
6
Students 100.77 2.82 35.77 <.0001 0.71 135.55 3.67 36.96 <.0001 0.65
District 25.79 6.72 3.84 <.0001 0.17 58.80 12.91 4.55 <.0001 0.29
Schools 17.57 3.73 4.72 <.0001 0.12 20.00 4.37 4.58 <.0001 0.10
7
Students 108.15 3.06 35.36 <.0001 0.71 120.66 3.30 36.56 <.0001 0.60
District 26.05 7.63 3.41 0.0003 0.16 52.35 11.19 4.68 <.0001 0.27
Schools 24.18 5.11 4.73 <.0001 0.15 29.67 5.43 5.47 <.0001 0.15
8
Students 115.44 3.28 35.23 <.0001 0.70 110.01 3.03 36.27 <.0001 0.57
a%Variance computed as intra-class correlations (ICCs), viz. as ratio of estimate to sum of estimates for District, School, and Students. Percentages may not sum to 100% due to rounding.
Unconditional Random Effects for ELA and MATH
Conditional Random Effects for ELA and MATH predicted from
Years in US, ELP, and Years + ELPELA MATH
Grade Source Years in
US ΔR2 ELP-Perf. ΔR2
Years and ELP ΔR2
Years in US ΔR2
ELP-Perf. ΔR2
Years and ELP ΔR2
District 27.21 -0.13 15.13 0.37 14.73 0.39 41.11 -0.12 29.10 0.21 26.51 0.28
Schools 25.04 0.02 15.66 0.39 15.72 0.39 32.74 0.04 22.62 0.34 22.85 0.33
4
Students 108.37 0.03 81.83 0.27 81.67 0.27 145.14 0.03 119.72 0.20 118.84 0.21
District 25.73 -0.02 11.62 0.54 11.11 0.56 45.24 -0.04 36.52 0.16 35.45 0.19
Schools 14.83 0.05 9.25 0.41 9.53 0.39 33.28 0.05 23.34 0.33 22.88 0.35
5
Students 104.37 0.03 70.30 0.35 69.65 0.35 149.57 0.01 120.02 0.21 117.86 0.22
District 22.15 -0.05 9.16 0.56 8.59 0.59 49.56 -0.02 35.88 0.26 31.27 0.36
Schools 18.24 0.11 12.68 0.38 12.90 0.37 23.81 0.01 20.05 0.16 20.34 0.15
6
Students 97.03 0.04 66.38 0.34 66.07 0.34 133.72 0.01 111.82 0.18 109.45 0.19
District 27.88 -0.08 11.20 0.57 11.05 0.57 61.72 -0.05 47.32 0.20 43.68 0.26
Schools 13.08 0.26 4.53 0.74 4.63 0.74 19.42 0.03 14.44 0.28 15.03 0.25
7
Students 104.51 0.03 60.65 0.44 60.68 0.44 119.63 0.01 97.70 0.19 95.91 0.21
District 26.70 -0.02 10.87 0.58 9.54 0.63 51.31 0.02 42.76 0.18 37.47 0.28
Schools 22.99 0.05 7.58 0.69 8.46 0.65 30.17 -0.02 22.09 0.26 21.84 0.26
8
Students 113.83 0.01 73.83 0.36 72.40 0.37 109.00 0.01 92.13 0.16 89.03 0.19
a ΔR2computed as change in variance component from unconditional model (Table 5) relative to magnitude of variance component in
unconditional model (Table 5-Table 6)/(Table 5).
Conditional Random Effects for ELA and MATH predicted from Years in US + ELP measured as
(1) Performance Levels, (2) Scaled Score, or (3) Domain Scores
Grade Source
ELA MATH
Years + ELP-PL ΔR2
Years + ELP-SS ΔR2
Years + ELP-DS ΔR2
Years ELP-PL ΔR2
Years + ELP-SS ΔR2
Years and ELP-DS ΔR2
4 District 14.73 0.39 14.00 0.42 10.92 0.54 26.51 0.28 25.27 0.31 20.12 0.45
Schools 15.72 0.39 14.90 0.42 12.81 0.50 22.85 0.33 22.08 0.35 18.65 0.45
Students 81.67 0.27 74.28 0.33 60.20 0.46 118.84 0.21 112.80 0.25 100.78 0.33
5 District 11.11 0.56 10.72 0.57 7.66 0.70 35.45 0.19 33.94 0.22 25.35 0.42
Schools 9.53 0.39 8.27 0.47 7.21 0.54 22.88 0.35 22.44 0.36 20.03 0.43
Students 69.65 0.35 65.37 0.39 60.02 0.44 117.86 0.22 112.85 0.25 105.12 0.31
6 District 8.59 0.59 7.02 0.67 7.60 0.64 31.27 0.36 28.26 0.42 26.63 0.45
Schools 12.90 0.37 10.78 0.47 6.67 0.67 20.34 0.15 18.99 0.21 17.36 0.27
Students 66.07 0.34 61.97 0.39 56.49 0.44 109.45 0.19 104.52 0.23 97.72 0.28
7 District 11.05 0.57 11.09 0.57 8.04 0.69 43.68 0.26 42.19 0.28 34.16 0.42
Schools 4.63 0.74 4.05 0.77 3.06 0.83 15.03 0.25 14.21 0.29 12.25 0.39
Students 60.68 0.44 57.85 0.47 53.16 0.51 95.91 0.21 93.23 0.23 85.99 0.29
8 District 9.54 0.63 8.36 0.68 3.32 0.87 37.47 0.28 35.03 0.33 27.24 0.48
Schools 8.46 0.65 7.14 0.70 5.52 0.77 21.84 0.26 20.94 0.29 19.99 0.33
Students 72.40 0.37 69.24 0.40 60.64 0.47 89.03 0.19 85.41 0.22 75.99 0.31
a ΔR2computed as change in variance component from unconditional model (Table 5) relative to magnitude of variance component in unconditional model (Table 5-Table 7)/(Table 5).
Analysis Summary
• Years in the US predicted ELA and MATH performance at the district, school, and student levels
• However, Years in the US was a relatively weak predictor compared with ELP
• When ELP was included with Years in US, the effects of Years in the US were unsystematic and small.
• Effects of ELP remained strong and consistent (i.e., outcomes increased with increases in ELP)
Analysis Summary• How ELP was measured made some
difference in its value as a predictor; Domain Scores predicted best
• Using Domain Scores for Reading and Writing only was almost as good as using Reading, Writing, Speaking, and Listening
• These results suggest that the academic components of the language assessment are the most important predictors of content area achievement
• It is noteworthy that ELP performance accounted for so much of the school and district variability in ELA and MATH
Conclusions from Statistical Analyses• Taken together, the results highlight the
importance of language in the development of content area knowledge. – Analyses not presented here further
highlight that children need to be taught the content in order to close achievement gaps.
– Gaining proficiency in English was not a guarantee of success on the content area assessments.
• The problem of content area proficiency is not simply a problem of testing as evidenced by the limited impact of test accommodations.
Suggestions for improving the accountability system for ELLs
Suggestions for Improved Accountability for ELLs
We clearly have a reporting problem that is fueled by the dynamic nature of the ELL category and the role that language plays in the development of content area proficiency
1.Create a separate reporting category of ELLs who are reclassified as FEP
2.Report achievement results within ELP proficiency levels –
• Beginner, • Intermediate, • Advanced Intermediate, • Fluent English Proficient
Suggestions for Improved Accountability for ELLs
3. Set achievement performance expectations that are challenging but realistic for each particular group– Expect students in the FEP category to perform
at levels comparable to monolinguals in the school/district/state
– Establish more realistic achievement targets for students who are still developing English• Goals must be challenging, but attainable, to
properly motivate students and teachers• Long term expectations for all students are
the same, but short term goals must be challenging, yet attainable in the short term
Suggestions for Improved Accountability for ELLs
4. Hold schools/districts accountable under Title I for language proficiency
5. Hold schools/districts accountable under Title I for the time required for children to achieve FEP status
6. Expect all students to reach FEP status7. Short-circuit gamesmanship by limiting the time
before children must be counted in the FEP category for achievement accountability
How does a system that incorporates these steps improve accountability?• Gives schools actionable information about the
content area achievement of all ELLs• Allows more accurate evaluation of school
performance in the face of shifting demographics• Credits language proficiency gains and content
area achievement gains that are reasonable given students language proficiency
• Meaningfully counts the achievement results for all children regardless of their level of language proficiency
• Recognizes the importance of setting challenging but attainable goals to motivate maximal performance
Conclusions
• An accountability model that addresses these issues will provide more accurate information to teachers, principals, and other stakeholders about the performance of ELLs
• Place emphasis on integration of language instruction into content area instruction, and
• Increase the emphasis on teaching content when ELLs first reach school
• Increase the demand for language tests that will serve as better barometers of ELL students’ acquisition of the academic language skills needed to master content domains.
Thank You
Gregory J. [email protected]
Purpose: Broad Overview of Three Areas:1) Strategies for improving reliability and
validity of ELL assessments2) Impact of construct-irrelevant features
on student performance3) Effects of attention to ELL subgroups
on student achievement
“The reliability coefficients of the test
scores
for ELL students are substantially lower
than
those for non-ELL students” (p. 6) rxx’ = S2
T
S2T + S2
E
“ELL students’ test outcomes show
lower
criterion-related validity.” (p. 6) rxy ≤ √ rxx’ ryy’
“Structural relationships between test
components and across measurement
domains are lower for ELL students.” (p. 6)
* change/multidimensionality of construct
* difficulties in applying standard test
developmen t practices
Four grade clusters:K-23-56-89-12
* Test-taker marks answers in a test booklet
* Paper is secured to work area with tape/magnet
1) Review NCLB accountability for language-related subgroups (LRSs)
2) Report on meta-analysis re:evidence that specific test accommodations are effective in improving the performance of ELLs;
evidence that specific test accommodations for ELLs are valid in large-scale assessments.
1) States differ in the instruments used to assess language proficiency
2) States differ in the criteria used to judge proficiency
3) Too often, subjective judgments are part of the decision process.” (p. 3)
Unlike any other concern of NCLB, ELL
subgroup membership is affected by
instruction.
Aggregate reporting masks performance
changes when demographics shift.
... given these points (variation, unlike other variables, demographic shifts mask performance changes), perhaps the most serious yet waved at validity concern is not the validity of any accommodations, but the validity of aggregating these data with other data obtained for AYP reporting purposes.
• An Area for Presenters to Clarify?
• “We know that there are about 12-15 test accommodations which appear to be useful for ELLs on large-scale academic tests.” (p. 1)
Continuous (or multidimensional) trait(s), but dichotomous accommodation policies.
Possible Solution: “A standardized computer-based system of matching students to accommodations has been built and is ready to be used.” (p. 1)
Annual CELDT Scores, 2001-2007
11 10 7 7 6 10 9
2319
14 13 1318 17
4037
36 33 33
39 39
2125
3233 33
25 28
4 9 11 15 147 8
0
20
40
60
80
100
2001 2002 2003 2004 2005 2006 2007
Pe
rce
nt
of
EL
s
Advanced
Early Adv
Intermed.
Early Interm
Beg
Baseline AMAOs (2003-04)
AMAOs (2004-05)
AMAOs (2005-06)
AMAOs (2006-07)
AMAOs (2007-08)
Overall performance of ELs on CELDT (%), 2005-07
(Linquanti, 2008)
CA Title III LEAs Meeting Individual AMAOs
Source: CDE 2008
85 84
65
86 87
7473 7469
8277
38
8382 80
0
10
20
30
40
50
60
70
80
90
100
Met AMAO 1 Met AMAO 2 Met AMAO 3
Perc
en
t o
f L
EA
s
2003 2004 2005 2006 2007
81
54
84
6568
54
74
31
77
68
0
10
20
30
40
50
60
70
80
90
100
Met AMAOs 1 & 2 Met All AMAOs (1, 2, 3)
Perc
en
t o
f L
EA
s
2003 2004 2005 2006 2007CELDT CELDT & academic
achiev. testSource: CDE 2008
CA Title III LEAs Meeting Two or More AMAOs
CA ELs meeting AMAO 1 Growth by Prior CELDT Level (2006-07 and
2007-08)
58 60
3745
7264 62
40
51
74
0102030405060708090
100
Beg. (19%->18%)
Early Int. (23%->20%)
Interm. (38%->40%)
EA/A not EP(2%->3%)
Eng. Prof.(17%->18%)
% E
Ls
mee
tin
g
2006-07
2007-08
% AMAO 1 cohort (06/0707/08)
CA EL Performance on Two Statewide Assessments: 2006-
07
16%19%
28%
36%32%
40% 41%
30%32%
38%
45%
31%
43%
32%
22%24%
18%
23%
11% 10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
Grade
% E
L 1
2 M
os
or
+
Eng. Prof. Level (CELDT Fall 2006) Mid-Basic or Above (CST-ELA Spring 2007)
* % test takers for each test by grade
AMAO 2 Goal
Common Reclass. criterion
CA EL Performance on CELDT and CST-ELA: 2007-08
17%
22%
31%
44%
38%
44%46%
33% 33%
41%
33%33%
17%
27%
17%13%
10% 8% 9%6% 4%
17%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th TOTAL
Grade
% E
L 12
Mos
or
+
Eng. Prof. Level (CELDT Fall 2007) Proficient or Advanced (CST-ELA Spring 2008)
* % test takers for each test by grade Total Ns: CST = 1,040,063; CELDT = 1,094,254
AMAO 2 Goal
AMAO 3 Goal
CA 2008 CST-ELA Results: ELs 12 or + months
13%19%
9%15% 18%
24% 27%20%
31%44%21%
27%
21%
24%28%
33%33%
38%
38%
34%
34%
38%
43%
44%41%
32%32% 33%
26%18%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
Grade
% S
tude
nts
in C
ST L
evel
s Adv.
Prof.
Bas ic
BB
FBB
Total N = 1,040,063Basic or Below: N = 867,673 (83% of all ELs 12 or + mos)
CA 2008 CST-ELA Results: Former ELs (Reclassified-FEP)
9%8% 9%13%
17%19%
33%
16%
29%33%
32%37% 33%
38%
37%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
Grade
% S
tude
nts
in C
ST L
evel
s Adv.
Prof.
Bas ic
BB
FBB
Total N = 596,919Basic or Below: N = 266,841 (45% of all RFEPs)
Misassignments by subject area, 2003-07 (n= 22,352)
7492
4315
2960
1214 1182 977 948 928 718 699 649270
0
1000
2000
3000
4000
5000
6000
7000
8000
54% (12,077)
Source: CTCC, 2008
Conclusions from CA’s AMAO Data to Date
• AMAOs 1 & 2 show steady progress• AMAO 3 is in effect “running the show”• Title I AYP (single-year, status-bar criterion) is
very different from Title III AMAOs 1 & 2 (longitudinal growth-model, sensitive to current level & length-of-time)
Conclusions• AMAO 1 & 2: ELP criteria and targets accepted
as reasonable, meaningful, useful– Signal many ELs need more focused
instructional support to move beyond Intermediate
• AMAO 3: AYP status bar insufficiently sensitive to growth; target to 100% undermining credibility
• While many ELs at Basic, even more below • Many RFEPs need help (45% Basic or below)
Implications for Accountability
• Specify expected progess in ELD by time– Lower is faster, higher is slower (Cook, 2008)– Proficient cut, time to proficiency target are
key• Explore AYP growth models to measure EL
academic progress toward proficiency– Benchmarking, reporting by ELD level– Do we specify expectation: X years to
academic proficiency?
Implications for Accountability
• Focus teacher PD on instruction ELs need to move beyond Intermediate (ELP); into & beyond Basic (Acad. achievement)
• EL reclassification not whole story, not end of story– Ongoing linguistic, academic needs – RFEP monitoring with resources to hold
the gains over time
In a better world, NCLB reauthorization will…
• Promote true progress models, including ELP level, time in program, prior performance, benchmarking (CRESST)
• Operationalize, measure, & evaluate opportunity to learn
• Foster capacity for internal accountability (local, reciprocal) (Newmann, Elmore)
• Incorporate more authentic, multiple measures of greater utility to teachers (Shepard, Baker, Resnick)