Classroom Assessments in Large Scale Assessment Programs Catherine Taylor University of Washington/OSPI Lesley Klenk OSPI

Classroom Assessments in Large Scale Assessment Programs

Catherine TaylorUniversity of Washington/OSPI

Lesley KlenkOSPI

History of Criterion-Referenced Assessment Models

• “Measurement driven instruction" (e.g., Popham, 1987) emerged during the 1980’s

• A process wherein the tests are used as the driver for instructional change.

• “If we value something, we must assess it.”

• Minimum-competency movement of the 1980's

• “Drive" instructional practices toward teaching of basic skills

• Movement was successful - Teachers did teach to the tests.

• Unfortunately, teachers taught too closely to tests (Smith, 1991; Haladyna, Nolen, Hass, 1991).

• The tests were typically multiple-choice tests of discrete skills

• Instruction narrowed to the content that was tested in the same form that it was tested.


• Large-scale achievement tests came under wide spread criticism – Negative impacts on the classroom (Darling-

Hammond & Wise, 1985; Madaus, West, Harmon, Lomax, & Viator, 1992; Shepard & Dougherty, 1991).

– Lack of fidelity to valued performances


• Studies compared indirect and direct measures of:– writing (Stiggins, 1982)– mathematical problem-solving (Baxter, Shavelson, Herman, Brown, &

Valadez, 1993)– science inquiry (Shavelson, Baxter, and Gao, 1993):

• Demonstrated that some of the knowledge and skills measured in each assessment format overlap

• Moderate to low correlations between different assessment modes

• Questions about the validity of multiple-choice test scores.

• Other studies (Haladyna, Nolen, and Haas, (1991) Shepard and Dougherty (1991), and Smith (1991)) showed:– pressure to raise scores on large scale tests

– narrowing of the curriculum to the specific content tested

– substantial classroom time spent teaching to the test and item formats.


• In response to criticisms of multiple-choice tests assessment reformers (e.g., Shepard, 1989; Wiggins, 1989) pressed for:– Different types of assessment – Assessments that measure students' achievement of

new curriculum standards – Assessment formats that more closely match the

ways knowledge, concepts and skills are used in the world beyond tests

– Encourage teachers to teach higher order thinking, problem-solving, and reasoning skills rather than rote skills and knowledge.


In response to these pressures to improve tests– LEAs, testing companies, and projects (e.g., New

Standards Project) incorporated “performance assessments” into testing programs

– “Performance assessments" included:• Short-answer items similar to multiple-choice items• Carefully scaffolded, multi-step tasks with several short-

answer items (e.g., Yen, 1993)• Open-ended performance tasks (California, 1990; OSPI,

1997).


• Still, writers criticized these efforts – Tasks are contrived and artificial (see, for example, Wiggins,

1992) – Teachers complain that standardized tests don’t assess what is

taught in the classroom

• Shepard (2000) indicated that the promises of high quality performance-based assessments have not been realized. – Authentic tasks are costly to implement, time-

consuming, and difficulty to evaluate– Less expensive performance assessment options are

less authentic

Impact of National Curriculum Standards

• Knowledge is expanding rapidly• Education must shift away from knowledge

dissemination• Students must learn how to:

– Gather information– Comprehend, analyze, interpret information– Evaluate the credibility of information– Synthesize information from different sources– Develop new knowledge

Early Attempts to Use Portfolios for State Assessment

Three states attempted to use collections of classroom work for state assessment:– California (Kirst & Mazzeo, 1996; Palmquist,

1994)– Kentucky (Kentucky State Department of

Education, 1996)– Vermont (Fontana, 1995; Forseth, 1992;

Hewitt, 1993; Vermont State Department of Education, 1993, 1994a, 1994b).

Early Attempts to Use Portfolios for State Assessment

Initial efforts were fraught with problems:– Inconsistency of raters when applying scoring criteria

(Koretz, Stecher, & Deibert, 1992b; Koretz, Stecher, Klein, & McCaffrey, 1994a),

– Lack of teacher preparation in high quality assessment development (Gearhart & Wolf, 1996),

– Inconsistencies in the focus, number, and types of evidence included in portfolios (Gearhart & Wolf, 1996; Koretz, et al 1992b), and

– Costs and logistics associated with processing portfolios (Kirst & Mazzeo, 1996).

Research on Large Scale Portfolio Assessment

• Research on impact of portfolios showed mixed results:• Teachers and administrators have generally positive

attitudes about use of portfolios (Klein, Stecher, & Koretz, 1995; Koretz, et al 1992a; Koretz, et al 1994a)

• Positive effects on instruction (Stecher, & Hamilton, 1994)

• Teachers develop a better understanding of mathematical problem-solving (Stecher & Mitchell, 1995)

• Too much time spent on the assessment process (Stecher, & Hamilton, 1994; Koretz et al, 1994a)

• Teachers work too hard to ensure that portfolios "look good" (Callahan, 1997).

Advantages to using classroom evidence in large-scale assessment program

• Evidence that teachers are preparing students to meet curriculum and performance standards (opportunity to learn),

• Broader evidence about student achievement • Opportunity to assess knowledge and skills difficult to

assess via standardized tests (e.g., speaking and presenting, report writing, scientific inquiry processes)

• Opportunity to include work that more closely represents the real contexts in which knowledge and skill are applied

Opportunity to Learn• Little evidence is available about whether teachers are actually teaching to

curriculum standards.

• Claims about positive impacts of new assessments on instructional practices are largely anecdotal or based on teacher self-report

• Legal challenges to tests for graduation, placement, and promotion demand evidence that students have had the opportunity to learn tested curriculum (Debra P. v. Turlington, 1979).

• There is no efficient method to assess students’ opportunity to learn the valued concepts and skills

• Collections of classroom work provide a window into the educational experiences of students

• Collections of classroom work provide window into the educational practices of teachers

• Collections of classroom work could help administrators evaluate the effectiveness of in-service teacher development programs

• Classroom assessments could be used in court cases to provide evidence of individual students’ opportunity to learn

Broader Evidence of Student Learning

• Some students function well in the classroom but do not perform well on tests.

• “Stereotype threat" research - fear of negative stereotype can lead minority students and girls to perform less well than they should on standardized tests (Aronson, Lustin, Good, Keough, Steele, Brown, 1999; Steele, 1999; Steele & Aronson, 2000).

• Students may have cultural values or language development issues that inhibit performance on timed, standardized tests

• These factors threaten the validity of large-scale test scores.

• Classroom work can be more sensitive to students’ cultural and linguistic backgrounds

• Collections of classroom work can be more reliable than standardized test scores

Including Standards that are Difficult Measure on Tests

• Some desirable curriculum standards are too unwieldy to measure on large-scale tests (e.g., scientific inquiry, research reports, oral presentations)

• Historically, standardized tests measure complex work by testing knowledge of how to conduct the work. Examples– Knowing where to locate sources for reports

– Knowing how to use tables of contents, bibliographies, card catalogues, and indexes

– Identifying control or experimental variables in a science experiment

– knowing appropriate strategies for oral presentation

– Knowing appropriate ways to use visual aids

• Critics often note that knowing what to do doesn't necessarily mean one is able to do.

Authenticity• Frederickson (1984) question of authenticity in assessment due to

misrepresentation of domains by standardized tests. • Wiggins (1989) claimed that in every discipline there are tasks that are

authentic to the given discipline. • Frederickson (1998) stated that authentic achievement is

– “significant intellectual accomplishment” that results in the “construction of knowledge through disciplined inquiry to produce discourse, products, or performances that have meaning or value beyond success in school.” (p. 19, italics added).

• Examples of performances:– Policy analysis– Historical narrative and evaluation of historical artifacts– Geographic analysis of human movement– Political debate– Story and poetry writing– Literary analysis/critique– Mathematical modeling– Investment or business analyses– Geometric design and animation– Written report of a scientific investigations– Evaluation of the health of an ecosystem

Authenticity

• Some measurement specialists question the use of the terms “authentic” and “direct” measurement

• All assessments are indirect measures from which we make inferences about other, related performances (Terwilliger, 1997))

• However:– Validity is related to the degree of inference necessary from

scores on a standardized tests to valued work

– Authentic classroom work requires less inference than multiple choice test scores

Challenges with Inclusion of Classroom Work in Large Scale Programs

1. Limited teacher preparation in classroom-based assessment (which can limit the quality of classroom-based evidence),

2. Selections of evidence (which can limit comparisons across students),

3. Reliability of raters (which can limit the believability of scores given to student work)

4. Construct irrelevant variance (which can limit the validity of scores)

Solving Teacher Preparation Issues

• Teachers must be taught how to:– Select, modify, and develop assessments– Score (evaluate) student work – Write scoring (marking) rules for assessments that align to

standards

• Significant, ongoing professional development in assessment is essential.

• Teachers need to re-examine:

• Important knowledge and skills within each discipline

• How to teach so that students are more independent learners

Selection of Evidence• "For which knowledge, concepts, and skills do we need

classroom-based evidence?"

• Koretz, et al (1992b) claimed that, when teachers are free to select evidence, there is too much diversity in tasks

• Diversity may cause low inter-judge agreement among raters of the portfolios.

• Koretz and his colleagues recommended placing some restrictions on the types of tasks considered acceptable for portfolios.

• Teachers need guidance in terms of what constitutes appropriate types of evidence.

Improving Selections of Evidence • Provide guidelines for what constitutes an

effective collection of evidence

• Provide models for the types of assignments (performances) that will demonstrate the standards.

• Provide blueprints for tests that can assess that EALRs assessed by WASL

• Provide guides for writing test questions and scoring rubrics

• Provide guides for writing directions and scoring rubrics for assignments (performances)

Guidelines for Collections Include

• Lists of important work samples to collect (e.g., research reports, mathematics problems)

• Number and types of evidence for each category

• Outline of steps in performances and work samples

• Tools for assessment of students’ performances and work samples

Example Lists of Number and Types of Work Samples to Collect

Writing Performances– At least 2 different writing purposes– At least 3 different audiences– Some examples from courses other than English

• Science Investigations:– At least 3 investigations (physical, earth/space, life)– Observational assessments of hands-on work – Lab books– Summary research reports

Develop “Benchmark” Performance Assessments

• Benchmark performances are performances that:– Have value in their own right– Are complex and interdisciplinary– Students expected to do by the end of some defined

period of time (e.g., the end of middle school).

• Performance may require:– Application of knowledge, concepts and skills across

subject disciplines (e.g., survey research)– Authentic work within one subject discipline (e.g.,

scientific investigations, expository writing)

Example Description of a Benchmark Performance in Reading

• By the end of middle school, students will select one important character from a novel, short story, or play and write a multi-paragraph essay describing a character, how the character's personality, actions, choices, and relationships influence the outcome of the story, and how the character was affected by the events in the story. Each paragraph will have a central thought that is unified into a greater whole supported by factual material (direct quotations and examples from the text) as well as commentary to explain the relationship between the factual material and the student's ideas.

Example Description of a Benchmark Performance in Mathematics

• By the end of high school, students will investigate and report on a topic of personal interest by collecting data for a research question of personal interest. Students will construct a questionnaire and obtain a sample a relevant population. In the report, students will report the results in a variety of appropriate forms (including pictographs, circle graphs, bar graphs, histograms, line graphs, and/or stem and leaf plots and incorporating the use of technology), analyze and interpret the data using statistical measures (central tendency, variability, and range) as appropriate, describe the results, make predictions, and discuss the limitations of their data collection methods. Graphics will be clearly labeled (including name of data, units of measurement and appropriate scale) and informatively titled. References to data in reports will include units of measurement. Sources will be documented.

Example of the Process of Developing Benchmark Performances

1. Select work that would be familiar or meaningful:

Purchasing decision

2. Describe the performance in some detail:A person plans to buy a ___ on credit. The person figures out how much s/he can spend (down-payment and monthly payments), does research on the different types of ___, reads consumer reports or product reviews, compares costs and qualities, and makes a final selection. The person then locates the chosen product and purchases it or finances the purchase.

Example of the Process (continued)

3. Define the steps adults take to complete the performance:

a. A person plans to buy a ___ on credit for ____ purpose.

b. The person figures out how much s/he can spend: Determines money available for down-payment Compares income and monthly expenses to determine

cash available for monthly payment

c. Does research on the different types of ___ including costs and finance options.

d. Reads consumer reports or product reviewse. Compares costs, qualities, and finance optionsf. Makes a final selection. g. Locates the chosen product and finances the

purchase.

Example of the Process (continued)

4. Create grade level appropriate stepsa. The student plans to buy a ___ on credit for _____

purpose.The student:b. Figures out how much s/he can spend:

Determines money available for down-payment Compares income and monthly expenses to

determine cash available for monthly paymentc. Does research on the at least 3 types of ______ d. Determines costs and finance options.e. Reads consumer reports or product reviewsf. Compares costs, qualities, and finance optionsg. Makes a final selection that is optimal for cost,

quality and finance options within budget.

Example of the Process (continued)5. Identify the EALRs demonstrated at each step:

a. The student plans to buy a ___ on credit for _____ purpose.

The student:

b. Figures out how much s/he can spend (EALR 4.1): Determines money available for down-payment (EALR 4.1) Compares income and monthly expenses to determine

cash available for monthly payment (EALR 3.1)

c. Does research on the at least 3 types of ______ (EALR 4.1)

d. Determines costs and finance options (EALR 1.5.4)

e. Reads consumer reports or product reviews (EALR 4.1)

f. Compares costs, qualities, and finance options (EALR 3.1)

g. Makes a final selection that is optimal for cost, quality and finance options within budget (EALR 2.1-2.3)

Example of the Process (continued)6. Modify the steps as needed to ensure demonstration of

the EALRs:a. The student plans to buy a ___ on credit for _____

purpose.The student:b. Figures out how much s/he can spend (EALR 4.1):

Determines money available for down-payment (EALR 4.1) Compares income and monthly expenses to determine cash

available for monthly payment (EALR 3.1)

c. Does research on the at least 3 types of ____ (EALR 4.1)d. Determines costs and finance options (EALR 1.5.4)e. Reads consumer reports or product reviews (EALR 4.1)f. Creates a table to show comparison of costs, qualities,

and finance options (EALR 3.1)g. Makes a final selection and explains how it is optimal for

cost, quality and finance options within budget (EALR 2.1-2.3)

Possible Authentic Performances in Mathematics

• Survey Research:– Community issue– School issue

• Return on investment (costs and sales)• Purchasing decisions• Graphic designs• Animation• Social science analyses

– Sources of GDP– Major categories of federal budget– Casualties during war

Possible Authentic Performances in Reading

• Literary analyses:– Comparisons across different works by the same

author– Comparisons across works by different authors on

same theme– Analysis of theme, character, plot development

• Reading journals

• Research reports:– Summary of information on a topic from multiple

sources– Investigation of a social or natural science research

question using multiple sources– Position paper based on information from multiple

sources

Providing example blueprint for tests that can assess the standards

Type of Standard Multiple-Choice Items

Short-Answer Items

Essay Items and/or

Performance Tasks

Simple Application 2-4 1-2

Multi-step application 2-3

Solve problem 2-3Communicate 1-2 1-2

Total 2-4 4-6 4-5

Example blueprint for tests that can assess standards

Learning Target

Multiple-Choice Items

Short-Answer Items

Essay Items and/or Performance

Tasks

Main ideas/ important details

3-4 1-2

Analysis, interpretation, & synthesis

1-2 2-3

Critical thinking

1-2 2-3

Total 3-4 4-6 4-5

Solving Score Reliability Issues

• Train expert teachers to evaluate diverse collections of evidence

• Expert teachers evaluate the collection of work to determine whether it meets standards

Construct Irrelevant Variance

• Factors that are unrelated to targeted knowledge and skills that affect validity of performance– Teachers provide too much “help”– Teachers provide differential types of help– Students get help from parents– Directions for assignments are not clear– Students are taught the content but not how

to do the type of performance

Solving Construct Irrelevant Variance Problems

• Provide guidelines for what constitutes valid evidence

• Provide model performance assessments or benchmark performance descriptions

• Provide professional development on appropriate levels of help

• Provide professional development on the EALRs and GLEs

• Provide professional development on how to teach to authentic work

Conclusion• Collections of evidence CAN be used to

measure valued knowledge and skills • Collection of Evidence (COE) guidelines for

Washington State:– Incorporate many of the characteristics that will

ensure more valid student scores– Will continue to improve as more examples are

provided

• Scoring of collections:– Will involve use of the same rigor in scoring as on

WASL items– Will provide reliable student level scores

Documents

Classroom Assessments in Large Scale Assessment Programs Catherine Taylor University of Washington/OSPI Lesley Klenk OSPI