Improving the Ways We Report Test Scores

Ronald Hambleton, April ZeniskyUniversity of Massachusetts

Amherst, USA

CERA Annual Meeting, June 1, 2010.

Ronald Hambleton, April ZeniskyUniversity of Massachusetts

Amherst, USA

CERA Annual Meeting, June 1, 2010.

Important Time in the Testing FieldImportant Time in the Testing Field

New provincial tests in Canada and state tests in the USA being introduced as part of educational reform (e.g., MA went from 7 to more than 24 in 10 years).

Users need to understand and use the scores and score reports correctly (or substantial funding is wasted).

New provincial tests in Canada and state tests in the USA being introduced as part of educational reform (e.g., MA went from 7 to more than 24 in 10 years).

Users need to understand and use the scores and score reports correctly (or substantial funding is wasted).

1. Considerable investment of time and money has been made to address technical problems:

• IRT modeling of data, test scoring of

performance data, test score equating, reliability estimation, computer technology, DIF analyses, standard-setting, and validity studies.

1. Considerable investment of time and money has been made to address technical problems:

• IRT modeling of data, test scoring of

performance data, test score equating, reliability estimation, computer technology, DIF analyses, standard-setting, and validity studies.

2. Surprisingly, test score reporting attracts very little attention! Name a research study? Without clear and meaningful

reporting of information, the other steps are of less value!

Also, on this topic, more than other technical topics, many persons thinks they are experts—everyone has an idea here about what to do!

2. Surprisingly, test score reporting attracts very little attention! Name a research study? Without clear and meaningful

reporting of information, the other steps are of less value!

Also, on this topic, more than other technical topics, many persons thinks they are experts—everyone has an idea here about what to do!

AERA, APA, NCME Test Standards: What do they say about

score scales and reporting?

AERA, APA, NCME Test Standards: What do they say about

score scales and reporting?

5.10. When test score information is released….those responsible should provide appropriate interpretations.

--information is needed about content coverage, meaning of scores, precision of scores, common misinterpretations, and proper use.

5.10. When test score information is released….those responsible should provide appropriate interpretations.

--information is needed about content coverage, meaning of scores, precision of scores, common misinterpretations, and proper use.

13.14 …Score reports should be accompanied by a clear statement of the degree of measurement error associated with each score or classification level and information on how to interpret the scores.

Major Problems in Score Reporting!Major Problems in Score Reporting!

Reporting scales and data displays (the reports) are confusing to many persons: percents vs. percentiles; IQ scores; New scales developed by states and

provinces T scores, stanine scores.

Reporting scales and data displays (the reports) are confusing to many persons: percents vs. percentiles; IQ scores; New scales developed by states and

provinces T scores, stanine scores.

Major Problems in Score Reporting!Major Problems in Score Reporting!

Quantitative literacy is not high

(three kinds of persons!). Half of

population can’t read bus schedules

in the US. What’s 20 million dollars

for testing? (1/3 of 1% of education

budget)

NRT vs. CRT scores.

Quantitative literacy is not high

(three kinds of persons!). Half of

population can’t read bus schedules

in the US. What’s 20 million dollars

for testing? (1/3 of 1% of education

budget)

NRT vs. CRT scores.

Major Problems in Score Reporting!Major Problems in Score Reporting! Body of evidence highlighting score

reporting problems (e.g., Jaeger) Reporting scores without error bands Too much meaningless score

information on some reports (called “chart clutter” by Tufte)

Not providing meaningful diagnostic information

Body of evidence highlighting score reporting problems (e.g., Jaeger) Reporting scores without error bands Too much meaningless score

information on some reports (called “chart clutter” by Tufte)

Not providing meaningful diagnostic information

Goals of the PresentationGoals of the Presentation

1. Consider student reports—improving the meaning of score scales and diagnostic reports.

2. Mention several emerging methodologies for researching score reports and their utility.

3. Identify a seven step model for improving score report design and evaluation.

1. Consider student reports—improving the meaning of score scales and diagnostic reports.

2. Mention several emerging methodologies for researching score reports and their utility.

3. Identify a seven step model for improving score report design and evaluation.

Individual Test Score ReportsIndividual Test Score Reports

In the USA, over 30,000,000 individual reports, alone, to parents of school children.

Over 1000 credentialing exams, and some of the exams exceed 100,000 candidates (e.g., securities, accountants, nurses)

In the USA, over 30,000,000 individual reports, alone, to parents of school children.

Over 1000 credentialing exams, and some of the exams exceed 100,000 candidates (e.g., securities, accountants, nurses)

Shortcomings in the Student Reports(Goodman & Hambleton, AME, 2004)Shortcomings in the Student Reports(Goodman & Hambleton, AME, 2004)

No stated purpose, no advanced organizer, no clues about where to start reading.

Performance categories (typically) are not defined, even briefly.

No error bands on any of the reported scores, or even a hint that errors of measurement (i.e., imprecision) are present!

No stated purpose, no advanced organizer, no clues about where to start reading.

Performance categories (typically) are not defined, even briefly.

No error bands on any of the reported scores, or even a hint that errors of measurement (i.e., imprecision) are present!

Shortcomings in the Student ReportsShortcomings in the Student Reports Font is often too small to read easily.

Instructional needs information is not always user-friendly—e.g. (to a parent), “You need help in “extending meaning by drawing conclusions and using critical thinking to connect and synthesize information within and across text, ideas, and concepts.”

Font is often too small to read easily.

Instructional needs information is not always user-friendly—e.g. (to a parent), “You need help in “extending meaning by drawing conclusions and using critical thinking to connect and synthesize information within and across text, ideas, and concepts.”

Shortcomings in the Student ReportsShortcomings in the Student Reports

Several undefined terms on the displays: percentile, prompt, z score, performance category, achievement level, and more.

Basically, the reports are crowded!

Several undefined terms on the displays: percentile, prompt, z score, performance category, achievement level, and more.

Basically, the reports are crowded!

Two Ideas for Score ReportsTwo Ideas for Score Reports

Bench-marking is one of our favorites and most promising: Capitalizes on item response theory (IRT)

—strong modeling of data, and items and candidates being reported on the same scale.

Researchers have been slow to take advantage of this

Bench-marking is one of our favorites and most promising: Capitalizes on item response theory (IRT)

—strong modeling of data, and items and candidates being reported on the same scale.

Researchers have been slow to take advantage of this

Bench-Marking Solution: Makes Scale Scores More Meaningful

Place boundary points on the reporting scale

Choose a probability associated with “knowing/can do”, say, 65%.

Use the ICCs from IRT to develop descriptions of what examinees can and cannot do between boundary points.

Place boundary points on the reporting scale

Choose a probability associated with “knowing/can do”, say, 65%.

Use the ICCs from IRT to develop descriptions of what examinees can and cannot do between boundary points.

(3P) Item characteristic Curve (ICC)(3P) Item characteristic Curve (ICC)

-3 -2 -1 0 1 2 3

Ability

-3 -2 -1 0 1 2 3

Ability

-3 -2 -1 0 1 2 3

Proficiency Scale

c)Item Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 Items

P=0.65

Reporting Items PointsCategoryTopic 1 13 16Topic 2 18 21Topic 3 9 12Topic 4 8 11Topic 5 12 15

Making Score Scales More MeaningfulMaking Score Scales More Meaningful

200 300 400 500 600 700 800

Mathematics

200 300 400 500 600 700 800

Mathematics

400 500 600

200 300 400 500 600 700 800

Mathematics

ity 0.65

200 300 400 500 600 700 800

Mathematics

ity 0.65

Making Score Scales More MeaningfulMaking Score Scales More MeaningfulMaking Score Scales More MeaningfulMaking Score Scales More Meaningful

200 300 400 500 600 700 800

Mathematics

lity 0.65

200 300 400 500 600 700 800

Mathematics

lity 0.65

400 600500

Meaning of the Mathematics ScaleMeaning of the Mathematics Scale200 300 400 500 600 700 800

Level 200-290: Students at this level can sometimes solve very basic problems in each of the content areas. For example, they can solve simple arithmetic problems and read simple data displays.

Level 300-390: Students at this level show a beginning ability to recall and use mathematical facts and terminology to solve basic problems. For example, they can identify the rule for a simple pattern and solve very routine geometry problems.

Level 400-490: Students at this level display the ability to solve a greater variety of basic problems in each of the content areas. For example, they can recognize relationships and solve routine problems presented in verbal, mathematical, or graphical forms.

Level 500-590: Students at this level are able to solve multi-step problems in different content areas and can make connections between content areas. For example, they can solve multi-step percent problems and can use algebraic skills to solve geometry problems.

Level 600-690: Students at this level show a clear increase in ability to solve more demanding problems, to generalize, to understand mathematical terminology, and to make connections. For example, they can solve complex counting problems involving permutations/ combinations, generalize complex patterns, and solve multi-step problems involving geometric/algebraic relationships.

Level 700-790: Students at this level have the ability to apply insight, reasoning, and problem solving strategies to solve a wide range of problems both within and across the content areas. For example, they can solve problems involving newly-defined functions in more than two variables and can solve conditional probability problems by constructing and analyzing a table of possible outcomes.

Common Diagnostic ReportCommon Diagnostic ReportCommon Diagnostic ReportCommon Diagnostic Report

Candidate results by subdomain categories (e.g. math):

Content Domain Score Points Percent Correct

1. Data Analysis, Stats (20%) 1 of 10

2. Geometry (10%) 6 of 8

3. Measurement (20%) 9 of 12

4. Number Sense/Operations (15%) 4 of 9

5. Patterns (35%) 4 of 22

0% 100%

Highly Problematic Report!!Highly Problematic Report!!

No sense of measurement error No guarantee that the items are

representative No basis for score interpretation

No sense of measurement error No guarantee that the items are

representative No basis for score interpretation

Mathematics Your Performance Compared to Passing Students

Content DomainYour Performance

Passing Student Performance

Weaker Comparable Stronger

1. Data Analysis, Stats (20%) 10% 20% X

2. Geometry (10%) 75% 60% X

3. Measurement (20%) 75% 90% X

4. Number Sense/ Operations (15%) 44% 60% X

5. Patterns (35%) 18% 65% X

Overall Performance Weaker Comparable Stronger

Multiple Choice (70%) X

Constructed Response (30%) X

A Better Report!!A Better Report!!

Confidence bandsA frame of reference:

performance of borderline candidates, or passing candidates, for example.

Confidence bandsA frame of reference:

performance of borderline candidates, or passing candidates, for example.

Score Report Design & Evaluation Score Report Design & Evaluation

Experiments Focus Groups Think-alouds Qualitative Reviews from the Field Tryouts

7 Steps in Report Development7 Steps in Report Development

Define purpose of score report

Identify intended audience(s)

Review report examples/literature

Develop reports(s)

Data collection/field test

Revise and redesign

Ongoing maintenance

Necessary ResearchNecessary Research

Reducing the size of error bands for knowledge/skill areas improving the quality of test items Improving the targeting of the test capitalizing on correlational information

among the skills or other priors

Reducing the size of error bands for knowledge/skill areas improving the quality of test items Improving the targeting of the test capitalizing on correlational information

among the skills or other priors

Necessary Research (cont.)Necessary Research (cont.)

Learning to move from the ICCs, to choosing the number of performance categories, to preparing the descriptive statements that can enhance the meaning of a score scale, and validation.

Final RemarksFinal Remarks Important advances have been made

in score reporting. More research needed on matching

score reports to intended audiences, and evaluating score reports prior to use.

Diagnostic reports are important to users but need more research.

Important advances have been made in score reporting.

More research needed on matching score reports to intended audiences, and evaluating score reports prior to use.

Diagnostic reports are important to users but need more research.

Final RemarksFinal Remarks Seven step model should be used, and

exemplar reports compiled. We are pleased to see the developments

taking place.

--States, provinces and countries are beginning to use the tools and progress can be seen.

See the NCME bibliography by Deng and Yoo with 70+ pages of references!

Seven step model should be used, and exemplar reports compiled.

We are pleased to see the developments taking place.

--States, provinces and countries are beginning to use the tools and progress can be seen.

See the NCME bibliography by Deng and Yoo with 70+ pages of references!

Improving the Ways We Report Test Scores

Documents

Improving Patient Discharge Satisfaction Scores by

UCL Arena Exchange Seminar Improving assessment and feedback scores

1 Improving the Ways We Report Test Scores Ronald Hambleton, April Zenisky University of Massachusetts Amherst, USA CERA Annual Meeting, June 1, 2010

Improving the ways that students make evidence-based decisions€¦ · Improving the ways that students make evidence-based decisions. Dr Manisha Thakkar. Dr Jeanne Young Kirby. MELTedSA

A SIMPLE APPROACH TO IMPROVING YOUR ACT SCORES...minimum English, reading, mathematics, and science assessment scores representing the level of achievement required for students to

VPL 20-20 Season 2: Killer ways for Improving English Vocabulary

Risk: Improving government's capability to handle risk and ... IMPROVING GOVERMENT.pdf · Risk: Improving government’s capability to handle risk and uncertainty 2 In many ways life

Active Children Have Active Minds Improving Achievement Test Scores Using PASS & CATCH

Interpreting Your Child’s CRCT Scores: It’s About Improving Achievement!

Ways of Improving Ways of Improving Physical Health Physical Health Ways of Improving Physical Health Cheyenne

Improving ITBS Scores

Michelle breen libraries improving their scores on the lib qual noise question_quiet space_july2013

5 ways wal mart is improving

4 Easy Ways to Boost Test Scores for STAAR

Appropriate Solutions - Simple and Low-cost Ways of Improving People's Lives

Early Childhood Conference: Improving Data, Improving ... · Early Childhood Conference: Improving Data, Improving Outcomes ... Main NAEP Reading Scores, Grades 4 and 8, 2013,

Four ways data is improving healthcare operations

Improving Patient Satisfaction Scores in the ED and IP Setting

Facebook Engagement - 5 Ways to Boost Your EdgeRank Scores

Improving Patient Satisfaction Scores Through Creating a Service