View
30
Download
0
Category
Tags:
Preview:
DESCRIPTION
Improving the Ways We Report Test Scores. Ronald Hambleton, April Zenisky University of Massachusetts Amherst, USA CERA Annual Meeting, June 1, 2010. Important Time in the Testing Field. - PowerPoint PPT Presentation
Citation preview
1
Improving the Ways We Report Test Scores
Improving the Ways We Report Test Scores
Ronald Hambleton, April ZeniskyUniversity of Massachusetts
Amherst, USA
CERA Annual Meeting, June 1, 2010.
Ronald Hambleton, April ZeniskyUniversity of Massachusetts
Amherst, USA
CERA Annual Meeting, June 1, 2010.
2
Important Time in the Testing FieldImportant Time in the Testing Field
New provincial tests in Canada and state tests in the USA being introduced as part of educational reform (e.g., MA went from 7 to more than 24 in 10 years).
Users need to understand and use the scores and score reports correctly (or substantial funding is wasted).
New provincial tests in Canada and state tests in the USA being introduced as part of educational reform (e.g., MA went from 7 to more than 24 in 10 years).
Users need to understand and use the scores and score reports correctly (or substantial funding is wasted).
3
1. Considerable investment of time and money has been made to address technical problems:
• IRT modeling of data, test scoring of
performance data, test score equating, reliability estimation, computer technology, DIF analyses, standard-setting, and validity studies.
1. Considerable investment of time and money has been made to address technical problems:
• IRT modeling of data, test scoring of
performance data, test score equating, reliability estimation, computer technology, DIF analyses, standard-setting, and validity studies.
4
2. Surprisingly, test score reporting attracts very little attention! Name a research study? Without clear and meaningful
reporting of information, the other steps are of less value!
Also, on this topic, more than other technical topics, many persons thinks they are experts—everyone has an idea here about what to do!
2. Surprisingly, test score reporting attracts very little attention! Name a research study? Without clear and meaningful
reporting of information, the other steps are of less value!
Also, on this topic, more than other technical topics, many persons thinks they are experts—everyone has an idea here about what to do!
5
AERA, APA, NCME Test Standards: What do they say about
score scales and reporting?
AERA, APA, NCME Test Standards: What do they say about
score scales and reporting?
5.10. When test score information is released….those responsible should provide appropriate interpretations.
--information is needed about content coverage, meaning of scores, precision of scores, common misinterpretations, and proper use.
5.10. When test score information is released….those responsible should provide appropriate interpretations.
--information is needed about content coverage, meaning of scores, precision of scores, common misinterpretations, and proper use.
6
13.14 …Score reports should be accompanied by a clear statement of the degree of measurement error associated with each score or classification level and information on how to interpret the scores.
13.14 …Score reports should be accompanied by a clear statement of the degree of measurement error associated with each score or classification level and information on how to interpret the scores.
7
Major Problems in Score Reporting!Major Problems in Score Reporting!
Reporting scales and data displays (the reports) are confusing to many persons: percents vs. percentiles; IQ scores; New scales developed by states and
provinces T scores, stanine scores.
Reporting scales and data displays (the reports) are confusing to many persons: percents vs. percentiles; IQ scores; New scales developed by states and
provinces T scores, stanine scores.
8
Major Problems in Score Reporting!Major Problems in Score Reporting!
Quantitative literacy is not high
(three kinds of persons!). Half of
population can’t read bus schedules
in the US. What’s 20 million dollars
for testing? (1/3 of 1% of education
budget)
NRT vs. CRT scores.
Quantitative literacy is not high
(three kinds of persons!). Half of
population can’t read bus schedules
in the US. What’s 20 million dollars
for testing? (1/3 of 1% of education
budget)
NRT vs. CRT scores.
9
Major Problems in Score Reporting!Major Problems in Score Reporting! Body of evidence highlighting score
reporting problems (e.g., Jaeger) Reporting scores without error bands Too much meaningless score
information on some reports (called “chart clutter” by Tufte)
Not providing meaningful diagnostic information
Body of evidence highlighting score reporting problems (e.g., Jaeger) Reporting scores without error bands Too much meaningless score
information on some reports (called “chart clutter” by Tufte)
Not providing meaningful diagnostic information
10
Goals of the PresentationGoals of the Presentation
1. Consider student reports—improving the meaning of score scales and diagnostic reports.
2. Mention several emerging methodologies for researching score reports and their utility.
3. Identify a seven step model for improving score report design and evaluation.
1. Consider student reports—improving the meaning of score scales and diagnostic reports.
2. Mention several emerging methodologies for researching score reports and their utility.
3. Identify a seven step model for improving score report design and evaluation.
11
Individual Test Score ReportsIndividual Test Score Reports
In the USA, over 30,000,000 individual reports, alone, to parents of school children.
Over 1000 credentialing exams, and some of the exams exceed 100,000 candidates (e.g., securities, accountants, nurses)
In the USA, over 30,000,000 individual reports, alone, to parents of school children.
Over 1000 credentialing exams, and some of the exams exceed 100,000 candidates (e.g., securities, accountants, nurses)
12
Shortcomings in the Student Reports(Goodman & Hambleton, AME, 2004)Shortcomings in the Student Reports(Goodman & Hambleton, AME, 2004)
No stated purpose, no advanced organizer, no clues about where to start reading.
Performance categories (typically) are not defined, even briefly.
No error bands on any of the reported scores, or even a hint that errors of measurement (i.e., imprecision) are present!
No stated purpose, no advanced organizer, no clues about where to start reading.
Performance categories (typically) are not defined, even briefly.
No error bands on any of the reported scores, or even a hint that errors of measurement (i.e., imprecision) are present!
13
Shortcomings in the Student ReportsShortcomings in the Student Reports Font is often too small to read easily.
Instructional needs information is not always user-friendly—e.g. (to a parent), “You need help in “extending meaning by drawing conclusions and using critical thinking to connect and synthesize information within and across text, ideas, and concepts.”
Font is often too small to read easily.
Instructional needs information is not always user-friendly—e.g. (to a parent), “You need help in “extending meaning by drawing conclusions and using critical thinking to connect and synthesize information within and across text, ideas, and concepts.”
14
Shortcomings in the Student ReportsShortcomings in the Student Reports
Several undefined terms on the displays: percentile, prompt, z score, performance category, achievement level, and more.
Basically, the reports are crowded!
Several undefined terms on the displays: percentile, prompt, z score, performance category, achievement level, and more.
Basically, the reports are crowded!
15
Two Ideas for Score ReportsTwo Ideas for Score Reports
Bench-marking is one of our favorites and most promising: Capitalizes on item response theory (IRT)
—strong modeling of data, and items and candidates being reported on the same scale.
Researchers have been slow to take advantage of this
Bench-marking is one of our favorites and most promising: Capitalizes on item response theory (IRT)
—strong modeling of data, and items and candidates being reported on the same scale.
Researchers have been slow to take advantage of this
16
Bench-Marking Solution: Makes Scale Scores More Meaningful
Bench-Marking Solution: Makes Scale Scores More Meaningful
Place boundary points on the reporting scale
Choose a probability associated with “knowing/can do”, say, 65%.
Use the ICCs from IRT to develop descriptions of what examinees can and cannot do between boundary points.
Place boundary points on the reporting scale
Choose a probability associated with “knowing/can do”, say, 65%.
Use the ICCs from IRT to develop descriptions of what examinees can and cannot do between boundary points.
17
(3P) Item characteristic Curve (ICC)(3P) Item characteristic Curve (ICC)
-3 -2 -1 0 1 2 3
Ability
Prob
abili
ty o
f C
orre
ct R
espo
nse
.
Freq
uenc
y1.0
0.5
0.0
A
B
-3 -2 -1 0 1 2 3
Ability
Prob
abili
ty o
f C
orre
ct R
espo
nse
.
Freq
uenc
y1.0
0.5
0.0
A
B
18
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-3 -2 -1 0 1 2 3
Proficiency Scale
Exp
ecte
d S
core
(on
th
e 0-
1 m
etri
c)Item Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 ItemsItem Characteristic Curves for 60 Items
P=0.65
W N P
Reporting Items PointsCategoryTopic 1 13 16Topic 2 18 21Topic 3 9 12Topic 4 8 11Topic 5 12 15
19
Making Score Scales More MeaningfulMaking Score Scales More Meaningful
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
babi
lity
20
Making Score Scales More MeaningfulMaking Score Scales More Meaningful
0.65
B P A
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
babi
lity
0.65
400 500 600
21
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
ba
bil
ity 0.65
Making Score Scales More MeaningfulMaking Score Scales More Meaningful
400
22
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
ba
bil
ity 0.65
500
Making Score Scales More MeaningfulMaking Score Scales More MeaningfulMaking Score Scales More MeaningfulMaking Score Scales More Meaningful
23
Making Score Scales More MeaningfulMaking Score Scales More Meaningful
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
babi
lity 0.65
600
24
Making Score Scales More MeaningfulMaking Score Scales More Meaningful
0.00
0.25
0.50
0.75
1.00
200 300 400 500 600 700 800
Mathematics
Pro
babi
lity 0.65
400 600500
25
Meaning of the Mathematics ScaleMeaning of the Mathematics Scale200 300 400 500 600 700 800
Level 200-290: Students at this level can sometimes solve very basic problems in each of the content areas. For example, they can solve simple arithmetic problems and read simple data displays.
Level 300-390: Students at this level show a beginning ability to recall and use mathematical facts and terminology to solve basic problems. For example, they can identify the rule for a simple pattern and solve very routine geometry problems.
Level 400-490: Students at this level display the ability to solve a greater variety of basic problems in each of the content areas. For example, they can recognize relationships and solve routine problems presented in verbal, mathematical, or graphical forms.
Level 500-590: Students at this level are able to solve multi-step problems in different content areas and can make connections between content areas. For example, they can solve multi-step percent problems and can use algebraic skills to solve geometry problems.
Level 600-690: Students at this level show a clear increase in ability to solve more demanding problems, to generalize, to understand mathematical terminology, and to make connections. For example, they can solve complex counting problems involving permutations/ combinations, generalize complex patterns, and solve multi-step problems involving geometric/algebraic relationships.
Level 700-790: Students at this level have the ability to apply insight, reasoning, and problem solving strategies to solve a wide range of problems both within and across the content areas. For example, they can solve problems involving newly-defined functions in more than two variables and can solve conditional probability problems by constructing and analyzing a table of possible outcomes.
26
Common Diagnostic ReportCommon Diagnostic ReportCommon Diagnostic ReportCommon Diagnostic Report
Candidate results by subdomain categories (e.g. math):
Content Domain Score Points Percent Correct
1. Data Analysis, Stats (20%) 1 of 10
2. Geometry (10%) 6 of 8
3. Measurement (20%) 9 of 12
4. Number Sense/Operations (15%) 4 of 9
5. Patterns (35%) 4 of 22
10%
75%
75%
44%
18%
0% 100%
27
Highly Problematic Report!!Highly Problematic Report!!
No sense of measurement error No guarantee that the items are
representative No basis for score interpretation
No sense of measurement error No guarantee that the items are
representative No basis for score interpretation
28
Mathematics Your Performance Compared to Passing Students
Content DomainYour Performance
Passing Student Performance
Weaker Comparable Stronger
1. Data Analysis, Stats (20%) 10% 20% X
2. Geometry (10%) 75% 60% X
3. Measurement (20%) 75% 90% X
4. Number Sense/ Operations (15%) 44% 60% X
5. Patterns (35%) 18% 65% X
Overall Performance Weaker Comparable Stronger
Multiple Choice (70%) X
Constructed Response (30%) X
29
A Better Report!!A Better Report!!
Confidence bandsA frame of reference:
performance of borderline candidates, or passing candidates, for example.
Confidence bandsA frame of reference:
performance of borderline candidates, or passing candidates, for example.
30
Score Report Design & Evaluation Score Report Design & Evaluation
Experiments Focus Groups Think-alouds Qualitative Reviews from the Field Tryouts
Experiments Focus Groups Think-alouds Qualitative Reviews from the Field Tryouts
31
7 Steps in Report Development7 Steps in Report Development
Define purpose of score report
Identify intended audience(s)
Review report examples/literature
Develop reports(s)
Data collection/field test
Revise and redesign
Ongoing maintenance
32
Necessary ResearchNecessary Research
Reducing the size of error bands for knowledge/skill areas improving the quality of test items Improving the targeting of the test capitalizing on correlational information
among the skills or other priors
Reducing the size of error bands for knowledge/skill areas improving the quality of test items Improving the targeting of the test capitalizing on correlational information
among the skills or other priors
33
Necessary Research (cont.)Necessary Research (cont.)
Learning to move from the ICCs, to choosing the number of performance categories, to preparing the descriptive statements that can enhance the meaning of a score scale, and validation.
Learning to move from the ICCs, to choosing the number of performance categories, to preparing the descriptive statements that can enhance the meaning of a score scale, and validation.
34
Final RemarksFinal Remarks Important advances have been made
in score reporting. More research needed on matching
score reports to intended audiences, and evaluating score reports prior to use.
Diagnostic reports are important to users but need more research.
Important advances have been made in score reporting.
More research needed on matching score reports to intended audiences, and evaluating score reports prior to use.
Diagnostic reports are important to users but need more research.
35
Final RemarksFinal Remarks Seven step model should be used, and
exemplar reports compiled. We are pleased to see the developments
taking place.
--States, provinces and countries are beginning to use the tools and progress can be seen.
See the NCME bibliography by Deng and Yoo with 70+ pages of references!
Seven step model should be used, and exemplar reports compiled.
We are pleased to see the developments taking place.
--States, provinces and countries are beginning to use the tools and progress can be seen.
See the NCME bibliography by Deng and Yoo with 70+ pages of references!
Recommended