11
SETTING CUTSCORES FOR CERTIFICATION EXAMS Nathan A. Thompson, Ph.D. Vice President, ASC Adjunct Faculty, University of Cincinnati

Introduction to standard setting (cutscores)

Embed Size (px)

Citation preview

Page 1: Introduction to standard setting (cutscores)

SETTING CUTSCORES FOR CERTIFICATION

EXAMSNathan A. Thompson, Ph.D.

Vice President, ASCAdjunct Faculty, University of Cincinnati

Page 2: Introduction to standard setting (cutscores)

Why are cutscores necessary? As Glaser (1963) pointed out, the reason

for the existence of many tests is to make decisions about people Mastery: Pass/Fail educational content Credentialing: Award/not professional

credential (certification, certificate, license) Pre-employment: Hire/not for job (or eligible

as candidate) University selection: Admission/not to

university or program

Page 3: Introduction to standard setting (cutscores)

From Livingston (1980), discussing the rationale for cutscores:

Page 4: Introduction to standard setting (cutscores)

Why are cutscores necessary? What does that mean? That most of what we want to measure is in

a continuum (knowledge, intelligence) and not naturally in “states” (e.g., male/female)

So we need to set a cutscore (or cutscores) on the continuum to sort examinees into groups that reflect interpretations and meanings that are useful to us Pass is “qualified” and Fail is “unqualified”

Page 5: Introduction to standard setting (cutscores)

How do we set a cutscore? As the Livingston excerpt notes, all

cutscores involve a level of subjectivity or arbitrariness

The higher the stakes of the exam, the more we need to reduce the arbitrariness

Standard setting methods differ in their level of objectivity

A more objective method provides an anchor to validity and defensibility

Page 6: Introduction to standard setting (cutscores)

How do we set a cutscore?Approach Example Arbitrari

nessArbitrary round number

70% of items MOST

Quota Whatever passes 85% of people (z=-1.0)

MOST

Examinee-based

Borderline, Contrasting Groups

LEAST

Content-based Angoff, Bookmark LEAST

Page 7: Introduction to standard setting (cutscores)

Examinee-based methods Borderline Method

Experts familiar with content AND all examinees identify those examinees they consider “borderline”

The mean or median score for those examinees is the cutscore

Contrasting Groups Method Experts familiar with content AND all examinees

sort examinees into Pass and Fail Groups (or external criterion is used)

The point where the two score distributions cross is the cutscore

Page 8: Introduction to standard setting (cutscores)

Examinee-based methods Are conceptually appealing but have two

large disadvantages: Require examinees to take the test first, so

pass/fail decisions cannot be made after they finish the test

Require a way to assign examinees into groups WITHOUT test scores – either experts that are familiar with all examinees or some sort of “gold standard” Example: For a practice test, results on the real

test can be used as a gold standard to set cutscore

Page 9: Introduction to standard setting (cutscores)

Content-based methods The Angoff and Bookmark methods require

experts to look at items rather than candidates

Bookmark: pilot all items, analyze difficulty statistics, order the items by difficulty in a booklet, and ask experts to place a bookmark

Angoff: All experts provide a rating 0 to 100 for each item, average serves as cutscore

Page 10: Introduction to standard setting (cutscores)

Content-based methods The Angoff method is the most commonly

used approach in certification testing and therefore quite legally defensible

Biggest advantage: does not require test to be administered for data

Can use data too, with Beuk Compromise, to incorporate examinee-based aspects

The drawback is that it requires a group of subject matter experts to rate all items, which can take time

Page 11: Introduction to standard setting (cutscores)

Content-based methods The Bookmark method has the

advantage that a rating is not required for every item from every expert (which takes a lot of time)

The drawback is that it requires all items to be delivered to a decent-sized sample in order to obtain item difficulty statistics (might not be feasible)