Centrality in a Modified- Angoff Standard Setting PhD Dissertation - Proposal by

Centrality in a Modified-Angoff Standard SettingPhD Dissertation - Proposal

by

Michael Scott Sommers (張夏石 )Department of Educational Psychology,

National Taiwan Normal University

Outline of

the Proposal

Statement of the ProblemPurpose & Motivation• Introduction to Standard Setting• Angoff Standard Setting Procedure• Problems with the Angoff Procedure• What is Centrality?• Research QuestionsMethods & Analysis• Materials & Participants• AnalysisExpected Results

Statement of the

Problem

Standard setting is the most widely used method to establish cutscores for high stakes examinations.

Despite this, many questions remain about how the procedure works and what exactly the meaning of the cutscore is.

This is especially true for the Angoff and related methods of standard setting.

It is widely believed that judges in an Angoff standard setting have problems judging the most difficult and the easiest items.

That this inability creates centrality for the estimates that judges are required to provide.

Aim of the study

Study the impact of questions of different difficulty and different rounds of the standard setting on panelist centrality in a modified-Angoff procedure

1. Question 1: Does Centrality exist in the modified-Angoff standard setting?

2. Question 2: How does Centrality change across the rounds of the modified-Angoff procedure?

3. Question 3: Is Centrality explained by differences in panelist ratings between extreme (difficult and easy) items and median difficulty item?

Purpose &

Motivation

Introduction to Standard Setting

Standard setting is the most widely used method to establish cutscores for high stakes examinations.

Despite this, many questions remain about how the procedure works and the full meaning of the cutscore .

This is especially true for the Angoff and related methods of standard setting.

It is widely believed that judges in an Angoff standard setting have problems judging the most difficult and the easiest items.

That this inability creates centrality for the estimates that judges are required to provide.

• Standard setting is a procedure used to calculate a cutscore for a test.

• Standard is a verbal description of performance.

• Cutscores are the scores on a test needed to separate people taking a test in to the different categories of a standard.

Common European Framework

of Reference

(CEFR)

C2

C1

B2

B1

A2

A1

Can understand with ease virtually everything heard or read. Can express him/herself spontaneously, very fluently and precisely.

Can understand a wide range of demanding, longer texts. Can express him/herself fluently and spontaneously.

Can understand the main ideas of complex text. Can interact with a degree of fluency and spontaneity that makes interaction possible without strain.Can understand the main points of clear standard input on familiar matters. Can produce simple connected text on familiar topics.Can understand sentences and frequently used expressions related to areas of immediate relevance. Can communicate in simple and routine tasks.

Can understand and use familiar everyday expressions and very basic phrases.

B1 Can understand the main points of clear standard input on familiar matters. Can produce simple connected text on familiar topics.

What is a “familiar matter” or “simple text”?

A standard setting procedure can help understand these terms so they can be used to decide how a test can be used to determine if the this has been reached.

• The Angoff standard setting procedure is one of the most widely used methods in Taiwan and the world to determine the passing score for high stakes tests.

• First suggested by William Angoff who attributed the idea to his colleague Ledyard Tucker (Cizek & Bunch, 2007).

• There are many different types of standard setting procedures. One recent review (Kaftandjieva, 2010) identified more than 60 different methods for standard setting.

Angoff Standard Setting Procedure

Judges are trained to use a description of performance called Performance Level Descriptors (PLDs) and match test items with these descriptors to create a cutscore that can be used to divide test takers in to different levels of performance.

Judges are trained to use the PLDs and imagine a ‘barely proficient student’ (BPS)

Angoff, Step 1: The BPS

BPS

PLDs

Step 2: Item Functioning

TEST ITEMJudges examine each item to assess difficulty.

Step 3: Quantifying Estimates

BPS

“I think a BPS has a 68% chance of answering correctly.”

1

0

.50

.68

Judges quantify their expectations of the outcome as probabilities.

Calculating the Cutscore

Item 1 68Item 2 43Item 3 72Item 4 80Item 5 76

Mean = 67.8

Judge 1 Mean = 67.8Judge 2 Mean = 72.2Judge 3 Mean = 65Judge 4 Mean = 75

Mean across judges = 70

Final Cutscore = 70

Angoff, Step 4: Discussion & Feedback

Sharing Estimates / DiscussionEmpirical P-values

Conditional ‘P-values’% Students who would pass

Typical Angoff Procedure

Round 1

Round 2

Round 3

Discussion&

Feedback

Discussion&

Feedback

FinalCutscore

Problems with the Angoff Procedure?

It is widely reported in standard setting and testing research that even the most experienced and well-trained judge may have problems estimating the difficulty of items on a test.

Is this true for all items?

Are judges completely wrong?

Are some items easier than others to estimate accurately?

Very difficult and very easy, i.e. extreme items, appear to be more difficult to

estimate correctly.

Items of moderate difficulty can be judged more accurately.

Judges are not using the full range of the scale.

This study is a clarification of the measurement properties associated with this problem.

Does item difficulty and the rounds of the standard setting help us understand the observed centrality of the panelists?

What is the effect of item difficulty and the rounds of a standard setting on the observed centrality in an Angoff standard setting?

What is Centrality?

Centrality is a widely accepted concept in the study of rating scales and raters that describes the clustering of rater scores around the center of a rating scale.

It has not been used to describe the results of an Angoff standard setting before but the similarity between the two situations indicates it could produce important results.

A wide range of definitions have been suggested.

These are reviewed in Saal, Downey, and Lahey (1980)

Their review is not very helpful.

By current standards, their conclusions about the measurement of Centrality are useless.

Why?

They include measures that are clearly measuring different things.

They include measures that have been used without discussion of what they’re measuring.

Saal, Downey, and Lahey (1980) is focused on classical measures of Centrality.

Wolfe (2004, pp. 39-40)“centrality...results in a concentration of assigned ratings in the middle of the rating scale…”

A number of related terms

“…restricted range exists when centrality is combined with leniency or harshness. That is, the restriction of range results in a restricted range around a non-central location on the rating scale. The converse of rater centrality occurs when raters tend to overuse the extreme rating scale categories - a rater effect called extremism.”

Other related terms

Central tendency – sometimes used synonymously with Centrality, sometimes it is used differently.

Rater effect

Rater bias

I will only be dealing with,

Wolfe (2004, pp. 39-40)“centrality...results in a concentration of assigned ratings in the middle of the rating scale…”

Item Centrality – ratings from different raters for different items are clustered near the center of a rating scale.

Rater Centrality – rating for different items from different raters are clustered near the center of a rating scale

Items/persons

p1 p2 p3 ….. Pn Itemi1 ip11 ip12 ip13….. Ip1n CentralityI2 ip21 ip22 ip23….. ip2n

I3 ip31 ip32 ip13….. ip3n

.

.

.

In ipn1 ipn2 ipn3….. Ipnn

Person

Centrality

Research Questions

Aim of the study

Study the impact of questions of different difficulty and different rounds of the standard setting on panelist Centrality in a modified-Angoff procedure

1. Question 1: Does Centrality exist in the modified-Angoff standard setting?

2. Question 2: How does Centrality change across the rounds of the modified-Angoff procedure?

3. Question 3: Is Centrality explained by differences in panelist ratings between extreme (difficult and easy) items and median difficulty item?

Methods and

Analysis

Materials & Participants

Analysis

Methods & Participants

• Reliability (Cronbach’s Alpha): ~.90 overall (> .80 for listening/reading subtests)

• Item Analysis (Rasch Fit, Point Biserials)• Construct Validation: Factor Analysis, PCA

Linked Exam - EPTSpring Midterm - Annual English Proficiency Test (EPT) to assess gains in student proficiency.

Angoff Yes/No: Spring 2009 (97-2 學年 ) EPTAngoff: Spring 2010 (98-2 學年 ) EPT

~3000 Examinees per year level~12,000 Examinees

Angoff Standard Setting – July, 2010

Panel I Panel II Panel III

18 judges, all with ESL background13 Female, 5 Male14 English NNS, 4 English NS3 Administrators, 13 Teachers, 3 Recently-

graduated TAs

Angoff Standard Setting - June, 2009

Materials: 40-item EPT reading and 40-item EPTListening, CEFR B1 Reading and Listening

Descriptors (PLDs)

Training: 1 full day on CEFR, EPT familiarization, Angoff procedures. (Sommers, 2012)

Procedure: 3 Rounds with discussion/feedback data.

Data Collection: 18 x 80 x 3 = 4320 estimates; Discussions of items during training and

between rounds recorded and transcribed.

Analysis

The measurement of Centrality has not been well-studied.

Some research was done in the 1970s and reviewed in Saal, Downey, and Lahey (1980).

This work was based on classical measurement and was not very useful in terms of understanding Centrality.

It suggests the following classical measures for Centrality.

• standard deviation• distance from the mean• kurtosis• rater X ratee ANOVA.

Measures similar to Centrality have been developed and used in different fields.

• Job performance evaluation• Expertise evaluation

Cochran-Weis-Shanteau Index

All of these have been based on the idea that Centrality is an error and its presence indicates a problem.

Thus, Centrality is really an issue of rater accuracy.

Much of this work involves finding ways to check the accuracy of raters.

This is problematic.

These measures do not measure the same thing.

For example, standard deviation and kurtosis are very different ideas and do not always correlate very well.

More recently, Edward Wolfe and his students have developed and begun exploring measures of Centrality based on latent trait theory and Rasch Modeling.

Wolfe and his students (Wolfe, 2004; Wolfe, Moulder & Myford, 2001; Myford & Wolfe, 2003; 2004; Yue, 2011) have suggested a large number of such measures.

• Mean-square fit statistics• Expected-residual correlations• Ratee measures and their residuals derived

from Multi-Facted Rasch Measurement models (MFRM)

• Correlation of Rasch measures and measures from raters

• Rater slope (point biserial)

Yue (2011) tested these measures, as well as standard deviation in a series of studies using simulated data.

She found the most sensitive measures were• Standard deviation• Correlation of the raw measure and the Rasch

residual correlation • Correlation of the Rasch expected value and

the Rasch residual value

For my study, I have selected several of these measures to examine Centrality in the Angoff standard setting.

I have chosen • Kurtosis• Standard deviation• Correlation of the raw measure and and the

Rasch residual value, I will call this

rmeasure,res

from here on.

I have chosen these because both standard deviation and kurtosis are well-understood classical statistics. They are easily performed and their interpretation is straight forward.

Yue (2011) has demonstrated that standard deviation has a great deal of utility in detecting Centrality.

She did not study kurtosis. Kurtosis is easy to perform and interpret. I’m curious to see how well kurtosis does compared with standard deviation.

Pearson Kurtosis is a measure of the peakedness of a distribution of scores.

It is the 4th movement of the distribution typically calculated with

= indicates the sum of calculationsX = observed value

= population meanN = number of scores in sample.

∑❑

❑

❑

= indicates the sum of calculationsX = observation value

= sample means = sample standard deviationn = number of scores in sample.

I want to compare at least one latent trait measure.

I picked one of latent trait measures from Yue (2011) because they have an advantage of interpretation.

There are no guidelines for the interpretation of kurtosis and standard deviation to determine if there is or is not Centrality.

The latent trait measure

rmeasure, res

Is designed so that under perfect Centrality, the measure should be -1.0 under perfect extremism, the measure should be 1.0with more Centrality, the measure moves from 0 to -1.0,.

Rasch Model

Log(Pni1/1-Pni1)= βn - δi

Whereβn is the location of person n along the underlying latent trait, δi is the location of item i along the same latent variable, Pni1 and Pni0 are the probabilities of person n on item i scoring 1 and 0, respectively .

Residual = Xnr-Enr

where

Xnr is the observed value for rater r and

Enr is the expected value for rater r

Expected Results

Question 1: Does Centrality exist in the modified-Angoff standard setting?

Hypothesis 1:

Centrality is a fundamental part of the Angoff standard setting procedure and will be detectable in every standard setting.

Research suggests that Item Centrality is built into the methods of the Angoff standard setting.

A decrease in standard deviation across rounds is a “common feature of standard settings” (Cizek, 2001a, p. 10) and is generally interpreted as an indicator of the validity of the particular standard setting being examined.

.

It is not clear if there will be Judge Centrality. Some research indicates there might not be, but this has not been a focus of previous work and there is little work that reflects on it

Expected ResultI expect there will be detectable Centrality in the modified-Angoff standard setting.

It is not clear if Item Centrality can be found in Round 1, but it will be detectable before the final cutscore is derived.

I make no clear predictions about Judge Centrality. Some research indicates there might not be, but this has not been a focus of previous work and there is little work that reflects on it.

Question 2: How does Centrality change across the rounds of the modified-Angoff procedure?

Hypothesis 2:

Centrality is a fundamental part of the Angoff standard setting procedure and will be detectable in every standard setting. It is produced by discussion and feedback, which are basic procedures of the modified-Angoff. As such, there will be more Centrality in the later rounds of the procedure.

Expected Results

Item Centrality should decrease, as the Angoff procedure is designed to make this happen.

Once again, it is not clear if there will be Judge Centrality. Some research indicates there might not be, but this has not been a focus of previous work and there is little work that reflects on it.

Question 3: Is Centrality explained by differences in panelist ratings between extreme (difficult and easy) items and median difficulty item?

This is the central question of my project.

Hypothesis 3:

Centrality is caused by problems evaluating the most extremely easy and extremely difficult items. It is these items that should contribute the most to valid measures of Centrality.

I propose a series of calculations.

Calculation 1Based on measures suggested by Impara & Plake (1998) using differences in raw scores, rather than classical or latent trait transformations

absolute value of the distance from the midpoint of logit scores to the item’s empirical p-value or logit difficulty score

use this in correlations with classical measures of Centrality, such as standard deviation or kurtosis and perhaps latent trait measures

Expected Results 1

I expect the modified-Angoff will show significant correlations between measures of score differences and item p-value/logit difficulty measures.

Presuming that some measures capture only part of Centrality, I propose combining several measures to see if I can create a more sensitive index.

Combine Kurtosis and Standard Deviation to see if I can use these two different measures to capture some aspect of Centrality.

Calculation 2• the first round panelist estimates will be

subtracted from the final value in the third round• the value for kurtosis will be positive or negative

depending on whether it, and the Centrality of raters or items, is increasing or decreasing across the standard setting.

• a similar operation will be performed for the standard deviation

• combine these 2 measures to produce a 2X2 matrix

KURTOSIS across 3 rounds positive negativeA I Bsmaller HIGHEST I CENTRALITY IST. -----------------------------------------------------DEV. C I D across 3 rounds I LOWEST I CENTRALITYlarger I

The A Quadrant shows the highest level of Centrality, containing those items and panelists with a decreasing standard deviation and increasingly positive kurtosis.

The D Quadrant shows the least Centrality containing items, with an increasing standard deviation and an increasingly negative kurtosis.

I can perform this for each item and for each judge.

Then use identify which items and which judges

Are the items with the greatest Centrality also the items with the most extreme p-value / logit difficulty score (Rasch difficulty score)?

Expected Results 2I predict the items with the most extreme difficulty values (easy and difficult) will fall in Quadrant A.

The pattern for judges is not theoretically important for this study.

Questions?

Documents

Centrality in a Modified- Angoff Standard Setting PhD Dissertation - Proposal by