40
© Fariba Chamani, 2015 GLEN FULCHER (2010). PRACTICAL LANGUAGE TESTING CHAPTER 5: ALIGNING TESTS TO STANDARDS

Aligning tests to standards

Embed Size (px)

Citation preview

Page 1: Aligning tests to standards

copy Fariba Chamani 2015

GLEN FULCHER (2010) PRACTICAL LANGUAGE TESTING

CHAPTER 5 ALIGNING TESTS TO STANDARDS

Content of this chapterItrsquos as old as the hillsThe definition of lsquostandardsrsquoThe uses of standards Unintended consequences revisited Using standards for harmonization and identity How many standards can we afford Performance level descriptors (PLDs) and test scores Some initial decisions Standard-setting methodologies Evaluating standard setting Training The special case of the CEFR You can always count on uncertainty

Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations

Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard

A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move

Definition of lsquostandardrsquo

Standard = a level of performance required or experienced (Davies et al 1999)

Example The standard required for entry to the university is an A in English

The uses of standards Educational purposes (achievement tests)

Professional purposes (certification of aircraft engineers)

Political purposes (NCLB amp AYP)

Immigration Policy purposes

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 2: Aligning tests to standards

Content of this chapterItrsquos as old as the hillsThe definition of lsquostandardsrsquoThe uses of standards Unintended consequences revisited Using standards for harmonization and identity How many standards can we afford Performance level descriptors (PLDs) and test scores Some initial decisions Standard-setting methodologies Evaluating standard setting Training The special case of the CEFR You can always count on uncertainty

Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations

Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard

A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move

Definition of lsquostandardrsquo

Standard = a level of performance required or experienced (Davies et al 1999)

Example The standard required for entry to the university is an A in English

The uses of standards Educational purposes (achievement tests)

Professional purposes (certification of aircraft engineers)

Political purposes (NCLB amp AYP)

Immigration Policy purposes

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 3: Aligning tests to standards

Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations

Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard

A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move

Definition of lsquostandardrsquo

Standard = a level of performance required or experienced (Davies et al 1999)

Example The standard required for entry to the university is an A in English

The uses of standards Educational purposes (achievement tests)

Professional purposes (certification of aircraft engineers)

Political purposes (NCLB amp AYP)

Immigration Policy purposes

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 4: Aligning tests to standards

Definition of lsquostandardrsquo

Standard = a level of performance required or experienced (Davies et al 1999)

Example The standard required for entry to the university is an A in English

The uses of standards Educational purposes (achievement tests)

Professional purposes (certification of aircraft engineers)

Political purposes (NCLB amp AYP)

Immigration Policy purposes

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 5: Aligning tests to standards

The uses of standards Educational purposes (achievement tests)

Professional purposes (certification of aircraft engineers)

Political purposes (NCLB amp AYP)

Immigration Policy purposes

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 6: Aligning tests to standards

Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed

Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language

The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 7: Aligning tests to standards

Using standards for harmonization amp identity

To enforce conformity to a single model that helps to create and maintain political unity and identity

ExamplesCarolingian empire of Charlemagne (CE 800ndash814)

CEFR (Now)

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 8: Aligning tests to standards

Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates

In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate

There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 9: Aligning tests to standards

CEFR (Common European Framework of Reference )

CEFR = A set of standards (six-level scales and their

descriptors ) that provides a European model for language

testing and learning to enhance European identity and

harmonization

Teachers should align their curriculum and tests to CEFR

standards (Linking) otherwise many European institutions

will not recognize the certificate they awarded

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 10: Aligning tests to standards

Problems with CEFRIt drains creativity among teachers

The same set of standards are used for all people across different contexts with different purposes

Validation is based on linking the test to the CEFR This is against validity theories

The use of standards and tests for harmonization ultimately leads to a desire for more control

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 11: Aligning tests to standards

How many standards can we afford

The number of performance levels depends on the goals and the use of the test

Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks

Index of Separation estimates the number of performance levels into which a test can reliably place test takers

Sometimes we have to use numerous categories to motivate young learners

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 12: Aligning tests to standards

Performance level descriptors (PLDs) amp Test scores

PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers

There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo

The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker

Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 13: Aligning tests to standards

Standard based tests CRT amp scoring rubrics

It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification

The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo

The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 14: Aligning tests to standards

Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)

Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area

Decision 2 What classification errors can you tolerate

Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 15: Aligning tests to standards

Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 16: Aligning tests to standards

Standard-setting methodologies

Test-centered

bull Angoffbull Ebelbull Nedelskybull Bookmark

Examinee-centered

bull Method of Contrasting Groups

bull Method of Borderline group

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 17: Aligning tests to standards

Common process of standard setting

Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel

Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to

explain their ratings listen to others and revise their views or decisions before another round of judging

Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 18: Aligning tests to standards

Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 19: Aligning tests to standards

Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly

The average of these probabilities across judges or raters is the cut score

If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 20: Aligning tests to standards

Advantages amp disadvantages

Clarity

Simplicity

Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way

+ -

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 21: Aligning tests to standards

Ebel method 2 Rounds Experts classify independently test items by

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionabl

e

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 22: Aligning tests to standards

Ebel method The judges estimate the percentage of items a borderline

test taker would get correct for each cell Then the percentage for each cell is multiplied by the

number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700

These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge

Finally these are averaged across judges to give a final cut score

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 23: Aligning tests to standards

All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example

categories Expert 3 Expert 4 Expert 5 Number of items

in a category

(А)

correctly performed

items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

Number of items

in a category

(А)

correctly

performed items

(В)

АВ

EssentialEasy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0

Questionable

Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0

Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35

Mean for all experts 28

Cut-score 12

hellip

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 24: Aligning tests to standards

Problems with EBELThe complex cognitive requirements of classifying items

according to two criteria in relation to an imagined borderline

student may be challenging for the judges

As it is assumed that some items may have questionable

relevance to the construct of interest it implicitly throws into

doubt the rigor of the test development process and validity

arguments

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 25: Aligning tests to standards

Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline

test taker would be able to eliminate

In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )

These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 26: Aligning tests to standards

Problems with Nedelsky method

It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way

Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 27: Aligning tests to standards

Bookmark method

Directions to Bookmark participants

Ordered item booklet

Booklet guideline

Student exemplar papers

Scoring guide

Essential materials

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 28: Aligning tests to standards

Standard Setting

Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments

Overview of established cut-scores by every expert repeating of the same procedure as

in the first step

Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is

introduced to them

Basic steps of the procedure

Round III

Round II

Round I

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 29: Aligning tests to standards

Procedures in Bookmark method

Judges are presented with the necessary materials Then they are asked to keep in mind a borderline

student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly

The bookmarks are discussed in group and finally

the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 30: Aligning tests to standards

Examinee-centered methods

The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 31: Aligning tests to standards

Borderline group method The judges define what borderline candidates are

like and then identify borderline candidates who fit the definition

Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score

The main problem the cut score is dependent upon the group being used in the study

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 32: Aligning tests to standards

Method of contrasting groupsProcedure includes testing of two groups of examinees

bullThe classification must be done using independent criteria such as teacher judgments

bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions

bull The cut score will be where overlap is observed in the distributions

Competent Non-competent

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 33: Aligning tests to standards

Which method is the lsquobestrsquo

It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available

However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores

The problem is getting the judgments of a number of people on a large group of individuals

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 34: Aligning tests to standards

Evaluating standard setting (Kane 1994)

Procedural evidence

bull What procedures were used for the standard-setting to ensure that the process is systematic

bull Were the judges properly trained in the methodology and allowed to express their views freely

Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )

External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 35: Aligning tests to standards

Training a critical part of standard setting

Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback

Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers

The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 36: Aligning tests to standards

The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard

setting in order to introduce a common language and a single reporting system into Europe

bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation

bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 37: Aligning tests to standards

PLDs in CEFR amp in other standard-based systems

The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations

Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization

Benchmarking = the process of rating individual performance samples using the CEFR PLDs

Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels

PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed

Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made

Benchmarking = the typical performances that are identified after standard-setting

Standard-setting = establishing cut scores on tests

CEFR Other standard-based systems

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 38: Aligning tests to standards

You can always count on uncertainty

Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education

Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40
Page 39: Aligning tests to standards

Thank You For Your Attention

  • Glen Fulcher (2010) Practical Language Testing Chapter 5 A
  • Content of this chapter
  • Itrsquos as old as the hills
  • Definition of lsquostandardrsquo
  • The uses of standards
  • Unintended consequences
  • Using standards for harmonization amp identity
  • Carolingian empire of Charlemagne
  • CEFR (Common European Framework of Reference )
  • Problems with CEFR
  • How many standards can we afford
  • Performance level descriptors (PLDs) amp Test scores
  • Standard based tests CRT amp scoring rubrics
  • Some initial decisions
  • Slide 15
  • Standard-setting methodologies
  • Common process of standard setting
  • Test-centered methods
  • Angoff method
  • Advantages amp disadvantages
  • Ebel method
  • Ebel method (2)
  • Slide 23
  • Problems with EBEL
  • Nedelsky method (Multiple-choice)
  • Problems with Nedelsky method
  • Bookmark method
  • Slide 28
  • Procedures in Bookmark method
  • Examinee-centered methods
  • Borderline group method
  • Method of contrasting groups
  • Slide 33
  • Which method is the lsquobestrsquo
  • Evaluating standard setting (Kane 1994)
  • Training a critical part of standard setting
  • The special case of the CEFR
  • PLDs in CEFR amp in other standard-based systems
  • You can always count on uncertainty
  • Slide 40