The BILC BAT: A Research and Development Success Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October

Language is the most complex of human behaviors. Language proficiency is clearly not a simple, one-dimensional trait. Therefore, language development can not be expected to be linear. However, language proficiency can be assessed against a hierarchy of identifiable common stages of language skill development.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. Rasch one-factor IRT analysis assumes: A one-dimensional trait. Linear skill development. All test items discriminate equally well. Norm-referenced statistics are meant to distinguish all students from each another, not separating passing students from failing students.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. They require too many subjects for use in LCTLs. About 100 to 300 test subjects of varying abilities must answer each item. There may not be that number of people to be tested. The results do not have a direct relationship to proficiency levels or other external criteria.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. There has not been an adequate way of insuring that the range of skills tested and the difficulty of any given test match the targeted range of the language proficiency scale. Setting passing scores using norm-referenced statistics is an imprecise process. Setting multiple cut-scores from a total test score violates the criterion-referenced principle of non-compensatory scoring.

Test Development Procedures: Norm-Referenced Tests Create a table of test specifications. Train item writers in item-writing techniques. Develop items. Test the items for difficulty and reliability by administering them to several hundred learners. Use statistics to eliminate bad items. Administer the resulting test. Report results compared to other students or attempt to relate these norm-referenced results to a polytomous set of criteria (such as the STANAG scale).

Traditional Method of Setting Cut Scores 100 50 0 Level 3 Grou p Level 2 Group Level 1 Group Test to be calibrated Groups of known ability

The Results You Hope For: 100 50 0 Level 3 Grou p Level 2 Group Level 1 Group Test to be calibrated Groups of known ability

The Results You Always Get: 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Test Scores Received Groups of known ability

Why is there always an overlap? Total scores are by definition compensatory scores. Every answer guessed correctly adds to the individuals score. There is no way to check for ability at a given proficiency level. Students with different abilities may have attained the same scores, by Answering only the Level 1 questions right. Answering 25% of all the questions right.

No matter where the cut scores are set, they are wrong for someone. 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Test Scores Received Groups of known ability

A Better Way We can test language proficiency using criterion-referenced instead of norm- referenced testing procedures.

Criterion-Referenced Proficiency Testing in the Receptive Skills Items must strictly adhere to the proficiency Table of Specifications. Every component of the test item must be aligned with and match the specifications of a single level of the proficiency scale. The text difficulty The author purpose The task asked of the reader/listener

Criterion-Referenced Proficiency Testing in the Receptive Skills Testing reading and listening proficiency requires Independent, non-compensatory scoring for each proficiency level, not calculating a single score for the entire test. This makes the test development process more complex. Requires trained item writers and reviewers. Begins with modified Angoff ratings instead of IRT procedures to validate items.

The BILC Benchmark Advisory Test (Reading) Is a Criterion-Referenced Proficiency Test.

Steps in the Process 1.We updated the STANAG 6001 Proficiency Scale. a.Each level describes a measurable point on the scale. b.These assessment points are not arbitrary, but represent useful levels of ability, e.g. Survival, Functional, Professional, etc. c.Thus, each level represents a defined construct of language ability.

Steps in the Process 2.We validated the scale. a.The hierarchical nature of these constructs had been operationally but not statistically validated. b.A statistical validation process was run in Sofia, Bulgaria. c.The results substantiated the validity of the scales operational use.

STANAG 6001 Scale Validation Exercise Conducted at Sofia, Bulgaria 13 October 2005

Instructions On the top of a blank piece of paper, write the following information: 1.Your current work assignment: Teacher, Tester, Administrator, Other______ 2.Your first (or dominate) language: _________ 3.You do not need to write your name!

Instructions Next, write the numbers: 0 1 2 3 4 5 down the left side of the paper.

Instructions You will now be shown 6 descriptions of language speaking proficiency. Each description will be labeled with a color.

Instructions Rank the descriptions according to their level of difficulty by writing their color designation next the appropriate number: 0 (easiest) = Color ? 1 (next easiest) = Color ? 2 (next easiest) = Color ? 3 (next easiest) = Color ? 4 (next easiest) = Color ? 5 (most difficult) = Color ?

Ready? The descriptions will now be presented One at a time, In a random sequence, For 15 seconds each. You will see each of the descriptors 4 times. Thank you for participating in this experiment.

STANAG 6001 Scale Validation: A Timed Exercise Without Training 74 people turned in their rankings. They marked their current work assignments as: Administrator 49 Teacher26 Tester19 Other 1

Results of the STANAG Scale Validation ( n = 74 )

Steps in the Process 3.We used the STANAG 6001 base proficiency levels as the definitive specifications for item development. a.Author task and purpose in producing the text have to be aligned with the question or task asked of the reader. b.The written (or audio) text type and linguistic characteristics of each item must also be characteristic of the proficiency level targeted by the item.

Steps in the Process 4.The items developed had to then pass a strict review of whether each item matched the design specifications. a.Multiple expert judges made independent judgments of whether each item matched the targeted level. b.Only the items which passed this review with the unanimous consensus of trained judges, were taken to the next step.

Steps in the Process 5.The next step was a bracketing process to check the adequacy of the questions multiple choice options. a.Experts were asked to make independent judgments about how likely a learner at the next lower level would be to answer the question correctly. Responses significantly above chance (or 25%) made the item unacceptable. In such cases the item, item question, or item choices had to be discarded or revised.

Steps in the Process 5.(Cont.) b.Experts made independent judgments about how likely a learner at the next higher level would be to answer each question correctly. If the item would not be answered correctly by this more competent group, it was rejected. (Because of human limitations, inattention, fatigue, carelessness, etc, it was recognized that the correct response probability for this more competent group would be less than 100%.)

Steps in the Process 6.Items that passed the technical specifications review and the bracketing process, then underwent a Modified Angoff rating procedure. a.Expert judges rated the probability that each item would be correctly answered by a person who was fully competent at the targeted proficiency level. b.If the independent probability ratings produced an outlier rating or a standard deviation of more than 5 points, the item was rejected and/or revised.

Steps in the Process 7.Items found acceptable in the Modified Angoff rating procedure, where assembled into an online test. a.The test had three subtests of 20 items each. b.A separate subtest for each of the Reading proficiency Levels 1, 2, and 3. c.Each test was to be graded separately. d.Sustained performance (passing) on each subtest was defined as the mean Angoff rating minus one standard deviation or 70%.

More About Scoring Scoring had to follow Criterion- Referenced, non-compensatory Proficiency assessment procedures. Sustained ability would be required to qualify as proficient at each level. Summary ratings would consider both Floor and Ceiling abilities. Each learners performance profile would determine between-level ratings (if any).

And the results More pilot testing will be done, but here are the results of the first 36 pilot tests:

Congratulations! Working together, we have solved a major testing problem a problem which has plagued language testers for decades. We have developed a criterion- referenced proficiency test of Reading which Accurately assigns proficiency levels. Has both face- and statistical validity.

Questions?

Some additional thoughts The assessment points or levels in the STANAG 60001 scale may be thought of as chords each of which describe a short segment along an extended multi-dimensional proficiency development scale. These chords represent cross-dimensional constellations of factors that represent different levels of language ability. Like the concept of chords in calculus, these defined progress levels allow us to accurately measure whether the particular set of factors described at each level has been mastered. Each proficiency level or factor constellation can also be seen as a separate construct, and these constructs can be shown to form an ascending array or hierarchy of increasing language proficiency which meets Guttman scaling criteria. Therefore, these points in the scale can also indicate overall proficiency development.

Documents

The BILC BAT: A Research and Development Success Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October