The BILC BAT: A Research and Development Success Story Ray T.
Clifford BILC Professional Seminar Vienna, Austria 11 October
Slide 2
Language is the most complex of human behaviors. Language
proficiency is clearly not a simple, one-dimensional trait.
Therefore, language development can not be expected to be linear.
However, language proficiency can be assessed against a hierarchy
of identifiable common stages of language skill development.
Slide 3
Testing Language Proficiency in the Receptive Skills
Norm-referenced statistical analyses are problematic when testing
for proficiency. Rasch one-factor IRT analysis assumes: A
one-dimensional trait. Linear skill development. All test items
discriminate equally well. Norm-referenced statistics are meant to
distinguish all students from each another, not separating passing
students from failing students.
Slide 4
Testing Language Proficiency in the Receptive Skills
Norm-referenced statistical analyses are problematic when testing
for proficiency. They require too many subjects for use in LCTLs.
About 100 to 300 test subjects of varying abilities must answer
each item. There may not be that number of people to be tested. The
results do not have a direct relationship to proficiency levels or
other external criteria.
Slide 5
Testing Language Proficiency in the Receptive Skills
Norm-referenced statistical analyses are problematic when testing
for proficiency. There has not been an adequate way of insuring
that the range of skills tested and the difficulty of any given
test match the targeted range of the language proficiency scale.
Setting passing scores using norm-referenced statistics is an
imprecise process. Setting multiple cut-scores from a total test
score violates the criterion-referenced principle of
non-compensatory scoring.
Slide 6
Test Development Procedures: Norm-Referenced Tests Create a
table of test specifications. Train item writers in item-writing
techniques. Develop items. Test the items for difficulty and
reliability by administering them to several hundred learners. Use
statistics to eliminate bad items. Administer the resulting test.
Report results compared to other students or attempt to relate
these norm-referenced results to a polytomous set of criteria (such
as the STANAG scale).
Slide 7
Traditional Method of Setting Cut Scores 100 50 0 Level 3 Grou
p Level 2 Group Level 1 Group Test to be calibrated Groups of known
ability
Slide 8
The Results You Hope For: 100 50 0 Level 3 Grou p Level 2 Group
Level 1 Group Test to be calibrated Groups of known ability
Slide 9
The Results You Always Get: 100 ??? 50 ??? 0 Level 3 Grou p
Level 2 Group Level 1 Group Test Scores Received Groups of known
ability
Slide 10
Why is there always an overlap? Total scores are by definition
compensatory scores. Every answer guessed correctly adds to the
individuals score. There is no way to check for ability at a given
proficiency level. Students with different abilities may have
attained the same scores, by Answering only the Level 1 questions
right. Answering 25% of all the questions right.
Slide 11
No matter where the cut scores are set, they are wrong for
someone. 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1
Group Test Scores Received Groups of known ability
Slide 12
A Better Way We can test language proficiency using
criterion-referenced instead of norm- referenced testing
procedures.
Slide 13
Criterion-Referenced Proficiency Testing in the Receptive
Skills Items must strictly adhere to the proficiency Table of
Specifications. Every component of the test item must be aligned
with and match the specifications of a single level of the
proficiency scale. The text difficulty The author purpose The task
asked of the reader/listener
Slide 14
Criterion-Referenced Proficiency Testing in the Receptive
Skills Testing reading and listening proficiency requires
Independent, non-compensatory scoring for each proficiency level,
not calculating a single score for the entire test. This makes the
test development process more complex. Requires trained item
writers and reviewers. Begins with modified Angoff ratings instead
of IRT procedures to validate items.
Slide 15
The BILC Benchmark Advisory Test (Reading) Is a
Criterion-Referenced Proficiency Test.
Slide 16
Steps in the Process 1.We updated the STANAG 6001 Proficiency
Scale. a.Each level describes a measurable point on the scale.
b.These assessment points are not arbitrary, but represent useful
levels of ability, e.g. Survival, Functional, Professional, etc.
c.Thus, each level represents a defined construct of language
ability.
Slide 17
Steps in the Process 2.We validated the scale. a.The
hierarchical nature of these constructs had been operationally but
not statistically validated. b.A statistical validation process was
run in Sofia, Bulgaria. c.The results substantiated the validity of
the scales operational use.
Slide 18
STANAG 6001 Scale Validation Exercise Conducted at Sofia,
Bulgaria 13 October 2005
Slide 19
Instructions On the top of a blank piece of paper, write the
following information: 1.Your current work assignment: Teacher,
Tester, Administrator, Other______ 2.Your first (or dominate)
language: _________ 3.You do not need to write your name!
Slide 20
Instructions Next, write the numbers: 0 1 2 3 4 5 down the left
side of the paper.
Slide 21
Instructions You will now be shown 6 descriptions of language
speaking proficiency. Each description will be labeled with a
color.
Slide 22
Instructions Rank the descriptions according to their level of
difficulty by writing their color designation next the appropriate
number: 0 (easiest) = Color ? 1 (next easiest) = Color ? 2 (next
easiest) = Color ? 3 (next easiest) = Color ? 4 (next easiest) =
Color ? 5 (most difficult) = Color ?
Slide 23
Ready? The descriptions will now be presented One at a time, In
a random sequence, For 15 seconds each. You will see each of the
descriptors 4 times. Thank you for participating in this
experiment.
Slide 24
STANAG 6001 Scale Validation: A Timed Exercise Without Training
74 people turned in their rankings. They marked their current work
assignments as: Administrator 49 Teacher26 Tester19 Other 1
Slide 25
Results of the STANAG Scale Validation ( n = 74 )
Slide 26
Steps in the Process 3.We used the STANAG 6001 base proficiency
levels as the definitive specifications for item development.
a.Author task and purpose in producing the text have to be aligned
with the question or task asked of the reader. b.The written (or
audio) text type and linguistic characteristics of each item must
also be characteristic of the proficiency level targeted by the
item.
Slide 27
Steps in the Process 4.The items developed had to then pass a
strict review of whether each item matched the design
specifications. a.Multiple expert judges made independent judgments
of whether each item matched the targeted level. b.Only the items
which passed this review with the unanimous consensus of trained
judges, were taken to the next step.
Slide 28
Steps in the Process 5.The next step was a bracketing process
to check the adequacy of the questions multiple choice options.
a.Experts were asked to make independent judgments about how likely
a learner at the next lower level would be to answer the question
correctly. Responses significantly above chance (or 25%) made the
item unacceptable. In such cases the item, item question, or item
choices had to be discarded or revised.
Slide 29
Steps in the Process 5.(Cont.) b.Experts made independent
judgments about how likely a learner at the next higher level would
be to answer each question correctly. If the item would not be
answered correctly by this more competent group, it was rejected.
(Because of human limitations, inattention, fatigue, carelessness,
etc, it was recognized that the correct response probability for
this more competent group would be less than 100%.)
Slide 30
Steps in the Process 6.Items that passed the technical
specifications review and the bracketing process, then underwent a
Modified Angoff rating procedure. a.Expert judges rated the
probability that each item would be correctly answered by a person
who was fully competent at the targeted proficiency level. b.If the
independent probability ratings produced an outlier rating or a
standard deviation of more than 5 points, the item was rejected
and/or revised.
Slide 31
Steps in the Process 7.Items found acceptable in the Modified
Angoff rating procedure, where assembled into an online test. a.The
test had three subtests of 20 items each. b.A separate subtest for
each of the Reading proficiency Levels 1, 2, and 3. c.Each test was
to be graded separately. d.Sustained performance (passing) on each
subtest was defined as the mean Angoff rating minus one standard
deviation or 70%.
Slide 32
More About Scoring Scoring had to follow Criterion- Referenced,
non-compensatory Proficiency assessment procedures. Sustained
ability would be required to qualify as proficient at each level.
Summary ratings would consider both Floor and Ceiling abilities.
Each learners performance profile would determine between-level
ratings (if any).
Slide 33
And the results More pilot testing will be done, but here are
the results of the first 36 pilot tests:
Slide 34
Slide 35
Congratulations! Working together, we have solved a major
testing problem a problem which has plagued language testers for
decades. We have developed a criterion- referenced proficiency test
of Reading which Accurately assigns proficiency levels. Has both
face- and statistical validity.
Slide 36
Questions?
Slide 37
Some additional thoughts The assessment points or levels in the
STANAG 60001 scale may be thought of as chords each of which
describe a short segment along an extended multi-dimensional
proficiency development scale. These chords represent
cross-dimensional constellations of factors that represent
different levels of language ability. Like the concept of chords in
calculus, these defined progress levels allow us to accurately
measure whether the particular set of factors described at each
level has been mastered. Each proficiency level or factor
constellation can also be seen as a separate construct, and these
constructs can be shown to form an ascending array or hierarchy of
increasing language proficiency which meets Guttman scaling
criteria. Therefore, these points in the scale can also indicate
overall proficiency development.