Technical Report - Center for Applied Linguistics · Technical Report Development of a Computer-Assisted Assessment of Oral Proficiency for Adult English Language Learners September

Technical ReportDevelopment of a Computer-Assisted Assessment of Oral Proficiency for Adult English Language Learners

September 2005

BEST Plus Technical ReportSeptember 2005

Writing: Dorry M. Kenyon, Herb Ware, Lauren Smith JanzenEditing and proofreading: Jeannie RennieDesign and illustration: Falls River Arts, Sally MorrisonProduction: Driven, Inc.

The writing of this report was supported by funding from the U.S. Department of Education (ED), Office of Vocational and Adult Education (OVAE), under contract No. ED-00-CO-0130. The opinions expressed in this publication do not necessarily reflect the positions or policies of ED.

Recommended citation: Center for Applied Linguistics. (2005). BEST Plus Technical Report. Washington, DC: Author.

© 2005 All rights reserved.

No part of this publication may be reproduced, in any form or by any means, without permission in writing from the Center for Applied Linguistics.

All permission inquiries should be addressed to Publications Coordinator, Center for Applied Linguistics, 4646 40th Street NW, Washington, DC 20016-1859. Telephone 202-362-0700.

Printed in the United States of America

Contents

1. Development of BEST Plus ......................................................................................................................................... 11.1 Purpose of BEST Plus.............................................................................................................................................................. 11.2 Background to the Test ........................................................................................................................................................ 11.3 Overview of the Development of BEST Plus .................................................................................................................... 2

1.3.1 Overview of the First Cycle.......................................................................................................................................... 21.3.2 Overview of the Second Cycle .................................................................................................................................... 4

1.4 The Full-Scale Field Test and Final Test Preparation ................................................................................................... 61.5 Development of the Operational Adaptive Test ............................................................................................................ 7

2. Reliability ........................................................................................................................................................................... 102.1 Interrater Reliability ........................................................................................................................................................... 102.2 Test/Retest Reliability ........................................................................................................................................................ 132.3 Parallel-Form Reliability .................................................................................................................................................... 132.4 Traditional Reliability and Precision of Measurement .............................................................................................. 14

3. Validity ................................................................................................................................................................................ 153.1 Content Evidence ................................................................................................................................................................ 153.2 Relationship With Other Measures ................................................................................................................................ 15

3.2.1 Validity Evidence From Program Placement Levels ........................................................................................... 163.2.2 Validity Evidence Based on Scores of Other Measures of English Proficiency ............................................ 173.2.3 Discussion ..................................................................................................................................................................... 19

4. BEST/BEST Plus Comparability Study .............................................................................................................. 204.1 Raw Score Relationships (BEST) ..................................................................................................................................... 204.2 Raw Score Relationships (BEST Plus) ............................................................................................................................. 214.3 Comparing Performances on the BEST and BEST Plus .............................................................................................. 23

4.3.1 SPL Relationships: All 32 Examinees ..................................................................................................................... 234.3.2 SPL Relationships: 16 Subjects With Final SPLs of 1–6 ................................................................................... 254.3.3 SPL Relationships: 16 Subjects With at Least One Final SPL of 7 or Higher ............................................. 264.3.4 SPL Relationships: Differences in BEST and BEST Plus SPLs in Higher Proficiency Examinees .............27

4.4 Summary ............................................................................................................................................................................... 28

5. Interpreting Test Results ............................................................................................................................................ 295.1 Criterion-Referenced Interpretation: Student Performance Levels ........................................................................ 295.2 Norm-Referenced Interpretation: National Percentiles ............................................................................................ 30

5.2.1 Countries Represented .............................................................................................................................................. 305.2.2 Languages Represented ............................................................................................................................................. 325.2.3 Gender ........................................................................................................................................................................... 335.2.4 Age .................................................................................................................................................................................. 335.2.5 Time in Program ......................................................................................................................................................... 345.2.6 National Percentiles ................................................................................................................................................... 35

References ............................................................................................................................................................................... 36

© September 2005 Center for Applied Linguistics BEST P l u s T ECHN ICAL REPORT 1

This page intentionally

left blank

© September 2005 Center for Applied Linguistics BEST P l u s T ECHN ICAL REPORT 1

1. Development of BEST Plus

1.1 Purpose of BEST PlusThe purpose of BEST Plus is to assess the oral language proficiency of adult English language learners. Oral language proficiency is understood as the underlying competencies that enable the performance of communicative language functions that integrate both listening and speaking skills. BEST Plus assesses the ability to understand and use unrehearsed, conversational, everyday language within topic areas generally covered in adult English language courses. As an integrative performance assessment of listening and speaking proficiency delivered in a face-to-face mode, the test does not assess an examinee’s proficiency in comprehending presentational language of the type encountered, for example, in listening to a radio or television broadcast, or in producing oral presentational language, such as reading aloud or delivering a prepared speech.

BEST Plus is designed to assess the language proficiency of adult (16 years of age or older) nonnative speakers of English who may or may not have received an education in their native language or in English, but who need to use English to function in day-to-day life in the United States. It is designed for the population of adult students typically found in adult education programs in the United States.

1.2 Background to the Test BEST Plus is a revision of the oral interview section of the Basic English Skills Test, or BEST (Center for Applied Linguistics, 1984), which discriminates among the levels of English language proficiency described in the Student Performance Levels (SPLs) (Grognet, 1997). BEST Plus is also aligned with the requirements of the National Reporting System (NRS) and with the needs of local programs to provide data on learner progress and achievement for comparison across programs and within and across states. These data can also be used for program evaluation and accountability.

BEST Plus was developed in response to a number of needs in adult English language education. These included the need for a performance assessment that did not take long to administer and that could be given frequently for pretesting and posttesting. Users of the BEST oral interview also expressed a need for an assessment that would extend beyond the BEST ceiling of SPL 7, preferably up to SPL 10.

BEST Plus meets these needs. While the oral interview section of the BEST (Forms B and C) takes about 15 minutes to administer to each examinee, research on the computer-adaptive version of BEST Plus shows that administration time for lower ability students, who comprise much of the testing population, is typically 7 minutes or less. Because it is an adaptive test, administration time varies according to the ability level of the examinee. In general, the computer-adaptive version of BEST Plus can be administered in less than 10 minutes to students whose scores are no higher than SPL 7. BEST Plus can, however, measure the oral English proficiency of students up to SPL 10. Students at the highest SPLs, who cannot be assessed by the BEST oral interview section, comprise a very small percentage of the testing population. They may take up to 15 minutes to assess with the computer-adaptive BEST Plus. The extra time is needed to allow examinees to fully demonstrate all they know and can do.

The oral interview section of the BEST exists in two parallel forms. Although alternate forms are administered for pretesting and posttesting, frequent testing could lead to questions being memorized. Rehearsal of questions and answers by examinees may invalidate the meaning of the test scores. As an adaptive test, BEST Plus has a large pool of underlying items out of which a relatively small number are administered to any

BEST P l u s T ECHN ICAL REPORT2© September 2005 Center for Applied Linguistics BEST P l u s T ECHN ICAL REPORT 3

individual examinee. The adaptive process also means that examinees will receive different items each time they take the test, particularly as their skills improve and they are administered increasingly challenging items.

BEST Plus is also available in a semi-adaptive print-based format. There are three forms (A, B, and C) of the print-based version, so different forms may be used for pretesting and posttesting. Each form consists of a locator and three level tests. The items in each form are drawn from the item pool of the computer-adaptive version. The test administrator gives the examinee the quick locator to determine which of the three level tests would be most appropriate, then administers the test and marks the scores in the test booklet. The BEST Plus Score Management Software is then used to convert the raw scores into scale scores. (See also Section 2.3, Parallel-Form Reliability.)

1.3 Overview of the Development of BEST PlusLanguage testing professionals and adult English language educators at the Center for Applied Linguistics (CAL) began the development of BEST Plus in 1999 with funding from the Office of Vocational and Adult Education (OVAE) of the U.S. Department of Education. The initial funding covered the development and piloting of a computer-administered, semi-adaptive version of the BEST oral interview section. Following the successful completion of a prototype in June 2000, OVAE issued a contract to CAL to develop a complete test.

Working with input and direction from a nationally representative 10-member technical working group (TWG), CAL professional staff began the development of the complete BEST Plus in the fall of 2000. The project consisted of two cycles. The first cycle, lasting 1 year, covered the development of a small-scale version of BEST Plus. Only a quarter of the final number of items expected to be needed for the full item pool were developed and field-tested. An initial version of the adaptive test was developed and researched in a reliability study, which was completed in November 2001. In the second cycle, the complete item pool was developed and field-tested, and the adaptive versions were fully developed and researched. The studies from the second cycle provided the data upon which much of this report is based.

1.3.1 Overview of the First CycleIn the first cycle, CAL staff developed initial specifications for the test and its items. These were reviewed and approved by the TWG. The specifications called for items to be written at three levels, depending on the type of response required. Type 1 items required a simple word or phrase in response, Type 2 required one or two complete sentences, and Type 3 required extended discourse. Items were to be grouped thematically into “folders,” each with a topic falling under 1 of 14 specific topical domains commonly covered in adult education courses. The initial 14 domains are presented in Table 1.

Table 1: Initial List of 14 Domains

General Domains Personal Occupational Public

Specific Domains

Personal Identification Getting a Job Citizenship

Health On the Job Legal Issues

Family/Parenting Community Services

Consumerism Transportation/Directions

Housing Weather/Seasons

Recreation/Entertainment Education


Each folder contained a set of six items:

• Three Type 1 items, designed to elicit the use of low-level language skills (e.g., very common vocabulary and easy grammar), most frequently based on a photograph

• Two Type 2 items, designed to elicit intermediate-level language skills

• One Type 3 item, designed to elicit the use of high-level language skills (e.g., less common vocabulary, ideas developed and organized in more complicated structures)

After approval of the initial specifications by the TWG, 12 item writers from across the nation were trained at CAL in workshops held in December 2000 and January 2001. Item writers completed three iterative item-writing assignments, in which they prepared items for assigned folders and submitted draft items to CAL. CAL staff reviewed the items and worked with the item writers to make needed revisions. CAL staff then conducted a final review and approval of all test items.

During this process, project staff came to question the appropriateness of the achievement-oriented domain of citizenship in a proficiency-oriented assessment. The legal domain, likely to require specialized vocabulary or knowledge of the legal system in the United States, was also questioned. In December 2001, project staff decided to combine the citizenship and legal domains to form a new proficiency-oriented “civics” domain. The final list of domains assessed in BEST Plus is presented in Table 2.

Table 2: 13 Domains Represented on BEST Plus

General Domains Personal Occupational Public

Specific Domains

Personal Identification Getting a Job Civics

Health On the Job Community Services

Family/Parenting Transportation/Directions

Consumerism Weather/Seasons

Housing Education

Recreation/Entertainment

Note: Personal identification questions appear in the warm-up items only.

In the spring of 2001, a small-scale field test was conducted using 156 questions drawn from 5 specific domains (consumerism, community services, health, housing, and on the job). To conduct the field test, participating programs each sent two test administrators to CAL for training. In all, 7 programs and 15 test administrators participated in the small-scale field testing and test administrator training, which focused on how to use the computerized program and, in particular, how to use the new scoring rubric to assign scores to an examinee’s performance. The 156 questions to be field-tested were divided into several fixed forms; different administrators had different forms. The test was administered to more than 740 students; in the end, data from 738 student records were available for use in the item analyses. On the whole, the items and folders selected for the small-scale field test performed well. In essence, this means that the items that were intended to be hardest (Type 3 items) were most challenging to students with low proficiency, and the items intended to be easiest (Type 1 items) were answerable by all but the weakest students. In addition, within each category of item (Types 1, 2, and 3), there was a range of difficulty.

From the 156 items used in the small-scale field test, the best ones, totaling 19 folders, or 114 items, were included in a small-scale reliability study conducted in November 2001. The reliability study used the


computer-adaptive version of BEST Plus, relying on the item calibrations (i.e., the item difficulty values) derived from the small-scale field test, to estimate an examinee’s ability during each stage of the test and to choose the best items to be delivered to each examinee as the test progressed. The adaptive algorithm was programmed such that for every folder of six questions, only three would be administered to any examinee. The first item to be administered from the folder would be a Type 1 item. If this appeared too easy for the examinee, the second item would be a Type 2 item. If the Type 1 item was not too easy, the second item would also be a Type 1 item. Finally, if the Type 2 item was also too easy for the examinee, the final item would be a Type 3 item, the most challenging kind. Otherwise, it would be a Type 1 or Type 2 item, as appropriate for the ability level of the examinee. The reliability study demonstrated that the adaptive test could yield reliable results when administered by different test administrators to the same students.

1.3.2 Overview of the Second CycleThe items and folders produced by trained item writers and CAL staff performed well during the small-scale field test and the reliability study. However, as a result of observations made by members of the project group during the small-scale reliability study and observations passed on by test administrators after the small-scale field test, CAL project staff decided that the item pool and folder structure should be revised to encourage more engagement from examinees at all levels of proficiency.

CAL staff felt that low-proficiency examinees were not always given the chance to show all that they could do on the adaptive test because the questions they were presented with resulted in what was essentially a vocabulary test, requiring them to name objects or actions from photos. When an examinee could not name the item in a photo, he or she was presented with a second or third photo identification item. For the lowest proficiency students, the entire assessment consisted of such naming questions, never moving on to easy topics about which they might have been able to speak comprehensibly and which might have been more satisfying to them.

CAL staff also felt that high-proficiency examinees were not always pushed to speak as well as they could. Every student began each folder with a photo identification item, which was too easy for high-proficiency students. Project staff felt that it would be better to engage their interest earlier in each folder through the use of questions that would allow them to interact on a more personal level with the test administrator.

To address these concerns, preliminary changes to item, folder, and test specifications were written and revised during a series of meetings among CAL project staff. New sample items and folders were produced to act as models for item writers. Three of the original item writers were then invited to an item-writing workshop at CAL in January 2002. They were trained to the new specifications and were asked for their input. The revised item specifications, below, resulted from the work of project staff members and item writers during this 2-day workshop. In the end, seven item types were developed.

General Specifications for Item Writers

Folders should be personally engaging for the examinee.

Folders should allow the examinee and interviewer to share personal information and opinions.

Ask yourself: Would you really want to know the answers to these questions from an adult ESL student? If not, drop the question.

The wording of setups and questions should be natural and conversational, not like test questions; read your questions aloud to check for naturalness.


Folders belong to a specific domain, but the questions in a folder should be organized around a single, intrinsically interesting theme related to that specific domain

Questions should flow smoothly throughout the possible paths an examinee might take within a folder.

Specifications for Individual Item TypesEntry Question (EQ)

• Purpose: To introduce the folder topic in a conversational manner.

• The item begins with a true personal statement about the examiner, relevant to the topic and of interest to the examinee, and follows with a conversational question to the examinee that invites him or her to share similar information.

• The item usually uses wh- questions.

• The target response is a short answer (which could be expanded); the question cannot be answered yes or no.

• Possible topics include preferences, personal routines, habits, and activities.

Photo Description (PD)

• Purpose: To allow lower level students to demonstrate what they know about vocabulary related to the folder theme.

• The target response may be a list of words, although more able students may produce more complex language, such as simple sentences; the picture and question should allow for elicitation of a range of possible responses.

• The photo should be rich with possibilities for description, not focused on a single object.

• The item should include a prompt to allow students to show what they can do (e.g., “Tell me about this picture”).

Personal Expansion (PE)

• Purpose: To allow the examinee to talk from and about personal experience on the topic of the folder.

• The target response should be at the sentence level (one or more sentences).

• To the greatest degree possible, the question should build on the entry question.

• The question allows the interviewer to learn interesting information about the examinee’s life, experiences, points of view, and so forth.

General Expansion (GE)

• Purpose: To allow the examinee to talk from personal experience and knowledge on the topic of the folder as it relates to the world more generally (outside his or her personal experience).

• The target response should be at the sentence level (one or more sentences).

• To the greatest degree possible, the question should build on the Personal Expansion question.

• The question may be an easier version of the Elaboration question (see below) that appears in the same folder, allowing, for example, statement of an opinion or preference but not requiring elaboration to support it or to discuss or explain it in detail.


Elaboration (EL)

• Purpose: To allow the examinee to demonstrate the ability to talk in some depth on a topic—describing in detail; narrating an event; or supporting an opinion for which more precise vocabulary, greater control over grammar, and some knowledge of discourse organization will be required.

• The target response should be detailed and elaborate, above the sentence level.

• This question should be personally engaging to both the examinee and the interviewer.

• To the greatest degree possible, the question should build on the Personal Expansion question.

Choice Questions (CQ)

• Purpose: To allow an examinee with very limited English ability to demonstrate comprehension of spoken English by responding to a non-photo-based question.

• The question should offer at least two choices related to the topic of the folder.

• Vocabulary in the question should be at a very basic level, and the question should not be unnecessarily long.

Yes/No Questions (YN)

• Purpose: To allow an examinee at the lowest level of English ability to demonstrate comprehension of spoken English by responding to a non-photo-based yes/no question.

• The question must be related to the topic of the folder.

• Vocabulary in the question should be at a very basic level, and the question should be as short as possible.

Given these new specifications, it was envisioned that there would be many different paths through a folder, depending on the ability of the examinee. However, each folder would start with either the entry question or the photo description question to introduce the topic of the folder to the examinee. Examinees of higher ability might move up to the open-ended expansion questions (i.e., personal expansion, general expansion, and elaboration), while examinees needing more English language support might move down to the questions providing more support (photo description, choice questions, yes/no questions). Each folder would contain one of each type of question; ultimately, the difficulty value of each question would be determined empirically from data collected in the full-scale field test.

1.4 The Full-Scale Field Test and Final Test PreparationIn March 2002, the TWG approved the revised specifications. Local item writers and CAL staff then revised earlier items and created new ones to complete the item pool in May 2002. In-house reviews of the items were conducted by CAL project staff. An in-house pilot of the newest items was conducted with both native and nonnative speakers of English in order to determine if the items were clear, easy to answer, and interesting. In addition, the 12 original item writers were invited to administer or review the items (rather than produce new items) as their fourth and final item-writing assignment. Items were revised according to the feedback received from these pilots and reviews.

In preparation for the field test and for the final adaptive version of the test, the computer program was revised and piloted. In May 2002, the software for field testing was completed. Videotapes of examinees taking the test were collected at a local program in May to prepare samples of performances for training field-test


administrators. CAL staff also prepared additional print and video materials for test administrator training workshops.

Ultimately, field-test administrator workshops were held in four states (Florida, Illinois, Massachusetts, and Oregon) and in Washington, DC, with representatives from four additional states (Delaware, Maryland, Pennsylvania, and Virginia) present as well.

More than 40 administrators tested more than 2,400 students from 25 programs in the full-scale field test. Each administrator used a specific nonadaptive form of the test containing 11 folders and was asked to administer it to 65 students representing all ability levels within their program. Each examinee was administered 31 items from the 11 folders—2, 3, or 4 items from each folder. While this version of the test was not adaptive, the items from the folders appeared in random but logical orders based on question type and folder content. Across all administrators, folders appeared several times as 2-, 3-, or 4-item combinations. In this way, all items were administered to at least 100 students, with six warm-up questions and three linking questions administered to all students in order to calibrate the items together.

1.5 Development of the Operational Adaptive Test The data collected from the full-scale field test were used to calibrate the difficulty of the 258 items in the final BEST Plus pool. Calibration was done using the software Facets, an application of the many-facet Rasch measurement model. The final empirical model used to calibrate the field-test data specified eight different rating scales. Facets calibrated the step difficulty (i.e., the degree of ability needed to move from one rating level to the next) of each scale step for each of the eight scales, and the item difficulty for each of the 258 items; Facets also provided an ability estimate for each of the 2,356 examinees whose complete test data allowed them to be included in the calibration sample. The final model was judged to have acceptable fit, and these difficulty estimates and step calibrations were entered into the final computer-adaptive version of BEST Plus.

By computer adaptive, we mean that BEST Plus is programmed with an algorithm that calculates an examinee’s ability given his or her performance on the test questions. The underlying logit scale is centered at zero (0), which represents the average performance of the more than 2,400 examinees who participated in the full-scale field test of BEST Plus. This value is chosen as the starting ability estimate of each examinee before any questions are administered. The examinee’s ability estimate is updated after the administration and scoring of each item. Thus, the first update occurs after the administration and scoring of the first warm-up item; additional updates occur after each of the following items.

After the six warm-up questions, the remainder of the BEST Plus items are administered out of folders—thematically organized groups of questions. Within each folder, the algorithm chooses the best question to administer next following certain constraints that allow the test to flow smoothly (e.g., the topic of each folder is introduced by a mandatory photo description or entry question). When it is time to move to another folder, the algorithm takes into account the current estimate of the examinee’s ability and the difficulty of the items in the remaining folders and specifies from which folder questions should be administered next to maximize the appropriateness of the items for the examinee. As part of this process, of course, the software ensures that no question, folder, or topical domain is ever visited more than once per administration of the test.


In other words, with the exception of the warm-up questions administered to every student, the remaining items are placed into thematically organized folders of 7 questions, out of which 2, 3, or 4 are administered. The maximum number of questions administered is 25, as follows:

Warm-up: 6 questions

Folder 1: 4 questions







Each of the 7 questions within a folder is of a different item type, such as photo description, yes/no, personal expansion, elaboration, and so on, as described above. Within each folder, the 7 questions range from very easy to very hard.

The performance of each student on each test question is scored on three subscales—Listening Comprehension, Language Complexity, and Communication—with a different range of points possible for each subscale. The sum of the points that could be awarded on the three subscales for a single item ranges from 0 to 9. The scoring possibilities are presented in Table 3.

Table 3: Possible Scores on a BEST Plus Item

Subscale Possible Points

Listening Comprehension 0, 1, 2Language Complexity 0, 1, 2, 3, 4Communication 0, 1, 2, 3TOTAL 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Each folder includes items from 1 of 13 domains that are commonly covered in adult ESL courses in the United States. (See Table 2 for a complete list of BEST Plus domains.) Items from the personal identification domain are in the warm-up folder. Of the remaining 12 domains, one is covered in four folders, one in two folders, and 10 in three folders, with each folder centering on a sub-theme within the larger domain. All together, there are 37 folders in the current BEST Plus item pool.

To ensure that randomization occurs, the folder immediately following the warm-up is chosen totally at random. The following folders are selected randomly from among those that contain the next most appropriate item (within a range) based on item selection difficulty values.

Within each folder, the algorithm chooses the most appropriate items to administer within the constraints of a logical sequence of items (e.g., an easier item type, such as photo description or entry question, introduces the theme of the folder first; four questions from the first folder are administered, then three questions each from the next three, then two questions each from the final three).

Once a question is administered and scored, the program recalculates its estimate of the examinee’s ability. With this updated ability estimate, the program selects the next test question to be administered from the


folder. That next question should challenge the examinee, but it should not be too difficult or too easy. As indicated above, depending on where the folder is in the sequence of folders, two, three, or four questions from that folder will be administered. Following the administration of the last question in a folder, the program selects a new folder, and the procedure continues, with the ability estimate of the examinee being recalculated after each question is scored.

There are three stopping rules to this process. At the end of a folder, the test is stopped when any of these rules is reached:

A. For six consecutive questions, the examinee’s estimated ability remains below a scale score of 330, which corresponds to SPL 0.

B. Standard error falls below .20 on a 7-logit scale (about 3% of the scale).

C. The maximum number of folders (7 folders, a total of 25 questions including 6 warm-up questions) is reached.

Rule A ensures that examinees of low ability are not over tested. The test stops for them if their ability estimate over six questions remains at the lowest possible ability level (Student Performance Level 0).

Rule B ensures that an adequate level of precision, analogous to the standard error of measurement, is reached before stopping the test, but that the test is no longer than it has to be.

Rule C ensures that the test will stop within a reasonable time limit and with a reasonable level of precision, even if the standard error does not fall below .20.


In testing, reliability refers to the consistency of measurement of a particular assessment instrument—in other words, the degree to which a test produces consistent results each time it is used. For instruments such as BEST Plus, in which individuals are trained to administer and rate performances, the issue of how consistently different individuals score the test—known as interrater reliability—is paramount. The test developers must demonstrate that high levels of interrater reliability may be obtained. Nevertheless, as the Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999) make clear, local users of a test such as BEST Plus also hold some responsibility for maintaining high levels of reliability in practice. The Standards read as follows:

Typically, developers and distributors of tests have primary responsibility for obtaining and reporting evidence of reliability. . . . In some instances, however, local users of a test or procedures must accept at least partial responsibility for documenting the precision of measurement. This obligation holds when . . . users must rely on local scorers who are trained to use the scoring rubrics provided by the test developer. In such settings, local factors may materially affect the magnitude of error variance and observed score variance. Therefore, the reliability of scores may differ appreciably from that reported by the developer. (pp. 30-31)

The current BEST Plus development project includes a thorough study of the reliability of the computer-adaptive version of BEST Plus. This study, which is reported on below, demonstrated the typical interrater reliability that may be achieved by qualified and well-trained test administrators. The study also examined the consistency of measurement across administrations of the computer-adaptive form of BEST Plus, another aspect of reliability that is important for this test. This type of reliability is somewhat analogous to traditional parallel-form reliability in that each administration of the adaptive test can be considered a parallel form of the test. Other aspects of reliability, in particular the precision of measurement that is available from the measurement model upon which BEST Plus is built, are also discussed below.

2.1 Interrater Reliability In November 2002, CAL conducted an interrater reliability study on the computer-adaptive version of BEST Plus. The study involved 32 students from a local adult ESL program, drawn from all levels of the program, and two groups of three raters: Group A and Group B. Within each group of raters, one scorer was the test administrator, one was an experienced scorer who had participated in the small-scale reliability study, and one was a novice scorer who had received about 2 hours of training. Each student was tested twice: once by Group A and once by Group B.

Group ATable 4 presents the descriptive statistics for the three raters in Group A, who tested or co-scored all 32 examinees. The first column shows from which of the score categories the data were obtained (total score points or total points from subscore categories). The next column shows which rater from Group A supplied the scores in each row. The third column shows the number of examinees who were tested. The fourth column shows the mean score awarded by each rater, the fifth shows the standard deviation of the scores awarded by each rater, and the last two show the minimum and maximum scores awarded by each rater.

2. Reliability


Table 4: Descriptive Statistics for Group A

N Mean Std. Dev Min Max

TOTAL SCOREAdmin-A 32 115.81 30.377 51 174Experienced-A 32 113.53 32.013 40 176Novice-A 32 116.03 29.430 50 168

Listening Comprehension

Admin-A 32 36.63 8.202 16 50Experienced-A 32 36.22 8.596 14 50Novice-A 32 37.03 8.102 17 50

Language Complexity

Admin-A 32 27.09 11.073 10 50Experienced-A 32 26.50 11.438 8 53Novice-A 32 25.53 9.971 9 45

CommunicationAdmin-A 32 52.09 12.068 25 74Experienced-A 32 50.81 12.973 18 74Novice-A 32 53.47 12.511 24 75

From Table 4 we see that for Group A, the mean scores (across all students and all questions) for total score and all three subscales are very close. They vary least for the Listening Comprehension subscale. For Language Complexity and Communication, there is little variation between the administrator and the experienced scorer; the novice scorer is slightly lower on the Language Complexity subscale and slightly higher on the Communication subscale, but still very close to the others. These data indicate that the raters are applying the rubric with a great deal of consistency.

Table 5 presents the correlation between pairs of scores awarded by different rater pairs within Group A. The first column shows the scoring category. The next three columns show the correlation of the scores of each rating pair. The final column shows the simple average of the three paired correlations.

Table 5: Pearson Correlations for Group A

CategoryRATING PAIR

AverageAdmin-A/Exp-A Admin-A/Nov-A Exp-A/Nov-A

TOTAL SCORE .98 .99 .98 .98Listening Comprehension .98 .99 .98 .98Language Complexity .97 .97 .97 .97Communication .96 .98 .96 .97

The correlations on the total score and all subscores are extremely high between raters in each pair, ranging from a high of .99 between the administrator and novice scorer on the total score and the Listening Comprehension subscale to a low of .96 between the administrator and the experienced rater and between the novice and the experienced rater on the Communication subscale.


Group BTable 6 shows the descriptive statistics for the three raters in Group B.

Table 6: Descriptive Statistics for Group B


TOTAL SCOREAdmin-B 32 128.66 32.175 56 185Experienced-B 32 125.22 29.724 63 173Novice-B 32 122.31 31.011 65 183


Admin-B 32 39.47 7.960 19 50Experienced-B 32 39.09 7.892 22 50Novice-B 32 38.44 7.857 24 49

LanguageComplexity

Admin-B 32 31.81 13.860 10 60Experienced-B 32 31.22 12.178 11 54Novice-B 32 29.41 12.503 12 64

CommunicationAdmin-B 32 57.38 12.546 27 75Experienced-B 32 54.91 11.663 30 75Novice-B 32 54.47 13.056 29 75

From Table 6 we see that, with the exception of the Listening Comprehension subscale scores, there appears to be greater variation among raters in Group B for both the total scores and the subscale scores than was seen in Group A, and that the variation is systematic. The test administrator awards the highest average scores in all categories; the novice rater awards the lowest. The administrator’s tendency is most clearly seen in the Communication subscale: Although there is very little difference here between the other two raters, the administrator is about 2.7 points higher than the average of the other two. The tendency to rate lower by the novice rater is most clearly seen in the Listening Comprehension and Language Complexity subscales, where there is little difference between the administrator and the experienced rater. For the Listening Comprehension subscale, the novice rater is almost 1 point lower than the mean of the other two raters; for the Language Complexity subscale, she is a bit more than 2 points lower. While the tendencies for raters to be consistently high or low are clear in these data, these variations are not extreme given the size of the standard deviations. As with Group A, these data indicate that the raters were rating with some consistency.

Table 7 parallels Table 5 and shows the correlations between pairs of raters within Group B.

Table 7: Pearson Correlations for Group B

CategoryRATING PAIR

AverageAdmin-B/Exp-B Admin-B/Nov-B Exp-B/Nov-B



Again, while somewhat lower than the correlations in Table 5 (which may be expected given the greater variations in average scores), the correlations between raters in each pair in Group B remain quite high, averaging .97 for the total score and the Listening Comprehension subscale. The highest observed correlation, .98, was between the experienced and novice co-scorers on the total score and the Listening Comprehension subscale. The lowest correlation was also between this pair: .90 on the Language Complexity subscale.

Tables 4 and 6 cannot be directly compared because the number of test items the examinees received with each group of raters differed. However, when we compare Tables 5 and 7, we see that the average correlation for the total score for both groups was very high; .98 for group A and .97 for group B. For Listening Comprehension it was also .98 and .97; for Language Complexity and for Communication it was .97 and .93.

In summary, the degree of interrater reliability that can be achieved using the BEST Plus Scoring Rubric appears to be quite high even for novice scorers. It is possible to achieve interrater reliability on each of the BEST Plus subscales that approaches that of the Fluency scale on the current BEST (also rated 0, 1, 2, 3), which is formally reported in the BEST Test Manual (Center for Applied Linguistics, 1989) as .98 for Form B, .96 for Form C, and .98 for Form D (two raters only, n = 29 for each group).

2.2 Test/Retest ReliabilityWhen a test is adaptive, as is BEST Plus, an important reliability question is, “Will the same student receive the same score on two different administrations of the test, with different raters and different items?” In this case there are multiple sources of variation: differences between the two tests administered as well as differences between the two raters administering the test.

In the study reported on above, the same set of 32 students took BEST Plus from both Administrator A and Administrator B. Because the two tests were adaptive, with different items and different numbers of items being administered each time, raw scores cannot be compared—only the final ability estimate (i.e., the final score). The correlation between the final scores on BEST Plus as given by Administrator A and Administrator B was .89. This is a very high correlation that indicates that students who scored high with one administrator scored high with the other, and those who scored low with one administrator scored low with the other. This figure is analogous to a traditional parallel form correlation in that the two versions of the adaptive test were not exactly the same but may be considered parallel. In summary, it appears that extraneous factors producing differential performances were minimal.

2.3 Parallel-Form ReliabilityThe print-based version of BEST Plus exists in three parallel forms. An important reliability question is “Will the same student taking two different forms of the print-based BEST Plus receive the same score on each administration?” To answer this question, 48 adult ESL students participated in a study in which three groups of 16 students each were administered two of the three forms of the print-based BEST Plus. Sixteen students received Form A and Form B; 16 received Form A and Form C; and 16 received Form B and Form C. Three administrators participated in the study. Each examiner administered one of the three forms. The results were scored using a computer program to produce final ability scores that were then correlated within each group. The correlation between the two scores of the examinees who took Form A and Form B was .93; of the examinees who took Form A and Form C, it was .96; and of the examinees who took Form B and Form C, the correlation was .85. The average across the three forms was .91, which compares favorably with the test/retest reliability of the computer-adaptive BEST Plus. Examinees who scored high on one form of the print-based version scored high on the other, and those who scored low on one form scored low on the other.


2.4 Traditional Reliability and Precision of Measurement BEST Plus is based on a sophisticated mathematical measurement model that is a many-facet extension of the Rasch model. When using this model or others based on item response theory (IRT), the traditional concept of reliability of a test—a concept that depends upon the ability distribution of the sample—is replaced by the concept of the precision of measurement. Rasch estimation procedures, however, produce a statistic analogous to the traditional reliability estimation of the measure. For BEST Plus, this statistic, from the calibration of the field test items, is .97. This figure represents the relationship between the measures of examinee ability as if estimated without error (i.e., true score) and the measures estimated with error (i.e., true score with error). The outcome of .97 is very high and reflects the heterogeneity of the population included in the sample and the fact that the ability estimates were measured with a low amount of error (average standard error was .20 of a logit for 31 items administered).

More appropriate in this context is the standard error of the ability estimate, which is a measure of the precision attained. This standard error depends primarily upon the number of items used and on the appropriateness of the difficulty of the items administered relative to the ability of the examinee. In theory, then, the standard error can vary for each adaptive test administered. In practice for BEST Plus, however, a standard error of .20 of a logit was chosen as a stopping rule for the test. In the research described above, this stopping rule was met for all examinees with an ability within the current BEST range (i.e., up to but not including SPL level 7), even when as few as 16 items were administered.

What does a standard error of .20 of a logit represent? The entire BEST Plus scale, unconverted, is 7 logits long. Thus, the figure of .20 represents 2.9% of the scale. In terms of the BEST Plus score scale, which ranges from 088 to 999, the standard error of the ability estimate is 20 points. This figure can be used if a confidence band for BEST Plus scale scores is desired.

In the research study reported above, for examinees of high ability levels, the stopping rule of the maximum number of folders was reached before the standard error dropped below .20. For these eight students, the range of the final standard errors was .203 to .277. The average standard error was .231, which translates to about 23 points on the BEST Plus score scale.

The final standard error for every adaptive BEST Plus score is recorded in the BEST Plus Scores Database.


3. Validity

Test validity is a complex issue. In recent years, the focus of validity has been enlarged to highlight the fact that ultimately it is not the test that is validated, but test usage. This focus is clear in the introductory paragraph on validity in the Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999):

Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests. The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself. (p. 9)

CAL has designed BEST Plus to be used as an assessment of the oral language skills of adult learners of English. A particular use of BEST Plus is to place examinees into Student Performance Levels (and thereby NRS levels). Although establishing validity is an ongoing process, preliminary evidence for the use of BEST Plus as an assessment of oral language skills is summarized below.

3.1 Content EvidenceThe face-to-face mode of delivery of BEST Plus as an oral interaction requiring the use of listening and speaking skills provides some support for the claim that BEST Plus assesses those skills. In addition, the test specifications outlined in Section 1 of this report support such claims. An investigation of the BEST Plus Scoring Rubric, which assesses Listening Comprehension, Language Complexity, and Communication, also provides support. However, for a performance assessment, the most important content criterion comes from the actual performances of examinees. Thus, of particular importance here is evidence from the BEST Plus standard-setting study, reported in more detail in Section 5 of this report. In that study, 11 judges watched 30 videotaped BEST Plus administrations with examinees representing a wide range of ability levels. The judges were asked to assign SPL descriptors to the examinees’ performances. The descriptors used were those for general language ability, listening comprehension, and oral communication. While not all descriptors were equally useful, the judges, to support their ratings, cited many of the descriptors as clearly observed. All the judges were able to apply appropriate descriptors to the videotaped performances. Thus, to the extent that the SPLs provide appropriate descriptors of behavior related to increasing ability in listening and speaking, this finding provides content support for the claim that BEST Plus elicits behavior indicative of oral language proficiency.

A particularly interesting finding noted by the judges was the presence of many behaviors not included in the descriptors of the SPLs but noted by applied linguistics researchers as characteristic of interpersonal oral communication. These behaviors included those associated with negotiation of meaning, such as requests for clarification and comprehension checks on the part of the examinee.

3.2 Relationship With Other MeasuresEmpirical support for the validity of new measures often comes from examining their relationship to existing measures. During the BEST Plus full-scale field test, data and ancillary measures available in the programs were collected. The following is based on data from 1,866 students.


3.2.1. Validity Evidence From Program Placement LevelsAlthough programs can use a variety of means for placing their students into levels (e.g., assessments that emphasize literacy skills, academic skills, or oral language skills), in every program there is an understanding that higher ability students are placed into higher level classes and lower ability students into lower level classes. In the BEST Plus full-scale field test, placement data were available for 24 programs, with differing numbers of students in each. For all 24 programs, moderate to high correlations were consistently found between placement levels and scores on BEST Plus. Table 8 presents these results. It shows the number of students in each program who were included in the calculation of the correlation and the rank order correlation between results on the BEST Plus field test and program placement level.

Table 8: Correlations Between BEST Plus Results and Program Placement Level

Program Number of Students Correlation

1 111 .832 118 .783 40 .774 130 .535 130 .806 66 .607 130 .788 28 .879 64 .80

10 65 .5411 65 .7012 61 .6313 67 .3514 115 .6515 65 .7716 64 .7617 39 .8618 64 .8219 65 .8620 37 .8521 78 .5022 66 .7023 76 .7724 122 .79

Note: All correlations are significant at the .01 level.

Table 9 summarizes the correlations between placement level and results on the BEST Plus field test. The average correlation across all programs was a strong .72. Table 9 indicates that for 33% of the programs, a very high correlation (.80 or above) was obtained, while for 71% (17 programs), a high correlation (above .70) was found.


Table 9. Summary of Correlations Between BEST Plus Scale Scores and Placement Levels for the 24 Programs

Range of Correlation Number of Programs Percentage.80 or above 8 33.3%

.70 to .79 9 37.5%

.60 to .69 3 12.5%

.50 to .59 3 12.5%Below .50 1 4.2%

These results, based on 24 diverse programs from throughout the United States, provide empirical evidence of the validity of BEST Plus as a measure of English proficiency and of its potential usefulness as a placement instrument in adult ESL programs.

3.2.2 Validity Evidence Based on Scores of Other Measures of English ProficiencyAmong the diverse programs and students participating in the full-scale field test of BEST Plus were programs that used other measures of English proficiency. These included the BEST oral interview, the BEST literacy skills test, the CASAS reading test, the CASAS listening test, and the Tests of Adult Basic Education (TABE). What is the relationship between performance on these tests and performance on BEST Plus? Because they all measure ability in English, some correlation between scores on BEST Plus and these other measures would be expected. However, some of these measures (BEST literacy skills test, CASAS reading test, and the TABE) focus on literacy skills, whereas others (the BEST oral interview and the CASAS listening test) focus on the same English skills as BEST Plus. Thus, while some correlation should be found between BEST Plus and all of these other measures, if BEST Plus is measuring oral skills, as it is meant to do, higher correlations should be found with the other oral/aural measures than with the literacy skills measures.

In this section we will examine the relationship between BEST Plus and each of these tests in turn.

Relationship with the BEST oral interview

Of all the tests in this data set, the BEST oral interview would be the one closest to measuring the same construct as BEST Plus: Both assess speaking proficiency, both involve one-on-one face-to-face interviews, and both produce scores that correlate with the Student Performance Levels (SPLs). SPLs based on a form of the BEST oral interview (Form B, Form C, or Short Form scores) were obtained for 304 students from 5 different programs. For these 304 students, a high correlation (.75) was found between the BEST oral interview SPL levels and performance on BEST Plus. While the BEST scores were not collected concurrently with the BEST Plus scores (i.e., some BEST scores could have been several months old), the strong correlation is a good indicator that the two tests measure similar skills.

Additional data were collected in a study of 32 examinees, in which each examinee was administered both BEST Plus and the BEST. Each test performance was scored by three raters (the test administrator and two trained observers). There was strong agreement between the SPLs awarded by the BEST Plus raters and the SPLs awarded by the BEST raters. The correlation between the SPLs awarded on the BEST and BEST Plus was .885. (For details, see Section 4 of this report, BEST/BEST Plus Comparability Study.)


Relationship with the CASAS listening test

While the CASAS listening test requires literacy on the part of test takers to read the multiple choice items, the test assesses a construct that is directly related to what BEST Plus intends to measure. Scale scores on the CASAS listening test were obtained from 101 students in 4 different programs. For these students, a high correlation (.76) was found between their CASAS listening scale scores and their performance on BEST Plus. As with the similarly strong correlation with the BEST oral interview, the strong correlation between the CASAS listening test and BEST Plus is a good indicator that the two tests measure similar skills.

Relationship with the BEST literacy skills test

The BEST literacy skills test assesses the functional reading and writing abilities of English language learners. It should therefore correlate well with BEST Plus as a measure of English ability, although its correlation may be only moderate because it measures different skills than does BEST Plus (reading and writing vs. oral proficiency). It should be noted that students included in our analysis may be assumed to be literate because they have been scored on the literacy skills section of the BEST. There are, however, many students who are not yet literate in English who could be assessed across a wide range of scores on BEST Plus but for whom the literacy skills section of the BEST would not be an appropriate measure.

SPLs on the BEST literacy skills section (Form B or Form C) were obtained from 182 students in 2 different programs. As expected, a moderate correlation (.62) was found between the students’ SPL levels on the BEST literacy skills section and their performance on BEST Plus, but the correlation was lower than for the two assessments that measured more similar skills (i.e., the BEST oral interview and the CASAS listening test).

Relationship with the CASAS reading test

The CASAS reading test, a multiple-choice instrument, should correlate well with BEST Plus as a measure of English ability. Again, as with the literacy skills section of the BEST, its correlation with BEST Plus may be only moderate because it measures a different skill (reading vs. oral proficiency).

CASAS reading scores were obtained from 215 students in 5 different programs. As expected, a moderate correlation (.67) was found between scale scores on the CASAS reading test and performance on BEST Plus, but again the correlation is lower than for the two assessments that measured more similar skills (i.e., the BEST oral interview and the CASAS listening test).

Relationship with the TABE

The TABE, as an assessment of adult basic skills, is perhaps an indirect measure of English ability for learners of English as a second language. As such, there should be some correlation with BEST Plus as a measure of English ability. Again, as with the BEST literacy skills test and the CASAS reading test, which measure different skills than BEST Plus, the correlation of the TABE with BEST Plus may be only moderate.

Scale scores (i.e., grade-level scores) on the TABE were obtained for 309 students from 9 different programs. As expected, a moderate correlation (.65) was found between scores on the TABE and performance on BEST Plus, but again the correlation is lower than for the two assessments that measured more similar skills (i.e., the BEST oral interview and the CASAS listening test).


3.2.3 DiscussionThe collection of ancillary data during the BEST Plus full-scale field test has provided some strong evidence of the validity of BEST Plus as a measure of oral English language ability. First, across 24 different programs, a strong correlation was found between performances on BEST Plus and placement levels. These programs use a variety of methods for placing students in their programs, yet the average correlation between BEST Plus scores and placement level (.72) was higher than the correlation between BEST Plus scores and assessment of non-oral language skills as measured by the BEST literacy skills test (.62), the CASAS reading test (.67), and the TABE (.65). In fact, for 71% of the programs, the correlation between placement level and BEST Plus performance was above .70. This also provides support to the argument that BEST Plus can be very useful for program placement across a variety of programs.

Second, BEST Plus correlates moderately to highly with other assessments of English language ability, from .62 for the BEST literacy skills section to .76 for the CASAS listening test. The strongest correlations are between BEST Plus and two other measures of oral/aural skills, namely the CASAS listening test (.76) and the BEST oral interview (.75). The correlation with three measures of English literacy skills was not as high (CASAS reading at .67, TABE at .65, and BEST literacy skills at .62).


Because BEST Plus is a revision of the oral interview section of the Basic English Skills Test (BEST), it is important to determine the extent of agreement between results on the two tests. To examine this, a study was conducted in which 32 nonnative speakers of English of varying language backgrounds and proficiency levels were administered both the BEST and BEST Plus. Half of the examinees took BEST Plus first; the other half took the BEST first. Each interview was scored by three people: the test administrator and two observers. All scoring was done during the administration of the test. The same three individuals scored all 32 administrations of BEST Plus; a separate group of three scored all 32 administrations of the BEST. In each group, one observer was a member of the CAL staff, and the other observer was a member of the staff of the program in which the examinees were enrolled. The results of these interviews and observations were examined for two types of scores: the raw scores and the corresponding Student Performance Levels (SPLs).

4.1 Raw Score Relationships (BEST)For the BEST, three different item types produce three separate subscores: Listening, Fluency, and Communication. Scores on these three components are summed to obtain a total score. Table 10 presents descriptive statistics for the three raters in this study who scored all 32 administrations of the BEST. The first column shows the score categories from which the data were obtained (total score points or points for each subscore category); the second column shows the rater (Administrator, CAL Staff, or Program Staff) who awarded the scores in each row. The third column shows the number of examinees who were tested, and the fourth column shows the mean score awarded by each rater to all examinees. The fifth column shows the standard deviation of the scores awarded by each rater, and the last two columns show the minimum and maximum scores awarded by each rater.

Table 10: Descriptive Statistics for the BEST


TOTAL SCOREAdministrator 32 52.25 14.140 25 71CAL Staff 32 53.06 12.541 27 70Program Staff 32 54.53 14.321 23 72

ListeningAdministrator 32 6.53 1.565 2 9CAL Staff 32 7.66 1.004 5 9Program Staff 32 8.50 1.107 5 9

FluencyAdministrator 32 12.69 6.587 1 23CAL Staff 32 12.50 6.431 1 23Program Staff 32 13.13 6.724 1 23

CommunicationAdministrator 32 33.03 7.009 19 44CAL Staff 32 32.91 6.669 18 43Program Staff 32 32.91 7.554 16 44

From Table 10 we see that for the BEST administration, the mean raw scores (across all students and all questions) for total score and for the Fluency and Communication subscales were very close. They varied more, however, for the Listening subscale. The program staff observer awarded, on average, slightly higher scores for Listening and Fluency than the other two raters, thereby producing a slightly higher average total score. For the Listening subscale, on average, the administrator awarded the lowest scores of the three raters; the CAL staff observer awarded an average between the administrator and the program staff observer.

4. BEST/BEST Plus Comparability Study


These data indicate that the raters are scoring the BEST with a great deal of consistency for Fluency and Communication and with slightly less consistency for Listening. This consistency is important because it helps assure consistency in the SPLs awarded to different interviewees by different interviewers.

Although comparing the means, standard deviations, and range of scores awarded to examinees for the BEST gives us some sense of the consistency among the three raters, examining the correlations among the raw score ratings for each pair of raters gives us a more quantitative sense of the similarity of the scoring between them. Table B displays the correlations—the interrater agreement—between each pair of raters.

Table 11 shows the correlation between pairs of raw scores awarded by different rater pairs among the three raters that scored all 32 administrations of the BEST. The first column shows the scoring category. The next three columns show the rating pair for whom the scores are being correlated. The final column shows the simple average of the three paired correlations.

Table 11: Pearson Correlations for BEST Administration

CategoryRATING PAIR

AverageAdministrator/CAL Staff

Administrator/Program Staff

CAL Staff/Program Staff

TOTAL SCORE .98 .96 .67 .87Listening .35 .55 .13 .34Fluency .97 .97 .54 .83Communication .95 .92 .73 .87

Correlations between the administrator and the CAL staff member and between the administrator and program staff rater are very high on the total score and on the subscores for Fluency and Communication. They range from .92 on Communication for the administrator/program staff pair to .98 on the total score for the administrator/CAL staff pair. For the CAL staff/program staff pair, the correlations are considerably lower, ranging from a very low .13 on Listening to a modest .73 on Communication. For all three pairs of raters, the correlations on Listening range from a very low .13 to a very modest .55. These low correlations confirm the disparity in means, standard deviations, and ranges summarized for Listening in Table 10.

4.2 Raw Score Relationships (BEST Plus)For BEST Plus, the response to each question is scored on three subscales of oral language proficiency: Listening Comprehension, Language Complexity, and Communication. Each is scored using a separate rubric. Listening Comprehension scores range from 0 to 2, Language Complexity from 0 to 4, and Communication from 0 to 3. The following analysis is based on the sum of the raw scores for each subscale across all the items administered to an examinee; total scores represent the sums of the raw scores on the three subscales. Table 12, in a manner similar to Table 10, presents the descriptive statistics for the three raters who scored all 32 examinees who were administered BEST Plus.


Table 12: Descriptive Statistics for BEST Plus


TOTAL SCOREAdministrator 32 124.63 36.046 57 163CAL Staff 32 121.63 35.415 57 165Program Staff 32 122.09 35.123 51 162


Administrator 32 37.88 10.238 17 50CAL Staff 32 38.16 10.029 18 50Program Staff 32 38.38 10.219 16 49

Language ComplexityAdministrator 32 29.25 11.846 10 45CAL Staff 32 25.09 9.888 10 44Program Staff 32 24.97 9.355 9 39

CommunicationAdministrator 32 57.50 14.642 28 74CAL Staff 32 58.50 16.250 26 75Program Staff 32 58.84 16.032 26 75

From Table 12 we see that for the BEST Plus administration, the mean raw scores (across all students and all questions) for total score and for the Listening Comprehension and Communication subscales were very close. They varied more, however, for the Language Complexity subscale, with the administrator awarding an average raw score higher than the very close averages for the CAL and program staff observers. The standard deviations and the ranges between minimum and maximum scores, both indicators of the spread in the scores awarded by the three scorers, were very close for the Listening Comprehension and Communication subscales. For Language Complexity, the standard deviation for the administrator was larger than that for the CAL staff and program staff raters. This suggests that in the administrator’s scoring, there were more scores farther from the average than for the other two raters. However, the score ranges on Language Complexity for the administrator and the CAL staff member were practically the same. This suggests that both scorers were awarding scores with the same spread, though the administrator may have been awarding more scores farther from the mean. These data indicate that the raters are applying the BEST Plus rubric with substantial consistency for Listening Comprehension and Communication and with slightly less consistency for Language Complexity. For Listening Comprehension, the scoring appears to be more consistent on BEST Plus than on the BEST.

Table 13, in a manner similar to Table 11, presents the correlation, or interrater agreement, between pairs of raw scores awarded by different rater pairs among the three raters that scored all 32 students on BEST Plus. The columns and rows are interpreted as for Table 11, with the final column showing the simple average of the three paired correlations. For BEST Plus, the subscales are Listening Comprehension, Language Complexity, and Communication.

Table 13: Pearson Correlations for BEST Plus Administration

CategoryRATING PAIR

AverageAdministrator/CAL Staff

Administrator/Program Staff

CAL Staff/Program Staff



The correlations between the pairs of raters on the total score and all subscores are extremely high. The lowest is .96 on Language Complexity for the administrator/CAL staff pair. The highest is .99 on both total score and Listening Comprehension for the administrator/program staff pair. The strong correlations for Listening Comprehension (.98, .99, and .98) contrast considerably with the much weaker .35, .55, and .13 on the BEST.

In sum, for relationships among the raw scores awarded by these two groups of administrators and observers, it appears that there is greater agreement among the three raters for BEST Plus than among the three raters for the BEST.

4.3 Comparing Performances on the BEST and BEST Plus Scores on the BEST oral interview section have most often been interpreted in terms of SPLs. To relate scores on BEST Plus to scores on the BEST, as well as to provide a criterion-referenced interpretation of BEST and BEST Plus scores, it was necessary to determine the relationship between performances on BEST Plus and BEST in terms of SPLs. This relationship was determined primarily through a standard-setting study conducted at the Center for Applied Linguistics in December 2002 and described in Section 5 of this report. In the study of the 32 examinees described above, the SPLs derived from the three BEST Plus raters’ scores were compared with the SPLs derived from the three BEST raters’ scores.

4.3.1 SPL Relationships: All 32 ExamineesTable 14, in a manner similar to Table 10, presents descriptive statistics, this time for SPLs on the BEST and BEST Plus. Again, various columns show the sample size (N), the mean score (in terms of SPLs), standard deviation, and minimum and maximum scores (SPLs) for each of the raters. The minimum to maximum range of SPLs is identical for all three BEST Plus raters, from 1 to 8; the range for all three BEST raters is also identical, but from 2 to 7. The range of SPL means for the three BEST Plus raters, from 5.31 to 5.47, is outside (and higher than) the range of means for the three BEST raters (4.97 to 5.22). The standard deviations for both the BEST Plus and BEST raters are clustered closely together at 2.039 to 2.104 and 1.596 to 1.666, respectively. These differences in means, standard deviations, and minimum to maximum ranges, when considered together, suggest that BEST Plus may result in slightly higher SPLs than the BEST but may also produce a wider range of SPLs. A possible explanation for this may be the difference between the two tests’ ceilings; BEST Plus can assess oral proficiency up to SPL 10, whereas the BEST assesses oral proficiency only up to SPL 7.

Table 14: Descriptive Statistics for BEST Plus and BEST SPLs: All 32 Examinees


BEST PlusAdministrator 32 5.47 2.048 1 8CAL Staff 32 5.31 2.039 1 8Program Staff 32 5.34 2.104 1 8

BESTAdministrator 32 5.00 1.666 2 7CAL Staff 32 4.97 1.596 2 7Program Staff 32 5.22 1.660 2 7

While comparing the means, standard deviations, and ranges of the SPLs awarded for the BEST and BEST Plus gives us some sense of the consistency between the three BEST Plus raters and the three BEST raters, examining the correlations among the SPLs for these raters will give us a more quantitative sense of the similarity of the SPLs resulting from their scoring.


Because there was not perfect agreement among the raters on the raw scores, there was also not perfect agreement among them as to the SPL that should be awarded to each examinee on the basis of his or her performance on the BEST or BEST Plus. For each examinee on each test, we took as the final SPL the SPL that was derived from at least two of the three raters. In other words, if the SPL based on the administrator’s score was 4, the SPL based on the CAL staff member’s score was 5, and the SPL based on the program staff member’s score was 4, we chose 4 as the final SPL for this analysis. The correlation between the final SPL on the BEST and the final SPL on BEST Plus was .885.

In Table 15, we examine the percentage of absolute agreement and absolute and adjacent agreement between the final SPL on the BEST and BEST Plus as described above. Absolute agreement means the final BEST SPL for an examinee was the same as the final BEST Plus SPL for the same examinee. Absolute and adjacent agreement includes cases in which the final BEST SPL was one level lower than, equal to, or one level higher than the final BEST Plus SPL. These are presented in terms of number of cases and percentage of cases.

Table 15: Absolute Agreement of Final SPL for BEST and BEST Plus (n = 32)

Absolute Agreement (Number) 11Absolute Agreement (Percentage) 34%Absolute and Adjacent Agreement (Number) 29Absolute and Adjacent Agreement (Percentage) 91%

In a third of the cases (11 of 32 subjects or 34%), the final SPLs on each test matched exactly. Almost all final SPLs (for 29 of 32 subjects, or 91%) were no more than one level different between the two tests.

In discussing this outcome, it is important to remember that the two performances of the examinees on the BEST and BEST Plus were unique. That is to say, the test questions, method of scoring, and raters under one condition (i.e., the administration of the BEST) were totally different from those under the other condition (i.e., the administration of BEST Plus). It is also important to remember that half of the examinees took the BEST first, then BEST Plus, and half took BEST Plus first, then the BEST. There was no way to control for the qualitative performance of examinees across testing conditions. Examinees may have been more interested in the first administration and less engaged in the second, or the first administration may have served as a kind of warm-up, leaving them better prepared for the second.

If BEST and BEST Plus are comparable measures of oral proficiency, taking into consideration changes in examinee performances from one administration to the next, we would still expect many examinees to earn the same SPL on BEST and BEST Plus. In cases where there was a difference in SPLs, we would expect the number of examinees who have higher SPLs on BEST Plus than on the BEST to be about equal to the number of examinees who have lower SPLs on BEST Plus than on the BEST.

Table 16 shows that 11 of the 32 examinees (34%) had absolute agreement between their final SPLs on the BEST and BEST Plus. We see that of the 21 examinees without absolute agreement, 13 (41%) earned higher SPLs on BEST Plus than on the BEST, whereas only 8 (25%) earned lower SPLs on BEST Plus than on the BEST.

Table 16: Final SPL for BEST Plus Relative to Final SPL for the BEST (n = 32)

BEST Plus Lower Than BEST BEST Plus Equal to BEST BEST Plus Higher Than BEST8 (25%) 11 (34 %) 13 (41%)


Table 16 might lead us to conclude that SPLs awarded on the basis of BEST Plus have a tendency to be higher than those awarded on the basis of the BEST. When we looked at the data more closely, however, we noticed that about half of the examinees were at a very high final SPL level and that most of the examinees who scored higher on BEST Plus than on the BEST had a final SPL of 7 or above. Remember that whereas the BEST Plus can measure up to SPL 10, the BEST has a ceiling of SPL 7. Therefore, we divided the 32 examinees into two groups: 16 whose final SPLs on both tests were in the range of 1 through 6, and 16 whose final SPL on at least one of the two tests was 7 or higher.

4.3.2 SPL Relationships: 16 Subjects With Final SPLs of 1–6For the first restricted sample of 16 examinees—that in which subjects earned a final SPL of 1 through 6—we examined the same relationships as were examined for the sample of 32 interviewees: the relationships between SPLs awarded as a result of the scores of the BEST Plus raters and those of the BEST raters. The descriptive statistics for this analysis are displayed in Table 17. The rows and columns of Table 17 can be interpreted similarly to those in Table 14.

Table 17: Descriptive Statistics for BEST Plus and BEST SPLs: Subjects With Final SPLs of 1–6


BEST PlusAdministrator 16 3.94 1.806 1 7CAL Staff 16 3.69 1.621 1 6Program Staff 16 3.69 1.740 1 6

BESTAdministrator 16 3.69 1.250 2 6CAL Staff 16 3.75 1.065 2 6Program Staff 16 4.06 1.526 2 6

(Note that although the SPL based on one of the BEST Plus administrator’s scores placed an examinee into SPL 7, that examinee did not have a final SPL of 7 because only one SPL 7 was received.)

From Table 17, we see that the range of means of the SPLs awarded by the BEST Plus raters falls within the range of the means for the BEST raters. However, the standard deviation is greater for the BEST Plus raters and the minimum to maximum range for the BEST Plus raters is greater than that for the BEST raters. SPLs based on BEST raters’ scores were in a narrower range than those on BEST Plus; however, there was more variability among the standard deviations for the BEST SPLs than for the BEST Plus SPLs. This suggests greater variability between raters on the BEST SPLs than on the BEST Plus SPLs.

As described above, the final SPL on each assessment was that which was derived from at least two of the three raters. The correlation between the final SPL on the BEST and the final SPL on BEST Plus for these 16 subjects was .835. The correlation for the original sample of 32 examinees was .885.

Table 18, like Table 15, shows the percentage of absolute agreement and absolute and adjacent agreement between the final SPLs on the BEST and BEST Plus. In the case of absolute agreement, the final BEST SPL is the same as the final BEST Plus SPL. Absolute and adjacent agreement includes cases in which the final BEST SPL was one level lower than, equal to, or one level higher than the final BEST Plus SPL. These are presented in terms of number of cases and percentage of cases.


Table 18: Absolute Agreement of Final SPL for BEST and BEST Plus: Subjects With Final SPLs of 1–6 (n = 16)

Absolute Agreement (Number) 5

Absolute Agreement (Percentage) 31%

Absolute and Adjacent Agreement (Number) 15

Absolute and Adjacent Agreement (Percentage) 94%

There was absolute agreement in 5 out of 16 cases, or 31% of examinees; this compares to 34% absolute agreement out of all 32 examinees. As seen in Table 18, there was absolute and adjacent agreement in 15 of 16 cases, or 94%. This compares strongly to the group of all 32 examinees whose final SPLs showed 91% absolute and adjacent agreement.

Table 19 replicates Table 16 for the 16 examinees with final SPLs between 1 and 6.

Table 19: Final SPL for BEST Plus Relative to Final SPL for BEST: Subjects With Final SPLs of 1–6 (n = 16)

BEST Plus Lower than BEST BEST Plus Equal to BEST BEST Plus Higher than BEST5 (31%) 5 (31 %) 6 (38%)

We see in Table 19 that 5 examinees (31%) had absolute agreement between their final SPLs on the BEST and BEST Plus, 6 examinees (38%) earned higher SPLs on BEST Plus than on the BEST, and 5 examinees (31%) earned lower SPLs on BEST Plus than on the BEST. This is consistent with our expectations for examinee performance on BEST and BEST Plus: When examinees’ final SPLs on BEST and BEST Plus are compared, we expect to find many examinees with equal SPLs on both tests and an equal number of examinees with higher and lower SPLs on one test than on the other. Thus, for this group of examinees, we do not see a clear tendency for performances on BEST Plus to result in higher SPLs than performances on the BEST.

4.3.3 SPL Relationships: 16 Subjects With at Least One Final SPL of 7 (or Higher)Descriptive statistics for the 16 subjects performing at a final SPL of 7, defined as those examinees who received an SPL of 7 from at least two out of three raters, are shown in Table 20.

Table 20: Descriptive Statistics for BEST Plus and BEST SPLs: Subjects With at Least One Final SPL of 7

N Mean Std. Dev. Min Max

BEST PlusAdministrator 16 7.00 .632 6 8CAL Staff 16 6.94 .574 6 8Program Staff 16 7.00 .516 6 8

BESTAdministrator 16 6.31 .704 5 7CAL Staff 16 6.19 .981 4 7Program Staff 16 6.38 .719 5 7

Table 20 shows that the BEST Plus SPLs ranged from 6 to 8, whereas the BEST SPLs ranged from 4 to 7. However, we must remember that the highest SPL that can be awarded on the basis of the BEST is 7. From both the range and the means for this group of examinees, we see that they tended to earn a higher SPL on their BEST Plus performance than on their BEST performance.

Because this sample is so truncated and restricted in range, examining the correlations is not helpful.


However, Table 21 replicates Tables 15 and 18, showing the percentage of absolute agreement and absolute and adjacent agreement between the final SPL on the BEST and BEST Plus.

Table 21: Absolute Agreement of Final SPL for BEST and BEST Plus: Subjects With at Least One Final SPL of 7 (n = 16)

Absolute Agreement (Number) 6Absolute Agreement (Percentage) 38%Absolute and Adjacent Agreement (Number) 14Absolute and Adjacent Agreement (Percentage) 88%

For those with a final SPL of 7, 6 out of 16 cases (38%) show absolute agreement, or the same final SPL on both assessments, as compared to 34% of all 32 examinees. In 14 out of 16 cases (88%), there was absolute and adjacent agreement, as compared to 91% with all 32 examinees. Again, analysis of the smaller sample of 16 subjects finds percentages of absolute agreement and of absolute and adjacent agreement similar to those in the original sample of 32 subjects. This suggests consistency among raters of the BEST and BEST Plus at all levels of student proficiency.

Table 22 replicates Tables 16 and 19, showing the final SPL for BEST Plus relative to the final SPL for BEST.

Table 22: Final SPL for BEST Plus relative to Final SPL for BEST: Subjects With at Least One Final SPL of 7 (n = 16)

BEST Plus Lower Than BEST BEST Plus Equal to BEST BEST Plus Higher Than BEST2 (13%) 6 (38 %) 8 (50%)

Table 22 shows that half of the examinees (8 out of 16) received a higher final SPL on BEST Plus than on the BEST, whereas only 13% (2 out of 16) received a lower final SPL on BEST Plus than on the BEST.

Based on the above results, we can conclude that raters of BEST and BEST Plus are consistent in their scoring of examinees at all student proficiency levels (31-38% absolute agreement and 88-94% absolute and adjacent agreement among final SPLs on the BEST and BEST Plus). While there is no tendency for students at SPLs 1–6 to receive a higher SPL on BEST Plus than on the BEST, there is a strong tendency for higher proficiency examinees (those with a proficiency level of SPL 7 or above on at least one of the two tests) to earn a higher final SPL based on their BEST Plus performance than on their BEST performance.

4.3.4 SPL Relationships: Differences in BEST and BEST Plus SPLs in Higher Proficiency ExamineesCAL staff explored several possible explanations for the differences in BEST and BEST Plus SPLs for those examinees in the higher proficiency group. That is, we asked the question, “Why does an examinee with a final SPL of at least 7 on either assessment tend to earn a higher SPL on BEST Plus than on the BEST?”

One likely reason for this trend is that BEST Plus measures student performance at higher levels of proficiency than does the BEST. As mentioned earlier, BEST Plus can be used to assess student performance up to SPL 10, whereas the BEST can be used to assess student performance no further than SPL 7. This means that an examinee may perform better than the equivalent of SPL 7 on the BEST, but the BEST cannot measure that level of performance.


To further examine possible explanations for the trend of examinees to earn higher final SPLs on BEST Plus than on the BEST, CAL staff reviewed the videos of the administration of both the BEST and BEST Plus to three students who performed at a final SPL of 7. This brief review showed clearly that questions on the adaptive BEST Plus targeted to high performing examinees were more conducive to oral elaboration than the most challenging questions on the BEST. This suggests that examinees have greater opportunity to produce more sophisticated speech while taking BEST Plus than they have while taking the BEST. For example, typical higher level elaboration questions on BEST Plus would be something like these:

Do you think it’s important to keep up with the news?. . . Why/why not?

Some people think that news reports in the United States are unreliable and show only one side of an issue. Others think that the news is accurate and unbiased. What do you think about news reports in the United States? . . . Tell me more.

On the BEST, the eight questions scored for fluency are examples of higher level questions. Typical BEST fluency questions include these:

Is shopping in _____ and the United States the same? How is it different/the same?

In _____ what do people do when they get sick?

The analysis of video-taped performances of three subjects taking the BEST and BEST Plus indicate to us that a higher performing examinee may have more opportunity to show the test administrator greater oral proficiency on BEST Plus than on the BEST due to the nature of the questions presented.

4.4 Summary Overall, this study showed that there is a strong relationship between performances on BEST Plus and the BEST. The two tests may be compared via the Student Performance Levels (SPLs). The correlation between the final SPLs produced by the two tests (the agreement of at least two out of three raters) was .885. This high correlation provides evidence that the two tests are measuring a common construct. A side finding of this study was that interrater reliability on BEST Plus was higher than that achieved on the BEST, although their scoring systems are very different.

In terms of absolute agreements, the final SPL based on performances on both tests was the same in 34% of cases. In almost all cases (91%), the final SPL on both tests was the same or at an adjacent level. At the lower proficiency levels (SPLs 1-6), there was no tendency noticed for performances on either the BEST or BEST Plus to produce a higher SPL. At higher proficiency levels, at the ceiling of the BEST, examinees tended to be awarded a higher SPL on BEST Plus than on the BEST. A small analysis of three examinees’ video-taped performances on BEST Plus, compared to their performances on the BEST, suggests that the nature of the questions on BEST Plus, which can assess oral proficiency up to SPL 10, may allow examinees to demonstrate greater levels of oral proficiency than the questions on the BEST.


5. Interpreting Test Results

Performances on BEST Plus are reported on a scale that ranges from a minimum of 088 to a maximum of 999. These scaled scores are not to be understood as a simple number-correct score. Rather, they reflect a complex mathematical measure of an examinee’s ability on an underlying scale that has been converted to expand from 088 to 999 for ease of use. This ability measure takes into account the difficulty of the items administered to the examinee as well as the examinee’s performance on the items administered.

5.1 Criterion-Referenced Interpretation: Student Performance Levels BEST Plus represents an update of the oral interview section of the Basic English Skills Test (BEST). BEST oral interview scores have been most often interpreted in terms of Student Performance Levels (SPLs). To relate scores on BEST Plus to the BEST, as well as to provide a criterion-referenced interpretation of BEST Plus scores, it was necessary to determine the relationship between performances on BEST Plus and the SPLs.

This relationship was determined primarily through a standard-setting study conducted by the Center for Applied Linguistics in December 2002. In this study, 11 panelists from across the United States—experienced adult ESL educators, very familiar with the SPLs for listening and speaking—viewed 30 videotaped administrations of BEST Plus, each lasting about 5 minutes. The videotapes were viewed in order from lowest to highest in terms of the BEST Plus scale scores the examinees had received. After viewing each examinee performance, each panelist determined which SPL was best represented by that performance. They also determined what percentage of that SPL was represented by the performance (e.g., 100%, 90%, 80%, 70%, 60%, or 50%) and what percentage of an adjacent level, if any, was also represented. Each panelist then presented his or her rating to the entire panel. Following the display of all the ratings for a given performance, the panelists discussed their ratings together, particularly the outlying ratings (i.e., the areas of most disagreement). Finally, the panelists viewed the performance a second time, submitted their final ratings for that performance, and moved on to the next performance without further discussion.

Using the averages of the panelists’ final ratings and the known scores for BEST Plus performances viewed on a score scale centered at 500, a logistic regression was run to determine the boundary between adjacent SPL levels (e.g., the boundary between SPL 0 and SPL 1). The boundary was set as the point at which there was a 50% probability that this group of judges would award a rating higher than the current SPL. In other words, the boundary between SPL 0 and SPL 1 indicates the score on the BEST Plus scale at which there is only a 50% probability that these judges would award an SPL of 0 to such a performance, rather than an SPL of 1 (or higher). The boundary between SPL 1 and SPL 2 represented the point on the ability scale at which there was a 50% probability the judges would award either an SPL 1 (or lower) or an SPL 2 (or higher) to a performance at that scale score. The boundary points were rounded to the nearest whole number on the scoring scale.

The results of this study are presented in Table 23.


Table 23: BEST Plus to SPL Conversion

SPL Scale Score Range0 below 3301 330 to 4002 401 to 4173 418 to 4384 439 to 4725 473 to 5066 507 to 5407 541 to 5988 599 to 7069 707 to 795

10 above 795

Because of the relationship between the SPLs and the levels of the National Reporting System (NRS), it is possible to build a table to convert BEST Plus scale scores to NRS levels. The results are shown in Table 24.

Table 24: BEST Plus to NRS Level Conversion

NRS Level Name Related SPLs BEST Plus Scale ScoresBeginning ESL Literacy 0-1 below 401

Beginning ESL 2-3 401 to 438Low Intermediate ESL 4 439 to 472High Intermediate ESL 5 473 to 506

Low Advanced ESL 6 507 to 540High Advanced ESL 7 or more above 540

5.2 Norm-Referenced Interpretation: National PercentilesDuring the field test of BEST Plus, complete background data were collected from more than 2,000 adult ESL students. These students came from 24 programs in eight states: Delaware, Florida, Illinois, Maryland, Massachusetts, Oregon, Pennsylvania, and Virginia. This represents a national sample of adult ESL students. Some programs may find it helpful to understand the performance of their students in terms of national norms—that is, to compare the performance of their students with the performance of students in other adult ESL programs.

5.2.1 Countries RepresentedStudents in the sample represented 100 different countries of origin. Table 25 presents each of the countries of origin that is represented by more than 1% of the students; in other words, the study population included more than 20 students from each of the countries listed here. Column 1 lists the country, column 2 the absolute number of students from that country, and column 3 the percentage of students from that country.


Table 25: Country of Origin

Country Number of Students PercentageMexico 527 24.6China 144 6.7Colombia 138 6.4Haiti 136 6.4El Salvador 91 4.3Peru 69 3.2Dominican Republic 67 3.1Poland 64 3.0Korea 59 2.8Vietnam 49 2.3Cuba 48 2.2Honduras 46 2.1Brazil 44 2.1Guatemala 39 1.8Puerto Rico 35 1.6Argentina 28 1.3Cape Verde 27 1.3Ukraine 27 1.3Russia 25 1.2Japan 23 1.1Venezuela 23 1.1

EcuadorEgyptEstoniaEthiopiaFranceGabonGeorgiaGermanyGhanaGreeceGuineaGuinea BissauHungaryIndiaIndonesiaIranIraqIsraelJordanKazakhstan

KenyaKuwaitKyrgyzstanLaosLatviaLebanonLithuaniaMacedoniaMalaysiaMaliMoldovaMoroccoNicaraguaPakistanPalestinePanamaParaguayPhilippinesPortugalRomania

Saudi ArabiaSerbiaSlovakiaSomaliaSouth AfricaSpainSudanSwitzerlandSyriaTaiwanTanzaniaThailandTurkeyTurkmenistanUnited Arab EmiratesUruguayUzbekistanYemenYugoslavia

AfghanistanAlbaniaAlgeriaAngolaArmeniaBahamasBangladeshBelarusBeninBoliviaBosniaBulgariaBurmaCambodiaCanadaChileCongoCosta RicaCroatiaCzech Republic

In addition to the 21 countries in Table 25, the following 79 countries, shown below in alphabetical order, had at least one student in the sample:


5.2.2 Languages RepresentedStudents in the sample represented 80 native languages or language combinations. Table 26 presents the native language(s) represented by 10 or more students in the sample. Column 1 lists the language or language combination, column 2 the absolute number of students from that language background, and column 3 the percentage of students in the full sample represented by that language or language combination.

Table 26: Native Language(s) Spoken by 10 or More Students

Language(s) Number of Students PercentageSpanish 1,195 55.8Chinese 75 3.5French 71 3.3Polish 63 2.9Arabic 60 2.8Korean 60 2.8French/Creole 53 2.5Russian 49 2.3Vietnamese 45 2.1Cantonese/Chinese 42 2.0Portuguese 40 1.9Creole 36 1.7Cantonese 26 1.2French/Criuolo 24 1.1Japanese 23 1.1Farsi 18 0.8Turkish 16 0.7Albanian 11 0.5Mandarin 11 0.5Hindi 10 0.5Khmer 10 0.5Thai 10 0.5Russian/Lithuanian 10 0.5


AmharicAmharic/ArabicArabic/FrenchArabic/SomaliArmenianBanglaBengaliBosnianBosnian/CroatianBulgarianChinese/MandarinChinese/Taiwanese/

MandarinChinese/ThaiCzech

EstonianFrench/PersianFrench/PortugueseFrench/Portuguese/

Cape Verdean CreoleFrench/SpanishGaGermanGreekHebrewHungarianHungarian/RomanianIndonesianItalianKhmer/Vietnamese

KurdishLaoLithuanianMacedonianMalayalamMandarin/CantonesePortuguese/CreolePortuguese/SpanishRomanianRomanschRussian/ArmenianRussian/PolishRussian/UkrainianRussian/Ukrainian/

Polish

Russian/YugoslavianSerbian/BosnianSerbian/Chinese/RussianSlovakSomaliSomali/SpanishSpanish/EnglishSpanish/MayanSwahiliTagalogTarascoUkrainianUrduUrdu/VietnameseUzbek

In addition to the languages in Table 26, the following were native language(s) spoken by at least one student in the sample:

5.2.3 GenderOf the 2,120 students for whom gender was reported, 1,222 (57.6%) were female and 898 (42.3%) were male.

5.2.4 AgeTable 27 provides information on the age of the 2,044 students in the sample for whom age was reported.

Table 27: Age of Students

Age Category Number of Students PercentageUnder 20 Yrs 97 4.720-25 Yrs 439 21.526-30 Yrs 386 18.931-35 Yrs 296 14.536-45 Yrs 459 22.546-55 Yrs 351 17.255 Yrs or Over 16 0.8Total 2,044 100.0


5.2.5 Time in ProgramTable 28 provides information on the length of time students had been in the program at the time of the study for the 1,984 students for whom this information was reported.

Table 28: Time in Program

Time in Program Number of Students PercentageLess Than 1 Month 277 14.01-2 Months 379 19.13-4 Months 307 15.55-6 Months 271 13.77-12 Months 445 22.413-24 Months 199 10.0More Than 24 Months 106 5.3Total 1,984 100.0


5.2.6 National PercentilesThe following table may be used to understand a student’s performance on BEST Plus in terms of the sample of 2,053 students who participated in the full-scale field test in the summer of 2002. To use the table, find the student’s BEST Plus scale score in the bold left-hand column. The number in italics in the right-hand column shows the percentage of students in the national sample whose performance was below that of the student in question. For example, suppose a student scored 407 on BEST Plus. According to the table below, that student did better than 18% of the students in the national sample.

Table 29: National Percentiles

BEST Plus Scale Score

National Percentile

below 273 0273-295 1296-313 2314-325 3326-339 4340-345 5346-354 6355-361 7362-368 8369-372 9373-377 10378-379 11380-385 12386-390 13391-394 14395-397 15398-400 16401-406 17407-410 18411-413 19414-415 20416-417 21418-421 22422-424 23425-427 24428-429 25430-433 26434-436 27437-439 28440-441 29442-444 30445-446 31447-449 32450-451 33


National Percentile

452-454 34455-456 35457-459 36460-462 37463-464 38465-466 39467-469 40470-471 41472-473 42474-476 43

477 44478-480 45481-483 46484-486 47487-490 48491-493 49494-495 50496-497 51498-500 52501-502 53503-505 54506-507 55508-509 56510-513 57514-516 58517-518 59519-521 60522-525 61526-527 62528-530 63531-534 64535-539 65540-542 66543-545 67


National Percentile

546-547 68548-551 69552-556 70557-560 71561-563 72564-567 73568-574 74575-577 75578-581 76582-584 77585-589 78590-594 79595-602 80603-609 81610-616 82617-622 83623-626 84627-632 85633-642 86643-650 87651-657 88658-669 89670-675 90676-685 91686-692 92693-706 93707-718 94719-734 95735-756 96757-779 97780-841 98

842+ 99

BEST P l u s T ECHN ICAL REPORT36

AERA/APA/NCME. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Center for Applied Linguistics. (1984). Basic English skills test. Washington, DC: Author.Center for Applied Linguistics. (1989). Basic English Skills Test test manual. Washington, DC: AuthorGrognet, A. G. (1997). Performance-based criteria and outcomes. The Mainstream English Language Training Project

(MELT) updated for the 1990s and beyond. Denver: Spring Institute for International Studies.Linacre, J. M. (2002) Facets [Computer program]. Chicago: MESA Press.

References

BEST P l u s T ECHN ICAL REPORT36

This page intentionally

left blank

Phone: 1-866-845-BEST (2378)Fax: 1-888-700-3629Email: [email protected] site: www.best-plus.netWrite: Center for Applied Linguistics Attn: BEST Plus 4646 40th Street, NW Washington, DC 20016-1859

www.cal.org

September 2005

Documents

Technical Report - Center for Applied Linguistics · Technical Report Development of a Computer-Assisted Assessment of Oral Proficiency for Adult English Language Learners September