Spaan Fellow Working Papers in Second or Foreign Language … · 2018-12-17 · Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 1 2003 Edited by Jeff S

Spaan Fellow Working Papers in Second or Foreign Language Assessment

Volume 1

2003

Edited by

Jeff S. Johnson

Published by

English Language Institute The University of Michigan

TCF Building

401 E. Liberty, Suite 350 Ann Arbor, MI 48104-2298

[email protected] http://www.lsa.umich.edu/eli

First Printing, April, 2003 © 2003 by the English Language Institute, The University of Michigan. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. The Regents of The University of Michigan: David A. Brandon, Laurence B. Deitch, Olivia P. Maynard, Rebecca McGowan, Andrea Fischer Newman, Andrew C. Richner, S. Martin Taylor, Katherine E. White, Mary Sue Coleman, ex officio

iii

Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Fleurquin, Fernando

Development of a Standardized Test for Young EFL Learners . . . 1 Shin, Sang-Keun

A Construct Validation Study of Emphasis Type Questions in the Michigan English Language Assessment Battery . . . . . . . . . . . . . 25

Saito, Yoko

Investigating the Construct Validity of the Cloze Section in the Examination for the Certificate of Proficiency in English . . . . . . 39

Al-Hamly, Mashael and Coombe, Christine An Investigation into Answer-Changing Practices on Multiple-Choice Questions with Gulf Arab Learners in an EFL Context . . 83

iv

Welcome to the inaugural issue of the Spaan Fellow Working Papers in Second or Foreign Language Assessment. This annual series makes available the work done by recipients of the year-long University of Michigan Spaan Fellowship for Studies in Second or Foreign Language Assessment (see our website at www.lsa.umich.edu/eli/spaanfellowship). The four articles in this issue are reports from the 2002 Spaan Fellows: Fernando Fleurquin, Sang-Keun Shin, Yoko Saito, and the two-person team of Mashael Al-Hamly and Christine Coombe. Fernando Fleuquin’s paper provides a detailed account of the development of a standardized EFL test for young learners in Uruguay and Argentina. Sang-Keun Shin’s study uses both qualitative and quantitative methods to measure the construct validity of the emphasized word listening items in the Michigan English Language Assessment Battery (MELAB). Yoko Saito employs structural equation modeling to explore the construct validity of cloze items in the 1997 Examination for the Certificate of Proficiency in English (ECPE). Finally, Mashael Al-Hamly and Christine Coombe’s report looks at the answer-changing behavior of Gulf Arab EFL learners using the multiple-choice test items in the Michigan English Language Institute College English Test—Grammar, Cloze, Vocabulary, Reading (MELICET—GCVR). Working with the first cohort of Spaan Fellows has been extremely rewarding for me and my colleagues in the Michigan Testing and Certification Division of the English Language Institute, and this first issue of the Spaan Fellow Working Papers is testament to the high quality of effort and dedication the first group of Spaan Fellows has given to their projects. I would like to thank the 2002 Spaan Fellows for their patience and understanding during the sometimes tedious and perhaps overly lengthy editing and checking (and reediting and rechecking) process I led them through. The publication of this issue was also greatly facilitated with the help of several others. I thank, first of all, my fellow members of the Spaan Fellowship Committee, Johns Swales, Sarah Briggs, and, of course, Mary Spaan, for their expertise and insight in a wonderfully successful Spaan Fellow selection process, which made it nearly impossible to publish a set of reports of questionable quality. I also thank Mary, Sarah, and Amy Yamashiro for contributing so much of their valuable time to help mentor one or more of the 2002 Spaan Fellows. I am indebted to Dawne Adam for her dedication and expertise as a superior copyeditor (any remaining errors, which almost seem inevitable, are my responsibility), and Shelly Dart for her excellent cover design. Finally, I want to especially express my gratitude to Mary Spaan, whose wholehearted and considerate support has made the task of putting together this edition of the Spaan Working Papers a pleasure indeed. Jeff S. Johnson, Editor

Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 1, 2003 English Language Institute, University of Michigan 1

Development of a Standardized Test for Young EFL Learners

Fernando Fleurquin Alianza Cultural Uruguay Estados Unidos

Motivated by the demand for a standardized measure of young learners’ performance in English, the academic team of the Bi-National Center in Uruguay developed a norm-referenced test of American English for students finishing elementary school. The test development process and specifications are explained. Item analysis and descriptive statistics are provided. To further validate the content of the test, teachers evaluated the tasks and level of difficulty of each item, the test content was compared to the course syllabi and core content of currently used textbooks, and it was also compared to other international exams available on the market. Implications for the development of testing instruments for young learners are discussed.

Motivated by the growing need to offer young learners and parents different ways to acknowledge students’ accomplishments in a foreign language, the Bi-National Center (BNC) in Uruguay, the Alianza Cultural Uruguay Estados Unidos, developed an exam to measure the English proficiency of elementary school level learners. The following paper is the result of a research project conducted in collaboration with the English Language Institute, The University of Michigan in Ann Arbor. After piloting and administering the exam, item analysis and content validity studies were performed. Based on the results obtained, the exam was adapted to meet the specific demands of the target population. In this process, important pedagogical and testing implications were discovered that affect the development of new exams for young English as a foreign language (EFL) learners.

The Educational Context and the Role of International Exams in Uruguay

In countries like Uruguay, where English is learned as a foreign language, students do not have as many chances to use the target language outside of class as they would in an English as a second language (ESL) situation. Still, becoming fluent speakers of English has become a necessity, not because speaking English is required for daily communication, but in order to have better opportunities in the labor market and academic life. Parents have also gained a much more active role, determining some of the directions that schools have taken in recent years. They have expressed their interest in having documented proof of their children’s achievement in foreign languages, particularly in English. Aware of all these concerns, primary as well as secondary schools have had to complement their curriculum so that their students are better prepared to face the challenges that a university or working life will demand in the future. Following this trend, the teaching of English has been one of the areas that has changed the most at our schools. The few private schools that did not offer English as part of their curriculum have added English courses. Other schools have increased the number of hours

2

devoted to foreign languages, particularly English. Still others offer extracurricular programs in English as an additional service to students, so that students can attain better levels of proficiency in the second language. With this growing demand for more complete English programs, schools are also hiring new staff or institutions to be in charge of their English Departments. All these changes are motivated by the fact that schools need to offer a distinct and unique service in order to be better prepared to keep or improve their positioning in the market. One of the differentiating factors among schools has been the kind and number of external tests that are administered throughout the student’s life. In recent years, there has been a growing demand for standardized tests that measure students’ abilities and knowledge of the target language. In order to show that students’ performance in the language transcends the local standards, schools have adopted a number of tests that are developed and scored by foreign universities or institutions. It has been mostly British universities who have developed a variety of tests that certify different levels of English proficiency, starting at very early ages in primary school and proceeding through high school. For example, the Cambridge Young Learners English Tests, Key English Test (KET) and Preliminary English Test (PET) developed by the EFL Division of the University of Cambridge Local Examinations Syndicate and Trinity Grade Examinations in spoken English are among the tests most frequently adopted by a variety of elementary and secondary schools in Uruguay. Schools that have incorporated American English and culture as part of their curriculum have a variety of instructional materials to use; however, there is a lack of reliable and valid language tests that evaluate the use of American English at elementary levels. The English Language Institute, The University of Michigan in Ann Arbor, offers the Examination for the Certificate of Proficiency in English (ECPE) and the Examination for the Certificate of Competency in English (ECCE). The ECPE and ECCE are the American English examinations, leading to international certificates, that most Bi-National Centers administer throughout the world. So far, however, there is no test of American English developed to measure young learners’ beginning and low intermediate levels of language proficiency in an EFL context. To this end, the Alianza Cultural Uruguay Estados Unidos started to develop such a test for 10- to 14-year-old learners who are studying American English as a foreign language at the Bi-National Center or at an affiliated elementary school.

Considerations About Testing Young EFL Learners

All schools need to determine whether their students have achieved the expected objectives, and the evaluation of English outcomes is no exception. The challenge that we have as educators is to find the most appropriate, valid, and reliable assessment instruments for our group of young learners for each stage of a course. These assessment choices will depend on institutional policies and teachers’ beliefs about language learning and about the evaluation of the results of the learning process. However, no single assessment instrument will ever be able to meet all assessment purposes. There is an ample variety of instruments to choose from along a continuum that goes from informal daily assessments to formal standardized tests. O’Malley and Pierce Valdez (1996) define authentic assessment as “the multiple forms of assessment that reflect students learning, achievement, motivation, and attitudes on

3

instructionally-relevant classroom activities” (p. 4). This kind of assessment fosters a direct relationship between instruction and assessment. Performance assessments prompt the student to elaborate some kind of oral or written response by means of oral reports, individual or group projects, writing samples, demonstrations, or open-ended questions. Student portfolios also provide extremely valuable information about both the learning process and the final results. Self-assessment, an indispensable component that can accompany all kinds of assessment, is used to learn about students’ involvement, interests and motivation, learning preferences, perceptions of learning, and progress, thus providing feedback from the learner point of view. In fact, these evaluation instruments not only have testing purposes, but they are also true opportunities to continue learning from the test, generating a positive washback effect. In this way, instruction informs assessment practices and assessment informs instructional practices. Mainly, it is the teachers who are responsible for the development and implementation of these authentic assessment instruments along the curriculum, and they process the information obtained in order to make new decisions and improve the conditions for instruction in order to obtain even more effective results. In our school programs, performance assessment instruments are widely favored. These evaluations, contextualized around topics of interest to the young students, have oral and written components, and activities are selected according to students’ age, level, and language knowledge and skills to be tested. Since these achievement tests reflect the specific kind of work done in class, oral communication is evaluated through role plays or scenarios performed in pairs or small groups, and written tasks resemble the kind of written production done in class or at home for realistic purposes. For instance, at the end of every unit of work, students complete a mini-project that rounds up all the objectives of the unit. Every 3 or 4 units, students take a progress test that evaluates their achievement in relation to the course objectives. All the tasks in these evaluation instruments provide excellent opportunities to give students feedback on their performance. Since each group of learners is different from the other, each teacher devises most of these evaluation instruments, which are then edited and approved by the school coordinator. During an English course of study with young learners, there are clear schedules for achievement and for proficiency tests. Before the course begins, or during the first classes, we carry out placement, diagnostic or screening tests to match students’ levels with the appropriate course and to make the initial decisions on curriculum development. As the course progresses, we carry out a series of progress tests (quizzes, short achievement tests) after regular intervals in order to assess how well students are learning the target language and to inform instruction. By the end of each course or at the end of a series of courses, in addition to final achievement tests, we administer standardized proficiency tests. Even in an educational framework that places such an emphasis on authentic assessment, standardized tests play important roles in the curriculum. They enable us to compare individual performance with an external norm, to monitor the performance of groups along the year or across years, to evaluate and compare results from different schools, and to evaluate entire English programs. Besides, standardized tests give schools and parents an external measure of their children’s learning that provides credibility to the school English program. In an attempt to devise a new instrument to evaluate young students’ use of American English in an EFL context, we developed the Alianza Certificate of Elementary Competence in English (ACECE).

4

Test Development and Administration

Devising a standardized evaluation tool that can be implemented in different schools was no easy task. The following is a description of the origin of the test, the test specifications and the steps we followed to develop it. The need to develop this test emerged from the constant demand from primary schools for a standardized measure of their students’ abilities in American English. Therefore, we started studying the different school programs in terms of curricular design, English course load, texts used, language objectives, classroom and assessment practices, and overall expectations of students’ performance at the end of primary school. With this information, we defined the objectives and level of the test. The ACECE is a standardized test of American English that aims at measuring young learners’ abilities to understand and to communicate in English at an elementary level. The term “elementary” has two connotations. In the first place, it refers to the fact that students are studying English as a Foreign Language as one of the subjects at an elementary school. In addition, it describes the level of English that these students can reach in an EFL context with an average of 3 hours of instruction per week. Students who achieve the elementary level of English and pass the ACECE have the following profile: a) they are able to use English to follow a conversation on familiar contexts, to exchange information, to express needs, likes, dislikes and opinions, b) they can understand authentic oral and written language related to topics of personal interest, c) they can communicate personal information and narrate personal experiences in writing, d) their oral and written expression is generally related to personal, familiar or school contexts, e) their structures are simple, and mistakes can be very frequent, and f) their active vocabulary is limited to certain themes, but recognition vocabulary is wider. The target population for which the test was developed study English at the Alianza or at an affiliated school. Their ages range from 10 to 13, with an average age of 12. The vast majority are finishing the sixth and last grade of elementary school in our country by the time they take this test. These students have studied American English for 4 to 6 years, with an average of 3 hours of English instruction per week. They have completed between 300 to 400 hours of formal instruction in the target language. Some of the textbooks they have used include New Parade (up to levels 5 or 6, Herrera, 2002), Go for It (up to level 3, Nunan, 1999), and The Language Tree (up to level 4, Vale, 2001). The ACECE is a standardized, norm-referenced proficiency test that measures students’ general language proficiency without any link to a particular course of study, textbook or pre-established syllabus. Since this test was to be implemented at several sites, and eventually in different countries, we needed a standardized test that would be easily administered and scored. When deciding what areas of language to assess and what sections the test would consist of, we took into consideration that in elementary levels, our EFL courses prioritize the development of oral skills, and the acquisition of vocabulary, structures and strategies to communicate personal and general information in contextualized situations in the target language. In order to cover different areas of students’ overall proficiency, we decided to evaluate foreign language skills that children of this age would need to use in common everyday situations.

5

In order to select which specific structures and vocabulary to test, we observed students in classes and surveyed the texts that students had used during their learning history. Comparing the observed level of performance with the expected outcomes defined for all final courses, we selected the structures and vocabulary that are necessary for daily communication in familiar contexts at an elementary level. Table 1 presents the resulting lists of grammar structures and categories of vocabulary actively used by students at this level. These structures and vocabulary items make up the core contents of the test.

Table 1. Grammar and Vocabulary Categories Selected for the Elementary Level Test Grammar structures Vocabulary categories Verb: to be (present and past) Classroom objects and actions Subject and object pronouns Family members Question words Places and shops in a city, jobs Present progressive tense Parts of a house, furniture Simple present tense Health, common symptoms Simple past tense Daily activities at home or school Future tense: going to, will Meals and drinks Modals: should (advice) Animals Modals: can (ability) Descriptions of people, clothes, body parts Possessive adjectives, pronouns: ’s Celebrations and holidays Demonstrative pronouns Weather There is/there are Time, days, months, seasons Articles, indefinite pronouns: a, some, any, one Entertainment Much, many Planets, solar system Prepositions of place and time Adjectives and adverbs, comparatives and superlatives

Concerning the testing tasks to use, we considered that students finishing primary school have already acquired higher-order thinking skills, and are able to process abstract concepts and perform complex tasks in a foreign language. Since their attention span is still limited, we did not want to include lengthy activities that would require deep concentration. Since the students who would be tested with this instrument were learning English in different institutional contexts (bilingual schools, English courses at a school or at the Bi-National Center), we needed to choose test tasks that would provide information on students’ performance regardless of the course they were taking or the school program they were attending. Therefore, we chose the following test tasks: a) multiple-choice items to evaluate grammar, vocabulary, and reading comprehension, b) a cloze test to evaluate overall proficiency, c) a response to a short letter to evaluate freer written communication, and d) an oral interview to evaluate students’ understanding, oral communication and interaction skills for a given situation. Since our students usually take the English Language Institute, The University of Michigan Examination for the Certificate of Competency in English when they are in high school, the ACECE exposure to standardized test formats these young learners receive will provide them with learning experiences and strategies that can help them perform better on this future test.

6

The first three components mentioned above make up the objective test. Since we wanted to make sure that multiple-choice items reflect the contexts in which students usually encounter language, each item consists of a short statement or conversation in a familiar context. To simplify the cognitive demands of this kind of test task, we used three options in all multiple-choice items. Table 2 shows samples of grammar and vocabulary items used.

Table 2. Sample Grammar and Vocabulary Items Grammar section item: Vocabulary section item: A: _____ you live in a house? B: No, I don’t. I live in an apartment.

(a) Are (b) Did (c) Do

A: What is your favorite _________? B: I love ice cream.

(a) dessert (b) drink (c) lunch

The objective test starts with 20 multiple-choice grammar items and 20 multiple-choice vocabulary items. A cloze passage with five blanks and a five-item reading comprehension section follow. The written task consists of a response to one of two given topics. Students have to write a short letter or narrative about topics related to their everyday life. With no time limit to do this, they are encouraged to read the instructions carefully and write a few notes to organize their ideas before writing their response. They also have time to re-read and edit what they wrote. The oral English test has three main sections. Students are evaluated in pairs by an interviewer and their course teacher. During the first two minutes, the interviewer asks some personal warm-up questions to check students’ levels of understanding. Next students get several pictures that depict a familiar situation. The interviewer asks them to describe everything they can to evaluate students’ extended speech. Inference questions may be asked to check students’ comprehension of ideas that are not explicitly described (e.g., What season is it?). Last, students have 30 to 60 seconds to prepare a role-play about the situation in the picture. Here interaction and conversational skills are evaluated. Guidelines and holistic rubrics were developed to assist evaluators with the different sections of the interviews and to ensure reliable scores. During the oral interview, the following areas are evaluated: comprehension, pronunciation and intonation, accuracy, fluency, and functional language use. Students’ performance is evaluated using the provided holistic rubrics (see Table 3). The objective test sections comprise 50 points of the 100-point test. Students get 10 more points for the written response, and the oral interview accounts for the remaining 40 points. This distribution reflects the approximate weight each of the language skills has in our curriculum for EFL programs at schools: 40% oral skills (listening and speaking), 10% writing and 50% knowledge of grammar, vocabulary and overall proficiency. Final scores summarize students’ performance on all sections of the oral and written test. Test results are reported in one of these four final categories: Honors (98-100%), High Pass (86-97%), Pass (71-85%), Low Pass (60-70%) or Fail (59% or below). Students who pass the exam receive the Alianza Certificate of Elementary Competence in English.

7

Table 3. Holistic Rubric for Oral Interviews Description of performance Score range The examinee…

- responds with a flow of related phrases and sentences - engages in everyday conversation without relying on concrete

contextual support - communicates thoughts effectively and fluently - uses near native pronunciation and intonation

36-40

The examinee… - responds with related phrases and sentences - engages in everyday conversation if s/he can rely on concrete

contextual support - communicates thoughts fluently - uses correct pronunciation and intonation, although a foreign accent

may be perceived

32-35

The examinee… - responds with phrases and short sentences although they may not be

clearly related - engages in everyday conversation only if s/he is supported by the

examiner - communicates thoughts with minimum fluency - uses language marked by a foreign accent

28-31

The examinee… - responds with phrases and incomplete sentences - has difficulty engaging in everyday conversation - communicates a few thoughts, and grammar inaccuracies hinder

comprehension - commits pronunciation and intonation inaccuracies that hinder

communication at times

20-27

The examinee… - responds with phrases - can’t engage in everyday conversation - attempts to express thoughts which are marked by grammar

inaccuracies that make speech incomprehensible - commits serious pronunciation and intonation inaccuracies that

hinder communication

1-19

Considering that we were evaluating young learners, and knowing that any evaluation generates some degree of anxiety, we decided not to set any time limits for the written test. Students are able to take their time to think about different items and work at their own pace. We expected oral interviews would take each pair of students an average of 15 minutes, and that they would spend between 30 to 60 minutes on the objective test. Each school chooses the best date to implement the test within the month of November, near the end of the school year. Depending on each school, students can take the oral test or the written test first. The oral and written tests are administered on separate days, on

8

previously scheduled dates. In order to reduce students’ anxiety to a minimum, each group of students takes the written test in their own class with their regular teacher. Before we administered the test for the first time, a selection of 30 items (including reading comprehension, cloze, grammar and vocabulary items) was compiled and given to a group of our target students to pilot the test. We had several reasons to do this. First, we wanted teachers and students to give us their opinions about the content and format of the test. Neither of the groups found major objections about the format and test questions. Only a few items required editing. Another purpose was to find out if students are familiar with multiple-choice items and cloze procedures as part of a standardized test. We also wanted to determine how much time students would take with each item and with the entire test. We wanted to check how clearly instructions were presented in the test, and to observe students’ reactions when carrying out each activity. We finally administered the ACECE to sixth graders in two different schools and to a group of young learners who were finishing the sixth year of the Young Learners’ Program at the Alianza. Ninety students took this first test in November of 2001. Students took between sixty to ninety minutes to complete the written test in class with their teacher. On the same day in one case and on the following day in the other two cases, students took the oral part of the test, interviewed by an Alianza evaluator and their teacher. BNC staff scored the written tests. Results were published three days after the test administration. While we were working on the development of the test, we met with Academic Directors and teachers from other Bi-National Centers from Argentina who had expressed a need for this kind of test as a tool to develop the EFL market for young learners. They immediately became interested in this project, which then turned into a regional initiative. The student population they were aiming at was early high school. Even though their average age was 14 years, their level of English proficiency was similar to our students’ expected level of performance. To adapt the level of the test to a high school context, a team in the BNC in Rosario, Argentina, developed a second test for the same year – 2001. A few changes were proposed for the second version of the test: the reading comprehension passage, the cloze test and the writing section could be related to the same context, and the categories and rubrics for the speaking section were edited. In order to compare students’ performance on Tests 1 and 2, we also administered the second test to a sample of 40 students from two of our schools in April, 2002.

Test Analysis

The main objective in analyzing the test results was to be able to improve our test development procedure. For that purpose, we analyzed the test format, content and items and compared the levels of tests 1 and 2 to determine if they were equivalent. Some of these studies were carried out before the administration of the tests, and others, once the test results were available. We used test development checklists and guidelines to analyze the format and content of the entire tests and of each of the items in the objective sections. Brown (1996) and Alderson, Clapham and Wall (1995) provide several of these guidelines and checklists. We used Brown’s guidelines to revise the format of the objective test, and of receptive and productive items in particular. As a result of this item format analysis, we found we needed to improve some distractors, correct or eliminate ambiguous or obvious items, reorder some of the

9

options, and clarify the instructions of the writing task so young learners would clearly know what information to include in their responses. Item analysis statistics were obtained for the objective sections of both tests. The results appear in Table 4. The mean score for Test 1 was higher than for Test 2 (39.144 vs. 35.114), but Test 2 showed a higher dispersion of scores. The range of scores for Test 1 was 20-49, and 17-49 for Test 2. The standard deviation was 7.075 for Test 1 and 9.420 for Test 2. Table 4. Item Analysis Test 1 Test 2 Number of items 50 50 Number of examinees 90 44 Mean 39.144 35.114 Variance 50.057 88.737 Standard deviation 7.075 9.420 Skew -0.692 -0.363 Kurtosis -0.3444 -0.953 Minimum 20 17 Maximum 49 49 Range 29 32 Median 41 37 Cronbach’s Alpha 0.871 0.921 Std Error Measurement 2.574 2.654 Mean P 0.783 0.702 Mean Item-Tot correlation 0.367 0.453 Mean Biserial 0.560 0.638 Max Score (Low) 35 27 N (Low group) 26 12 Min score (High) 45 42 N (High group) 28 14 The Item Facility index (IF), also called Item Difficulty, Facility Value, or Proportion Correct, represents the percentage of students who answer the item correctly. Ideal IFs should be centered on .50, with a range from .30 to .70. (Brown, 1996). Table 5 shows the percentage of IFs for each section of the test in three different categories: an acceptable range (from .30 to .70), very difficult items (< .30) and very easy items (> .70). In Test 1, only 26% of the items fell within the acceptable range for IF, while the remaining 74% of the items had a very high IF. Test 2 followed a very similar pattern, with 32% of the items in the appropriate range. There were 13 items in Test 1 with an appropriate IF range, while there were 16 items from Test 2 within the same range. Comparing the IFs of Tests 1 and 2, the majority of items in all sections of both tests are above .70. According to this information, both tests were relatively easy for our target audience. The Item Discrimination index (ID) “indicates the degree to which an item separates the students who performed well from those who performed poorly. These two groups are sometimes referred to as the high and low scorers or upper and lower-proficiency students” (Brown, 1996, p.66-67). IDs can range from +1.00 to –1.00. Very good items have IDs equal to or higher than .40, and those between .30 and .39 are “reasonably good but possibly subject

10

to improvement” (Ebel, 1979, cited in Brown, 1996, p. 70). Table 5 also shows the percentage of items from each test that had good IDs (equal to or above .30) and the percentage of items that do not discriminate appropriately between high and low scorers (ID < .29). The percentage of items with IDs above .30 was 64% for Test 1 and 82% for Test 2, which means that the majority of items discriminate well between both groups. The sections that had poorly discriminating items were grammar (one fourth of the items in this section), and vocabulary (half of the items in this section) from Test1 and vocabulary (one fourth of the items in this section) and reading comprehension (2 out of 5 items) from Test 2. Table 5. IFs and IDs for Test 1 and 2 Subsections Test 1 Cloze Grammar Vocabulary Reading Total .30<IF<.70 (Good) 2% 16% 8% 0% 26% IF>0.70 (Easy) 8% 24% 32% 10% 74% IF<0.30 (Hard) 0% 0% 0% 0% 0% Total 10% 40% 40% 10% 100% ID>0.30 (Good) 6% 30% 18% 10% 64% ID <0.29 (Poor) 4% 10% 22% 0% 36% TOTAL 10% 40% 40% 10% 100% Test 2 Cloze Grammar Vocabulary Reading Total .30<IF<.70 (Good) 8% 14% 10% 0% 32% IF>0.70 (Easy) 2% 24% 30% 10% 66% IF<0.30 (Hard) 0% 2% 0% 0% 2% Total 10% 40% 40% 10% 100% ID>0.30 (Good) 10% 36% 30% 6% 82% ID <0.29 (Poor) 0% 4% 10% 4% 18% TOTAL 10% 40% 40% 10% 100%

Validation

Henning (1987) defines validity as “the appropriateness of a given test or any of its component parts as a measure of what it is purported to measure” (p. 89). The concept of test validity is therefore directly linked to and inseparable from the population that the test was developed for. This is why in order to validate the content of the test, we took different and complementary points of view to analyze how our test specifications matched the needs of our population. For this purpose, we compared the test content with the contents of three commercially available textbooks, with our course syllabi for BNC Children’s courses, and with commercially available international tests, and also analyzed the correlations between the two tests and the subsections within the tests. Content Validity The first content validation measure we took was to compare the grammar and vocabulary test section content with the contents of three textbooks used with the target group of young learners in the different school programs. The texts were New Parade (Herrera, 2000), Go For It (Nunan, 1999), and The Language Tree (Vale, 2001). Table 6 shows the results of this comparison. All the target structures and the great majority of the vocabulary

11

Table 6. Language Core Content in the Tests, Current Syllabus, and Textbook Series Grammar structures

Our syllabus for children’s courses

New Parade

Go For It

Language Tree

Verb: to be (present and past) CH 1, 2, 3 Yes Yes Yes Subject and object pronouns CH 1, 2,3, 4 Yes Yes Yes Question words CH 1, 2, 3, 4, 5, 6 Yes Yes Yes Present progressive tense CH 1, 2, 3 Yes Yes Yes Simple present tense CH 1, 2, 3, 4, 5 Yes Yes Yes Simple past tense CH 3,4,5, 6 Yes Yes Yes Future tense: going to, will CH 4, 5, 6 Yes Yes Yes Modals: should (advice) CH 4, 5, 6 Yes Yes Yes Modals: can (ability) CH 2, 3, 6 Yes Yes Yes Possessive adjectives, pronouns: ‘s CH 1, 2, 3, 4, 5 Yes Yes Yes Demonstrative pronouns CH 1, 2 Yes Yes Yes There is/there are CH 2, 3 Yes Yes Yes Articles, indefinite pronouns: a, some, any, one

CH 1, 2, 3, 4

Yes

Yes

Yes

Much, many CH 1, 3 Yes Yes Yes Prepositions of place and time CH 1, 2, 6 Yes Yes Yes Adjectives and adverbs, comparatives and superlatives

CH 1, 2, 3, 4, 5, 6

Yes

Yes

Yes

Vocabulary categories Classroom objects and actions CH 1, 2, 3 No Yes Yes Family members CH 1, 2, 3, 4, 5 Yes Yes Yes Places and shops in a city, jobs CH 2, 3, 5 Yes Yes Yes Parts of a house, furniture CH 1, 2 No Yes No Health, common symptoms CH 2, 3, 4, Yes Yes Yes Daily activities at home or school CH 1, 2, 3, 4, 5 Yes Yes Yes Meals and drinks CH 2, 3, 4, Yes Yes Yes Animals CH 1, 2, 3, 4 Yes Yes Yes Descriptions of people, clothes, parts of the body

CH 1, 2, 3, 5

Yes

Yes

Yes

Celebrations and holidays CH 1, 2, 4, 5, 6 Yes Yes Yes Weather CH 3 Yes Yes Yes Time, days, months, seasons CH 1, 2, 3 Yes Yes Yes Entertainment CH 3, 5, 6 Yes Yes Yes Planets, solar system CH 6 Yes Yes Yes categories appeared as specific language points in each of the three textbook series. Sixty percent of the structures were explicitly taught during the year when students took the test (question words, simple present tense, simple past tense, future tense with “going to” and with “will,” comparatives and superlatives). The rest of the structures had been taught in previous years and were incorporated as part of the textbook or daily input during that same year (to be, present progressive, possessive adjectives, subject and object pronouns, demonstrative pronouns, there is/are, articles, indefinite pronouns, prepositions of time and place). Vocabulary categories can also be distributed in two groups: those that had been taught in

12

previous years (e.g., time, colors, symptoms) and those that were part of one or more than one of the surveyed textbooks used in the current year (e.g., school subjects, personality descriptions, celebrations). Twelve of the 14 vocabulary categories included in the test had been taught during the current year. The other two (classroom objects/actions and parts of a house/furniture) had appeared in lower levels of the same textbook. From this comparison, it was observed that the structures and vocabulary included in the test matched those that had been presented in the three textbooks students had used during the last two years. Each of our six annual courses for young learners has a specially prepared syllabus. We compared the syllabi for all these courses with the core contents of Tests 1 and 2. Table 5 also shows in which of our young learners’ courses the target structures or vocabulary categories were explicitly taught. Grammar structures are continuously recycled, so it is not surprising to find the same structure in 2, 3 or up to 6 courses. The same happens with vocabulary categories. Besides the initial stages in the course when these elements are explicitly taught, the same structures and vocabulary are incorporated into the input students receive every class. All the grammar and vocabulary categories included in the test were part of the syllabi for the children’s series. Two vocabulary topics appeared only once in the six syllabi: the weather and the Solar System. Although there is only one unit devoted to each of these two topics along the series of courses, students talk about the weather at the beginning of every single class, so it would be logical to include this information as part of the test. The Solar System category will be eliminated in future tests since it may not have been sufficiently developed and not all textbook series used included the same kind of information about it. Two additional teachers with ample experience with young learners were asked to analyze each section and all the items of the two tests, study the target language evaluated in each item, and point out in which of our courses students are expected to actively use each of the target structures and vocabulary elements. The first question they answered was how old students should be in order to pass this test. In this regard, they said that students of 9 to 11 years of age would be at the target level for the test. Next they estimated whether their students from fourth, fifth or sixth grades in English could answer each of the items. Table 7 shows how many grammar and vocabulary items they estimated students in those grade levels could answer. From their estimation, students in fourth grade would be able to answer 32.5% (Test 1) or 35% (Test 2) of the items correctly. Students in fifth grade would learn new language forms and be able to answer 50% (Test 1) or 47.5% (Test 2) of grammar and vocabulary items correctly, and if 32.5% or 35% of the items covered during fourth grade were added, teachers would expect their fifth grade students to get 82.5% correct answers in the grammar and vocabulary sections of both tests. They also indicated that students finishing sixth grade should answer 100% of the items correctly. Therefore, students in either fifth or sixth grades could be prepared to pass these two sections of both tests. As additional feedback, they considered that the reading passages used to evaluate reading comprehension and the reading items were appropriate for students in sixth grade. They felt that the cloze test would be a very complex task for young students because they are not used to discrete reading comprehension tasks of this sort. However, they considered that the cloze passage could be simplified and used as part of the objective test by editing some of the options. Lastly, they made some suggestions about some of the distractors in the vocabulary section and the reading and cloze passages. Their overall conclusions confirmed that the level of both tests was appropriate for 11- or 12-year old students who are finishing fifth or sixth grade of elementary school.

13

Table 7. Teachers’ Estimation of the Grade Level Needed to Pass Each Section Test 1 Fourth grade Fifth grade Sixth grade Total # of itemsGrammar 4 12 4 20 Vocabulary 9 8 3 20 Total 13 20 7 40 Percentage 32.5% 50% 17.5% 100% Test 2 Grammar 7 13 0 20 Vocabulary 7 6 7 20 Total 14 19 7 40 Percentage 35.0% 47.5% 17.5% 100.0%

We also compared our test with the description of two commercially available tests that are used in other schools at the same levels: Trinity’s Grade Examinations in Spoken English and the examinations for young learners from the University of Cambridge. Table 8 shows some of the specifications for the three tests. Trinity’s grade examinations form a series of twelve tests of spoken English that start at a very low level of proficiency (Grade1) to an advanced (or near native) level of proficiency (Grade12). Although this test only evaluates spoken English, we compared the content and expected candidate’s performance for both tests. Trinity’s ESOL Syllabus from 2002 (Trinity College London, 2001) describes the candidate’s profile, as well as the format procedure and assessment criteria for the different levels of these oral exams. Trinity’s Grade 5 oral test is the one that is most similar to the target level of our test for young learners. Grade 5 corresponds to the previous to the last course in the elementary stage. In terms of testing procedure, there are similarities and differences between both tests. Trinity’s oral exam takes between 10 to 20 minutes while ours does not have a fixed time limit. Trinity’s exam has four sections. After an introduction, the candidate is asked to present a topic that s/he has prepared. The examiner will ask questions and prompt a discussion on the topic. Our oral evaluation consists of three stages, and students are not asked to prepare a topic in advance. Instead they are given familiar pictures to describe and improvise an informal conversation based on them. In both tests the examiner asks further questions to stimulate deeper levels of comprehension and communication, and to elicit extended and unassisted discourse from the candidate. We also compared our test to the battery of tests for young learners prepared by the Local Examinations Syndicate from the University of Cambridge. The Cambridge Young Learners English Tests include three tests aimed at students from 7 to 12 years of age who have taken from 100 hours (“Starters”) to 250 hours (“Flyers”) of instruction. The other exams developed for young learners are the Key English Test (KET) and the Preliminary English Test (PET). KET aims at the level of proficiency defined as Cambridge Level One, which is obtained after taking 180 to 200 hours of instruction. “Successful candidates have the linguistic ability to satisfy their most basic needs in everyday situations” (University of Cambridge, Local Examinations Syndicate, 1998, p. 4). PET aims at Cambridge Level Two, a low intermediate level of proficiency which requires approximately 400 hours of instruction. Our test aims at a level of proficiency similar to or slightly above KET. KET has three components: reading and writing (50% of the test), listening (25%), and speaking (25%). The reading and writing paper has eight parts using multiple-choice,

14

Table 8. Comparison of ACECE with Commercially Available Tests for Young Learners

Alianza – ACECE Trinity Exam of Spoken English

Cambridge – KET

Skills to evaluate Listening, speaking, reading, writing, grammar, vocabulary

Listening and speaking

Listening, speaking, reading, writing, grammar, vocabulary

Test level(s) Only one level, at the end of elementary school (6th grade)

Twelve levels, from initial to 12th grade.

Only one level

Test tasks Multiple choice, cloze test, letter writing, oral interview in pairs

Oral interview Multiple choice, matching, gap filling, true/false, complete sentences, note taking, writing, oral interview in pairs

Timing As long as students need 10-20 min R&W: 1 hr 10 min L: 25-30 min S: 8-10 min

Number of items Gram: 20 items, Voc: 20 items, R: 5 items, Cloze: 5

----------------------

R&W: 56 items L: 25 items

Weight of test components

Objective test (grm, voc, read, cloze): 50% Writing: 10% Oral interview: 40%

L&S: 100% L: 25% S: 25% R & W: 50%

Assessment criteria for oral interviews

- Understanding - Pronunciation - Accuracy - Fluency - Vocabulary - Functional language use

- Readiness - Pronunciation - Usage - Focus

- Interactive skill - Ability to communicate clearly - Accuracy of language use: grammar, vocab, pron.

Writing task Brief response to letter or short narrative (letter or paragraph format)

----------------------

Brief message (up to 25 words)

Assessment criteria for writing sections

- organization and format - communication of ideas - accuracy - vocabulary

---------------------

- communication of ideas - coherence - errors (spelling, grammar, punctuation)

Results Honors (98%-100%) High pass (86%-97%) Pass (71%-85%) Low pass (60%-70%) Fail (<59%)

Pass with Distinction (85%-100%) Pass with merit (75%-84%) Pass (65%-74%)

Pass with merit Pass Narrow fail Fail

15

matching, right/wrong, fill-in-gaps, cloze, and writing activities. The listening section consists of five parts, using multiple-choice, matching, and note-taking activities. The speaking section has two parts where students interact giving and asking for personal and nonpersonal information. The KET Handbook provides a list of 13 topics that students at this level are expected to handle (personal identification; house, home and environment; daily life, work and study; free time, sport and entertainment; travel and holidays; relations with other people; health; shopping; food and drink; services: post office, bank, police, etc.; places; language; weather- p.8). KET has two passing grades (Pass with Merit and Pass) and two failing grades (Narrow fail and Fail). The passing level usually lies between 72 to 74% of the overall score (University of Cambridge Local Examinations Syndicate, 1998, p. 5). Again there are several similarities and differences between our test and KET. Although the test tasks are different, the level of our test can be equated to KET’s Grade 5. Students who take KET are expected to complete a wider range of activities using language more actively. However, the written response proposed by KET consists of a 20-25 word message, while in our test students have to produce a much more extended response showing their command of written language and basic writing conventions. The passing grade for KET is higher than for our test (72-74% vs. 60%), but our test places higher demands on students’ production (more extended written response, unprepared role plays and oral interaction). We could say that both tests are intended to discriminate the students who have achieved the elementary level of proficiency in the target language from those who have not reached that level, although different assessment criteria are used for each test. Construct Validity A group of 44 students took both ACECE tests in order to determine if they had an equivalent level of English proficiency. The correlation coefficient between students’ scores on both tests was .70. Table 9 shows the final results of each test for this sample of students. Table 10 shows the correlations among the four sections of the two tests taking all items from the same category together. In this case, for example, the five items from the cloze section from Test 1 and the five items from the same section from Test 2 were grouped together. The same was done with the grammar, vocabulary and reading comprehension sections, thus obtaining 10 cloze items, 40 grammar items, 40 vocabulary items and 10 reading comprehension items, totaling 100 items between the two tests. We correlated each section with the other sections in this way. The highest correlations we obtained here were between the vocabulary and the grammar sections (0.84), followed by the cloze and grammar sections (0.78) and cloze and vocabulary sections (0.72). The lowest correlations in this category involved reading comprehension sections (0.64 for reading/cloze, 0.64 for reading/grammar). However, the reading and the vocabulary section showed a better correlation (0.76). We also compared each of the sections from each test separately (see Table 11). The highest correlations in this category were between grammar and vocabulary from Test 1 (0.82). Grammar from Test 1 showed a high correlation with the grammar and vocabulary sections from Test 2 (0.70 and 0.78, respectively). Both vocabulary sections also had a high correlation coefficient (0.76). The lowest correlations involved the cloze passages and reading comprehension (0.34 for cloze 1/cloze 2; 0.49 for reading 1/reading 2; 0.36 for reading 1/cloze 2; 0.44 for reading 2/cloze 2). Based on the test content considerations and statistical test analysis information, we can state that our test for young learners is measuring what it is supposed to measure. First, it

16

Table 9. Correlation Between Scores on Test 1 and 2 Student Score 1 Score 2 Student Score 1 Score 2 Statistics Test 1 Test 2 401 45 47 508 40 37 N 44 44 403 40 40 509 46 40 Mean 38.52 35.02 405 20 17 512 31 34 SD 7.756 9.532 406 36 26 513 43 46 Min 20 17 407 47 45 514 38 37 Max 49 49 408 44 37 515 33 31 Range 29 32 409 37 31 516 27 42 Correlation 0.706 410 36 34 518 36 37 411 31 26 603 33 20 412 20 30 604 45 40 413 49 45 605 46 43 414 47 48 606 44 42 415 41 19 607 44 39 416 43 44 608 43 32 417 48 49 609 43 19 418 45 43 610 43 39 501 45 49 611 38 27 502 48 49 612 29 17 503 35 33 613 42 43 505 31 26 614 31 25 506 24 24 615 46 36 507 26 21 617 36 32 Table 10. Combined Test Section Correlation Indices Cloze test

(10 items) Grammar (40 items)

Vocabulary (40 items)

Reading (10 items)

Cloze test 1.000 Grammar section 0.783 1.000 Vocabulary 0.720 0.843 1.000 Reading 0.637 0.644 0.760 1.000

Table 11. Section Correlation Indices For Test 1 and Test 2

Cloze 1 Gram 1 Vocab 1 Read 1 Gram 2 Vocab 2 Read 2 Cloze 2 Cloze 1 1.000 Gram 1 0.596 1.000 Vocab 1 0.627 0.822 1.000 Read 1 0.556 0.625 0.681 1.000 Gram 2 0.513 0.700 0.541 0.348 1.000 Vocab 2 0.529 0.783 0.763 0.519 0.769 1.000 Read 2 0.647 0.537 0.591 0.490 0.574 0.700 1.000 Cloze 2 0.337 0.626 0.547 0.361 0.640 0.581 0.444 1.000

17

covers the core contents of our elementary courses. Second, the test contents match the contents from three different textbooks widely used at the elementary level. Third, expert teachers who surveyed the test agreed that students in fifth or sixth grade would be able to pass the test. Last, the test results and IF indices show that students could complete the different sections of the test, and in fact it was relatively easy for our student population. The test items discriminate fairly well between high and low performers. From the comparison with international tests, the level of our test is equivalent to the levels of Oral Examination Grade 5 from Trinity College and KET from the University of Cambridge Local Examinations Syndicate. The levels of both tests were slightly different. Descriptive statistics show that both tests have a similar central tendency (as shown by the means 39.144 for Test 1 and 34.114 for Test 2). These results represent a final score of 78% and 70%, respectively, and both fall under a passing level. However, 70% is the limit of our “Low pass” category, while 78% is in the middle of the “Pass” level. While the two tests have a wide dispersion, Test 2 has a slightly higher one (SD = 7.08 and range of 29 for Test 1, and SD = 9.42 and range of 32 for Test 2). The correlation coefficients show that there is a positive relationship between both tests and among the main sections of the objective parts. We can conclude that both tests are aiming at slightly different levels of proficiency, and adjustments should be made to ensure that students are being tested with the same criteria if we want to use these two tests as equivalent instruments.

Pedagogical and Testing Implications for the Development of Exams for Young Learners

After developing many classroom-based achievement tests for our young learners for many years, we started a test development project that involved many participants and many stages. From the beginning of this initiative, we were determined to keep track of our process and results in order to be able to improve our final product and to share our conclusions with the ESL/EFL worldwide community. The following are some of the conclusions we reached during this process. All curriculum design or test development procedures need to start with clear specifications. We limited the objectives and scope of the project to a certain age group of students, to a specific level of proficiency, to certain language content and skills, and to specific school contexts. Each of these aspects was defined in advance; however, while we were developing and administering the test, we had to develop some of those definitions in more detail. For example, students who had joined the English program at the school during the current year were not ready to take this test and were not required to do so. When developing the speaking tasks, we also prepared holistic rubrics to evaluate students’ performance, but different interviewers did not agree on the same final evaluations. These rubrics had to be adapted to ensure greater consistency among oral interviewers. The test objectives and core contents need to be carefully defined. Before preparing actual test items, a team of teachers needs to agree on the objectives and particularly on the level of difficulty that the test will have. Several factors can influence the latter: the kind of tasks selected for the test, the items prepared, the level of accuracy expected, or the cut off score selected for a passing grade, among others. Knowing the target students’ level of proficiency, observing them in class, analyzing some of the textbooks that they use, and

18

comparing their performance to the expected outcomes for their level (specified in the course syllabi) were some of the steps we took to define the level of this test. After the test was administered, the students’ scores, teachers’ analysis of the difficulty of each item, comparison with textbook contents, and item analysis validated that the level and content of the test were in the appropriate range for our target population. Teachers were surprised at the level of performance of some of their students on the test. In some cases, students seemed to talk a lot more during the oral interviews than in their regular class. Others were surprised to find so much information and so few mistakes in the letters that students wrote in a relatively short time. We did not have any teacher who expressed surprise at the low level of any of the students. Obviously, there may be differences in students’ performance on a test if it is compared to their performance in class. However, the productive sections of the test where students had to express their own ideas prompted them to show the best of their performance. Offering appropriate opportunities for students to express their own ideas was a central part of the test, and it was fully achieved. A few of the teachers commented that they had not proposed the same kinds of activities (e.g., role plays, letter writing) because they did not consider that the students would be ready at this stage. Being present during the interviews, teachers observed their students groping for words, monitoring and correcting some of their own mistakes, and in most cases, speaking more than they usually did in class. The test, then, had important implications for some of the teachers. They noticed there was a gap between their interpretation of the course objectives and the actual outcome level determined by those same objectives. In this way, by involving course teachers in the testing process, this collaboration contributed to enhance consistency in the program, by unifying criteria and improving the decisions teachers make in the classroom. Regarding item development, we concluded that preparing multiple-choice items with three options gave us an accurate and reliable measure of our young learners’ performance. We obtained reasonably good results using three options to discriminate between the low and high scorers. Students did not show any difficulties with this type of task. Some used the provided answer sheet while others wrote their responses in their test booklet. The only problem we found here was that a few students wrote other words that were not provided in the options. These students may have required some strategy training for this kind of test. For future tests, we are considering adding some visual contextual clues for some of the items in order to make the test more visually appealing and appropriate for students with other styles of perception. However, in order to control the impact of the objective section on the performance of 12-year old learners, we decided that the four sections that are included here (grammar, vocabulary, reading comprehension, and cloze test) would comprise 50% of the total weight of the test while speaking would account for 40% and writing for the remaining 10%. This was another important decision in the development process, which had its own washback effect. Knowing the relative weight of the speaking section, students participated more actively in English in class during the weeks before and after the test. Teachers have mentioned that they will spend much more class time on this skill, and that they will propose new challenges for students to improve their oral communication. In terms of the number of items selected for each section, we have found that 20 items for the grammar and vocabulary sections enabled us to test all the core contents included in the initial specifications. We will keep the same distribution for future tests. An improvement that was suggested by centers in Argentina was to categorize the items around themes and not

19

linguistic categories. For instance, students will have some multiple-choice items about school, their vacations, and the weather, following a more logical sequence. On the other hand, we were not satisfied with the responses in the cloze or the reading comprehension sections. There were too few items in each of these sections – 5 in each – to obtain reliable statistics. The challenge we found was to select a passage that relates to students’ interests, that is simple enough to understand the global idea in a short time, and that is cohesive enough to enable the student to fill in the missing gaps. To overcome this, in Test 2 we gave students a word bank with the words for the entire passage instead of giving them 3 options for each blank, as in Test 1. The majority chose the correct answers, but the correlations of their scores with those in other sections were among the lowest we obtained in the entire test. Similar problems were found with the reading comprehension passage. For future tests, we will explore the need to include other kinds of tasks to engage students and check their reading and writing in a more meaningful and reliable way. The most important conclusions about editing items were reached after the students took the test. Item Facility and Item Discrimination indices showed us how accurately each item was measuring what we expected. Also, knowing how often distractors had been chosen by the low and high groups helped us in the editing process of items for new exams. Take the following example: A: _____________ we go to the baseball game? B: No, you can’t. You need to finish your homework.

(a) Should (b) Can (c) Will The item is obviously checking the use of the modal auxiliary can to ask for permission. The item had an IF of .94 and an ID was .20. This means that 94% of the students answered this item correctly. In fact, 1% chose option (a), 94% chose option (b) and 5% chose option (c). Such an easy item was discarded and was not used again. In addition to using a high-frequency question that students are very familiar with, the hint to fill in the blank is provided in the second part of the conversation. These kinds of errors in item writing need to be avoided and should be spotted during the editing process. The following is an example of an item that turned out to be a very good one. I __________ in 5th grade last year. (a) were (b) was (c) am Although it seemed like a very easy item at first sight, IF for this item was .71 and ID was .64 for our population. Of the 29% of the students who answered it incorrectly, 8% chose option (a), 20% chose option (c) and 1% wrote a different word on the blank not provided in the options. Of the group of students who got the lowest scores on the test, 12% chose option (a) and 52% chose option (c), thus accounting for the ID of .64. It clearly discriminates between the low and high achievers of this group. In the following example, a different kind of response was obtained: IF was .33 and ID was .13. Only 33% of our population answered it correctly, and it does not discriminate between the high and low scorers. A: How _____________ popcorn do you want?

B: 2 bags, please. (a) often (b) many (c) much

Sixty-three percent of the high scorer students answered it incorrectly choosing option (b), and 24% of the low scorers answered it correctly. The item can be categorized as very ambiguous and misleading. In the first place, students at this age may identify the words how

20

much or how often as single expressions and not as independent words that can be combined with different meanings. Isolating the words that belong in these expressions may decontextualize the item, creating an additional difficulty. Furthermore, students are taught that how many is used for countable things, and the response of the conversation starts with a number. This could also have misled even the high scorers to choose answer (b). Editing for this item would involve changing the question, the answer, or both. It would be more advisable to leave each of the entire expressions in the options and make sure that ambiguous information was not included in any item. Once we obtained all the statistics, we could order items on an Excel spreadsheet according to IFs as the first criterion and according to IDs as the second. In this way we could make up a database of items, selecting the most appropriate ones for future use. We could also decide which ones would require some editing, and which ones to discard. Not setting a time limit for the test had very good results. Observing the process of the students while taking the written test, we noticed a few students finished very quickly and handed in the test in 45 minutes. The rest of the students took up to 95 minutes to complete the written part of the test. Among those who handed it in last were some of the students who got the highest scores in each group. If we had limited their chances to think and respond at their own pace, their results would have been different, and would not have reflected such a high level of performance. The only limitation we have found was due to logistic reasons at one of the schools. A different group had to use the same room where students were taking the test, and the students from the first group who needed time to finish had to stop, leave the room, and find another place to continue and finish their test. In future administrations, this needs to be carefully anticipated, and administrators will need to book a room for more time than was initially expected. When evaluating the written responses, we relied on the rubrics that were developed for this section. We had decided to involve the students’ teachers in the scoring, while a second evaluator would also read the papers and score them. In cases in which the first two scorers did not choose the same score for the written response, a third reader was involved. The final scores were decided by the two official evaluators. We found that the holistic rubrics initially proposed were effective for some of the evaluators but not for all. Even after training all the evaluators using samples of students’ work, there were disagreements, especially with the lowest and highest categories. In order to solve these differences, in cases where there was a disagreement of +/- 1 point (scale from 1 to 8) between the first two readers, we used a different kind of scale and procedure for the next reading session. Tables 12 and 13 show the two rubrics used. Instead of holistic rubrics, we defined the areas we were evaluating using a four-category analytic scale. The second rubric enabled readers to come to an agreement on the final scores for the writing section much more easily. Since evaluating each writing task with these four categories takes a little more time than with the holistic rubric, we used it only for cases in which there were disagreements and did not adopt it for the initial readings. The evaluation of the oral performance underwent a similar evolution. We started with holistic rubrics (Table 3). During the interviews, there were also disagreements between interviewers, who expressed that it was not always easy to choose the most appropriate category since students’ performance showed elements from different categories. To solve this problem, we prepared another set of rubrics after the first interviews that replaced the original ones. The new rubrics are shown on Table 14. We changed some of the categories to reflect what was actually taking place in the interviews. For example, by adding the concept

21

of “extended discourse” to the category on fluency, evaluators not only paid attention to students’ fluency but also to the amount of information that students generated in the interviews. If students did not sustain the conversation, evaluators had the responsibility to elicit more extended discourse from the candidates, thus ensuring that all students would have the same opportunities to be evaluated, regardless of their natural extroversion or introversion. This change of rubrics was very useful for all interviewers, and these criteria will be incorporated in the course syllabi for all teachers of elementary levels in the coming year. Table 12. Holistic Rubrics for the Writing Section Descriptor points Communicates all the requested information. Grammar, lexis or punctuation mistakes do not interfere with the message.

9-10

Communicates the requested information. Mistakes of grammar, lexis, spelling or punctuation slightly interfere with getting the message across.

7-8

Communicates most of the requested information. Mistakes make the message hard to understand.

5-6

Much of the requested information is not included. Mistakes hinder comprehension. 3-4 Attempts at expressing ideas in writing are unsuccessful. 1-2

Table 13. Scale Used for Writing Section Evaluation When Disagreements Occur Categories to evaluate: Possible points Poor Good Very good Format and organization 2 0 1 2 Communication of ideas 3 1 2 3 Grammar 3 1 2 3 Vocabulary 2 0 1 2 Total 10

Table14. New Rubric for Oral Interview Categories to evaluate: Failing Poor Good Very good Excellent Total Understanding 1 2-3 4-5 6-7 8 1-8 Pronunciation/intonation 1 2-3 4-5 6-7 8 1-8 Accuracy 1 2-3 4-5 6-7 8 1-8 Fluency & extended discourse 1 2-3 4-5 6-7 8 1-8 Conversational skills 1 2-3 4-5 6-7 8 1-8 Total 5-40

Future stages of this project involve the development of progressive tests for different levels and age groups of young learners. Although we mainly worked with 12-year-old students, some of the information we obtained in this experience can be extrapolated to 10 to 14 year old learners. However, if the target population is in the first three years of elementary school, a new kind of test will need to be designed. An important step that still needs to be taken is to evaluate the face validity of the test with the school authorities that initially requested this kind of standardized evaluation. Involving school directors, assistant directors and even school teachers will increase the impact of the exam in the local communities. Brochures will be prepared to give parents information about the test before it is administered. Parents will receive information on their

22

children’s performance and what they are able to do with the language. Special descriptors of levels of performance will be developed to translate the test score into abilities in the foreign language. In the case of students who fail the exam, special areas of remedial work will be suggested and students will have the opportunity to take it again within the following four months so they can catch up with the level of their classmates within the same year. The most important challenge we faced was related to the number of people actively involved in the administration of the test. Teachers who worked in the same school had radically different criteria to evaluate students’ oral or written performances. Even interviewers and readers, who are academic coordinators from the Alianza and have had much more experience with evaluation criteria, came to divergent decisions about the oral and written samples. Working on manuals for teachers and evaluators will be an essential step in order to ensure that every participant in the process has the same information and follows the same procedures. Training sessions for all teachers in elementary levels and evaluators will be held during the year. Practice tests will be developed on the basis of older versions of the test, selecting the best items from the IF and ID continuum. In this way, teachers will have more opportunities to develop and refine their own judgment criteria based on the established guidelines. Last, one of the most important conclusions is related to the way we worked on this project. Only a team of committed teachers and administrators can succeed with this endeavor. Carrying out the initial research, discussing alternatives, piloting partial results, writing and editing items, deciding the appropriate level of difficulty of a test, among other tasks, cannot be done by a single person. The teachers who work with students of this age need to participate together with the school authorities and the test developers in order to produce a test that meets the needs of the market, places fair demands on students, and actually helps teachers make more effective decisions. In addition, the interaction that this project generated helped all of us grow personally and professionally, and even joined the efforts of teachers, coordinators and directors from different Bi-National Centers, transcending the borders of our country. This is a reward that cannot be easily transmitted via a written report, but we want to encourage others to engage in these collaborative professional endeavors.

Conclusion We are not judging whether a standardized test of this kind is the most suitable for our young learners. By developing this testing instrument, we responded to the need of an American English test for the international EFL market at schools. Our test is another instrument that can give information on students’ overall level of achievement regardless of their course of study, textbooks, course load or school context. This test was not intended to evaluate achievement or progress during the course, but rather to expand the array of assessment elements and to give an objective and external measure of the students’ level in relation to a preestablished reference. In that sense, we believe it has achieved the objective that generated it. A proficiency and summative test of this sort needs to be used in conjunction with other assessment instruments that serve formative purposes. Students in our schools earned a final certificate that they value as a reward for their efforts studying English for many years.

23

We have developed the first standardized test of American English for young, 10- to 13-year old learners in Argentina and Uruguay. Starting as a personal and institutional initiative, it soon became a project that involved two different countries, and it has great potential to keep expanding into other markets. We believe that this testing experience was highly successful for all who participated during this process. Students, teachers and school authorities were very satisfied with the final results. Scrutinized from different points of view, the test met our criteria for content validation. Future research on our test development project will involve studies of concurrent and predictive validity. Once common evaluation criteria have been established for all schools, students’ grades during the year will be compared to the scores on the test. In addition, students’ progress will be monitored during the next three to five years until they take The University of Michigan Examination for the Certificate of Competency in English. Their results on both exams can then be compared.

Acknowledgments I would like to thank Amy Yamashiro for her outstanding mentoring style, her encouraging words and her multifaceted support, Diane Larsen-Freeman, Joan Morley, Mary Spaan, Barbara Dobson, Sarah Briggs, Jeff Johnson and all the staff of the English Language Institute, The University of Michigan, for their assistance and feedback on this project, and all the school teachers, Alianza coordinators, and students who actively participated in the development or administration of the test. And most of all I would like to express my deepest gratitude to my wife Fabiana and my two sons, Bruno and Guillermo, for their understanding, support and love.

References Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation.

Cambridge: Cambridge University Press. Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall

Regents. Henning, G. (1987). A guide to language testing: Development, evaluation, research. Boston,

MA: Heinle & Heinle. Herrera, M. (2000). New parade. White Plains, NY: Pearson Education. Nunan, D. (1999). Go for it. Boston, MA: Heinle & Heinle. O’Malley, J. M., & Valdez, L. Pierce. (1996). Authentic assessment for English language

learners: Practical approaches for teachers. Reading, MA: Addison Wesley. Trinity College. (2001). Grade examinations in spoken English for speakers of other

languages. London: Trinity College London, UK. University of Cambridge, ESOL Examinations (n.d.) Common European framework for

modern languages. Retrieved January, 21, 2003, from http://www.cambridge-efl.org.uk/exam/engfile/efl2.cfm.

University of Cambridge Local Examinations Syndicate. Cambridge examinations, certificates and diplomas. (1998). KET Key English Test Handbook. Cambridge: University of Cambridge Local Examinations Syndicate.

Vale, D. ( 2001). The language tree. Oxford, UK: McMillan-Heinemann ELT.

25

A Construct Validation Study of Emphasis Type Questions in the Michigan English Language Assessment Battery

Sang-Keun Shin University of California, Los Angeles

Taxonomies of listening skills have long recognized the importance of the ability of second language learners to recognize the use of sentence stress within an utterance (Munby, 1978; Richards, 1983). Emphasis type questions in the MELAB represent one of the first attempts to measure this important component of listening ability in the context of standardized testing. This study has considered five types of construct validity evidence concerning the usefulness of this promising item type: item analysis, internal structure of the test, test-taking strategies, content analysis, and prediction of item difficulty. On the basis of evidence, this paper provides a summary of validation arguments for and against interpreting emphasis items as a measure for sentence stress in second language.

Taxonomies of listening skills have long recognized the importance of the ability of second language learners to recognize the use of stress in connected speech (Munby, 1978; Richards, 1983). According to Dirven and Oakeshott-Taylor (1984), stress and intonation are more important to word recognition than the actual sounds. Lynch (1998) also suggests that prosodic features have a direct impact on how listeners interpret discourse segments and can carry considerable meaning that supplements the literal meaning of the words. Yet, despite this recognition, we know little about how to assess such ability. Emphasis type questions in the Michigan English Language Assessment Battery (MELAB) represent one of the first attempts to measure this important component of listening ability on a large-scale in the context of standardized testing. However, since this task type has not been validated, not much is known about the constructs being measured by these items. This study examines, therefore, the construct validity of the emphasis type questions. The primary aim of this study, therefore, is to examine the construct validity of emphasis type questions. Since the inferences made on the basis of test scores, and the way that they are used, are the object of validation, rather than the test itself, this suggests that validation is a process through which a variety of evidence about test interpretation and use is produced and examined (Messick, 1989). Thus, construct validity evidence refers to the judgmental and empirical justifications that do or do not support the inferences made about these items. Since we cannot observe test takers’ listening ability directly, we need to examine the construct validity evidence to assess the extent to which components of prosodic competence are responsible for test takers’ performance with respect to the emphasis items. Here, the following five types of construct validity evidence will be examined: 1) item analysis, 2) internal structure of the test, 3) test-taking strategies, 4) content analysis, and 5) prediction of item difficulty.

26

Method MELAB Listening Section The listening portion of the MELAB is a tape-recorded segment containing 50 questions which lasts about 25 minutes. All listening items are multiple choice (MC) with three options to choose from. Table 1 presents the overall design of the listening section. Table 1. The Structure of MELAB Listening Section Items

Type of Item Number of Items

FormBB Form CC Short Question

Choose the appropriate answer to a short question 8 8 Statement

Identify the paraphrase of single utterances or short conversational exchanges 7 7

Emphasis Choose the appropriate response to short expressions articulated with emphasis on particular lexical items Identify how a speaker might continue after emphasizing a certain lexical term 10 10

Lecture Select appropriate answers to short questions based on a 3-4 minute mini-lecture 13 11

Conversation Select the appropriate answer to short questions based on an approximately 4-5 minute conversation 12 14

The MELAB has two types of emphasis questions (English Language Institute, The University of Michigan, 2001). As can be seen in Example 1, the first item type involves presenting a question or statement phrased in a certain way, with special emphasis on a particular structure. Test takers are then asked to choose the answer that best corresponds to what the speaker would be most likely to say next.

Example 1: On the recording, test-takers will hear:

I need the small read cup, In their test book, they will read: a. not the big one. b. not the green one. c. not the plate. The other item type presents test takers with a question that has an emphasized word

and asks them to select the most appropriate answer to the question, as can be seen in Example 2.

Example 2: On the recording, test-takers will hear: Do you have John’s keys? In their test book, they will read: a. No, but Jane does. b. No, I have Jim’s. c. No, only his bags.

27

The MELAB Technical Manual (English Language Institute, The University of Michigan, 1996) states that the items that include emphasis focus specifically on the meaning conveyed by suprasegmentals or prosodic aspects of a speaker’s utterance. The utterance stem is deliberately ambiguous, and this ambiguity can only be resolved by integrating the stress information contained in the pronunciation of the utterance. As such, the items with emphasis serve to measure the combination of a particularly important performance skill in listening that is linked to one’s capacity to extract underlying meaning from language. Data

This study analyzed 20 emphasized items in two forms of the MELAB listening test: forms BB and CC. There were 1793 test-takers whose performance was analyzed in this study for Form BB and 1410 for Form CC.

Results

Item Analysis The converted (or scaled) mean scores of the two forms were 80 and 79, respectively. These values are sufficiently close to the population mean of 77.40, which is reported in the MELAB Technical Manual. The Cronbach’s alpha reliability of the two forms was the same: 0.80. Table 2 presents the difficulty and discrimination indices of the emphasis items. Table 2. Item Difficulty and Discrimination Indices Item Difficulty Item Discrimination Item Form BB Form CC Form BB Form CC 16 0.26 0.63 0.25 0.40 17 0.47 0.96 0.32 0.05 18 0.35 0.44 0.35 0.32 19 0.55 0.65 0.26 0.47 20 0.37 0.66 0.37 0.41 21 0.69 0.15 0.23 0.21 22 0.47 0.69 0.32 0.31 23 0.44 0.85 0.45 0.23 24 0.65 0.52 0.51 0.37 25 0.41 0.34 0.54 0.29

As can be seen in Table 2, the item difficulty indices ranged from 0.15 to 0.96. The item facility of item 21 on Form CC was lowest, 0.15, suggesting that this item was extremely difficult. In contrast, the item facility of item 17 on Form CC was 0.96, indicating that this item was extremely easy, at least for this population. All the item discrimination indices were positive, suggesting that those who achieved a high overall score tended to answer all of the items correctly more often than those who had low overall scores on the test. The item discrimination index for item 17 on Form CC was 0.05, indicating that this items did not discriminate well between good and poor listeners, which can probably be accounted for by the high item facility index of the item.

28

Internal Structure of the MELAB Listening Section The construct validity of a test can be supported to the extent that the observed dimensionality of response data is consistent with the hypothesized dimensionality of a construct. To test the hypothesis that the emphasis items measure different constructs than other items, an exploratory factor analysis (EFA) was conducted to explore the internal structure of the listening section. Since the item variables were dichotomous data, a matrix of tetrachoric correlations among the variables was produced. A principal components analysis was used to extract the initial factors and the scree plot and eigenvalues obtained from the initial extractions were examined as initial indications of the number of factors represented by the data. Then, the principal axes were extracted with the number of factors equal to one above and one below the number of factors indicated by the initial principal components extractions. These extractions were rotated to both orthogonal and oblique solutions using varimax and promax rotation procedures, respectively. The final criteria used for deciding the best number of factors to extract were simple structure and meaningful interpretation. Only the first 25 items were included in the analysis. This was because the input for these items was single sentences, whereas the input for the rest of items (Items 25-50) was longer than single sentences. Since different input types may result in method rather than trait factors, the items with longer text were excluded from the analysis. The EFA analysis was carried out using the MicroFACT (Waller, 2000), and the results of the analysis are presented in Table 3. As can be seen in Table 3, the two-factor oblique solution seemed the most interpretable, with a meaningful pattern of variable loadings on all extracted factors in both forms.1 For Form BB, 11 out of 15 short question and statement items had high loadings on the second factor (values in bold), whereas 8 out of 10 emphasis items had high loadings on the first factor. Similarly, for Form CC, 14 short question and statement items had high loadings on the first factor, whereas 9 emphasis items had high loadings on the second factor. We can conclude, therefore, that the emphasis items measure different constructs than the other items, thereby supporting the construct validity of the use of emphasis items. Test-Taking Strategies Verbal protocol analysis is used to explore the closeness-of-fit between testers’ presumptions about what is being tested and the actual processes that test takers go though to produce acceptable answers (Bachman, 1990). Two English as a second language (ESL) students, Naoko and Jin, provided immediate retrospection of the test-taking processes that they went through while responding to the items. Naoko was an intermediate-high and Jin an advanced ESL student. They listened to the input only once, as in a real test, and then provided retrospection. To examine for any difference in their strategy-use across item types, they were also asked to tackle the statement and short answer items.

Analysis of the retrospection data showed that both participants employed different strategies according to item type. When they were responding to both the statement and short question items, they processed all of the input. For the emphasis items, however, they focused on emphasized words. As can be seen in Examples 3 and 4, Naoko processed all of the input and, as a result of her comprehension, got these items right.

Example 3: Student 1/Item 5 He asked what the appropriate price for the house was. So the answer is a, not more than 25,000 dollars.

29

Example 4: Student 1/Item 7 Well, did people stop him when he became famous? The answer is no, but they followed him.

On the other hand, both test takers were able to solve emphasis items by concentrating only on emphasized words. For example, Naoko got item 17 right by processing ‘first’ as can be seen in Example 5. (Emphasized words are presented in bold).

Example 5: Student 1/Item 17 First two paragraphs. Did he say to read the first two paragraphs? I am not sure. But I knew that I had to find out which word was emphasized. So I just tried to catch the emphasized word, and did not pay attention to the rest of the sentence. First two paragraphs, so the answer is c, not the last two.

Table 3. Factor Loadings Form BB Form CC Item 1 2 1 2 1 -0.139 0.689 0.604 -0.0862 0.032 0.483 0.600 -0.0173 0.100 0.457 0.505 0.1714 0.120 0.470 0.535 0.0235 -0.069 0.614 0.368 -0.0036 -0.029 0.324 0.567 -0.0387 -0.035 0.594 0.416 0.0718 0.102 0.430 0.356 0.2489 -0.037 0.402 0.522 -0.18710 0.169 0.284 0.338 0.01611 0.156 0.141 0.291 0.11712 0.020 0.308 0.387 0.02813 0.059 0.386 0.364 0.12214 0.129 0.115 0.528 0.08415 0.160 0.308 0.495 -0.12116 0.490 -0.003 0.162 0.39217 0.534 -0.051 0.055 0.26318 0.716 -0.037 0.095 0.61619 0.256 0.009 0.118 0.62120 0.446 0.102 0.054 0.63621 0.182 0.184 -0.034 0.67522 0.493 0.001 0.216 0.12123 0.773 -0.054 0.153 0.37124 0.357 0.333 -0.005 0.49125 0.731 0.074 -0.202 0.812

30

As shown in example 6, Jin used the same strategy when responding to emphasis items. Example 6: Student 2/Item 23 Cotton ribbon, he said something about a typewriter. I didn’t exactly understand what he said but ‘cotton’ was definitely stressed. So the answer is b, not a Nylon ribbon, right?

Since Naoko and Jin employed different strategies for different item types, they answered different item types incorrectly but for different reasons. For the statement and question type items, they provided wrong answers when they failed to understand either all or part of the input. This was because these ‘paraphrase recognition’ and ‘response evaluation’ items (Buck, 2001) ask test takers to choose either an option that means the same thing as what they heard or the most appropriate response. In the following example, Naoko had trouble with the first part of the input, that is, ‘can’t you be more specific,’ and, thus, had to make a semi-educated guess. She selected yes, he’s doing now instead of no, that’s all I know. Example 7: Student 1/Item 2

He said something about what he does but I didn’t get the first part. As for the emphasis items, the two students gave wrong answers primarily when they failed to locate emphasized words. For example, in Examples 8 and 9, Naoko had trouble identifying emphasized words and subsequently provided wrong answers to both items.

Example 8: Student 1/Item 18 I heard that she wanted her children to have a double room on the second floor. I understood what she said but failed to decide upon the emphasized word. There were three or four words that were stressed.

Example 9: Student 2/Item 20 The question was about what graduate students should do by next Monday. But was next emphasized or was it graduate? Graduate was prominent but I had to listen to the rest of the sentence to see if another word was more prominent than graduate. And then I got confused. I am not sure which one was more emphasized.

Another noteworthy fact is that even when they misunderstood the input or failed to comprehend the whole sentence, they were also able to answer some items correctly, as long as they managed to locate the emphasized word. In example 10, for example, Naoko misunderstood the first intonation unit (Gilbert, 1983).

Example 10: Student 1/Item 20 In the first part, the girl asked : Are you a graduate student ? Since graduate was emphasized, the anwer is c, not undergraduate.

The first part of the question was ‘Are graduate students supposed to register..?’ However, Naoko misunderstood this part of the question as ‘are you a graduate student?’ However, since she recognized that ‘graduate’ was emphasized, she selected undergraduate as an answer. In example 11, Naoko misheard ‘meal’ as ‘mail.’ But, since she caught the emphasized word ‘pink,’ she got this item correct.

31

Example 11: Student 1/Item 22 The woman said eat pink something before mail.

However, that strategy does not always work. Even though Naoko identified the emphasized words in Example 12, she provided a wrong answer because she misinterpreted the meaning of the verb ‘call’.

Example 12: Student 1/Item 20 The director asked the girl to call the new students. Call the new students, which means call their names. There is no corresponding response among the alternatives. Call their names, not write them. That doesn’t make a sense. There is no appropriate choice to choose.

Jin provided an incorrect response to the following item because she perceived ‘my dress’as one chunk.2 Example 13: Student 1/Item

Will my dress be ready by…? Mydress Mydress, What is mydress? Examples 12 and 13 suggest that these items also require using the bottom-up aspect of listening skill.

Are these construct-relevant or construct-irrelevant test-taking processes? Verbal protocol analysis shows that these items do indeed require test takers to pay attention to input stress in order to answer the questions. They had to first locate emphasized words and then infer either what the speaker would say next or the appropriate answer to the question. They missed items when they failed to locate and process the emphasized word. The fact that test takers need to use stress information to infer implied meaning does support the construct validity of these items. However, verbal protocol data also suggest that the test method itself, primarily decontextualized input and MC item format, may introduce construct-irrelevant variance. In other words, test takers got certain items right or wrong for the wrong reasons. For example, for some items, test takers did not have to process all of the input because they were able to answer some of the questions by simply processing the emphasized word. For other items, they were able to choose the correct answer even when they misunderstood the message. In addition, even when they understood all of the input, they sometimes gave wrong answers to certain items because they failed to identify the emphasized words. Are these test-taking strategies part of the communicative strategies that we hope would have an effect on performance in a language test? It is not clear to what extent these strategies that are required for emphasis items fall within the construct definition. It appears that test developers need to clearly define what they intend these items to measure, so that a variety of construct evidence can be examined in light of the definition. Content analysis Content evidence can be gathered by collecting the judgments of experts concerning the precise ability (or abilities) that they believe that a test measures. As such, content analysis provides evidence for a hypothesized match between test items or tasks and the construct that the test is intended to measure (Chapelle, 1999). For a systematic content analysis, a hypothesis was postulated concerning the abilities required by the emphasis items based on the ‘speech production framework’ proposed by Celce-Murcia and Olshtain (2000). Celce-

32

Murcia and Olshtain propose that listening has both top-down and bottom-up aspects. Top-down listening processes involve activation of both schematic and contextual knowledge. According to them, schematic knowledge is generally thought to be comprised of two types of prior knowledge: 1) content schemata and 2) formal schemata. Contextual knowledge involves an understanding of the specific listening situation at hand (i.e., listeners assess who the participants are, what the setting is, what the topic and purpose are, etc.). Celce-Murcia and Olshtain also propose that good listeners make use of their understanding of the ongoing discourse or co-text (i.e., they attend to what has already been said and predict what is likely to be said next). The bottom-up level of the listening process involves prior knowledge of the language system: 1) phonology, 2) vocabulary, and 3) grammar. The phonology of a language is often described by linguists in terms of segmental and suprasegmental systems, where segmental refers to individual vowel and consonant sounds and their distribution, while suprasegmental refers to patterns of rhythm (i.e., the timing of syllable length, syllable stress, pauses, and word-level stress patterns) and intonation contours (i.e., pitch patterns, signaling new vs. old information, contrast stress, and contradictions/disagreements) that accompany sound sequences when language is used for oral communication. A five-point scale was developed on which two ESL instructors rated emphasis items according to the degree to which they required the components of listening comprehension (see Appendix). They first listened to all the items and assessed them according to the scale. They were also interviewed about their perceptions of the usefulness of the emphasis items. The first instructor, Peter, had 12 years of teaching experience, and his teaching areas included pronunciation and communication strategies. Peter concurred with the construct definition of the emphasis items. He felt that these items do deal with emphatic stress. However, he also pointed out a few limitations of these items. For example, he thought that some target words were overly emphasized and, thus, sounded quite unnatural. His comments include: “There was no secondary stress at all,” “They just give away answers,” and “This one sounds OK.” He pointed out that there was a slight pause before emphasized words in a few of the sentences. He added that test input was limited to the sentence-level and, thus, highly decontextualized. He questioned how often students had to process context-free prosodic information in the real world. Peter also suggested that the MELAB failed to assess many other important prosodic components, such as the closed-choice alternative question intonation pattern. The second instructor, Judith, had 25 years of teaching experience. She currently teaches communication skills and pronunciation in an ESL service course. Her first response to the emphasis items was very positive. She found the items "very creative.” She believed that these items may induce a positive washback for test takers and teachers, so that they pay more attention to suprasegmental issues in listening comprehension. However, she was also concerned about test method effects because some students may answer some of the questions correctly by simply processing the emphasized items. Judith also questioned the authenticity of the input. First, she felt that some sentences failed to sound natural; and, since the sentences are read in a very slow and deliberate manner, they have only a few oral features. Second, even though language use extends over much longer pieces of text than sentences, the input is limited to sentence level. In addition, test takers are not required to relate linguistic information to the context. Judith also pointed out that the fact that test takers may get some items by processing only emphasized words makes it more likely that instruction may focus on test-taking strategies rather than communication.

33

The two raters’ responses to the questionnaire were very similar, and interrater reliability was very high in that the Pearson correlation coefficient was .96. Both raters thought that the emphasis items deal with both emphatic and contrastive stress. However, both of them thought that the emphasis items involved only very limited aspects of the top-down listening processes. Prediction of Item Difficulty Prediction of item difficulty attempts to investigate the extent to which relevant factors affect item difficulty and discrimination and provide evidence for the substantive aspect of construct validity by revealing the extent to which hypothesized knowledge and processes appear to be responsible for learner performance. The following three variables were selected from studies investigating the item difficulties of listening comprehension tests, such as those by Brown (1995), Buck (2001), Freedle and Kostin (1996, 1999), and Nissan, DeVincenzi, and Tang (1996). Variable 1: Length of stimulus - Since test-takers hear each input only once, it was presumed that the difficulty of the item would be related to the length of the stimulus. It was thus hypothesized that a longer sentence would be more difficult because examinees would have more language to process. The length was measured in seconds, using the Pitchworks (Tehrani, 2000).

Variable 2: Intonation - Since test takers are asked to identify emphasized words in the sentence(s), it was hypothesized that the more prominent the emphasized words were, the easier the items would be. To test the hypothesis, the ratio between the intensity of the syllable in the emphasized word and that of the syllable in the preceding word was estimated. Since acoustic intensity is the appropriate measure corresponding to loudness (Ladefoged, 2001), each sentence was digitized, and the intensity was measured using the Pitchworks (Tehrani, 2000). Figure 1 illustrates how the ratio of item 23 was estimated. Figure 1. Intensity Ratio of Item 23

Variable 3: Content words - Variable 3 was the total number of content words in the

sentence, which served as a way of estimating the relative information load of the input. It was hypothesized that the greater the number of content words, the more difficult the items would be.

34

The dependent variable used in this study was an item difficulty index from classical test theory reported in Table 2. Descriptive statistics for three variables are summarized in Table 4. A multiple regression model was used to investigate the relationship between item difficulty and the three variables, and the results of regression analysis are reported in Table 5. Table 4. Descriptive Statistics for Each Measure Variable Mean Std. Deviation Minimum Maximum Length 3.67 0.60 2.80 5.20 Intonation 1.10 0.05 1.00 1.22 Content Words 6.15 1.35 4.00 8.00 Table 5. Regression Table Model R R2 Std. Error of estimate 1 0.685 0.469 0.157 Variables B Std Error of B T Sig. Of T Constant -2.42 0.811 -2.984 0.009 Length 0.005 0.063 0.079 0.938 Intonation 2.697 0.737 3.66 0.002 Content Words -0.00159 0.028 -0.058 0.955 The results of the regression analysis show that only one variable, Intonation, had a significant impact on item difficulty. As can be seen in Table 5, the regression coefficient for Intonation is significantly different from 0. The R2 value for the regression equation used is 0.47, suggesting that 47% of the variance in item difficulty could be accounted for by these variables. Even though only 47% of the variance of item difficulty could be explained by these variables, the fact that Intonation, that is, the relative prominence of emphasized words, was the only significant predictor of item difficulty does support the construct validity of the items.3

Summary A review of the evidence for construct validity suggests that the use of emphasis items results in tests where certain aspects of listening ability can be inferred. First, item analysis showed that most of the emphasis items successfully discriminated between good vs. poor listeners. Second, the results of EFA revealed that the emphasis items measure different listening skills than other items. Third, the immediate retrospection of the two ESL students indicated that they processed stress information in the input sentences to answer the questions. Fourth, two ESL instructors reported that the emphasis items measured the test takers’ ability to process emphatic and contrastive stress at the sentence level. Finally, the prominence of the emphasized word had a significant impact on item difficulty. However, two raters also pointed out that listening to a series of decontextualized questions and statements appears to be highly inauthentic, at least outside of a language test. Table 6 provides a summary of validation arguments both for and against the validity of score-based inferences and use of emphasis items (I owe the idea of a validation table to Chapelle, 1994).

35

Table 6. Summary of Validation Arguments Interpretation of the Emphasis Items as a Measure of Prosodic Competence Evidence Argues in Favor Argues Against Item Analysis A wide range of item difficulty

All positive item discrimination indices

Internal Structure of Tests

Separate dimension for the emphasis items

Test-Taking Strategies

Different test taking strategies for the emphasis items: paying attention to stress

Potential sources of construct-irrelevant variance

Content Analysis

Measuring test takers’ ability to process emphatic and contrastive stress

Lack of authenticity primarily due to decontextualized input Construct underrepresentation - omission of other prosodic components

Predicting Item Difficulty

Prominence of emphasized word: a significant predictor of item difficulty

Impact

Positive washback: to raise awareness of the importance of prosody for both teachers and students

Discussion The primary purpose of this study was to examine the usefulness of emphasis item types for listening comprehension tests. We have considered five sources of construct validity evidence, which largely support the adequacy of interpretations of these items as a measure the ability of test takers to process prosodic information in the stimulus. However, we have also seen that there are potential limitations of the item types, mainly as a result of their decontextualized input and MC format.

Even though this validation study has provided a summary of validation arguments, it has not drawn the conclusion that the emphasis items are valid or invalid. It should be noted that validity is not an all-or-nothing attribute and that validation is an ongoing process. Since the same items can be valid for one purpose but invalid for another, each test user, therefore, should evaluate the construct validity evidence of emphasis items in accordance with a given purpose and then draw his or her own validation conclusions. It may take time and great pains to design items that assess test takers’ prosodic competence, but we need to remember that they are of great critical importance. This is because prosody is clearly involved at all levels of the listening processing, even the processing of single words. The review of construct validity evidence suggests that emphasis

36

item types have the promise of providing very simple measures for emphatic and contrastive stress. Even though some of the oral features of spoken texts are absent, emphasis items may be quite suitable for large-scale, standardized listening comprehension tests.

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.

Brown, G. (1995). Dimensions of difficulty in listening comprehension. In D. J. Mendelsohn & J. Rubin (Eds.), A guide for the teaching of second language listening (pp. 59–73). San Diego, CA: Dominie Press.

Buck, G. (2001). Assessing listening. Cambridge, UK: Cambridge University Press. Celce-Murcia, M., & Olshtain, E. (2000). Discourse and context in language teaching.

Cambridge, UK: Cambridge University Press. Chapelle, C. (1994). Are C-tests valid measures for L2 vocabulary research? Second

Language Research, 10(2), 157-187. Chapelle, C. (1999). Validity in language assessment. Annual Review of Applied Linguistics,

19, 1-19. Dirven, R., & Oakeshott-Taylor, J. (1984). Listening comprehension (Part I). Language

Teaching, 17(4), 326-342. English Language Institute, The University of Michigan. (1996). MELAB technical manual.

Ann Arbor, MI: The University of Michigan. English Language Institute, The University of Michigan. (2001). MELAB 2000-20001:

Information bulletin and registration forms. Ann Arbor, MI: The University of Michigan. Freedle, R., & Kostin, I. (1996). The prediction of TOEFL listening comprehension item

difficulty for minitalk passages: Implications for construct validity (TOEFL Research Report No. 56). Princeton, NJ: Educational Testing Service.

Freedle, R., & Kostin, I. (1999). Does text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL’s minitalks. Language Testing, 16(1), 2-32.

Gilbert, J. B. (1983). Pronunciation and listening comprehension. Cross Currents, 10, 53-61. Ladefoged, P. (2001). A course in phonetics. Fort Worth, TX: Harcourt College. Lynch, T. (1998). Theoretical perspective on listening. Annual Review of Applied Linguistics,

18, 3-19. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13-

103). New York: American Council on Education/Macmillan. Munby, J. (1978). Communicative syllabus design. Cambridge, UK: Cambridge University

Press. Nissan, S., DeVincenzi, F., & Tang, L. (1996). An analysis of factors affecting the difficulty

of dialogue items in TOEFL listening comprehension. (TOEFL Research Report No. 51). Princeton, NJ: Educational Testing Service.

Richards, J. C. (1983). Listening comprehension: Approach, design, procedure. TESOL Quarterly, 17(2), 219-240.

Tehrani, H. T. (2000). PitchWorks software for Windows. Los Angeles: Scicon R & D, Inc. Waller, N. G. (2000). MicroFACT for Windows: Factor analysis for dichotomous and

polytomous response data. St. Paul, MN: Assessment Systems Corporation.

37

Appendix

The Listening Ability Scale If the ability is not required for successful completion of the task, write “0”; if the ability may be involved, but is not critical to the successful completion of the task, write “1”; if the ability is critical to successful completion of the task, and at a basic level, write “2”; if critical, but intermediate level, write “3”; and if critical and advanced level, write “4”.

Not Somewhat Critical Critical Critical Required Involved Basic Intermediate Advanced 0 1 2 3 4

1. Vocabulary __ __ __ __ __ 2. Listening strategy __ __ __ __ __ 3. Background Knowledge __ __ __ __ __ 4. Grammar __ __ __ __ __ 5. Stress on new information __ __ __ __ __ 6. Contrastive stress __ __ __ __ __ 7. Emphatic stress __ __ __ __ __ 8. Lexical stress __ __ __ __ __ 9. content schemata __ __ __ __ __ 10. formal schemata __ __ __ __ __ 11. contextual knowledge __ __ __ __ __ 12. understanding of the on-going discourse __ __ __ __ __ 13. knowledge of segmental system of English __ __ __ __ __

Notes

1 How large a loading has to be to be judged as important cannot be stated in absolute terms. In the present study, a loading on a factor was considered to be relatively important when it was more than twice the size of another factor loading for the same variable.

2 During the postinterview, she asked what ‘mydress’ meant. When she learned that she misheard ‘my dress’ as one word, she complained that she made a mistake because ‘my’ received too much stress.

3 We could have accounted for more than 47% of the item difficulty if we had included more than three variables. Due to the lack of appropriate measures, however, a few variables were not included in the analysis. For example, the frequency of emphasized words was considered an important variable to include in the study. However, an academic word frequency list based on spoken corpora was not available.

38

Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 1, 2003 English Language Institute, The University of Michigan 39

Investigating the Construct Validity of the Cloze Section in the Examination for the Certificate of Proficiency in English

Yoko Saito

Teachers College, Columbia University

This study addresses the question of the construct validity of the cloze test in the Examination for the Certificate of Proficiency in English (ECPE), which is developed by the English Language Institute, The University of Michigan. Through a rigorous investigation using structural equation modeling (SEM) as a primary statistical procedure, a model composed of two factors with lexico-grammatical ability and reading ability was confirmed and accepted as the best representation of the data. Furthermore, this study empirically demonstrates that the cloze section of the ECPE measures the form and meaning of grammar. In other words, cloze items appear to measure grammatical knowledge on the sentential and suprasentential levels rather than overall language proficiency. This research also demonstrates the usefulness of the structural equation modeling for the examination of the construct validation.

Since the introduction of the cloze method in 1953 by Taylor, there have been numerous studies on cloze from both theoretical and methodological perspectives in the field of language testing. In spite of the extensive research carried out to determine the validity and reliability of cloze tests, inconsistent research findings leave the question, “what does cloze measure?” unanswered. For instance, studies on cloze indicate a wide range of reliability estimates from a low of 0.31 to a high of 0.96 (Alderson, 1979; Bachman, 1985; Brown, 1984, 1988; Hinofotis, 1980; Irvine, Atai, and Oller, 1974; Mullen, 1979; Oller, 1972; Oller and Inal, 1971; Stubbs and Tucker, 1974). Correspondingly, various results have been obtained for the validity of cloze tests (Alderson, 1979, 1980; Bachman, 1985; Brown, 1984, 1988; Hanania and Shikhani, 1986; Hinofotis, 1980; Irvine et al., 1974; Mullen, 1979; Oller, 1972; Oller and Conrad, 1971; Oller and Inal, 1971; Stubbs and Tucker, 1974). Many of the cloze validity studies show moderate to high correlations between cloze tests and standardized tests, and with their sub-tests, such as listening comprehension, reading comprehension, writing, and the Foreign Service Institute (FSI) oral interview (e.g., Hinofotis, 1980; Oller, 1973). Based on the high correlations with other criterion measures, past researchers recommended the cloze test as an integrative test of overall proficiency in English as a second language (ESL).

Similarly, Oller (1979) argues that cloze tests assess the “pragmatic expectancy grammar” that underlies language performance. The theory asserts that test-takers would use the same expectancy grammar for completing a cloze test as they would in any other language context. In other words, the reduced redundancy that results from the cloze procedure forces the test-takers to rely upon their knowledge of underlying linguistic rules and also to retain the coherence of the passage to fill in the blanks. Further, constraints imposed by the internalized rule system allow the test-takers to make predictions about the content of the passage (Laesch and van Kleeck, 1987). Based on this theory, Oller (1983) claims that cloze scores could be

40

interpreted as indicators of “a general language proficiency factor” (p. 3), which could be applied for various testing purposes.

Contrary to the idea that cloze tests are sensitive to constraints beyond clause boundaries, and therefore measure higher order processing abilities, some researchers argue that cloze tests provide a measure of lower order proficiency such as grammar and vocabulary (Alderson, 1979; Markham, 1985; Purpura, 1999; Shanahan, Kamil, and Tobin, 1982). Alderson (1979) examined the effect of certain methodological variables: passage difficulty, deletion ratio, and scoring criteria. The results showed that a change in each methodological variable had a significant impact on the validity of the cloze test. He concluded that if the cloze is sensitive to the changes in deletion rate, then the claim, “cloze measures higher-order skills” becomes questionable. Accounting for its sensitivity to the deletion ratio, the cloze procedure was found to be essentially “sentence-bound” (1979, p. 225), measuring lower-order skills.

A recent study conducted by Purpura (1999) draws a similar conclusion to Alderson (1979), and Shanahan et al. (1982). Purpura investigated the internal structure of the cloze used to assess English language proficiency. Although the cloze produced a high degree of internal consistency reliability, the majority of items appeared to measure lexical meaning, and to a lesser extent, morphosyntactic form. He also found that grammatical meaning and grammatical form were measured by content words and function words, respectively. Furthermore, the modals and logical connectors appeared to measure both grammatical form and meaning. Based on these results, he concluded that “the cloze task was not a single, global measure of language ability; rather, it measured two highly-related, but separate components of grammatical knowledge – grammatical form and meaning – on the sentential and suprasentential levels.” (Purpura, forthcoming, p. 31).

Despite the vast amount of research, there still remain unanswered questions because the results of cloze test research presented in numerous studies differ extensively from study to study. Bachman (1985) suggests that the inconsistency among the results of these studies may be partially due to the methods used in constructing cloze tests. The cloze performance can be affected by contextual features of the test method, such as scoring systems (Alderson, 1979, 1980; Brown, Yamashiro, and Ogane, 2001), deletion ratio (Abraham and Chapelle, 1992; Alderson, 1979, 1980; Bachman, 1982, 1985; Black, 1993; Farhady and Keramati, 1996, Shanahan et al. 1982), passage difficulty (Alderson, 1979; Brown, 1984; Klein-Braley, 1983; Sasaki, 2000), number of items (Sciarone, and Schoorl, 1989), test topic (Alderson and Urquhart, 1985), and method of student response (Abraham and Chapelle, 1992; Bensoussan and Ramraz, 1984; Black, 1993; Storey, 1997). Moreover, the studies vary extensively in the quality of the theories on which they are based, in the clarity and consistency with which they apply theory, in the way sources of variance are managed, and in the way the data obtained are interpreted (Jonz and Oller, 1994). In addition to faulting the differences in test methods, Jonz and Oller argue that some of the earlier studies had serious limitations in research design, such as small sample sizes and uncontrolled for proficiency levels of the test-takers.

In addition to these variables, the choice of statistical procedure for the experimental design may also have affected the results of the research. Because most of the studies were designed in the 1970’s and 80’s, a majority used correlation analyses to determine the validity of the cloze procedure (except Bachman, 1982; Purpura, 1999; Turner, 1989). Now, however, more sophisticated statistical procedures, such as structural equation modeling (SEM), have increasingly gained attention among language testers (Kunnan, 1995; Purpura, 1999; Sasaki, 1993). SEM provides a means of generating models based on substantive theory, in which hypothetical relationships between latent and observed variables can be tested, evaluated, and

41

modified. If the SEM procedure had been used for the data in the earlier studies, rather than correlational analyses, more precise information on the underlying construct of cloze might have been revealed (Kunnan, 1998). Given the number of difficulties discussed regarding the experimental designs and the appropriate interpretation of results, the ways in which we resolve these issues will certainly provide valuable insights into the investigation on the validity of cloze tests. Examination for the Certificate of Proficiency in English

This study examines the underlying construct of the cloze test in the Examination for the Certificate of Proficiency in English (ECPE), which is administered by the English Language Institute at The University of Michigan. The ECPE is an advanced-level ESL examination, which is designed to measure the following language abilities: speaking, listening, writing, reading, and lexical grammar. The exam consists of five components: speaking, listening, writing, grammar/vocabulary/reading and cloze. It is assumed that each section of the test measures a separate ability, and that together they determine the English proficiency of the test-takers. A certificate of proficiency is awarded only to those who obtain passing scores on all five sections of the ECPE. Unfortunately, there have not been any recognized empirical studies on the construct validation of each section to show that all five sections measure distinct abilities; thus, it may be possible that two sections of the test measure the same ability.

According to the Michigan Certificate Examinations General Information Bulletin 2001-2002, the ECPE cloze section is intended to assess “… an understanding of the organizational features of written text as well as grammatical knowledge and pragmatic knowledge of English, particularly knowledge about expected vocabulary in certain contexts” (English Language Institute, The University of Michigan, 2002, p. 8). This description seems analogous to the test specification for the grammar/vocabulary/reading (GVR) section. If the cloze section and the GVR section are measuring the same language ability, the test-takers should only have to pass either the cloze or the grammar section, not both. Another solution would be to combine the two sections and reduce the number of subtests in the ECPE battery to four. Either way, an investigation of the construct validity of the cloze section is needed to strengthen the validity of the ECPE.

The purpose of this study is to investigate whether the cloze section of the ECPE merits being a separate section of the ECPE battery. To provide a rationale for the cloze to be (or not to be) an independent section, the underlying trait structure of the cloze section must be compared with the underlying construct of the GVR section.

The current study addresses the following research questions: 1. To what extent do the items in each component (grammar, vocabulary, and reading)

perform as a homogeneous group? 2. What is the underlying trait structure of foreign language test performance of English,

as measured by the ECPE GVR section? 3. To what extent do the items in the cloze section perform as a homogeneous group? 4. What is the underlying trait structure of foreign language test performance of English,

as measured by the ECPE cloze section? 5. What is the relationship between the cloze and the GVR sections? 6. Does the cloze section merit being a separate section of the ECPE battery?

42

Method Participants

The data were collected from 79 different test centers all over the world in 1997. This involved 12,468 students of English as a foreign language. The majority of the participants in the study, 75 percent, spoke Greek (N = 9,237). Sixteen percent spoke Portuguese (N = 2,012); and 8 percent spoke Spanish (N = 984). The breakdown of participants by their native language is shown below (Table 1).

With regard to gender, the majority of participants in the study were female (N = 8,175), representing 65.6 percent of the population, while 30.4 percent were male (N = 3,785), and 4.0 percent failed to report the information (N = 507). Also the majority of participants (72.4%) were 22 years of age or younger. The median and the mean ages were 19 and 21, respectively, with the youngest participants being 7 and the oldest being 87. The data revealed a wide range of ages; however, there is a possibility that some of the age information may not be accurate. Although the participants were asked to darken the circles of the last two digits of the year they were born on the answer sheet, some people may have mistakenly marked the wrong circles. Table 1. Native Language of Participants Number Percent Arabic 84 0.77Greek 9237 74.6Portuguese 2012 16.2Spanish 984 7.9Others 151 0.66Total 12468 100.0

The ECPE Test

The test was developed by the English Language Institute of The University of Michigan (ELI-UM) for advanced-level students. The ECPE Test consists of five sections with 180 selected-response items, one writing task, and one speaking task. The ECPE is designed to measure the test-takers’ English language performance levels in the different areas of language ability. The participants were given 155 minutes to complete all the sections in the exam.

Although all the sections of the ECPE need in-depth investigations of their underlying constructs, this paper examines only the GVR and cloze sections. The GVR section consists of 100 multiple-choice items measuring three types of language abilities: Grammar (40 items), Vocabulary (40 items), and Reading (20 items). The cloze section consists of a total of 40 multiple-choice items in two passages, with 20 items each. Table 2 describes the sections of the ECPE. Procedures

The ECPE is administered annually at over 130 testing centers in about 25 countries. Writing, listening, cloze, and GVR sections are given during a single administration period, followed by the Interactive Oral Communication Section (IOCS) on a different date. In order to ensure test security and to avoid an unfair advantage to any test-taker, each participant’s identity is checked, and all the test booklets are collected after the exam.

43

Table 2. Description of the ECPE Tasks Time (minutes) Number of Items Writing 30 (1) Listening 25 40 Multiple-Choice Cloze 25 40 Multiple-Choice Grammar/Vocabulary/Reading 60 Grammar (GRAM) 40 Vocabulary (VOC) 40 Reading (READ) 20 Interactive Oral Communication 15 (1)

The ECPE answer sheets are distributed first, followed by the test packet. The general instructions are read aloud in English by the test administrator. No questions regarding the test items are answered. For the GVR and cloze sections, the students read the directions and fill in their responses to each item on the provided answer sheet.

Analyses

Prior to the statistical analyses, I labeled the test items according to the section of the test. For the cloze section, “C” was marked in front of the number of the item to indicate “cloze.” Because there are two passages in the cloze, I labeled the first passage A and the latter B. For example, test item 41 was identified as CA41 (cloze section, passage A) and item 80 as CB80 (cloze section, passage B). Similarly, GVR items were marked with G (i.e., G100), V (i.e., V120), and R (i.e., R160), respectively.

Coding of the GV Section

Before statistical analyses were performed, all the GVR and cloze items were coded to determine what these items were measuring. For the grammar and vocabulary items in the GVR section, the coding was based on the model of grammatical ability proposed by Purpura (forthcoming), which provides a theoretical definition of grammar (See Figure 1). According to his model, language ability is primarily composed of two parts: grammatical knowledge and pragmatic knowledge. Grammatical knowledge is divided into two closely related components: grammatical form and grammatical meaning. Each knowledge component is then defined in terms of six subcomponents, including, at the sentential level, phonological or graphological forms/meaning, lexical forms/meaning, and morphosyntactic forms/meaning, and at the suprasentential level, cohesive forms/meaning, information management forms/meaning, and interaction forms/meaning. Using this model, I attempted to categorize the items according to what domain of grammatical knowledge each item was measuring. Although it is necessary to introduce the model in order to classify the items appropriately, describing each component of the model is beyond the purview of this paper. Therefore, I will only focus on the components which appear to be in the GV section of the ECPE: lexical form (LFORM), lexical meaning (LMEAN), morphosyntactic form (MFORM), cohesive form (CFORM), and cohesive meaning (CMEAN).

According to Purpura (forthcoming), “knowledge of lexical form enables us to understand and produce those features of words that encode grammar rather than those that reveal meaning” (p. 25). These include orthography, part of speech (e.g., happy, happiness), morphological irregularity, word formation (e.g., nightstand; kickoff), countability (e.g.,

44

Figure 1. A Theoretical Definition of Grammar Grammar Pragmatics

Grammatical Forms

Grammatical Meanings (Literal Functional Meanings)

Pragmatic Meanings

Subs

ente

ntia

l or S

ente

ntia

l Le

vels

• Phonological/

Graphological Forms • Lexical Forms • Morphosyntactic

Forms

• Phonological/

Graphological Meanings

• Lexical Meanings • Morphosyntactic/

Literal/ Locutionary Meaning

Supr

asen

tent

ial o

r D

isco

urse

Lev

els

• Cohesive Forms • Information

Management Forms • Interactional Forms

• Cohesive Meanings • Information

Management Meanings • Interactional Meanings

• Implied Contextual

Meanings • Sociolinguistic Meanings • Psychological Meanings • Rhetorical Meanings

(Purpura, forthcoming, p. 84). children; people) / gender (e.g., actress) restrictions, co-occurrence restrictions, and formulaic expressions. A co-occurrence restriction occurs when a verb or a transitive adjective is followed by a particular preposition (e.g., depend on X; yield to X) or a given noun phrase is preceded by a particular preposition (e.g., in my opinion) (Celce-Murcia and Larsen-Freeman, 1983). An example of lexical form (LFORM) follows:

Mary gets along _____ her roommates well. a. with * b. of c. for d. to

In this example, the word along is followed by the preposition with. This is considered the grammatical dimension of lexis, representing a co-occurrence restriction with prepositions (Purpura, forthcoming).

Another component in the model which is closely associated with lexical form is lexical meaning (LMEAN). A difference between the two is that LFORM focuses on the grammatical structure of a word, whereas the LMEAN emphasizes the literal meaning of a word. Consider the following example:

45

There’s a serious ____ between the two university football teams. a. competition * b. bile c. temper d. exasperation

All four choices for the blank are nouns, thus this item is not measuring the form of the word. Instead, it is examining whether the test-takers understand the meaning of the word. Competition carries the meaning of rivalry and is the most appropriate choice in this example.

A third component which is often tested in the GV section is morphosyntactic form (MFORM). As the name of the component suggests, it focuses on a morphological and/or syntactic form of the language. The features of morphosyntactic form are the articles, prepositions, pronouns, inflectional affixes (e.g., -ed), derivational affixes (e.g., un-), simple, compound and complex sentences, mood, and voice. Consider the following example:

I had a hard time _____ for the exam this weekend. a. studying * b. to study c. with study d. study

In this example, a gerund should be included in the blank. By looking at the choices, the test-takers must recognize that studying is a gerund form (–ing form) that functions as a noun. This type of item on the test provides the same word with different alternatives of form in order to measure the test-takers’ ability to understand the appropriate morphosyntactic form of the language.

The last components measured in the exam are cohesive form (CFORM) and cohesive meaning (CMEAN). According to Purpura (forthcoming), “knowledge of cohesive form enables us to use the phonological, lexical and morphosyntactic features of the language in order to interpret and express cohesion on both the sentential and discourse levels” (p. 27). This includes cohesive devices such as logical connectors (e.g., therefore; however), pronoun referents, and ellipses (e.g., so do I; I do too). Purpura further states that CFORM is closely associated with CMEAN through cohesive devices that make connections between cohesive forms and their referential meanings within the linguistic environment. Following is a good example of measuring CFORM and CMEAN:

“I didn’t go to Jane’s party last night.” “________________.”

a. Neither did I * b. I don’t either c. So do I d. So did I

All four choices are grammatically correct; however, the inverted negative expression, Neither did I, is most appropriate in this context. Test-takers who choose I don’t either as the answer understand the cohesive meaning yet fail to acknowledge the meaning (past tense) difference. On the other hand, selecting So did I as the answer shows awareness of the tense, yet a failure to comprehend the cohesive meaning of the sentence. In summary, this item examines whether the test-taker understands the use of cohesive form and meaning at the discourse level.

46

In order to determine what each item in the GV section is measuring, a total of eleven doctoral students in language testing at Teachers College, Columbia University, were asked to code the items according to Purpura’s framework. They were given the descriptions of the coding scheme listed above and were asked to classify each item. When they were unsure of the appropriate classification, they were asked to use their best judgment. The students applied the same coding to the cloze sections to investigate whether the GV items and the cloze items measure common traits. Although most of the item coding was consistent among students, there were some items on which the students did not agree. Those items were carefully examined and coded by a professor who specializes in language testing at Teachers College.

As a result, the GV items were classified using three of the five components. Twenty-five items were coded lexical form (LFORM), 28 items, lexical mean (LMEAN) and 27 items, morphosyntactic form (MFORM), (see Table 3).

Table 3. Initial Taxonomy of the GV Items in the GVR Section (80 Items)

Components Number of Items Items

Lexical Form (LFORM) 25 81, 82, 88, 91, 102, 103, 105, 110, 115, 116, 125, 126, 134, 138, 140, 142, 145, 146, 149, 151, 154, 155, 156, 159, 160

Lexical Mean (LMEAN) 28 95, 96, 108, 121, 122, 123, 124, 127, 128, 129, 130, 131, 132, 133, 135, 136, 137, 139, 141, 143, 144, 147, 148, 150, 152, 153, 157, 158

Morphosyntactic Form (MFORM) 27

83, 84, 85, 86, 87, 89, 90, 92, 93, 94, 97, 98, 99, 100, 101, 104, 106, 107, 109, 111, 112, 113, 114, 117, 118, 119, 120

Coding of the R Section

The reading section items were divided into two question types: reading for explicit information and reading for inferential information (Purpura, 1999). For the explicit information (EXP) questions, the participants were asked about specific information in the text. The inferential information (INF) items required the participants to derive meaning not explicitly stated in the text. The coding procedure used for the grammar and vocabulary items was also used for the reading items, resulting in 12 EXP items and 8 INF items, as seen in Table 4. Table 4. Initial Taxonomy of the Reading Items in the GVR Section (20 Items)


Reading for Explicit Information (EXP) 12 161, 162, 164, 165, 167, 168, 172, 173, 175, 178,

179, 180 Reading for Inferential Information (INF) 8 163, 166, 169, 170, 171, 174, 176, 177

47

Coding of the Cloze Section Two coding systems were applied in order to investigate whether cloze tests are sensitive to

constraints beyond clause boundaries (Chavez-Oller, Chihara, Weaver, and Oller, 1985; Chihara, Oller, Weaver, and Chavez-Oller, 1977; Fotos, 1991; Jonz, 1990; Oller, 1973; Oller and Conrad, 1971) or only measure sentence level processing abilities such as grammar and vocabulary (Alderson, 1979, 1983; Markham, 1985; Porter, 1983; Purpura, 1999; Shanahan et al., 1982).

I first employed the coding used by Hale, Stansfield, Rock, Hicks, Butler, and Oller (1988) for their TOEFL cloze study. This coding assumes that cloze items measure higher order processing abilities. The second coding was based on the model of grammatical ability proposed by Purpura (forthcoming); this is the same coding used for the GV section in this study. Contrary to the Hale et al.’s (1988) coding, Purpura’s categorization assumes that cloze items measure grammatical knowledge rather than global language proficiency. The two coding schemes for the cloze items were examined through the following statistical analyses in order to determine which coding more properly measures the underlying construct of the cloze test.

The Hale et al.’s (1988) classification was based on the assumption that a cloze test includes skills such as grammar, vocabulary, and reading comprehension. According to the TOEFL study, not only are these skills interrelated in certain respects, the classification scheme assumes that the reading comprehension is involved to some degree in all items (Hale et al., 1988). They developed a four-category scheme as follows:

1. Reading Comprehension/Grammar (RG) 2. Reading Comprehension/Vocabulary (RV) 3. Grammar/Reading Comprehension (GR) 4. Vocabulary/Reading Comprehension (VR)

The following description of each category with examples is taken from their study. (The following is a partial quotation.)

Reading Comprehension/Grammar (RG) In this category, the task is one of understanding propositional information at an interclausal level, but answering the question also emphasizes knowledge of syntax (i.e., sequential arrangement and markers of such arrangements) rather than of lexicon.

Example: A ballad is a folk song; however, a folk song is not a ballad ______ it tells a story.

a. because b. if c. whether d. unless

Reading Comprehension/Vocabulary (RV) In this category, the problem is one of long-range constraints, but a lexical choice is required to solve it. The reader’s task is basically one of understanding the text and getting the propositional information out of elements that may be some distance apart (usually across clause boundaries) yet a lexical choice is also required. Example: … known as the Lost Sea. It is listed (in) the Guinness Book of World Records as the world’s largest underground ______.

48

a. water b. body c. lake d. cave

(Parentheses denote another place where a word has been deleted, the correct response here being the word “in.”)

Grammar/Reading Comprehension (GR) Here the source of item difficulty involves relatively short-range grammatical constraints -- usually a few words on either side of the blank, or within a single grammatical phrase or clause. The item primarily taps knowledge of surface syntax, and reading comprehension is involved primarily because the reader must understand within clause propositional information. Example: It is generally understood that a ballad is a song that tells a story, (but) a folk song is not so ______ defined.

a. easy b. easily c. ease d. easier

Vocabulary/Reading Comprehension (VR) The primary aspect of this category is vocabulary (including idioms and collocations) although it also invokes reading comprehension to the extent of understanding the information presented within clause boundaries. The main source of difficulty, from the examinee’s standpoint, is vocabulary -- not grammar and not the understanding of long-range textual constraints. Example: In fact, there are folk songs for many occupations -- railroading, ______ cattle, and so on.

a. following b. mustering c. concentrating d. herding

(Hale et al., 1988, pp. 11-12) A total of four people (three testing experts at the English Language Institute at The

University of Michigan and I) participated in coding the items. We used the coding scheme described above to classify each item. Because some items were difficult to classify, we indicated our degree of certainty on a four-point scale for each item (very certain (4), somewhat certain (3), somewhat uncertain (2), and very uncertain (1)). When we did not rate the items as very certain (4), we indicated the other classification(s) the item could receive.

All four judges placed 15 out of the 40 cloze items in the same group, and three judges placed 20 other items in the same group. The fourth judge gave 5 of these 20 items a secondary rating that agreed with the other judges’ primary rating. (For example, three judges rated an item grammar/reading (GR). The fourth judge rated it reading/ grammar (RG); however, s/he marked GR as the second choice.) The remaining 5 items were controversial among judges; two judges picked one category while the other two judges chose different categories. (For

49

example, two judges rated an item GR, but the other two judges marked it RG and RV. In short, the judges appeared to have some difficulty coding the items.)

Each of the 40 items was placed into a category based on the judges’ ratings. Where three or four judges indicated the same category (35 of the 40 items), the item was assigned to that category. Where two judges marked one category yet the other two judges were split between two other categories, the item was assigned to the first category. Table 5 summarizes the coding of the cloze section. Table 5. Initial Taxonomy of the Cloze Section Based on the TOEFL Study Coding


Reading-Grammar (RG) 10 42, 43, 45, 48, 51, 59, 60, 70, 71, 74

Reading-Vocabulary (RV) 12 44, 47, 49, 54, 57, 61, 63, 64, 69, 76, 77, 80

Grammar-Reading (GR) 9 41, 50, 56, 58, 62, 68, 72, 73, 75

Vocabulary-Reading (VR) 9 46, 52, 53, 55, 65, 66, 67, 78, 79 The second coding scheme using Purpura’s (forthcoming) model of grammatical ability

produced the following taxonomy of the items in the cloze section (see Table 6). While RG/RV/GR/VR coding uses four components, this coding uses five components: lexical form (8 items), morphosyntactic form (2 items), cohesive form (6 items), lexical meaning (23 items), and cohesive meaning (1 item).

Table 6. Second Taxonomy of the Cloze Section Based on Purpura’s Model


Lexical Form (LFORM) 8 49, 52, 53, 54, 60, 66, 68, 73 Morphosyntactic Form (MSFORM) 2 72, 78

Cohesive Form (CFORM) 6 41, 51, 58, 62, 70, 75

Lexical Mean (LMEAN) 23 43, 44, 45, 46, 47, 48, 50, 55, 56, 57, 59, 61, 63, 64, 65, 67, 69, 71, 74, 76, 77, 79, 80

Cohesive Mean (CMEAN) 1 42

Descriptive Statistics

To examine the central tendency and dispersion, I calculated descriptive statistics using SPSS Version 10 for the PC. The standard deviations were checked to identify items with no variability. Subsequently, to examine the item distribution, I calculated the kurtosis and skewness of each variable. This allowed for an examination of any potential violations to the assumption of normality. The kurtosis and skewness were expected to range from –3 to +3. If any of the items lay outside the acceptable limits, the items were flagged and further examined to determine whether they could be deleted from the test analyses.

50

Reliability Analysis

To examine consistency of measurement, the internal consistency reliability estimates were calculated to examine the homogeneity of the test items in each component of the coding schemes for the GVR and cloze sections. I performed reliability analyses on the data using SPSS Version 10.1 for the PC to examine (a) how each item correlated with the other items in the component and (b) how the items in each scale performed as a group. The item-total correlations for each item as well as the overall estimate of the scale reliability were investigated. For the GVR and the cloze sections, I used Cronbach’s alpha and the adjusted alpha for the scale if the item was to be deleted. As for the cloze, the reliability estimates may be overestimated because the cloze items may violate the assumption of independence (Bachman, 1990).

Exploratory Factor Analysis

Subsequent to the coding of the items, I performed a series of exploratory factor analyses (EFAs), using Mplus Version 2 for the PC to examine the patterns of correlations among the items within and across each component of the coding schemes for the cloze and GVR sections. In other words, I used EFA to determine whether the items in each component were measuring the same underlying construct and if each component represented an independent construct.

Following Kim and Muller’s (1978) and Purpura’s (1999) procedures for EFA, I followed three steps in performing the EFAs: (1) preparation of the matrix to be analyzed, (2) extraction of the initial factors, and (3) rotation and interpretation. First, for both cloze and GVR, because of the dichotomous nature of the variables I produced a matrix of tetrachoric correlations among the various items. The data were analyzed and evaluated for factor analytic appropriateness. I based all appropriateness decisions on the determinant of the correlation matrix.

With regard to the extraction, I examined the eigenvalues obtained from the initial extraction, which provided a preliminary indication of the number of factors represented by the data. Consequently, these initial extractions together with the theoretical design of the cloze and the GVR section were used to determine the ultimate number of underlying factors to be extracted.

Following the determination of the minimum and maximum number of factors to extract, the extractions were rotated to an orthogonal solution using a varimax rotation and to an oblique solution using a promax rotation (Purpura, 1999). To determine whether to interpret the orthogonal or the oblique solution, I examined the interfactor correlation matrices. In sum, I used simple structure and meaningful interpretation as final criteria for deciding the best number of factors to extract.

Finally, I performed the reliability analyses with the revised taxonomy of the ECPE based on the results of the EFA. Once again, the reliability analyses examined (a) the homogeneity of the items in the cloze and the GVR section and (b) the degree of consistency of each section.

Structural Equation Modeling

The primary statistical procedure used in this study was structural equation modeling (SEM). SEM is a means of representing interrelationships between observed and content variables and among latent variables based on substantive theory (Purpura, 1999). Each relationship in the model is defined by a set of mathematical equations, and the entire model is empirically tested for overall model-data fit. SEM involves two steps in the analyses:

51

validating the measurement model, and fitting the structural model (Purpura, 1999). The former is called Confirmatory Factor Analysis (CFA), which examines the hypothesis of linkages between observed variables and their latent variables in the individual measurement model. The latter refers to the procedures for testing the hypotheses of linkages among latent variables. In the current study, I will examine the CFA in order to answer the question of how the items in the GVR and the cloze sections compose the underlying structure of the exam. A flow chart of these procedures is seen in Figure 2.

Figure 2. A Flow Chart of Statistical Procedures Used in this Study

(Adapted from Purpura, 1999)

Results The results are discussed in three sections, with each section containing four subsections

(descriptive statistics, reliability analysis, exploratory factor analysis, and structural equation modeling). First, the results of the GVR section are presented, followed by the results of the cloze section. Finally, the results of the combined GVR and cloze sections are discussed. The GVR Section Descriptive Statistics

First, I analyzed the item-level data from the GVR section based on all 12,468 test-takers (see Appendix A). The means for the grammar section ranged from 0.30 to 0.99, suggesting a wide range of item-difficulty levels. The standard deviation ranged from 0.11 to 0.50. Nine items (G89, G90, G96, G97, G98, G100, G102, G112, G117) had means above 0.91, and the values for skewness and kurtosis of those nine items were beyond +/- 3. However, it is perfectly normal to expect high kurtosis values for these items because the mean values were also extremely high.

Data Preparation • Scoring • Inputting • Checking for missing values • Coding the items

Descriptive Statistics for the GVR and Cloze Section

• Examining central tendencies

• Checking for normality

Reliability Analysis for the Cloze and the GVR Section

• Examining the homogeneity of the Cloze section

• Examining the homogeneity of each section: Grammar, Vocabulary, and Reading

Exploratory Factor Analysis for the Cloze and the GVR Section

• Examining the item clusters • Creating composite variables

Confirmatory Factor Analysis for the Cloze and the GVR Sections separately

• Examining the measurement models

Confirmatory Factor Analysis for the combined Cloze and the GVR

• Examining the measurement models

52

The means for the vocabulary section ranged from 0.20 to 0.85, again suggesting a wide range of item-difficulty levels, and the standard deviations ranged from 0.36 to 0.50. All values for skewness and kurtosis were within the accepted limits, indicating that all the items appeared to be normally distributed. Compared to the grammar section, this section contains well-balanced item difficulty levels with no item means over 0.85.

The means for the reading section ranged from 0.50 to 0.98, suggesting a moderate range of item-difficulty levels, and the standard deviations ranged from 0.15 to 0.50. All values for skewness and kurtosis were within the accepted limits except for two items (R161 and R165), indicating univariate normality.

There were many items that did not fall within the accepted limits of the descriptive statistics described in the Method section. For instance, items with a mean value higher than 0.90 had kurtosis and skewness values over the acceptable limit. A range of difficulty levels was necessary in the test, and the items that fell outside the accepted descriptive statistic limits did not appear to cause any substantial threats to normality when the reliability analyses were performed. Hence, these items were kept for the subsequent analyses. Internal Consistency Reliability for the GVR Section

Reliability analyses were performed to examine the extent to which the items in each component of the coding schemes (grammar, vocabulary, and reading) performed as a homogeneous group and the extent to which the items related to other items in the GVR section of the ECPE.

In answer to the first research question, the results showed that all original sections yielded alphas of 0.50 or more: grammar (α = 0.80), vocabulary (α = 0.74), and reading (α = 0.72). The standard error of measurement for each section was then examined to estimate an average of the distribution of error deviations across each section. The standard error of measurement for grammar, vocabulary, and reading were 2.37, 2.81, and 1.71, respectively.

On the sub-section level, the grammar section had a high internal consistency reliability (over 0.80); however, the alpha of the vocabulary section and the reading section were somewhat moderate. In other words, the items in the grammar section appeared to measure the same construct within the section more than the items in the vocabulary and reading sections did. When the reading items were fixed to 40 items, using the Spearman-Brown Prophecy formula, reliability increased to 0.84. This indicates that when the number of items is held constant to 40 items, READ has the highest internal consistency reliability among the three types of items.

In order to examine the internal consistency reliability for the GVR section of the ECPE, all 100 items were included in the analyses. The results yielded an alpha of 0.87 with a standard error of measurement of 4.10. The reasonably high alpha of 0.87 suggests that the items in the exam appear to be measuring the same construct: English as a foreign language (EFL) grammar, vocabulary, and reading comprehension test performance.

A summary of the Cronbach’s alpha reliability estimates for internal consistency of the GVR section is presented in Table 7. Although the present reliability analyses provide invaluable information on the homogeneity of the items as well as the degree of consistency of each section, the information on the underlying trait structure of the GVR section remains unknown. In order to answer the second research question (see p. 3), exploratory factor analysis was performed. It is discussed in the following section.

53

Table 7. Reliability Estimates for the GVR Section GVR Section

Number of Items

Reliability Estimates

Corrected Reliability Estimates

Standard Error of Measurement

Grammar 40 0.80 2.37 Vocabulary 40 0.74 2.81 Reading 20 0.72 0.84 1.71 Total 100 0.87

Exploratory Factor Analysis

To investigate the trait structure of the GVR section of the ECPE, a matrix of tetrachoric correlations using all 100 items was generated in Mplus Version 2. Then a series of EFAs was performed on the GVR section. The following presents a summary of the findings.

I first performed EFAs on all 100 items in the GVR section. These analyses produced a three factor promax rotation that seemed to maximize parsimony and interpretability. Although each component of the GVR was designed to measure three factors, each representing one of the three components identified in the test content specifications (grammar, vocabulary, and reading), many of the vocabulary and grammar items loaded on the same factors. This indicates that some of the vocabulary items were measuring the same trait as many of the grammar items, and vice versa. Table 8 presents the initial 3-factor loadings of the GVR section.

While the grammar and vocabulary items combined in loading on two factors, all 20 reading items loaded on one factor, indicating that they appeared to be measuring one underlying construct. This implies that reading items measure a distinct ability from grammar and vocabulary. Based on the results of the EFAs, the GVR section of the ECPE was considered to measure two underlying construct abilities: reading ability and lexico-grammatical ability (L-G).

The next step was to analyze the GRAM and VOC (GV) items together to examine how the lexico-grammatical ability was measured. The EFAs on the GV items produced a three factor promax rotation that maximized parsimony and interpretability. In the course of these analyses, 37 items (G82, G83, G84, G85, G88, G92, G95, G96, G97, G98, G99, G101, G102, G104, G106, G107, G108, G109, G110, G113, G114, G120, V121, V124, V129, V130, V131, V132, V134, V139, V141, V144, V145, V147, V149, V157, and V159) produced extremely low factor loadings (lower than 0.3) and double loadings. This may be due to the moderate values of reliability for grammar and vocabulary sections (α = 0.80 and α = 0.74, respectively). If the items in each section had been performing more homogeneously, the EFA may not have produced this many items with low factor loadings and double loadings. Because these items distract from investigating the construct validity of the ECPE, 37 items were dropped from further analyses. Some items with a factor loading of a little less than 0.30 were kept because they clearly loaded on a factor. For example, Item V158, with a loading value of 0.276, was kept in the analysis even though it did not make the cut-off line of 0.30. A reason for keeping the item was that the loadings for morphosyntactic form and lexical form were extremely low: 0.095 and –0.074, respectively. In other words, when the factor loadings were compared, Item V158 was clearly measuring lexical meaning. Ultimately, 43 items were retained to measure the three underlying factors. Through a series of EFAs, a promax solution again produced three factors, as seen in Table 9. The factor correlation matrix is shown in Table 10. Based on the Purpura’s grammatical ability model, these three factors are the following.

Table 8. The Initial EFA Results of GVR Section: Promax Rotation Item F1 F2 F3 Item F1 F2 F3 Item F1 F2 F3 G81 0.545 -0.187 0.091 G114 -0.338 0.550 -0.021 V147 0.285 0.197 -0.056 G82 0.248 0.186 0.018 G115 0.731 -0.281 0.054 V148 -0.693 0.819 -0.049 G83 0.468 0.074 -0.089 G116 0.671 -0.117 0.053 V149 0.140 0.086 -0.032 G84 0.625 -0.164 0.016 G117 0.143 0.074 0.188 V150 -0.117 0.558 -0.084 G85 0.373 0.023 0.027 G118 0.025 0.419 0.072 V151 0.239 0.042 -0.073 G86 0.064 0.153 0.044 G119 0.182 0.124 0.168 V152 -0.107 0.401 -0.043 G87 0.497 -0.096 0.068 G120 0.548 -0.028 -0.064 V153 -0.162 0.518 0.193 G88 0.198 0.216 0.180 V121 0.252 0.125 0.037 V154 0.499 -0.109 0.072 G89 -0.025 0.361 0.113 V122 -0.055 0.466 -0.009 V155 0.343 -0.061 -0.129 G90 0.192 0.048 0.219 V123 -0.088 0.539 0.001 V156 0.301 -0.053 0.192 G91 0.539 -0.150 0.075 V124 -0.111 0.366 -0.094 V157 0.371 -0.033 0.063 G92 0.358 0.083 0.109 V125 0.255 0.055 -0.048 V158 -0.061 0.298 0.033 G93 0.162 0.168 0.109 V126 0.314 0.022 0.073 V159 0.471 -0.017 0.134 G94 0.313 0.277 0.007 V127 0.048 0.444 -0.037 V160 0.602 -0.330 0.128 G95 0.285 0.109 -0.066 V128 0.077 0.432 -0.049 R161 0.093 -0.087 0.370 G96 0.304 -0.043 0.183 V129 0.331 0.041 0.012 R162 -0.011 -0.015 0.363 G97 0.698 -0.142 0.117 V130 0.281 0.055 -0.068 R163 0.144 -0.063 0.459 G98 0.252 0.013 0.167 V131 0.162 0.276 -0.062 R164 -0.027 0.042 0.387 G99 0.035 0.367 0.031 V132 0.209 0.176 0.002 R165 0.150 -0.090 0.571 G100 0.295 0.096 0.098 V133 0.017 0.390 0.023 R166 -0.007 0.072 0.449 G101 0.239 0.051 0.178 V134 -0.388 0.535 -0.017 R167 -0.058 0.211 0.367 G102 0.462 -0.135 0.142 V135 -0.062 0.319 0.282 R168 -0.003 0.040 0.400 G103 0.801 -0.384 0.114 V136 -0.258 0.603 -0.022 R169 -0.124 0.119 0.390 G104 0.600 -0.043 0.076 V137 0.223 0.291 0.102 R170 -0.155 0.173 0.534 G105 0.448 0.066 -0.053 V138 0.643 -0.144 -0.058 R171 0.028 -0.064 0.556 G106 0.576 -0.140 0.037 V139 0.345 0.145 -0.027 R172 -0.008 0.015 0.557 G107 0.390 0.228 0.007 V140 0.529 -0.053 -0.107 R173 0.031 0.023 0.409 G108 0.507 -0.061 0.085 V141 0.340 0.215 -0.037 R174 0.095 -0.128 0.418 G109 0.487 -0.113 -0.015 V142 0.442 0.049 -0.023 R175 0.165 -0.131 0.543 G110 0.148 0.193 0.065 V143 -0.114 0.360 0.115 R176 0.094 0.051 0.371 G111 0.283 0.280 -0.037 V144 0.344 0.215 -0.073 R177 0.041 0.070 0.452 G112 0.208 0.201 0.146 V145 0.260 0.105 -0.103 R178 0.003 0.062 0.502 G113 0.420 0.308 -0.050 V146 0.577 -0.061 -0.014 R179 -0.045 0.104 0.461 R180 0.135 -0.037 0.371

55

Table 9. EFA Results of GV Section: Promax Rotation Item Code F1: MF F2: LF F3: LM G118 MF 0.525 -0.163 0.177 G93 MF 0.505 -0.015 -0.023 G112 MF 0.497 0.062 0.033 G90 MF 0.446 0.099 -0.049 G100 MF 0.445 0.124 -0.038 G89 MF 0.440 -0.165 0.181 G87 G117

MF MF

0.430 0.425

0.298 0.040

-0.200 -0.055

G94 MF 0.385 0.149 0.116 G111 MF 0.338 0.104 0.151 G86 MF 0.313 -0.058 0.010 G119 MF 0.297 0.113 0.065 G103 LF 0.096 0.776 -0.234 G115 LF 0.147 0.657 -0.193 V160 LF -0.029 0.654 -0.136 V138 LF -0.085 0.640 0.017 G116 LF 0.173 0.596 -0.052 V146 LF 0.052 0.542 0.030 V140 LF -0.092 0.509 0.068 V154 LF 0.077 0.487 -0.027 G81 LF 0.186 0.476 -0.157 G91 LF 0.206 0.462 -0.130 V142 LF 0.036 0.412 0.122 V155 LF -0.231 0.383 0.100 V156 LF -0.231 0.342 0.019 V126 LF 0.031 0.339 0.097 G105 LF 0.123 0.339 0.077 V151 LF -0.130 0.276 0.134 V125 LF -0.081 0.265 0.142 V148 LM 0.103 -0.707 0.616 V150 LM -0.141 -0.050 0.610 V123 LM -0.071 -0.018 0.589 V136 LM 0.012 -0.243 0.549 V153 LM 0.138 -0.090 0.507 V133 LM -0.119 0.105 0.495 V127 LM -0.008 0.071 0.465 V143 LM -0.007 -0.034 0.393 V128 LM 0.100 0.027 0.380 V122 LM 0.148 -0.099 0.378 V152 LM 0.026 -0.118 0.364 V135 LM 0.135 0.055 0.357 V137 LM 0.166 0.222 0.307 V158 LM 0.095 -0.074 0.276

LF = Lexical Form, MF = Morpho-Syntactic Form, LM = Lexical Mean

56

Table 10. GV Section Factor Correlation Matrix

MFORM LFORM LMEAN MFORM 1.00 LFORM 0.31 1.00 LMEAN 0.36 0.23 1.00

1. A form factor (MFORM), which includes items dealing with morphosyntactic form. 2. Another form factor (LFORM), which consists of lexical forms with word formation,

co-occurrence restrictions, and formulaic expressions. 3. A meaning factor (LMEAN), which includes denotational (literal) meanings, meanings

of formulaic expressions, semantic fields, etc. An inspection of the interfactor correlation matrix indicates that there is a low correlation

among three factors. Based on these analyses, 14 items were used to form the LMEAN composite variables for subsequent analyses. For LFORM and MFORM, 17 and 12 items were used, respectively, in order to form composite variables.

After performing the EFAs on the GV section, I performed a separate series of EFAs on the reading section of the test. The reading section is composed of four discrete passages with five corresponding items per passage. The EFAs produced a 4-factor promax solution that seemed to maximize parsimony and interpretability in spite of the fact that two factors, each representing (1) reading for explicit information and (2) reading for inferential information, were expected (Purpura, 1999). The 4-factor solution produced an interesting result: the items loaded according to the passage, which clearly indicates that the items are text dependent (see Table 11). For example, items 161-165 loaded on the first passage in the reading section, items 166-170 loaded on the second passage, and so forth. An inspection of the interfactor correlation matrix (Table 12) shows that all four factors are moderately correlated, with correlation coefficients over 0.50. Post-EFA Reliability Analyses

Following the exploratory factor analysis, I performed reliability analyses with the revised taxonomy, which contains two sections with four variables in the GVR section of the ECPE (see Table 13). The sections are grammar/vocabulary (GV) and reading (READ). The GV section contains three components, which are morphosyntactic form (MFORM), lexical form (LFORM), and lexical meaning (LMEAN). The reading section contains only one component, READ. In examining the new GVR section for its overall reliability and the degree to which the items in the scale related with the others, I found that all four components were moderate: MFORM (α = 0.52), LFORM (α = 0.72), LMEAN (α = 0.64), and READ (α = 0.72). The standard errors of measurement for each component were examined to estimate an average of the distribution of error deviation across each component. The standard error of measurement for MFORM, LFORM, LMEAN, and READ were 1.21, 1.71, 1.55, and 1.71, respectively.

In addition to the reporting of reliability estimates for all 63 items, corrected reliability estimates using the Spearman-Brown Prophecy formula for MFORM, LFORM, and LMEAN items are provided (see Table 13). The purpose of using corrected reliability estimates is to compare reliability estimates for tests with differing lengths, as the formulas are dependent on test length (Hatch and Lazaraton, 1991; Henning, 1987). For instance, an alpha of 0.8 on a 100-item test is not comparable to an alpha of 0.8 on a 10-item test. Corrected reliability

57

Table 11. EFA Results of READ Section: Promax Rotation

Item Code F1 Passage 1

F2 Passage 2

F3 Passage 3

F4 Passage 4

165 EXP 0.782 0.118 -0.005 -0.097 164 EXP 0.445 0.048 0.042 -0.022 161 EXP 0.424 -0.101 -0.004 0.171 162 EXP 0.379 0.063 -0.005 0.009 163 INF 0.333 0.074 0.154 0.063 169 INF 0.027 0.529 -0.075 -0.007 170 INF 0.007 0.523 0.083 0.031 166 INF 0.052 0.412 0.079 0.051 168 EXP 0.016 0.369 0.146 -0.015 167 EXP 0.093 0.365 -0.014 0.086 175 EXP 0.060 -0.018 0.570 0.080 172 EXP 0.044 0.028 0.558 0.024 171 INF 0.011 0.113 0.552 -0.032 173 EXP -0.078 0.065 0.533 -0.023 174 INF -0.014 0.006 0.412 0.085 177 INF -0.017 0.019 -0.019 0.662 176 INF -0.088 0.039 -0.020 0.625 179 EXP 0.002 0.159 -0.030 0.465 178 EXP 0.000 0.049 0.154 0.464 180 EXP -0.015 -0.089 0.195 0.426

EXP = Reading for Explicit Information, INF = Reading for Implicit Information Table 12. Read Section Factor Correlation Matrix Passage 1 Passage 2 Passage 3 Passage 4 F1 Passage 1 1.00 F2 Passage 2 0.55 1.00 F3 Passage 3 0.54 0.60 1.00 F4 Passage 4 0.51 0.53 0.66 1.00

Table 13. Reliability Estimates for the GVR Section (63 items) Section

Number of Items



Standard Error of Measurement

GV MFORM 12 0.52 0.86 1.21 LFORM 17 0.72 0.78 1.71 MEAN 14 0.64 0.84 1.55 READ READ 20 0.72 0.72 1.71 Total 63 0.81 0.87

estimates for MFORM, LFORM, LMEAN components were fixed to 40 items since that was the original number of items for both the grammar and vocabulary sections. According to the corrected reliability, MFORM, LFORM, and LMEAN produced fairly high reliability estimates, indicating that the items in each section are highly homogeneous. Corrected reliability

58

estimates for the reading section are not required because no items were deleted from the section.

The internal consistency reliability for the revised GVR section yielded an alpha of 0.81 with a SEM of 3.20. The corrected reliability estimate for the overall GVR was fixed to 100 items, which produced an alpha of 0.87. Despite the extensive decrease in the number of items, the alpha value did not change much from the original value. This suggests that the items remaining in the GVR section appear to measure the construct consistently. Structural Equation Modeling

According to the EFA results, the GVR section is hypothesized to be composed of two components measuring lexico-grammatical ability and reading ability. Table 14 presents a summary of descriptive statistics for the seven factors in the GVR section. Since lexico-grammatical ability appears to measure only one type of MEAN items (LMEAN), the lexical mean items are specified as MEAN in the further analyses.

Table 14. Distributions of the GVR Section (63 items) Variable Mean Std. Dev. Kurtosis Skewness Min Max # Poss. Lexico-Grammatical Ability MFORM 9.37 1.75 0.49 -0.69 0 12 12 LFORM 10.61 3.24 -0.19 -0.55 0 17 17 MEAN 8.51 2.58 -0.45 -0.08 0 14 14 Reading Ability READ 1 4.48 0.77 2.89 -1.65 0 5 5 READ 2 3.79 1.19 0.17 -0.88 0 5 5 READ 3 4.00 1.15 0.82 -1.14 0 5 5 READ 4 3.04 1.45 -0.88 -0.27 0 5 5

Based on the results of the EFAs, the abbreviated GVR section of the ECPE was

represented as a two-factor model of foreign language test performance of English and contains two intercorrelated factors, lexico-grammatical ability (L-G) and reading ability. Between the two factors, there are seven observed variables (Morphosyntactic Form, Lexical Form, Lexical Mean, Passage 1, Passage 2, Passage 3, and Passage 4), with each variable hypothesized to load on only one factor (see Figure 3). This is a first-order confirmatory factor analysis designed to test the multidimensionality of foreign language test performance of English as measured by the abridged GVR section. Before exploring the trait structure of the GVR section, I first investigated the statistical assumptions underlying the estimation procedure used in these analyses and then proceeded to assessing model-data fit. The statistical analysis showed that the variables in this model were univariately normally distributed; thus, further statistical analyses were produced. Model 1.1 in Figure 3 addresses the following research question: (2) what is the underlying trait structure of foreign language test performance of English as measured by the ECPE GVR section?

Model 1.1 is similar to the model presented in Purpura’s study (1999). This is a first-order confirmatory factor analysis designed to investigate the multidimensionality of the foreign language test performance of English measured by the GVR section. Prior to exploring the trait

59

Figure 3. Initially Hypothesized 2-Factor Model of the GVR Section: Model 1.1

* = Freely estimated

1.0 = Fixed

structure of this model, I examined the univariate and multivariate sample statistics for sample normality. As shown in Table 14, the skewness and kurtosis values were within the acceptable limits, indicating that these variables are normally distributed. Consequently, by using Mplus Version 2, I assessed the hypothesized model to determine to what extent the model fit the sample data. The model-data fit statistics for Model 1.1 produced a chi-square of 873.562 with 13 degrees of freedom (p < 0.0000) and a CFI of 0.92, as seen in Table 15. A root mean square error of approximation (RMSEA) of 0.07 indicates a degree of global misfit. Although this model fit is not completely unsatisfactory, it does not provide substantial evidence for acceptance of Model 1.1. As Model 1.1 was not the best model, I made no interpretation of the individual parameter estimates. Instead, I further investigated a model which better represents the sample data. Table 15. Results for Initially Hypothesized 2-Factor Model of the GVR Section: Model 1.1 Goodness of fit summary: Comparative fit index (CFI) 0.92 The Tucker-Lewis index (TLI) 0.92 Standardized residual matrix: Standardized Root Mean Square Residual (SRMR) 0.04 Root Mean Square Error of Approximation (RMSEA) 0.07 Chi-square test of model fit: Value 873.562 Degrees of Freedom 13 P-Value 0.0000

1.0

*

1.0

*

*

*

*

*

*

*

*

*

*

*

*L-G

Ability (F1)

Passage 1 (V3)

READAbility (F2)

LFORM (V41)

MEAN (V2)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

E41

E2

E3

E4

E5

E6

MFORM (V40) E40

60

Based on the results of Model 1.1, I reconceptualized the model using a series of fitting procedures. Numerous models were examined; however, most were misfitting or substantively irrelevant. Through rigorous investigation, it was revealed that one model appeared to represent the sample data well from both a substantive and statistical point of view. The two observed variables under lexico-grammatical ability are both assumed to measure form (lexical form and morphosyntactic form); therefore, these variables were combined to construct one variable; FORM. In the revised model, the GVR section is hypothesized to be composed of two factors with six observed variables (Model 1.2, Figure 4). Figure 4. The Revised 2-Factor Model of the GVR Section: Model 1.2

* = Freely estimated 1.0 = Fixed Prior to assessing the model-data fit, the descriptive statistics for FORM were reexamined,

as seen in Table 16. All values for skewness and kurtosis were within the acceptable limits, indicating the variable is unvariately normally distributed; therefore, the SEM was performed on the Model 1.2.

Table 16. Distributions of the FORM Variable in the GVR Section (63 items) Variable Mean Std. Dev. Kurtosis Skewness Min Max # Poss. FORM 19.98 4.00 0.13 -0.56 0 29 29

Using Mplus Version 2, I evaluated the model for overall model data fit, as seen in Table 17. The goodness-of-fit index for the revised two-factor model of foreign language test performance of English produced a chi-square value of 408.302 with 8 degrees of freedom, representing a substantial drop in overall chi-square (∆χ2(5) = 465.26) from the initially hypothesized model. This reduction in χ2 demonstrated a substantial improvement in goodness of fit. Along with the chi-square, the CFI (0.96) also reflected an improvement in model-data fit (∆ = 0.04). Although a smaller RMSEA may be preferred, Model 1.2 is a better representation of the data compared to the previous model.

1.0

*

1.0

*

*

*

*

*

*

*

*

*

*

L-GAbility (F1)

Passage 1 (V3)

READAbility (F2)

FORM (V1)

MEAN (V2)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

E1

E2

E3

E4

E5

E6

61

Table 17. Results for the Revised 2-Factor Model of the GVR Section: Model 1.2 Goodness of fit summary: Comparative fit index (CFI) 0.96 The Tucker-Lewis index (TLI) 0.97 Standardized residual matrix: Standardized Root Mean Square Residual (SRMR) 0.03 Root Mean Square Error of Approximation (RMSEA) 0.06 Chi-square test of model fit: Value 408.302 Degrees of Freedom 8 P-Value 0.0000

The standardized solution shown in Table 18 presents Model 1.2 in the form of substantive

relationships represented by mathematical equations. For instance, the equation for FORM shows that the FORM (V1) items depend on one latent variable, lexico-grammatical ability, and one error term (E1), which accounts for any measurement error in this variable as well as any specific systematic component of the variable not captured in the latent variables. I evaluated the feasibility of the individual parameter estimates and discovered all to be reasonable and statistically significant at the 0.05 level. This indicates that the underlying factors are well measured by the observed variables, and that these variables are measuring lexico-grammatical and reading ability. Moreover, the variances of the error terms as well as all the parameter estimates were statistically significant. The loadings in the standardized solution were somewhat moderate, ranging from a low 0.34 to a moderate 0.62. Model 1.2, along with the standardized parameter estimates, is presented in Figure 5.

In sum, Model 1.2 provides strong evidence for acceptance of the two-factor solution of foreign language test performance measured by the abbreviated GVR section as a reasonable explanation of the correlations among the observed variables. This solution asserts the notion that the shortened GVR section of the ECPE consists of two underlying factors: lexico-grammatical ability and reading ability. According to this model, reading ability results vary because the reading passages are different, not because the item types measure different skills (i.e., reading for explicit information and reading for inferential information). On the other hand, lexico-grammatical ability results vary because the item types measure different skills. Furthermore, this solution produced a high interfactor correlation (r = 0.97) between lexico-grammatical ability and reading ability, suggesting that these abilities are not purely independent. Instead, these two abilities are inextricably related.

Table 18. Parameter Estimates for Model 1.2 Standardized Solution: FORM = V1 = 0.55 F1 + 0.89 E1 MEAN = V2 = 0.34 F1 + 0.70 E2 Passage 1 = V3 = 0.45 F2 + 0.80 E3 Passage 2 = V4 = 0.56 F2 + 0.69 E4 Passage 3 = V5 = 0.60 F2 + 0.61 E5 Passage 4 = V6 = 0.62 F2 + 0.70 E6

62

Figure 5. The Revised 2-Factor Model of the GVR Section with Standardized Parameter Estimates: Model 1.2


Cloze Section Descriptive Statistics

First, I analyzed the item-level data from the cloze section based on all 12,468 test-takers (see Appendix B). The means for the cloze section ranged from 0.31 to 0.97, suggesting a wide range of item-difficulty levels. The standard deviation ranged from 0.16 to 0.50. There are four items (CB62, CB72, CB75, CB78) that had means above 0.94, and the values for skewness and kurtosis for those four items were beyond +/- 3. However, it is entirely logical to expect high kurtosis values for these items because the mean values were so high.

Although some items did not fall within the accepted limits of the descriptive statistics described in the Method section, they did not appear to cause any substantial threats to normality when performing the reliability analyses, thus were kept for the subsequent analyses. Internal Consistency Reliability

Reliability analysis was performed in order to examine the internal consistency reliability for the cloze section, as seen in Table 19. The results show that the first passage and the second passage yielded alphas of 0.60 and 0.50, respectively. The overall cloze section yielded an alpha of 0.70. Such a moderate value is surprising because higher reliability is normally expected for an exam like the ECPE with over 12,000 participants and 40 items. Compared to the reliability estimate in the GVR section (α = 0.87), 0.70 seems fairly low, and suggests that the items in the cloze section do not appear to measure strongly a homogeneous construct. In order to answer the fourth research question (what is the underlying trait structure of foreign language test performance of English, as measured by the ECPE cloze section?), I performed exploratory factor analysis, which is discussed in the following section.

0.55

0.34*

0.89

0.70

0.80

0.69

0.61

0.70

0.45

0.56*

0.60*

0.62*

0.97*

L-GAbility (F1)

Passage 1 (V3)

READAbility (F2)

FORM (V1)

MEAN (V2)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

E1

E2

E3

E4

E5

E6

63

Table 19. Reliability Estimates for the Cloze Section Section Number of Items Reliability Estimates First Passage 20 0.60 Second Passage 20 0.50 Total 40 0.70 Exploratory Factor Analysis

To investigate the factorial structure of the cloze section, a matrix of tetrachoric correlations using all 40 items was generated in Mplus Version 2. Then, a series of EFAs was performed on the cloze section, which produced a two factor promax rotation that seemed to maximize parsimony and interpretability. Although it was initially hypothesized that the cloze items were measuring four factors, each representing one of the four components identified in the coding (RG, RV, GR, VR), there seemed to be no substantial EFA results to support the idea of four factors. In the course of these analyses, 15 items (CA41, CA47, CA52, CA53, CA55, CA57, CA58, CB60, CB65, CB67, CB68, CB70, CB71, CB74, and CB76) were dropped due to extremely low (lower than 0.3) factor loadings and double loadings (see Table 20). The same Table 20. The Initial EFA Results of Cloze Section: Promax Rotation Item Code F1 F2 Item Code F1 F2 C41 CF 0.241 0.118 C61 LM 0.117 0.255 C42 CM -0.187 0.409 C62 CF 0.306 0.117 C43 LM -0.076 0.411 C63 LM 0.024 0.348 C44 LM 0.087 0.248 C64 LM -0.080 0.337 C45 LM 0.122 0.319 C65 LM 0.079 0.197 C46 LM 0.154 0.231 C66 LF 0.411 -0.254 C47 LM 0.162 0.105 C67 LM 0.135 0.148 C48 LM -0.009 0.301 C68 MF 0.214 0.161 C49 LF 0.418 0.087 C69 LM -0.093 0.441 C50 LM 0.109 0.306 C70 CF 0.159 0.061 C51 CF 0.343 0.190 C71 LM 0.204 0.128 C52 LM 0.209 0.121 C72 MF 0.541 -0.153 C53 CF 0.159 0.139 C73 LF 0.233 0.078 C54 CF 0.357 0.180 C74 LM 0.158 0.183 C55 LM 0.181 0.193 C75 CF 0.583 -0.084 C56 LM 0.012 0.476 C76 LM 0.018 0.035 C57 LM 0.062 0.130 C77 LM -0.021 0.353 C58 MF 0.145 0.139 C78 MF 0.686 -0.213 C59 LM 0.184 0.336 C79 LM -0.132 0.680 C60 MF 0.263 0.196 C80 LM -0.065 0.396 LF = Lexical Form, MF = Morpho-Syntactic Form, CF = Cohesive Form, LM = Lexical Mean, CM = Cohesive Mean procedure as for the GVR analysis was used for dropping items that produced a factor loading lower than 0.30. The deletion of 15 items was due to the moderate value of reliability for the cloze section. If the items in the cloze section had been performing more homogeneously, the EFA would not have produced this many items with low factor loadings and double loadings.

64

Although deleting 15 items from the subsequent analyses seems excessive, it was necessary to remove the low and/or double loading items in order to avoid distracting from investigating the construct validity of the ECPE. Ultimately, 25 cloze items were used to measure the two underlying factors (Table 21). The items loading on the first factor appeared to be measuring grammatical forms, and those on the second factor appeared to be measuring grammatical meaning based on Purpura’s theoretical model. Table 21. EFA Results of the Shortened Cloze Section: Promax Rotation Item Code F1 FORM F2 MEAN CB78 MF 0.662 -0.149 CB75 CF 0.569 -0.036 CB72 MF 0.521 -0.099 CA49 LF 0.387 0.131 CB66 LF 0.365 -0.205 CA54 CF 0.326 0.219 CA51 CF 0.309 0.229 CB62 CF 0.294 0.149 CB73 LF 0.243 0.098 CB79 LM -0.104 0.664 CA56 LM 0.012 0.472 CB69 LM -0.062 0.428 CA43 LM -0.069 0.400 CB80 LM -0.041 0.390 CA42 CM -0.168 0.388 CA59 LM 0.143 0.353 CB63 LM 0.021 0.345 CB77 LM -0.023 0.343 CA45 LM 0.114 0.333 CB64 LM -0.073 0.323 CA50 LM 0.098 0.316 CA48 LM -0.003 0.298 CB61 LM 0.109 0.265 CA44 LM 0.088 0.259 CA46 LM 0.137 0.246 LF = Lexical Form, MF = Morpho-Syntactic Form, CF = Cohesive Form, LM = Lexical Mean, CM = Cohesive Mean

An inspection of the interfactor correlation matrix indicated that there is a moderate correlation between two factors: 0.55. This moderate correlation indicates that the FORM items and the MEAN items are not measuring the same construct, yet they are somewhat interdependent. Based on these analyses, 16 items were used to form the MEAN composite variables, whereas 9 items were used to form the FORM composite variables for the analyses. Post EFA Reliability Analyses Following the exploratory factor analysis, I performed a reliability analysis with the revised cloze section taxonomy containing two factors (FORM and MEAN). The analysis

65

yielded the alphas of 0.40 and 0.59 for FORM and MEAN items, respectively, (see Table 22). The Spearman-Brown formula was used in order to examine the corrected reliability (the test length fixed to 40 items), yielding values within acceptable limits of 0.50 or more (0.75 and 0.78 for FORM and MEAN, respectively). The corrected reliability estimate for the entire shortened cloze section increased to 0.74 from the original cloze section reliability estimate of 0.70. Table 22. Reliability Estimates for the Revised Cloze Section: FORM and MEAN Number

of Items Items Kept



FORM 9 49, 51, 54, 62, 66, 72, 73, 75, 78 0.29 0.64 MEAN 16 42, 43, 44, 45, 46, 48, 50, 56, 59,

61, 63, 64, 69, 77, 79, 80 0.58 0.77

Total 25 0.58 0.70 Structural Equation Modeling (SEM) Based on the results of the EFAs, the shortened foreign language cloze test performance of English appeared to be measured by two variables: FORM and MEAN. I attempted to perform SEM on these factors by treating FORM and MEAN as observed variables. However, the model is under-identified because there are only two observed variables in the one-factor model. An under-identified model has one or more parameters which may not be uniquely determined due to insufficient information in the matrix (Schumaker and Lomax, 1996). Therefore, such a model produces unreliable parameter estimates. In order to compensate for this limitation, item level SEM was performed for the EFA-generated factors. In other words, FORM and MEAN were considered as underlying factors while the items were treated as observed variables in this model. Model 2 addresses the following research question: what is the underlying trait structure of foreign language test performance of English measured by cloze?

Using Mplus Version 2, I evaluated the hypothesized model to determine to what extent the model fit the sample data for the item level analysis. Table 23 presents the summary of the model fit. The data produced the standardized root mean square residual of 0.041. The goodness of fit index for this model produced a chi-square value of 1862.276 with 255 degrees of freedom and a CFI of 0.90. Although a CFI of 0.95 and above is preferred, the RMSEA is acceptable (RMSEA = 0.022), indicating that this model is a good representation of the data.

Table 23. Results for EFA-Generated Cloze Section 2-Factor Model: Model 2 Goodness of fit summary: Comparative fit index (CFI) 0.90 The Tucker-Lewis index (TLI) 0.91 Standardized residual matrix: Standardized Root Mean Square Residual (SRMR) 0.04 Root Mean Square Error of Approximation (RMSEA) 0.02 Chi-square test of model fit: Value 1862.276 Degrees of Freedom 255 P-Value 0.0000

66

Considering that there are 25 observed variables in the model, CFI = 0.90 seems to be an acceptable model fit for this data.

I then evaluated the feasibility of the individual parameter estimates and discovered all to be substantively reasonable and statistically significant at the 0.05 level. This indicates that the underlying factors are reasonably measured by the observed variables. In other words, the items in the shortened cloze section appear to measure two underlying factors: FORM and MEAN. The standardized solution in Table 24 shows that the factor loadings for Model 2 were somewhat moderate, ranging from a low of 0.10 to a high of 0.59. Model 2, along with the standardized parameter estimates, are presented in Figure 6. Table 24. Parameter Estimates for Model 2 Standardized Solution: 49 = V11 = 0.48 F1 + 0.77 E11 51 = V12 = 0.53 F1 + 0.72 E12 54 = V13 = 0.55 F1 + 0.70 E13 62 = V14 = 0.42 F1 + 0.17 E14 66 = V15 = 0.10 F1 + 0.99 E15 72 = V16 = 0.35 F1 + 0.88 E16 73 = V17 = 0.31 F1 + 0.90 E17 75 = V18 = 0.45 F1 + 0.80 E18 78 = V19 = 0.41 F1 + 0.83 E19 42 = V20 = 0.25 F2 + 0.94 E20 43 = V21 = 0.35 F2 + 0.88 E21 44 = V22 = 0.32 F2 + 0.90 E22 45 = V23 = 0.42 F2 + 0.82 E23 46 = V24 = 0.35 F2 + 0.88 E24 48 = V25 = 0.30 F2 + 0.91 E25 50 = V26 = 0.38 F2 + 0.85 E26 56 = V27 = 0.48 F2 + 0.78 E27 59 = V28 = 0.47 F2 + 0.78 E28 61 = V29 = 0.35 F2 + 0.88 E29 63 = V30 = 0.35 F2 + 0.88 E30 64 = V31 = 0.26 F2 + 0.93 E31 69 = V32 = 0.37 F2 + 0.87 E32 77 = V33 = 0.32 F2 + 0.90 E33 79 = V34 = 0.59 F2 + 0.66 E34 80 = V35 = 0.37 F2 + 0.86 E35

To summarize, Model 2 provides sufficient evidence for the acceptance of the two-factor

solution of foreign language cloze test performance of English as a reasonable explanation of the correlations among the observed variables. This model suggests that the selected items in the cloze section are measuring two underlying constructs: grammatical forms and grammatical meanings.

67

Figure 6. EFA-Generated 2-Factors Model of the Cloze Section: Model 2


0.53*

0.55*

0.42*

0.10*

0.35*

0.31*

0.45*

0.41*

0.48

0.77*

0.72*

0.70*

0.83*

0.99*

0.88*

0.90*

0.80*

0.83*

0.88*

0.90*

0.82*

0.88*

0.91*

0.85*

0.78*

0.78*

0.94*

0.88*

0.88*

0.93*

0.87*

0.90*

0.66*

0.86*

0.73*

0.25

0.35*

0.32*

0.42*

0.35*

0.30*

0.38*

0.48*

0.47*

0.35*

0.35*

0.26*

0.37*

0.32*

0.59*

0.37*

FORM(F4)

MEAN(F5)

49 (V11)

51 (V12)

54 (V13)

62 (V14)

66 (V15)

72 (V16)

73 (V17)

75 (V18)

78 (V19)

E11

E12

E13

E14

E15

E16

E17

E18

E19

42 (V20)

43 (V21)

44 (V22)

45 (V23)

46 (V24)

48 (V25)

50 (V26)

56 (V27)

59 (V28)

E20

E21

E22

E23

E24

E25

E26

E27

E28

61 (V29)

63 (V30)

64 (V31)

69 (V32)

77 (V33)

79 (V34)

80 (V35)

E29

E30

E31

E32

E33

E34

E35

68

The GVR and the Cloze Sections The primary purpose of this study was to determine how the cloze items relate to the

various parts of the GVR section and to investigate whether the cloze section merits being a separate section of the ECPE battery. The first step in answering this research question was to examine the correlations among the cloze scores and the three parts in the GVR section. Table 25 presents the Pearson product-moment correlations of the total cloze score, the total GVR score, and the scores of the three parts (grammar, vocabulary, and reading) in the GVR section. Table 25. Correlations of Cloze and GVR Scores

GVR Grammar Vocabulary Reading

Total GVR

Total Cloze 0.59* 0.49* 0.55* 0.66* * significant at the 0.01 level (2-tailed)

The correlations show that the cloze items appear to measure more of the grammatical aspects of the language than the vocabulary and reading comprehension aspects; however, the differences in correlations for all three parts in the GVR were not substantial (0.49 to 0.59). The correlation between the total cloze score and the total GVR score was 0.66, indicating that there is a moderate degree of overlap in the processes measured by the cloze and the GVR. In other words, these two sections appear to measure a homogeneous construct to some extent but not so robustly.

To further investigate the relationship between the cloze and the GVR sections, the correlation between these two sections based on the SEM analyses was examined. According to the SEM results, both the abbreviated cloze and the grammar and vocabulary parts of the GVR measure FORM and MEAN. Thus, the extent to which the cloze form/mean items correlate with the GVR form/mean items was investigated. Table 26 presents the results of the correlational analysis. Table 26. Correlations of the Cloze Form/Mean Items and the GVR Form/Mean Items GVR FORM GVR MEAN Cloze FORM 0.45* 0.07* Cloze MEAN 0.34* 0.43* * significant at the 0.01 level (2-tailed)

The cloze FORM and the GVR FORM correlated the highest with a value of 0.45. The second highest correlation was between the cloze MEAN and the GVR MEAN, with the value of 0.43. Although the correlations are not particularly high, this analysis shows that the FORM items in the cloze and GVR sections correlate, and the MEAN items in the cloze and GVR sections correlate. The cloze MEAN and the GVR FORM showed a low correlation of 0.34, suggesting a weak relationship between the two item types. Both correlations between different item types across the cloze and GVR sections produced low correlations (0.07 and 0.34), indicating a weak relationship between the FORM and the MEAN items.

Following the correlation analyses, the overall SEM model for the cloze and GVR sections was examined in order to investigate the underlying construct of the combined sections. Based on the results of the SEMs, the following model (Figure 7) was initially hypothesized as the

69

overall model. It contains two intercorrelated factors (L-G = lexico-grammatical ability and READ = reading ability) with eight observed variables (GVR Mean, GVR Form, Cloze Form, Cloze Mean, Reading passage 1, passage 2, passage 3, and passage 4), and each observed variable is hypothesized to load on only one factor. Errors associated with each observed variable (E36 through E39 and E3 through E6) are assumed to be uncorrelated. Figure 7. Initially Hypothesized Model of the Overall Cloze and the GVR Section: Model 3.1

* = Freely estimated 1.0 = Fixed

Model 3.1 is a first-order confirmatory factor analysis designed to examine the

multidimensionality of the foreign language test performance of English measured by the abbreviated cloze and GVR sections. Due to the exploratory nature of this study, I examined the relationships among the variables with the objective of generating the best fitting and most substantively meaningful model, rather than simply confirming or rejecting this particular model.

Prior to exploring the trait structure of this model, the univariate and multivariate statistical assumptions underlying the maximum likelihood estimation procedure were examined. The univariate values for skewness and kurtosis were satisfactorily normally distributed. Then, the trait structure of the hypothesized model was examined to investigate the extent to which the model fit the sample data. With regard to model adequacy as a total, the data produced a root mean square error of approximation (RMSEA) of 0.10, indicating a degree of global misfit (see Table 27). Furthermore, the goodness of fit index for the initially hypothesized two-factor model produced a chi-square value of 2188.00 with 19 degrees of freedom (p<0.0000). This again suggests a weakly fitting model. Along with the RMSEA and the chi-square value, the comparative fit index (CFI) of 0.88 confirms that this model does not provide compelling

1.0

*

1.0

*

*

*

*

*

*

*

*

*

*

*

*

*

*

L-GAbility (F1)

READAbility (F2)

E38

E39

E3

E4

E5

E6

E36

E37GVR Form

(V37)

GVR Mean(V36)

Cloze Mean(V38)

Cloze Form(V39)

Passage 1 (V3)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

70

evidence for acceptance. Although the individual parameters of this model were evaluated, I did not interpret these parameters due to inadequate fitting of the overall model. Table 27. Results for Initially Hypothesized Overall Cloze/GVR Section Model: Model 3.1 Goodness of fit summary: Comparative fit index (CFI) 0.88 The Tucker-Lewis index (TLI) 0.89 Standardized residual matrix: Standardized Root Mean Square Residual (SRMR) 0.05 Root Mean Square Error of Approximation (RMSEA) 0.10 Chi-square test of model fit: Value 2188.00 Degrees of Freedom 19 P-Value 0.0000

Based on the results of Mode 3.1, I performed a series of post hoc fitting procedures in order to discover a better fitting model. My primary concern was, if the FORM items in the cloze and GVR sections are measuring the same trait, to what extent are they correlated? The same inquiry was raised for the MEAN items. Therefore, in the revised model, the error terms associated with Cloze FORM-GVR FORM and Cloze MEAN-GVR MEAN are hypothesized to be correlated. Model 3.2, presented in Figure 8, was built based on both a substantive and statistical point of view.

Figure 8. The Revised Model of the Overall Cloze and the GVR Section: Model 3.2


1.0 = Fixed

1.0

*

1.0

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

L-GAbility (F1)

READAbility (F2)

E38

E39

E3

E4

E5

E6

E36

E37GVR Form

(V37)

GVR Mean(V36)

Cloze Mean(V38)

Cloze Form(V39)

Passage 1 (V3)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

71

All eight variables produced satisfactory skewness and kurtosis values based on the sample statistics; thus, further analysis proceeded. With respect to goodness of fit, Model 3.2 produced a root mean square error of approximation (RMSEA) of 0.05, indicating an insignificant degree of misfit (see Table 28). It also produced a chi-square statistic of 751.721 with 17 degrees of freedom, representing a drastic decrease in overall chi-square (∆χ2(2) = 1436.279) from the initially hypothesized model. This reduction in χ2 exhibited a substantial improvement in goodness of fit. Along with the chi-square, the CFI (0.96) also reflected an extensive improvement in model-data fit (∆ = 0.08).

Table 28. Results for the Overall Cloze and the GVR Section: Model 3.2

These statistics provide strong evidence for acceptance of Model 3.2. As seen in Table 29, the loadings in the standardized solution ranged from a low 0.31 for GVR Mean to a moderately high 0.62 for Passage 4 in the reading section. Nonetheless, all factor loadings were found to be statistically significant.

Table 29. Parameter Estimates for Model 3.2

Figure 9 provides a diagrammatic representation of Model 3.2, in which the

standardized parameter estimates are indicated. An inspection of Model 3.2 illustrates that the ECPE foreign language test performance of English for the selected items is represented by two highly related underlying factors measured by eight observed variables. The high (0.95) interfactor correlation suggests that lexico-grammatical ability and reading ability are closely related. The most significant finding in this model is that the two error terms are significantly

Goodness of fit summary: Comparative fit index (CFI) 0.96 The Tucker-Lewis index (TLI) 0.97 Standardized residual matrix: Standardized Root Mean Square Residual (SRMR) 0.03 Root Mean Square Error of Approximation (RMSEA) 0.05 Chi-square test of model fit: Value 751.721 Degrees of Freedom 17 P-Value 0.0000

Standardized Solution: GVR Mean = V36 = 0.31 F1 + 0.95 E36 GVR Form = V37 = 0.57 F1 + 0.82 E37 Cloze Mean = V38 = 0.60 F1 + 0.81 E38 Cloze Form = V39 = 0.53 F1 + 0.85 E39 Passage 1 = V3 = 0.46 F1 + 0.89 E3 Passage 2 = V4 = 0.56 F1 + 0.83 E4 Passage 3 = V5 = 0.59 F1 + 0.81 E5 Passage 4 = V6 = 0.62 F1 + 0.78 E6

72

related to each other. Although the correlations of 0.21 and 0.31 are not high, this indicates that there is some redundant content being measured across the cloze and GVR sections. Figure 9. Results for the Overall Cloze and the GVR Section with Standardized Parameter Estimates: Model 3.2


In summary, Model 3.2 provides a reasonable explanation of the underlying construct of the shortened ECPE cloze and GVR sections. With its two intercorrelated factors, eight measured variables and two correlated errors, this model generally supports the hypothesis that cloze is measuring form and meaning, thereby supporting the notion that cloze does not measure processing abilities beyond the clause level (Alderson, 1979; Shanahan et al., 1982; Markham, 1985). Factor 1, lexico-grammatical ability, is represented by items assessing forms and meanings in both the cloze and the GVR sections. Factor 2, reading ability, is represented by items assessing reading in four different passages in the GVR section.

Discussion

The present study investigated six research questions concerning the underlying trait structure of the ECPE cloze and GVR sections. The first research question investigated the extent to which the items in each component (grammar, vocabulary, and reading) in the GVR section performed as a homogeneous group. The reliability analysis indicated that the grammar items were the most homogeneous among the three components, with a Cronbach’s alpha of 0.80. However, when the number of items for all three components was held constant to 40 items, the reading items had the highest internal consistency, with an alpha of 0.84. The internal consistency reliability range of 0.72 to 0.84 suggests that the items reasonably measure the

0.31

0.53*

0.46

0.56*

0.59*

0.62*

0.95*

0.81*

0.85*

0.89*

0.83*

0.81*

0.78*

0.82*

0.95*

0.57*

0.60*

0.31*

0.21*

L-GAbility (F1)

READAbility (F2)

E38

E39

E3

E4

E5

E6

E36

E37GVR Form

(V37)

GVR Mean(V36)

Cloze Mean (V38)

Cloze Form (V39)

Passage 1 (V3)

Passage 2 (V4)

Passage 3 (V5)

Passage 4 (V6)

73

same construct within each component. The internal consistency reliability of the overall GVR section produced a high alpha of 0.87, indicating that the items in the GVR section appear to measure reliably second language GVR test performance of English. The information provided by the reliability analysis proved valuable in determining whether or not to proceed with the exploratory and the confirmatory factor analyses, which were designed to analyze the hypothesized underlying structures of the GVR and cloze sections, as well as to analyze whether or not the composite variables were measuring the language ability they were designed to measure.

The second research question examined the underlying trait structure of foreign language test performance of English measured by the GVR section. In order to answer the question, exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) were used. The EFA indicated that the selected grammar items measure morphosyntactic form and lexical form, while the selected vocabulary items measure lexical form and lexical meaning. In other words, lexical form is measured in both grammar and vocabulary sections, whereas lexical meaning is only measured in the shortened vocabulary section, and morphosyntactic form is measured only in the shortened grammar section. This is not a surprising result, since lexical form and lexical meaning are closely associated by definition (Purpura, forthcoming).

The CFA proved valuable in confirming the hypothesized two-factor underlying trait structure, indicating moderate loadings between observed variables and their hypothesized factors. This confirmed that test takers’ performance on the abbreviated GVR section was explained by two hypothesized factors (lexico-grammatical ability and reading ability), and that lexico-grammatical ability is measured by two observed variables: form and meaning.

Although the reading items were expected to load on the explicit and implicit reading factors, they loaded according to passages. This may suggest that reading items in this test are text dependent rather than item type dependent. There is a need for further investigation of how the choice of reading passages affects the measure of reading ability.

The interfactor correlation between lexico-grammatical ability and reading ability was extremely high (r = 0.97). This suggests that these two factors are not purely independent. Rather, lexico-grammatical ability and the variables that measure the lexico-grammatical ability appear to be closely related to reading ability and vice versa.

The third and fourth research questions addressed the underlying trait structure of the cloze section. With regard to the third question, the internal consistency reliability of the cloze items was investigated, resulting in moderate estimates of 0.57 and 0.45, which indicates that items in each passage did not perform in a homogeneous way. The internal consistency reliability estimate of the overall cloze section was 0.65, suggesting a weak homogeneity of the cloze items. Compared to the reliability for the grammar and the vocabulary parts in the GVR section, the cloze reliability seems rather low. This may suggest that cloze items are not measuring a single underlying construct. As previously discussed, past studies have indicated a wide range of reliability estimates (0.31 to 0.96). Considering this wide range, the reliability of 0.65 seems reasonable. However, considering the ECPE is a high-stakes exam with over 12,000 subjects and a reasonable number of items, 0.65 seems rather low. In order to understand the reason for the low reliability, the cloze items and their distractors should be studied further.

The fourth research question examined the underlying trait structure of foreign language test performance of English measured by the cloze section, using EFA followed by CFA. According to the Michigan Certificate Examinations General Information Bulletin, “the cloze section is intended to assess the understanding of the organizational features of written text as well as grammatical knowledge and pragmatic knowledge of English, particularly knowledge

74

about expected vocabulary in certain contexts” (English Language Institute, The University of Michigan, 2002, p. 8). In other words, the cloze section is intended to assess higher-order processing abilities. However, based on the rigorous investigation using SEM in the present study, the cloze test items selected appeared to be accounted for by two factors: form and meaning. The findings indicate that the abbreviated cloze section appears to measure only lower processing skills and not the skill of comprehending organizational features of written text.

The fifth research question, examining the relationship between the cloze and GVR sections, was addressed by composing a model that would fit both statistically and substantively. As a result, the final model identified two underlying factors, each with four observed variables and two correlated errors. The reading ability factor is represented by four reading passage variables in the GVR section. The lexico-grammatical ability factor on the other hand is represented by two FORM variables and two MEAN variables measured in the cloze and GVR sections. The model indicates moderate relationships between variables and their respective hypothesized factors, as well as a high interfactor correlation, indicating that these abilities are inextricably related. The correlated errors provide evidence of some redundancy in the content being measured across the cloze and GVR sections. However, low error values (0.21 and 0.31) suggest that form and meaning measured in the cloze and GVR sections are different. This finding leads to the last research question, which asks whether the cloze section merits being a separate section of the ECPE battery.

In light of the above observations, the answer to the sixth research question is that the GVR and the cloze items may be integrated into one section. Although the correlation of the FORM/MEAN items was low between the cloze and the GVR sections, many of the cloze items and the items in the grammar and the vocabulary sections were essentially measuring form and meaning. Given this observation, it may be unnecessary to expect the test-takers to pass both the cloze and GVR sections along with the remaining three sections of the ECPE to obtain the certificate of proficiency.

The pre- and post-EFA reliability analyses showed that reliability stayed constant after 37 items in the GVR were deleted (observing the corrected reliability based on the Spearman-Brown formula). The deletion of 15 cloze section items also resulted in an unchanged reliability estimate (0.70). Therefore, if the cloze and the GVR sections were to be combined, test developers could reduce the number of items without substantially decreasing the reliability of the section.

Conclusion

This study investigated the underlying construct of the 1997 cloze section in the ECPE,

which was developed by the English Language Institute, The University of Michigan. The question to be answered was, if the cloze section measures the same construct as the GVR section, then why have the cloze section be a distinct section of the ECPE battery? In order to answer this question, I attempted to identify the underlying construct of the cloze section and compare it with the trait structure of the GVR section.

Through a rigorous investigation using structural equation modeling, I determined that the cloze section appears to measure grammar forms and meaning rather than overall language proficiency. When the cloze and the GVR items were included in the same model, the cloze again measured forms and meaning, along with the GV items.

75

Although this study provides beneficial information regarding the underlying trait structure of the ECPE cloze and GVR sections, it is imperative to recognize the study’s limitations. Though the underlying structure using Purpura’s model may appear to be a plausible representative of second language test performance of English measured by the cloze and GVR sections, the results should not be generalized. There may be other models that would be a better representative of the ECPE cloze and GVR sections. In order to create an accurate and full representation of the cloze and GVR sections, this study should be replicated and the results should be confirmed by other studies with ECPE tests from different years. Furthermore, different models need to be rigorously tested for a better model fit.

In addition, it is important to recognize that distractor efficiency was not evaluated prior to performing any statistical procedures. This may have affected the credibility of the subsequent statistical analyses. When writing distractors for multiple-choice items, one should be careful to be accurately assessing the intended test-taker abilities. One of the guidelines to follow in writing grammar distractors is that all choices should belong to the same grammatical category. For instance, the distractors should not include prepositions when the correct answer is a conjunction. There were some questions that violated this guideline. These questions may have caused a decrease in reliability of the cloze section. Because the initial cloze test reliability produced a moderate value of 0.65, there is a possibility that the cloze test was not strongly measuring a homogeneous construct from the beginning. This may have affected the results of EFA and SEM analyses. Revisions in distractors may increase the reliability and produce a more accurate representation of the underlying construct of the cloze section.

It is also imperative to acknowledge that there are various other research questions to be asked and answered, especially concerning the ethnicity, age, and gender of the subjects. Further research could analyze these variables to develop more detailed descriptions or explanations of second language test performance for these populations. A reliability analysis with a consequent confirmatory factor analysis could again be used to determine the underlying construct and strength of relationships among the language ability variables for these groups. This analysis may present different factor loadings or even a disparate underlying trait structure.

From a methodological point of view, this study has demonstrated the significance of using various statistical procedures, especially structural equation modeling. SEM has presented evidence that it can be a powerful research tool for investigating the underlying construct of latent factors and for providing insights into the interrelationships among the latent factors as well as the observed variables.

Despite the limitations, the findings of this study have contributed to a deeper understanding of the construct validity of cloze items, which has been debated for many decades. According to Oller and Jonz (1994), the cloze procedure contributes to the understanding of the “basic theoretical questions about human mental abilities as well as urgent practical questions about designing curricula. We are convinced that answers to such fundamental questions about meaningfulness will have countless invaluable applications” (p. 12). It is hoped that this study will encourage ESL test administrators to apply these findings to improve the validity of cloze tests.

76

References Abraham, R. G., & Chapelle, C. A. (1992). The meaning of cloze test scores: An item difficulty perspective. The Modern Language Journal, 76(4), 468-479. Alderson, J. C. (1979). The cloze procedure and proficiency in English as a foreign language.

TESOL Quarterly, 13(2), 219-227. Alderson, J. C. (1980). Native and nonnative speaker performance on cloze tests. Language

Learning, 30(1), 59-76. Alderson, J. C. (1983). The cloze procedure and proficiency in English as a foreign language. In

J. W. Oller (Ed.), Issues in language testing research (pp. 219-228). Rowley, MA: Newbury House.

Alderson, J. C., & Urquhart, A. H. (1985). The effect of students’ academic discipline on their performance on ESP reading tests. Language Testing, 2(2), 192-204.

Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL Quarterly, 16(1), 61-70. Bachman, L. F. (1985). Performance on cloze tests with fixed-ratio and rational deletions.

TESOL Quarterly, 19(3), 535-556. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford

University Press. Bensoussan, M., & Ramraz, R. (1984). Testing EFL reading comprehension using a

multiple-choice rational cloze. Modern Language Journal, 68(3), 230-239. Black, J. H. (1993). Learning and reception strategy use and the cloze procedure. The Canadian

Modern Language Review, 49(3), 418-445. Brown, J. D. (1984). A cloze is a cloze is a cloze? In J. Handscombe, R. A. Orem, & B. P. Taylor

(Eds.), On TESOL ’83. The question of control (pp. 109-119). Washington, DC: Teachers of English to Speakers of Other Languages.

Brown, J. D. (1988). Tailored cloze: Improved with classical item analysis techniques. Language Testing, 5(1), 19-31.

Brown, J. D., Yamashiro, A. D., & Ogane, E. (2001). The emperor’s new cloze: Strategies for revising cloze tests. In T. Hudson & J. D. Brown (Eds.), A focus on language testing development. (Technical report No. 21). Honolulu, HI: Second Language Teaching & Curriculum Center, University of Hawai’i at Manoa.

Celce-Murcia, M., & Larsen-Freemen, D. (1999). The grammar book: An ESL/EFL teacher's course (2nd ed.). Boston, MA: Heinle & Heinle.

Chavez-Oller, M. A., Chihara, T., Weaver, K. A., & Oller, J.W. (1985). When are cloze items sensitive to constraints across sentences? Language Learning, 35(2), 181-206.

Chihara, T., Oller, J. W., Weaver, K. A., & Chavez-Oller, M. A. (1977). Are cloze items sensitive to constraints across sentences? Language Learning, 27(1), 63-73.

English Language Institute, The University of Michigan (2002). Michigan Certificate Examinations general information bulletin 2001-2002. Ann Arbor, MI: English Language Institute, The University of Michigan.

Farhady, H., & Keramati, M. N. (1996). A text-driven method for the deletion procedure in cloze passages. Language Testing, 13(2), 191-207.

Fotos, S. S. (1991). The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations? Language Learning, 41(3), 313-336.

Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., & Oller, J. W. (1988). Multiple-choice cloze items and the test of English as a foreign language (Research Reports No. 26). Princeton, NJ: Educational Testing Service.

77

Hanania, E., & Shikhani, M. (1986). Interrelationships among three tests of language proficiency: Standardized ESL, cloze, and writing. TESOL Quarterly, 20(1), 97-109.

Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistics. Boston, MA: Heinle & Heinle.

Henning, G. (1987). A guide to language testing. Boston, MA: Heinle & Heinle. Hinofotis, F. B. (1980). Cloze as an alternative method of ESL placement and proficiency

testing. In J. W. Oller & K. Perkins (Eds.), Research in language testing (pp. 121-128). Rowley, MA: Newbury House.

Irvine, P, Atai, P., & Oller, J. W. (1974). Cloze, dictation, and the test of English as a foreign language. Language Learning, 24 (2), 245-252.

Jonz, J. (1990). Another turn in the conversation: What does cloze measure? TESOL Quarterly, 24 (1), 61-83.

Jonz J., & Oller, J. W. (1994). A critical appraisal of related cloze research. In J. W. Oller & J. Jonz (Eds.), Cloze and coherence (pp. 371-407). London: Bucknell University Press. Kim, J. O., & Muller, C. W. (1978). Introduction to factor analysis: What it is and how to do it.

Newbury Park, CA: Sage University Press. Klein-Braley, C. (1983). A cloze is a cloze is a question. In J. W. Oller (Ed.), Issues in language

testing research (pp. 218-228). Rowley, MA: Newbury House. Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural equation

modeling approach. Cambridge, UK: Cambridge University Press. Kunnan, A. J. (1998). An introduction to structural equation modeling for language assessment

research. Language Testing, 15(3), 295-332. Laesch, K. B., & van Kleeck, A. (1987). The cloze test as an alternative measure of language

proficiency of children considered for exit from bilingual education programs. Language Learning, 37(2), 171-189.

Markham, P. L. (1985). The rational deletion cloze and global comprehension in German. Language Learning, 35(3), 423-430.

Mullen, K. (1979). More on cloze tests as of proficiency in English as a second language. In E. Briere & F. B. Hinofotis (Eds.), Concepts in language testing: Some recent studies (pp. 21-32). Washington DC: Teachers of English to Speakers of Other Languages.

Oller, J. W. (1972). Scoring methods and difficulty levels for cloze tests of proficiency in English as a second language. Modern Language Journal, 56(3), 151-158.

Oller, J. W. (1973). Cloze tests of language proficiency and what they measure. Language Learning, 23(1), 105-118.

Oller, J. W. (1979). Language tests at school. London: Longman Group Limited. Oller, J. W. (1983). Evidence for a general language proficiency factor: An expectancy

grammar. In J. W. Oller (Ed.), Issues in language testing research (pp. 3-10). Rowley, MA: Newbury House.

Oller, J. W., & Conrad, C. A. (1971). The cloze technique and ESL proficiency. Language Learning, 21(2), 183-194.

Oller, J. W., & Inal, N. (1971). A cloze test of English prepositions. TESOL Quarterly, 5(4), 315-326.

Oller, J. W., & Jonz J. (1994). Why cloze procedure? In J. W. Oller & J. Jonz (Eds.), Cloze and coherence (pp. 1-20). London: Bucknell University Press.

Porter, D. (1983). The effects of quantity of context on the ability to make linguistic predictions: A flaw in a measure of general proficiency. In A. Hughes & D. Porter (Eds.), Current developments in language testing (pp. 63-74). London: Academic Press.

78

Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge, UK: Cambridge University Press.

Purpura, J. E. (Forthcoming). Assessing grammar. Cambridge, UK: Cambridge University Press.

Sasaki, M. (1993). Relationships among second language proficiency, foreign language aptitude and intelligence: A structural equation modeling approach. Language Learning, 43(3), 313-344.

Sasaki, M. (2000). Effects of cultural schemata on students’ test-taking processes for cloze tests: A multiple data source approach. Language Testing, 17(1), 85-114.

Shumacker, R. E., & Lomax, R. G. (1996). A beginner’s guide to structural equation modeling. Mahwah, NJ: Lawrence Erlbaum Associates.

Sciarone, A. G., & Schoorl, J. J. (1989). The cloze test: Or why small isn’t always beautiful. Language Learning, 39(3), 415-438.

Shanahan, T., Kamil, M. L., & Tobin, A. W. (1982). Cloze as a measure of intersentential comprehension. Reading Research Quarterly, 17(2), 229-255.

Storey, P. (1997). Examining the test-taking process: A cognitive perspective on the discourse cloze test. Language Testing, 14(2), 214-231.

Stubbs, J. B., & Tucker, G. R. (1974). The cloze test as a measure of English proficiency. Modern Language Journal, 58(5-6), 239-241.

Taylor, W. L. (1953). Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30(4), 414-438.

Turner, C. E. (1989). The underlying factor structure of L2 cloze test performance in Francophone, university-level students: Casual modeling as an approach to construct validation. Language Testing, 6(2), 172-197.

79

Appendix A

GVR Section - Descriptive Statistics Item Mean Std. Dev Kurtosis Skewness G81 0.84 0.36 1.55 -1.89 G82 0.84 0.37 1.37 -1.83 G83 0.52 0.50 -1.99 -0.09 G84 0.86 0.35 2.33 -2.08 G85 0.70 0.46 -1.21 -0.89 G86 0.76 0.43 -0.54 -1.21 G87 0.77 0.42 -0.29 -1.31 G88 0.36 0.48 -1.65 0.59 G89 0.93 0.26 8.95 -3.31 G90 0.93 0.26 8.86 -3.30 G91 0.84 0.36 1.63 -1.91 G92 0.81 0.39 0.63 -1.62 G93 0.79 0.41 0.04 -1.43 G94 0.30 0.46 -1.19 0.90 G95 0.36 0.48 -1.66 0.58 G96 0.92 0.27 7.85 -3.14 G97 0.91 0.29 6.05 -2.84 G98 0.91 0.29 5.66 -2.77 G99 0.79 0.41 -0.01 -1.41 G100 0.96 0.20 19.74 -4.66 G101 0.76 0.43 -0.58 -1.20 G102 0.95 0.21 16.93 -4.35 G103 0.86 0.35 2.31 -2.08 G104 0.83 0.38 1.07 -1.75 G105 0.62 0.49 -1.78 -0.47 G106 0.86 0.35 2.34 -2.08 G107 0.36 0.48 -1.64 0.60 G108 0.72 0.45 -1.08 -0.96 G109 0.87 0.33 3.04 -2.25 G110 0.80 0.40 0.31 -1.52 G111 0.80 0.40 0.21 -1.49 G112 0.93 0.26 8.65 -3.26 G113 0.31 0.46 -1.35 0.81 G114 0.41 0.49 -1.88 0.35 G115 0.87 0.34 2.73 -2.17 G116 0.69 0.46 -1.35 -0.81 G117 0.99 0.11 75.99 -8.83 G118 0.51 0.50 -2.00 -0.04 G119 0.72 0.45 -1.09 -0.96 G120 0.60 0.49 -1.85 -0.39

80

GVR Section - Descriptive Statistics cont. Item Mean Std. Dev Kurtosis Skewness V121 0.77 0.42 -0.43 -1.25 V122 0.50 0.50 -2.00 -0.01 V123 0.85 0.36 1.93 -1.98 V124 0.26 0.44 -0.84 1.08 V125 0.53 0.50 -1.98 -0.14 V126 0.49 0.50 -2.00 0.03 V127 0.48 0.50 -1.99 0.09 V128 0.38 0.48 -1.75 0.50 V129 0.77 0.42 -0.31 -1.30 V130 0.68 0.47 -1.43 -0.76 V131 0.31 0.46 -1.35 0.80 V132 0.35 0.48 -1.62 0.61 V133 0.70 0.46 -1.27 -0.85 V134 0.41 0.49 -1.88 0.35 V135 0.80 0.40 0.18 -1.48 V136 0.20 0.40 0.28 1.51 V137 0.75 0.43 -0.61 -1.18 V138 0.73 0.45 -0.98 -1.01 V139 0.38 0.49 -1.76 0.49 V140 0.39 0.49 -1.81 0.44 V141 0.53 0.50 -1.98 -0.13 V142 0.61 0.49 -1.81 -0.43 V143 0.84 0.37 1.28 -1.81 V144 0.27 0.44 -0.94 1.03 V145 0.42 0.50 -1.89 0.33 V146 0.65 0.48 -1.62 -0.62 V147 0.49 0.50 -2.00 0.06 V148 0.38 0.48 -1.74 0.51 V149 0.42 0.49 -1.89 0.33 V150 0.59 0.49 -1.86 -0.38 V151 0.26 0.44 -0.75 1.12 V152 0.41 0.49 -1.88 0.35 V153 0.79 0.40 0.13 -1.46 V154 0.63 0.48 -1.70 -0.55 V155 0.27 0.44 -0.87 1.06 V156 0.75 0.43 -0.63 -1.17 V157 0.55 0.50 -1.95 -0.21 V158 0.84 0.37 1.41 -1.85 V159 0.81 0.40 0.37 -1.54 V160 0.59 0.50 -1.87 -0.36

81

GVR Section - Descriptive Statistics cont. Item Mean Std. Dev Kurtosis Skewness

R161 0.93 0.26 9.31 -3.36 R162 0.86 0.34 2.54 -2.13 R163 0.81 0.39 0.54 -1.59 R164 0.90 0.31 4.70 -2.59 R165 0.98 0.15 39.40 -6.43 R166 0.76 0.43 -0.47 -1.24 R167 0.75 0.43 -0.69 -1.15 R168 0.83 0.38 0.95 -1.72 R169 0.67 0.47 -1.48 -0.72 R170 0.78 0.41 -0.10 -1.38 R171 0.84 0.36 1.57 -1.89 R172 0.82 0.39 0.67 -1.64 R173 0.68 0.47 -1.42 -0.76 R174 0.77 0.42 -0.36 -1.28 R175 0.86 0.35 2.33 -2.08 R176 0.50 0.50 -2.00 0.00 R177 0.57 0.50 -1.93 -0.27 R178 0.76 0.43 -0.51 -1.22 R179 0.63 0.48 -1.72 -0.53 R180 0.58 0.49 -1.89 -0.33

82

Appendix B Cloze Section - Descriptive Statistics

Item Mean Std. Dev Kurtosis Skewness CA41 0.87 0.34 2.64 -2.15 CA42 0.51 0.50 -2.00 -0.03 CA43 0.68 0.47 -1.41 -0.77 CA44 0.48 0.50 -1.99 0.09 CA45 0.40 0.49 -1.85 0.39 CA46 0.76 0.43 -0.57 -1.20 CA47 0.69 0.46 -1.32 -0.82 CA48 0.82 0.39 0.69 -1.64 CA49 0.87 0.34 2.64 -2.15 CA50 0.51 0.50 -2.00 -0.03 CA51 0.74 0.44 -0.85 -1.07 CA52 0.80 0.40 0.28 -1.51 CA53 0.58 0.49 -1.90 -0.33 CA54 0.62 0.48 -1.74 -0.51 CA55 0.74 0.44 -0.78 -1.11 CA56 0.70 0.46 -1.21 -0.89 CA57 0.79 0.40 0.11 -1.46 CA58 0.86 0.35 2.21 -2.05 CA59 0.57 0.50 -1.93 -0.26 CA60 0.70 0.46 -1.21 -0.89 CB61 0.83 0.38 0.93 -1.71 CB62 0.90 0.30 4.95 -2.64 CB63 0.46 0.50 -1.98 0.16 CB64 0.35 0.48 -1.59 0.64 CB65 0.68 0.47 -1.43 -0.76 CB66 0.56 0.50 -1.95 -0.23 CB67 0.31 0.46 -1.36 0.80 CB68 0.61 0.49 -1.81 -0.43 CB69 0.81 0.40 0.39 -1.55 CB70 0.64 0.48 -1.64 -0.60 CB71 0.52 0.50 -1.99 -0.10 CB72 0.95 0.21 16.55 -4.31 CB73 0.77 0.42 -0.37 -1.28 CB74 0.74 0.44 -0.81 -1.09 CB75 0.97 0.16 31.28 -5.77 CB76 0.63 0.48 -1.73 -0.52 CB77 0.35 0.48 -1.60 0.64 CB78 0.94 0.25 10.47 -3.53 CB79 0.44 0.50 -1.94 0.24 CB80 0.51 0.50 -2.00 -0.06

Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 1, 2003 English Language Institute, The University of Michigan 83

An Investigation into Answer-Changing Practices on Multiple-Choice Questions with Gulf Arab Learners

in an EFL Context

Mashael Al-Hamly Kuwait University

Christine Coombe

Dubai Men’s College

This study investigates whether the practice of answer changing on multiple-choice questions (MCQs) is beneficial to Gulf Arab students’ overall test performance. The proficiency exam used in this research project is the Michigan English Language Institute College English Test—Grammar, Cloze, Vocabulary, Reading (MELICET-GCVR) exam, which was developed using retired forms of the Michigan English Language Assessment Battery (MELAB). This proficiency exam was administered to 286 students at both Kuwait University and Dubai Men’s College, Higher Colleges of Technology. From this data, inferences as to whether or not changing answers is an effective strategy for Gulf Arab learners are made. It is hoped that this research will provide baseline data on the answer-changing behavior of EFL students taking an English-language proficiency exam. The effects of students’ level of test anxiety and item difficulty and how they impact on answer changing are also reported. Finally, information on the efficacy of one specific objective test strategy, that of changing answers, is disseminated.

Throughout a student’s formal education one of the most widely used devices to evaluate progress and performance has been the multiple-choice test or some variation of this objective format. Educators agree that the skills students need to have to do well on a test are dependent on the skill area assessed and exam question format (Foster, Paulk, and Reiderer, 1999; Burdess, 1991; Scruggs and Marsing, 1988). A frequently used format in large-scale, high-stakes English language proficiency exams is the multiple-choice question (MCQ) format. When providing information on MCQ formats, a statement often heard in test preparation classes is “go with your first response.” The question arose as to whether this was a research-based practice (Torrence, 1986) or simply anecdotal advice or conventional wisdom on the part of the teacher. Stough (1993) suggests that most skills-based instructional programs that focus on test-taking strategies tend to give “common sense” suggestions rather than empirically verifiable strategies to students.

Several researchers have examined the answer-changing behavior of students taking objective tests. Most of the research is aimed at testing the accuracy of “first impressions” in test taking. This bit of academic folk wisdom is typically stated as the belief that “one should not change answers on objective tests because initial reactions to test questions are intuitively

84

more accurate than subsequent responses” (Benjamin, Cavell and Shallenberger, 1984, p. 133; see also Hanna, 1989).

The first empirical study on answer changing dates back to 1929, where Mathews investigated answer-changing behaviors of college-level students in educational psychology courses and found that more than 53% of the answers changed on MCQs were from wrong to right. The basic finding of this first important study was that for every point lost roughly 2 to 3 points were gained. Findings since have noted a remarkable consistency of results by several later researchers. (For a general review of the early literature, see Mueller and Wasser, 1977; Benjamin, et al, 1984.)

Since 1929, at least 56 studies have been published concerning a number of issues surrounding answer-changing behavior on objective tests. The most consistent findings in these studies have been that there is nothing inherently wrong with changing initial answers. In fact, empirical evidence uniformly indicates that: a) only a small percentage of items are actually changed, b) most of these changes are from wrong to right answers, c) most test-takers are answer changers, and d) most answer changers are point gainers.

Prior studies have investigated a number of different variables and their relationship to answer changing. Researchers have examined the possible effects of gender on answer-changing behavior (Bath, 1967; Copeland, 1972; Geiger 1991a; Reiling and Taylor, 1972), the subject’s ethnicity (Payne, 1984), the differences due to item difficulty (Green, 1981; Ramsey, Ramsey and Barnes, 1987; Vidler and Hansen, 1980), the differences due to item type (Geiger, 1991b), the cognitive styles of students (Friedman and Cook, 1995), students’ perceptions of answer changes (Geiger, 1991a; Mathews, 1929; Prinsell, Ramsey and Ramsey, 1994; Skinner, 1983) and answers changes as a function of test anxiety (Green, 1981). In addition, researchers have found that there exists a positive relationship between gains from answer changing and overall student performance (Friedman and Cook, 1995). Testwiseness is another area that has received empirical attention. Millman, Bishop and Ebel (1975) have indicated that the judicious changing of one’s original answer selection is a basic aspect of being testwise. Appendix A details a fairly comprehensive list of the studies conducted on answer changing. Test Anxiety

Test anxiety as defined by Dusek, 1980, is “an unpleasant feeling or emotional state that has psychological and behavioral commitments and that is experienced in formal testing or other evaluative situations” (p.88). Similarly, Spielberger (1983) defined anxiety as “a subjective feeling of tension, apprehension, nervousness, and worry associated with arousal of the autonomic nervous system.” Test anxiety has been shown to be one of the most important negative motivators in education and has direct, sometimes debilitating effects on school success (Hill and Wigfield, 1984). When anxiety is restricted to the language learning situation it falls into the category of specific anxiety reactions (El-Banna, 1989, p. 6).

Green (1981) investigated answer changing as a function of test anxiety with undergraduate students of statistics. Her findings indicate that students who experience more test anxiety make more item-response changes than students who experience less test anxiety. Her results also suggest that both high and low anxiety students profit to a similar extent proportionally from answer changing (p. 225). In addition to her research on test anxiety, she also examined item difficulty. Her findings indicate that more answers were changed on

85

difficult rather than on easy items by students who experience both high and low levels of test anxiety.

Rationale for the Study

The present study is unique in at least three ways. First of all, although there is a huge body of literature on the issue of answer changes and students’ performance on tests in content areas, there is very little if anything being done in the field of language learning in general and English as a second or foreign language (ES/FL) in particular. Secondly, a large percentage of the studies reported in the literature were conducted using achievement tests rather than proficiency tests as in the present study. Finally and most important, there were no available studies cited in the literature that investigated the relationship between language test anxiety and answer changing behavior of ES/FL learners. Purpose of the Study

The primary purpose of the present study is to determine the effect of answer changing on overall test scores. Because there are no studies available on our student population (tertiary-level students in the Arabian Gulf), or our content area under study (EFL) and test type (English language proficiency), this study attempts to provide baseline data on the above variables. More specifically, this study aims at answering three types of questions: general questions, student-related questions, and test-item related questions.

This research project posits the following research questions: 1. Do Gulf Arab students engage in the practice of answer changing on MCQs? If so, to

what extent? 2. Do the students in this study benefit from answer changing in terms of gains in total

test scores on the MELICET-GCVR? 3. Do the results for students taking an EFL test in this study support the answer

changing research carried out in other content areas, which asserts that change usually produces gain?

4. Is there a significant difference between female and male students in relation to number of answer changes made and the quality of these changes?

5. Is there a significant correlation between students’ test anxiety and number and quality of answer changes made?

6. Is there a significant correlation between students’ self estimates and number and quality of answer changes made as measured by self-assessment of performance on both total and sub-sections of the MELICET-GCVR?

7. Is there a significant correlation between students’ proficiency levels as measured by the Oxford Quick Placement Test (QPT) and the MELICET-GCVR and number and quality of answer changes made?

8. Are there any significant differences between any MELICET-GCVR sub-sections with regard to the number and quality of answer changes?

Pilot Study

A pilot study was conducted in May 2002 (approximately five months prior to the planned actual data collection). This study was conducted with the intention of piloting the instruments and projected procedures. Results of this study were used to make modifications

86

in the research protocol. A random sample of 39 tertiary-level students from both sites participated in the pilot study.

A major purpose of the pilot study was to further determine the effectiveness of the various instruments developed by the researchers. Also of concern in the pilot study was the sequence of procedures in the protocol as well as the translations of the instruments into Arabic. In addition, the pilot study allowed the researchers to determine whether any of the experimental procedures were causing subjects difficulty. The pilot study indicated that minor adjustments needed to be made to the above-mentioned areas.

Method Population and Sample Selection

There were 286 (147 male, 139 female) university-level students ranging in age from 17 to 32 from Dubai Men’s College and Kuwait University who participated in the main study (see Table 1). The participants were enrolled in a variety of different degree programs. Table 1. Main Study Participants Institute Male Female Total Dubai Men’s College 106 0 106 Kuwait University 41 139 180 Total 147 139 286 Instrumentation

A number of different instruments were used in the study: the Test Anxiety Inventory (TAI), the QPT, the MELICET-GCVR, a Student’s Profile Form, and a Posttest Questionnaire. Test Anxiety Inventory

The TAI is a frequently employed and thoroughly researched self-report psychometric scale developed to measure individual differences in test anxiety in high school and college students (Speilberger, 1972, 1983). The one-page likert–scale questionnaire asks subjects to record their degree of agreement/disagreement in twenty areas all related to the concept of test anxiety. Response choices are (1) almost never, (2) sometimes, (3) often, and (4) almost always. “Almost never” indicates low test anxiety and is scored “1”; “Almost always” indicates high test anxiety and is scored “4.” The scoring weights are reversed on Item One only, since its intention is as a checks and balances item. All twenty items are used to determine the total score of the TAI. The minimum TAI Total Score (very low, if any, anxiety) is 20. The maximum TAI Total Score (very high anxiety) is 80. The objective of the TAI is to learn how frequently students experience anxiety symptoms before, during, and after tests (Appendix B).

The TAI was originally published in English and later translated into Arabic by the Arabic-speaking author. The TAI was translated in an effort to ensure total comprehension on the part of the subjects and to decrease the risk of inaccurate information (Oller and Perkins, 1978). The translation was validated by asking another highly educated native speaker of Arabic to translate the document back into English. Every attempt was made to provide an Arabic version that was as faithful a representation of the English as possible. The back-

87

translated versions were compared with the original English to ensure that the translation was accurate. An Arabic/English linguist and the author checked the Arabic and the back-translated English versions of the instruments and compared them to the original English version. The same process was carried out with the other instruments, namely the Student’s Profile Form and the posttest questionnaire, which was originally designed and written in English and then translated into Arabic. Oxford Quick Placement Test

Subjects’ English language proficiency level was determined through the QPT. This test assesses reading, vocabulary and grammar using a typical MCQ (stem with either three or four response options) format. The paper and pen version of the test, which was used in this study, consists of two parts.

The items themselves are typical of those produced by University of Cambridge Local Examinations Syndicate (UCLES). The first five questions on Part 1 of the test relate to signs and where they would most likely be found. Questions 6-20 are MCQs based on cloze passages. Questions 21-40 are fill-in-the-blank MCQs with a one-sentence context. A variety of language points are tested in this section. Detailed information about how to organize the administration of the two parts of the test is provided in the QPT User Manual (2000). MELICET-GCVR

The MELICET-GCVR is a 100-item, multiple-choice test of grammar items in a conversational format, cloze items in a paragraph format, vocabulary items requiring selection of a synonym or completion of a sentence, and reading comprehension items. The MELICET-GCVR is a retired form of one section of the MELAB, which is used to assess the English language proficiency of students who are applying to universities or colleges where the language of instruction is English, and to assess the general English language proficiency of professionals who will need to use English in their work (English Language Institute, The University of Michigan, 2001). Student’s Profile Form

A form was developed for the purpose of collecting student demographic information and signed permission to participate in the study (Appendix C). This document was used to collect the following information: name, date, contact information, field of study, institution, university I.D. number, gender, and year of study. Students were also asked to read a ‘permission statement’ in Arabic and/or English and sign it. This permission statement was based on The University of Michigan’s Human Subjects Consent Form. Posttest Questionnaire

In order to collect information on students’ self estimates, a posttest questionnaire was developed (Appendix D). This likert-scale questionnaire asked subjects to assess their abilities in English as per their performance on the MELICET-GCVR in the following areas: whole test, grammar section, cloze section, vocabulary section and reading section. Students were asked to rate these areas on a 5-point likert scale from 1 (very poor) to 5 (excellent). A final question on the posttest questionnaire asked students whether or not they had changed some answers on their answer sheet. To respond, subjects circled yes or no.

88

Data Collection Procedures

Data for the study was collected in two separate administrations. A protocol for the administration of all instruments was designed to insure that procedures would be standard throughout data collection (Appendix E). During the first day of data collection, students were asked to fill in the Student’s Profile Form and sign the consent form and fill out the TAI. The QPT was then administered under exam conditions.

On the second day of data collection, the MELICET-GCVR was administered to students followed by the Self Assessment Posttest Questionnaire. Not more than 10 days elapsed from one data collection period to the next. Scoring

The answer sheets for the QPT were hand marked using the overlay provided in the test kit, and the scores were manually recorded on the Student’s Profile Form under section D labeled ‘For Tester Use Only’. The TAI was manually calculated and the resulting score entered under section F on the Student Profile Form. Because no published interpretations of the TAI were in existence, the researchers classified students scoring between 20-39, 40-60, and 61-80 as having either low, moderate, or high levels of test anxiety, respectively.

Optical mark read (OMR) sheets for the MELICET-GCVR were examined for erasures indicating answer changes. Initially a count of answer changes was made and recorded on the Student’s Profile. Then each answer change on each OMR sheet was further examined and classified according to the direction of change. These directions are as follows: Right to Wrong (R-W), Wrong to Right (W-R) and Wrong to Wrong (W-W).

The above-mentioned scores were then transferred onto the OMR sheet into the categories provided and sent to The University of Michigan for detailed, computerized statistical analysis. Data Analysis

Students’ scores on the above-mentioned instruments were statistically analyzed using the computer statistical programming package SPSS 11.0 for Windows. Correlations were run to answer the research questions posited. In addition, item statistics on the MELICET-GCVR were also computed using ITEMAN (Assessment Systems Corporation, 1996).

Results

The results are organized according to the research questions they are intended to answer. Research Question 1: Do Gulf Arab students engage in the practice of answer changing

on MCQs? If so, to what extent? Of the 286 students who took part in this study, 192 (or 67%) of them changed answers. Other studies report a much higher percentage of answer changers. Students made 760 answer changes (ACs), accounting for a small percentage of actual test answers (2.65%). This finding is consistent with the literature, which reports that only a small percentage, between 2-3%, of answers are actually changed.

Research Question 2: Do the students in this study benefit from answer changing in terms of gains in total test scores on the MELICET-GCVR? The results of this study are consistent with previous research findings in that the greatest percentage of answer changes was from wrong to right (44%). Of the students who changed answers, 57% gained points while 19%

89

lost points (Table 3). Answer changing for the remaining students had no positive or negative effect on their overall MELICET-GCVR score. The mean point gain was 2.05.

Research Question 3: Do the results for students taking the EFL test in this study support the answer changing research carried out in other content areas, which asserts that change usually produces gain? This study supports the idea that students who change answers on EFL tests are more likely to benefit from doing so. Unlike other studies, however, students made a higher percentage of wrong to wrong answer changes (37%) than has been previously reported in other content areas. Table 2 presents a summary of the number and direction of answer changes. Table 2. Number and Direction of Answer Changes Right-Wrong Wrong-Right Wrong-Wrong Total ACs 144 (19%) 332 (44%) 284 (37%) 760

Research Question 4: Is there a significant difference between female and male students in relation to number of answer changes made and the quality of these changes? In the current study there was no significant difference between males and females for total number of answer changes; 66.9% of the females (n = 93) changed answers while 67.3% of the males (n = 99) did so. Although there were slightly more male answer changers in the sample, there were more female “gainers” (62%) than male (52%). The mean gain in points due to answer changing was 2.25 and 2.05 for males and females, respectively (Table 3). As regards the current literature base, there is no established trend in answer changing vis-à-vis gender. While a number of studies indicate that males tend to make more wrong to right answer changes than females, our study indicates the contrary. Another consistent finding in the literature is that females make more total answer changes than males. Again our results indicate that males make more answer changes (420, or 2.86 changes per person) than females (340, or 2.45 changes per person). Table 3. Summary of Individual Gains/Losses by Gender Men (n=99) Women (n=93) Total (n=192) Gainers n 51 58 109 percent 52% 62% 57% mean gain 2.25 2.05 Losers n 23 13 36 percent 23% 14% 19% Samers n 25 22 47 percent 25% 24% 43%

Research Question 5: Is there a significant correlation between students’ test anxiety and number and quality of answer changes made? Subjects participating in this study had a mean level of test anxiety of 42.71 (standard deviation = 11.00, minimum = 20, maximum = 71), which indicates a moderate level of test anxiety. Further findings indicate that there are very low correlations between test anxiety and both the number and quality of answer changes made. Table 4 shows the correlations with test anxiety (TAI) and the total number of answer changes and the direction (Right-Wrong, Wrong-Right, Wrong-Wrong) of answer changes

90

(for the candidates who made at least one answer change). None of the correlations are statistically significant. Table 4. Correlations Between TAI and Answer Changes Total ACs R-W W-R W-W TAI -.074 -.136 .024 -.102 n 286 88 141 130

Students were further subdivided into test anxiety groups whereby students scoring from 20-39 on the TAI were classified as low test anxious, and students scoring between 61 and 80 classified as high test anxious students. Of the total number of answer changers, 124 students, or 65%, were classified as low test anxious while only 20 students, or 10%, were classified as high test anxious. When comparing these two groups, it was found that low test anxious students made fewer answer changes (2.61) than high test anxious students (2.90). However, the amount of points gained for these answer changes was marginal at .73 and .55 respectively. Our results on this issue support the current knowledge base that students with high levels of test anxiety make more answer changes than students with low levels of test anxiety.

Table 5 indicates the relationship between the total number and quality of answer changes and low versus high levels of test anxiety. One interesting finding is the significant correlation found between students with low test anxiety scores and the number of answer changes they made (r = .196). Further investigation reveals a significant correlation between low test anxious students and the Wrong to Wrong direction of answer changes (r = .223). Table 5. Correlations Between Level of Test Anxiety and Number and

Quality of Answer Changes n Number ACs R-W W-R W-W Low test anxious 124 .196* .111 .110 .223* High test anxious 20 -.121 .048 -.123 -.147 * Correlation is significant at the 0.05 level (2 tailed)

Another point of interest in the literature is the issue of gender and test anxiety. The present study supports previous research that in general females seem to have higher levels of test anxiety than males. That said, the difference between males and females is marginal with females reporting a mean level of test anxiety of 43.20 while males report 42.24.

Research Question 6: Is there a significant correlation between students’ self-estimates of performance on both total and sub-sections of the MELICET-GCVR and number and quality of answer changes made? As shown in Table 6, there seems to be a low correlation (r = .024) between students’ self-estimates of their overall performance on the MELICET-GCVR and the number of answer changes made. As expected based on overall correlations, student self-estimates on the individual sub-sections of the test indicate low correlations with number and quality of answer changes. The only statistically significant correlation is between self-assessment on the cloze section and the number of wrong to right answer changes (r = .131).

91

Table 6. Students Self-Assessment and Number and Quality of Answer Changes Self-assessment Number ACs R-W W-R W-W Whole Test .024 -.048 .086 -.027 Grammar -.032 -.085 .004 -.028 Cloze .082 .015 .131* .015 Vocabulary .011 .014 .017 -.009 Reading .006 -.051 .032 .002 Correlation is significant at the 0.05 level (2 tailed)

Research Question 7: Is there a significant correlation between students’ proficiency level as measured by the QPT and the MELICET-GCVR and number and quality of answer changes made? Descriptive statistics for the QPT and MELICET-GCVR test scores are listed in Table 7. Research question 7 results (Table 8) indicate that there is a negative correlation between students’ performance on both the QPT and the MELICET-GCVR and the total number of answer changes made. In other words, high scorers tend not to change answers. In fact, the higher the student scores on a test, the more reluctant s/he is to change answers. More specifically, higher scoring students were found to make fewer Wrong to Wrong answer changes as evidenced by the inverse relationship (p < .05) between Wrong to Wrong answer changes and scores on both the QPT and MELICET-GCVR. Table 7. Descriptive Statistics for QPT and MELICET—GCVR Scores Test

Possible

Mean

Standard Deviation

Minimum

Maximum

QPT 5 1.85 0.92 1 5 MELICET-GCVR 100 41.43 15.20 14 95 Grammar 30 14.98 5.21 1 29 Cloze 20 8.72 3.45 1 20 Vocabulary 30 11.55 5.55 0 30 Reading 20 6.20 3.58 0 20 Table 8. Correlations Between QPT and MELICET-GCVR Scores and Number and Quality of Answer Changes Scores Number ACs R-W W-R W-W QPT -.044 .046 .023 -.147* MELICET-GCVR -.029 -.067 .097 -.133* *Correlation is significant at the 0.05 level (2 tailed)

Research Question 8: Are there any significant differences between any MELICET-GCVR sub-sections with regard to the number and quality of answer changes? As indicated in Table 9 the number and direction of answer change seemed to differ depending on the sub-section of the MELICET-GCVR. The highest number of answer changes took place in the Cloze section of the test while the fewest number of answer changes were found in the Reading section. As far as the quality of these answer changes is concerned, students seemed to benefit more from changing their answers during the Cloze section of the test. Results indicate that 53% of the answer changes in this section were from Wrong to Right. In contrast, students were least successful when they changed answers in the Vocabulary section

92

of the test. Interestingly, the highest percentage of Wrong to Wrong answer changing behavior occurred in the Reading section. Timing might be an issue here because the Reading section is the final section of the 75 minute test. Further research is needed to investigate this phenomenon. Table 9. MELICET-GCVR Sub-Sections and Answer Changes Sub-sections R-W W-R W-W Total ACs Grammar (k = 30)

n %

34 20%

75 43%

65 37%

174 23%

Cloze (k = 20)

n %

34 14%

129 53%

82 33%

245 32%

Vocabulary (k = 30)

n %

57 25%

81 36%

87 39%

225 30%

Reading (k = 20)

n %

19 16%

47 41%

50 43%

116 15%

Research Question 9: Is there a significant correlation between item difficulty and

frequency of change? To determine whether item difficulty had any correlation on the number and quality of answer changes, item difficulty indices for all items on the MELICET-GCVR were correlated with answer changing. The positive correlation between item difficulty and Wrong to Right answer changes means that more examinees changed the harder items from wrong to right, and the strong negative correlation between item difficulty and Wrong to Wrong changes means that it was more common for the examinees to change the easier items from wrong to wrong (Table 10). Table 10. Correlations Between Item Difficulty and Number and Quality of Answer Changes Total ACs R-W W-R W-W Item Difficulty -.133 -.088 .217* -.452* Correlation is significant at the 0.05 level (2-tailed)

Pedagogical Implications

What may be inferred from the research results reported here and how might these results inform practice? The fact that 67% of our students changed answers and 57% of them gained points as a result of this practice supports the need for language educators to better understand the answer-changing phenomenon.

Based on several decades of research, students should not be dissuaded from changing answers, as this strategy has been shown to be an effective one for students in a number of contexts and across a wide variety of disciplines. Our study expands the existing answer changing literature and provides valuable information about an as yet unstudied population—that of EF/SL.

Answer changing is part and parcel of test preparation skills and strategies. In high-stakes testing cultures like those in the Arabian Gulf, teachers sometimes feel uncertain about their role in preparing students for tests. It is our belief that answer changing as a test taking strategy should be explicitly discussed in language classrooms in general and in test preparation courses in particular. Open discussion about answer changing and other

93

successful test-taking strategies will serve to increase the transparency surrounding assessment.

Studies in other disciplines have identified and researched a number of variables that might influence the answer changing practices of their students. However, to date, no definitive findings have been offered.

Recommendations for Further Research

It should be acknowledged that the present study is just the first attempt to shed light on the answer changing behavior and practices of one type of EFL learner--the Gulf Arab student. Further research needs to address the same issues in different EF/SL contexts.

Similar studies might investigate answer changing under actual exam conditions. The majority of students in this study reported either low or moderate levels of test anxiety. Anecdotal evidence, however, indicates that because of the high-stakes nature of testing in the Gulf, students often experience higher levels of test anxiety. This discrepancy about what is reported versus what is expected could be the result of one of two factors or a combination of both. First, data was collected under experimental conditions in a simulated testing situation. Students were told that scores received on the MELICET-GCVR would have no effect on their overall course grade. A second issue concerns the TAI used in the study. As previously mentioned, the TAI was not specifically designed to be used in a foreign/second language testing context. Rather it identifies students’ anxiety levels in general testing conditions. Although this instrument has been widely used in previous studies, it is crucial for F/SL educators to design a measure that will more precisely assess anxiety in language assessment.

More in-depth studies are also needed to investigate the variables that might affect answer changing. One important avenue for research is investigating the reasons that EF/SL students have for changing answers in an appropriate direction. Additional qualitative investigations must be conducted to determine the students’ rationale for changing answers.

Future researchers in this area could also investigate the relationship between test preparation skills and other personal variables like cognitive style and their impact on answer changing.

Conclusion

The present study corroborates previous findings in that there is nothing wrong with changing answers on a MCQ test, as most changes are from Wrong to Right and result in points gained.

When results of this study are combined with the prior literature, findings suggest that we should encourage students to change answers judiciously after they have scrutinized their original answers for more plausible alternatives. Most important, students should not be influenced by incorrect perceptions about answer changing.

94

Acknowledgments

The authors would like to thank administration, faculty and students at their respective institutions, Kuwait University and Dubai Men’s College, for granting them the support needed to carry out this research. Particular gratitude is expressed to the Mary Spaan Fellowship Committee at The University of Michigan for awarding the authors a 2002 Fellowship for Research in Foreign/Second Language Assessment. This study would not have been possible without the dedicated work of Mary Spaan and Jeff Johnson who provided endless hours of mentoring and encouragement. Any inaccuracies are the authors’ responsibility.

References Assessment Systems Corporation. (1996). ITEMAN [Computer software]. St. Paul, MN:

Assessment Systems Corporation. Balance, C. (1977). Students’ expectations and other answer-changing behavior.

Psychological Reports, 41(1), 163-166. Bath, J. (1967). Answer-changing behavior on objective examinations. The Journal of

Educational Research, 6(3), 105-107. Benjamin, L., Cavell, T., & Shallenberger, W. (1984). Staying with initial answers on

objective tests: Is it a myth? Teaching of Psychology, 11(3), 133-141. Burdess, N. (1991). The handbook of student skills. Sydney: Prentice Hall of Australia Pty

Ltd. Copeland, D. A. (1972). Should chemistry students change answers on multiple-choice tests?

Journal of Chemical Education, 49(4), 258. Crocker, L., & Benson, J. (1980). Does answer-changing affect test quality? Measurement

and Evaluation in Guidance, 12(4), 233-239. Dusek, J.B. (1980). The development of test anxiety in children. In I. G. Sarason (Ed). Test

anxiety: Theory, research and applications. Hillsdale, NJ: Lawrence Erlbaum Associates. El-Banna, A. (1989). Language anxiety and language proficiency among EFL/ESL learners

at university level: An exploratory investigation. Educational Resources Information Centers Document (ED)308698.

English Language Institute, The University of Michigan. (2001). MELICET-GCVR user’s manual. Ann Arbor, MI: The University of Michigan.

Foote, R., & Belinky, C. (1972). It pays to switch? Consequences of changing answers on multiple-choice examinations, Psychological Reports, 31(2), 667-673.

Foster S., Paulk, A., & Reiderer, D. (1999). Can we really teach test-taking skills? Retrieved January, 2002, from http://nova.edu/~aed/horizons/vol13n1.html.

Friedman, S., & Cook, G. (1995). Is an examinee’s cognitive style related to the impact of answer-changing on multiple-choice tests? Journal of Experimental Education, 63(3), 199-213.

Friedman-Erickson, S. (1994, June). To change or not to change: The multiple choice dilemma. Paper presented at the Annual Institute of the American Psychological Society on the Teaching of Psychology, Washington, DC.

Geiger, M. (1991a). Changing multiple choice answers: A validation and extension. College Student Journal, 25, 181-186.

95

Geiger, M. (1991b). Changing multiple-choice answers: Do students accurately perceive their performance? The Journal of Experimental Education, 59(3), 250-57.

Geiger, M. (1996). On the benefits of changing multiple-choice answers: Student perception and performance. Education, 117(1), 108-116.

Geiger, M. (1997). An examination of the relationship between answer changing, testwiseness, and examination performance. The Journal of Experimental Education, 66(1), 49-60.

Green, K. (1981). Item-response changes on multiple-choice tests as a function of test anxiety. Journal of Experimental Education, 49(4), 225-228.

Hanna, G. (1989). To change answers or not to change answers: That is the question. The Clearing House, 62(9), 414-416.

Hill, K. T., & Wigfield, A. (1984). Test anxiety: A major educational problem and what can be done about it. Elementary School Journal, 85(1), 105-126.

Lynch, D., & Smith, B.C. (1972, April). To change or not to change item responses when taking tests: Empirical evidence for test takers. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.

Mathews, C.O. (1929). Erroneous first impressions on objective tests. Journal of Educational Psychology, 20(4), 280-286.

McMorris, R., Lawrence, P., & Schwarz, S. (1987). Attitudes, behaviors, and reasons for changing responses following answer-changing instruction. Journal of Educational Measurement, 24(2), 131-143.

McMorris, R., & Hoops Weideman, A. (1986). Answer changing after instruction on answer changing. Measurement and Evaluation in Counseling and Development, 19(2), 93-101.

Mercer, M. (1979, April). Answer changing and students’ test scores. Paper presented at the 63rd Annual Meeting of the American Educational Research Association, San Francisco, CA.

Millman, J., Bishop, C.H., & Ebel, R. (1975). An analysis of testwiseness. Educational and Psychological Measurement, 12(1), 251-254.

Mueller, J, & Wasser, V. (1977). Implications of changing answers on objective test items. Journal of Educational Measurement, 14(1), 9-13.

Oller, J. W., & Perkins, K. (1978). Language proficiency as a source of variance in self-reported affective variables. In J. W. Oller and K. Perkins (Eds.), Language in education: Testing the tests (pp. 103-125). Rowley, MA: Newbury House.

Pascale, P. (1974). Changing initial answers on multiple-choice achievement tests. Measurement and Evaluation in Guidance, 6(4), 236-238.

Payne, B. (1984). The relationship of test anxiety and answer-changing behavior: An analysis by race and sex. Measurement and Evaluation in Guidance, 16(4), 205-210.

Penfield, D., & Mercer, M. (1980). Answer changing and statistics. Educational Research Quarterly, 5(1), 50-57.

Prinsell, C., Ramsey, P. H., & Ramsey, P. P. (1994). Score gains, attitudes and behavior changes due to answer-changing instruction. Journal of Educational Measurement, 31(4), 327-337.

Quick Placement Test Manual. (2000). Cambridge, UK: Cambridge University Press. Ramsey, P., Ramsey, P. P., & Barnes, M. J. (1987). Effects of student confidence and item

difficulty on test score gains due to answer changing. Teaching of Psychology, 14(4), 206-210.

96

Reile, P., & Briggs, L. (1952). Should students change their initial answers on objective-type tests?: More evidence regarding an old problem. Journal of Educational Psychology, 43(2), 110-115.

Reiling, E., & Taylor, R. (1972). A new approach to the problem of changing initial responses to multiple-choice questions. Journal of Educational Measurement, 9(1), 67-70.

Scruggs, T. E., & Marsing, L. (1988). Teaching test-taking skills to behaviorally disordered students. Behavioral Disorders, 13(4), 240-244.

Schwarz, S., McMorris, R., & DeMers, L. (1991). Reasons for changing answers: An evaluation using personal interviews. Journal of Educational Measurement, 28(2), 163-171.

Shatz, M. A., & Best, J. B. (1987). Students’ reasons for changing answers on objective tests. Teaching of Psychology, 14(4), 241-242.

Skinner, N. (1983). Switching answers on multiple-choice questions: Shrewdness or shibboleth? Teaching of Psychology, 10(4), 220-221.

Slem, C. (1985, August). The effects of an educational intervention on answer changing. Paper presented at the 93rd Annual Convention of the American Psychological Association, Los Angeles, CA.

Smith, M., White, K., & Coop, R. (1979). The effect of item type on the consequences of changing answers on multiple choice tests. Journal of Educational Measurement, 16(3), 203-208.

Spielberger, C. D. (1972). Conceptual and methodological issues in anxiety research. In C.D. Spielberger (Ed.) Anxiety: Current trends in theory and research 2 (pp. 481-493). New York: Academic Press.

Spielberger, C. D. (1983). Manual for the State-Trait Anxiety Inventory Form Y. Palo Alto, CA: Consulting Psychologists Press.

Stallworth-Clark, R., Cohran, J., & Scott, J. (1998, November). Test anxiety and effect of anxiety-reduction training on students’ performance on the Georgia Regents’ Reading Exam. Paper presented at the Annual Meeting of the Georgia Educational Research Association, Atlanta, GA.

Stoffer, G., Davis, K., & Brown, J. (1977). The consequences of changing initial answers on objective tests: A stable effect and a stable misconception. Journal of Educational Research, 70(5), 272-276.

Stough, L. (1993, April). Research on multiple-choice questions: Implications for strategy instruction. Paper presented at the Annual Convention of the Council for Exceptional Children, San Antonio, TX.

Torrence, D. (1986, October). Changing answers as a test taking strategy for taking objective tests. Paper presented at American Educational Research Association Conference, Kansas City, MO.

Vidler, D., & Hansen, R. (1980). Answer changing on multiple-choice tests. Journal of Experimental Education, 49(1), 18-20.

Wagner, D., Cook, G., & Friedman, S. (1998). Staying with their first impulse? The relationship between impulsivity/reflectivity, field dependence/field independence and answer changes on multiple-choice exams in a fifth-grade sample. Journal of Research and Development in Education, 31(3), 166-175.

97

Appendix A Summary of Answer Changing Studies Author Year # of Ss Content Area Findings

Balance 1977 144 Professional • 86% changed answers after reconsideration • Majority believed that changing answers lead to

loss in test scores • Ss beliefs have no effect on test-taking behavior

Bath 1967 77 Educational Psychology

• Female Ss made more answer changes • Changing answers did not lower the score • Female Ss out-performed Male Ss • W-W answer changes made mostly by low-

proficient Female Ss Benjamin 1984 Survey of 33 studies carried out

pre 1984 on answer-changing • A small percentage of items are actually

changed • Majority of answer changes are from wrong-

right • Most answer changers are point gainers • Most test-takers are answer changers

Crocker & Benson

1980 289 7th graders

Math • 57% of 7th graders Vs. 89% of college Ss changed answers

• Young examiners benefited more from answer changes

• Teachers should not discourage answer changes Foote & Belinky

1972 222 Psychology • 55% changed their answers from W-R • 24% changed their answers from W-W • 21% changed their answers from R-W

Friedman-Erickson

1994 244 Psychology & Child & Adult Development

• 56% changed answers from W–R • 24% changed answers from R-W • 20% changed answers from W-W

Friedman & Cook

1995 106 Psychology • Cognitive-style variables had little impact on the canonical solution

• A combination of the effect of answer changes, the number of changes, and unit examination scores were the most influential components of the first canonical variants

Geiger 1991a 120 university accounting Ss

Business Administration

• 59.9% changed answers from W-R • 21% changed answers from R-W • 19% changed answers from W-W • No evidence that Ss behave differently toward

numeric or non-numeric questions in regard to answer changing behavior

• Males changed more answers on numeric Qs from R-W

• Males changed more answers on numeric Qs than females

Geiger 1991b 124 university accounting Ss

Business Administration

• When Ss change answers on MCQs they tend to gain roughly 3 points for every point lost

• Ss underestimate the benefits of answer changes

Geiger 1996 279 university Ss

Accounting • When Ss change answers on MCQs they tend to gain roughly 2- 3 points for every point lost

• 73% of the Ss increased their total point for the

98

semester • Most Ss held negative perceptions towards

answer changing Geiger 1997 150 college

Ss Accounting • Answer-changing behavior was related to

performance on MCQ part of the exam but not to performance on the non-MCQ part of the exam or to testwiseness

• Testwiseness scores related to performance on MCQ and non-MCQ parts of the exam

Green 1981 70 graduate Ss

Statistics course • High test anxious Ss make more answer changes than low test anxious Ss

• All Ss benefit from answer changing • More changes made on difficult than on easy

items for all Ss Lynch & Smith

1972 178 university Ss

Education courses

• 68% of answer changing was from wrong – right • 32% of answer changing was from right – wrong • Significant relationship between answer changes

and test score • Insignificant tendency for low-scorers to do

poorly than high-scorers • Low but significant relation between item

difficulty and number of answer changers Mercer 1979 200 Ss in

grades 4-7 Standardized test in Math and Reading

• Answer changing increased test scores • No significant differences in the total number of

answer changes between high-scoring Ss and low-scoring Ss

• Males made significantly more wrong-right answer changes than females in Math tests at grade level

• Sex and the interaction between sex and grade level did not influence the proportion of answer changes

• Sex, grade, level and their interaction did not influence the proportion of total wrong-right answer changes

• Test content seemed to be related to the proportion of total wrong-right answer changes

McMorris et al.

1987 6 Master’s level classes

Educational/psy-chological measurement

• Gains from answer changing • Amount of change between instructed and

uninstructed in answer changing was the same • Ss in the bottom third changed more answers

and made more W-W changes than did the other two groups

• Most frequently given reason for change was ‘rethought’, followed by ‘reread,’ ‘reread and rethought’ and clerical

McMorris & Hoops Weideman

1986 51 Grad Ss Educational/psy-chological measurement

• Gains for informed test takers • Most changed their answers because they

rethought or reread items Pascale 1974 94 Measurement • Male Ss did better

• No effect of proficiency level on test score • Ss improved their score when changed answers

Penfield & Mercer

1980 83 graduate Ss

Statistics • Gains from answer changing • No sex differences noted

99

• High-scoring Ss make more changes than lower-scoring Ss

Payne 1984 134 test items

Science • No sex differences • Black Ss made more answer changes • Black Ss & Female Ss had higher Test Anxiety

Prinsell et al.


Psychology and Statistics

• A significant increase in favorability toward answer changing after instruction

• Mean gain score did not change significantly after instruction & gains from answer change

Ramsey et al.


Psychology and Business Administration

• Answer changes of 6.6% • No significant effect found for ability, gender

nor the interaction between ability and gender • Significant main effect for item difficulty, Ss

change confidence, and their interaction • No significant effect on gain score found for

ability or any interaction with ability Reile & Briggs

1952 124 Psychology • Female Ss changed more answers but profited less

• More answer changes on items placed at the beginning

• D & F Ss made more changes but mostly from w-w and r-w

• A & B Ss made most w-r changes

Reiling & Taylor

1972 416 exams • Gains were made from changing answers • Final grade, sex & analytical Qs had no impact

on answer changing gains Shatz & Best

1987 65 UG Psychology • Ss who reported guessing as their reason for changing answers were not nearly as likely to benefit from their answer changing as were Ss who reported other reasons

Schwarz et al.

1991 104 MA Educational & Psychological Measurement

• Answer changes lead to gains • All type of Ss (low, mid, high) gained from

change • No significance found between grades and

answer changes • 6 strategies used by Ss when taking tests

Skinner 1983 68 1st year Ss

Psychology • 51% changed answers from W-R • 26.3% changed answers from R-W • 22.3% changed answers from W-W • 4% only of answer changes • Female Ss made more changes (double) • Male Ss made 54% successful changes • Female made 50% successful changes

Smith et al. 1979 157 UG Educational psychology

• Gains from answer changing • Misconception among Ss of negative effect of

answer changing • 68% of Ss believed answer changing leads to

losses Slem 1985 470 college

Ss Effective study techniques

• No significant difference found between instructed and non instructed Ss on answer-changing

• Answer changing had less than 1% net gain for

100

the Ss Stallworth-Clark et al.


Reading test (GRE)

• Anxiety-reduction training helped improve test scores for anxious Ss

• As anxiety rose, test score declined Stoffer et al. 1977 76 UG

&107 Air Force

Psychology vs. technical skills

• Gains obtained from answer changes • Most Ss (67% for academic & 72% for military

Ss) benefited from changing answers • High-scorers benefited more

Vidler & Hansen

1980 162 college Ss

Psychology • Most Ss changed answers • Most answer changes from wrong-right • Answer changes made on difficult rather than

easy items • Relationship between item difficulty and

direction of answer changes was insignificant Wagner et al.

1998 41 Fifth-grade Ss

Science • Ss changed 2.02 answers • Ss gained an average of .83 points as a result of

their changes • The more impulsive Ss changed more answers

and gained more points than the more reflective Ss

• Field-dependence/independence not significantly related to answer-changing behavior

• Changing answers on exams is beneficial but the benefit differs depending on the Ss level of impulsivity/reflectivity

101

Appendix B

Test Anxiety Inventory

Test Anxiety Inventory

Name: ………………………………………………. DIRECTIONS: A number of statements which people have used to describe themselves are given below. Read each statement and then circle the appropriate number to indicate how you generally feel. There are no right or wrong answers. Do not spend too much time on any one statement but give the answer which best describes how you generally feel. 1. I feel confident and relaxed while taking tests……………………………. 1 2 3 4 2. While taking examinations I have an uneasy, upset feeling ……………. 1 2 3 4 3. Thinking about my grade in a course interferes with my work on tests… 1 2 3 4 4. I freeze up on important exams ……………………………………… …. 1 2 3 4 5. During exams I find myself thinking about whether I’ll ever get through school …………………………………………………..………. 1 2 3 4 6. The harder I work at taking a test, the more confused I get ………………. 1 2 3 4 7. Thoughts of doing poorly interfere with my concentration on tests ……… 1 2 3 4 8. I feel very jittery when taking an important test ………………………….. 1 2 3 4 9. Even when I’m well prepared for a test, I feel very nervous about it ……. 1 2 3 4 10. I start feeling very uneasy just before getting a test paper back …………... 1 2 3 4 11. During tests I feel very tense ……………………………………………. 1 2 3 4 12. I wish examinations did not bother me so much …………………………. 1 2 3 4 13. During important tests I am so tense that my stomach gets upset ………… 1 2 3 4 14. I seem to defeat myself while working on important tests ………………... 1 2 3 4 15. I feel very panicky when I take an important test ………………………... 1 2 3 4 16. I worry a great deal before taking an important examination ………….…. 1 2 3 4 17. During tests I find myself thinking about the consequences of failing ……... 1 2 3 4 18. I feel my heart beating very fast during important tests …………………... 1 2 3 4 19. After an exam is over I try to stop worrying about it, but I just can’t …….. 1 2 3 4 20. During examinations I get so nervous that I forget facts I really know ……. 1 2 3 4

ALM

OST N

EVER

SOM

ETIMES

OFTEN

ALM

OST A

LWA

YS

102

Appendix C Student’s Profile • Date:………………………………… • Name: ………………………………………………………………… • Contact Phone Number: ……………………………………. • English Course: ……………………….. • Field of Study: (circle) English Major Arts Engineering • Student University I.D. Number: …………………………………….. A. Gender: (circle) 1 = Male 2 = Female B. Country: (circle) 1 = Kuwait 2 = UAE C. University Year: (circle) 1 = First 2 = Second 3 = Third 4 = Fourth I give my permission to Kuwait University, Dubai Men’s College, and the University of Michigan to use my responses to questionnaires, test questions, and interview questions for research purposes. I understand that my name will not be revealed. Signature of candidate………………………………… ___________________________________________________________________ For Tester Use Only: D. QPT Score: 1 2 3 4 5 E. AnsCh: 0 = None 1 2 3 4 5 6 7 8 9 = > 8 F. TAI Scale: ------------

1 = 20-26 2 = 27-33 3 = 34-39 4 = 40-46 5 = 47-53 6 = 54-60 7 = 61-66 8 = 67-73 9 = 74-80

G. Whole Test Self-Assessment: 1 2 3 4 5 Self-Assessment on Sub-Sections = Office: G 1 2 3 4 5 C 1 2 3 4 5 V 1 2 3 4 5 R 1 2 3 4 5

103

Appendix D Post-Test Questionnaire Student Name:………………………………………… Dear Student: Please help us rate your abilities in English. Read the following statements/questions about the exam you have just finished. Your answers to this questionnaire will not affect your test score. We appreciate your honesty when answering the questions.

• On a scale of one to five, I think my performance on the Very Poor Average Excellent

Whole test is: 1 2 3 4 5 Grammar section is: 1 2 3 4 5 Cloze section is: 1 2 3 4 5 Vocabulary section is: 1 2 3 4 5 Reading section is: 1 2 3 4 5

• Did you change any of your answers after you have marked them on the answer sheet? Yes No

104

Appendix E Protocol for the Administration of the QPT Before & during the test

1. Put the test booklets, answer sheets, Student’s Profile Forms, which have been stapled to TAI Scale, as well as pencils on tables. (Put the Student’s Profile Form on top, stapled to TAI scale, then answer sheet and then the test booklet.)

2. Let the students into the room. 3. Ask them to use pencils only for marking their answers on the answer sheet. 4. Ask students to fill out the Student’s Profile Form and instruct them not to write anything on the

bottom section of the form where it reads “For Tester Use Only.” Go around checking that all have signed the consent statement.

5. Ask students to answer all items on the TAI Scale. Give not more than 10 minutes. 6. Go around checking that all have signed the consent statement and completed the TAI. Collect

Student’s Profile Form and attached TAI. 7. Ask students to write their first name, father’s name and family name on the answer sheet. 8. Review the instructions and examples for the QPT test. Inform students that they will have (30)

minutes to answer all questions of the QPT test. 9. Instruct the students to start the test. Start timing the test after the instructions and examples have

been explained. 10. Do not give any help in answering actual test problems, and do not translate any part of the test itself.

Do not give help by spelling out words or by pronouncing words on the test. No dictionaries or other aids are allowed.

11. Inform students at half-time and five minutes before time is up. Ask students to stop writing when the time is up.

12. Collect the answer sheets and test booklets from the students. 13. Remind the students of the date and time of the MELICET-GCVR exam. Stress the importance of

their presence. Protocol for the Administration of the MELICET-GCVR Before & during the test

1. Distribute the test booklets, answer sheets, and pencils on tables. 2. Let the students into the room. 3. Ask them to use pencils only for marking their answers on the answer sheet. 4. Instruct the students to write their first name, father’s name and family name as well as their date of

birth on the bubble sheet. 5. Review the instructions and examples for the MELICET-GCVR. Inform students that they will have

75 minutes to finish answering all questions of the test. 6. Instruct the students to start the test. Start timing the test after the instructions and examples have

been explained. 7. Do not give any help in answering actual test problems, and do not translate any part of the test itself.

Do not give help by spelling out words or by pronouncing words on the test. No dictionaries or other aids are allowed.

8. Inform students at half time and five minutes before time is up. Ask students to stop writing when time is up.

9. Collect the answer sheets and test booklets from students. While doing so, distribute the posttest questionnaire to the students and read the instructions aloud to them. Ask the students to write their names on the top of the questionnaire. Give them 5-10 minutes to mark their answers. Collect the questionnaires.

Documents

Spaan Fellow Working Papers in Second or Foreign Language … · 2018-12-17 · Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 1 2003 Edited by Jeff S