Second Language Proficiency Assessment: Current Issues. Language in Education: Theory and Practice, No. 70

ED 296 612

AUTHOR

TITLE

INSTITUTION

SPONS AGENCY

REPORT NOPUB DATECONTRACTNOTEAVAILABLE FROM

PUB TYPE

EDRS PRICEDESCRIPTORS

DOCUMENT RESUME

FL 017 539

Lowe, Pardee, Jr., Ed.; Stansfield, Charles W.,Ed.Second Language Proficiency Assessment: CurrentIssues. Language in Education: Theory and Practice,No. 70.ERIC Clearinghouse on Languages and Linguistics,Washington, D.C.Office of Educational Research and Improvement (ED),Washington, DC.ISBN-0-13-798398-088400-86-0019207p.Prentice-Hall, Inc., Book Distribution Center, Route59 at Brook Hill Dr., West Nyack, NY 10994.Information Analyses ERIC Information AnalysisProducts (071) -- Information Analyses (070)

MF01/PC09 Plus Postage.Educational History; Evaluation Criteria; *LanguageProficiency; *Language Tests; *Reading Skills;Research Needs; *Second Languages; Test Theory;*Uncommonly Taught Languages; *Writing Skills

ABSTRACTA collection of essays on current issues in the field

of second language proficiency assessment includes: "TheUnassimilated History" (Pardee Lowe, Jr.), which chronicles thedevelopment of proficiency testing; "A Research Agenda" (John L. D.Clark and John Lett), a discussion of research considerations andneeds in proficiency testing; "Issues Concerning the Less CommonlyTaught Languages" (Irene Thompson, Richard T. Thompson, and DavidHiple), which examines the relevance and appropriateness ofproficiency testing theory and practice for less commonly taughtlanguages; "Issues in Reading Proficiency Assessment", including "AFramework for Discussion" (Jim Child) and "Interpretations andMisinterpretations" (June K. Phillips), discussions of proficiencytesting in the government and academic contexts; and "Issues inWriting Proficiency Assessment," including "The Government Scale"(Martha Herzog) and "The Academic Context" (Anne Katz), which look atan unexplored area in proficiency testing. (MSE)

***********************************************************************Reproductions supplied by EDRS are the best that can be made

from the original document.***********************************************************************

6

S

U S. DEPARTMENT OF EDUCATIONOffice of Educational Research and Improvement

EDUCATIONAL RESOURCES INFORMATIONCENTER (ERIC)

+This document has been reproduced asreceived from the person or organize.. ioriginating it.

0 Minor changes have been made *a improvereproduction Quality

Points& view or opinions stated in t his ciocu-ment co not necessarily represent officialOERI position or policy

"PERMISSION TO REPRODUCE THISMATERIAL HAS BEEN GRANTED BY

TO THE EDUCATIONAL RESOURC ESINFORMATION CENTER (ERIC)."

ERIC

Language in Education:Theory and Practice

SECOND LANGUAGEPROFICIENCYASSESSMENT:CURRENT ISSUES

Pardee Lowe, Jr.and Charles W. Stansfield,Editors

A publication of Center for Applied LinguisticsdB'Prepared by the ERIC Clearinghouse on Languages and Linguistics

A,

PRENTICE HALL REGENTS Englewood Cliffs, New Jersey 07632

0: 3

two, if Coors. C0.1.19-0.4.11.00 Co...01. I <Oro ttttt ....

C.vas.. 011Iv ,,V..C.v..... V,O. s.4010 M.

1. 0.01 ..... 0 C.., le .11.11.,....toSi .SSW 00371.314,3-0t... 2 i Ito 3 ..... 0.3. tttt

MO. R. foe C .........to VIP IV. tentsON

.007110 O. Mt

LANGUAGE IN EDUCATION: Theory and Practice 70

This publication was prepared with funding from the Office of EducationalResearch and Improvement, U.S. Department of Education, under contractno. 400-86-0019. The opinions expressed in this report do not necessarilyreflect the positions or policies of OERI or ED.

Production supervision: Arthur MaiselCover design: Karen StephensManufacturing buyer: Art Michalez

Published 1988 by Prentice-Hall, Inc.A Division of Simon & SchusterEnglewood Cliffs, New Jersey 07632

All rights reserved. No part of this book may bereproduced, in any form or by any means,without permission in writing from the publisher.

Printed in the United States of America10 9 8 7 6 5 4 3 2 1

ISBN 0-13-798398-0

Prentice-Hall International (UK) Limited, LondonPrentice-Hall of Australia Pty. Limited, SydneyPrentice-Hall Canada Inc., TorontoPrentice-Hall Hisp:aoamericana, S.A., MexicoPrentice-Hall of India Private Limited, New DehliPrentice-Hall of Japan, Inc., TokyoSimon & Schuster Asia Pte. Ltd., SingaporeEditora Prentice-Hall do Brasil, Ltda., Rio de Janeiro

4

Language in Education:Theory and Practice

ERIC (Educational Resources Information Center) is a nationwidenetwork of information centers, each responsible for a given edu-cational level or field of study. ERIC is supported by the Office ofEducational Research and Improvement of the U.S. Department ofEducation. The basic objective of ERIC is to make current develop-ments in ed...cational research, instruction, and personnel prep-aration readily accessible to educators and members of relatedprofessions.

ERIC/CLL. The ERIC Clearinghouse on Languages and Linguistics(ERIC/CLL), one of the specialized clearinghouses in the ERICsystem, is operated by the Center for Applied Linguistics (CAL).ERIC/CLL is specifically responsible for the collection and dissemi-nation of information on research in languages and linguistics andits application to language teaching and learning.

LANGUAGE IN EDUCATION: THEORY AND PRACTICE. Inaddition to processing information, ERIC/CLL is involved ininformation synthesis and analysis. The Clearinghouse commis-sions recognized authorities in languages and linguistics to writeanalyses of the current issues in their areas of specialty. Theresultant documents, intended for use by educators andresearchers, are published under the series title, Language in Edu-cation: Theory and Practice The series includes practical guides forclassroom teachers and extensive state-of-the-art papers.

This publication may be purchased directly from Prentice-Hall,Inc., Book Distribution Center, Route 59 at Brook Hill Dr., WestNyack, NY 10994, telephone (201) 767-5049. It also will be announcedin the ERIC monthly abstract journal Resources in Education (RIE)and will be available from the ERIC Document ReproductionService, Computer Microfilm International Corp., 3900 WheelerAve., Alexandria, VA 22304. See RIE for ordering information andED number.

For further information on the ERIC system, ERIC/CLL, andCAL/Clearinghouse publications, write to ERIC Clearinghouse onLanguages and Linguistics, Center for Applied Linguistics, 111822nd St. NW, Washington, DC 20037.

Gina Doggett, Editor, Language in Education

m

Acknowledgments

The editors wish to acknowledge the contributions ofseveral individuals who helped us carry out the manytasks required to bring a volume such as this to fruition.The outside reviewers, Anthony A. Ciccone, Claire J.Kramsch, and Sally Sieloff Magnan, provided excellentdirection. In addition, each chapter author reviewedand critiqued at least one other chapter, and patientlyresponded to many suggestions, a process that resultedin multiple drafts of each chapter. Gina Doggett, theLanguage in Education series editor, provided generalguidance and assisted us with the editing of each chap-ter. Jeanne Rennie and Dorry Kenyon also providedmany useful editorial suggestions. Finally, we wouldlike to thank the Office of Educational Research andImprovement of the U.S. Department vf Education,through which this publication was funded.

P.L.C.W.S.

December 1987

6

iv

...moo.

Introduction

Contents

1Pardee Lowe, Jr., Charles W. Stansfield

11Pardee Lowe, Jr.

Chapter I: The Unassimilated History

Chapter 11: A Research Agenda 53John L.D. Clark, John Lett

Chapter III: Issues Concerning the Less 83Commonly Taught LanguagesIrene Thompson, Richard T. Thompson,David Hiple

Chapter IV: Issues in Reading ProficiencyAssessmentSection 1: A Framework for Discussion 125Jim ChildSection 2: Interpretations and 136MisinterpretationsJune K. Phillips

Chapter V: Issues in Writing ProficiencyAssessmentSection 1: The Government Scale 149Martha HerzogSection 2: The Academic Context 178Anne Katz

V

7

/Introduction

This decade has seen a surge of interest in secondlanguage proficiency assessment within the academiccommunity. This interest follows on a long history of ac-tivity in assessing second language proficiency withinthe U.S. government, which began in 1956 with theinitial development of the Foreign Service Institute(FSI) Oral Proficiency Rating Scale (Sollenberger, 1978).By 1973, the Interagency Language Roundtable (ILR)had assumed primary responsibility for the scale. Thisintergovernmental committee is comprised of represen-tatives of government agencies that are concerned withsecond language teaching and testing, including theFSI, the Peace Corps, the Central Intelligency Agency,the National Security Agency, the Defense LanguageInstitute, the Department of Education, and a numberof others. Together, these organizations have consider-able experience and expertise in intensive second lan-guage instruction. Drawing on this experience, theILR Testing Committee has continued to refine thegovernment's definit'ons of proficiency associated witheach level and each skill on its ILR scale (InteragencyLanguage Roundtable, 1985).

In recent years, the American Council on theTeaching of Foreign Languages (ACTFL) has developedand disseminated a derivative set of proficiency guide-lines (or skill-level descriptions, as they would be calledby the ILR) with the assistance of the ILR and theEducational Testing Service (ETS). This set of guide-lines is designed to relate to the academic learner. This

rtf E

2 PROFICIENCY ASSESSMENT

learner typically differs from the government employeein a number of important ways. The typical student inthe government setting is an adult learner in an im-mersion program who has both a utilitarian motive forstudying a second language and the opportunity to ap-ply classroom learning to daily job requirements in thetarget-language country. The academic learner, on theother hand, usually studies the foreign language aspart of an effort to obtain a humanistic education. His orher exposure is typically limited to three to five hoursper week of contact with the teacher in the classroom.Also, instead of complete- 6 this study of a language inone year or less of int' 'sive instruction, the academiclearner often continue , to study it for two or more years.In short, students in academia learn the language inthe classroom, while the government language-learn-ing situation is characterized by learners who have anopportunity to "acquire" (Krashen, 1981) the languageboth inside and outside the classroom.

The fact that the government and the foreign lan-guage teaching profession typically encounter two quitedivergent learner groups suggests that each has animportant perspective to contribute to an understandingof second language proficiency and its assessment. Forthis reason, in this volume we have brought together agroup of authors with experience in academia, in gov-ernment, or in both.

As the influence of the ACTFL guidelines has beenincreasingly felt in academic circles, a number of con-cerns about them have arisen. Many concerns are quitelegitimate, while others seem to reflect an incompleteunderstanding of the origin and intent of the guide-lines. We hope that this volume, which attempts to bothclarify the ACTFL and ILR documents and proposeadditional work that is needed on them, will contributeto a better understanding of this approach to languageproficiency assessment and further advance the field.

The history of today's proficiency testing movementcan be traced from its foundations in one governmentagency (FSI) through its development with cooperating

9 ;1

Introduction 3

government agencies (the ILR) to its expansion intoacademia through the work of ACTFL and ETS. Toreflect this developmental history, the guidelines are re-ferred to throughout this book as the AEI (ACTFL/ETS/ILR) guidelines. Although other definitions of foreignlanguage proficiency have appeared (Bachman &Savignon, 1986; Cummins, 1984), the AEI guidelinesstand apart as the most comprehensive and widely im-plemented definitions to date.

Differences, however, do exist between the ILR andthe ACTFL/ETS scales. The ILR scale extends from 0(for no ability to communicate effectively or understandthe language) to 5 (for ability equivalent to that of a well-educated native speaker/listener/reader/writer). Thescale includes pluses at Levels 0 through 4, designatingperformance that substantially surpasses the require-ments for a given level but fails to be sustained at thenext higher level, thus furnishing an 11-range scale.

The ACTFL/ETS scale, derived from the ILR scale,provides three distinctions each at the ILR 0/0+ and 1/1+levels. Thus, it is more sensitive than the ILR scale atthe lower levels of proficiency. Note, however, that theACTFL/ETS scale places all ILR levels above 3 (i.e., 3,3+, 4, 4+, and 5) under an omnibus designation, Supe-rior. The ACTFL/ETS scale thus has 9 ranges. More-over, the ACTFL/ETS scale bears prose designations,such as ACTFL/ETS Advanced, rather than the ILRnumeric designations, such as ILR Level 2. The twoscales are compared on the next page.

In this volume, ten experts in second languagetesting draw on their experience to identify significantissues and share perceptions and concerns aboutpossible solutions. The authors make no attempt to hidetheir struggle with the implications of the AEI system.Thus their writings reflect the provisional nature ofavailable knowledge on assessing second languageproficiency. The present state of flux is at least partiallydue to the melding of two different traditions of meas-urementa top-down, holistic approach deriving fromthe work of the ILR, and the various uncoordinated

10


ILR Scale ACTFLVI'S Scale

5 Reading Speaking& Listening & Writing

4+

4 Distinguished,3+

3

2+

2

1+

Superior

Superior

All Skills

Advanced Plus

Advanced

Intermediate-High

1

0+

{Intermediate-Mid

Intermediate-Low

Novice-High

01 Novice-Mid

Novice-Low

Absolute Zero

Relationship of ILR Scale to ACTFLES Scale

systems in use in academia that are traditionallybottom-up and atomistic in nature.

The volume focuses on the skills of speaking, read-ing, and writing. The authors of each chapter treatthem separately and in mixes that best illustrate the is-sues they present.

11

Introduction 5

Speaking, the most fully understood and studiedskill in the history of proficiency assessment, forms themajor focus of the first three chapters, "The Unassimi-lated History," "A Research Agenda," and "Issues Con-cerning the Less Commonly Taught Lt nguages," butreceives only passing attention in the remaining chap-ters. The book's concluding chapters, each containing asection on testing in the government and a section ontesting in the academic sector, are devoted to the natureand assessment of reading and writing proficiency.

The first chapter deals with the causes, gaps, andmisperceptions that seem to have led many to assumethat the ACTFL/ETS guidelines appeared on the scenesuddenly in the early 1980s. The chapter seeks to de-scribe language proficiency assessment as it was car-ried out before then, both in academia and in the U.S.government. It neither attempts to apologize for nor tocriticize this prior history, but maintains that thishistory must be understood in order to understand theAEI concept of proficiency more fully, to use it moreintelligently, and to investigate it more thoroughly.

It is true that, in the decades preceding the 1980s,the relevant research conducted within the governmentwas not disseminated to academia. However, in the1950s, when the ILR definitions were first written, theprofession may not have been ready to assimilate a typeof testing that stressed functional foreign languageskills, thus radically differing from academia's focuson literature and culture. In addition, there seem to beaspects of the ILR system that were never really madeknown outside the government, such as the realign-ment of tasks according to difficulty across all the skillmodalities in 1978-79. Nor did manuals exist on how toadminister a proficiency test. A manual on oral inter-view testing was not begun in earnest until 1981. To thisday, there are no ILR manuals describing the testing ofreading, listening, or writing. Thus, the nonassimila-tion of the developmental history prior to the 1982appearance of the ACTFL/ETS Guidelines is partiallyunderstandable, but remains a significant problem to be

2

%..1%;" .mni


overcome before the nature of the AEI scales can betruly understood.

The second chapter of this volume proposes aresearch agenda, examining proficiency testing in lightof validity and reliability concerns. The strengths of theAEI model are discussed, along with areas of neededresearch, future develop7rent, and possible modifica-tions. The nature of language tested and its appropri-ateness and intelligibility as judged by native speakersare also treated. Suggestions are made about standardi-zation of the interview process and optimum interviewlength. Attention is also devoted to the need for researchinto proficiency and language attrition.

The third chapter addresses the application of profi-ciency guidelines to the less commonly taught lan-guages. Questions of relevance and appropriateness intheory and practice are addressed. In addition, ques-tions of Eurocentric bias and the impact of the appli-cation of the provisional generic guidelines to lan-guages with different typologies are traced with ref-erence to speaking and to reading. Theoretical andpractical problems in adapting the guidelines to specificlanguages are discussed, and finally policy issuesaffecting the various constituencies are raised includ-ing the role of the federal government and the languageand area studies centers.

The fourth chapter's first section, on evaluatingreading proficiency in the government setting, distin-guishes between skill-level statements, typology of texts,and reader performance. The author proposes refine-ments to the reading skill-level definitions and address-es the conflict between idealized testing of reading andthe practical issues that must be faced when actualperformance is to be rated. The hierarchical structureof reading tasks for both native and nonnative readersis discussed, with a focus on the problem that ariseswhen foreign language students are turned intodecoders of advanced-level texts. An example is theRussian-language student who attempts to decipherDostoevsky with dictionary in hand. This approach to

1?

Introduction 7

reading comprehension contrasts with an alternatestrategy that seeks automatic comprehension of texts ata near-native reading rate.

The second section, on reading in the academicsetting, discusses instructional implications of both thetraditional approach and that suggested by a proficiencyorientation. The chapter describes the reading processand reader strategies. It examines how the higher-levelnonnative reader pulls reading strategies as well asnative- and target-language linguistic ability together toread mid- and higher-level texts of a general nature.

The last chapter audresses a heretofore unexploredarea of proficiency testing, writing. The first section,concerning the testing of writing proficiency in the gov-ernment setting, treats the history and use of the ILRwriting scale. Improvements to the scale are proposedand areas of research suggested.

The second section, concerning writing in an aca-demic setting, examines the concept of the major writ-ing assignment, as well as the differences between test-ing the writing of native English users and the testingof writing proficiency in a second or foreign language.The role of the guidelines in assessing foreign languagewriting competence concludes the chapter.

For reasons of both space and development, it hasbeen impossible to include three topics related to thepresent discussion: listening comprehension, culture,and translation. Noting that listening tasks can be cat-egorized into participative listening and nonparticipa-tive listening (Phillips & Omaggio, 1984; Valdman,1987), let it be said here that participative listening maynot be as readily separated from speaking as theACTFL/ETS Guidelines and the ILR Skill Level De-scriptions imply. As central as culture is to the field offoreign language education, cultural skills differ signif-icantly from the skills described in other guidelines, asis substantiated by the developers of the ACTFL Pro-ficiency Culture Guidelines (see Hiple, 1987; Galloway,1987). Thus the topic of culture lies outside the domainof the present work. Translation, too, as a bicultural,

1 4


bilingual, and multiple-skill undertaking, presentsissues beyond the scope of this volume.

Lastly, because this book focuses on proficiency as-sessment, space permits only passing comment on theimplications of proficiency for curriculum design andclassroom methodology. A full treatment of its effects oncurriculum is properly the subject of another volume.

A driving force behind proficiency assessment hasbeen the search for a common metric (EducationalTesting Service, 1981) of a language user's ability re-gardless of the text taught, instructional method used,or site of instruction. A proficiency score is thus a char-acterization of how well a person can function in thelanguage in question: how well he or she speaks,writes, reads, or understands the language. Althoughthe lieed for a national metric is open to debate, the es-tablishment of a common measure will be possible onlywhen we can adequately and accurately assess the levelof skills language users possess. We hope this volumewill contribute to an understanding of this measure-ment process by both classroom foreign language teach-ers and second language testing professionals.

Pardee Lowe, Jr., Office of Training and Education,Central Intelligence Agency;

and Charles W. Stansfield, Director, ERICClearinghouse on Languages and Linguistics

NOTE

1. The ILR is a consortium of government agencies with a need tohire, train, test, and use employees whose jobs require skills in aforeign language. The ILR meets monthly in plenary session (ex-cept in summer) and regularly in committees, including manage-ment, curriculum and research, computer-based training, andtesting committees. The testing committee prepared the skill-leveldescriptions referred to in this volume.

15

Introduction 9

References

Bachman, L.F., & Savignon, S.J. (1986). The evaluationof communicative language proficiency: A critiqueof the ACTFL oral interview. Modern LanguageJournal, 70, 291-97.

Cummins, J. (1984). Wanted: A theoretical frameworkfor relating language proficiency to academicachievement among bilingual students. In C.Rivera (Ed.), Language proficiency and academicachievement. Clevedon/Avon, England: Multilin-gual Matters.

Educational Testing Service. (1981). A common metricfor language proficiency (Final Report for Depart-ment of Education Grant No. G008001739). Prince-ton, NJ: Author.

Galloway, V. (1987). From defining to developing pro-ficiency: A look at the decisions. In H. Byrnes & M.Canale (Eds.): Defining and developing proficiency:Guidelines, implementations, and concepts. Hast-ings-on-Hudson, NY: Lmerican Council on theTeaching of Foreign Languages.

Hiple, D.V. (1987). A progress report on ACTFL Profi-ciency Guidelines, 1 982-1 986. In H. Byrnes & M.Canale (Eds.), Defining and developing proficiency:Guidelines, implementations, and concepts. Hast-ings-on-Hudson, NY: American Council on theTeaching of Foreign Languages.

Interagency Language Roundtable. (1985). Languageskill level descriptions (Internal Document). Wash-ington, DC: Author. Also available in Appendix Eof R.P. Duran, M. Canale, J. Penfield, C.W.Stansfield, & J.E. Liskin-Gasparro (1985). TOEFL

4 6


from a communicative viewpoint on language profi-ciency: A working paper (TOEFL Research Report17). Princeton, NJ: Educational Testing Service.

Krashen, S.D. (1981). Second language acquisition andsecond language learning. Oxford, England: Perga-mon Press.

Phillips, J.K., & Omaggio, A.C. (Eds.). (1984). [SpecialIssue on the 1983 ACTFL Symposium on ReceptiveLanguage Skills). Foreign Language Annals, 17(4).

Sollenberber, H.E. (1978). Development and currentuse of the FSI Oral Interview Test. In J.L.D. Clark(Ed.), Direct testing of speaking proficiency: Theoryand application. Princeton, NJ: Educational Test-ing Service.

Valdman, A. (Ed.) (1987). Proceedings of the Sym-posium on the Evaluation of Foreign LanguageProficiency (March 1987). Bloomington, IN: Indi-ana University.

17

The UnassimilatedHistory

by Pardee Lowe, Jr.,Office of Training and Education, CIA

As the proficiency movements matures, there seem tobe as many opinions on the nature of proficiency and itsmerits as there are participants in the debate. Someconfusion surrounds the genesis of the movement,which is thought by many to have begun in 1982, whenthe ACTFL Provisional Proficiency Guidelines werepublished. These are often called the ACTFL/ETSguidelines because the American Council on theTeaching of Foreign Languages (ACTFL) and the Edu-cational Testing Service (ETS) cooperated in their pro-duction. When the guidelines were revised in 1986, thequalifier "provisional" was dropped, and they becamethe ACTFL Proficiency Guideline.

Proficiency testing procedures2 were used by thegovernment for years before the guidelines were devel-oped for use in academia (Jones, 1975; Liskin-Gasparro,1984a; Lowe, 1985b; Sollenberger, 1978; Wilds, 1975).However, until now, the early stages of the guidelines'history seem to have been unassimilated by the secondlanguage education community. It is the task of thischapter to record this unassimiliated history in thehope of clearing up the misunderstandings that consis-tently arise in the current literature and at conferenceswhere proficiency is discussed. However, this chapter isnot intended Yo- those who are new to the proficiency


not intended for those who are new to the proficiencyconcept. It necessarily begins in medias res, specificallyaddressing those already generally familiar with pro-ficiency testing and struggling with the issues it pre-sentsissues that delving into its prior history shouldhelp to clarify.

Defining AEI Proficiency

The designation AEI is a ready clue to the guide-lines' origins. To be historically correct, the acronymshould be ILR/ETS/ACTFL, because after the StateDepartment's Foreign Service Institute (FSI) originatedthe proficiency assessment system, the InteragencyLanguage Roundtable (ILR) refined it. It was first usedoutside the government by ETS to rate the language pro-ficiency of Peace Corps volunteers. ETS later, throughthe Common Metric Project in 1979, examined the sys-tem's suitability for use in academia (ETS, 1981). Final-ly, ACTFL adopted ETS' recommendations and pro-duced the guidelines much as they exist today.

The ACTFL/ETS scale was purposely derived fromand designed to be commensurate with the 1LR scale.As a result, the designation AEI expressly refers to themany facets the two systems share. To reflect this jointhistory, an operational definition outlining a commoncore might read:

AEI proficiency equals achievement (ILR func-tions, coiitent, accuracy) plus functional evi-dence of internalized strategies for creativityexpressed in a single global rating of generallanguage ability over a wide range of functionsand topics at any given ILR level. (Lowe 1985b,1986)3

Achievement is used here as it is traditionallyunderstood in the academic classroom as well as in the

19

government sense of mastery of ILR functions, content,and accuracy when these are memorized but not usedcreatively in a consistent and sustained manner. Func-tional refers to obvious, usable skills; that is, those thatare proven by use, not merely implied. For example, thepresence of a conditional does not prove consistent andsustained ability to hypothesize.

The need for a fuller definition of proficiency isclear, although simple definitions exist. Shulz (1986)mentions "the ability to send and receive different mes-sages in different real life situations" (p. 374). Larsonand Jones (1984) summarize definitions of proficiencybefore 1984 as "the ability to communicate accurately."Lowe (1986, p. 392) also offers the following short char-acterization: "doing things through language." Conven-ient as these short charact3rizations of proficiency maybe, they can also be misleading.

Larson and Jones' review of the literature (1984)suggests that everyone knows what "proficiency" is.Unfortunately, any two observers who have not receivedtraining in AEI proficiency assessment, when asked toassess the "proficiency" of a performance, will selectdifferent aspects to rate; their definitions of proficiencywould differ accordingly.

Webster's New International Dictionary suggestssources for such differing interpretations of proficiencyby presenting a ccntinuum of-definitions ranging fromindications of partial control (progression, advance-ment) to proof of expert control. While the dictionarystresses expertness, popular usage in foreign languagecircles tends to stress advancement. But this leavesseveral questions unanswered:

1. What constitutes advancement?2. How is advancement quantified?3. How much advancement is involved?4. Is that advancement significant?

The existence of an abundance of definitions ofproficiency in the foreign language profession, from


any advancement deemed significant by the observer to(near) expertness, is demonstrated by Larson andJones' (1984) survey of proficiency definitions:

Certainly germane to the issue of proficiency isan understanding of what is to be meant by "pro-ficiency." Interpretations of the term range fromthe ability simply to use properly the phonologyand structural devices of the language . . . tocomplete mastery of all the components of com-munication. . . . Others . . . have suggested thatlanguage proficiency depends on linguistic func-tions, situational contexts, and personal needs.

(p. 114)

But this range of interpretations proves too wide to beuseful; one person's "proficiency" might be another's"achievement." Clifford (as quoted in Buck, 1984, p. 311)segments this continuum into three parts:

An achievement test checks whether languageexists; a performance test checks whether lan-guage is used; a proficiency test checks whetherlanguage is used creatively.

To the third category might be added "and accurately."A more complete definition of proficiency than those

mentioned by Schulz or Larson and Jones is neededbecause it is necessary to specify exactly how to deter-mine the presence and extent of proficiency. Thepresent operational definition specifies that perfor-mances be judged against ILR functions, content, andaccuracy (see the appropriate trisection).4 In consider-ing the nature of other conceptualizations of proficiencysuch as the one posited by Bachman and Savignon(1986), the AEI division into ILR functions, content, andaccuracy has more than passing significance. Thisdivision suggests that Bachman and Savignon's con-cept of Communicative Language Proficiency (CLP),renamed Communicative Language Ability (CLA), andAEI proficiency may prove incompatible. The position of

21.ti

The Unassimilated History 15

accuracy within CLA remains unclear; nor is it clearthat CLA is not an atomistic, bottom-up view oflanguage ability rather than a holistic, top-down viewlike that of AE1 proficiency (Bachman, personal com-munication). These possible differences have consider-able significance for any research agenda.

AEI History and Possible Hindrancesto Its Assimilation

A historical context is important to the under-standing of a concept's true nature. This is especially soin the case of AEI proficiency, because disregarding thepast also poses a risk of altering the underlying def-initions and testing procedures. As a result, the orig-inal purposethe ass( :sment of foreign languageusers' abilities according to a consistent scalemay bedefeated. Although the ILR definitions have addressedthis purpose satisfactorily for more than 30 years, onlyrecently has essential information about the AEI scalesbegun to emege in a systematic and integrated fashion.This chapter joins in the effort to document the his-torical context of the proficiency movemmt.

The following discussion rests on 14 years' expe-rience in administering and certifying interviews oforal language skills and reading comprehension; train-ing oral interviewers and oral interview trainers; train-ing test designers and item writers; refit 'ng the ILRSkill Level Descriptions; and developing and revieingthe ACTFL Proficiency Guidelines.

In-House ILR Research

To researchers, a major drawback of the AEI scaleshas been the lack of research data. Studies conducted to

;22


date afford little understanding of the rationale behindthe government's research decisions (Lowe, 1985b). Thelimited amount of early research compels the formu-lation of a broadly based research agenda (Bachman &Clark, 1987; Clark & Lett, this volume). Academia'sinitial lack of interest in the system was another reasonfor the small number of studies available.

Proficiency evaluation was developed when aca-demia focused heavily on literature. The governmentdid not set out to create a radically new testing system.Linguists at FSI would have gladly adopted or modifieda system for ascertaining oral proficiency if one existed;the fact was that none did. Francis Cartier, formerchief of evaluation at the Defense Language Institute(DLI) in Monterey, Calif., enunciated the dictum bywhich most ILR linguist-managers must abide: "The-ory has no deadline! I must have a Spanish readingproficiency test in place by next Tuesday" (commentmade at ILR pre-Georgetown University Round Table[GURT] symposium, March 1980). And so the govern-ment set out to design and implement its own test.

Not generally known is that the requisite expertiseexisted within the government, and that it derived fromacademia. Claudia Wilds, a primary designer of oralproficiency assessment procedures and later head oftesting at the FSI School of Language Studies, was a for-mer student of John B. Carroll, then at Harvard Uni-versity. Under Wilds' influence, FSI drew on the latestpsychometric theory of the early 1950s. The ILR systemadopted two characteristics of this theory: criterion-referenced testing and Osgood's semantic differential(Osgood, Suci, & Tannenbaum, 1957). Both were pio-neering applications in foreign language testing.

Most government research was conducted in-houseand not published, a fact that has led some to believethat little research was conducted. In retrospect, thispractice was unfortunate. Still, the single early publi-cation on proficiency testing (Rice, 1959) aroused littlenotice in academia, with its radically different con-cerns.

23


Today it is difficult to determine exactly which as-pects of proficiency were scrutinized psychometrically.Rice (1959) indicates that efforts to identify the factorscontributing to the global score were undertaken andregression equations were calculated to ascertain theextent of their contribution at Level 3/Superior. Exceptfor Rice's article, however, little else was reported from1954 to 1978. A study of interagency interrater reliabilitywas conducted by FSI and the Central IntelligenceAgency staff in 1973-74 (Wilds & Jones, 1974). (Inter-rater reliability is a measure of the degree of agreementbetween the separate global scores when two raters ratethe same performance independently.) Jones, then atCIA, who designed and executed the study with Wildsat FSI, reported that correlations between raters at thetwo agencies in each of three languagesFrench, Ger-man, and Spanishsurpassed .89, demonstrating thatsuitable reliability could be attained.b The full study isstill unpublished.

In 1974, Jones and Spolsky organized an extraor-dinary session at the pre-GURT symposium devoted toproficiency testing. Testing experts from academia andthe government participated in a discussion of the roleof proficiency in U.S. language tasting and training.The symposium inaugurated an annual symposiumseries that sporadically produces proceedings (Clark1978; Frith, 1980; Jones & Spolsky, 1975). Even then,however, interest remained confined to a small group,primarily within the government.

Interest in the ILR's proficiency system emerged inacademia in the late 1970sby which time language re-quirements at universities had been eliminated orreduced drastically. A telling criticism in discussions ofeducational reform was the statement: "I studied_______(name your language!) for four years andcan't order a emp of coffee!" Americans are a practicalbreed, and this utilitarian criticism hit home. Otherfactors came into play, such as the growth of overseaslanguage training programs and a greater awarenessamong students of their lack of language competence.

n4


In addition, U.S. foreign language educators becamemore familiar with European functional-notional syl-labi. Concurrently, dissatisfaction grew within the for-eign language community with the status quo. Puttingaside structural linguistics, linguists were beginning tobranch out into conversational analysis, discourse anal-ysis, sociolinguistics, and even psycholinguistics. All ofthese factors put pressure on foreign language depart-ments. Once again a need emerged, this time in thecontext of public education, for the teaching and testingof functional foreign language skillsprecisely the sit-uation faced by the State Department in the early 1950s.

The need in academia was met by a happy series ofevents:

1. testing workshops conducted by James Frith,dean of FSI's School of Languages, and MarianneAdams, FSI's head of testing;

2. the ETS Common Yardstick project (1981); and3. the ACTFL/ETS Guidelines project.

All thr-e efforts had ILR support and led to a deriva-tive, commensurate scale for academia. These eventsare well-documented by Frith (1979), Liskin-Gasparro(1984a), Murphy and Jimenez (1.984), Hip le (1984, 1986),and Lowe (1985b). In 1974, the government began as-sembling a manual on the oral proficiency interview,which was not completed until 1985 (Lowe, 1985a).

In the mid-1980s, many observers, some of whomlater espoused or criticized the ACTFL/ETS Guidelines,maintained that the ILR conducted little research, rare-ly published it, and failed to systematize and integratethe government's experience and findings into themainstream of American foreign language test andcurriculum development (Shohamy, 1987). These criticswere generally accurate in their assessment, for theperiod extending from the 1950s until 1974. From 1974 to1978, a small group of academicians became aware ofthe ILR system, but little use was made of either theawareness or the system. From 1978 on, however, the

25


exchange grew; ACTFL and ETS became observers atmeetings of the ILR and participated in the revision ofthe ILR Skill Level Descriptions, which led to theappearance of the provisional ACTFL/ETS Guidelinesin 1982 (Hip le, 1987). Starting in 1974 with the Jones andSpolsky symposium and gathering momentum from1978 to the present, the ILR has conducted considerableresearch on the nature of AEI proficiency (Lowe, 198Fb).

In an academic setting, this research would havebeen carried out early in the system's existence. In thegovernment context, a practical approach was animmediate necessity. Fortunately, the wedding of aca-demia's and the government's current interests hasculminated in a proposed common research agenda(Bachman & Clark, 1987; Clark & Lett, this volume).

Any discussion of the significant issues in AEI pro-ficiency should build on the fullest possible understand-ing of its nature. While an operational definition hasbeen proposed, a discussion of difficulties it wordingthe definitions, a fuller characterization of the AEIframework, and a review of the framework's applica-tions and testing procedures should aid understand-ing the chapters that follow. The problems cited herehave proven to be stumbling blocks to many seeking toassimilate the system.

Words Versus Experience

A crucial misperception of the AEI scales assumesthat their wording reflects the system. Every deviationin wording between versions may be thought to implynotable differences. This is not usually the case. Thedrafters of the ACTFL/ETS Guidelines and the ILR SkillLevel Descriptions took exceptional care to ensurecommensurability.6 Different wording usually signalsno difference between the two systems, but rather analternate way of expressing the same aspect of profi-ciency evaluation.


The different wording results in part from the factthat ILR base ranges were split into ACTFL/ETS sub-ranges; for example, ILR Level 1 was subdivided intoACTFL/ETS Intermediate-Low and Intermediate-Mid(see figure in the Introduction to this volume). ILRraters had used arrows, assigning an upward arrowfor a strong performance, and a downward arrow for aweak performance. The ACTFL/ETS distinctions of Lowand High performance at ILR Levels 0 and 1 werederived from this procedure of assigning upward anddownward arrows (Lowe, 1980).

Problems arose when ILR experience was distilledinto words for the first time. Behind differences inwording lurks the greater challenge of describing asophisticated system verbally. A central problem is thata single document is expected to speak to threeaudiences: examinees, examiners, and administrators.Perhaps the moment is ripe, as Clark has contended, toproduce several documents, each with a distinct goal(Clark, 1987a). When the interagency handbook on oralproficiency testing reached 500 pages, it proved sounwieldy that it had to be streamlined (Lowe, 1985a).Yet many questions still arise that such a handbookcannot address. Frankly, it may never encompass allpossible situations. Thus, AEI proficiency should beviewed more as a dynamic constitution than as aNapoleonic Code.? Perhaps the best access to the sys-tem is offered by firsthand experience: observing tests,being trained as a tester, or being trained as a testertrainer. Indeed, the definitions cannot replace hands-on exposure. This is the case for speaking and reading(Lange & Lowe, 1987), and so may be true of all skillmodalities.

Wording problems, such as overlayering, arose forhistorical reasons; problems also arose in shifts infocus from level to level, such as describing plus-levelperformances and in describing 1980s tasks in 1950slanguage. These problems in wording have contributedto widespread inability to internalize the system.


Over layering

Over layering occurs when statements describinglearner behaviors are superimposed on earlier state-ments describing user behaviors.8 After a series ofrevisions, the scale's final form was fixed at FSI in themid-1950s. A needs analysis of jobs at home and abroadproduced a set of hierarchical categories, called levels,into which language performances by governmentlanguage users could be sorted. Initially, the needsanalysis focused on users and their performances inthe field. When the assessment system was field-testedlater at FSI, language learners comprised its . orpopulation.

The confusion has been exacerbated by academia'sburgeoning interest in proficiency-oriented curricula,emphasizing learners, precisely because the definitionswere never intended for this population. Yet the chal-lenge of adapting proficiency guidelines for use in theclassroom continues to fascinate the profession (Liskin-Gasparro, 1984b). Moreover, the statements about ILRlearner outcomes were developed in the context of thegovernment's intensive language learning programs.

To some, this fact suggests that meaningful AEIlevels are unattainable in academia in programs fornonmajors (Schulz, 1986). This will prove untrue fortwo reasons: First, AEI levels do not refer to the speak-ing skill modality alonean early misunderstanding.Significant functional ability can be attained in listen-ing, reading, and perhaps writing. Secondly, unlikeintensive government training, with its constantexposure to the target language, regular classroominstruction in academia provides psychological absorp-tion timetime for the material to "sink in." This mayprove important, especially in the initial stages of for-eign language instruction, and its possible significancebears investigation.

Thus AEI testing should not remain unattemptedfor fear that students will not demonstrate satisfactory

r)C,


gains. While traditional instruction may result inmeager gains in speaking skills, gains in other skillsmay be somewhat greater. With proficiency-orientedinstruction, gains may be greater still.

Shifting Focus in the Plus-Level Descriptions

A shift in focus is a move from stressing one view tostressing another within or across definitions. Withregard to the AEI scales, this means that a definitionmay emphasize the user at one point and the learner atanother, usually within the same level. Less obvious areshifts in focus between levels. In the following discus-sion, the plus-level descriptions are examined.9

Writing plus-level descriptions poses particularproblems. While base-level descriptions may be writtenin a more positive vein, plus-level descriptions by naturemust cite deficiencies. By definition, plus-level perfor-mances must evince many characteristics of the nexthigher base level, but are not sufficiently consistent orsustained to merit the higher rating. Byrnes (1987a)characterizes the mixed nature of these performancesas "turbulent." Plus-level definitions must account for awide variety of performances, difficult to captureverbally. Perhaps plus levels could be regarded as abuild-up of competence, not all of which has yet beentranslated into performance. The description writermust choose a particular vantage point from which tocompose the description, and the focus may be differentfrom level to level, depending on the principal need asthe guideline writer sees it. The 0+ Level (ACTFL/ETSNovice) definition, for example, targets general 0+ Levelbehaviors; while the Level 2+/Advanced Plus descriptionstresses the worst case, the "street" or "terminal" 2+(defined later). Recent revisions have made every effortto remove these problems.

29


Alignment of Task Difficulty Across Skills

The year 1978 was a watershed year in the history ofproficiency evaluation, because it is when the ILRTesting Subcommittee rethought the definitions. Before1978, the gestalt nature of the ILR system had not beenconsistently elaborated in the definitions. The scale'searly midpoint descriptions were vying with its morepervasive threshold descriptions, causing misunder-standings about acceptable performances. Thresholdsystems contrast markedly with midpoint systems.

A gestalt is any clearly recognizable constellation offactors, that is, a unit or an entity that is wholly indic-ative of a single level and no other (Lowe, 1980). Interms of the AEI scales, a threshold is a perceptualgestalt, a figure or a constellation of factors that areclearly present to perform the function(s) in questionwhen the major border between levels is crossed (Lowe,1980). ILR thresholds exist at each border between aplus level and the next higher base level: 0411, 1412, 2413,

3+14, 4+/5.In contrast, in a midpoint system, a weak gestalt

represents the lower subrange; a strong gestalt, themid; and a still stronger gestalt, the plus subrange.Midpoint systems are ubiquitous. The color spectrum isan example, with each color's shadings of pale, nor-mal, and deep hues. In a threshold system, on the otherhand, a weak gestalt occurs at the plus-level border; astrong gestalt occurs just above the border between theplus level and the next higher base level; and a stillstronger gestalt occurs toward the mid-range of thenext higher level. For example, driving is a thresholdsystem. The plus-level below might be characterized bypassing a test on the rules of the road; and the nexthigher level by crossing a threshold when the drivercan simultaneously recognize a stop sign, release theaccelerator, downshift, and brake.

Past-tense narration presents a linguistic example.At the levels below the threshold, the examinee has

e)Li U


been gathering the linguistic bits and pieces (vocabu-lary, grammar, etc.) with which to perform the func-tions of the next higher base level. But the examineecannot yet put all the pieces together. Use of the past issporadic and faltering. Once the threshold has beencrossed, however, the examinee can satisfactorily per-form the suitable functions. At this point, there is nodoubt that past-tense narration occurs in a consistentand sustained fashion. Thus, the confusion betweenfocusing on midpoints versus focusing on thresholds ininterpreting the scale was a problem in using the ear-lier definitions.

Further complicating a correct interpretation of thedefinitions is the offset problem; that is, the tendency forindividuals to exhibit unequal levels of competence inthe various skill modalities. For example, many ILRstudents learn to speak, listen, and read, but possesssignificantly lower writing facility in the targetlanguage. For students in academia, the reverse can betrue, with writing skills sometimes outpacing speakingskills.

Faced with this problem, writers of ILR Skill LevelDescriptions from 1978 onward felt that a certainconsistency in revising the definitions was in order.There were two possible directions, neither clearlydemarcated in earlier attempts. The first, the uniformapproach, would have linked levels so that a typicalindividual rated S-1 (speaking Level 1) would beregarded as automatically having a certain ability inlistening comprehension, such as L-1. This approachwas rejected because, as Lowe (1985b) showed, not allspeakers show such uniformity; that is, there is notalways a link between ability levels. In addition,examinees in some government language trainingprograms show a consistent tendency to acquire someskills faster than others. Even when an offset commonlyoccurs in a language, its magnitude may differ byexaminee. In Lowe's (1985b) sample, the offset betweenspeaking and listening in Spanish ranged from a pluslevel to two whole levels.


The profile approach was chosen for the level guide-lines instead, making all skill modalities parallel tospeaking and to one another, in tasks, difficulty, andrating standards. Consequently, an S-1's creativity inspeaking was matched by a similar "creativity" inunderstanding, characterized by phrases in the defini-tions such as "able to understand sentence-length utter-ances" when recombined in new ways at the same level(ACTFL, 1986). This approach requires reference touser profiles, and reflects speakers' tendency to operateat different levels in different skill modalities. A pos-sible s-rf profile might be: 1+ in speaking, 2 in listening,3 in reading, and 2 in writing. A profile reflects a for-eign language user's skills more accurately because itcaptures variations among them.

1980s Tasks in 1950s Language

A longstanding dilemma that has only recently beenaddressed i3 that of stating 1980s tasks in 1950s lan-guage. 1980s tasks refer to recent identifications of themany tasks requiring language. 1950s language refersto the structuralist view of language, in which the AEIscales were first expressed. This is not to imply that thescales were previously incapable of rating 1980s tasks.For example, the terms coherence and cohesion, untilrecently, did not appear in the definitions. Yet no high-level rating (i.e., 4, 4+, 5) would be assigned to a speakerwho could not produce a coherent or cohesive sample.This is one of many examples in which the AEI scalesholistically handle areas that are just beginning to beunderstood atomistically.

Given growing recognition of the sophistication oflanguage, one might wonder whether a system of thescope of the AEI scales could be devised today. The taskwould certainly be much more complex, because sincethe early 1950s, infinitely more has been learned aboutthe nature of language and language acquisition

:32


through transformational grammar, discourse analy-sis, pragmatics, and so onthan was known then,though there is obviously more to learn. The ILR systemwas devised when views of language were simpler andmore readily captured by a few terms. For speaking, thecommonly accepted factors were: pronunciation, flu-ency, vocabulary, and grammar; added later weresociolinguistics, culture, and tasks accomplished.°Similarly, factors may be identified for the other skillmodalities.11 Some believe that such terms are mis-leading because they may be seen as "structuralist" inorigin (Bachman & Savignon, 1986). Even the termgrammar miscommunicates to some (Higgs, 1985;Garrett, 1986). Others overemphasize vocabulary. How-ever, properly expanded and defined through research,manuals, and workshops, these expressions usuallycommunicate AEI proficiency to the teacher in theclassroom, the administrator behind the scenes, andthe foreign language user. These terms encompass thedomains usually considered when describing globalperformance.

Today's knowledge allows for expansion beyondphrase- or sentence-level grammatical descriptions todescribing discourse structure and the realization ofpropositions (Byrnes, 1987a). Pragmatics can be dis-cussed, as well as the real-world constraints on whatcan be said (Child, 1987). If a proficiency system had tobe built today, the task would be too daunting were it notfor the framework inherited through the AEI scales, asimple one that permits considerable expansion.

The framework permits expansion, refinement, anda top-down view of proficiency that provides an antidoteto the bottom-up activities of achievement-orientedclassrooms. In sum, the proficiency system of the early1950s was just sophisticated enough to blaze an invitingand rewarding trail, and just simple enough to allowdrafters of new guidelines to maintain their bearings inthe face of new and intriguing problems in evaluation.

La la


The AEI Framework12

The government's need for an evaluation system todeal with oral proficiency led to another impediment toassimilating the system, the lack of an overall state-ment on the AEI characteristics common to all skillmodalities. This section attempts to address that need.

The Outlines of the Proficiency Levels. The defini-tion for each level outlines the target area into whichcharacteristic behaviors fall. The definitions cite repre-sentative behaviors, content, and accuracy as needed,but make no attempt to produce an exhaustive descrip-tion. For a consistent statement on the required ILRfunctions, content, and accuracy in a given skill modal-ity, see the respective functional trisection. For details,consult the relevant manuals (ILR or ACTFL/ETS) forspeaking, and the relevant literature for all skills (Edu-cational Testing Service, 1982; Lowe 1985a, 1985b).

Stress on General Language. Language may beviewed as a continuum ranging from general lan-guage, through work-related language, to job-specificlanguage. A major characteristic of the AEI scales istheir stress on general language.

Stress on Proficiency, not Achievement. The AEIscales stress proficiency, except at the lowest levels-0,0+/Novice-Low, Mid, and Highat which students tendto regurgitate the limited material they have learned(achievement) and possess few, if any, strategies to usethe material creatively. The expected creativity mustreflect the requisite quality and quantity (consistent andsustained production) for the level. "Creativity" isunderstood here as the ability to generate sentences anddiscourse that are new to the examinee, not the ability toproduce immortal prose.

Quality statements are important throughout thescale. Nonetheless, at the lowest levels they play down

:i 4


perfection; and indeed, even at the highest levels theyassume only the "near perfection" of the well-educatednative, who still occasionally lapses. Nonnatives arepermitted lapses at the highest levels, provided they arenative-like lapses.

Process Versus Product. It is the goal of the AEIscales to rate the examinee's facility in a given skill andnot solely or even primarily on the product produced.Thus, in rating an examinee's writing proficiency, theexaminer reports: "John Simpson writes French at Lev-el 2+/Advanced Plus," and not "John Simpson's essaywas a 2+." The essay obviously has to demonstrate 2+writing skills, which means that it must be a ratablesample revealing consistent and sastained ability. It isthe examinee's ability, not merely his or her per-formance, that receives the rating (Child, this volume).

Gestalt Nature of AEI Rating. Because of the gestaltnature of the AEI scales, rating proves most efficientand accurate when raters judge "wholes" or "nearwholes" rather than bits and pieces. The system pro-vides yet another reason to stress performance that is"consistent" and "sustained" (Lowe, 1980).

Noncompensa,:ory Core. In the ILR scale Levels 3through 5 are noncompensatory. This means that acore of ILR functions and certain levels of accuracymust be demonstrated for the examinee to earn therating. Nevertheless, the lower a speaker is on the AEIscales, the more compensation is possible. An exam-inee's strong vocabulary, for example, may under cer-tain conditions compensate for weak grammar at Level2+/Advanced Plus and lower.

Even at lower levels, however, compensation issometimes impossible. This results from the presenceof a noncompensatory core; in a given sample, certainfeatures in each language prove indispensable andmust appear with the degree of quality the definitionsrequire. For example, in West European languages,some past-tense features are part of the central core to

35


earn a Level 2/Advanced rating. A failure to use pasttense at all, or without the requisite quality, blocksassignment of the level in question.

Criterion Shift at the Intermediate High 'AdvancedBorder. At the 1+, 2/Intermediate-High, Advanced bor-der or lower, the definitions state that an examineemust be able to communicate with a native used to deal-ing with foreigners. This often requh-es the listener todo most of the workparticularly at the lowest levels.The shift of responsibility materially affects the rating.

Expression of Ability in a Global Rating. The exam-inee's ability is expressed, in a single level rating. Fac-tors contributing to the rating have been identified, butthe global score does not represent a summation ofscores on the individual factors (for details, see Bach-man & Savignon, 1986; Bachman & Clark, 1987; Clark,1987a; Clifford, 1980).

Full-Range Nature of the Scale. The AEI scalescover the full range of performances from knowledge ofisolated words (0, 0+/Novice levels) to ability equivalentto that of a well-educated native speaker (Level 5/thepeak of Superior). The apex of the full range (ILR Level5) is included in both scales because the ACTFL/ETScategory of Superior comprises ILR ranges 3, 3+, 4, 4+,and 5, for which the ultimate reference point is theeducated native speaker/listener/reader/writer. Theterm "educated native speaker" has occasioned muchdiscussion and led to misunderstandings abot t thescales (Bachman & Savignon, 3986; Lowe, 1985b, 1986).Space limitations preclude a full airing here, but theconfusion appears to be based partly on a failure toassimilate the wide range of behaviors the scalesencompass (Lowe, 1987; Valdman, 1987).

The confusion may also be due in part to differencesin terminology. ILR terms such as school, classic,street, and terminal may require explanation. Schoollearners are formal learners who typically have asolid foundation in grammar but limited everyday

4-%

ti


vocabulary. A classic speaker evinces a balance betweensolid grammar and useful vocabulary. Street learnersare those who have learned a second language infor-mally and whose everyday vocabularythrough expo-sure in the country of the languageis extensive (forthe level in question), but whose grammar provesnonexistent, weak, or fossilized vis-à-vis the level. Aterminal speaker is similar to a street learner, withlanguage that is probably irremediable. Except for theclassic speaker, these designations cannot in ILR usageapply above the Level 2+/ Advanced Plus. Because thelevels above 2+ are noncompensatory, any global ratingmust be based on a common floor in grammar andvocabulary (see Higgs & Clifford, 1982; Lowe 1985b fordetails). In contrast to these few terms, academia usesa plethora of expressions (Valdman, 1987).

Since the AEI scales were first instituted, the field'sunderstanding of sociolinguistics, culture, the relation-ship of standard language to dialect and regional vari-ants has burgeoned. As thb 1987 Indiana Symposiumunderscored, high priority should be assigned to incor-porating these insights more fully into the AEI def-initions (Byrnes, 1987a; 1987b).

Top-Down View of Abilities. The vantage point of theAEI scales is from the top downward; that is, the ref-erence point is at the top of the scale, which representsthe performance of a well-educated native speaker of thetarget language. This one aspect alone may account forthe difficulty some have had in assimilating the scales.Most classroom instruction and testing assumes theopposite vantage point: from the bottom up. Bits andpieces are taught and tested, and it is hoped that bymanipulating these fragments students will somehowproduce larger units. In contrast, the top-down viewassumes that to successfully produce units at one level,learners must have a developing sense of the largerunits above that level. For example, the language of astudent at the Intermediate/1 Level is often character-ized by short, discrete sentences. To produce a longer,

- 37


intelligible unit, the student must first have a sense of alarger unit such as a complex or compound sentence.

The salient characteristics of top-downness includethe educated native speaker at the scales' apex; the ges-talt nature of AEI levels; the expanding noncompensa-tory core as the scale is ascended; and the full-rangenature of the scale. These characteristics apply to allfour skill modalities. Together they further distinguishAEI proficiency from other kinds of proficiency.

Applications and Procedures

Another factor that contributes to difficulty in inter-nalizing the system is the wide variety of approaches totesting the nonoral skills, many of which suspiciouslyresemble those used in achievement testing.ld Super-ficial similarities aside (e.g., in reading, multiple-choice and doze formats), any AEI proficiency testingprocedure must:

1. obtain a ratable sample, usually more exten-sive than that of an achievement test;

2. establish consistent and sustained production,whatever the skill; and

3. establish the presence of creativity at levelsabove 0+/Novice High.

These three characteristics are the hallmarks of AEIproficiency in the applications and procedures dis-cussed here.

The AM proficiency framework has been applied tomany evaluation situations, including assessment atthe end of training or a course; for placement; before orafter a stay in a target-language country; and forawarding course credit (Fischer, 1984). Although theAEI framework is often used for diagnostic purposes,its principal application has been to assess proficiency.

Testing approaches vary by skill modality and, in

0ti


some instances, by agency or school. A general outlineis presented here.

All Skill Modalities. Testers of AEI proficiency useeither the ILR Skill Level Descriptions or the ACTFL/ETS Proficiency Guidelines. In both, the concept of AEIproficiency is constant, requiring the testing of anexaminee's ability to perform in a consistent and sus-tained manner for a suitable period, depending on thelevel.

Rating Procedures. All rating procedures ultimatelydepend on the definitions. In procedures using inter-views, the government tends to use two raters, whileacademia uses one. In direct rating of speaking andwriting, the sample is elicited and the examinee's per-formance is evaluated on the spot against the defini-tions. Oral proficiency interviews are audiotaped forpossible verification later. Through listening and read-ing interviews, the receptive skills can also be ratedimmediately. Indirect rating occurs after paper-and-pencil tests have been normed. It may appear that testsrated in this way have no connection to real-life per-formances. But this is an illusion, because beforethey are approved as testing instruments, such tests arecalibrated directly, through interviews, against real-lifeperformances of examinees. Thus, behind the indirectproficiency test, by which a given skill is rated througha multiple-choice instrument, for example, lie real-lifebehaviors representative of the AEI definitions that areused to rate interviews.

Moreover, all proficiency tests require both a "floor,"the level at which the examinee can consit.i.ently andsustainedly perform, and a "ceiling," the level at whichthe examinee no longer can sustain performance con-sistently. This differs from achievement tests, whichhave a floor but no ceiling.

Defining Levels of the Four Skills. In a holistic,top-down view, the definitions of the four skills may beassumed; in their fullest form, they are defined as the


way in which a well-educated native speaker speaks,listens, reads, or writes. In an atomistic, bottom-upview, on the other hand, opinions vary widely on thenature of each skill. The lowest levels of reading, forexample, may be represented by the ability to decode, theability to scan for specific information, or the ability toskim for general information, among other tasks.Because atomistic definitions are difficult to interrelatewith holistic ones, commonalities across skill modali-ties within the AEI scales are stressed instead.

In a holistic, top-down view of the four skills, thereare two other reference points besides the well-educatednative speaker at the apex. These reference points com-mend themselves naturally because they derive from afunctional constellation of features. The middle refer-ence point is the first general utility level, and thelowest reference point is the first usable level. The dif-ference between the latter two is the degree of indepen-dence and accuracy. At the first general utility level,speakers have a level oi production that is gen.,-allyindependent and they are confident that they will notcommit egregious errors or pass misinformation. Thefirst usable level, in contrast, represents a level of pro-duction that is reached when the examinee is no longerlimited to formulaic language, and can use his or herknowledge of the language to create novel sentences.

Here the concept of profiles is useful, because thefirst usable level in reading falls at a different point onthe scaN from the first usable level in speaking (i.e., R-3versus S-1; see Figure 1.1). ILR experience confirms theplacement of these two levels for speaking, and theirplacement is generally confirmed for reading. ILRexperience may not be extensive enough to confirm theauthor's experience of these two categories for listeningand writing.

In sum, the AEI scales present three levels withdiminishing degrees of utility referenced from the topdown: the well-educated native level, the first generalutility level, and the first usable level. These referencepoints cut across skills, but do not necessarily fall at the

d r).4 ;:,


CapsuleCharacterization

Level S(Educated native)Functions equivalentto a well-educatednative(Special Purpose Mode)

Level 4(Representation)Tailors languageto suit audience(Projective Mode)

Level 3(Abstract)Thinks in TLSupports opinionHypothesizesHandles unfamiliarsituations, topics(Esaluative Mode)

Level 2(Concrete)Operates in past andfuture as well aspresent timeHnstructise Mode)

Level I(Sursisal)Consistently createswith languageTends to operatein present time(Orientation Mode)

Level 0+Operates withmemorized material(Enumerative Mode)

Lod 0No functionAl ability

Speaking(says)

Skill

absolutely

colloquial

(general-only

-some-more errors

3+

ModalityListening Reading(hears) (reads)

appropriate discoursein wide range of contexts from

to careful formal speech-

well.organtzed discourse 4 +(synonyms controlled)

Writing(writes)

4

organized discoursevocabulary controlled-

sporadic errors in basic grammar-errors in frequent complex structures-

in low.frequency complex structures)

3r

paragraphs(combining sentences into

limited connected discourse)

short sentences

short lists(gathering building blocks)

(next to) nothing

3

I + r

Legend: Solid line = first general utility level; dotted line =usable level.

Figure 1.1. ILR Skills Overview

Note. From "'The' Question" by P. Lowe, Jr., 1984, ForeignLanguage Annals, 17. p. 382. Copyright 1984 by American Councilon the Teaching of Iroreign Languages. Reprinted by permission.

41


same points on the scales for every skill. Consequently,testing may take place at different levels depending onthe level of proficiency desired for the skill tested anddepending on the nature of the test procedure orinstrument.

Testing Speaking. Testers use the oral proficiencyinterview, although some tasks, such as briefing anddebriefing, are agency-specific. A general-languageAEI interview is currently used by ACTFL, ETS, andmost ILR agencies. The interview used at FSI is moreexclusively job-related than at other agencies. Because asingle testing procedure was used in the earliest stagesof proficiency assessment, the transfer to academia wassimplest for oral proficiency testing.

For the other skills, the transfer has been moredifficult, with varying approaches and less widespreadunderstanding of the techniques involved. Moreover,academia has used numerous testing approaches tolistening, reading, and writing, and has defined theseskills differently.

Testing Listening. From a holistic, top-down point ofview, ILR users generally agree on the nature of listen-ing, but many disagree on the contexts that a listeningproficiency test should cover.±4 The lack of clear-cutguidelines for assessing these differing contexts hin-ders the design of listening comprehension tests. It isunclear whether the test should focus on participativelistening, overheard participative listening, nonpartici-pative listening, or a mix of these three. The firstcategory is distinguished from the others by the pres-ence of opportunities to seek clarification; overheardparticipative and nonparticipative listening are differ-entiated by the nature of their content; that is, real-life,everyday situations versus lectures and broadcasts. Noconsensus has been reached on the issue of whetherbroadcast-quality tapes should be used for real-lifecontexts; whether natural noise should be included aspart and parcel of the background; or whether a

42-

36 PROFICIENCY ASSESS

combination should be used. Moreover, perhaps a varie-ty of voices, genders, ages, and regional variants shouldbe included. I5 Obviously, a lack of suitable guidelineshampers development in this area.

Some ILR agencies, like FSI, assign no listeningscore. Other agencies evaluate listening comprehen-sion, but by varying methods. Some derive a listeningscore from the oral proficiency interview, interspersingspeaking probes with listening probes. Such questionsare specifically designed to test a rise in the level of lis-tening comprehension (see Lowe, 1935b, on the listeningoffset). Others supplement the oral proficiency inter-view through a listening comprehension interviewwhose passages are either played on tape or read aloudto the examines by the tester(s). The examinee verbal-izes the essential meaning of the passages (in Englishat lower and mid levels; optionally in the target lan-guage at higher levels). At least one agency, DLI,administers a separate tape-mediated listening compre-hension examination as part of its Defense LanguageProficiency Tests (DLPTs). Earlier versions of the test,DLPT I and II, tested only reading and listening com-prehension. The latest version, DLPT III, includes asemi-direct speaking test (see Lowe & Clifford, 1980, onthe ROPE; Herzog, cited in Rose, 1987).

Testing Reading. In a holistic, top-down view, thedefinition of the reading skill is also straightforward.However, because current views of reading are gener-ally atomistic and bottom-up, there is much disagree-ment in academia on the nature of reading proficiency.Drawing on ILR experience concerning the relativeutility of the different skills at various AEI levels, theAEI to.--down categories and the more atomistic bottom-up categories may be combined into the followingschema, with the two sets of categories intersecting atLevel 3/Superior: At the top, Level 5 is the well-educatednative reader; at Level 4, the reader is at the firstgeneral utility leve1.4 and Level 3 represents the firstusable level. All three categories assume automatic

43


comprehension, with diminishing degrees of ability inILR functions and content, and diminishing accuracy,as the scale is descended. From the bottom up, the AEIdefinitions already reflect categories that are commonin academia. Possible definitions range from decodingability to automatic comprehension. The AEI defini-tions assume decoding ability at the 0; 0÷/Novice Level,and automatic comprehension beginning with the 3/Superior Level. The ability to puzzle out what the topic ofa passage may be after multiple passes is vastly dif-ferent from that of automatically comprehending thesame passage in one pass at near-native reading speed.The AEI definitions assume this higher-level abilitystarting at Level 3/Superior, and throughout the scalethey require proof of consistent and sustained ability toread (however that may be defined for the level in ques-tion). The AEI scales' assumption of such ability differssharply from what most in academia consider as read-ing in a foreign language, at least at its earliest stages.

Three other factors influence bottom-up views ofreading proficiency, primarily involving the fact thatreading is a receptive skill and must be tested byindirect means. First, the prompt, in this case the textthe examinee is supposed to understand, must begraded for AEI level(s). Unless the level of the text isdetermined, it is impossible to assess a level of profi-ciency for the examinee. While much work remains tobe done in developing criteria for categorizing texts atparticular levels, the ILR Testing Committee hasgraded more than 30 English texts to exemplify the textrating process, and it is possible to develop capsulecharacterizations of these text types (see Figure 1.2).16Doubtless other text types would lead to slightly differentcharacterizations.

A second complication is that the rater's percep-tion of the extent to which an examinee comprehends areading text cannot be direct, but must occur throughanother channel, often a productive skill, that is, speak-ing or writing. If this other skill is poorly controlled,then this channel may introduce so much disturbance

44


that the rater may not discover the level at which theexaminee truly understood the target-language text.For example, an examinee with a high level of profi-ciency in reading Japanese may be unable to demon-strate his or her true level of comprehension of aJapanese text when asked to paraphrase it orally inJapanese due to a low level of speaking proficiency.

A third and hotly debated factor is the contribution ofreal-world knowledge (Bernhardt, 1986; Lange & Lowe,1987). Such knowledge may be presumed amongwell-educated native speakers. Indeed, in the govern-ment's experience, the scales are developmental. Aspeaker at the peak of the scale would be expected tointegrate various abilities and real-world knowledge; atlower levels, less real-world knowledge is presumed.When the scale is applied to adolescents and children,however, less real-world knowledge can be expected atall levels. Hirsch (1987) goes far in answering the ques-tion of how much knowledge can be assumed with alengthy list of cultural facts he expects educated Amer-ican adults to know. Obviously, other languages placesimilar demands on well-educated natives (e.g., Stein's[1946] Kulturfahrplan for Germans).

In sum, because academia has had greater expe-rience with reading than -with speaking, opinionsdiverge markedly regarding the need, efficacy, andappropriateness of conducting AEI reading proficiencytests (Bernhardt, 1986).17 Procedures for assessingreading include the reading interview used at FSI (nowbeing revised); all-level, multiple-choice reading proti-cienc: examinations (CIA and DLI); and level-specifictests (NSA) (Lowe, 1984a, 1984b). The reading interviewusually follows the oral proficiency interview, in whichcase testers select reading passages at or slightly belowthe speaking level. If it does not follow the oral profi-ciency interview, the testers begin with a Level 1 pas-sage. They then progress upward or downward, select-ing the next passage as indicated by the examinee'sperformance and continuing iteratively until both afloor and a ceiling in the examinee's performance are

45


Figure 1.2. Capsule Characterizations of Written Texts

The Hlowing capsules represent the major characteristics used ingrading passages according to the ILR reading proficiency scales.The following assumptions were made.

1. The descriptions would primarily treat expository prose.

2. The descriptions would outline, but not exhaustively describe,the target area into which most passages at each level fall.

3. The descriptions would be cumulative, each higher-level de-scription subsuming those below and modifying them as necessary.

4. Straightforward texts exist whose vocabulary, structure, andorganization clearly demonstrate the level in question, prove readi-ly gradable, and consequently need not be described separately.

5. Most texts focus on a single level, but also evince variety. Forexample, a text graded Level 3 might nonetheless use 3+ or 4 vo-cabulary or structure, but in such a way as not to prove essentialto understanding the core of the passage.

6. The major plus rule apples to grading texts: To be considereda plus level, a passage must evince numerous features of thenext higher base level but fail to use them as consistentl,- and sus-tainedly as a passage from that level would. Plus levels are treatedseparately only at Levels 0+ and 1+, where attention must be drawnto their salient features.

7. The descriptions should build on the work of several ILR agen-cies, in this case Child (1987) and Ray T. Clifford's additions toChild's work.

8. Experience is the best teacher. Consequently, the best introduc-tion to grading texts is a series of graded passages and comment-ary with sufficient variation at each level to impart a sense for thesystem.

CAPSULE CHARACTERIZATIONS OF WRITTEN TEXTSBY LEVEL

Clifford's Formulaic Mode (commonly fixed phrases and isolatedwords): To the extent that 0+ texts exist, they tend to be loosely con-nected groups of words, such as those referring to the weather: sun,rain, and so on. Texts are often strongly supported by context,usually visual.

Child's Orientation Mode (main ideas): Level 1 texts contain short,discrete, simple (occasionally compound) sentences, whose vocab-ulary and structure are simple and whose ordering of information

4


may prove quite loose. Reorderings both within sentences andwithin paragraphs are often possible. Moreover information isnot tightly packed and consequently words may be deleted with-out drastically altering meaning. Passages at this level may besupported by context.

Level 1+: Such texts present somewhat longer sentences; vocabularyoften contains higher-level items; and sentence structure may becompound. Passages may be written in the past, and are usuallypacked with information. Material cannot be easily deleted withoutaltering the sentence's meaning, and reordering often provesimpossible as material is presented in a logical sequence comparedwith Level 1.

Child's Instructive Mode (main ideas plus supporting facts, but notdetails ): Level 2, texts contain complex sentences, verbal times otherthan the present, and, in news articles, densely packed informa-tion. Vocabulary grows more topic-specific, and organization beginsto reflect target-language types. Often less obvious, but more perva-sive, is shapi gthe subtle choosing of material, of ordering, and ofinterpretative comment on the author's part. Author assumes someshared target language culture as background.

Child's Evaluative Mode (main ideas, supporting facts, details, andinferences): Level 3 texts often treat abstract topics, and the lan-guage itself reflects this through abstract formulations. Texts maypresent author-intended inference, hypothesis, and suasion withthe author often intending that the reader evaluate the material. Atthis level information - packing may lead to stylistics, inference andoccasionally emotional aspects of the author's message. Shaping isevident as are target-language orderings that may deviate markedlyfrom those of American English expository prose. The authorassumes that readers share much target culture as background.

Child's Projective Mode (marked by unpredictable turns of thought):Level 4 texts demonstrate the author's virtuosity with language,often mixing registers (formal and informal, etc.), achievingsubtlety and nuance, frequently evincing tone (irony, humor, etc.),cogently persuading and generally challenging the reader to followunpredictable turns of thought. Language achieves a high level ofabstraction, and author-shaping and organization is clearly evi-dent. Author assumes reader shares target-language culture at ahigh level. Choice of vocabulary, idioms, and structure proves sohighly appropriate that attempts at substitution produce nuancesnot intended by the author.

Clifford's Special Purpose Mode (the highest levels of language,organization, esthetics, and thought): Level 5 texts are often writtenfor special purposes and therefore are often idiosyncratic (e.g.,avant garde literature, high-level legal documents). Choice of vocab-ulary, structure, style, register, organization, cultural references,and so on, proves absolutely appropriate.

47


estaLlished. Thus, like the oral proficiency interview,but unlike more common reading tests, the readinginterview is adaptive; the difficulty of the items is tail-ored to the examinee.

The examinee demonstrates the extent of under-standing by paraphrasing the content of the passage. Atthe Low and Mid levels, the examinee uses English; atHigh levels, he or she has the option of paraphrasing inthe target language.

At each level, the reading definitions state thenature of the tasks and degree of understandingrequired. At Level 1/Intermediate, the examinee shouldunderstand the main ideas. At Level 2/Advanced, he orshe should understand the main ideas plus some sup-porting facts. At Level 3/Superior, the examinee shouldunderstand the main ideas and supporting facts; drawout inferences intended by the author; and perceiveanalogies. The level of the passages is raised until a lin-guistic ceiling is reached.

Although the reading interview is now an uncom-mon approach, it is important to note that many state-ments in the definitions refer to the behaviors observeddaring reading interviews. Some behaviors parallelthose in the oral proficiency interview. For example, thedefinitions state that a Level 0/Novice examinee tends tounderstand isolated words. In an actual reading inter-view, a Level 0/Novice examinee typically points toindividual words in a Level-1 text, the ones he or sheunderstands in a text that is otherwise incomprehen-sible to him or her.

Testing Writing. Writing is the skill least tested byILR agencies. Only two agencies, CIA and DLI, testwriting, and then only for special cases (Herzog andKatz, this volume). The usual approach has been tochoose a topic according to the level of proficiencyrequired for a particular job. The resultant perfor-mance is judged against the definitions. An analogousscoring procedure used elsewhere is holistic rating(Herzog and Katz, this volume.) Many CIA jobs, for

46


example, require Level 3/Superior writing skills orbetter.

Because any writing performance is fixed in blackand white, Ericson (personal communication) recom-mends a three-stage approach to rating:

1. Read the composition for communicativeeffectiveness;

2. Reread for specific factors contributing to bothsuccess and failure to communicate; and

3. Combine the results of the two approaches toderive the global writing score.

Herzog and Katz (this volume) address the difficulties ofthis approach.

Conclusion

This chapter has advanced probable reasons whyproficiency assessment's earlier history has often goneunassimilated. To a large extent, this chapter has pre-sented proficiency assessment in its own terms, that is,as a consistent framework, evolving over time andapplying to all four skill modalities. It is hoped that thischapter provides an overarching context for the signifi-cant issues, explored in the following chapters, thathave arisen as proficiency assessment has been intro-duced into academia.

NOTES

1. A general familiarity with the ACTFL/ETS/ILR definitions isassumed. Concise introductions are found in Lowe and Liskin-Gasparro (1986) and Lowe (1983) for speaking, anG in Lowe (1984a)for other skill modalities.

2. A distinction should be drawn between a procedure and aninstrument. While an instrument is invariable for alladministrations, a procedure permits the administrator to vary thetesting approach. Both aid assessment, but in different 1. In any


assessment procedure, the content may vary, but the performancecriteria (levels, functions, and accuracy) required are constant, sothat each administration at a given level furnishes a parallel test.Interviews, whether for speaking, listening, or reading, areclassified as procedures. Paper-and-pencil tests of any kind areinstruments.

3. The term functions is used here to mean fixed task universalsthat characterize specific levels in the ILR system. It is not used inthe functional/notional sense of a set of variable qualifiers affectinglanguage communication (Munby, 1978).

4. Trisections are systematic condensations of the AEI definitions,and focus on ILR functions, context/content, and accuracy for eachlevel. (For a discussion and a chart showing the trisection for eachskill, see Bragger, 1985, p. 47 [speaking]; Omaggio, 1986, pp. 129-132[listening]; Magnan, 1985, p. 111 [writing]; Omaggio, 1986, pp.153-156 [reading].)

5. For a more recent study, see Clark, 1987b.

6. Some differences in wording represent attempts to use theterminology more current in the literature on the nonoral (seesection entitled "1980s Tasks in 1950s Language" later in thischapter).

7. Revision of the definitions is a continuing process both for ACTFLand the ILR, with new insights nd experience being incorporatedas they become available. In this sense, all sets of the definitions areprovisional.

8. A clearly important third group, as yet unaddressed by the AEIdefinitions, is language losersthose who are in the process offorgetting or have already mostly forgotten the target language(s)(Lambert & Freed, 1982). Research is needed to determine whathappens when people who know a lang,nge fail to use it; whichskill modalities are most _table, and 4 ; 11 are least stable (Lowe,cited in Lambert & Freed, 1982). An overarching question is howsuch performances challenge the system.

The obvious differences among these three groupsusers,learners, and losersprovide a ric. field for investigation. Anotherset of differences lies in the often displrate goals between the manygovernment programs that are intensive and focused on teachingfunctional skills and some programs in academia whose articula-tion between program content and academia's overall goals for thelearner are less clear.

9. The ILR Skill Level Descriptions' plus levels are inconsi.At.ntlydesignated in the ACTFL/ETS versions, being referred to as plus atthe Advanced level (Advanced Plus = 2+) but as High at the lowerlevels (Intermediate High, Novice High = 1+, 0+ respectively).

50


10. The factors enumerated here are used by most ILR. agencies,though FSI uses a slightly different set.

11. For example, in reading, instead of pronunciation, orthography;instead of fluency, speed. See Herzog and Katz, this volume, forfactors contributing to writing, such as organization and discoursemethods.

12. This discussion complements Lowe, 1985b and 1986.

13. For a possibly more atomistic, bottom-up view of proficiency andapplicable testing procedures, see Larson and Jones, 1984.

14. For details on listening, see Douglas (1987); Foreign LanguageAnnals (September 1984), Joiner (1986). In addition, many of thestatements made here about reading may well apply to listening,which is related as a receptive skill.

15. For additional clarificiation of these problems, see Valdman(1987).

16. The accompanying figure builds on the work of Child, Clifford,Herzog, and Lowe, and outlines primarily the characteristics ofexemplary expository prose tests at each ILR level.

17. Several AEI reading proficiency tests have been developed inacademia: Japanese Proficiency Test and Russian Proficiency Test(ETS); Chinese Proficiency Test (Center for Applied Linguistics);and the Hindi Proficiency Test (University of Pennsylvania).

References

Am.rican Council on the Teaching of Foreign Lan-guages. (1986). ACTFL proficiency guidelines. Hast-ings-on-Hudson, NY: Author.

American Com cil on the Teaching of Foreign Lan-guages. (1982). ACTFL provisional proficiencyguidelines. Hastings-on-Hudson, NY: Author.

JZ


Bachmann, L.F., & Clark, J.L.D. (1987). The measure-ment of foreign/second language proficiency. An-nals of the American Political and Social Sciences,490 (March), 20-33.

Bachmann, L.F., & Savignon, S.J. (1986). The evalua-tion of communicative language proficiency: A cri-tique of the ACTF1 oral interview. Modern LanguageJournal, 70, 291-297.

Bernhardt, E. (1986). Proficient texts or proficient read-ers? ADFL Bulletin, 18(1), 25-28.

Bragger, J.D. (1985). The development of oral profi-ciency. In A.C. Omaggio (Ed.), Proficiency, curri-culum, articulation: The ties that bind. Middlebury,VT: Northeast Conference.

Buck, K.A. (1984). Nurturing the hothouse special:When proficiency becomes performance. Die Unter-richtspraxis, 17, 307-311.

Byrnes, H. (1987a). Discourse structure and communi-cative strategies in the development of COMP"' mica-tive ability. In A. Valdman (Ed.), Proceedings of theSymposium on the Evaluation of Foreign LanguageProficiency. Bloomington, IN: Indiana University.

Byrnes, H. (1987b). Second language acquisition: In-sights from a proficiency orientation. In H. Byrnes& M. Canale (Eds.), Defining and developing pro-ficiency: Guidelines, implementations, and con-cepts. Lincolnwood, IL: National Textbook Co.

Child, J. (1987). Language proficiency and the typologyof texts. In H. Byrnes & M. Canale (Eds.), Definingand developing proficiency: Guidelines, implemen-tation, and concepts. Lincolnwood, IL: NationalTextbook Co.

err)ti 4,


Clark, J.L.D. (1978). Direct testing of speaking profi-ciency: Theory and application [Proceedings of ajoint ETS /ILR/GURT conference]. Princeton, NJ:Educational Testing Service

Clark, J.L.D. (1987a). Sociolinguistic aspects of com-municative ability: The issue of accuracy vs.fluency. In A. Valdman, (Ed ). Proceedings of theSymposium on the Evaluation of Foreign LanguageProficiency. Bloomington, IN: Indiana University.

Clark, J.L.D. (1987b). A study of the comparability ofspeaking proficiency interview ratings across threegovernment language training agencies. In K.M.Bailey, T.L. Dale, & R.T. Clifford (Eds.), Languagetesting research. Monterey, CA: Defense LanguageInstitute.

Clifford, R.T. (1980). FSI factor scores and global rating.In J.R. Frith (Ed.), Measuring spoken languageproficiency. Washington, DC: Georgetown Universi-ty Press.

Cummins, J. (1984). Wanted: A theoretical frameworkfor relating language proficiency to academicachievement among bilingual students. In C.Rivera (Ed.), Language proficiency and academicachievement. Clevedon/Avon, England: Multilin-gual Matters.

Douglas, D. (1987). Testing listening comprehension. InA. Valdman, (Ed.). Proceedings of the Symposiumon the Evaluation of Foreign Language Proficiency.Bloomington, IN: Indiana University.

Educational Testing Service. (1981). A common metricfor language proficiency (Final Report for Depart-ment of Education Grant No. G008001739). Prince-ton, NJ: Author.

5 3


Educational Testing Service. (1982). ETS oral profi-ciency testing manual. Princeton, NJ: Author.

Fischer, W.B. (1984). Not just lip service: Systematicoral testing in a first-year college German program.Die Unterrichtspraxis, 17, 225-39.

Frith, J. (1979). Testing the testing kit. ADFL Bulletin,11 (2), 12-14.

Garrett, N. (1986). The problem with grammar: Whatkind can the language learner use? Modern Lan-guage Journal, 70, 133-48.

Higgs, T.V. (1985). Teaching grammar for nroficiency.Foreign Language Annals, 18, 289-96.

Higgs, T.V., & Clifford, R. T. (1982). The push towardscommunication. In T.V. Higgs (Ed.), Curriculum,competence and the foreign language teacher. Li.i-colnwood, IL: National Textbook Co.

Hiple, D.V. (1984). The ACTFL proficiency projects: Anupdate. Die Unterrichtspraxis, 17, 327-29.

Hiple, D.V. (1986). A progress report on ACTFL pro-ficiency guidelines, 1982-1986. In H. Byrnes & M.Canale (Eds.), Defining and developing proficiency:Guidelines, implementations, and concepts. Lin-colnwood, IL: National Textbook Co.

Hiple, D.V. (1987). The dynamics of organizationalgrowth and development in an association providingadult continuing professional education. Unpub-lished doctoral dissertation, State University of NewJersey, Rutgers, NJ.

Hirsch, E.D., Jr. (1987). Cultural literacy: What everyAmerican needs to know. Boston: Houghton Mifflin.

(7


Interagency Language Roundtable. ILR skill level de-scriptions. (1985). Washington, DC: Author.

Joiner, E.G. (1986). Listening in the foreign language.In B.H. Wing (Ed.), Listening, reading, writing: An-alysis and application. Middlebury, VT: NortheastConference.

Jones, R.L. (1975). Testing language proficiency in theU.S. government. In R.L. Jones & B. Spolsky (Eds.),Testing language proficiency. Arlington, VA: Cen-ter for Applied Linguistics.

Jones, R.L., & Spolsky, B. (Eds.). (1975). Testing lan-guage proficiency. Arlington, VA: Center forApplied Linguistics.

Lambert, R.D., & Freed, B.F. (Eds.). (1982). The loss oflanguage skills. Rowley, MA: Newbury House.

Lange, D., & Lowe, P., Jr. (1987). Grading readingpassages according to the ACTFL/ETS/ILR profi-ciency standard: Can it be learned? In K. Bailey &R.T. Clifford (Eds.), Proceedings of the pre-TESOLtesting colloquium. Monterey, CA: Defense Lan-guage Institute.

Larson, J.W., & Jones, R.L. (1984). Proficiency testingfor the other language modalities. In T.V. Higgs,(Ed.), Teaching for proficiency: The organizing prin-ciple. Lincolnwood, IL: National Textbook Co.

Liskin-Gasparro, J.E. (1984a). The ACTFL proficiencyguidelines: A historical perspective. In T.V. Higgs(Ed.), Teaching for proficiency: The organizing prin-ciple. Lincolnwood, IL: National Textbook Co.

Liskin-Gasparro, J.E. (1984b). The ACTFL proficiencyguidelines: Gateway to testing and curriculum. For-eign Language Annals, 17 , 475-89.

55


Lowe, P., Jr. (1980, March). Gestalts, thresholds, andthe ILR definitions. Paper given at the 1980 preses-sion sponsored by Georgetown University RoundTable on Languages and Linguistics and the Inter-agency Language Roundtable, Washington, DC.

Lowe, P., Jr. (1983). The ILR oral interview: Origins,applications, pitfalls, and implications. Die Unter-richtspraxis 16(2), 230-244.

Lowe, P., Jr. (1984a). Setting the stage: Constraints onILR receptive skills testing. Foreign Language An-nals, 17 , 375-79.

Lowe, P., Jr. (1984b). "The" question. Foreign LanguageAnnals, 17, 381-88.

Lowe, P., Jr. (1985a). The ILR handbook on oral inter-view testing. Washington, DC: DLI-ILR Oral Inter-view Project.

Lowe, P., Jr. (1985b). The ILR proficiency scale as asynthesizing research principle: The view from themountain. In C.J. James, (Ed.), Foreign languageproficiency in the classroom and beyond. Lincoln-wood, IL: National Textbook Co.

Lowe, P., Jr. (1986). Proficiency: Panacea, framework,process? A reply to Kramsch, Schulz and, par-ticularly, to Bachman and Savignon. The ModernLanguage Journal, 70, 391-97.

Lowe, P., Jr. (1987). Fear challenges to proficiency:Comments from within the AEI proficiencyframework. In A. Valdman (Ed.), Proceedings ofthe Symposium on Evaluating Foreign LanguageProficiency. Bloomington, IN: Indiana University.

Lowe, P., Jr., & Clifford, R.T. (1980). Developing anindirect measure of overall oral proficiency. In J.R.

e.)


Frith (Ed.), Measuring spoken language profi-ciency. Washington, DC: Georgetown UniversityPress.

Lowe, P., Jr., & Liskin-Gasparro, J.E. (1986).Testingspeaking proficiency: The oral interview [ERICQ&A]. Washington, DC: Center for Applied Linguis-tics.

Magnan, S.S. (1985). Teaching and testing proficiencyin writing: Skills to transcend the second languageclassroom. In A.C. Omaggio (Ed.), Proficiency,curriculum, articulation: The ties that bind. Middle-bury, VT: Northeast Conference.

Munby, J. (1978). Communicative syllabus design.Cambridge, England: Cambridge University Press.

Murphy, C., & Jimenez, R., (1984). Proficiency projectsin action. In T.V. Higgs, (Ed.), Teaching for pro-ficiency, the organizing principle. Lincolnwood, IL:National Textbook Co.

Omaggio, A.C. (Ed.). (1985). Proficiency, curriculum,articulation: The ties that bind. Middlebury, VT:Northeast Conference.

Omaggio, A.C. (1986). Teaching language in context:Proficiency-oriented instruction. Boston: Heinle &Heinle.

Osgood, C.E., Suci, G.J., & Tannenbaum, P.H. (1957).The measurement of meaning. Urbana, IL: Uni-versity of Illinois Press.

Rice, F.A. (1959). Language proficiency testing at theForeign Service Institute. The Linguistic Reporter,1, 4.

r t,0


Rose, M.G. (Ed.), (1987). Translation excellence: As-sessment, achievement, maintenance (AmericanTranslators Association Scholarly Monograph ser-ies, Vol. 1). Binghamton, NY: University Center atBinghamton (SUNY).

Schulz, R.A. (1986). From achievement to proficiencythrough classroom instruction: Some caveats. Mod-ern Language Journal, 70, 373-79.

Shohamy, E. (1987). Comments. In A. Valdman (Ed.),Proceedings of the Symposium on Evaluating For-eign Language Proficiency. Bloomington, IN: In-diana University.

Sollenberger, H.E. (1978). Development and current use ofthe FSI oral interview test. In J.L.D. Clark (Ed.),Direct testing of speaking proficiency. Princeton, NJ:Educational Testing Service.

Stein, W. (1946). Der Kulturfahrplan. Berlin-Grune-wald: F.A. Herbig.

Valdman, A. (1987). The prol em of the target model inproficiency-oriented language instruction. In A.Valdman (Ed.), Proceedings of the Symposium onEvaluating Foreign Language Proficiency. Bloom-ington, IN: Indiana University.

Wilds, C.P. (1975). The oral interview test. In R.L. Jones& B. Spolsky (Eds.), Testing language proficiency.Arlington, VA: Center for Applied Linguistics.

Wilds, C.P., & Jones, R. (1974). Interagency interraterinterreliability study (Internal document). Washing-ton, DC: Foreign Service Institute; Central Intel-ligence Agency.

r

IIA Research Agenda

by John L.D. Clark and John Lett,Defense Language Institute

So much has been said about proficiency-based testingwithin the past few years that the reader may wonderwhat else can possibly be learned or reported about it.Fortunately for the researcher in this areaand alsoultimately for ike user of the information provided bythis researchit is unlikely that a point will ever bereached at which "enough" systematic and meaningfuldata will have been gathered about the strengths andweaknesses, appropriate and inappropriate appli-cations, valid and less valid interpretations, and othermajor characteristics of proficiency-based testing ap-proaches. While it is not possible to provide acomprehensive inventory of potential research activitiesthat might be carried out over the next several years, anumber of issues can be brolight '-o the reaaer'sattention that seem to warrant high priority on theproficiency testing research agenda. Although thepresent focus is on oral proficiency testing, theresearch concepts and challenges addressed can beconsidered substantially applicable to assessment inthe other three skill areaslistening, reading, andwritingas well.

The oral proficiency interview is examined firstfrom the two fundament'1 perspectives of validity andreliability, each of which is seen to subsume andsuggest a number of major empirically based stud-ies. A discussion follows of scaling and measurement

r nti 0


considerations at issue in the interview procedure andassociated rating process. Finally, a recently proposedcomprehensive program of proficiency testing researchis described that is based on an expanded model ofcommunicative language ability that cou:d serve as anoverall organizing structure and vehicle for conductingvirtually all the research activities discussed here.

Validity Considerations in ProficiencyTesting

The crucial question in assessing the validity of anytype of test is simple and straightforward: "How welldoes the test do what it is supposed to do?" With regardto tests of general proficiency in a second language, itcan reasonably be assumed that such tests are supposedto determine "how well a person can speak (or under-stand speech in, read, or write) the language." Whilethe meaning of "speak" (or understand, read, write) inthis context is fairly clear, the proper interpretation andapplication of the terms "how well" and "the language"are considerably more complex, and go to the heart ofthe validity issue.

To address first what is meant by "the language" ina proficiency testing context, it is usually understoodand agreed that "proficiency" measurement involvesdetermining the examinee's ability to perform in alinguistically (and sociolinguistically) appropriate man-ner within a variety of language-use situations encoun-tered in real-world contexts external to the instructionalsetting per se. Although this orientation does not ruleout the possibility that a given instructional programwould itself emphasize this type of language practiceand development on the student's part, it does requirethat the test developer look exclusively at language usein "real-world" settings in specifying both the linguistic

60

A Research Agenda 55

content and manner of operation of the test.Obviously, no single proficiency test of any reason-

able length could more than merely sample the virtual-ly unlimited range of language and language-usesituations potentially encountered in the "real world."The science and art of proficiency testing thus involves,to a great extent, the judicious and informed selection oflanguage content and testing techniques in such a wayas to maximize the extent to which one can legitimatelyinfer from an examinee's test performance a similarmanner and quality of performance in a variety ofreal-world settings that the test is designed to reflect.

One approach is to restrict the range of real-worldsituations to which a given test is intended to apply andto focus test content and testing procedures on thesmaller area. This strategy was adopted by Munby(1978), who developed a procedure for specifying thelinguistic and sociolinguistic situations in which aparticular examineefor example, a waiter in a sum-mer resortwould need to function. Although a lan-guage performance test focusing on such a delimitedarea could readily be developed, its use and inter-pretation would by the same token necessarily berestricted to the given area. In view of both the economicand administrative infeasibility of developing and usingseparate performance measures for each of the vir-tually unlimited number of real-world situations thatcould be proposed as areas for direct, specializedtesting, the only viable alterative would seem to be to usea considerably smaller number of mo, e. general testsdesigned in such as way as to permit reasonable as-sumptions about the level and quality of examineeperformance across a wider range of language-usecontexts. This approach has the potential drawback ofreduced measurement accuracy by comparison withspecifically tailored performance tests.

Given the wide-scale use of the ACTFL/ETS/ILR(abbreviated here to AEI) speaking proficiency interviewand associated scoring scale in both academic andgovernment settings, it would be useful to consider,

, .

u


from both conceptual and empirical research stand-points, the extent to which the interview process ade-quately samples the communicative situations and as-sociated linguistic contexts in which nonnative speak-ers are most frequently required to perform. In thisregard, an immediate observation is that the conversa-tional portion of the AEI interview, while constituting ahighly realistic sample of polite, reasonably formalconversation between relative strangers, does not lenditself well to exemplifying discourse situations inwhich, for example, the examinee is in a higher-statusrole than his or her interlocutor; is engaging in relaxedconversation with friends; or is communicating in ahighly emotional or other affect-laden manner. Al-though an approximation of these other types of dis-course settings is offered by the use of printed cardsdescribing various communication situations that theexaminee is asked to role-play with the interviewer, thisapproximation remains at a considerable physical andpsychological remove from the genus- communicativesituation.

A second concern is that neither the regular inter-view portion nor the situational role-play provides aready or straightforward opportunity for the examineeto engage in extended narration, although experiencedinterviewers are often able to elicit some "monologue"discourse from the examinee. Also, the testing proce-dures used by the Foreign Service Institute (FSI) haverecently been modified to include a "briefing" section inwhich the examinee is required to speak continuouslyfor several minutes about information provided in aprinted text studied a few minutes previously. It re-mains fair to say, however, that although the role-playprocedure, briefing technique, and other modificationsof the basic interview represent useful attempts to gobeyond the sociolinguistic and discourse constraints of"polite conversation," these have been faivly adventi-tious and have not been guided by a preliminary analy-sis of the various situations (and their relative frequen-cies) in which nonnativa learners of the language

62


would characteristically be required to communicate.One way to address this aspect, of the validity

question would be to conduct a Munby-type (1978) analy-sis, at a suitable level of generality, of the communi-cation situations most frequently encountered by givengroups of nonnative users of the language. Such ananalysis would produce information on which to basethe content of role plays and possibly other componentsof the interview, depending on the group. Americanstraveling abroad as tourists might constitute one suchgroup, while anothev group might consist of nativeEnglish speakers stu,:ing or working in the target-language country over extended periods. Analyses ofthe speaking requirements of these and several othergroups could be conducted, and the resulting language-use profiles could be compared. To the extent that theprofiles overlap, the use of a single testing process andcontent to apply to several different groups would be areasonable approach. On the other hand, it would not beappropriate to attempt to apply a single uniform processand content across groups whose language-use needsare found to differ appreciably as to types of languageand the situations in which the language would need tobe used.

In other words, such linguistic needs analyses,applied as content/process criteria in the review of agiven proficiency test, would serve to indicate anddocument the range of applicability of the instrument inquestion. At the same time, such analyses could identi-fy real-world contexts for which the test content andmanner of operation would not represent a reasonablesample; in such instances, examinee performance onthe test would not provide an accurate prediction ofreal-life performance.

Turning to issues involv gl in determining "howwell" the examinee is able to use the language inreal-life communication settings, it is proposed forpurposes of initial discussion that development andvalidation of the AEIor any other scale purporting todefine and quantify second language proficiency

Ci


should take adequately into account native-speakerjudgments of the nature and effectiveness of theexaminee's communication attempts, because it isprecisely with native interlocutors that the examineewould be expected to communicate in the real-life con-texts the scale is intended to reflect. It should be notedthat scale development based on native-speaker judg-ments would need to taxe into consideration a numberof sources of variance inherent in such judgments,including the sociolinguistic context in which they aremade; the age, sex, educational level, and profession ofthe native-speaker judge; and whether the impact ofthese factors varies according to the linguistic and cul-tural norms of the language and national community towhich the judge belongs. These and related issues arediscussed later in this section.

Native-speaker judgment studies available to datesuggest that this is a fertile field of inquiry with majorimplications for proficiency-based measurement. Froma strict comprehension-of-message point of view, Burt(1975) found that errors by nonnative English speakersinvolving incorrect word order or other sentence-leveldistortions resulted in considerabl:, greater miscommu-nication than did "local" errors such as omission of theauxiliary in phrases such as "Why [do] we like . . . ?"Similarly, Chas in (1980) found, though in a writingrather than a sl ...thing context, that native speakers ofSpanish were much more tolerant of simple "grammat-ical" errors (e.g., use of the wrong gender of definiteand indefinite articles) than they were of errors thataffect meaning, such as the use of a contextuallyinappropriate lexical item. With regard to affectivereactions of native speakers to various types of non-native speech, Raisler (1976) found that spoken Englishthat was proficient in all respects except for accent wasnot only negatively received by native listeners but alsoconsidered, quite erroneously, to contain "grammarerrors." In a study by Galloway (1980), native speakersof Spanish viewing videotapes of American learnersattempting to communicate in that language gave

64


considerable credit to people whom they perceived asexerting "effort in expressing themselves," by compari-son with other more linguistically advanced studentsfor whom such effort was less apparent. Gallowayconcluded that "an individual's lack of grammaticalaccuracy may not produce negative reactions if thedesire and urgency to communicate are evident" (p.430). (See also Mattran, 1977; Ludwig, 1982).

Intriguing questions can be posed about the natureand degree of "fit" between the AEI level ratings ofgiven examinees and native-speaker perceptions oftheir communicative effectiveness. An immediate areaof investigation would be the possibility of rating "dis-connects" between various portions of the AEI scale andnative speakers' quantified judgments. Would nativespeakers indeed make distinctions between, for exam-ple, examinees rated Novice-Mid and those rated Nov-ice-High, or would they assign both to a single un-differentiated group? In the latter case, some concernmight emerge as to whether the "fine-grained" ACTFLlevels are indicative of true differences in communi-cative performance in the perception of native speakers.

Assuming research designs that provide adequatecontrol of native-speaker variables, detailed linguisticanalysis of "less proficient" and "more proficient" non-native speech as judged by native listeners could becarried out both fog its own sake and as one approach tovalidating the AEI level descriptions in a noncircularmanner. For example, a set of interview samples mightbe judged by naive native speakers to constitute increas-ingly "proficient" performances; if it were also found toexhibit the types and degrees of control of structure,lexicon, phonology, and so on, that would be predictedfor them on the basis of the AEI descriptions, consid-erable confidence could be placed in the operationalvalue of the level descriptions as reflective of real-lifelinguistic phenomena. Identification of linguistic as-pects that do not pattern similarly on the two continuamight suggest desirable changes in wording for one ormore of the level descriptions, including possibly some

G5


shifting of the points on the AEI scale at which givenlinguistic phenomena are considered to occur.

While close examination of the AEI scale throughthe optic of naive native-speaker judgments is a re-search area of extremely high potential, a properlyconducted research program in this area would need toaddress several conceptual and operational complexi-ties. First is the degree of consistency and uniformitythat native speakers would demonstrate in their judg-ments of language-related phenomena. If a given nativespeaker's appraisel of the level of proficiency represent-ed by a particular speech sample were found to varywidely from one judgirg occasion to the next (i.e., lowintrarater reliability), or if several native speakers,working independently, were found to assign widelydiffering ratings to the same speech samples (i.e., lowinterrater reliability), meaningful comparisons between"native judgment" data and AEI level assignmentswould be extremely difficult or impossible to make.

Second, closely related to the reliability-of-judgingissue is the matter of providing valid and meaningfuljudging instructions. In this regard, references to"grammar," "vocabulary," "fluency," and other lin-guistic features would probably not be easily interpretedby the naive native speaker and give rise to highlyidiosyncratic judging performances. Instructions toassign given speech samples to one of several arbitrarycategories of "increasing proficiency" (the so-calledQ-sort technique) would be expected to provide moremeaningful and reliable data, as would the forcedchoice between pairs of speech samples (this process,however, would be considerably more laborious andtime-consuming).

A third very significant lasue is the distinctionbetween linguistic and nonlinguistic characteristics ofnormative speakers' communicative performances asthese might influence the judgments of native speakers.Variations in the psychological or personality charac-teristics of given individuals (e.g., taciturn vs. gar-rulous, outgoing vs. withdrawn) may be expected to

66


influence native speakers' estimates of linguistic profi-ciency to some extent, independently of the technicalaccuracy of the language performance. Ethnic differ-ences between normative speakers and native judges, aswell as real or perceived differences in social status,degree of education, and so on, as these traits are con-veyed aurally and/or visually in the course of the judg-ing process, would be expected to introduce a certaindegree of error variability in the rating of linguisticproficiency per se.

A fourth concern is that examinees vary widely inthe extent to which they are perceived as effectivecommunicators in their native language. At least someof this variance would presumably be reflected in ap-praisals of their target-language competence by nativejudges. Although it would be appropriate for this var-iance to contribute to the results of tests aimed at, forexample, selecting the individual bea suited for a job inwhich interpersonal communication skills were a pri-mary consideration, it would seem desirable to limitsuch variance when the objective of the judgment issimply to assess the extent to which an individual hasacquired the ability to use a second linguistic system.One approach lending itself to exploratory researchwould be to administer the AEI interview to examineesin both the target language and the native language,statistically partial out the effects of the latter on theformer, and examine the resulting scores from avariety of perspectives.

Despite the complexities in the use of native-speakerjudgments as external criteria for developing languageproficiency tents, and the associated need to pay closeattention to `aese factors in the design and conduct ofresearch studies, none of these issues should be consid-ered to pose insurmountable problems or negate in anyway the extremely important role that "native-speakerstudies" would be expected to play in further inves-tigations of the validity and practical utility of the AEIproficiency scale and rating technique.

The discussion so far has dealt primarily with the

" 67


substance and manner of operation of the AEI testingprocedure itself, or with questions of content validity.Another large potential research area is that of theconstruct validity of the AEI testing approach. Brieflycharacterized, construct validity studies involve theempirical determination of the tenability of hypothesesderived from measurement claims made by or implicitin the testing procedures under study. With regard tothe AEI interview and scoring scale, probably the mostsalient hypothesis, arising directly from the statedpurpose of this testing procedure, is that examineeswho score at a given level on the AEI scale will in fact beable, in the real-world setting, to carry out each of thelanguage-use tasks or functions at issue in that profi-ciency level description.

One theoretically possible way to test this hypothesisis constant surreptitious observation of examinees atgiven score levels going about their ''..ay-to-day lan-guage-use activities in the target language. This ap-proach, unfortunately, is impracticable, despite its con-ceptual appeal. A second considerably more feasibleapproach is to have the examinee provide a self-report ofability to perform the indicated tasks or functions with-in the actual language-use setting. Self-report question-naires on perceived degree of functional ability in thelanguage have been developed and used with somesuccess in a parametric study of second languagelearning by Peace Corps volunteers (Carroll, Clark,Goddu, Edwards, & Hendrick. 1966) and in projects ofthe Experiment in International Living (1976). More re-cently, an Educational Testing Service study found thata set of "can-do" self-appraisal statements correlated ata level of about .60-.65 with results of an AEI-type directproficiency interview (Clark, 1981).

Another criterion data-gathering possibility is to asksecond parties who are in a reasonable position to do soto make proficiency-related judgments about exami-nees' language performance in specified contexts. Forexample, supervisors of foreign service officers mightbe asked to make judgments of these individuals' ability

68


to perform the particular communicative tasks at issuein the upper range of the AEI scale, with these judg-ments being subsequently correlated with the officiallymeasured proficiency levels.

In all of 1,' ese construct validation efforts, it wouldof course be necessary to keep in mind the potentialunreliability of the self-rating, second-party, or othertypes of criterion data themselves, and to refine andobjectify as much as possible the questionnaires or otherelicitation procedures used in obtaining the validationinformation. While "perfect" criterion data will prob-ably never be obtained, this consideration should in noway vitiate the concept of construct validation as appliedto language proficiency testing, nor diminish the practi-cal need for such validation. To the contrary, the dearthof construct validation studies within AEI research todate suggests a high priority for such investigations.

Reliability Considerations

Issues of reliability also figure prominently in theprogram of research needed to more fully investigateand develop bah the AEI technique and other ap-proaches to language proficiency assessment. Just asthere are several different aspects of validity, eachcontributing to an appreciation of the overall "validationsituation" for a particular instrument or testing ap-proach, several different types of "reliability" exist, eachmeriting examination with a comprehensive analysis ofthis aspect of test performance. To begin with a gen-eralized definition of the term, a test can be viewed asreliable to the extent that it provides identical scoringresults for a given individual across each of a number ofvarying test-administration or other conditions (andunder the assumption that the individual's true knowl-edge or ability in the subject matter remains constantacross the testing occasions in question). For the AEIinterview, major "varying conditions" would include:

,i G9


1. variation in the types of discourse, topics, and soon broached in the interview producing differences inperformance attributable solely or predominantly to the"luck of the content draw";

2. variation in interviewing technique, includingdifferences in the personality or "style" of the interview-ers, that affects performance independently of overallproficiency; and

3. for any given interview performance, variation inratings assigned by the judge(s), either for a given judgererating a particular interview (intrarater reliability),or across two or more judges rating the same per-formance (interrater reliability).

The potertial for scoring unreliability associatedwith the first item might be reduced in at least twoways. First, the length of the interview could beextended so that a wider range of discourse styles andtopical areas can be explored. However, given the al-ready considerable human-resource requirements ofinterview-based testing of "normal" duration (about10-40 minutes, depending on the overall proficiencylevel of the examinee), lengthening of the interviewwould probably not be well received from an adminis-trative or budgetary perspective. A more viable ap-proach would be to standardize the content of theinterview more closely, both across examinees andinterviewers, to reduce or eliminate the content sam-pling problem as a potential source of unreliability. Atone extreme of the degree-of-standardization continu-um is the completely fixed format of the so-called "semi-direct" speaking test (Clark, 1979), in which the ex-aminee orally responds to questions posed by a tape re-cording and/or to stimulus material in a printed book-letan approach that completely eliminates across-testvariation in test content. Semidirect tests, designed toreflect as closely as possible the AM interview withrespect to both the topical areas dealt with and types ofdiscourse elicited, have been developed (Clark, 1986) andhave been found to correlate significantly (r = .89-.96)

70Skassasswarr.ocassaw......


with concurrent live interviews. However, because thesemidirect approach does not provide for the examineeto negotiate meaning, use repair strategies, or carry uutother interactive discourse tasks characteristic of liveconversation, the face and operational validity of thistesting procedure is somewhat less than that of the liveinterview.

A combined approach is possible that would bothmaintain the interactive nature of the live interview andincrease the uniformity of test content: The tester coulduse printed scripts or other aids as a guide in selectingparticular types of questions or topics to be broached inthe interview. This general approach has already beenimplemented to some extent in the "situation cards"developed by ACTFL, which provide descriptions of po-tential role-play scenarios at various AEI scale levels.Analogous lists of passible questions or categories ofquestions from which the tester could draw as nec-essary during the conversational portion of the inter-view could also help increase the across-interview uni-formity of content and testing procedure. Cautions toheed when using this approach include both the need toprovide a sufficient number of questions in the selectionpool to keep details of the test content from becomingknown to prospective examinees, and the need to main-tain a reasonaMe degree of flexibility and spontaneity ofdiscussion within the test as a whole.

Across-interview variation in examinee perfor-mance attributable to differences in testers' elicitationtechniquesincluding affective or personality charac-teristics of the interviewermay also be presumed tonegatively affect test-retest reliability. Interviewers whoshow genuine interest in the information being com-municated and who interact with the examinees in anempathetic manner would, in general, be expected toelicit better performances from the same examineesthan would more disinterested or "colder" interviewers,especially at the lower range of the proficiency scale. Inview of the fairly subtle phenomena at issue, anempirical test of this hypothesis would probably require


an exaggerated approach in which participating inter-viewers would consciously emphasize either a highlyinterested and friendly or highly disinterested and cooltesting style in interviewing the same group of exam-inees. Counterbalanced administ7,-tion of the two typesof interview, together with single- or double-blind rat-ings, would be needed to control for test-order and othereffects, and a fairly large number of examinees alsowould be required. However, there are no insurmount-able research-design impediments to a thorough exami-nation of these and other variables in interviewer andelicitation techniques as these may relate to differencesin examinee performance across different interviewoccasions.

A major program of empirical research is needed tofully examine the intra- and interrater reliability char-acteristics of the AEI interview and scoring scale.Although a few small-scale studies have been con-ducted in this area (Adams, 1978; Clark 1978, 1986;Clark & Li, 1986; Shohamy, 1983), little definitiveinformation is presently available in answer to ques-tions such as the following: For trained raters, what isthe typical amount of scoring variation that a givenrater will exhibit in the repetitive scoring of a givenexaminee performance? Secondly, how and to whatextent does intrarater reliability vary as a function ofthe technical mode (audiotape versus videotape) underwhich the rerating takes place? It may be hypothesizedthat reratings based on audiotape playback will, ingeneral, result in somewhat lower ratings than wereinitially assigned. Such a phenomenon would be attrib-utable to the fact that grammatical and other linguis-tic shortcomings in the speech sample would be moresalient in a relaxed, after-the-fact, audio-only reviewthan they would have been during the real-time inter-view, in which gestures or other visual cues on the partof the examinee might have masked (and perhaps prop-erly co, from a communicative standpoint) certain tech-nical flaws in the examinee's speech performance perse. In addition, in interviewing situations in which a72


single tester is required to both elicit and rate the speechsample, the cognitive and performance demands of theformer activity may make it difficult for the tester/raterto attend as closely as would otherwise be possible tospecific linguistic details of the examinee's perfor-mance. For check-rating purposes, videotape (ratherthan audiotape) playback would more closely approxi-mate the original communicative situation and wouldbe considered preferable for this reason alone. If audio-tape-based check-ratings were found to be systematical-ly biased by comparison with the original ratingsagain presumably in the direction of lower scoresthiswould even more strongly suggest the advisability of ar-chiving and check-scoring interviews through means ofvideotapes rather than audiotapes.

A third question is, for situations in which a singleindividual is required to both administer the interviewand at the same time listen attentively to and ultimatelyrate the examinee's speaking performance, what effectwould this dual testing task have on rating reliability,by comparison with a two-tester procedure in which oneperson would concentrate on elicitation and the second-,n performance ana:ysis and rating? Presumably, the,ognitive and performance demands on the interviewerof eliciting a proper and complete speech sample wouldmake it more difficult to attend to specific linguisticdetails of the examinee's performance than would bethe case with a separate listener/rater concentratingonly on these aspects. Thus, strictly from a scoringreliability standpoint, the listener/rater would probablybe in a somewhat better position to make consistentrating judgments than the interviewer. On the otherhand, the validity of this process could be questioned tothe extent that the interviewer/rater attended morecarefully to particular strengths or weaknesses in theexaminee's linguistic performance than would be thecase under normal commanication situations in reallife. In any event, the first step in examining questionsof this type would be to conduct a comparative study ofthese two administration formats to determine the

, ,.-)


nature, extent, and probable practical significance ofany rating reliability differences.

Fourth, an important question that has yet to besystematically addressed concerns changes in scoringreliability on the part of teFLrs following their initialinterviewer/rater training. It is generally assumed,and indeed borne out in informal experience, that cer-tified testers tend to exhibit over time at least somedegree of departure in rating performance from theoriginal standards against which they were trained.However, the direction and magnitude of these scoringvariations, as a function of both the amount of elapsedtime following initial training and the amount ofinterviewer/rating activity engaged in over that timeperiod, have not been rigorously investigated. Onepossible research approach would be to have trainedraters periodically rerate a set of calibrated interviews(possibly in alternate forms for the different ratingoccasions) covering the full range of AEI score levels.They would also be asked to keep a chronological log ofthe real interviews they were conducting and rating, aswell as any other interviewing/rating-related activities(such as check-rating of colleagues' interviews) inwhich they were engaged during the 'study. Assuminga sufficiently large number of participants, it would bepossible to place each person along a broad continuumof post-training activity, ranging from virtually con-stant interviewing and rating to little or no activityfollowing initial training. Various correlative datacould also be gathered, including the duration andintensity of the initial training; the degree of accuracyin rating the initial calibration tapes (ranging fromborderline to consistently on target); trainer judgmentsas to how thoroughly the individual appeared to haveunderstood and "internalized" the basic rating con-cepts; the tester's own level of proficiency in the targetlanguage, specifying whether the tester is a native ornonnative speaker; and, to the extent reasonably pos-sible, several different measures of psychological orpersonality characteristics that might be hypothesized

. 74


to contribute to or militate against the maintenance ofrating accuracy over time.

Sue a study could provide data critical to arrivingat informed answers to questions concerning:

1. changes in rating reliability over time in theabsence of interviewing/rating opportunities;

2. the direction of these changes (i.e., tendency to in-creasing severity, generosity, or random variation);comparability of these changes across raters (as op-posed to idiosyncratic or unpredictable variatiot);

3. possible interactions between maintenance or lossof scoring accuracy and the particular proficiency levelsevaluated; for ex ample, level 3 or higher performancemight continuE to be accurately rated, with widerdepartures at the lower levels (or vice versa);

4. the contribution of ongoing interviewing/ratingactivities to maintenance or loss of rating accuracy;

5. performance during tester training as a predictorof maintenance or loss of rating accuracy aftertraining;

6. native versus nonnative speaker status as apredictor of maintenance or loss of accuracy; fornonnatives, the influence of the rater's own proficiencylevel in the target language; and

7. the relationship of background and/or personalitycharacteristics of raters to maintenance or loss ofaccuracy.

Answers to such questions could lead in turn to thedevelopment of more effective guidelines for the selec-tion of prospective interviewer/raters; improved initialtraining procedures; specifications for the optimumamount and frequency of posttraining rating practiceneeded to maintain accuracy; and more precisely de-signed and targeted "refresher/retraining" procedures.

A reliability-related question with major practicalimplications is that of optimum interview length. Anearlier study in this area (Clark, 1978) showed quitehigh correlations between ratings based on severely

hs;A iJ


shortened interviews (about 6 minutes on average) andlonger "standard" interviews of 20 minutes or more.Although the need for face and content validityas wellas user acceptance of the resultsare likely to require asomewhat longer speech sample, it may be hypothe-sized that for each score level on the proficiency scale arough total time limit exists beyond which scoringreliability can increase very little if at all. Should thishypothesis be confirmed and the limits identified,longer interviews could be said to waste expensive testerresources without increasing rating accuracy.

Although several research approaches to this ,..,aes-tion are possible, one reasonable procedure would be tohave several raters independently listen to a set of nor-mal-length interview tapes. At two-minute intervals,the raters would be asked to make their best currentguess as to the examiaee's proficiency level. In addi-tion, they would be asked to indicate precisely when theybecome "certain" of the level they would ultimately as-sign. Listening would continue for several mii.ates be-yond this point to provide an opportunity for any changeof mind. For each rater, examination of the scores as-signed at each of the checkpoints would indicate, foreach proficiency level, the length of interview at whichthe individuals' rating judgments tended to stabilize.Across-rater comparisons at each time period wouldreveal changes in interrater reliability (if any) as afunction of increasing test length, as well as possiblysuggest an "upper bound" length (for each proficiencylevel) beyond which little or no further improvement inscoring reliability would be expected.

Scaling and MeasurementConsiderations

The preceding sections have examined the AEItesting procedure from the twin perspectives of validity

76


and reliability, each of which was seen to generate anumber of secondary issues meriting further research.Although the issues of scaling to be addressed in thissection can also be viewed as components of validity andreliability in the broadest sense of these terms, it is con-venient to separate them for discussion here. Two basicscaling-related questions are addressed: first, to whatextent Lan the AEI level descriptions be consiOered ascale in the rigorous psychometric meaning of thisterm? Secondly, what are the properties of this scaleand the implications of these properties for the accu-rate measurement of communicative performance?

In order for a system of ratings or descriptions to beconsidered a scale, the ratings or descriptions must (a)denote the relative presence or absence of some sub-stance or quality and (b) be capable of validly andreliably ordering people or objects according to theextent to which they possess the substance or quality inquestion. For example, an attitude scale that purports torate individuals according to their degree of ethnocen-tricity must be able to demonstrate that it can (a) denoteor label individuals possessing varying degrees of"ethnocentrism" and (b) validly and reliably (acrossvarying testing instruments and raters) order individu-als who are knownfrom some source of informationexternal to that provided by the scale itselfto differalong the dimension of "ethnocentrism." These defini-tions assume that the dimension of "ethnocentrism" isindeed scalable and that the construct underlying thescale is essentially unidimensional.

Before discussing in detail a number of scale-relatedissues in the AEI proficiency level descriptions, it isnecessary to acknowledge that over a period of about 30years, language training professionals, chiefly withinthe U.S. government, have effectively used the ILR pro-ficiency descriptions and associated testing placeduresto make hundreds of thousands of judgments aboutexaminee language performance. Furthermore, thesejudgments have generally been consideredwith afairly substantial amount of empirical supportto

1

'72 PROFICIENCY ASSESSMENT

provide sufficiently reliable and valid indications oflanguage performance to properly accomplish the prac-tical assessment purpose , for which they were designedand used within the government community. In thisregard, development and use of the ILR scale and test-ing procedure must in all candor be viewed as the mostsignificant and most highly consequential measure-ment initiative in the proficiency testing field to haveoccurred over the last three decades.

Nevertheless, despite the high degree of practicalutility and the high level of "user satisfaction" that arequite properly attributed to the ILR testing approach,close examination of the ILR level descriptions withregard to the specific psychometric properties that atrue measurement scale is expected to demonstratepresents some technical, and ultimately practical, con-cerns. More explicitly stated, although the fact thatindividuals can be trained to make reliable judgmentsthat are also viewed as valid by their peers may provideevidence for attibuting scalar properties to the judgingsystem being used, such evidence is not in and of itselfsufficient to fully establish the scalar nature of the sys-tem: Unidimensionality and other definitional charac-teristics of a true scale must also be examined andverified.

The unidimensionality question is especially perti-nent in the area of human performance. In practice,many scales violate the unidimensionality principle,and with relative impunity. In fact, to the extent thatthe various component dimensions of a particular con-struct (such as "language proficiency") are conceptually relatable and are all equally scalable, it is oftenmeaningful and psychometrically appropriate to com-bine them into a single "scale" and to make judgmentsbased on the values obtained on that single index. Thisis indeed the approach adopted in defining and usingmany social-psychological scales, and it is also the im-plicit, if not always explicit, assumption uz,derlyingILR scale-based proficiency assessment. Note, for ex-ample, the original FSI "factor" subscales of accent,

Pio0t


lexicon, structure, fluency, and listening comprehen-sion, each of which was considered to contribute, in agenerally additive manner, to the total proficiencyrating; as well as the "functional trisection" of content,function, and accuracy, as initially propounded byHiggs and Clifford (1982).

Closer examirl.tion of the linguistic aspects under-lying the factor subscales and the functional trisection,as well as the observed behavior of both types of sub-scales as they are used in practice, suggests that bothare somewhat problematic in terms of scalability. Thefactor scores are manifestly not lineal:1y relatedthroughout the ILR scale range in that their propor-tional contribution to the overall proficiency level ratingvaries as a function of the level itself, as manifested inthe well-known "butterfly" graph (see Fig. 2.1, nextpage) presented and discussed by Higgs and Clifford(1982). By the same token, the trisection components donot appear to possess equal scalar characteristics. Thelinguistic accuracy of an examinee's performance mayreasonably be viewed as lying somewhere along aconfirm= ranging from none at all to that associatedwith native-speaker competence. Also, in terms ofsociolinguistic functions, it could be posited that at leastsome types of functions can be ordered along acontinuum. For example, simple description may beviewed as less demanding than supporting one'sopinions or convincing others to change theirs. How-ever, the assumption that the content of a givenperformance can be similarly ordered is considerablymore difficult to support, particularly in view of the factthat in testing practice, the ability to function inspecified content domains has been assumed to be botha requirement for a given skill-level rating and evidencethat an examinee may deserve it. Recognition of theso-called "hothouse special" phenomenon (i.e., an indi-vidual inordinately well versed in a particular 1.;xicaldomain or specific limited area of discourse) is anacknowledgment of this problem, but only from onepoint of viewthat of requiring the individual so labeled

.4


Percentage

40

35

30

25

20

15

10

I 1IVocabulary

Grammar ow

Pronunciation

Fluency

Sociolinguistic

more iwoo

0 3

/

4

OWNS MINIM

5 Level

Fig,ure 2.1. Hypothesized Relative Contribution Model

Noce. From "The Push Toward Communication" (p. 69) by T.V.Higgs & R.T. Clifford in T.V. Higgs (Ed.), Curriculum, Compe-tence, and the Foreign Language leacher. 1982, Lincolnwood, IL:National Textbook Co. Copyript 1982 by National Textbook Co.Reprinted by permission.

to demonstrate control over the other content areasspecified in the given level description in order toreceive that rating. Whether certain content areas are

80


intrinsically more difficult to handle than othersanecessary condition for placing them along a difficultycontinuumhas yet to be established.

A second area in need of detailed research is that ofacross-language variation in the scaling of particularelements within the accuracy, function, and content do-mains. For example, with regard to accuracy, the abili-ty to properly convey time/tense is probably a mucheasier accomplishment (and hence, presumably, loweron the scale) in Chinese than in German. Similarly, ex-pressing apologies would presumably fall at an ap-preciably different point on the sociolinguistic functioncontinuum for Japanese than for Spanish. Although itwould certainly be possible to specify different config-urations of the function, content, and accuracy scalesfor each different language or language group, such anapproach would raise serious questions as to the under-lying measurement theory or concepts involved, since itwould represent an ad hoc process of proficiency scaledefinition rather than an a priori, principled, and, mostimportant from the validity standpoint, generalizableapproach.

One way to resolve the content dilemma might be toexamine it in terms of two separate dimensions oflexicon: range of vocabulary and precision of vocabu-lary. Viewing examinee control of "content" in thismanner might help reduce the size and significance ofthe test-versus-real-world sampling probiem posedearlier. With regard first to range of vocabulary, a givenlevel of proficiency could be defined as requiring (amongother things) the ability to make use of the lexiconassociated with a number of areas of discourse. Eachhigher proficiency level would require control of lexiconin an increasing number of areas. Precision ofvocabulary would be scaled on a continuum rangingfrom virtually no ability to retrieve and use basic lexiconof the content areas in question to a high level ofaccuracy and sophistication in lexical choice, includingthe ability to readily express fine nuances of meaningwithin each content area. However, this approach

81


would still leave unaddressed the variation in contentattributable more to functions than to lexicon.

Another alternative to developing a uniform, hier-archical scale of "content" would be to make no scalarclaims whatsoever for the content dimension, butsimply to describe the Pxaminee's ability to functionwithin each of several explicitly stated content areas.This approach could give rise to an assessment proce-dure in which, for example, an examinee would beevaluated as a Level 3 speaker in the areas of politics,fine arts, and economics (i.e., would have satisfiedLevel 3 function and accuracy requirements withinthese lexical domains), with no specific performanceclaims being made or implied for discourse areasoutside these particular domains.

One experimental approach to resolving, or at leastconsiderably clarifying, the content issue would be tocarry out a concurrent validation study in which givenexaminees would be administered both a traditionalinterview and several much more highly specialized"lexical domain" tests in such diverse areas as food,lodging, transportation, autobiographical information,current events, sports, ,. onomics, politics, science andtechnology, and other appropriate specialized topics. Atissue would be both the uniformity of a given exam-inee's performance across the specific lexical domainsand the extent to which the traditional interview wouldbe able to predict performance for some or all of thelexical domains. Results of such a study would help todetermine (a) whethe. the traditional interview is doinga reasonable job of standing in for more detailed as-sessments of examinee control across a variety ofcontent domains; (b) whether and in what respectscaution should be exercised in interpreting interviewresults as reflective of broad "content control"; (c)whether relatively modest changes in the interviewingprocess with respect to the types of content areasbroached might provide appreciable improvements inpredictive validity; or (d) whether the correspondencebetween interview performance and lexical/content

-82


control across a variety of areas is so tenuous as toindicate a need for complete rethinking of theassessment procedure- iecessary to this task. Expe-rience with the AEI procedure to date suggests that theactual outcomes of the proposed study would involvesome combination of the first three possibleconclusions; in any event, a considerable amount ofproperly obtained and carefully analyzed empiricaldata would be needed to adequately address theseissues.

Levels of Measurement

In addition to the question of dimensionality raisedin the preceding section, it is necessary to consider thestatistical level of measurement (categorical, ordinal,interval, or ratio) represented by the AEI scale. Thelevel-of-measurement question is important because itrelates to the types of uses to which the measurementprocedure can legitimately be put, as well as to thekinds of statements or inferences that can properly bedrawn on the basis of the testing results. A categoricalscale would be quite satisfactory for separating indi-viduals into groups, such as those who can do a givenjob and those who cannot; or those who are better suitedfor job A than for job B. In both of these cases, a cate-gorical scale is adequate only when differences in typesof linguistic performance are being evaluated, notgreater or lesser degrees of performance.

For most uses to which the AEI interview is typical-ly put, an ordinal or, in some instances, interval scalewould be required. For example, the task of placing in-coming students into appropriate language courses, orthat of selecting the best-qualified individuals for impor-tant positions requiring language skills, would requireat leest an ordinal scale, since these applications allinvolve ranking or ordering the individuals in question

83


in the language. An even higher level of measurementthe interval scaleis needed when the measure-ment intent is, for example, to determine how nearlythe student has approach -d the next higher proficiencylevel since the last assessment was made. Experi-mental studies involving the use of parametric statistics(e.g., correlational analyses using language proficiencyscores as criterion variables in language-learning re-search) also require interval-level data. The use ofnoninterval data in such analyses places the results injeopardy to the extent that the statistical procedure usedis affected by violation of the assumption that intervaldata are in fact being used.

Although considerable c iidence exists, both statis-tical and anecdotal, that the AEI and similar profi-ciency scales are at least ordinal in nature, the questionof whether they are interval as well is more complex be-cause of the empirical difficulties involved in investi-gating certain fundamental assumptions associatedwith language proficiency theory as embodied in theAEI guidelines and testing approach. For example, it isgenerally presumed, and indeed asserted, by those in-volved in AEI-type testing that the "distance" betweenany two levels increases as one progresses up the pro-ficiency scale, and that the "plus" point within anygiven level is closer to the value of the next higher level,rather than midway between the two levels. Althoughcertified testers are taught to make judgments based onthese assumptions, to determine whether the scaleproperties implicit in these assumptions have any basisin erternal "fact" would require that the proficiencylevel ratings themselves be compared against someother scale of known interval properties and of such anature as to have a high degree of face validity withinthe proficiency testing movement. Various investiga-tions that might be carried out within the overallresearch framework described in the following sectioncould help to resolve this question, as well as othersposed earlier in this chapter.

c19


A Proposed Framework for LanguageProficiency Testing Research

Bachman and Clark (1987) proposed the detailed andsystematic investigation of a wide variety of proficiency-based testing issues within the framework of a singlelarge-scale research program, under whose auspices anumber of individuals and institutions would be askedto combine their efforts in an attempt to provide reliableand detailed information on each of these issues. Inoutline form, the major steps in the proposed programare to:

1. specify a prototypical model of communicativelanguage ability, including but not limited to the per-formance ele:nents currently at issue in the AEI con-text;

2. develop highly detailed and comprehensive per-formance tests covering each of the component aspectsof the communicative ability model;

3. develop shorter, necessarily less co:nprehensive,practically oriented tests that could be used in real-lifeapplications as effective surrogates for the criterionmeasures at issue in (2);

4. establish the types of construct validation data tobe gathered to support or contradict the validity of thecommunicative ability model, the criterion measures,and the practically oriented tests;

5. carry out a series of research studies in which thecriterion performance test, practically oriented tests,

nd construct validation data are intercorrelated andompared on a two- and three-way basis;

6. draw necessary conclusions based on thesestudies and correspondingly revise the proficiency mod-el, criterion tests, and/or prxtically oriented tests;

7. repeat steps (5) and (6) until the researchersinvolved express high confidence in the theoretical andpsychometric quality of the data, and operational testusers find high practical utility in the results.

8D


A comprehensive study conduct,:d along the generallines outlined here would fully accommodate the inves-tigation of the various issues raised in this chapterconcerning the AEI scale and testing procedure. Inaddition, it would do so within the context of n ex-panded communicative ability model that might offersome useful new perspectives for the further elabora-tion of the AEI approach and/or for the development of avariety of other types of instruments and procedures inthe service of developing increasingly valid, reliable,and kractical language proficiency testing.

References

Adams, M.L. (1978). Measuring foreign languagespeaking proficiency: A study of agreement among.aters. In u.L.D. Clark (Ed.), Direct testing of speak-ing proficiency: Theory and application. Prince taxi,NJ: Educational Testing Service.

Bachman, L.F., & Clark, J.L.D. (1987). The measure-ment of foreign/second language proficiency. An-nals of the American Academy of Political and So-cial Science, 490 (March), 20-33.

Burt, M.K. (1975). Error analysis in the adult EFL class-room. TESOL Quarterly, 9, 53-63.

Carroll, J.B., Cldrk, J.L.D., Goddu, R.J.B., Edwards,T.M., & Handrick, F.A.'(1 966). A parametric studyof language training i the Peace Corps. Cam-bridge, MA: Laboratory for Researcn in Instruction,Harvard Graduate School of Education.

Chastain, K. (1980). Native speaker reaction tc instruc-tor-identified student second-language errors. Mod-ern Language Journal, 64, 210-15.

86


Clark, J.L.D. (1978). Interview testing research at Edu-cational Testing Service. In J.L.D. Clark (Ed.), Dir-ect testing of speaking proficiency: Theory and appli-cation. Princeton, NJ: Educational Testing Service.

Clark, J.L.D. (1979). Direct vs. semi-direct tests ofspeaking ability. In E.J. Briere & F.B. Hinofotis(Eds.), Concepts in language testing: Some recentstudies. Washington, DC: Teachers of English toSpeakers of Other Languages.

Clark, J.L.D. (1981). Survey measures: Language; Sur-vey measures: Results. In T.S. Barrows (Ed.), Col-lege students' knowledge and beliefs: A survey ofglobal understanding. New Rochelle, NY: ChangeMagazine Press.

Clark, J.L.D. (1986). Development of a tape-mediated,ACTFL/ILR wale-based test of Chinese speaking)roficiency. In C.W. Stansfield (Ed.), Technology

and language testing. Washington, DC: Teachers ofEnglish to Speakers of Other Languages.

Clark, J.L.D., & Li, Y.C. (1986). Development, valida-tion, and dissemination of a proficiency-based test ofspeaking ability in Chinese and an associatedassessment model for other less commonly taughtlanguages (Final project report for Grant No.G008402258, U.S. Department of Education). Wash-ington, DC: Center for Applied Linguistics.

Experiment in International Living. (1976). Your objec-tives, guidelines, and assessment: An evaluationform of communicative competence (rev. ed.). Brat-tleboro, VT: Author.

Galloway, V.B. (1980). Perceptions of the communi-cative efforts of American students of Spanish.Modern Language Journal, 64, 428433.

67


Higgs, T.V., & Clifford, R.T. (1982). The push towardcommunication. In T.V. Higgs (Ed.), Curriculum,competence, and the foreign language teacher(ACTFL Foreign Language Education Series, Vol.13). Skokie, IL: National Tetbook Cc.

Lie twig, J. (1982). Native-speaker judgments of secondlanguage learners' efforts at communication: Areview. Modern Language Journal, G6, 274-283.

Mattran, K.J. (1977). Native speaker reactions to speak-ers of ESL. TESOL Quarterly, 11, 407-414.

Munby, J. (1978). Communicat-de syllabus design: Asociolinguistic model for defining the content of pupose-specific language programmes. Cambridge,England: Cambridge University Press.

Raisler, I. (1976). Differential response to the r memessage delivered by native and foreign spc -kers.Foreign Language Annals, 9, 256-259.

Shohamy, E. (1983). Rater reliability of the oral inter-view speaking test. Foreign Language Annals, 16,219 -29?.

88

Issues Concerning theLess Commonly Taught

Languages

by Irene Thompson, George Washington University;Richard T. Thompson, ACTFUCenter for Applied Linguistics;

and David Hip le, ACTFL

This chapter addresses the application of proficiencyguidelines to the less commonly taught languages.Questions of relevance and appropriateness in theoryand practice will be addressed. A distinction is drawnamong commonly, less commonly, and much less com-monly taught languages. This distinction is more prac-tical then theoretical and relates to questions of supplyand demand, the need cor priority-setting, the availa-bility of trained specialists in specific languages as wenas the likelihood of developing such specialists in manyof these languages.

Questions of Eurocentric b*.as and the impact of theapplication of the provisional generic guidelines to lan-guages with different typologies, such as Chinese, Jap-anese, and Arabic, and their role in a subsequent redef-initioli of the generic guidelines themselves are tracedfor both speaking arid reading.

Theoretical and practical problems in adapting theguidelines to specific less commonly taught languagesare discussed, including the presence of Hindi-Englishcode switching at high levels of proficiency in educatednative speakers, special problems of diglossia in Arabic,complex inflectional morphologies in languages such


as Rt iian, the early appearance of significant prob-lems in register in Indonesian and Japanese, and thecomplex and predominantly nonphonologically basedwriting systems of languages such as Chinese and Jap-anese with repercussions for the generic reading andwriting guidelines.

Finally, polic3 issues affecting the various constitu-encies are addressed, including the role of the federalgovernment, the language and area studies centersmost directly affected by recent federal legislation, andpending regulations relating to proficiency testing andcompetency-based language programs.

Foreign Language Enrollments

A survey by Brod and Devens (1985) indicates that in1983, a total of 734,515 college students were enrolled inSpanish, French, and German language courses as fol-lows: 386,238 in Spanish, 270,123 in French, and 128,154in German. These three languages have come to beknown as the commonly taught languages.

Enrollment figures in 1983 reveal a second clusterof foreign languages taught at the college level (fol-lowing Spanish, French, and German). These wereItalian, 38,672; Russian, 30,386; Hebrew, 18,199; Jap-anese, 16,127; Chinese, 13,178; Portuguese, 4,447; andArabic, 3,436. To illustrate the comparative significanceof these figures, the number of students enrolled inJapanese courses in 1983 represented about 4 percent ofthe number of students enrolled in Spanish courses,and the number students enrolled in Arabic coursesrepresented about 1 percent of those enrolled in Span-ish. This second cluster of languages is often referred toas the less commonly taught languages.

After this cluster of less commonly taught lan-guages, enrollments reveal that most other foreign lan-guages are much less commonly taught. In 1983, for

90

Less Commonly Taught Languages 85

example, 507 college students enrolled in Swahilicourses, 219 college students studied Hindi, and 85 stu-dents enrolled in Indonesiar courses, the latter figurerepresenting approximately 2 percent of enrollments inArabic. Yet, even Indonesian scholars, with their 85students, could take comfort in the fact that only 14students enrolled in Uzbek, while only 4 studentsenrolled in Ibo. These languages may thus be referredto as the much less commonly taught languages.

At the secondary level, differences between en-rollments in commonly taught and less commonlytaught languages are even more dramatic. An Amer-ican Council on the Teaching of Foreign Languages(ACTFL) survey (1984) indicated that Spanish, French,German, Latin, and Italian, in that order, accountedfor approximately 99 percent of the 2,740,198 foreign lan-guage enrollments. Of the remaining 1 percent, 5,497students were enrolled in Russian, 1,980 in Chinese,and 51 in Arabic.

Thus. within what has been referred to as the lesscommonly taug,t languages, there is a wide range instudent enrollments, reflecting the fact that some lan-guages are, indeed, much less commonly taught. Thisdistribution provides a basis f-r priority-setting in theface of hinted training resources, both human and 1-n anci al.

ACTFL Proficiency Initiatives Beyondthe Commonly Taught Languages

ACTFL's activities in extending the languageproficiency assessment movement beyond governmentand into academia have progressed from projects incommonly taught languages, initiated in 1981, toprojects in less commonly taught languages, begun in1983, to projects in much less commonly taught lan-guages, started in 1985. The initial projects involved the

91


development of proficiency guidelines for French, Ger-man, and Spanish, as well as the training of individ-uals to administer and evaluate oral proficiency tests inSpanish, French, German, and Italian. The secondstage of activities involved developing proficiency guide-lines for Chinese, Japanese, and Russian, and oral pro-ficiency tester training in Arabic, Chinese, ESL/EFL,Japanese, Portuguese, and Russian. The third stage ofactivities innlved a dissemination project, undertakenby ACTFL jointly with the Center for Applied Lin-guistics, to extend proficiency concepts to Hindi, Indo-nesian, and Swahili, as well as preliminary oralproficiency tester-training activities in Hindi, Indo-nesian, Swahili, and a small sample of other Africanlanguages such as Hausa and Lingala.

Development of Proficiency Guidelines for the LessCommonly Taught Languages

In 1983, ACTFL received support from the U.S.Department of Education to initiate the second stage ofthe guidelines project to create language-specific pro-ficiency statements for Chinese, Japanese, a:-td Rus-sian. As the working committees began theL tsk, aWest European bias of the existing generic gi ,ulinesbecame most evident in ,tatements concerns .,g accu-racy in speaking and in statements dealing with thewriting system (Hip le, 1987).

With respect to accuracy in speaking, this bias wasseen in references to grammatical constructions com-mon to most West European languages, such as inflec-tions, subject-verb and adjective-noun agreement, arti-cles, prepositions, tense usage, passive constructions,question forms, and relative clauses. It was felt thatlearners of languages such as Japanese, Chinese, andArabic had to deal with a very different set of grammat-ical constructions, and thus that the accuracy state-ments in the generic guidelines were not relevant to

02


these languages. To some extent, this was true even ofRussian, which, for instance, lacks articles.

After completing the initial draft of the Chinese,Japanese, and Russian guidelines, it became evidentthat creating meaningful guidelines for those lan-guages would not be possible without z revision of thegeneric guidelines. As a result, ACTFL petitioned theDepartment of Education for and was granted anamendment to the project to revise tb.3 generic guide-lines in order to broaden them enough to accommodatelanguage-specific statements for Chinese, Japanese,and Russian.

With respect to writing, the West European biasof the generic guidelines was even more obvious. Theguidelines assumed that learners would not requiremuch time to master the principles and mechanicsof the writing system. Therefore, Novice writers wereassumed to be capable of functions such as filling outsimple forms and making simple lists. This was felt tobe unrepres, 'itative of languages such as Japan3se,Chinese, and Arabic, in which the complexities of thewriting system dictated a more graduated set of state-ments regarding the development of real-life writingfunctions.

The evolution of the proficiency guidelines as relatedto less commonly taught languages can be traced fromthe Interagency Language Roundtable (ILR) definitionsthrough the Provisional Proficiency Guidelines to therevised Proficiency Guidelines and the respective lan-guage-specific descriptions in several uncommonlytaught languages. Space does not allow inclusion of allthe changes in the definitions for all levels in all fourskills, and only selected levels in speaking and readingare discilssed here.

Evocation of the Speaking Guidelines. The ILRdefinition for Level 1 speaking prsficiency (S-1) and thecorresponding ACTFL provisional definitions for theIntermediate-Low and Intermediate-Mid levels serve toillustrate the evolution of the speaking guidelines. It

.r)ts


will be remembered that S-1 (Elementary Proficiency) isdefined as being able to "satisfy minimum courtesy re-quirements and maintain very simple face-to-face ,n-versations on familiar topics."

An obviez,s difference between the ILR S-1 defini, nand the ACTFL Intermediate descriptions is that tt,eequivalent to the ILR S-1 is represented by two sub-rangesIntermediate-Low and Intermediate-Mid--onthe ACTFL scale. Liskin-Gasparro (1984) describes theCommon Yardstick Project of the Educational TestingService (ETS) and the need for a scale that discrimi-nates more finely at the lower end, where most of theforeign language students in schools and colleges tendto cluster. This need is particularly real in less com-monly taught languages, whose students can expect toinvest more time in order to arrive at the Intermediatelevel than students of commonly taught languages. Forexample, the School of Language Studies of the ForeignService institute estimates that students may requiretwice as much time to attain S-1 proficiency in Arabic,Chinese, Japanese, and Korean as to attain the samelevel of proficiency in Spanish or French. Thus, theneed to distinguish among subranges of the Interme-diate level of proficiency seems particularly compellingfor a number of less commonly taught languages.

Content I context. The most noticeable aspect of theILR S-1 definition is its orientation toward satisfyingminimum courtesy requirements, survival needs suchas getting food and lodging, and work demands such asgiving information about business hours and explain-ing routine procedures. ACTFL's Provisional Guide-lines retained the courtesy and survival requirementsbut left out reference to specific work demands, Instead,the requirement of satisfying limited social demandswas added at the Intermediate-Mid level. The language-specific Provisional Guidelines began the process of ad-aptation of content to the academic environment byincluding contexts appropriate for aL,,lemic learners,such as references to school (French and Spanish

9 4


Intermediate-Low), learning the target language andother academic studies (German Intermediate-Low),autobiograpnical information, leisure time activities,daily schedule, future plans (French and Spanish In-termediate-Mid), and academic subjects (German In-termediate-Mid).

This process of content/context adaptation continuedin the 1986 version of the ACTFL guidelines with theintroduction at the Intermediate-Low level of the moregeneral statement "Able to handle successfully a limit-ed number of interactive, task-oriented and social situ-ations." As a result, language-specific statements inthe revised language-specific guidelines of 1986 includereferences to greetings, introductions, biograph-ical information, social amenities, making, accepting,or declining ir_vitations, handling routine exchangeswith authorities, and making social arrangements.

Ac curacy. When it came to accuracy, the problem ofadaptation was more serious, because many of theaccuracy requirements in the ILR descriptions and theACTFL Provisional Guidelines were typically reflectiveof ILdo-European languages and irrelevant for Arabic,Chinese, Japanese, and Russian.

An attempt was made, therefore, to remove manyof the quality statements and reserve them for theirproper place in the language-specific guidelines. Forinstance, the Intermediate-Mid description in theProvisional Guidelines contaired references to subject-verb agreement, adjective-noun agreement, and inflec-tions. In the revised guidelines of 1986, all references tothese structures were remoed. As a result of this revi-sion, the language-specific guidelines were free toinclude accuracy statements that were more represen-tative of their languages. Thus, the Arabic guidelinesspecify verb-object phrases, common adverbials, wordorder, and negation (Allen, 1984); the Chinese guide-lines refer to word order, auxiliaries and time mark-ers (ACTFL, 1986a); the Japanese guidelines singleout formal nonpast/past, affirmative/negative forms,

ci


demonstratives, classifiers, and particles (ACTFL,1987); and the Russian guidelines include references toadjective-noun and subject-predicate agreement, and adevelopmental hierarchy of cases (ACTFL, 1986c).

A related refinement was the reformatting of theguidelines to present the generic and t1:- language-specific statements together so that the two could beviewed simnitaneously. This change in format was par-ticularly useful in light of the attempt to maintain theneutrality of the generic descriptions and to focus on theaccuracy statements in the language-specific descrip-tions. This proved helpful in oral proficiency testertraining workshops because it was no longer necessaryto deal with two separate documents. Also, lan,aage-specific guidelines no longer needed to repeat the samegeneric statements and could concentrste instead onproviding appropriate language-specific examples.

Evolution of the Reading Guidelines. The mostnoticeable aspect of the reading guidelines for theACTFL Novice-Low and Novice-Mid (ILR R-0) levels iscomplete negativity"Consistently misunderstands orcannot comprehend at all." In the Provisional Guide-lines, only the Novice-Low level was characterized neg-atively as "No functional ability in reading the foreignlanguage." However, this negative wording was felt tobe unhelpful, because it focused excessively on what thecandidates could not do, and not on Ciat they could doin the target language. Therefore, the revised genericstatements for reading substituted a positive statementthat recognizes the beginning of reading development:"Able occasionally to identify isolated words and/ormajor phrases when strongly supported by context."This allowed the Chinese Novice-Low description toin_iude a reference to "some Romanization symbols anda few simple characters." At the same time, the Rus-sian examinee at the Novice-Low level recognizes someletters of the Cyrillic alphabet in printed form.

When the ILR R-0 level was divided into two sub-ranges for academic use, the Novice-Mid description

0E3

L


allowed for some development of reading ability by stat-ing: "Sufficient understanding of the written languageto interpret highly contextualized words or cognateswithin predictable areas. Vocabulary for comprehen-sion limited to simple elementary needs, such asnames, addresses, dates, street signs, building names,short informative signs (e.g., no smoking, entrance/exit) and formulaic vocabulary requesting same." Al-though this positive wording was a step in the rightdirection, reference to cognates and the specificity of theexamples posed a number of problems for noncognatelanguages with nonalphabetic writing systems. Chi-nese, for example, shares no cognates with English;student6 of Chinese must learn both characters andRomanization system(s). Thus the reading of names,for instance, becrJaies a rather advanced skill.

The revised generic guidelines of 1986 distinguishamong alphabetic, syllabic, and character-based writ-ing systems, thus allowing greater latitude for lan-guages such as Chinese and Japanese. The reference tocognates was modified as follows: "The reader canidentify an increasing number of highly contextualizedwords and/or phrases including cognates and borrowedwords, where appropriate." All references to specificmaterials representative of this level were left out.

As a result of these changes in the generic guide-lines, the Chinese description of the Novice-Mid readerincludes the ability to identify/recognize a small set oftypeset or carefully hand-printed radicals and char-acters in traditional full form or in simplified form, andfull control over at least one Romanization system. Thereading context includes public writing in high-contextsituations, such as characters for "male" and "female"on restroom doors. In contrast, the Novice-Mid readerin Russian can identify all letters of the Cyrillic al-phabet in printed form and can read personal names,street signs, public signs, and some names on maps.The Arabic Novice-Mid reader can identify the lettersbut has difficulty recognizing all four forms of eachletter as well as the way in which these letters are

S7


joined to each other in forming words. He or she canrecognize individual Arabic words from memorizedlists as well as highly contextualized words and cog-nates such as public signs.

The question arises whether such changes in thelanguage-specific statements have led to a lack of com-parability across sets of guidelines, weakening the unityof the system. This problem resulted from ACTFL'sdecision to subdivide the lower-level descriptions intothree subranges, Novice-Low, Novice-Mid, and Novice-High, so that learner achievement at the very beginningstages of language learning could be recognized. Theentire Novice level is prefunctional, and, as such, rep-resents a developmental sequence leading to the firstfunctional level (i.e., the Intermediate level), ratherthan an index of real-life functional ability. If thedescription of this developmental sequence is to haveany validity, the statements at the Novice level will haveto take on a somewhat divergent character. This diver-gence should not represent serious difficulties for test-ing, because any test at the Novice level will turn out tobe a test of achievement, not of proficiency.

Theoretical and Practical Problems in Adapting theProficiency Guidelines to Specific Languages

On the practical side, there is little doubt that theproficiency guidelines have succeeded in injecting somevitality into the language teaching field by offering botha framework for program planning and an instrumentfor assessing student progress.

On the theoretical side, the development of languageproficiency guidelines is an ambitious attempt to defineand quantify foreign language proficiency at variouspoints in its development and to describe the essentialfeatures of each stage in a few well-chosen sentencesaccompanied by a few carefully selected examples. Forsuch an attempt to be successful, it must be a dynanLc

0 P


process that involves constant refinement of currentunderstanding of language competency and continualvalidation of language content and of real-world lan-guage-use situations typical of language performanceat different levels of proficiency (for a detailed discus-sion of validation of second-language testing, see Clark& Lett, this volume) Because this formidable task cannever be complete, the guidelines will always reflect astage in the developing understanding of the dynamicsof interlanguage.

For the moment, the guidelines have raised manyquestions that cannot be answered due to lack of em-pirical research. This lack may be partially due to therelatively recent introduction of the guidelines into theacademic setting. The following partial research agen-da applies to all languages:

1. validation of the claims in the guidelines regard-ing the developmental hierarchies of different aspects oflinguistic performance including pragmatic and dis-course strategies in different languages;

2. determination of differences in specific aspects oflinguistic performance across major scale boundaries;

3. examination of specific linguistic features thatdistinguish planned from unplanned discourse; and

4. examination of the validity of the developmentalsequence outlined in the receptive skills guidelines.

In addition to these problems, which affect alllanguages for which guidelines have been developed orare being developed, the development of guidelines forlanguages with different typologies has brought forth ahost of problems that hitherto had not been dealt with.These problems are examined next in the context ofAfrican languages, Arabic, Chinese, Hindi, Indone-sian, Japanese, and Russian.

The Case of Russian. The availability of governmenttesters to train the initial contingent of academic testersin Russian made it possible for a group of trained

.9 9


individuals to begin work on the Russian guidelines in1984.1 According to Thompson (1987), unlike the otherless commonly taught languages, Russian, an Indo-European language, faced. no special challenges indeveloping language-specific guidelines from a genericstarter kit. This adaptation process could be best char-acterized by a conflict between the desire to make thelevel descriptions come to life through a variety of exam-ples and the desire to preserve the global character ofthese descriptions.

The process of adaptation was also not without someuneasiness caused by a conflict between the desire tomake the Russian guidelines conform to those inFrench, German, and Spanish, and the need to includein them references to features that are unique toRussian. These features were related to content/context,accuracy, and the lack of provision in the genericguidelines at the Novice level for the learning of anotheralphabetical system.

With respect to content/context, the committee mem-bers felt that they had to correct the West-European andadult/professional bias in favor of contexts in whichU.S. students in the Soviet Union would most likely findthemselves, and to provide examples of topics thatAmericans would most likely discuss with Russians,particularly at the Advanced and Superior levels. It wasfelt that many of the survi tal situations mentioned inthe French, Spanish, and German guidelines could notbe applied to Russian because they would either simplynot occur in the Soviet Union or they would be struc-tured differently. This is true even at the Novice level,where some familiar categories in West European lan-guages (e.g., telling time, days of the week, months)require Intermediate-level skills in Russian.

Somewhat more serious problems presented them-selves in the area of accuracy. For example, the provi-sional guidelines in the Advanced-Plus descriptionreferred to lack of accuracy as follows: "Areas of weak-ness range from simple constructions such as plurals,articles, prepositions, and negatives to more complex

1 00


structures such as tense usage, passive constructions,word order, and relative clauses." Such specificitycaused problems for Russian, in which additionalgrammatical categories such as pronominal, adjectivaland nominal declensions, verbal aspect, modality, verbsof motion, and prefixation, among others, presentsignificant difficulties for the learners. The case ofRussian indicated that considerable variation in bothinventories of structural features and their hierarchicaldevelopment had to be accommodated before the Supe-rior level.

The removal of references to specific structures inthe revised guidelines of 1986 facilitated the subsequentrevision of the Russian-specific guidelines, for the revi-sion committee members2 no longer felt constrained bythe imposition of developmental hierarchies for gram-mar more characteristic of less inflected languages. Asa result, the revised Russian guidelines in the descrip-tion of the Advanced-Plus speaker refer to cases, aspect,mood, word order, and the use of particles.

Although all members of the Russian guidelinescommittee had been trained in the administration of theoral proficiency interview and all were experiencedteachers of Russian, they were somewhat uneasy aboutpositing a developnlental hierarchy of acquisition ofgrammatical, discourse, sociolinguistic, and pragmaticfeatures on the basis of observation and experiencerather than research evidence. It was felt then, and isfelt still, that the availability of large amounts of datafrom taped oral interviews in Russian should providethe impetus for psycholinguistic research into char-acteristics of learner speech at different levels of pro-ficiency, such as suggested by Byrnes (1987). The resultsof this research may guide efforts to reexamine andreevaluate some of the statements in the currentversion of the Russian proficiency guidelines withregard to various aspects of learner performance at dif-ferent levels of proficiency, keeping in mind, of course,the danger of a cyclical effect in using interview data tovalidate oral proficiency interview traits.

1 01


Finally, the lack of accommodation in the provi-sional reading/writing guidelines for learning to recog-nize/produce Cyrillic letters caused some concern.While so= -.) transfer can be made from West Europeanlanguages when it comes to recognizing/producingCyrillic letters, most of them represent a different pat-tern of letter-sound correspondence or have differentshapes altogether. In addition, tae printed and long-hand versions of the letters look quite different. Thus,learners must go through a training period before theyare able to recognize/produce Cyrillic script. Theprovisional guidelines, ho wever, describe the Novice-Low reader as having no functional reading ability,while the Novice-Mid reader is described as already ableto read highly contextualized words or cognates withinpredictable areas such as names, addresses, dates,street signs, building names, short informative signs,and so on. It was felt that there was a discontinuitybetween these two subranges that did not reflect theearly stages of learning to read in Russian.

The problem was solved by having the Novice-Lowreader in Russian recognize some letters of the Cyrillicalphabet in printed form and a few international wordsand names. By contrast, the Novice-Mid reader couldidentify all letters of the Cyrillic alphabet in printedform and some contextualized words such as names,public signs, and so on. Finally, the Novice-High readercould identify various typefaces in printed form or inlonghand as well as highly contextualized words,phrases, and sentences on maps and buildings and inschedules, documents, newspapers, and simple person-al notes. In this manner, the Novice level was designedto represent the gradual beginning steps in learning toread Russian.

The recommendations of the National Committee onRussian Language Study (AAASS, 1983), which calledfor the development of a common metric and its use toset standards for Russian language study, helped pavethe way for the introduction of the proficiency guide-lines and the oral proficiency interview into the Russian

102,,


field. The response has been generally quite positive,and during the past few years there has been a gooddeal of activity in the field involving the guidelines andthe oral interview test. The following deserve mention:(a) there is now a contingent of more than a dozen certi-fied oral proficiency testers and several academic tester-trainers in Russian, thereby ending dependence on theU.S. government for training; (b) curriculum work-shops geared to teaching for proficiency are now beingoffered to secondary and postsecondary Russian lan-guage teachers throughout the academic year and es-pecially during summers at various locations nation-wide; (c) videos of oral interview tests at all levels weredeveloped at Middlebury College under a grant from theSocial Science Research Council for use in tester train-ing; (d) the Educational Testing Service has developed

Advanced Russian Listening and Reading test basedn ACTFL Listening and Reading Guidelines thatreports raw scores and/or proficiency ratings fromIntermediate-High (ILR 1+) to Superior (ILR 3) or high-er; (e) major Russian overseas programs such as theCouncil on International Educational Exchange, theAmerican Council of Teachers of Russian, and the Mid-dlebury College Russian Program at the PushkinInstitute in Moscow use the ACTFL oral proficiencyinterview and the ETS Advanced Listening/ReadingTest for pre- and post-program evaluation of partici-pants, and data are being collected to update Carroll's(1967) study with respect to Russian; and (f) some insti-tutions have introduced graduation requirements forundergraduate and graduate majors in Russian interms of proficiency levels in various skills. Other insti-tutions are using the oral proficiency interview toscreen prospective teaching assistants. Many institu-tions are reevaluating their language courses by settingobjectives in terms of proficiency levels in various skillcombinations.

The Case of Hindi. Hindi presents another set ofproblems hitherto not encountered in the development

.10 3


of guidelines for other languages. Specifically, accom-modations must be made for Hindi-E-iglish codeswitching. Code switching is generally an indication ofa relatively low level of proficiency in a second lan-guage, but in the case of Hindi, appropriate Hindi-English code switching is representative of educatednative Hindi speakers.

Secondly, in terms of extending the oral proficiencyinterview to additional less commonly taught lan-guages, a major problem presented itself when no gov-ernment tester was available to train academic testersin a particular language. ACTFL first addressed thisproblem in the case of Hindi. The solution, a time-consuming one, was to train testers in a language otherthan the target language, and then to help the mostinterested ones transfer the concepts, procedures, andrating criteria to the target language.

Several problems had to be resolved before guide-lines could be created for Hindi (Gambhir, 1987).According to Gambhir, it was desirable to study the con-cept of an educated native speaker in the multilingualspeech community of India, where English is used bythe educated elite in most formal and professionaldomains, and where Hindi is primarily relegated to themore restricted domain of informal socialization andareas of higher education dealing with language, lit-erature, and culture. Because of the widespread use ofEnglish, which is the co-official language of India alongwith Hindi, in government, education, science, technol-ogy, and commerce, most educated native speakers ofHindi lack opportunities to develop higher levels of pro-ficiency normally associated with professional, educa-tional, and formal domains of language use. If theHindi guidelines are to reflect the actual use of Hindi byeducated native speakers, these limitations must be tak-en into account.

According to Gambhir (1987), an additional problemis the presence of two styles in the speech of educatednative speakers of Hindi. The spoken style, which con-tains many borrowings from English, Persian, and

1 n 4


Arabic, is used in speech and writing for informal pur-poses, while the written style, which contains manySanskrit words, is reserved for formal speech and writ-ing. The spoken style is characterized by frequent, rule-governed, Hindi-English code switching when usedwith Hindi-English bilinguals, which does not occur ininteractions between Hindi-English bilinguals andmonolingual speakers of Hindi. Since educated nativespeakers of Hindi use both a mixed and an unmixedcode in informal speech and writing, the proficiencyguidelines for Hindi must ref' this aspect of socio-linguistic competence. The ,ituation is paradoxical:Hindi-English code switcl ing indicates lesser profi-ciency in everyday, survival situations but greater profi-ciency in formal, professional settings.

In addition, the content/context in which Hindi isused needs to be elucidated. For instance, according toGambhir, the Superior-level functions in Hindi aremostly exercised in the areas of language, culture, andliterature; the Advanced-level functions occur mostly ininformal social situations; and the Intermediate-levelfunctions occur mostly in rural areas and in contactswith uneducated or less educated native speakers ofHindi who normally have little or no contact withforeigners.

Finally, in addition to making accuracy statementsregarding control of various phonological, morphosyn-tactic, and discourse features of Hindi, statements re-garding sociolinguistic competence must take intoaccount the complexity of rules governing style accord-ing to the relative status, age, sex, and relationship ofthe interlocutors as well as the formality/informalityof the situation.

To decide which linguistic features should beexpected to be fully, partially, or conceptually controlledat which level, Gambhir suggests combining two dif-ferent approachesone based on experience as to whatto expect in terms of functions, content/context, andaccuracy at what level, and the other based on ananalysis of a large number of interviews at different

.105


levels. This combined approach requires a tentative for-mulation of level descriptions through analysis ofactual interviews and supplementing missing datawith observations based on experience.

The process is already under way. Hindi testers3trained in administering the oral proficiency interviewin ESL started the transfer by administering the inter-view in Hindi. They identified the best Hindi interviews,translated the questions asked in those interviews intoEnglish, and met with experienced ESL testers to obtainfeedback on elicitation techniques and assistance inrating the samples.

In late spring of 1987, an oral proficiency testingworkshop was conducted for professors of Hindi andTamil at the South Asia Department of the University ofCalifornia at Berkeley, with representatives fromColumbia University and the University of Washingtonalso attending. Half the practice interviews were con-ducted in Hindi, and half were conducted in English.The Hindi testers are working toward tester certifica-tion in Hindi, and the Tamil testers are workingthrough English initially, with the long-term goal ofbecoming testers in Tamil. A second tester-trainingworkshop in Hindi has been scheduled in April 1988 atthe University of Illinois at Urbana.

Vijay Gambhir and the University of Pennsylvaniahave won a Department of Education grant to createproficiency guidelines for Hindi. The development ofguidelines and the training of testers will be mutuallyreinforcing, because experiential knowledge acquiredthrough testing will contribute to the writing of guide-lines, and the guidelines will ultimately improve test-ers' ability to elicit and reliably rate speech samples.

The Case of Indonesian. Wolff (1987) reported thathe attended a German tester-training workshop andthen conducted about 20 half-hour interviews rangingfrom Novice to Superior with students of Indonesian atCornell University. The interviews were presented to asmall group of individuals at the 1986 meeting of the

1Q6


Association for Asian Studies and discussed with oralproficiency interview experts from ACTFL and theCentral Intelligence Agency with regard to their con-tent and as to what they showed about the basic charac-teristics of students at different levels.

As a result of this preliminary work, Wolff does notthink that there are any particular features of Indone-sian that could not be measured by a common metric,even though Indonesian is significantly different fromthe more commonly taught West European languages.Despite the fact that the grammar of Indonesian isbased on a totally different set of principles than thoseon which most commonly taught Indo-European lan-guages are based, there is no reason, according toWolff, why the generic guidelines expressed in termsof functional abilities at different levels would not beapplicable to Indonesian as well.

Wolff suggests that the next step in the developmentof guidelines for Indonesian is the determination of thefeatures of phonology, grammar, and vocabulary thatcan be associated with each stage of proficiency in 'Ando-nesian. In addition, Wolff thinks that a determinationshould be made as to the candidate's ability to make useof the appropriate style, register, and sociolinguisticrules of Indonesian. Wolff makes the point that theserules are quite rigid, and that Indonesians do not havea great amount of tolerance for deviation from sociolin-guistic norms. Even the simplest utterance mustadhere to rules for acknowledging the relative socialstatus of the conversational partners through appropri-ate use of various sociolinguistic rules, such as thosefor forms of address. As a result, a set of guidelines fortesting students' communicative ability in Indonesianwill have to include specific statements regardingdegree of control of these sociolinguistic rules at differ-ent levels of proficiency. Wolff makes the additionalpoint that oral proficiency interviews in Indonesianmust include routines that require the use of sociolin-guistic rules representative of various settings peculiarto the Indonesian culture.

1 6 7


In terms of actual need, Wolff thinks that the de-velopment of proficiency guidelines for Indonesian is aworthwhile endeavor for two reasons. First, Indonesiais the fifth largest country in the world and is Jne of thefew former colonies in which the local language hastruly become a national language. Secondly, althoughthe total number of people studying Indonesian is notvery large, they represent different levels of proficiency,and it is important to be able to assess their competence.

As Wolff sees it, however, a number of problemsarise when it comes to testing a significant number ofthese individuals. These stem from the fact thatIndonesian is largely taught by the linguist-nativeinformant method. Linguists often have too many otherprofessional responsibilities to devote much time toproficiency testing, and native informants are typicallytemporarily employed and are not professional lan-guage teachers. Wolff suggests as one possible solutionthe development of a semi-direct test of speaking pro-ficiency that would be validated against the ACTFLinterview. Experience with a semi-direct test of speak-ing proficiency for Chinese, reflecting as closely aspossible the functions and content of the oral proficiencyinterview, shows that the former correlates highly withthe latter (Clark, 1986). This would not eliminate theneed for tester training, however, because trained spe-cialists would still be required to evaluate the tapedmaterial. The chief advantage of a semi-direct over adirect test is the elimination of travel time and expensefor face-to-face contact with the examinee. This is amajor consideration in the case of a less commonlytaught language such as Indonesian, given the sparse,widely separated examinees and the difficulty in main-taining the skills of testers who would not be engaged inproficiency testing on a regular basis. The main disad-vantage of a semi- direct approach is that oral responsesto tape-recorded or printed stimulus material do notallow examinees to demonstrate their ability to usediscourse strategies (Clark & Lett, this volume). Analternative solution would be to adopt the government

1 C3 81.k.....-.JIMINI="..11=111


practice of conducting interviews with two testersanative speaker trained to elicit a ratable speech sample,and a linguist experienced in both evaluation and elici-tation techniques. This solution would require avail-ability of testing teams throughout the country.

In late spring 1987, two additional professors ofIndonesian, one each from the University of Wisconsinand Ohio State University, trained as oral proficiencytesters in English as a first step toward becoming test-ers in Indonesian. These testers will seek certificationin English and then begin to test experimentally inIndonesian to gather information about the applicationof the oral interview to that language.

The Case of Arabic. In the 1980s, as Arabists beganpreparing to introduce proficiency testing to their fieldand to develop proficiency guidelines in Arabic, theyhad to (a) define the spectrum of language use thatcharacterizes an "educated native speaker of Arabic";(b) identify the registers of language that such a personhabitually uses and the processes of switching betweenthem; and (c) determine the impact of all these factorson oral proficiency testing and proficiency guidelines.Allen (1984, 1987) explains the problem as follows. Thelanguage used for oral communication in a given Arabcommunity is the language that people learn at home,and it can be one of a number of colloquial dialects thatvary from country to country and from one communityto another. Geographically contiguous colloquial dia-lects are mutually comprehensible, but geographicallyseparated ones are less so. As a result, in certainsituations, a form of the standard written language,referred to as Modern Standard Arabic (MSA), whichArabs learn in school, is used for oral communication.Thus, the colloquial dialect is reserved for day-to-dayusage, while MSA is generally restricted to formalsituations such as lectures, newscasts, and pan-Araband international conferences. MSA, being a literarylanguage, is almost exclusively used for all writingpurposes, formal and informal. The major exceptions

1 (4 9


are some Egyptian drama and any dialectal folk poetry.This diglossic situation creates a major problem in

proficiency testing. According to Mc Carus (1987), aca-demic programs in the United States generally teachMSA because very few programs can afford to teach oneor more dialects as well. The result is a somewhatanomalous situation in which a student might be anIntermediate-High or even an Advanced speaker whenit comes to formal or academic discussion involvir g pol-itics or religion, for example, but only a Novice-High orIntermediate-Low when it comes to dealing with basicsurvival situations.

Two solutions have been proposed. One is to ignorethe dialects and write the guidelines for MSA alone.The other is to choose, in addition to MSA, a majorcolloquial dialect, such as Egyptian, and write two setsof guidelines, one for MSA and one for the dialect.

Allen and his associates at the University of Penn-sylvania have received a Department of Education grantto prepare a set of provisional proficiency guidelines forArabic. After considering the advantages and disadvan-tages of the two solutions, the committee decided towrite one set of guidelines. As a result of their work, apreliminary set of Arabic Guidelines was published inAl-cArabiyya (Allen, 1985). This set has been renamedthe Provisional Guidelines for Modern Standard Arabic(as opposed to Arabic alone).

The next step, according to Allen (1987), is to set up ameasure of appropriateness of language usage and ratethe candidate's ability to use colloquial Arabic or MSAaccording to sociolinguistic rules adhered to by nativespeakers of Arabic. McCarus (personal communica-tion) also supports the idea of a single set of guidelinesfor Arabic. He points out that if the proficiency levels foreach of the four language skills are defined, it is up tothe individual to do the best he or she can, using collo-quial Arabic or MSA, where appropriate. The examinershould represent the Arab region for which the personis being tested (e.g., an Egyptian tester if the examineeis going to Egypt). In other cases, for a general rating in

.1.10


Arabic, the tester and the examinee should controlmutually intelligible dialects.

The current working solution for oral proficiencytesting in Arabic is for the tester to conduct the inter-view in MSA and to accept responses in MSA or in anycolloquial dialect. Only at the Superior level is the exam-inee expected to demonstrate sustained proficiency inMSA.

Allen admits that the solution of writing a prelimi-nary set of guidelines, including those for speaking andlistening, based on a single varietyMSAdoes notreflect the natural use of Arabic, except under specialcircumstances; nevertheless, MSA, potentially at least,is a means of communication between any two educatedArabs. In -.Jdition, this solution keeps Arabic in confor-mity with other languages for which guidelines havebeen developed thus far.

An alternative solution, according to Allen, wouldbe to write guidelines for speaking and listening on thebasis of the colloquial dialects, and those for readingand writing on the basis of the standard written lan-guage. This solution entails choosing specific dialect(s)and implementing Arabic courses that would offer thecombination of a colloquial dialect with the standardwritten dialect, a practice not currently in effect in mostU.S. universities, though the accepted solution at theForeign Service Institute.

The Case of Chinese. According to Walton (1987),one of the problems with adapting the guidelines tolanguages such as Chinese and Japanese for use inacademic testing lies in the area of reading/writingrather than in speaking/listening. The fact is that, inthese languages, more time is needed to reach a com-parable level of proficiency in reading and writing thanin languages such as Spanish and French. This largelyresults from the nature of the writing system. Waltonpoints out that exposure to a writing system, such asChinese, does not per se constitute meaningful input.For instance, people who have spent some time in

11 +1


France or Spain will, in all probability, learn how towrite their names and addresses and to read streetsigns and menu items in French or Spanish withoutspecific training. In Chinese, however, extensive train-ing is needed to be able to perform these simple writtenfunctions. Hence the achievement of even the Novicelevel in reading and writing depends on a fairly pro-tracted period of instruction and a series of develop-mental steps that are quite different from those involvedin achieving the same level of proficiency in languageswith alphabetic writing systems. Even with the refine-ment of the lower end of the continuum to include threesubranges of Novice level, students take a long time toachieve any measurable proficiency and to move fromone subrange to another.

Thus, in order to put students on the scale, theChinese guidelines committee4 decided to make somecompromises. As a result, the Chinese reading guide-lines make a distinction at the Novice and Intermediatelevels between reading specially prepared materialsand puzzling out authentic texts (i.e., between fluentreading and going through a sentence word by word,figuring out both word meanings and sentence struc-ture using a dictionary).

Another concern, according to Walton, is that thetesting situation constrains the elicitation of certainsociolinguistic behaviors in languages such as Chineseand Japanese. These cultures require certain languagebehaviors that are vastly different from those commonto most West European languages. An example of sucha behavior is the Chinese unwillingness to give a direct"no," preferring evasive answers that could be inter-preted as "no." Concomitant knowledge of when andwhen not to press for definite answers is also part ofsuch behavior. Thus, there is a need to define thesefeatures and to design situations in which they might beelicited.

Another problem Walton describes is that it may notbe appropriate for a foreigner to speak to a Chinesethe way Chinese speak to each other, because they


themselves expect certain behaviors of foreigners. This,of course, is a problem for all languages, but perhapsespecially for languages whose cultures differ greatlyfrom European ones. How and whether to include suchexpected behaviors into proficiency guidelines is anunresolved question.

The Case of Japanese. As is the case with Chinese,the Japanese Guidelines (as of spring 1987) reflect thedifficulties of adapting the reading/writing guidelines toa language with a complex, nonalphabetic writing sys-tem that includes both syllabaries and characters. As inthe Chinese Guidelines, descriptions of reading abilityin Japanese at the Intermediate and Advanced levelsdistinguish between reading of specially prepared(quasi-authentic) materials without a dictionary andreading of authentic materials with a dictionary.

With regard to Japanese Writing Guidelines, adevelopmental hierarchy was posited starting with theability to make limited use of the two syllabaries (Hiran-gana or Katakana as appropriate) at the Novice level toan emergent ability to write some Kanji (characters) atthe Intermediate level.

The Case of African Languages. The problems con-fronting proficiency testing in African languages stemfrom the fact that Africa is a region of considerablelinguistic diversity, representing 1,000 to 1,500 lan-guages, and from the fact that resources for studyingand teaching African languages are quite limited (Dwy-er & Hiple, 1987).

The African language teaching community in theUnited States has initiated a number of important pro-jects designed to coordinate the instruction efforts forAfrican languages in this country. Two of these projectswere designed to explore the application of the ACTFLproficiency model to African languages: a workshop atStanford in 1986 and a meeting at Madison in 1987. Inaddition, the development of a language proficiencyprofiling model was undertaken.


The ACTFL workshop hosted by Stanford providedfive intensive days of predominantly English-basedtraining topped off with sessions using Hausa and Swa-hili. With an initial goal of examining the suitability ofthe ACTFL model for African languages, the workshopparticipants concluded that the model was based onsound principles and could provide a reliable and validmeans for evaluating learners' proficiency in Africanlanguages.

In May 1987, representatives from the ten Title VIAfrican Studies Centers assembled at a meeting hostedby Michigan State University and the University ofW sconsin to rank common goals and to discuss thecoordination of activities among the centers. A princi-pal agenda item was team interviewing in much lesscommonly taught languages. There was near consen-sus on the potential for a team testing model, includinga certified tester who is not necessarily proficient in thetarget language and a native speaker who is not nec-essarily a certified tester. The group suggested a three-year plan to reach this goal: (a) in 1988, a standardACTFL workshop would be held, possibly usingEnglish, French, and Arabic as the languages of certifi-cation; (b) in 1989, a workshop would explore the designof the team approach and develop instructions for thenative speaker and his or her role in the interview; and(c) in 1989, two workshops would bP held using theACTFL team approach and Hausa and Swahili as thefocal languages.

A related development in the field of African lan-guages involving the ACTFL proficiency guidelines isthe "profiling" model cf Bennett, Biersteker, and Dihoff(1987). This model, which grew out of a questionnairefor the evaluation of existing Swahili textbooks, wasdesigned, according to its authors, to supplement theACTFL global rating with a more detailed analysis ofthe candidate's performance for diagnostic purposes.

1 14


Problems with Tester Training

The original testing kit workshops held in 1979-80 atthe Foreign Service Institute with support from theDepartment of Education brought interested academicstogether with experienced government testers in a com-mon effort to test the hypothesis that the proficiencyguidelines developed, modified, and validated over a30-year period within the federal government had appli-cability in a traditional academic setting.

Seven years later, after about 50 workshops trainingmore than 1,500 individuals in 10 languages, theanswer should be clear. This does not mean that thegeneric guidelines are not subject to further modifi-cation as new languages representing still differingtypologies are added, nor does it mean that all problemsin developing compatible guidelines for additional lan-guages and training individuals in their application intesting situations have been solved.

Less Commonly Taught Languages

It was not until 1984 that tester training becameavailable in the less commonly taught languages. Theseincluded Arabic, Chinese, Japanese, Portuguese, andRussian. The inclusion of a particular language in thislist was, of course, arbitrary, and was simply a productof whether a U.S. government tester/trainer was avail-able to conduct initial training and whether there wasinterest on the part of the academic community. In thecase of these languages, government trainers wereavailable to conduct a number of initial workshops.

As a result, adequate numbers of testers in theselanguages were trained with new testers being addedconstantly to the list. In addition, through a series oftester trainer workshops conducted since 1983, ACTFLwas able to terminate its dependence on government


trainers and to develop its own cadre of academictrainers not only in the commonly taught languages,but also in Russian, Chinese, Japanese, and Arabic.

Much Less Commonly Taught Languages

Recent demands from the academic sector for train-ing in some of the nearly 160 less commonly taught lan-guages currently available at U.S. institutions of highereducation boasting National Resource Centers in For-eign Languages and Area Studies are a serious strainon national training capacity. For most of these lan-guages, no trainer is available. This results in seriousproblems not only for the academic community but alsofor the U.S. government in general and for the Depart-ment of Education in particular. New federal legisla-tion and companion regulations were published forpublic comment on Oct. 2, 1987, mandating proficiencytesting for these languages. This has immediate impli-cations for tester training as well as for adaptation ofproficiency guidelines to these languages.

Using FSI trainers, ETS provided early training ofPeace Corps oral proficiency testers in English for all ofthe requisite languages. Unfortunately, no follow-upwas conducted to evaluate the effectiveness of suchtraining. More recently, a number of academics havebeen trained in an intermediary language (e.g., Eng-lish, French, German) while they are preparing to testin the target language (e.g., Hindi, Indonesian, Hausa,Hebrew, Polish, and Thai).

Research Needs

While the ideal training situation is target-languagespecific, cross-language training makes it possible to

11 6


reach languages that are otherwise inaccessible. Suchtraining will not only extend proficiency testing to all ofthe much less commonly taught languages, but it willperforce provide a significant research opportunity.

Questions of interrater reliability are normally stud-ied within specific languages. Yet, if the generic guide-lines are truly generic and the training procedures aretruly standardized, both rating and elicitation should bestandard across different languages. The opportunitythat presents itself is to study interlanguage reliabilityin a way heretofore not possible. A propy,rly designedresearch project could provide, for the firs:, time, empir-ical evidence on both the reliability of ratings acrosslanguages by the same individuals and the degreeof generality among the various language-specificguidelines.

Another valuable procedure for testing in many lesscommonly taught languages has been used by U.S.government testers for some time. This procedureinvolves the presence of a tester certified in one or morelanguages (preferably related to the target language)and an educated native speaker/informant of that lan-guage. With appropriate direction and experience inobservation during the testing of a candidate, reliableratings can be assigned by the testing team. As in thecase of cross-language training, joint testing also willprovide important research opportunities.

A research agenda might include the following:

1. A study of interrater reliability between govern-ment and ACTFL-certified oral proficiency testers. Thetwo groups of testers receive somewhat different train-ing, use different guidelines, conduct and evaluate testsdifferently (i.e., ILR testers work in teams of two, withthe final rating representing an agreement of the twotesters, or the lower of the two ratings if an agreementcannot be reached; ACTFL testers conduct the inter-view test alone and later score it from a tape-recording).How do these differences affect ratings? For example, doratings based on audiotape playback result in lower


ratings? (For a raore detailad discussion of this ques-tion, see Clark & Lett, this volume).

2. Intrarater reliability: An examination of differ-ences in testing one's own students as opposed to testingsomeone else's. This question is related to the largerquestion of testing familiar people versus strangers. Intesting his or her own students, the tester usuallyknows their level beforehand, and at the lower levelshas practiced most of the questions and answers withthe examinee in class. Is it more difficult for theinterviewer to show interest in the information beingcommunicated and to interact with the examinee in a"nonteacher" manner under these conditions? Is theexaminee more comfortable with an interviewer he orshe knows and with whose speech he or she is familiar?How does this influence the examinee's performance?

3. Interrater reliability: An investigation of possibledifferences between native and nonnative interviewerswith regard to both elicitation end rating. Are nativeand nonnative intennewers likely to ask different ques-tions or to phrase them eiifferently? Is there a differ-ence, for exami_le, in the way native and nonnativeinterviewers react to the sociolinguistic aspects of theexaminee's performance? Do native and nonnativeinterviewers have a different "style" in conductinginterviews?

4. Interrater reliability: Differences in reliability ofratings at different levels of proficiency. In academicsituations, testers are more likely to have considerablymore practice in conducting interviews at the Noviceand Intermediate levels than at the Advanced andSuperior levels. Does this mean that such testers you'dbe more proficient in conducting and more reliable inrating lower-level interviews? A similar question israised by Clark and Lett (this volume).

5. Maintenance of rating reliability over time, parti-cularly in the less commonly taught languages. Whilemaintenance of rating reliability applies to all lan-guages, this problem may be more serious for testers inthe less commonly taught languages, who are likely to

III 0'.


mandates competency-based language training andtesting, strengthened by federal regulations affecting 93National Resource Centers in Foreign Language andArea Studies at 53 premier institutions of higher edu-cation involving about 160 less commonly taught lan-guages, presents monumental challenges.

As universities seek to come into compliance, therewill be intense competition for the limited trainingresources currently available. The Department of Edu-cation, academia, and the major relevant professionalassociations, especially ACTFL, will need to cooperateto set realistic priorities and develop the necessaryguidelines.

Which languages will be designated for prioritydevelopment, and who will receive initial tester train-ing? Will the tester training be language specific orthrough English or another language? Will interim pro-cedures need to be developed to satisfy federal regula-tions until a sufficient cadre of testing specialists isavailable? Will training be extended to the precollegiatelevel for selected languages, and will teachers at thatlevel receive training? Where will the required researchon proficiency testing be carried out, and by whom?

What will be the role of the new federally authorizedlanguage resource centers in the area of proficiencytesting? Could a small number of such regionallylocated centers assume responsibility for testing andtester training in their regions? How would the relation-ships and responsibilities of the privately funded JohnsHopkins Foreign Language Resource Center and theNational Resource Centers be defined and coordinated?

The answers to these and other policy questions will,of course, require a cooperative effort by the affected con-stituencies. It is possible now, however, to sketch abroad outline of some of the options and factors that willinfluence them.

It would seem reasonable to use recent enrollmentdata as a general guidepost in establishing priorities.Decisions made by universities to offer courses in theless commonly taught and much less commonly taught

1 2: o


mandates competency-based language training andtesting, strengthened by federal regulations affecting 93National Resource Centers in Foreign Language andArea Studies at 53 premier institutions of higher edu-cation involving about 160 less commonly taught lan-guages, presents monumental challenges.

As universities seek to come into compliance, therewill be intense competition for the limited trainingresources currently available. The Department of Edu-cation, academia, and the major relevant professionalassociations, especially ACTFL, will need to cooperateto set realistic priorities and develop the necessaryguidelines.

Which languages will be designated for prioritydevelopment, and who will receive initial tester train-ing? Will the tester training be language specific orthrough English or another language? Will interim pro-cedures need to be developed to satisfy federal regula-tions until a sufficient cadre of testing specialists isavailable? Will training be extended to the precollegiatelevel for selected languages, and will teachers at thatlevel receive training? Where will the required researchon proficiency testing be carried out, and by whom?

What will be the role of the new federally authorizedlanguage resource centers in the area of proficiencytesting? Could a small number of such regionallylocated centers assume responsibility for testing andtester training in their regions? How would the relation-ships and responsibilities of the privately funded JohnsHopkins Foreign Language Resource Center and theNational Resource Centers be defined and coordinated?

The answers to these and other policy questions will,of course, require a cooperative effort by the affected con-stituencies. It is possible now, however, to sketch abroad outline of some of the options and factors that willinfluence them.

It would seem reasonable to use recent enrollmentdata as a general guidepost in establishing priorities.Decisions made by universities to offer courses in theless commonly taught and much less commonly taught

1 2: o


languages, as well as individual decisions by studentsto study these languages, already represent a prioritiza-tion, albeit implicit, and a decision that a particularlanguage has relative cultural, economic, or politicalvalue and is thus worth teaching and studying.

Because language-specific tester training currentlyexists in only a handful of the less commonly taughtlanguages, it is likely that most startup training will bethrough English or another language known to theprospective tester (e.g., English or French for Asianand African specialists). In some cases, it will be possi-ble to conduct training in a language that is struc-turally similar to the target language, such as trainingin Russian in order to test in other Slavic languages.There will also be situations in which testers in onelanguage will work together with native speakers of thetarget language in teams. The experienced tester mayknow the subject language only minimally, or mayknow a related language, thus being able to understandit without being able to speak it, and therefore being ableto work with the native speaker in a capacity similar tothat of the former linguist/informant method of lan-guage instructionguiding the informant through theinterview and making decisions as to the final rating. Itis also possible that semi-direct tests of oral proficiencywill be developed and validated against the oral inter-view for those much less commonly taught languagesfor which maintaining a cadre of trained testers willnot be possible.

In reauthorizing Title VI of the Higher EducationAct (formerly NDEA Title VI), Congress established anew Section 603, Foreign Language Resource Centers.These centers "shall serve as resources to improve thecapacity to teach and learn foreign languages effec-tively." Activities carried out by such centers mayinclude "the development and application of proficiencytesting appropriate to an educational setting" and "thetraining of teachers in the administration and inter-pretation of proficiency tests" (The Congressional Rec-ord, Sept. 22, 1986).

141


In establishing these Foreign Language ResourceCenters, Ccngress anticipated that giving them respon-sibility for direct testing and training teachers to testwould present problems for individual teachers andinstitutions. A small number of such centers, strategi-cally located in the United States with regional respon-sibilities, could significantly advance the nationalcapacity to meet the new federal requirements for profi-ciency testing and competency-based training. It is alsothrough such centers that the more limited needs at theprecollegiate level can be adequately met.

In discussing the research needs, Byrnes (1987)notes that "some of the greatest benefits of the increas-ing work being undertaken in academia with oralproficiency testing may well lie beyond the areas thatcome to mind most readily, such as placement, syllabusscope and sequence, course and program evaluation,entry and exit requirements, and required proficiencylevels of TAs or teachers." Rather, she sees as the mostexciting prospect of the proficiency movement "itspotential for giving language practitioners a frameworkwithin which to observe and evaluate the development ofsecond-language proficiency in their students" (p. 113).

A view of the oral proficiency interview not only as atest but also as data that could yield important insightsinto second-language (L2) acquisition processes posesthe need for language specialists to conduct second-language acquisition research, both theoretical andclassroom-oriented. This research should proceedalong several different lines, starting with a deter-mination of the ultimate level of proficiency attainableunder a given set of conditions (Lowe, 1985; Natelson &Allen, n.d.; Pica, 1983; Swain, 1985). Other importantareas are learner variables (Beebe, 1983; Bialystok,1983); input variables (Chaudron, 1983; Seliger, 1983);the relationship between L2 acquisition and L2instruction (Lightbown, 1983); and the effects of formalas opposed to informal exposure on different aspects oflanguage performance (i.e., grammar, vocabulary, flu-ency, and sociolinguistic and pragmatic features).

122


To perform such much-needed research, the pro-spective researchers require a background in second-language acquisition, research design, and statistics.Such a multidisciplinary background is not obtainablein highly compartmentalized foreign language depart-ments, which typically emphasize literature and lin-guistics. If second-language acquisition research is toextend from ESL into commonly and especially intouncommonly taught languages, the training of lan-guage specialists must extend beyond its currentboundaries of literature and linguistics to include thedisciplines just mentioned. An alternative solution forperforming language acquisition research in lan-guages for which researchers with such a backgrounddo not exist is to cooperate with other departments toform interdisciplinary research teams that, in additionto a specialist in an uncommonly taught language,would include psycholinguists, educational psycholo-gists, statisticians, and psychometricians.

Conclusion

This chapter attempts to place the development andapplication of proficiency guidelines to the less com-monly taught languages in broader perspective. Theauthors have sought to highlight the significance ofdeveloping and defining the generic guidelines fromtheir initial application to commonly taught West Euro-pean languages to accommodating an increasing num-ber of languages with widely varying typologies as lesscommonly and much less commonly taught _languagesare brought within the scope of the proficiency move-ment. The case is made to support the notion that guide-lines are precisely guidelines, that they are dynamicand subject to modification as experience with newlanguages representing other linguistic typologies isaccumulated. This experience may in turn have a

r:4. tj


backlash effect on the more commonly taught lan-guages.

Incipient experience with the much less commonlytaught languages reveals serious problems in trainingtesters and suggests imaginative and productive alter-natives that hold promise for research as well.

Recent legislative action has mandated competency-based language programs and proficiency testing forthe commonly as well as for the less commonly taughtlanguages. Language resource centers will bear specialresponsibilities in this area and will spearhead coop-erative planning and policy development in the future.

NOTES

1. The Russian Guidelines Committee was composed of ThomasBeyer (Middlebury College), Dan Davidson (Bryn Mawr College),Irene Thompson, Chair (George Washington University), GeraldErwin (Ohio State University), and Don Jarvis (Brigham YoungUniversity). All members of the committee had either received testertraining or had attended familiarization workshops. Two members(Erwin and Thompson) had previous experience as governmenttesters.

2. The Revision Committee had two members, Thomas Beyer andIrene Thompson. Both had been active testers in the interim.

3. Vijay Gambhir (University of Pennsylvania) is a certified testerand tester trainer. Two other Hindi teachers (one at ColumbiaUniversity and the other at the University of Washington) arecompleting their training as oral proficiency testers.

4. Members of the Chinese Guidelines Committee were Albert E.Dien, Stanford University; Ying-che Li, University of Hawaii; ChunTan-choi, Government Language School; Shou-hsin Teng,University of Massachusetts; A. Ronald Walton, University ofMaryland; and Huei-ling Worthy, Government Language School.

124


References

Allen, R.M.A. (1985). Arabic proficiency guidelines.Al-cArabiyya, 18(1-2), 45-70.

Allen, R.M.A. (1987). The Arabic guidelines: Wherenow? In C.W. Stansfield & C. Harman (Eds.),ACTFL proficiency guidelines for the less commonlytaught languages (Dept. of Education Program No.G008540634). Washington, DC: Center for AppliedLinguistics; and Hastings-on-Hudson, NY: Ameri-can Council on the Teaching of Foreign Languages.

American Asssociation for the Advancement of SlavicStudies. (1983). Russian language study in the Uni-ted States. Recommendations of the National Com-mittee on Russian Language Study. Stanford, CA:Author.

American Council on the Teaching of ForeignLanguages. (1982). ACTFL provisional proficiencyguidelines. Hastings-on-Hudson, NY: Author.

American Council on the Teaching of Foreign Lan-guages. (1984). Foreign language enrollments inpublic secondary schools, Fall 1982. Foreign Lan-guage Annals, 17, 611-623.

American Council on the Teaching of Foreign Lan-guages. (1986a). AC7'FL Chinese proficiency guide-lines. Hastings-on-Hudson, NY: Author.

American Council on the Teaching of Foreign Lan-guages. (1986b). ACTFL proficiency guidelines. Has-tings-on-Hudson, NY: Author.

American Council on the Teaching of Foreign Lan-guages. (1986c). ACTFL Russian proficiency guide-lines. Hastings-on-Hudson, NY: Author.

I, D


American Council on the Teaching of Foreign Lan-guages. (1987). ACTFL Japanese proficiency guide-lines. Hastings-on-Hudson, NY: Author.

Bennett, P., Biersteker, A., & Dihoff, I. (1987). Profi-ciency profiling guidelines: Generic, Swahili, Hau-sa. Revised edition. Unpublished manuscript.

Beebe, L.M. (1983). Risk-taking and the language learn-er. In H.W. Seliger & M.H. Long (Eds.), Classroomoriented research in second language acquisition(pp. 39-66). Rowley, MA: Newbury House.

Bialystok, E. (1983). Inferencing: Testing the "hypothe-sis-testing" hypothesis. In H.W. Seliger & M.H.Long (Eds.), Classroom oriented research in secondlanguage acquisition (pp. 104-124). Rowley, MA:Newbury House.

Brod, R.I., & Devens, M.S. (1985). Foreign language en-rollments in U.S. institutions of higher educationFall 1983. ADFL Bulletin, 16, 57-63.

Byrnes, H. (1987). Second language acquisition: In-sights from a proficiency orientation. In H. Byrnes& M. Canale (Eds.), Defining and developing profi-ciency: Guidelines, implementations and concepts(ACTFL Foreign Language Education Series, pp.107-131). Lincolnwood, IL: National Textbook Co.

Carroll, J.B. (1967). Foreign language proficiency levelsattained by language majors near graduation fromcollege. Foreign Language Annals, 1, 131-151.

Chaudron, C. (1983). Foreigner talk in the classroomAn aid to learning? In H.W. Seliger & M.H. Long(Eds.), Classroom oriented research in second lan-guage acquisition (pp. 127-145). Rowley, MA: New-bury House.

,6


Clark, J.L.D. (1986). Development of a tape-mediated,ACTFL/ILR scale-based test of Chinese speakingproficiency. In C.W. Stansfield (Ed.), Technologyand language testing. Washington, DC: Teachers ofEnglish to Speakers of Other Languages.

Dwyer, D., & Hip le, D. (1987) African language teachingand ACTFL team testing. In C.W. Stansfield & C.Harman (Eds.), ACTFL proficiency guidelines forthe less commonly taught languages (Dept. ofEducation Program No. G008540634). Washington,DC: Center for Applied Linguistics; and Hastings-on-Hudson, NY: American Council on the Teachingof Foreign Languages.

Gambhir, V. (1987). Some preliminary thoughts aboutproficiency guidelines in Hindi. In C.W. Stansfield& C. Harman (Eds.). ACTFL proficiency guidelinesfor the less commonly taught languages (Dept. ofEducation Program No. G008540634). Washington,DC: Center for Applied Linguistics; and Hastings-on-Hudson, NY: American Council on the Teachingof Foreign Languages.

Hip le, D. (1987, March). The extension of languageproficiency guidelines and oral proficiency testing tothe less commonly taught languages. Paper pre-sented at the Indiana Symposium on Foreign Lan-guage Evaluation, Bloomington, IN.

Interagency Language Roundtable. (1985). ILR lan-guage skill level descriptions. Arlington, VA: Au-thor.

Lightbown, P.M. (1983). Exploring relationships be-tween developmental and instructio:_ -1 sequences inL2 acquisition. In H.W. Seliger & M.H. Long (Eds.),Classroom oriented research in second languageacquisition (pp. 217-245). Rowley, MA: NewburyHouse.

127


Liskin-Gasparro, J.E. (1984). The ACTFL proficiencyguidelines: A historical perspective. In T.V. Higgs(Ed.), Teaching for proficiency, the organizing prin-ciple (ACTFL Foreign Language Education Series,pp. 11-42). Lincolnwood, IL: National Textbook Co.

Lowe, P., Jr. (1985). The ILR proficiency scale as a syn-thesizing research principle: The view from themountain. In C.J. James (Ed.), Foreign languageproficiency in the classroom and beyond (ACTFLForeign Language Education Series, pp. 9-54).Lincolnwood, IL: National Textbook Co.

Mc Carus, E. (1987, March). The application of ACTFLproficiency guidelines to Arabic. Paper presented atthe Indiana Symposium on Foreign Language Eval-uation, Bloomington, IN.

Natelson, E.R., & Allen, D.A. (n.d.). Prediction of suc-cess in French, Spanish, and Russian foreignlanguage learningAn analysis of FY67-74 studentdata. Monterey, CA: Defense Language Institute.

Pica, T. (1983). Adult acquisition of English as a secondlanguage under different conditions of exposure.Language Learning, 33, 465-497.

Seliger, H.W. (1983). Learner interaction in the class-room and its effects on language acquisition. InH.W. Seliger & M.H. Long (Eds.), Classroom orient-ed research in second language acquisition (pp. 246-267). Rowley, MA: Newbury House.

Staff. (1986, September 22). Congressional Record.Washington, DC: U.S. Government Printing Of-fice.

Swain, M. (1985). Communicative competence: Someroles of comprehensible input and comprehensibleoutput in its development. In M. Gass & C.G.

128


Madden (Eds.), Input and second languageacquisition (pp. 235-253). Rowley, MA: NewburyHouse.

Thompson, I. (1987, March). Adapting ACTFL profi-ciency guidelines to Russian: Problems, applica-tions, and implications. Paper presented at theIndiana Symposium on Foreign Language Evalu-ation, Bloomington, IN.

Walton, R. (1987, March). The application of the ACTFLproficiency guidelines to Chinese and Japanese.Paper presented at the Indiana Symposium onForeign Language Evaluation, Bloomington, IN.

Wolff, J.U. (1987, March). The application of the ILR-ACTFL test and guidelines to Indonesian. Paperpresented at the Indiana Symposium on ForeignLanguage Evaluation, Bloomington, IN.

1 2 9

/1/Reading Proficiency

Assessment:

Section 1: A Framework forDiscussion

by Jim Child, National Security Agency

The assessment of reading proficiency parallels theassessment of speaking proficiency by categorizing areader's functional abilities according to AEI readingdefinitions, reflecting the tasks, content, and accuracyrequirements for language at each level. Performanceis rated holistically by comparing it to the leveldescriptions.

Unlike speaking, however, reading is a receptiveskill. As such, it is an ability to which testers haveindirect access only. Testing must always be achievedthrough another channel: through speaking in a read-ing interview; through writing in answers to contentquestions; and through responses to multiple-choiceitems on the content of the text.

Compared with the customary classroom progresstest, reading proficiency tests are usually longer induration to permit the examinee to prove consistent andsustained ability. Reading proficiency tests also differfrom classroom progress tests in the levels of difficultycovered and in the range of tasks included, since diffi-culty and range are both required to establish a ceiling,or upper limit of proficiency, as well as a floor. That an

130


Interagency Language Roundtable (ILR) test can fur-nish an accurate assessment of a foreign languageuser's reading skills has been accepted within the gov-ernment for a number of years. But the fact that the testworks does not explain how or why it works. This chap-ter addresses a number of questions related to imple-menting such a test.

The construct of reading proficiency is surroundedby controversy. The skill has not been fully defined, andavailable descriptions of the reading process vary sig-nificantly. Also, the role of texts in proficiency assess-ment is poorly understood and is the subject of disagree-ment. For example, there is a debate over the degree towhich background knowledge is a factor in the stu-dent's attempt to internalize second language texts atthe various levels. The theoretical underpinnings ofreading assessment must be reviewed and expanded.In the next section, Phillips examines how languageproficiency is viewed in the academic setting and itsbacklash effect on reading in the foreign languageclassroom. Both this and the next section reflect thestate of flux of reading proficiency assessment.

This section focuses on reasons for the currentdebate on use of the ACTFL/ETS/ILR (abbreviated AEI)scale to assess reading proficiency. An examination oflanguage proficiency and the language learner leads,in turn, to a discussion of the classification of texts. Thesection concludes with a discussion of test items and theevaluation of performance.

Teaching for "second language proficiency" isgathering increasing momentum as an instructionalapproach, even though the term has eluded precisedefinition. To judge from the heated (and occasionallyrancorous) exchanges on the subject at conferences andsymposia, it would seem that a number of disparatephenomena have come together in search of a consen-sus at any price. One group has taken its cue from thetime-honored ILR descriptions of four skills and fivelevels, legitimized by decades of use and refined as

11 0 11i ...*

Reading: Framework for Discussion 127

necessary. Another sees proficiency in quite differentterms, feeling with some justification that the ILR skilllevel statements, as well as the ACTFL versions mod-eled on these, are too experiential and inadequatelysupported by theory. The AEI skill level definitions,argues the first group, are oriented too heavily towardclassroom language study, especially adult study, asopposed to language acquisition in a natural environ-ment.1 But many critics of the ILR standards feel that itis premature, if not actually impossible, to impose theseor similar standards on the academic community with-out a theoretical base for them.

Continuing research on the cognitive strategiesinvolved in study and acquisition is to be enthusias-ti'ally encouraged. Nonetheless, from an operationalpoint of view, the time-honored skill levels and theinstruments for measuring them (especially the oralproficiency interview for speaking, and more recrmtly,new kinds of reading tests) are effective. Moreover, thegovernment has a persistent need to carry out wide-scale testing of employees and their families and otherindividuals whose language proficiency must be a mat-ter of record.

This section refers to ILR and derivative skill levelstatements that have historically reflected how the lan-guage community (originally, government-agency lan-guage schools; later, academic institutions) has viewedproficiency. Because it is difficult to discuss proficiencyin any but a relative way in the absence of objective cri-teria, a "content" paradigm is offered that is somewha.,more elaborate than a text typology presented earlier(Child, 1987). Finally, the question of evaluating readingproficiency using contextual tasks is addressed. Thesetasks require the examinee to identify and supply miss-ing "parts" of passages.

The learner first attempts to communicate or under-stand all language material at whatever level the situ-ation requires, and does so to the extent possible relativeto a text typology that is sensitized to the needs of the

2


target language. Both the product of the learner as he orshe tries to meet language demands and the normsgoverning that behavior (i.e., texts graded according totype) are conceived of globally, in terms of final results.On the other hand, tasks are process-oriented: theexaminee is evaluated for his or her semantic and syn-tactic skills while applying general language knowl-edge to specific textual problems.

While exhaustive descriptions of examinee behaviorare available, and some materials on the gradation oftexts, little exists on approaches to developing textualitems (i.e., adequate analyses of item types, whetherdriven by form or content; the acceptability of weightingthe different types; "pass-fail cutoffs").

The Language Learner

Considerable literature addresses the subject oflearning a second language, both from the perspectiveof the student as he or she tries to develop strategies formaking progress in that language, and from the expe-rience of raters and drafters of skill level statements.Most of the literature explores the process of learningand teaching a second language. Yet the skill levelstatementsboth ILR and ACTFLdescribe in detailwhat learners do or appear to do once they reach eachlevel: that is, they address the competencies attained.

As noted earlier, continuing research in thelanguage-learning field is a prerequisite to any quan-tum leap in the teaching of reading a second language.Unfortunately, with few exceptions, students do not somuch learn to read as to decipher. Decipherment isessentially an activity in which form triumphs over con-tent. Exercises may train the student who has insuf-ficiently mastered the phonomorphological system (thesystem of language forms as expressed in alphabets,

.1..L.so


syllabaries, or characters) to read "word for word" tothe detriment of meaning, thus causing the means tointerfere with the ends. Other kinds of drills may sup-port decipherment at the expense of reading in a differ-ent way; students may be asked to parse sentences withthe aim of attaching formal labels to words conveyingmessages. Skillfully planned curricula that place con-tent before form would go a long way toward changingthe relatively entrenched discrete-item approach (Phil-lips, 1984).

A change of this sort would in addition comport withthe natural human tendency to generate or processtexts. No such natural tendency to perform grammati-cal analysis exists. The entire constituency of languageusers (native speakers and learners), psychologicallyprepared to speak or read for meaning, has little moti-vation for grammar exercises. The factors limiting pro-ficiency from this perspective are degree of verbalintelligence, technical and/or cultural backgroundknowledge, and ability to use various strategies, such ascircumlocution, in the absence of adequate control of thesecond language.

While general intelligence is not open to inter-vention, background knowledge in the cultural andtechnical domains can certainly be improved. Eachlearner brings to the target language a particular fundof knowledge about the world that contributes to his orher progress. What is often not considered, however, isthe extent to which various kinds of knowledge affectlanguage behavior at different levels in second lan-guage learning. To the extent that it has been an issueat all, subject-matter control has been a source of con-cern with respect to potential effects on test results. Anexaminee may appear (especially in an oral proficiencyinterview) to know the second language better than heor she actually does. This phenomenon is known tolinguists as "semantic feedback" and to testers as the"hothouse special." Knowledge of subject matter, ratherthan command of the target language system,

3 4,


determines the performance of the examinee andinterferes with the evaluation of proficiency in thetesting situation. Yet in the realm of practical applica-tions, content-oriented skills may be a sine qua non.One of the most useful approaches to decipheringunfamiliar material is to gain currency on the topic ortopics to be processed by studying these in one's nativelanguage first, and then to transfer this content knowl-edge to reading the material in a second language.

Another strategy for the learner who does not con-trol a particular construction or lexical item is to usecircumlocution, which ensures communication at theexpense of precision or elegance. While an examineemay be marked down for inaccuracy, a person usinglanguage "in the real world" may salvage a precarioussituation, fully justifying the mastery of this strategy.

A great deal of research is needed on the effects oftechnical and cultural knowledge and circumlocutionskills on performance. It will be useful to explore waysto measure these skills' contribution to communicationand to enhance them, given the virtual certainty thatthey will be needed in the future.

Classification of Texts

To establish the reading skill level of a languagelearner, a theoretical construct is needed that is appli-cable to all written languages and provides an objective,common yardstick of evaluation. A criterion of this sorthas always been implicit in the ILR statements, eventhough it is not fully appreciated. The question is dealtwith at length by Child (1987), but two examples arenoteworthy here.

The lead sentence of the ILR statement on readingfor Level 2 is worded as follows:

13 5


Sufficient comprehension to read simple,authentic written material in a form equivalentto usual printing or typescript on subjects withina familiar context.

The main focus of this (and every other) ILR readingstatement is the mental activity of the reader, withoutspeciying what "simple authentic written material"might be like in one language or another. Although thedescription is further qualified as "straightforward, fa-miliar, factual material," what is considered straight-forward and factual may differ from one language toanother. What is factual or straightforward to readersin one culture may appear to be propaganda to those ofanother. However, philosophical differences as to whatconstitutes factuality aside, the direct presentation ofmaterial that is amenable to reportorial or narrativetreatment (regardless of its truth value) falls under the"instructive" rubric of textual Level 2.

From the formal standpoint, the question is thedegree of complexity of expressian in which the contentis embedded (i.e., the syntax). Generally, the syntacticpatterns at Level 2 are high-frequency structures withminimal information, in keeping with the semanticand pragmatic expectations of the level. Element (,rderwithin sentences is rarely atypical. For example, if thenormal order of the declarative sentence in a languageis subject-object-verb with variations permitted only forspecial rhetorical purposes, the S-O-V arrangementwill appear almost exclusively in Level 2 texts becausethese (at least in written forms) rarely involve affectiveor evaluative material. As for morphologythe systemof affixes or "function words"these elements areusually obligatory and are critical for accuracy (evenwhen expendable in lower-level communication) bothwithin sentence or clause and at the discourse level.Finally, intonational patterns, clearly marked in speechbut underrepresented in writing, are also obligatorilysupplied for phrases in sentence-long units.

1.;'6


A second example is from reading Level 4.According to the ILR statement, the person reading atthis level is "able to read fluently and accurately allstyles and forms of the language pertinent to profes-sional needs." This description, admittedly vague,assumes an extensive command of the target culture(both in its narrower and broader senses) as well aswhatever content domains are controlled by the exam-inee who "is able to relate inferences in the text toreal-world knowledge and understand almost all socio-linguistic and cultural references." Depending on thetopic, the language may be highly metaphorical or allu-sive, with elaborate rhetoric or a deliberately craftedlexicon meant to convey worlds beyond the words. TheChild (1987) typology subsumes such varied productsunder the Projective Mode as follows: "Shared infor-mation and assumptions are at a miniraum and per-sonal input is paramount."

Formally, virtually any syntactic device, but es-pecially those of low frequency and high informationcontent, may be encountered, again according to thesupposed needs of both writer and reader. For instance,low-frequency word order (perhaps combined with aninfrequent lexical item) may be used to achieve a specialeffect (e.g., "His objections notwithstanding" for "De-spi4-e his objections"). By definition, obligatory mor-phemes are as important at Level 4 as elsewhere, butthe cohesive devices (reference, substitution, etc.) thatare immediately clear in Level 2 texts often requireinterpretation at the higher level. Synonymy, too, makesmuch greater demands vis-à-vis lexical cohesion thanat Level 2, as individual stylistic choicesthe majorcharacteristic of level 4come into play.

Limited as these examples are, they illustrate whatlearners can do at two stages of attainment, especiallywhen combined with intralanguage textual descrip-tions anchoring the behavioral statements. As notedearlier, both documents reflect results, encapsulatingthe end product of the language-learning process.

0 .


Test Items and Evaluation

Test design and evaluation, especially in a con-textual format, are extraordinarily difficult tasks. Firstit is essential to create a sound balance of item types ateach textual level tested, ensuring that these are dis-tributed relatively evenly throughout the exercise. Thendecisions need to be made on "pass-fail cutoffs," de-termining how many mistakes, and of what kind, areallowable at each level.

The experience of most teachers and test designersin the second and foreign language field has been dom-inated by formal grammar. Traditionally, parts ofspeech are tested in terms of case endings for nouns,tense markers for verbs, adjective-noun agreement, andso on. In more recent approaches to test design andevaluation, the content of a passage may be the primaryfocus of the exercise, with formal features included toestablish accuracy.

A means to reconcile form and content is critical tothe entire concept of proficiency testing. In an exper-iment at the National Security Agency, differentiallyweighted items are provided at Levels 1+ and 2+. AtLevel 1+, high-frequency vocabulary items, especiallyverbs and nouns, were deleted, as they interact in caseframes within sentences ("case" here refers to the log-ical relations obtaining between noun and verb as wellas among nouns; such relations may or may not beformally indicated, depending on the typology of thelanguage concerned). The aim is to elicit the exam-inee's knowledge of the forms of the language beingstudied. At Level 2+, items involving the linkage amongverbal and nominal forms are, of course, deleted. Inaddition, however, extended textual segments that con-tain formal cohesive elements are deleted: items thatrefer to nouns and verbs elsewhere in the text. Theseinclude mainly third-person pronouns, adverbs such asthus and so, conjunctions and even verbal forms such

1 a8


as does and is and nouns such as thing, matter, andothers. Testing such cohesive forms is a way to deter-mine whether the examinee is truly following the argu-ment (i.e., the content) of a given text in a secondlanguage. Because form and content are reasonablycongruent in most languages up to Level 2+ (i.e.,concepts are expressed in roughly predictable ways,with a limited number of variations), form and contentare testable together. At higher levels, where consider-able intellectual or esthetic creativity comes into play,the expression of content may run from the (deceptively)simple to the highly complex. Testing texts at these lev-els require a degree of sophistication that does not lenditself to thinking in terms of items.

Unfortunately, the practices of teaching and testinghave resulted in a dichotomy between form and content.Thus it often happens that learners have been force-fedon paradigms and mastered thew to a surprisingdegree, only to fail completely in processing simpletexts. By the same token, some learners have acquiredtheir second language in natural language-use envi-ronments and are content-oriented as a result. Suchlearners may have little trouble following the mainargument of a text (except for vocabulary items that areseldom encountered in spoken language registers) butmay have trouble restoring grammatical deletions, animportant exercise in all language batteries.

These considerations led to the idea of developing"mixes" of items and weighting them differentially.Because the problem of different types of learners is notunique to one institution, but pervades the foreign andsecond language education community, the analysis ofitem types and their organization into contextual testsmust go forward.


Summary

Professionals concerned with the assessment ofreading proficiency must come to terms with the in-teraction between reader and text. Readers bring totheir interpretation of texts various degrees of knowl-edge of the world and control of the grammar andlexicon of the target language. Texts, on the other hand,are the result of native speakers reporting facts, narrat-ing events, and commenting on topics of interest. Suchtexts obviously represent appropriate material for evalu-ating problems of form and content peculiar to eachlanguage and for determining textual levels of difficul-ty. Once these levels are determined, appropriate testscan be designed and administered to assess examinees'proficiency.

NOTE

1. Study is distinct here from acquisition; learning is used as thegeneric term for gaining control of a second language regardless ofthe method.

References

Child, J. (1987). Language proficiency levels and thetypology of texts. In H. Byrnes & N. Canale (Eds.),Defining and developing proficiency. Lincolnwood,IL: National Textbook Co.

Phillips, J.K. (1984). Practical implications of recentresearch in reading. Foreign Language Annals,17(4), 285-296.

/ 4 0

Section 2:Interpretations andMisinterpretations

by June K. Phillips,Tennessee Foreign Language Institute

While speaking proficiency assessment gained fairlyeasy entrance into academia, the reception given read-ing proficiency assessment has been cool, to say theleast. In addition to the reasons discussed by Child inthe first section of this chapter, two further reasons arediscussed here. First, academia has always concerneditself with reading, devised ways to teach it, and con-ducted considerable research into the nature of thereading process. Thus, the reading proficiency defini-tions, unlike the oral proficiency interviews, do not fill avoid but must contend with numerous other views abouttesting reading and the nature of the reading process.Second, the tendency in schools and colleges to dealwith higher-level texts before the requisite skills aremastered by the student leads to entirely different views

both texts and the reading process, and implications of

136

ETS/ILR (AEI) proficiency guidelines. The major con-

the AEI reading definitions (including texts andprocess) are explored for the teaching of reading in the

troversy has been formulated in the question, "pro-ficient readers or proficient texts?" This chapter arguesthat this opposition is counterproductive. This is but oneof many significant issues in adapting reading profi-ciency assessment to the academic setting.

of reading skills than the view suggested by the ACTFL/

Reading proficiency is discussed here in terms of

Reading: Interpretations & Misinterpretations 137

foreign language classroom. After a discussion of need-ed research, the section ends with observations on thepresent situation and directions for the future.

Reading skills as described in the AEI definitionsparallel those in other modalities in that they address acontinuum of tasks that native readers are capable ofperforming. They differ in that learners must do morethan manage the language they control as output; theymust deal with the input provided by a writer who is notattentive to their level of language. The continuum,therefore, must consider both texts and tasks. Thosewho have claimed that the definitions for readingaddress only text type have not read them carefully orthoroughly. As Child suggested, the inclusion of state-ments about reader behaviors and learner strategiescontributes to the complexity of the level descriptionsand current procedures for assessment. Thus, thedescriptions address proficient readers and not just pro-ficient texts, contrary to Bernhardt's (1986) interpre-tation. Such mixing of diverse elements in the AEIdefinitions calls for refinement. As the professiongrows in its knowledge of receptive processes, confirmsor rejects models of second-language reading, anddevelops more sensitive measures of comprehension,the definitions should be modified to reflect these gains.

The Texts to Be Read

In their present form, the reading proficiencydefinitions can enhance traditional reading experiencesin several ways; primary among them is the delineationof common text types according to the impact they haveon readers. At one time, in the academic setting, it wascommon for most foreign language learners to dealwith a narrow range of texts, usually 200-250 wordslong, in a recombination or highly controlled narrative.The small number of students who became language

142


"majors" would enter the world of literature, ideally tounderstand and appreciate, but more realistically todecode and manipulate. Most important, neither groupever attained a level of proficiency that encompassedmore than a few strata of the texts available to the nativereader or tasks resembling what the second languagereader can do in his or her first language. Academiamust now make a conscious decision to develop a curri-culum that includes a wider range of reading skills andpassage types. The alternative is to maintain a selectivefocus in which readers may become proficient with lim-'ted tasks in specific texts. Reading proficiency mea-sures are not appropriate or useful in the latter case,because the hierarchical assumptions of the AEI definitions cannot be met. The reading comprehensiontasks in which passages meet the criterion of a singletype (see Child s [1987] typology) produce a "score,"whereas a reading proficiency test according to the AEIscale demonstrates the reader's array of skills. This isbecause when a sustained level is assigned, the exam-iner assumes that lower-level functions and texts can behandled successfully.

It is important to note that the hierarchy suggestedby the reading definitions differs in reality from that ofspeaking, from which it was derived Acceptance of thedefinitions implies the intention to include the spec-trum of skills in instruction and evaluation. In reading,the hierarchy is not confirmed automatically by studentperformance, whereas in speaking, no learner can"narrate, describe, or speak in paragraphs" (Avanced/2) who cannot "maintain simple face-to-face conversa-tions" (Intermediate /1). Independent of method ormaterials, learners measured over time perform inspeaking at Novice/0 levels with words and learnedmaterials before they demonstrate creativity insentence-length structures (Intermediate/1). Schoolingcan, on the other hand, produce Advanced/2 readerswho may well lack the visual literacy of the Novice/O,especially if their skills have been developed in alanguage-centered program. Reading is not a natural

1418


skill; it :z learned. Thus the presence or absence of thehierarchical assumptions for reading depend on thegoals of the curriculum.

While it may be possible to teach students to readonly segments of the continuum mastered by nativereaders, such comprehension should be defin:d solelyin terms of passage types read. Otherwise, studentspresume that their ability to read an edited text andscore highly on a reading comprehension test meansthat they can "read." To their amazement, their firstexperience in the target culture often vividly demon-strates that reading the realia around them is not aseasy as they imagined. The task of defining or assess-ing reading proficiency within the AEI paradigmimposes the necessity to demonstrate that the hierarchyapplies to reading outside the classroom, and that anyrating implies successful reading of lower-level texts.Thus, test passages must be representative at all levels.

Processes and Stategies for Reading

The continuum of text types on the reading scales isonly part of the description contained in the AEI defini-tions. Just as for the other skills (see Lowe, this volume,Note 4), functional trisections have been charted forreading to clarify dimensions embedded in the leveldescriptions. For reading, one trisection includes texttype, reader function, and reader strategies (Cana le,Child, Jones, Liskin-Gasparro, & Lowe, 1984). Twoadditional segments that may be considered are authorintent and author accuracy.

This breakout is important in identifying strategiesreaders use and the functions they demonstrate withthe various texts. On its surface, the listing of strategieshorizontally with text types could be confusing if inter-preted to mean that these strategies are uniquely asso-ciated with the level at which they appear. In reality,

144


effective readers use all these strategies at all levels intheir efforts to assign meaning to a text. Actually, theside-by-side placement in the trisection, as well as themention of specific strategies in the narrative descrip-tions, indicates that consistent control of that strategycorrelates well with the accomplishment of the functionrequired. For example, witl- "orientational" Intermedi-ate/1 level texts, the reader usually needs only skim-ming and scanning strategies to identify the mainideas. This strategy tends to be the most important forLevel 1 functioning.

In describing the reading process in general, thesesame labels are used to designate learning strategies.Consequently, an individual reader aiming for compre-hension of an orientational text (e.g., names of stores,street signs, travel forms) may also use inferencing orcontextual guessing strategies to assign actual mean-ing.

To summarize, the AEI definitions, through therather tightly packed descriptions, provide insights intothe tasks and accuracy of comprehension that readersdemonstrate as they interact with a range of texts. Forteachers, this knowledge is a basis for choosing materi-als, particularly authentic ones, and for developingactivities that set reasonable tasks given the readinglevels targeted for their students.

Unlike oral proficiency testing, for which no suitabletesting mechanism existed, academia has always testedreading in some form using a wide variety of methods.As a result, the task of introducing tests of reading pro-ficiency according to the AEI scales presents problems,because traditional tests have largely required decod-ing parts of passages, rather than demonstrating con-sistent and sustained control of ILR functions and of thecontent specified by the AEI definitions. In fact, a mis-match exists between what the AEI scales define asreading and as suitable reading content and what aca-demia has traditionally regarded as reading and suit-able reading test material. The problem is not that AEIreading proficiency tests may not furnish generally

Reading. Interpretations &Misinterpretations 141

valid and reliable results, but rather that the classroomteacher will, without an introduction to the system andhow it affects curriculum, be baffled by it. If, as hasbeen maintained, teachers tend to teach for the test, twochanges must occur for reading proficiency tests tomake sense in the classroom: (a) new strategies mustbe taught and (b) new types of material must be includ-ed. Once these changes are introduced, then teacherand student alike will understan i more readily how anAEI reading proficiency score re lects the examinee'sability.

Implications for Teaching Proficiency inReading

The decision to integrate concepts of the readingguidelines into the foreign language curriculum entailsidentifying materials and le, rner strategies that sur-pass the current level of performance. SeveL al issuesconcerning both texts and strategies must be addressed.

A Wider Range of Tests

At the start of foreign language classes, studentsshould develop skills in reading Novice-level materialsat the lowest end of the ,,L.:.tiriuum that reinforce orenrich the topics being studied. These selections arehighly contextualized and often require some back-ground knowledge on the part of the student. Lackingthat, the teacher may need to provide cultural contexts,advance organizers or other activities that facilitat .,e

student's access to the text. Many basic textbookscontain bits of realia and cultural information, butthese need to be converted from decorative pieces tosources of information. The basic text must usually 1.)e

140

142 PROFIC1D\TCY ASSESSMENT

supplemented with a collection of reading materials.Most beginning courses follow a topical presentation. Tofind appropriate materials, the question to ask is, forexample, "What does one read in French or German orArabic that contains expressions of weather, or colors,or clothing, or food?" Appropriate texts at this level donot display the characteristics of paragraphs but are inthe form of symbols, short phrases and lists. It is rela-tively unimportant if students cannot understand everyword or detail; recognition of essential informationappropriate to the task suffices and stretches the learn-ers' thinking beyond the mechanical processing of arti-ficial documents. The abundance of Novice/0 level textsis attributable to the heavy reliance among the world'ssocieties on signs, symbols, advertisements, announce-ments, and so on.

As students ascend the scale, authentic materialsrepresenting other test types should systematically pro-vide for "real-world" reading practice. At the Interme-diate level, fewer materials are available, because thistype of fairly simple, straightforward passage is of lim-ited use to native readers. Good materials can be foundin subject areas related to "survival" topics, particu-larly shopping, the post office, banking, transporta-tion, food, and lodging. Brochures with good visualsupport, though aimed at native speakers, try to sim-plify both language and procedures for new customers.Another resource is general-audience magazines, butthe teacher may need to specify hoc. much and whatparts of articles are at the level of challenge. Materialsmust not be so difficult that they cause frustration.

The large majority of texts used by native readers isat the Advanced/2 level and above. Teachers andauthors of foreign language textbooks play an importantrole in selecting the actual texts, for these must reflectcontent, interest, and language that is or can be madeaccessible to the student. The key to accessibility lies indeciding what the learners can be asked to do andhelping them to do it. It is more a matter of editing thetask than editing the text.

147


Teaching Effective Strategies

Before a wider range of texts will lead to moreproficient reading levels, the process of reading must.also be taken into account. At beginning levels, when"foreign" describes the culture and context as well asthe language, successful reading depends on thelearner's development of a set of effective strategies. Awide spectrum of techniques have been proposed andexplored (see Grellet, 1981; Omaggio, 1986; Phillips,1984; Swaffer, 1983 for overviews). At a minimum, theyinvolve some degree of prereading activities, skimmingor scanning phases, and comprehension guides to ver-ify that the reader can carry out a similar task in reallife.

Most recent research has confirmed the importanceof prereading activities for students. Acting as advanceorganizers that activate students' abilities to predict,anticipate, guess from context, absorb into existingschema, or simply recall background knowledge orexperience, this first stage is critical to comprehensionof authentic materials. (For examples of research stud-ies, see Carrell & Eisterhold, 1983; Hudson, 1982; Haus& Levine, 1985; Steffenson, Joag-Dev, & Anderson,1979.) Teachers who are aware of their students' lin-guistic, cognitive, and experiential background are inthe best position to determine effective prereading activi-ties. Furthermore, they can best judge how much prep-aration is required for entry into a specific passage. Theteacher must walk a narrow line between providing toomuch and too little help.

Passages do not exist at as many levels as there aredifferences in student ability. The proficiency descrip-tions can serve as a tool for designing tasks anddetermining specific areas that can be tested for com-prehension. Tasks requiring intensive in-class readingdominated in the past, but more recent materialsinclude texts that are to be scanned only for main ideas,supporting detail, or designated pieces of information.

148 ,t.


Using these procedures, many passages that were onceconsidered too difficult become readable.

Instruction that is built on these principles, derivedfrom the definitions, blends concern for the readingprocess, the learner's repertoire of effective strategies,and exposure to texts representative of those in thetarget language. Students exiting such programs willhave a chance of improving their reading skills as theylearn new information about their environment andthat of others. More important, they will be able tomanage reading tasks that are carried out for real workpurposes; their skills will allow them to enter thespectrum of readings available to native readers so thatthey no longer exhibit mixed proficiency, jumping fromlevel to level in a nonfunc4'.)nal way.

While the implications of AEI reading proficiencytesting for the classroom continue to be explored, testsin use, though not thoroughly understood, mverthelessseem to work. It should be reiterated that AEI readingproficiency testing has proven effective, as evidenced byits use in government for more than 30 years. Yet theILR has devoted less attention to reading than tospeaking, and little understanding exists about theinterrelationship of functions, content, and accuracy inassessing reading proficiency. These facts compel moreresearch and underscore the urgency of the need. Per-haps this research will explain why the AEI readingsystem scales work, especially when the content fromone test passage to another varies greatly and exam-inees often only partially exhibit the skills set forth inthe AEI definitions.

The Research Agenda for Reading

The reading descriptions and the implications thatacademic teachers and administrators draw fromthem must undergo teats of applicability, assessment,

149


and rigorous experimentation. Assumptions must bnchallenged, and the practices being advocated must betested to determine whether they have positive effects oncomprehension. The implementation of a wide-rangingresearch agenda cannot be delayed. Lett and Clark (thisvolume) address several relevant issues, and whatfollows is an addendum to that chapter.

First, valid tests of comprehension of passagesidentified by level must be developed specifically for theacaiemic world. A program to assess whether perfor-mance at one level can guarantee that the lower levelsare also controlled would provide evidence of the exist-ence of a hierarchy for those who learn to read a foreignlanguage. Within level descriptions, research to con-firm whether the trisectional statements appropriatelycombine text, function, and strategy could also be con-ducted. Claims that the descriptions must be validatedshould be answered, but the face validity has beenestablished by the historical derivation of the model;that is, they are the result of analyzing learner perfor-mance. Research is probably not capable of producing aset of guidelines because no single theory could gen-erate them. The AEI definitions deal with specific lin-guistic items and how and to what extent learnerscomprehend them. Research can, however, confirm orreject their viability as good descriptors of performance.

Another agenda item is classroom-based researchon the interactions that occur between reader and text todetermine whether instructional strategies render textsaccessible. Every aspect of prereading, the effects ofadvance organizers, or applications of schema theory(Minsky, 1982) should continue to be explored. Experi-mentation with newer kinds of t-.,st items or formats forcomprehension would also be valuable. Computer pro-grams that provide help menus that respond to studentprompting would tailor intensive reading practice to theindividual learners' needs, a major advance over thefull-class problem-solving that usually occurs. The ulti-mate research and development effort would be todevelop a computer-adaptive reading test that could


efficiently assess the range of texts a student canprocess to given levels of accuracy so that a rating givena student conveyed the kind of information that the oralproficiency interview does (Carton & Kaya-Carton, 1986;Dandonoli, 1987). Students would then have a fairlyaccurate idea of what they can read and what the nextstage requires.

Interpretations, misinterpretations, implicationsfor the classroom, the need for researchall influencehow AEI reading proficiency testing will be received.This chapter has come full circlefrom testing room toclassroom and back. While the starting point and endpoint are the same, the understanding of that pointchanges significantly when the AEI definitions arest'idied with new eyes. Frankly, examiners in the pasthave tested reading achievement, the presence of bitsand pieces of content and ability. What the AEI def-initions suggest be tested in the future, particularlywhen students are taught for the test, is proficiency;that is, consistent and sustained ability to read a widevariety of texts with appropriate reader strategies.

References

Bernhardt, E.R. (1986). Proficient texts or proficientreaders? ADFL Bulletin, 18(1), 25-28.

Carrell, P., & Eisterhold, J. (1983). Schema theory andESL reading pedagogy. TESOL Quarterly, 17, 553.

Canale, M., Child, J., Jones, R.L., Liskin-Gasparro,J.E., & Lowe, P., Jr. (1984). The testing of readingand listening proficiency: A synthesis. Foreign Lan-guage Annals, 17, 389-391.

Carton, A.S., & Kaya-Carton, E. (1986). Multidimen-sionality of foreign language reading proficiency:

e 1i 0 1


Preliminary considerations in assessment. ForeignLanguage Annals, 19, 95-102.

Child, J. (1987). Language proficiency levels and thetypology of texts. In H. Byrnes & M. Cana le (Eds.),Defining and developing proficiency. Lincolnwood,IL: National Textbook Co.

Dandonoli, P. (1987). ACTFL's current research in pro-ficiency testing. In H. Byrnes & M. Cana le (Eds.),Defining and developing proficiency. Lincolnwood,IL: National Textbook Co.

Grellet, F. (1981). Developing reading skills: A practicalguide to reading comprehension. Cambridge, Eng-land: Cambridge University Press.

Haus, G.J., & Levine, M.G. (1985). The effect of back-ground knowledge on the reading comprehension of.second language readers. Foreign Language An-nals, 18, 391-397.

Hudson, T. (1982). The effects of induced schemata onthe "short circuit" in L2 reading: Nondecoding fac-tors in L2 reading performance. Language Learn-ing, 32, 1-31.

Minsky, M. (1982). A framework for representingknowledge. In J. Haugeland (Ed.), Mind design.Cambridge, MA: MIT Press.

Omaggio, A.C. (1986). Teaching language in context:Proficiency-oriented instruction. Boston: Heinle andHeinle.

Phillips, J.K. (1984). Practical implications of recentresearch in reading. Foreign Language Annals, 17,285-296.


Steffenson, M.S., Joag-Dev, C., & Anderson, R.C. (1979).A cross cultural perspective on reading comprehen-sion. Reading Research Quarterly, 15, 10-29.

Swaffar, J.K. (1985). Reading authentic texts in foreignlanguage: A cognitive model. Modern LanguageJournal, 69, 15-34.

153-----.g.

VIssues in Writing

ProficiencyAssessment

Section 1:The Government Scale

by Martha Herzog,Defense Language Institute

Writing has never been given great emphasis in gov-ernment language evaluation. The original effort in the1950s to inventory, define, and measure the languageproficiency of government emp-lyees led to a systemcovering all four skill modalities. However, the testingthat evolved concurrently with the language skill leveldescriptions focused on speaking and reading, withsome attention paid to listening comprehension (see Sol-lenberger, 1978). For many years, the writing proficien-cy scale was primarily used only for reference.

While their curricula may require some writing, thegovernment language schools do not test this skill aspart of the end-of-course evaluation. Furthermore, anywriting done during the course is not evaluated accord-ing to the proficiency scale.

In about 1980, however, the Civilian PersonnelOffice of the Defense Language Institute (DLI) took acareful look at writing proficiency. In the hope ofimproving the assessment of prospective teachers'


language ability, the personnel specialists proposed totest all four skills according to the government scale.This was a pioneering effort, in advance not only of theACTFL guidelines project and the proficiency move-ment, but also of the DLI Test Division's major under-takings for measuring students' language proficiency.At first, the personnel office concentrated on trainingtesters for English, but their ultimate goal was to createa sizable cadre of testers in all languages taught at DLI.This goal has been accomplished. DLI has more than250 certified testers in 26 languages, serving both thepersonnel system and the Test Division. Though fullytrained to assess writing proficiency, these testers ratethis skill only for the personnel office; under normalcircumstances, they are the only federal employees whoactively use the writing level descriptions. The presentfocus is on the one practical use of the writing scale.

By the time DLI became seriously involved in pro-ficiency testing in 1981, the Interagency LanguageRoundtable (ILR) Testing Committee had identified aneed to revise and expand the level descriptions for allfour skills. Revisions were needed for several reasons.First, there were no written descriptions of the pluslevels despite their widespread use. The Foreign ServiceInstitute had developed helpful guidelines for awardingplus ratings in speaking (Adams & Frith, 1979), and theagencies shared an informal tradition, conccrning thegeneral use of plusses. It seemed time to agree on com-plete descriptions for all 11 points on the scale.

It was also time to provide a fuller explanation of thefunctional ability found at each level. This was not amatter of a need for change; rather the need was foramplification. Examples were needed for the speakingdescriptions. Within the reading descriptions, a distinc-tion had to be made between texts at a specific level andpeople who read at that level. Also added was an expla-nation of the limited functions a Level 1 or Level 2reader can perform with higher-level texts.

Another problem that interested DLI involved com-mensurability among skills. For example, experience

1 55

Writing: The Government Scale 151

and analysis had shown that the standards for attain-ing a Level 1 in reading were somewhat lower thanthose for speaking. As the issue of commensurabilitywas discussed, the writing descriptions were carefullyscrutinized. DLI's participation in these discussionsmay have marked the first analysis by an agency thatactually proposed to use the writing scale for testing. Itturned out that the writing descriptions were not ascarefully graduated as those for other skills. Thus,while a Level 1 speaker could perform certain limitedlanguage tasks independently and the Level 3 speakercould function effectively in both social and professionalsituations, all writers below Level 5 required theassistance of an editor.1 Inadequacies in the writingdescriptions particularly pointed up the need to revisethe system to make it commensurate.

Several joint government and academic projectswere begun at a DLI testing conference late in 1981.Among them was an on-the-spot exercise by PardeeLowe Jr. and Adrian S. Palmer to remove the mostglaring deficiencies from the writing descriptions. Dur-ing a two-day period, they rewrote the base-level de-scriptions to bring them into line with speaking. Most ofthis work found its way into the revised ILR level de-scriptions that were finally published in 1984, and intothe ACTFL guidelines for writing.

This important effort by Lowe and Palmer paved theway for revision of the entire scale and should not beminimized. Nevertheless, in retrospect, it appears thatthe ILR Testing Committee might have devoted atten-tion to additional improvements. During a year of inten-sive rewriting, it was clear that the Testing Committeeassigned the lowest priority to descriptions for writing.

As noted earlier, even the improved writing scale islittle used within the federal government. If the mili-tary services were to ask DLI to test students, its certi-fied testers could certainly evaluate the writing sam-ples; they receive good training in the use of the scale.However, DLI's Personnel Office remains the only cli-ent for such testing.

; VI


Applicants for teaching positions are tested in bothEnglish and the target language. The lowest acceptablescore is a Level 2 in English (Advanced on the ACTFLscale) and Level 3 in the target language (ACTFL Supe-rior; for a comparison, see the figure in the Introduc-tion, p. 4, this volume). The present writing test consistsof a single essay answering the question, "How hasyour background and experience prepared you foremployment at DLI?" The essay is rarely written undercontrolled conditions, and it is tacitly assumed that oneversionEnglish or target languagewill probably be atranslation of the other. Because the topic has not variedfor several years, those who choose to compromise thetest are in an excellent position to do so.

The greater problem, however, is that this assign-ment severely limits educated, skilled writers' attemptsto show versatility with prose style or their ability todevelop a complex idea. A superficial glance at the scalewould suggest that even the best writer could handle thetopic adequately at Level 3. Careful scrutiny of the de-scriptions reveals an even more fundamental difficulty.Level 3 implies both formal and informal style, tone,and subject matter; clearly, no single essay can demon-strate this range of ability.

A Direction for Future Tests

Because of higher priorities for test development,DLI has not yet developed a meaningful writing test forprospective teachers. However, the time seems right tobegin work.

An extremely interesting prototype test that coversthe ACTFL scale from Novice through Superior has al-ready been proposed by Magnan (1985). She designed theprototype to be administered in two sessions, each last-ing about 2.5 hours. The testing time is lengthy, but nosurprise to those familiar with the guidelines or those

15 7

Writing: The Governrient Scale 153

who have administered essay exams under controlledconditions. Magnan's model succeeds in suiting topicsappropriately to proficiency levels and in demandingsufficient variety of topic and task to fully meet therequirements of the ACTFL guidelines. If DLI were totest students' writing ability, it would afford an excel-lent chance to experiment with this model and indeedwith many of the suggested topics. Magnan's testshould be given a thorough trial, particularly at institu-tions that implement a proficiency-based curriculum.

An earlier prototype had been developed by the au-thor at DLI covering Levels 2 through 5, the portionof the scale used by the personnel office. It is intendedonly for English-language testing. Tests of similardesign and length should be generated in the varioustarget languages by teams certified to test in each lan-guage. At the higher levels of proficiency, it is doubtfulthat topics and tasks can be designed that will work suc-cessfully in more than one language. Eve. aspect of thetest, including the instructions, must be sensitive to thedemands, problems, and conventions of writing in thespecific language. If this assumption is too cautious, aresearch project suggested later in this chapter willshow this.

Several certified testers were consulted in the de-velopment of the English prototype.2 To test throughLevel 5 requires samples of formal and informal writ-ing and a "variety of prose styles pertinent to profes-sional/educational needs." The minimum number ofwriting assignments and topics and the minimumlength of each assignment that would enable examineesto demonstrate their true writing ability had to bedetermined.

It was assumed at first that one assignment, with achoice, of topics, would be needed for each level tested.However, DLI's extensive experience with testing read-ing between 1982 and 1986 revealed that a carefullyselected text with well-designed questions can measurereading comprehension skill accurately at several pro-ficiency levels. For example, DLI has been able to use

to0


Level 3 texts in a doze format to discriminate preciselyamong readers who scored at Levels 1 thr ugh 3+ in athorough face-to-face test of reading. On the basis of thisexperience, it can be hypothesized that writing tasksaimed at Levels 2, 3, and 4 will furnish samples thatallow discrimination among writers through the upperpart of the scale.

Like Magnan, DLI was concerned about test length.But in addition to the problems she anticipated, DLIrealized that job applicants are not as readily availableas students. Therefore, a single testing session wasneeded. And the concern I, d to be to determine mini-mum, rather than optimum, length.

A carefully designed and validated test becomes bothexpensive and valuable. To preserve the investment, thetest had to be administered under controlled conditionsat DLI or another government office. DLI's assignmentof topics had to take this factor into account. Use ofreference materials such as dictionaries, thesauruses,or style manuals could not be permitted unless theiravailability at every testing site could be guaranteed. Inaddition, DLI sought to create topics that no category ofapplicants would be likely to have prepared for inadvance. At the same time, the topics had to be of inter-est to the average applicant. The task of narrowing thetest to measure writing proficiency according to thelevel descriptions and nothing morenot generalknowledge, professional preparation, intelligence, orretention of published articleswas far from easy. Anduntil the test can be validated and tried, it can by nomeans be assumed that all the problems have beenresolved.

The proposed test appears in Figure 5.1. Part Ipresents a straightforward Level 2 task. This particulartopic covers the reference in the '.evel descriptions to"routine social correspondence" and would show abilityto write about "daily situations." It assumes organiza-tional and discourse ability at the paragraph level only.The instructions mention grammar and vocabulary be-cause the level descriptions are relatively specific about

1 5 9


The following assignments will be used to evaluate yourEnglish writing proficiency according to the ILA languageproficiency descriptions. Pay careful attention both to the topic andthe intended reader.

PART ONE

Assume that you have just returned from a trip and arewriting a letter to a close friend. Describe- -a particularly memorableexperience that occurred while you -vere-traveling.

This will be one paragraph in a longer letter to your friend.The paragraph should be about 100 words in length.

You will be judged on the style and organization of thisparagraph as well as vocabulary and grammar. Remember, theintended reader is a close friend.

PART TWO

Imagine that your reader is a young American-20-25 yearsold, bright, educated, and outgoing. Choose one of the followingtopics. In about 300 words, write about the subject in a way that willbe interesting to this reader.

You will be judged on the style, organization, coherence, andcomplexity of your essay as well as the richness and precision ofvocabulary and the accuracy of grammar and spelling.

A. Explain why it would be valuable to learn a second language.B. Describe a person who had a great influence or your life.

Explain why this person was so important to you.C. Present your own definition of success, giving reasons.

PART THREE

Assume that you have bee asked to write a paper to be presentedduring the annual meetin 4'a professional organization to whichyou belong; it will later be printed in their quarterly newsletter.

Select one of the following topics and write a paper ofapproximately 750 words.

You will be judged on the style, organization, logical develop-ment, and complexity of your paper as well as the richness andprecision of vocabulary, accuracy of grammar and spelling, and thesu: =-+.)ility for the intended audience.

A. Teachers' resistance to change.B. The influence of television on language skills.C. Quality versus equality in higher education.D. The move toward neutralizing gender in language.

Figure 5.1. Proposed Test of Writing

1


the accuracy expected at Level 2.3 However, because theexaminee is urged to pay attention to style and toiddress a particular audience, the evaluator can alsouse this assignment for part of the overall rating ofhigher-level examinees. Experience has not shown thatLevel 2 writers themselves control style or tailor theirwriting for an assumed audience. However, reliance onthe single DLI topicdiscussion of past experiencemay have the effect of limiting examiners' knowledge ofthe Level 2 writer's ability.

Part I should certainly screen out examinees whoseperformance falls below Level 2. However, it may notprovide an adequate sample to discriminate between thetrue Level 2s and the Level 2+s.

Part II should do a great deal more. The choice oftopics should allow examinees to write on subjects theyhave thought about before, without offering any like-lihood of specific preparation for the test. The true Level2 could probably write well enough to confirm the eval-uation of the first assignment. Howe-yar, this secondassignment should establish a ceiling for the currentproficiency as a single paragraph might not. The solid2+ writer could use this assignment to demonstrate anability to "write about concrete topics relating to parti-cular interests and special fields of competence."

After completing Parts I and II, examinees whoseproficiency level exceeds 2+ will have shown theirhigher-level ability through their style, organization,and tailoring of the subject to the audience. In mostcases, grammatical control and a sufficient di'.."erencein style between Parts I and II would assure evaluatorsthat an examinee is at least Level 3. Trial will berequired to determine how much discrimination can bemade on the basis of these two essays.

In combination with the first two assignments, PartIII should provide enough information for assigningproficiency levels at the top of the scale. Any of thefour topics can be considered pertinent to professionalinterests and needs. Admittedly, they are extremelygeneral; however, generality should ensure lack of

1G1


conscious preparation for the test and eliminate theneed for reference materials. As part of the overall testbattery, Part III is challenging, but the choice of topicsand their broad relationship to language teachingshould create a fair test. Topics more closely related tothe field might well create interference in the scoring.If the objective is to assign a writing proficiency level,the examinee's subject-matter expertise should not beevaluated. Applicants have other opportunities to pre-sent their professional credentials. On the other hand,examinees can define the topic as they choose. Theirprofessional experience may well influence the organi-zation and development of their ideas into a style appro-priate to the audience.

These three diverse assignments should allowhigher -level examinees to demonstrate their full profi-ciency. Both formal and informal styles are required.Topics are social and professional, affording threeopportunities to develop and organize ideas and todemonstrate control of grammar and strength of vocab-ulary. A certain practicality is inherent in the tasks.Finally, the writer must tailor his or her presentation tothree quite different audiences. Even the Level 5, whomay not conzider these topics especially stimulatingintellectually, should find that he or she has been testedagainst the level descriptions within realistic time con-straints.

These constraints will have to be determinedthrough field trial. At this point, it appears that 2.5hours should be allowed for completion of Parts I andII, followed by a break. Examinees who want to qualifyonly for the minimum score could leave at the break.This procedure would provide a thorough test of Level 2wit!...Qut submitting lower-level writers to a frustratingexperience. Those who return to complete Part IIIshould be given no more than 2.5 hours.

Naturally, rigorous validation would be needed toensure that these or other topics are appropriate andsufficient to discriminate at the required levels. In addi-tion, multiple test forms should be prepared.

1G2


Following validation and trial implementation of theEnglish proficiency test, similar but not identical target-language tests should be developed. Such a design couldinclude a topic similar to Part II of the English test anddesignate a specific target-culture reader. Two separateassignments similar to those in Part III would be use-ful. These should address distinct audiences, with con-trasting requirements for topic, tasks, and style; how-ever, both should be appropriate for the target culture.

The instructions and the scoring criteria should beequally sensitive to rhetorical patterns in the languagetested. In this respect, testing specialists have a greatmany questions to ask the native speakers/writers whoare certified to test their language. While little researchhas been done on contrastive rhetoric, Kaplan (1966,1972) certainly has raised important issues. The ILRdescriptions and the ACTFL guidelines for writingmust be considered tentative until his assumptionshave been tested (see Gregg, 1986; Mohan, 1386a, 1986b;Mohan & L), 1985; Ricento, 1986). The DLI experienceseems to support Kaplan's contention that Englishrhetorical conventions do not apply universally (see alsoCarrell, 1982; Carrell & Eisterhold, 1983). Until evidenceappears to the contrary, an English model should not beimposed on the writing of other languages. Not onlyshould topics be culturally oriented, but instructions toLoth examinee and raters should refer to the rhetoricalfeatures characteristic of good prose in the target lan-guage.

Scoring of the English prototype, at least initially,should follow the holistic method now used at DLI. Twocertified testers rate he essay independently, consult-ing only the level des riptions. Discrepancies that crossa major scale border (e.g., 2+ and 3) require involving E.third rater to determine final scoring (see Lowe, 1978,1982, for comparable discussions of the third rater's rolein speaking tests).

1


Analysis of the Viability of the Scale

Working on an English prototype, trying to apply thedesign to a variety of languages, and trying to anticipatescoring difficulties led to doubts Loncerning the viabilityof the scale. The improvements begun by Lowe and Pal-mer are clear to those familiar with the earlier version.4The current writing scale is generally commensuratewith the speaking scale in terms of language produc-tion, and with the reading scale in terms of text. How-ever, it now appears that much has been omitted.

The writing descriptions stress sentence-level skills.Perhaps as an inheritance from the previous iterationof government descriptions, emphasis is placed onspelling, punctuation, and control of grammaticalstructure; lexical accuracy is considered more fullythan style. Many of the factors traditionally consideredwhen evaluating essays are minimized ur eliminated.Examples include organization, methods of develop-ment, diction arid tone, creativity (or the classical "in-vention"), and attention to the audience. (Weaver, 1967,and Brooks & Warren, 1972, present the fundamentalsof rhetoric for the native writer.)

Although it was not their ptrpose, companion es-says by Dvorak (1986) and Osterholm (1986) reinforce theimpression that the writing descriptions have crucialpieces missing. Osterholm points out that focusing onlower-level goals effectively blocks beginning writersfrom achieving mid-level goals such as developing aparagraph or using the paragraph to support the cen-tral argument. Dvorak refers to this concern for theconventions of language form as "transcription" anduses the term "composition" for the skills involved ineffectively developing and communicating an idea ormaking a point" (p. 145). She adds that for the last 20years, most articles on foreign language writing haveconcentrated on the lower-level skills, discussing waysto "reduce and repair error damage" (p. 148). Usually,foreign language teachers concentrate on transcription

164


and regard composition, as Dvorak defines it, as outsidetheir scope of interea.

It was within this general atmosphere that succes-sive government committees narrowed the scope of thewriting proficiency descriptions. Howear, by followingthe comparative., recent trend of foreign languageclassroom evaluation and ignoring several centuries ofrhetorical analysis, DLI has not only eliminated valua-ble tools for evaluating prose but, theoretically, mayhave contributed to the tendency of examinees to becomeblocked by sentence-level problems. The damage is nodoubt indeed only "theoretical," thanks to the extremelylimited use of the descriptions to date.

The ILR Testing Committee inadvertently allowedthis to happen both because of the infrequent use of thewriting scale and because of an assumption that the twoproductive skills were highly parallel. Of course, paral-lels do exist. However, the oral interview does not exam-in3 polished, practiced speech; in fact, considerableeffort is made to pnvent the examinee from usingprepar3d material. While the examiners' activities arehighly structured, the interview places the examinee ina series of situations that closely approximate sponta-neous speech acts of daily life. Writing cannot be testedin this way. Little writing beyond phone messages andnotes is spontaneous. Most writing is expected to be pol-ished and organized for the reader. Academic assign-mentstests, essays, term papersand professionalreports and papers all require planning, organization,and revision. Except for the necessary time limitations,a test of writing proficiency makes almost the samedemands on the writer as the writing done for academicor professional purposes. A writing test is a perfor-mance test. Therefore, the scoring system shouldinclude the factors used in evaluating the external cri-terionthe successful essay, report, or paper.

A traditional rhetorical analysis of an English essaywould place considerable emphasis on organization.Attention would be given to the writer's skill in orderingmaterial in a way that fits the subject, limiting the

_ ,


scope, providing the intended emphasis, creating effec-tive transitions, and achieving overall unity and co-herence within a framework that develops ideas clearlyfor the intended reader. These and other aspects of or-ganization are so crucial to the evaluation of prose thatit may seem that the writing proficiency descriptionsrefer to something else. In a sense, of course, they tic;the level descriptions focus on a writer's performanceacross a variety of tasks rather than on the character-istics of a single essay. However, rating performancedoes not alter the fact that a proficiency level can beassigned only to a writer after evaluating examples ofhis or her prose. Whether traditional scoring is globalor factored, organization must be seriously examined.An effective Level 3 description should define a type anddegree of organization acceptable for communicationand clarity in general professional writing. The higher-level descriptions should indicate progressive refine-ments in skill and greater sophistication in types ofprose organization. Conversely, lower-level descriptionsshould suggest the limitations of the less proficientwriter in ordering material and making transitions.

Unfortunately, when revising the writing descrip-tions, the ILR committee did not focus on this part of thetask. Instead of defining Level 3 as the threshold ofbasil organizational competence, the ILR descriptionstates, "Relationship of ideas is consistently clear." Thisis accurate but not adequate. Level 3+ is defined neg-atively, omitting .,..ny reference to improvements overLevel 3 proficiency: "Organization may suffer due tolack of variety in organizational patterns or in variety ofcohesive devices." Level 4 is described with moreprecision, although, like the 3+ description, it may over-emphasize variety: "Expository prose is clearly, con-sistently and explicitly organized. The writer employs avariety of organizational patterns, uses a wide variety ofcohesive devices such as ellipsis and parallelisms, andsubordination in a variety of ways." Organization is notmentioned at levels higher than 4. The lower-level de-scriptions provide a fairly good idea of organizational

166


limitations. The 0+ description makes it clear that no-thing beyond lists and filling out forms can be expected.Level 1 states adequately, "Writing tends to be a loosecollection of sentences (or fragments) on a given topicand provides little evidence of conscious organization."The description says the Level 1+ writer "can create sen-tences and short paragraphs" but "generally cannot usebasic cohesive elements of discourse to advantage." TheLevel 2 description is cryptic"Uses a limited numberof cohesive devices"and the Level 2+ description doesnot mention organization.

Methods of development at the essay and paragraphlevel are also intrinsic to traditional rhetoric. This fac-tor is so fully covered in the descriptions of readinglevels ',with descriptive and narrative texts found atLevel 2 and argumentation at Level 3) that one wouldexpect to find parallel references in the writing descrip-tions. This is not the case.

On the other hand, the ACTFL guidelines do ad-dress discourse functions. "Narratives and descriptionsof a factual nature" are mentioned in the Advanced-level guidelines; Advanced-Plus states, "can describeand narrate personal experiences fully but has difficul-ty supporting points of view in written discourse"; andthe Superior-level writer is said to be able to "hypoth-esize and present arguments or points of view accu-rately and effectively." The government descriptionswould benefit from incorporating this approach, al-though research may be needed to validate it.

The place of style in evaluating prose has not beencompletely neglected in the descriptions. There are ref-erences to diction in terms of control or precision ofvocabulary, which may be the only point that low-pro-ficiency writers can be expected to consider. The Levelr description touches on aspects of style, both positiveand negative, that may emerge: "Often shows surpris-ing fluency and ease of expression. . . . Normallycontrols general vocabulary with some misuse ofeveryday vocabulary evident. Shows a limited ability touse circumlocution. . . . Style is still obviously foreign."

1.0


The Level 3 description covers the topic in relatively ab-stract terms: "Able to use the language effectively inmost formal and informal written exchanges on prac-tical, social, and professional topics." Adequacy ofvocabulary and the foreign character of style are alsomentioned. The Level 3+ description refers to use of "afew prose styles" and notes possible weakness in ex-pressing "subtleties and nuances." The higher-level de-scriptions continue at the same level of abstraction tosuggest graduated control. For example, the Level 4description states: "Able to write the language preciselyand accurately in a variety of prose styles," and theLevel 4+ description adds a reference to "use of stylisticdevices."

Without doubt, the entire writing scale would beimproved by analyses of the contribution of diction andtone to style, the role of figurative language, the expres-sive effect of sentences with various rhetorical patterns,and the writer's awareness of connotations of words.These facets of style distinguish the prose of higher-level writers, although research is clearly needed todetermine Vie level at which sensitivity to connotationemerges or when figurative language can be integratedinto a text. While stylistic concerns should not dominatethe level descriptions, a more detailed discussion ofstyle would increase their usefulness.

The aspect of traditional rhetorical analysis that ismost conspicuously absent from the descriptions isinvention or creativity. Nothing is said at the lowerlevels. The Level 4 description states that writing is"adequate to express all his/her experiences," and Level5 says, "In addition to being clear, explicit and informa-tive, the writing and ideas are also imaginative." Whilethese brief statements imply a great deal, they do notconstitute an adequate analysis of the topic, particularlygiven that educated adults with writing proficiencybelow Level 4 have such limited control of rhetoricalelements in the second language that they cannot fullyexpress their thoughts. The gradually emerging abilityto break the barriers to invention should be traceable

1,68


with a certain amount of precision in the Levels 2through 5.

Overall, the current level descriptions say too littleabout organization, methods of development, and inven-tion. Style and a factor that may be called "attention tothe audience" or "reader-based writing" are included,but not in useful proportions. Sentence-level mattersgrammar, punctuation, spelling, and mechanicsaredisproportionately emphasized.

These deficiencies do not make the current scaleunusable. The employees whose writing samples arorated against this scale may be required to write corre-spondence in English for circulation within DLI as wellas target-language material for beginning students.Skilled evaluators can probably rate their potentialability to perform these tasks and could do a better jobwith a more thorough test. Because the current descrip-tions discuss the quality of correspondence, reports, andjot - related writing, they are certainly adequate forDLI's purpose. However, it would be worthwhile tobring them closer to the descriptions of the other skillmodalities with a view toward broader application.

Value of an Improved Scale

An improved writing scale would be as important acthe improved writing proficiency test discussed earlier.Descriptions more aligned with traditional rhetoriccould be used to evaluate candidates for educational pro-grams. Government employees would find it beneficialto cite their English writing proficiency score when ap-plying for graduate school. Universities would probablyfind a Level 3 or 4 more meaningful if the descriptionprovided reference points for organization an methodsof development rather than merely for sentence-levelskills.

Another potential use of an improved writing scale

1 C.9


would be for accrediting translators. While prospectivetranslators should be evaluated in several skills, intheir case writing is certainly one of the most impor-tant. To be effective, the translator must be able to writecompetently at the level of the original text. A higher-level reader with appropriate subject matter and cul-tural knowledge, but without appropriate writing skills,may be able to perform several functions successfully.He or she may be able to prepare reasonably accuratesummaries of a text. He or she may be able to prepareanswers to specific questions that interest readers.However, unless he or she can replicate the author'spresentation, convey subtleties, and address the audi-ence with the tone of the originator, the translator willlose the qualities that make the text unique and thusworth reading. Those selecting translators for particu-lar projects might well appreciate a writing proficiencyscore related to the aspects of texts that merit theirtrans] ati on.

A Method of Improving the Scale

The level descriptions are not to be rewritten here.The ILR experience has shown the value of developingthe descriptions in a committee of interested people whohave worked with the concepts extensively and gainedinsights that discussion will bring out. Judgmentsshould not be hasty, for a great deal of research isneeded.

Therefore, the following procedure is recommended.For several years, the CIA Language School and DLIhave used a Speaking Performance Profile as part of thetraining for the oral interview; the GovernmencLanguage School has integrated this aid into thescoring of interviews (see Lowe, 1982). The profile iso-lates six factors, all related to the speaking descrip-tionspronunciation, fluency, sociolinguistic/cultural

170


information, grammar, vocabulary, and tasksandcharts each on a 0-to-5 scale with a brief shorthanddescription. (A copy is reproduced in the appendix.) Theprofile complements but neither duplicates nor substi-tutes for the full descriptions.5

An analogous profile should be constructed for writ-ing, stating the performance characteristic of each levelfor six factors: organization, methods of development,style, invention, reader focus, and sentence-level skills.The initial statements would be tentative, based onDLI's empirical experience and collective judgment.DLI, and perhaps other government agencies, coulduse the profile together with the new writing proficiencytest until enough s_..mples and proficiency data areaccumulated to test the hypothesis of the profile. Trial,observation, and concurrent research could thenprovide a basis for refining the profile and revising thelevel descriptions.

If a profile based on traditional rhetorical analysisseems a step backward to those interested in writing asprocess (see Osterholm, 1986), it can be argued that thetradition is useful and is more compatible with pro-ficiency than the strictly sentence-level concerns thatpervade the current descriptions. Nevertheless, thecurrent proposal appears to emphasize writing-as-product more strongly than ever in the attention givento completed texts. Research into both areas is needed;and the results should be mutually beneficial. It isparticularly important to determine the effect on awriter's proficiency when he or she encounters varyingtopics, under varying conditions, and whether testswith optional equivalent topics provide better oppor-tunities to demonstrate writing skills. (See Coffman,1971.) At some point, DLI evaluation must focus on theproduct created by the examinee. However, researchinto process should help in developing the best test toelicit those products. Traditional rhetoric should pro-vide a better model for evaluation by putting all thefactors that go into writing proficiency into proper pro-portion.

I "4,


Nature of an Improved Scale

To show how the profile might look, Figures 5.2 and5.3 show preliminary statements for two factors--or-ganization ezid methods of development. These firstdrafts, based entirely on the author's experience, areintended to begin discussion. Colleagues in the govern-ment, and those in the academic community who haveapplied the ACTFL writing guidelines, may bringforward quite different observations to help clarify thecontent and condense the statements about these 11.-point analyses.

0 No functional ability.

0+ Can produce lists and fill in blanks appropriately.

1 Writing tends to be a loose collection of sentences (orfragments) on a given topic and provides little evidence ofconscious organization.

1+ Can produce short paragraphs on limited topics (e.g., survivaland social needs). Generally cannot use basic cohesive elementsof discourse to advantage.

2 Can produce a series of paragraphs on limited topics. Uses alimited number of cohesive devices and a kind of ordering ofmajor points. However, lack of unity and proportion trinyconfuse reader. Relationship of ideas may not always Le clear.There are few meaningful transitions.

2+ Can produce a series of paragraphs that form a text; narrativesand descriptions are usually organized adequately.Relationship of ideas is generally clear; there is evidence ofeffort to keep parts in proportion. Uses cohesive deviceswith some success. However, clarity, unity, and coherence maybe flawed in any text.

3 Narratives and descriptions are well organized; expositoryprose is usually organized adequately. Ordering of major pointsfits purpose of the text. Material is presented coherently.Relationship of ideas is clear. Transitions are usuallysuccessful.

Figure 5.2. Profile of Writing Organization

1"-',i


3+ Usually able to fit the type of organization to subject matter andpurpose. Narration, description, and expos:tion are always wellorganized. Can usually present arguments clearly andcoherently. Texts are normally unified. Transitions are nearlyalways successful.

4 Can regularly organize various types of narrative, description,exposition, and argumentation appropriately. Even complexideas and unique points of view can be ordered clearly.Transitions effectively aid understanding.

4+ Good ability to organize all types of writing according to needs ofsubject matter and. purpose. Clarity is created through unityand coherence. Parts are always in proportion to whole.Transitions are skillful.

5 Strong ability to organize all types of writing according to needsof subject matter and purpose. Clarity is created through unityand coherence even for complex and unusual subjects. There isa strong effect of correct proportion. Transitions are skillful andwell integrated into text. A degree of originality is often presentin organizational techniques.

Figure 5.2. (cont.)

Despite attempts to keep the criteria broad, this pre-liminary version is biased toward English prose organi-zation. A similar analysis of methods of development iseven more obviously restricted to the principles thatapply to English paragraphs and essays (see Carrell &Eisterhold, 1983). Considerable discussion and researchwill be needed to determine whether these conclusionshave any application to other languages or whethermore language-general statements can be framed.

The statements on methods of development are alsobriefer and more tentative than those on organization.A proficiency level has been assigned for the lowestpoint at which some practicable use may be made of amethod; naturally, any method can be used with great-er effectiveness at a higher level. Again, it should beclear that research is needed to determine the ort:er ofdifficulty, to set cut-off levels or second-language writ-ers' minimal control of a method, and to learn moreabout the language-specific issues. Because the subjectof style is uncharted territory and far more complexthan organization and methods of development, it


0 No functional ability.

0+ Can produce lists (on familiar topics).

1 Can produce statements (on familiar topics).

1+ Can produce simple time-ordered narration of anecdotes orincidents; simple descriptions of common objects, rooms,places.

2 Can use elementary classification (e.g., subjects in school,shops in mall); logical definitions.

2+ Can use exposition of process (e.g., how to change a tire orbake bread); more complex narration, including attention tonarrator's perspective and conflict; example and illustration;comparison and contrast; cause and effect; narration thatincludes a sketch or profile; description that includes a varietyof sensory impressions.

3 Can produce analysis that goes beyond classification, process,or comparison to present and support an opinion; argumentsto support an opinion; elaborated descriptions (for example,showing group interaction).

3+ Can produce an extended definition (for purposes ofpersuasion); argumentation that is logically and rhetoricallyelaborated, combining many of the previously mentionedmethods of development; persuasion involving ethical orphilosophical arguments.

4 Can write professional or educational essays that successfullycombine factual reporting or documentation with supportedopinion, analysis, and other techniques. (Success impliesensuring that the reader can readily discern the differencebetween fact and opinion. Discourse development will behighly integrated with style and organization.)

4+6 Controls full range of methods of development available to theeducated native writer, including conventions applied in aspecific field (e.g., law, medicine, science, military, literature,etc.).

Figure 5.3. Profile of Methods of Writing Development

would be premature even to begin a Profile until ratedwriting samples are gathered for research. Such sam-ples should be analyzed from at least four perspectives.

4,


Diction. The choice of both concrete ind abstractwords should be analyzed to determine the writer'sunderstanding of their denotative and connotativevalue. Starting at about Level 2+, the writer should showawareness of connotations. The use of technicalvocabulary at Level 3 and above should also be exam-ined.

Figurative Language. Considerable work needs to bedone to learn how much use and control can beexpected at Level 3 and above, assuming that littlefigurative language appears bc.low that leve . As apreliminary hypothesis, the order of usage may be asfollows: imagery, simile, metaphor, allusion, symbol,and trope.

Tone. Attitudes toward the subject and toward thereader should begin to be controlled at Level 3 andabove. Irony, understatement, overstatement, hyper-bole, and humor should be available in the repertoire ofwriters at Level 3+ and above. Since maladroit writingmust also be rated, some attention should be given to theplace of circumlocution, euphemism, and cliché insetting the tone.

Rhetorical Patterns of Sentence. The rhythm ofprose for both formal and informal style should be con-sidered at Level 3 and above.

In the attempts to add invention to the profile and,hence, to the level descriptions, analysis of writing sam-ples will be needed. Perhaps, also, current studies ofwriting-as-process will provide useful information onthe acquisition of this factor. As noted earlier, the cur-rent level descriptions include specific statements aboutreader focus and sentence-level skills; it should not bedifficult to construct preliminary scales for thesefactors. There is no question that profiles would have tobe tried and revised as lessons are learned from expe-rience and research.

Writing: The Government Seale 171

The Need for Research

This chapter concludes with a list of research topicsthat, ideally, would accompany the construction andtrial of writing proficiency tests and collection of ratedsamples. Development of a Writing Performance Profileshould occur concurrently, and eventually the complet-ed profile, the test results, and research conclusionscould all be brought together in a project to rewrite thelevel descriptions. Research needed to support this ef-fort includes the following:

1. Determination of the validity and reliability of anyessay examination as a performance test of writingproficiencygiven time restrictions, the limited oppor-tunity to revise and polish, and the relatively smallnumber of topics offered to sample the skill modality.Determination must also be made of the external cri-terion for such a performance test and how to u.3e it forvalidation (see Coffman, 1971).

2. Determination of the validity and reliability of thespecific type of sampling proposed in this chapter andby Magnan (1985).

3. Determination of inter- and intrarater reliabilityof scoring such tests according to the ILR level descrip-tions and the ACTFL guidelines. The fundamentalproblems have been covered by Coffman (1971). Becauseof the inadequacies of the current DLI topic for eval-uating the full 0-5 range, reliability statistics have notbeen collected for the scoring of writing. Informalobservation suggests no greater need to use third ratersfor scoring essays than for speaking samples; inter-rater reliability for speaking tests has traditionallyexceeded .80 at all government agencies.

4. Full-scale examination of contrastive rhetoric, assuggested by Kaplan (1972), to learn both about possibledifferences across languages and about the implica-tions such differences would have for essay test design,topics, scoring criteria, and instructions to examinees.


If results suggest extensive differences, study and dis-cussion would necessarily follow to assure that ncwlevel descriptions avoid bias toward more commonlytaught languages.

5. Comparison of the proposed Writing PerformanceProfile factors to global ratings when evaluating sam-ples.

6. Although the inherent philosophy of the level de-scriptions indicates a global score must be assigned,some consideration should be given to the inclusion ofpart scores for the proposed tests.

7. Examination of the role of writing proficiency aspart of a translator's skill.

8. Research examining the process of writing byboth natives and nonnatives in a variety of professionaldisciplines should be monitored for relevance to revisedlevel descriptions.

9. Examination of native-writer norms. Two types ofgood writing can be postulated. One type is produced byartists or essayists whose imagination and originalitydistinguish them as more creative than their peers. Theother type is produced by the educated. writer who hassomething to say, organizes and states it well, but dis-plays no real creative genius. Should only the former beeligible for a 4+ or 5 rating? The results of all researchon writing must be used to determine what constitutessuccessful writing by natives so that tests do not de-mand more from nonnatives.

NOTES

1. The Level 2 descriptions included the statement, "Materielnormally requires editing by a more proficient writer"; Level 3 "Allformal writing needs to be edited by an educated native"; Level 4,"Errors are rare and do not interfere with understanding. Never-theless, drafts of official correspondence and documents need to beedited by an educated native."

2. Without associating them with any shortcomings found in theprototype test, the author wishes to acknowledge the assistance ofAnne Wright, Robert Kluender, and Carl Erickson. All were on theDLI staff at that time.


3. Magnan's test made it clear that evaluative information was animportant addition to the instructions. The higher-level test wasamended accordingly; all wording is derived from Magnan's test.

4. The addition of plus-level descriptions permitted more attention tothe gradation of skills. For example, the current 0+ description ap-propriately contains the statement, formerly found at Level 1: "Canwrite numbers and dates, own name, rationality, address, etc."Refelences to the need for an editor's assistance at Level 4 and belowhave been removed, although the ability to edit texts is noted at Lev-els 4+ and 5.

5. The rating is global; the profile factors are not totaled to yield ascore.

References

Adams, M.L., & Frith, J. (Eds.) (1979). Testing kit. Ar-lington, VA: Foreign Service Institute.

Brooks, C., & Warren, R.P. (1972). Modern rhetoric.New York: Harcourt Brace Jovanovich.

Carrell, P.L. (1982). Cohesion is not coherence. TESOLQuarterly, 16, 479-488.

Carrell, P.L., & Eisterhold, J.C. (1983). Schema theoryand ESL reading pedagogy. TESOL Quarterly, 17,553-573.

Coffman, W.E. (1971). Essay examinations. In R.L.Thorndike (Ed.), Educational measurement. Wash-ington, DC: American Council on Education.

Dvorak, T. (1986). Writing in the foreign language. InB.H. Wing (Ed.), Listening, reading, writing:

178


Analysis and application (Reports of the NortheastConference on the Teaching of Foreign Languages,pp. 145-167). Middlebtr.---, VT: Northeast Conference.

Gregg, J. (1986). Comments on Bernard A. Mohan andWinnie Au-Yeung Lo's "Academic writing and Chi-nese students: Transfer and developmental factors."A reader reacts. TESOL Quarterly, 20, 354-358.

Kaplan, R.B. (1966). Cultural thought patterns in inter-cultural education. Language Learning, 16, 1-20.

Kaplan, R.B. (1972). The anatomy of rhetoric: Proleg-omena to a functional theory of rhetoric. Philadel-phia, PA: Center for Curriculum D Pelopment.

Lowe, P., Jr. (1978). Third rating of FSI interviews. InJ.L.D. Clark (Ed.), Direct testing of speaking profi-ciency: Theory and application (pp. 159-105). Prince-ton, NJ: Educational Testing Service.

Lowe, P., Jr. (1982). ILR handbook on oral interviewtesting (DLI/LS Joint Oral Interview Project). 'Arl-ington, VA: Foreign Service Institute LanguageSchool.

Magnan, S.S. (1985). Teaching and testing proficiencyin writing: S1:ills to transcend the second-languageclassroom. In A.C. Omaggio (Ed.), Proficiency,curriculum, articulation: The ties that bind (Reportsof the Northeast Conference on the Teaching ofForeign Languages, p.,,. 109-136). Middlebury, VT:Northeast Conference.

Mohan, B.A., & Lo, W.A. (1985). Academic writing andChinese students: Transfer and developmental fac-tors. TESOL Quarterly, 19, 515-534.

M -an, B.A. (1986a). On evidence for cross-culturalrhetoric. TESOL Quarterly, 20, 358-361.

179,


Mohan, B.A. (198613). On hypotheses in cross-culturalrhetoric research. TESOI. Quarterly, 20, 569-573.

Osterholm, K. (1986). Writing in the native language. InB.H. Wing (Ed.), Listening, reading, writing: Anal-ysis and application (Reports of the Northeast Con-ference on the Teaching of Foreign Languages, pp.117-143). Middlebury, VT: Northeast Conference.

Ricento, T. (1986). Comments on Bernard A. Mohanand Winnie Au-Yeung Lo's "Academic writing andChinese students: Transfer and developmental fac-tors." TESOL Quarterly, 20, 565-568.

Sollenberger, H.E. (1978). Development and current useof the FSI oral interview test. In J.L.D. Clark (Ed.),Direct testing of speaking proficiency: Theory andapplication (pp. 1-12). Princeton, NJ: EducationalTesting Service.

Weaver, R.M. (1967). A rhetoric and handbook. NewYork: Hoyt, Rinehart and Winston.

180

'


Appendix: Speaking Performance Profile

Pronunciation

P=

Fluency/Integrative

FL =

Sociolinguistic/Culture

SC =

MALT ACC/MC IN INS fait ACCINAN1 TO (NS ON AU CTS

MA! Os etCOtts. CULTUSAI LULUNCIS AND

COUOOLKOSK IOUNALEN7 TO*. (1.S

UNIT MISAONOUNC1S

5

NON CLONE Of IWINCY tKONUK.

!NOON. NONKUT VANN LUKE

ACCSKAUE TONS

4

5

lL$DCI ANKOMATUT °NALL MILS

NORNAUTt/NNINT TO NONS9CNAL

NI IDS 11 MUMS DOKNATIS

4

5

4

Acam MAT .4 FM KM. NMI I NUM, S.LA NT CKUNIS l..S

MKS WIN Wain (ALIT NAS TO 0800tIRUPKT MUM 02 C10.11 TO UN A Of NS AC.

MUMMY NAT NON KONOONCED KATY

TO Mt ssiLsPulUSS APO olcussOWOON

AND liwIt KILLS ALTNOUON NTT(

LANGUAGI NAT SUMAC! N IDIOMATIC 10913.

NOM CMOs:ATI ACTIAUN TNNINO N TI

3

MAILS UOUENT NKOMO( Kt Of TI

CUSTUILM PIN K.S AND EINISLONS

SOC811K-KSTK INCCIAACIIS MAT EXIST

DO NOTzISKT N AKUNDIISTANONG

3 3

OP tt to mum Out isattu04.1

InsiS WITH CO.FOI.C1 OUT NOT KM

PACLITY Ift SITANTN.OUT PALASKAS1 AND

PLUMS INT/ILANOUAOt IS KKK wt. Vit

itiCSONG

2

K MUNE NAT K309KINATI PUT

spoutstosne AND CUITUIAL COtoustsDO NOT CONK, NS

2 2

ellOsS MOUNT NTIUGOU TO K tritoTO OSAINO win.

SLOW. STUNS D. isCts7 SOS COO**

IssittSSPOsts ssOOSCSITsot

CONSUJCTIONS AND VOCKKAN

IOW NOUIOI MAT DoKret

1

SUPFKIINT CUULLAI/SOCKLINGUODC

(NOW IDOL TO Dim WITH NS UM C. TO

NA1NO WITH Ip4Kd4IS

u.mIIIC4M!SO KturO TKO COtol MOON ISAPOSIK.

0

NO !MIK! Or SOCADUNCAKDC 011

CULTUtAl AwAtINISS

0 0

ENS = educated native speaker NS = native speaker

181 BEST COPY AvAluibLL

Writing; The Government Scale 177

Grarnrt:nsr

G =

Vocabulary Tasks

V=

IQUAI TO INS Ng u120TN AND DOOM ON All

10M/AIM TOP - MALTS

ONIFOCC.ALOPut ISIORS.PPOPA12222, OF

Of F.C.INC2 NAKIS vS2 OF NG.4141

012400122 ST2000222

01442PC/ "." 121012 TASK STR0C

OCCASION. .PROPS /N102Pl1fO0ENCy

STKICTONS AND mOel !HOMO [Mex$ IF<

LIU CON40. CO021 it STIVCIVIIS

/OM S1NTINCIS N tuuTIODSCOVISI

IITINSM. MOSE. AND APPIOPtATt TO

I4112 OCC.ASPON

ROAD (NOW. TO CON424t AND I MIS,

.002NiONS iNPOCMAI AND PeOe.A1

CCMISAnONS &SW PitACFICAL S00O2,

0110122.00ci. AND 1411ACT TOPICS

$ UFFKR NT TO $012 $PlY wITNSON1

ClIC0./Cr:U1K>1% N CASUAL

COPMISATIONS ASOUT CONCEIT( TOPICS

MN AS Ovol 1...C2020040 FF222, ANO

IP/MISTS wOCC. CAVIL *NO COHN NT

(TINTS

5

4

3

2

T=

All 100112 /11.011101410,1110144.I44CI

IOW IN1 TO TTTAT Of AN INS

Alit TO 1102 LANGUaGt 10 Flt P VINCI

(2-Vrolt.PFISUAOf tf RESIN/ PONT OFvicw,MOOTIAtt,2430/Ttettli POI

120141A445

CAMCOMES1 N P02/4Al ANOINFOt/A1

1/1UAT OM 213004 Mkt 62 VTIJAMONS

C21 *FM UN P AN/10 R ?WKS POOVC4

EMANATIONS D120022 OfTit,SOPPOPTIO.Y.20.2 AND wrP

Alit 10 NM PaTICIPAT2 N C.21/22

CON22234TION4 C.AN IXPIIISS FACT S GM

INSTPLICTIONS Ct SCR*I WOO ON AND

11112/04 HAMMON A3001 CUPPINT, PAST,

AK)FuTle1ACtrviTt S

MIMS ION COWIN! AND FUNCTION CAN CalAtt WIN MI lANGUAGI ASC AND

MOO /2101.004. WT 4111110411 IONS

MO 10 DIMINO WITH F0221022212

TOT222 W1ONG 02 PONW$TINT

= target language

0

..olouAtt 101 DIN 502121

CON/2212TION

WOPOS LiNJTIO 10 FYIIIDAY 24.1211VAL

AP00O2/211 ST 2101/121mINIS

1 82!ZEST COPY AV AlLAbLi

NI. = native language

0

1 1

NO 14./NCTIO*04 2202

*pawn outs nom pancrot 1.2440aCON/IPSAnONS

0

1

Section 2:The Academic Context

by Anne Katz,University of California at Berkeley

Until recently, writing has been the neglected sister ofthe four language skill areas customarily taught insecond and foreign language classeslistening, speak-ing, reading, and writing. During the long p ;riod whena traditional grammar approach dominated languageteaching, writing was viewed as a technique to practice,and thus learn, grammar and the rules of correctusage. Tasks included copying words and sentences,dictation, and translating. The aim of this approach towriting w. 3 correctness of linguistic form, usually atthe sentence level.

The postwar advent of the audio-lingual approachbrought a major change in focus for many languageprograms, from a formal undertaking to a practicalmeans of teaching oral communication. This new ap-y.roach emphasized oral skills, with students practicingand learning the patterns of spoken language. Tech-niques for teaching writing remained essentially thesame, however. Tasks still included copying, writingfrom dictation, and practicing word and sentence pat-terns. The aim of these tasks was to reinforce languagepatterns introduced in oral practice.

The advent of the proficiency movement, along witha resurgence of interest in developing students' writingskills, has opened up the possibility of viewing writingin the foreign language classroom in a new light (Mag-nan, 1985). With a focus on communicative approachesto teaching language, teachers have become moreaware than ever of the importance of gaugir students'needs in designing appropriate language teaching

1781 ° n(),:).

Writing: The Academic Context 179

curricula (Savignon, 1983). Writing research haspointed out the cognitive benefits accruing from usingwriting to discover and explore meaning in bothlanguage classes and content classes (Applebee, 1984).

Taking heed of this new perspective, some secondanct foreign language specialists interested in teachingwriting have leaned heavily on the first language (L1)writing literature to provide both theory and techniquesfor helping student writers (for a sampling from bothfirst and second language writing specialists, seeDvorak, 1986; Hughey, Wormuth, Hartfiel, Jacobs, 1983;Magnan, 1985; Osterholm, 1986; and Zamel, 1976, 1982,1985). Drawing from this literatu.e, second and foreignlanguage writing specialists have redefined the prov-ince of language teachers, encouraging them to ap-proach the teaching of writing as process rather thanas product, from the composition side rather than fromthe grammar side. Thus, writing tasks are seen isways to use language to communicate meaning 'herthan as ways to practice or test writing skills.

Uses of Writing in the Foreign LanguageClassroom

While descriptions of this new approach to teachingwriting are surfacing in the pedagogical literature, afew studies of the use of proficiency procedures in theacademic clasuroom are also emerging. The news, forthose interested in writing, offers mixed results.

Tn a preliminary analysis of the effects of the profi-ciency-based foreign language requirement instituted atthe University of Pennsylvania, Freed (1987) points outthat the program is designed to help svadents "acquirefunctionally useful communicative abilities as well asstructural accuracy in the basic four skills" (p. 140). Tosatisfy the requisite foreign language requirement, stu-dents must pass a proficiency test that includes exams

184 I,


of functional skills in oral interaction, listening com-prehension, and reading comprehension, as well as inwriting.

This focus on inultiskili proficiency has led tochange in the curriculum, to what Freed describes as"more creative teaching and more emphasis on extend-ed use of the language in all four skills" (p. 145). Forteaching writing, this has meant "more varied andpractical types of writing assignments, and a decreasedemphasis on formal grammatical manipulation" (p.142).

On the other hand, the proficiency movement'seffect on other classrooms has produced somewhatdifferent results. In describing an alternative first-yearsequence in French, German, and Spanish, Clausen(1986) outlines a program that focuses on listeningcomprehension, culture, and speaking in everyday sit-uations.

This innovative program is intended as an alterna-tive to standard, ,rammar-oriented first-year classes.Clausen explains that her department is still workingout how to implement a proficiency-oriented curricu-lum. While oral testing has }:men introduced, changesinvolving the other skill areas are "more sporadic andvary greatly from one instructor to another" (p. 35).

Although more uniform enthusiasm for includingwriting in the curriculum might be expected or desired,it is understandable that writing continues to play asecondary role. For while the proficiency orientationhas provided new ways of thinking about language andhas developed guidelines for all four skill areas, theemphasis in the classroom continues to be on oral flu-ency. As Dvorak (1986) nctes, writing in the foreignlanguage classroom is still widely regarded as "speechfrom a pencil" (p. 147).

One reason for the continued emphasis on oral flu-ency certainly is tied to the most recent history of lan-guage teaching methodology, as discussed earlier. Thefr-us ;,,n proficiency and new ways of thinking aboutwriting are relatively recent developments. Classroom

185


teachers have been trained according to principlesembodie, in previous approaches to language teaching.In reviewing modern methodologists of the late 1970sand early 1980s, Dvorak reports that their discussions of"writing" focus specifically on teaching the conventionsof language form. For the most part, then, the develop-hlent of "writing" skill is narrowly defined in terms ofthe development of language.

Dvorak also points out that even the textbooks stu-dents encounter in class continue to support thisconcept of writing. Written assignments are linked toconversation practice or advanced grammar lessons. Inthis way, writing can be seen essentially as transcrip-tion, as a means to extend the lesson or to vary the day'sactivities. "Advanced composition" is defined as eitherfree composition, translation, or some combination ofthe two. Writing is evaluated according to the frequencyand gravity of error in the student's product. From thisperspective, the development of writing skill entailsincreased fluency and accuracy in the target languagebut says nothing about how well the learner has man-aged to deal with the content or with the notions ofaudience or purpose contained in the task.

While part of the difficulty in introducing new waysof thinking about writing into the foreign languageclassroom may stem from the traditional ways in whichforeign language has been taught, it may also havesomething to do with the very nature of foreign lan-guage requirements. Few schools require extensive for-eign language study. One year, at most two, will satisfymost university requirements. Freed (1987) notes, fcexample, that while the proficiency test is given tofourth - semester language students, students who wishto take it during the third semester may do so. Given thelimited amount of time allotted for instruction, teachersand program coordinators generally give writing shortshrift.

Writing as a means of conveying meaning is notnormally a part of the curriculum except in upper-division or graduate-level courses in literature. Even


then, students are assumed to have mastered in otherclasses the necessary prerequisite skills of organizingand developing t'leir ideas.

It is not hard to understand the rationale behind therole of writing in the foreign language classroom. Oneimportant factor concerns the generally perceived na-ture of language study. One learns a foreign languagein order to speak it, not write it. Thus, when studentsenter the classroom, expecting to develop their commu-nicative abilities, they are primarily expecting to learnto speak with other users of the foreign language. Thestrength of students' perceived needs is an importantfactor that is given a great deal of attention both in sec-ond language acquisition research (Ellis, 1985) and insyllabus design (Richards,1984).

Writing is also perceived as the most difficult of thelanguage skills, both by teachers and by students. Tobegin with, students often have difficulty learning howto transcribe units of meaning in the new language,and how to unt_ngle a new system of sound-symbolcorrespondences. And winle it may be difficult to detectwhether inflectional endings are present in the flow ofspeech, there is no doubt as to their presence, orabsence, when students begin writing a dictation or anessay. When writing is used in the classroom to extendthe grammar lesson and vary the 'singe of activities stu-dents engage in, and is evaluated in terms of accuracyand fluency instead of in terms of a process throughwhich to capture meaning, these perceptions of difficul-ty are reinforced.

The Role of Proficiency Testing in theForeign Language Classroom

It is not unknown in education for both teachersand students to work toward the test that will be used toevaluate the efforts of each (Purnell, 1982). Thus, a key

i37


role that a test of writing proficiency may play in foreignlanguage classrooms is that of spurring curricularinnovation. It is crucial, then, to weigh the directionthat a test of writing proficiency will take. Given the in-fluence of the Ll writing literature on the emergingtheory and practice of second and foreign languagewriting classrooms, it makes sense to tap into this richvein for insights useful for shaping the direction neededin using the ACTFL guidelines for assessing writingproficiency.

Assessment of Writing Proficiency :nEnglish

Large-scale assessment of writing has emerged as amajor trend in the field of writing today (Wolcott, 1987).While writing teachers and researchers explore anddescribe developing "processes" of students, it is stu-dents' "products" that continue to serve as indicatorsof their competence at critical points along the way.High schools, colleges, and universities require suchinformation to make placement decisions and t de-termine more realistic levels of foreign language nllsas students emerge from language programs tr taketheir places in governmem business, and incustry(Omaggio, 1973). These data also serve to inform judg-ments about the worth of particular programs and toaid in making budget decisions.

Most of the literature on writing assessment re-volves around assessing writing proficiency in Englishas a first language. Odell (1981) suggests the first stepin assessing writing is to determine what is meant bycompetence in writing. He points out that such a notionmust encompass a range of skills, including lexical,syntactic, and creative fluencies, discourse skills, andappreciational skills. His notion of competence alsoincludes discovering what one wants to say in writing;

1 88

*

L


it requires the writer to come to some conclusion aboutthe topic at hand. This aspect of competence involves thewriter in making "appropriate" choices guided by anawareness of audience and purpose. This definition ofcompetence makes it quite clear that different writingtasks require different writing skills, evoking a differentb- 'a.o.ce of options to satisfy the demands of audienceand purpose. Such a notion of writing competence en-tails a vision of writing as communication, not merelyas a channel.

It also poses certain difficulties in devising appro-priate tasks for assessing that competence. To measurecompetence, Odell underlines the importance of obtain-ing an adequate sample of students' writing and choos-ing an appropriate measure of writing competence.

According to Odell, obtaining an adequate sampleentails devising assessment procedures that meet sev-eral criteria:

1. Students should be asked to write under condi-tions that resemble as closely as possible those underwhich "important" writing is done. To best representtheir competence, students need time to "engage in theprocess of discovery."

2. Students should produce more than one type ofwriting. The rationale for this criterion is based onevidence that the "ability to do one kind of writing taskmay not imply equal ability with other kinds of tasks. Aswell, one kind of writing may not be equally importantfor all students in all schools in all communities." Inaddition to producing more than one type, studentsshould be asked to write for more than one audienceand for more than one purpose.

3. Information about the audience and the purposeof the piece (., writing students produce should be in-cluded in the prompts for writing. Also, the promptsmight indicate the form of response desired, be it aletter, a summary, an essay, or a journal entry.

4. The demands of the writing tasks assigned shouldbe carefully assessed to determine whether different

189


topics require students to draw on different sources ofdevelopment or to provide different modes of devel-opment.

5. Several pieces of writing (at least two for eachkind of writing) should be collected to make a judgmentof competency.

Once the samples of writing have been collected, it isnecessary to choose an appropriate measure of writingcompetence. While objective, multiple-choice tests havelong been used for determining competence in writing,one of the most common current means of assessingwriting competence is holistic evaluation.

Holistic evaluation is a guided procedure for assess-ing a sample of writing. Cooper (1977) describes the pro-cedure as follows:

The rater takes a piece of writing and either (1)matches it with another piece in a graded series

pieces or (2) scores it for the prominence ofcertain features important to that kind of writingor (3) assigns it a letter or number grade. Theplacing, scoring, or grading occurs quickly, im-pressionistically, after the rater has practicedthe procedure with other raters. (p. 3)

The strength of holistic evaluation lies in theapparent validity of its results. For, as Cooper conti-nues:

A piece of writing communicates a wholemessage with a particular tone to a knownaudience for some purpose: information, argu-

evaluation is a tool, a procedure for assessing a piece ofwriting accordinr to prescribed criteria. The strength of

ment, amusement, ridicule, titillation. At pres-ent, holistic evaluation by a human respondentgets us closer to what is essential in such acommunication than frequency counts do. (p. 3)

It is important, however, to remember ti-,t holistic

190


the claim for validity rests in the set. of features selectedby the test designers. The features deemed essential byone group of readers may not necessarily be con-sidered essential by another. In a critical overview ofholistic scoring, Charney (1984) points out that decidingon the boundaries of categories for the purpose ofevaluation or determining the salient features to beincluded in scoring guides are all "matters of opinion."

Charney's caveat has critical implications forapplying the guidelines to create tests of writing, parti-cularly across languages where different features mayhold varying degrees of saliency. Native users fromdifferent rhetorical traditions may choose contrastivesets of features to characterize "good" writing.

A Specific Example of an L2 RatingScale for Writing: The TOEFL Testof Written English

Before considering exactly how a model test offoreign language writing proficiency might look, it willbe useful to examine an existing test of writing for non-native users of the target language. One such major testis the new writing section of the Test of English as aForeign Language (TOEFL), the Test of Written English(TWE).

The first issue facing the developers of a test of writ-ing in English as a second language (ESL) was to re-define the notion of written competence. As Odell (1981)cautions, it is necessary to begin with an understandingof what it is being measured. In a discussion of theresearch framing the development of the TWE, Carlsonand Briageman (1986) explain how they approachedtheir initial research task from the perspective of "func-tional communicative competency? Contrasting theirapproach with the more traditional grammatically

1°II:4


based one, they defined a functionally based communi-cative approach as entailing "the ability to use languageto communicate effectively within the specific context inwhich the communication takes place" (p. 129). The keyassumption here is that performance on a task isconditioned by the specific dimensions of that tank. Toevaluate performance, then, thuse specific dimensionsmust be clarified.

Because the TOEFL is used as an admissions in-strument for colleges and universities Carlson andBridgeman designed a survey to assess the parametersof the tasks students face in an academic setting. Usingthe results of this survey, they devised an examinationcentered essentially on two very different types ofwriting: (a) a description and interpretation of a graphor chart, and (b) comparison and contrast, plus taking aposition: Subjects wrote two essays of each type andwere allotted 30 minutes to respond to each question.

These were the two comparison-and-contrast topicsused in developing the test:

Some people say that exploration of outer spacehas many advantages; other people feel that it isa waste of money and other resources. Write abrief essay in which you discuss each of these po-sitions. Give one or two advantages and dis-advantages of space exploration, and explainwhich position you support.

Many people enjoy active physical recreation likesports and other forms of exercise. Other peopleprefer intellectual activities like reading or lis-tening to music. In a brief essay, discuss one ortwo benefits of physical activities and of intel-lect al activities. Explain which kind of recre-ation you think is more valuable to someone yourage.

For the first of two topics that required the descrip-tion and interpretation of a chart or graph, studentsstudied three graphs showing changes in farming in

192


the United States from 19404980 and followed these in-structions:

Suppose that you are writing a report in whichyou must interpret the three graphs shownabove. Write the section of that report in whichyou discuss how the graphs are related to eachother and explain the conclusions you havereached from the information in the graphs. Besure the graphs support your conclusions.

The next graphic was a pair of charts showing thearea and population of the continents, with theseinstructions:

Suppose you are to write e report in which youinterpret these charts. Discuss how the infor-mation in the Area chart is related to the in-forr.~..tion in the Population chart. Explain theconclusions you have reached from the infor-mation in the two charts. Be sure the charts sup-port your conclusions.

These sample topics illustrate the kind of writingprompts TWE has continued to use since its first ad-ministration in July 1986. Overall, Carlson and Bridge-nan found that with careful topic selection and ade-quate training of raters, they could reliably evaluate thewriting performance of ESL students.

To score the essays produced in response to theseprompts, Stansfield (1986) directed the development of asix-point scoring guide for use by raters trained in ho-listic evaluation of writing. The scoring guide is givenin Figure 5.5.

While the defining criteria of each level cut clearlyacross the full range of writing performance, a range ofability remains within each of the six levels. Thus, forexample, there are "high" Level 4s, papers that strainthe upper boundary of minimal competence, as well as"low" Level 4s, papers that barely constitute the samedesignation. Various factors contribute to the range


6 Clearly demonstrates competence in writing on both the rhetoricaland syntactic levels, though it may have occasional errors. A paperin this category:

is well organized and well developedeffectively addresses the writing taskuses appropriate details to support a thesis or illustrate ideasshows unity, coherence, and progressiondisplays consistent facility in the use of languagedemonstrates syntactic variety and appropriate word choice

5 Demonstrates competence in writing on both the rhetorical andsyntactic levels, though it will have occasional errors. A paper inthis category:

is generally well organized and well developed, though it mayhave fewer details than does a 6 papermay address some parts of the task more effectively than othersshows unity, coherence, and progressiondemonstrates some syntactic variety and range of vocabularydisplays facility in language, though it may have more errorsthan does a 6 paper

4 Demonstrates minimal competence in writing on both therhetorical and syntactic levels. A paper in this category:

is adequately organizedaddresses the writing topic adequately but may slight parts of the taskuses some details to support a thesis or illustrate ideasdemonstrates adequate but undistinguished or inconsistentfacility with syntax and usagemay contain some serious errors that occasionally obscuremeaning

3 Demonstrates some developing competence in writing, but it remainsflawed on either the rhetorical or syntactic level, or both. A paper inthis category may reveal one or more of the following weaknesses:

inadequate organization or developmentfailure to support or illustrate generalizations with appropriateor sufficient detailan accumulation of errors in sentence structure and/or usagea noticeably inappropriate choice of words or word forms

Figure 5.5. The TWE's 6-Point Guidelines

1D44

190 -PROFICIENCY ASSESSMENT

2 Suggests incompetence in writing. A paper in this category isseriously flawed by one or more of the following weaknesses:

failure to organize or developlittle or no detail, or irrelevant specificsserious and frequent errors in usage or sentence structureserious problems with focus

1 Demonstrates incompetence in writing. A paper in thiscategory will contain serious and persistent writing errors, may beillogical or incoherent, or may reveal the writers inability tocomprehend the question. A paper that is severely underdevelopedalso falls into this category.

Figure 5.5. (cont.)

of ability within levels, including the writer's skill indealing with the topic or ability in selecting relevantdetail to develop main points.

For ESL students, a major issue concerns the bal-ance between rhetorical and syntactic skills. Since thescoring guidelines require testers to consider both rhe-torical and syntactic skill when determining scores,essays exhibiting differing levels in each of these twoareas could receive the same score because the overalleffect of the essays could demonstrate similar levels ofwriting performance.

Carlson and Bridgeman point out that for nativespeakers, organization skills tend to parallel mechani-cal ones. With ESL students, however, greater disparityis to be expected. Carlson and Bridgeman suggestresolving this difficulty by having the readers reach aconsensus about how to evaluate such essays. Clearly,the difficulties involved in dealing with a range of abilitywithin levels are resolvable, as Carlson and Bridgemanreported consistently high reliability for the holisticscores (.80-.85, according to Spearman-Brown calcula-tions of the reliability of a score based on the ratings).

The implementation of the TWE shows that it ispossible to assess reliably the writing of students usingtheir second or foreign language. It is now time to con-sider taking advantage of the insights garnered fromthe L1 writing assessment literature and from theimplementation of the TWE to evaluate the proficiency

n5


definitions and procedures for testing writing proficien-cy in a foreign language.

Assessing Foreign LanguageProficiency

The first step in assessing writing, according toOdell (1981), is to determine what is meant by com-petence in writing. The proficiency guidelines deter-mine how this competence will be defined. The fourbasic levels range from Novice, the ability to produceisolated words and phrases, to Superior, the ability towrite formally and informally on practical, social, andprofessicnal topics. Within each level, the guidelinesdescribe specific functions, content and degrees of accu-racy appropriate to that level (see Magnan, 1985, for afuller analysis of the levels organized according to thesethree areas).

One problem with the guidelines may be the highdegree of specificity included for each level. For exam-ple, to receive a rating of Intermediate-Mid, a writermust be able to control the syntax of noncomplex sen-tences and basic inflectional morphology. The guide-lines are presented as generic descriptions, yet secondand foreign language specialists hat e collected little evi-dence to determine how well these descriptions work inassessing proficiency across the broad range of targetedlanguages (see Rosengrant, 1987, however, for a studyof emerging grammatical and functional ability amongRussian learners). Given Charney's caveat about therelative nature of categories used for evaluation, thedescriptions should be considered tentative in nature.While specific criteria provide a clear focus for eval-uators undertaking the task of assessing written profi-ciency, more cross-cultural research is needed to con-firm the selection of features deemed characteristicof each level.

Another issue related to the guidelines' level of


specificity is the difficulty of tying specific sentence-level skills to rhetorical forms. It is necessary to con-sider how proficiency should be weighted in each ofthese areas. For, as Carlson and Bridgeman point outin their discussion of the TWE, greater disparity is ex-pected in the growth of fluency in these skills amongnormative users of a language. The problem in assess-ment revolves around deriving a single score from un-matched skill levels.

Another and perhaps more disturbing problem withthe guidelines concerns the linear model of develop-ment assumed across the levels of emerging profi-ciency. According to the guidelines, as students pro-gress from Novice to Intermediate and beyond, they addnew skillsstudents who acquire the ability to meet anumber of practical writing needs (Intermediate-Midlevel), for example, are still assumed to be able to supplyinformation on simple forms and documents (Novice-High level). Within such a model, then, writers accrueskills, building on past successes with written languageto forge texts containing newer, more complex linguis-tic structures and rhetorical functions.

Several difficulties emerge from this view of writingdevelopment. A linear model of development focusesattention on productsor the texts created by studentwritersbecause under such a model, the evaluation ofwriting skill limits itself to examining the texts againstspecific criteria. Yet current writing research andinstruction have reconceptualized what writing entails,enlarging the focus of inquiry to examine the cognitiveprocesses writers engage in during the creation of"written text. Researchers have suggested that an anal-ysis of changes in written products may not reveal how(or perhaps even whether) writers have changed theway they go about producing text (Bereiter, 1980; Shuy,1981). With this new focus on not only what studentwriters produce but also how they go about producing it,writing specialists are urging curricular and assess-ment innovations that take into account the process ofwriting. If writing proficiency is envisioned to include


these complex cognitive processes, then the questionbecomes how an assessment model that ostensiblyfocuses on products will take into consideration the un-derlying prccesses used to produce those texts.

Once the scope of the notion of writing is enlarged toinclude the variety of factors involved in the writingprocessthe demands of topic, audience, and purpose,for exampleadditional difficulties arise with a linearmodel of writing growth. In their study of indices ofgrowth in writing, Freedman and Pringle (1980) exam-ined the relationship between growth in rhetorical con-trol and growth in the writer's capacity for abstractionto higher levels. They found that as students tackledmore difficult topics or attempted more sophisticatedlines of argument, their written texts did not appear asproficient as when they dealt with less challengingtasks. They argued that when students take on a morecognitively difficult task, they tax their rhetorical skills,with the result that their texts appear less skillfulrhetorically and grammatically. "It is," they conclude,"quite simply more difficult to write when the task ismore intellectually taxing" (p. 322).

Part of the difficulty in responding to more chal-lenging tasks may have to do with the writer's level ofknowledge about the topic. In discussing how writersinterpreted writing tasks, Ruth and Murphy (1984)presented a complex model of how different backgroundknowledge results in different interpretations of thetask. For a striking example of the ambiguity residingin the interaction between a question and its receiver,they drew on the Interviewer's Manual from the Mich-igan Survey Research Center. When an interviewerasked, "Do you think the government should controlprofits or not?" the respondent answered, "Certainlynot. Only Heaven should control prophets." Ruth andMurphy argued that such examples illustrate the illu-sion created by assuming that questioners and respon-dents, as well as test designers, takers, and raters, havea "common linguistic and social frame of reference."Given such potential complexity of interpretation, a


model of writing assessment, they suggest, shouldaccommodate a range of possible responses.

While more discussion and research (for example,see Herzog's suggestions, this volume) will lead tofuller specifications of competence in the definitions tobe used in evaluating the skills students develop, it isimportant now to consider the forms that evaluationcan take in the classroom. As it was argued earlier, it isoften the forms of assessment that drive, or at leastreinforce, the foci of instruction in the classroom. AEItesting procedures must be devised that take intoaccount the curricular innovations that foreign lan-guage writing specialists have suggested and thatsatisify the requirements of programmatic assessment.

The enormous range of the AEI guidelines pre-cludes extensive discussion of all the various forms ofassessmr -t needed to determine foreign languagewriting :valency. In the academic community, how-ever, the skills attained in an intermediate-level courseare often designated as the criterion level for satisfyingforeign language requirements. Given the kinds of writ-ing described for the Intermediate-High throughAdvanced Levels, the following discussion providessome suggestions for testing writing proficiency.

Holistic Evaluation

Explicit in the guidelines themselves is -1 holisticcriterion. Lowe (1986) describes one of the guidelines'fixed characteristics as "expression of ability in a globalrating" (p. 394). Thus, even though each level of theguidelines describes specific, required abilities, evalu-ation requires raters to take all of these "parts" andblend them into a "whole" score. This key characteristicfits in nicely with the tenets of holistic evaluationdescribed earlier.

A characteristic of holistic evaluation that has beendescribed as a limitation may turn out to be a strength

V)9


in the context of proficiency assessment. To achievereliability, raters often need to interpret and negotiatethe criteria used for scoring (a procedure described, forexample, by Carlson & Bridgeman, 1986). Samplepapers are used to flesh out the guidelines and provide abasis for such negotiation of meaning. While this dis-cussion may be said to create an artificial aura of reli-ability (Charney, 1984), it also creates a community ofreaders with shared criteria and, thus, acknowledgesthe interaction that occurs between text and reader. Forin evaluating a sample of text, raters draw not only onthe surface features of the sample of writing, but also ontheir implicit notions about textual conventions. Discus-sions about samples of writing evince these implicitnotions, allowing raters to discard or retain them as thetraining sessions reach a consensus.

Because the guidelines have sparked a variety ofresponses from foreign language practitioners whobelieve they do not address the full range of ability foundin foreign language classrooms, the negotiation inher-ent in implementing the guidelines may provide a wayfor disparate viewpoints to use the scale. One exampleof a successful melding of two viewpoints throughimplementing the guidelines is reported by Hoffmanand James (1986). They describe the ACTFL proficiencyrating system as a "splendid tool" for integrating theforeign language and literature foci in their depart-ment. By designing a range of assignments in a literarycontext, they found that they could set intellectuallyrigorous literary tasks for all students, notwithstandingtheir varying foreign language proficiencies. Thus, theyfound that assessment was not limited to language orliterary criteria alone, but combined both foci.

Given the leeway for negotiation, it would makesense, then, to devise assessment procedures using aholistic form of evaluation. Holistic evaluation, how-ever, is a tool when used within an assessment plan. Inthe TWE, for example, holistic evaluation is used toassess one sample of writing from each candidate. Thismakes sense given that the TWE is designed as a test of

200


writing performance within predetermined types ofwriting. Generally, for reasons of financial economy,holistic evaluation in large assessment situations hasbeen used to assess only one piece of writing per writer.

Yet limiting evaluation to one sample violates manyof the tenets of "good" assessment discusg2d in theearlier review of LI writing assessment. Lloyd-Jones(1982) goes so far as to state, "A writing sample is notreal writing" (p. 3). Obtaining only one sample wouldalso seem to violate the intent of the guidelines them-selves. The guidelines are written E"'" descriptions ofproficiency. As such, they describe r ',let the languagelearner is "able" to do at each 1L vel. Given what isknown about variability in writing resulting from dif-ferences in type of writing, perceived audience, word-ing of topic, and time allowed for writing, it would bedifficult to assess "ability" from one piece of writing.Thus assessment should not focus on a single productof specific conditions, but rather must include a varietyof tasks, situations, types, and audiences (see Herzog,this volume). To assess learners' writing ability ade-quately requires a different approach to AEI assess-ment.

Portfolio Assessment

Portfolio assessment is a method by which studentsenrolled in a writing course produce a collection or"portfolio" of pieces of writing to be evaluated by ateacher other than the classroom teacher. The portfolioscontain several types of writing and include bothimpromptu and revised papers (Elbow, 1986). Gener-ally, the writing is evaluated holistically, althoughsome pieces may be evaluated using other scoringmethods.

The key advantage of this method is that it allowsraters to evaluate a variety of types of writing. The port-folio may contain samples of different modes of dis-

201


course. Pieces of writing may be designed for differentpurposes and different audiences. This advantage tiesin with Odell's argument that to evaluate students'ability, "we need to evaluate several different kinds ofwriting performance" (p. 115). Van Oorsouw (1986)points out that in addition to allowing a fairer assess-ment of student& ability, "it also offers the evaluatorsthe opportunity to learn much more about their stu-dents' writing than other methods offer" (p. 20).

This method also allows students to develop a morerealistic sense of who is assessing the writing and whatthe criteria are for those judgments. Hoetker (1982) sug-gests that while elaborate fictional topics may work wellin teaching writing, they do not serve student writers aswell in assessment situations. Rather, he suggests thata set of instructions describing the actual rhetoricalcontext might provide students with useful information.Portfolio assessment provides such rhetorical context.

This method also draws the evaluators together,leading to discussion about criteria and what goes on inthe classroom. As in the description of holistic scoring,portfolio assessment engenders a sense of communityamong the evaluators as they determine the standardsto be implemented.

Finally, this method allows students to draw on theskills they learn in process-centered classrooms. Stu-dents are evaluated on pieces of writing that they canplan and revise, and that take time to produce. Themethod works in both directions, for the criteria used toevaluats the writing will be brought back to the class-room.

While the benefits of portfolio assessment, are great,there are costs. Obviously, this method of evaluationinvolves considerable planning by both teachers and ad-ministrators. Because of increased costs in processingmore papers for evaluation, the net bottom line will begreater. In addition, the method works best when it ispart of an academic course. Large-scale assessment forplacement purposes, for example, would be difficult toorganize and carry out. Despite the difficulties, portfolio

2 02


assessment offers useful possibilities for evaluatingAEI foreign language proficiency.

To determine future directions for both definingproficiency levels and testing them, further research isneeded on how students develop proficiency in foreignlanguages. How closely are students' writing skills tiedto reading skills? Do students progress in similar waysacross different cultures? Do different languages andcultures place the same value on various types of dis-course as English? The answers to these questionswould provide needed direction for further changes inthe proficiency guidelines and may prompt new think-ing about writing and its processes.

References

Applebee, A.N. (1984). Contexts for learning to write:Studies of secondary school instruction. Norwood,NJ: Ablex.

Bereiter, C. (1980). Development in writing. In L.W.Gregg & E.R. Steinberg (Eds.), Cognitive processesin writing. Hillsdale, NJ: Erlbaum.

Carlson, S.B., & Bridgeman, B. (1986). Testing ESL stu-dent writers. In K.L. Greenberg, H.S. Wiener, &R.A. Donovan (Eds.), Writing assessment: Issuesand strategies (pp. 126-152). New York: Longman.

Charney, D. (1984). The validity of using holistic scoringto evaluate writing: A critical overview. Research inthe Teaching of English, 18, 65-81.

Clausen, J. (1986). Warming up to proficiency: A projectin process. ADFL Bulletin, 18, 34-37.

203


Cooper, C.R. (1977). Holistic evaluation of railing. InC.R. Cooper, & L. Odell (Eds.), Evaluating writing:Describing, measuring, judging (pp. 3-31). Urba-na, IL: National Council of Teachers of English.

Dvorak, T. (1986). Writing in the foreign language. InB.H. Wing (Ed.), Listening, reading, writing: Anal-ysis and application (Reports of the Northeast Con-ference on the Teaching of Foreign Languages, pp.145-167). Middlebury, VT: Northeast Conference.

Elbow, P. (1986). Portfolio assessment as an alternativeto proficiency testing. Notes from the National Test-ing Network in Writing (November), 3, 12.

Ellis, R. (1985). Understanding second language acqui-sition. Oxford, England: Oxford University Press.

Freed, B.F. (1987). Preliminary impressions of theeffects of a proficiency-based language requirement.Foreign Language Annals, 20, 139-146.

Freedman, A., & Pringle, I. (1980). Writing in the col-lege years: Some indices of growth. College Compo-sition and Communication, 31, 311-324.

Hoetker, J. (1982). Essay examination topics and stu-dents' writing. College Composition and Communi-cation, 33, 377-392.

Hoffman, E.F., & James, D. (1986). Toward the integra-tion of foreign language and literature teaching atall levels of the college curriculum. ADFL Bulletin,18, 29-33.

Hughey, J.B., Wormuth, D.R., Hartfiel, V.F., & Jacobs,H.L. (1983). Teaching ESL Composition: Principlesand techniques. Rowley, MA: Newbury House.

Lloyd-Jones, R. (1982). Skepticism about test scores.2141

PROFICIENCY ASSESSMENT 1

Notes from the National Testing Network in Writing(October), 3,9.

Lowe, P., Jr. (1986). Proficiency: Panacea, frame-work, process? A reply to Kramsch, Schulz, and,particularly, to Bachman and Savignon. ModernLanguage Journal, 70, 391-397.

Magnan, S.S. (1985). Teaching and testing proficiencyin writing: Skills to transcend the second-languageclassroom. In A.C. Omaggio (Ed.), Proficiency, cur-riculum, articulation: The ties that bind (Reports ofthe Northeast Conference on the Teaching of For-eign Languages, pp. 109-136). Middlebury, VT:Northeast Conference.

Odell, L. (1981). Defining and assessing competence inwriting. In C.R. Cooper (Ed.), The nature and meas-urement of competency in English, (pp. 95-138).Urbana, IL: National Council of Teachers of Eng-lish.

Omaggio, A.C. (1983). Proficiency-oriented classroomtesting (Language in Education series No. 52).Washington, DC: ERIC Clearinghouse on Lan-guages and Linguistics. (ERIC Document Repro-duction Service No. ED 233 589)

Osterholm, K.K. (1986). Writing in the native language.In B.H. Wing (Ed.), Listening, reading, writing:Analysis and application (Reports of the NortheastConference on the Teaching of Foreign Languages,pp. 117-143). Middlebury, VT: Northeast Conference.

Purnell, R.B. (1982). A survey of the testing of writingproficiency in college: A progress report. CollegeComposition and Communication, 33, 407-410.

2C/5


Richards, J.C. (1984). The secret life of methods. TESOLQuarterly, 18, 7-23.

Rosengrant, S.F. (1987). Error patterns in written Rus-sian. Modern Language Journal, 71, 138-146.

Ruth, L., & Murphy, S. (1984). Designing topics for writ-ing assessment: Problems of meaning. CollegeComposition and Communication, 35, 410-422.

Savignon, S. (1983). Communicative competence: The-ory and classroom practice. New York: AddisonWesley.

Shuy, R.W. (1981). Toward a development theory ofwriting. In C.H. Frederiksen & J.F. Dominic (Eds.),Writing: Process, development and communication(pp. 119-132). Hillsdale, NJ: Erlbaum.

Stansfield, C. (1986). A history of the Test of WrittenEnglish: The developmental year. Language Test-

, ing, 3, 224-234.

Van Oorsouw, J. (1986). Assessing writing proficiency:Methods and assumptions. Unpublished manu-script.

Wolcott, W. (1987). Writing instruction and assessment:The need for interplay between process and product.College Composition and Communication, 38, 40-47.

Zamel, V. (1976). Teaching composition in the ESLclassroom: What we can learn from research in theteaching of English. TESOL Quarterly, 12, 67-76.

Zamel, V. (1982). Writing: The process of discoveringmeaning. TESOL Quarterly, 16, 195-209.

Zamel, V. (1985). Responding to student writing. TESOLQuarterly, 19, 79-101.

206

Documents

Second Language Proficiency Assessment: Current Issues. Language in Education: Theory and Practice, No. 70