Meeting the challenges of curriculum construction and change: Revision and validity evaluation of a placement test

Meeting the Challenges ofCurriculum Construction andChange: Revision and ValidityEvaluation of a Placement TestANASTASIA MOZGALINAGeorgetown UniversityDepartment of Linguistics1421 37th Street NWPoulton Hall 240Box 571051Washington, DC 20057-1051Email: [email protected]

MARIANNA RYSHINA–PANKOVAGeorgetown UniversityGerman DepartmentIntercultural Center 466Washington, DC 20057-1048Email: [email protected]

Achievement of advanced literacy as a goal of foreign language (FL) study within the available amount oftime requires that FL departments construct a well-articulated program and optimize student learningat each stage of the curriculum. One essential element of such optimization is the development of assess-ment procedures to place students into courses that enable successful fostering of their abilities. Ideally,such assessment practices should incorporate aspects of textual literacy, including a well-motivated linkbetween meaning-oriented textual semantics and the required lexicogrammatical features. This articlereports on the revision and validity evaluation of a C-test as one component of the placement test inthe Georgetown University German program and as an instrument that includes accounting for textualliteracy. We begin with the reasons for the test revision and report on the development and evaluationof the new C-test texts, which enabled better alignment with the curriculum and demonstrate a fittingrange of reliable distinctions among examinees of broadly differing abilities. The article concludes byhighlighting the central role of contextually relevant assessment practices for the success of a programthat aims to develop advanced literacy in a FL and the lessons learned throughout the evaluation process.

Keywords: C-test; placement; validity evaluation; curricular alignment

AN INFLUENTIAL REFORM PROPOSAL INthe field of foreign language (FL) education, theMLA 2007 report, Foreign Languages and Higher Ed-ucation, calls for a reconceptualization of the goalsof foreign language study that should surpass in-strumental proficiency in a foreign language andenable “translingual and transcultural compe-tence” (p. 5). One compelling consequence of thestated educational aspirations is that they demanda carefully considered curriculum that integrates

The Modern Language Journal, 99, 2, (2015)DOI: 10.1111/modl.122170026-7902/15/346–370 $1.50/0C©2015 The Modern Language Journal

language and content learning in a principledmanner over an extended instructional sequence.In such a curriculum, curricular levels are char-acterized in terms of specific intended contentand language learning goals and provide learnerswith a pathway toward gradual development ofadvanced forms of literacy. Because achieving thatgoal is a long-term endeavor, time is of essence:Students need to start or continue their learningat the level that will both challenge them andenable them to be successful as they reach towardthe targeted performance levels. Furthermore,given the crisis in the humanities in general andFL education in particular (see, for example, theMLA 2007 and the MLA 2014 reports), working

Anastasia Mozgalina and Marianna Ryshina–Pankova 347

toward such optimization of learning is nolonger an option. While many remedies havebeen proposed, we focus on one that is rarelyinvoked, namely a placement test that arisesout of the local context and its programmaticprogression that is expressed in a carefully con-ceptualized and carefully spelled out curriculumacross multiple levels. Aside from administrativebenefits, developing such an assessment instru-ment will render more likely the attainmentof a program’s stated learning goals within thetime frames that tend to govern collegiate FLprograms (e.g., language requirement, minor,major).

In other words, the call for constructing artic-ulated programs put forth by the MLA reporthas far-reaching implications not only for cur-ricular, materials, and pedagogical development,but also for assessment practices. Among otheruses, these practices should include the goal ofplacing learners into courses at a curricular levelthat is most appropriate for them in the pro-gram they are entering. Worded more pointedly,the longstanding practice whereby FL programsbase placement on the results of generic program-external procedures such as language proficiencyand placement tests (e.g., AP subject test, CAPE[BYU], Wisconsin test, etc.), or on no more thanyears of language study, self-assessment, or learnerpreferences, is fundamentally flawed, as it giveslittle recognition to the fact that teaching and,by extension, learning, arises within the contextof a particular program and its underlying goalsand approaches, and hence should be based onprogram-specific placement procedures that arealigned with the local curriculum.

A corollary consequence arising from theMLA 2007 report is this: Inasmuch as advancedforms of literacy are fundamentally about tex-tual literacy, a suitable placement must seek toincorporate aspects of textual literacy and doso by building on a well-motivated link betweenmeaning-oriented textual semantics and thelexicogrammatical features that realize it. Anassessment format that meets that demand is theso-called C-test, developed in 1981 and appliedto more than 20 languages, including English,French, German, and Turkish. This productivemeasure assesses students’ discourse-level com-prehension as well as micro-processing abilities,including lexical, syntactical, and morphologicalfeatures of the language in question, in this case,German (Grotjahn, 1996) and thus presents anappropriate instrument for placing students incurriculum that targets an advanced form ofliteracy (Norris, 2006, 2008).

In the content- and language-integrated andliteracy-oriented program at the GeorgetownUniversity German Department (GUGD), theC-test that was developed and validated in 1999 inconjunction with the department’s major curricu-lar reform initiative (Norris, 2003), can rightly besaid to have served well the program’s interest ingood placement of students. However, as Norris(2013) emphasized, validity evaluation should beseen as a process, not a product. In other words,all assessment practices within a FL programshould be subjected to periodic validity evalua-tion. That demand reflects the cyclical nature ofcurricular practice whereby changes in a programand its pedagogies lead to changes in assessmentand these changes, in turn, affect pedagogies andprogram parameters. Pointing to the same issue,Brown (2008) argues that “no curriculum project(least of all testing) can be done systematicallywithout carefully considering the language teach-ing approaches, syllabuses, and testing methodsinvolved in a particular context” (p. 284). Inother words, without alignment to other impor-tant parts of a FL program, the use of any test, butmost especially a placement test, is not justified.Because curricular practices change, tests toomust change. Indeed, Brown (2008) contendsthat curriculum developers “must recognize that,in all cases, what they have in hand at the momentis their Current Best Shot and that their testingcan always be better” (p. 302).

In the case of the GUGD, the need to changethis particular assessment practice became evermore compelling as the curriculum and its learn-ing goals evolved in light of findings from thefields of FL pedagogy, curriculum construction,and second language acquisition. In particular,the curriculum went through an important trans-formation when it shifted from being primarilycontent- and task-based to explicitly espousingthe integration of content and language via atheoretically well motivated conceptualization ofgenre and its translation into level-appropriatepedagogies.

In this article we report on our effort to developand validate the new C-test. While this projectfollowed the methods recommended by Norris(2003) for the construction and validation ofthe original GUGD placement test, its value goesbeyond following, in craft-like fashion, his guide-lines. Rather, we see its value in a demonstrationof how the revised C-test (and its newly selectedtexts) responds to the curricular changes thattook place over more than a decade, especially interms of a more differentiated awareness of thediscourse, semantic, and linguistic features that

348 The Modern Language Journal 99 (2015)

characterize the program’s curricular levels inthe transformed practice.Accordingly, we begin by explicating the curric-

ular changes that motivated development of thenew placement test. Our next focus is on the valid-ity evaluation initiated to assure closer alignmentof the C-test with these changes. Specifically, wedescribe the curricular context and the need forrevision of the C-test as arising, among other fac-tors, from the department’s sustained placementexam practice itself. We then describe the processof development of a new C-test and how it dif-fers from the earlier version. Two evaluation ques-tions, identified by the intended test users, guidethe discussion: (a) to what extent does the test,which was developed by a curriculum-based ap-proach, produce an accurate measurement tool,and (b) does the test enable effective distinctionsamong learners across all levels of the multi-yearprogram? We conclude with implications of thestudy for the profession at large.

CHANGES IN CURRICULUM AND LEARNINGGOALS: REPOSITIONING ASSESSMENTNEEDS

Curricular Context

The GUGD offers an undergraduate foreignlanguage program that annually serves around200 to 300 highly motivated undergraduate stu-dents, with a majority studying a variety of disci-plines, but also including language majors andminors, as well as double majors. The German

department created the Developing Multiple Litera-cies (undergraduate) curriculum (Byrnes, 1998)in order to acknowledge the background as wellas the academic and professional aspirations of itsstudents. Among its defining characteristics is theintegration of content instruction with languagelearning from the very beginning and throughoutthe entire program; in other words, it dismissesthe frequent distinction between lower level lan-guage courses and upper level, literary–culturalcontent courses.Table 1 presents the sequential structure of

the GUGD curriculum, which consists of five hi-erarchically related levels. Levels I to III are of-fered either as nonintensive courses with 3 con-tact hours per week or as intensive courses with6 contact hours (in four class meetings); stu-dents can complete one level within either one ortwo 14-week semesters. Expected student learningoutcomes for one intensive semester are identi-cal to those for two nonintensive semesters withina given level. Second-semester courses (1.2, 2.2,3.2) are offered in spring semesters only, whereasintensive courses are offered each semester. The3-hours-per-week level IV course Text in Context isthe last sequenced course in the hierarchical cur-riculum. Upon its completion, students may en-roll into a variety of nonsequenced courses fo-cused on political, literary, or other content areasdepending on their individual goals and interests.In order to enroll in the program, students

with prior German learning experience take anobligatory placement exam. Developed in 1999,it is a very carefully constructed and validated set

TABLE 1Sequenced Levels in the GUGD Curriculum

Curricular Level CoursesContactHours Abbreviation

I. Experiencing the Introductory German 1 3 1.1German-speaking world Introductory German 2 3 1.2

Intensive Basic German 6 1.IntII. Contemporary Germany Intermediate German 1 3 2.1

Intermediate German 2 3 2.2Intensive Intermediate German 6 2.Int

III. Stories and Histories Advanced German 1 3 3.1Advanced German 2 3 3.2Intensive Advanced German 6 3.Int

IV. Text in Context Text in Context 3 4Five additional courses(e.g., Issues and Trends)

3 each 4

V. Domain-specific courses Post-sequence courses(in literature, business,history, etc.)

3 each 5


of instruments and procedures that has servedthe department well over the years. It consistsof four instruments: (a) a Language Profile Sur-vey that seeks information about students’ expe-rience with learning/using German, (b) a C-test,(c) a Listening Comprehension Test (LCT), and(d) a Reading Comprehension Test (RCT). Eachof these subtests was designed to provide quickand efficient information about how well studentsare able to understand and process German lan-guage texts (both aural and written) like thosefound at various levels within the GUGD curricu-lum. The first evaluation of the three componentsof the placement exam (Norris, 2003) showedthat the LCT and RCT scores provided less con-sistent information than C-test scores. At the sametime, the C-test tapped into a wider range of stu-dent ability levels than the two other componentsdid. Therefore, the C-test scores became the pri-mary source of input for decision making.

After the initial validity evaluation study con-ducted by Norris in 2000, a longer-term researchand development agenda was suggested in or-der to accomplish several goals, among themcomputerization of the placement exam anddevelopment of additional test forms in orderto address potential security issues. Since thenseveral changes have taken place in test adminis-tration and the curriculum itself that led to theneed for the current placement revision project.

Curricular Changes

The initial test used a pen-and-paper formatand was administered in the first week of classes.In the summer of 2008, the test was transferredinto an electronic environment in such a way thatthe five C-test texts (one for each curricular level)with gaps were entered into the electronic plat-form and a time limit of 25 minutes was set onthe C-test. In this fashion, students slated to en-roll in the program (usually for the upcoming fallsemester) can take the test fromhome and, on thebasis of their performance, will receive a prelimi-nary recommendation for course placement thatenables them to preregister into courses over thesummer. Two days before the beginning of classesin the fall semester, students take the remaininglistening and reading parts of the test in a proc-tored environment on campus to confirm the ear-lier, preliminary placement.

Because assessment and evaluation were in-cluded as a guiding heuristic for the program,the initial curricular design has undergone sus-tained inspection, evaluation, reflection, andimprovement. The resultant changes can be

broadly summarized in terms of two areas: (a)development of a clearer and more differentiatedvision of the role of genre in the curriculum and(b) a more refined conceptualization of theprogression in textual genres and genre-basedtasks that are at the heart of the program’s ped-agogical work. Both aspects are systematicallyimplemented across the levels of the curriculum:in materials selection and didacticization, taskconstruction, instruction, and assessment.

A Shift From a Content- and Task-Based to a Genre-Based Curriculum. As stated, the curricular re-form initiated over a 3-year period in 1997–2000for which the original C-test was developed re-sulted in a task- and content-based curriculum(Byrnes, 2001; Eigler, 2001). In subsequent years,the curriculum underwent a major shift towardconnecting task and content on the basis of theconstruct of genre (e.g., Byrnes et al., 2006), itselfadapted from the genre-based approach devel-oped within systemic functional linguistics (Mar-tin, 1984). That shift gave the program and itscurriculum a firm foundation on which to inte-grate more substantively social context, subjectmatter content, and language use. As is to be ex-pected, after adopting the definition of genre as “astaged, goal-oriented, purposeful activityˮ (Mar-tin, 1984, p. 25), implementing that notion intoconcrete curricular action took time and requiredfurther adaptation and learning on the part ofall the practitioners in the program (see Byrnes,Maxim, & Norris, 2010, for extensive discussion).Research in discourse analysis of various genres,primarily in the systemic–functional tradition, wasnecessary to analyze and describe the various au-thentic texts used in the program at various levelsand ultimately translate this knowledge into level-appropriate pedagogies and assessment practices.

For that more detailed work, the program’seducators and researchers turned to studies onthe discourse, semantic, and linguistic character-istics of various genres, especially those inspiredby systemic–functional linguists who presentedcareful analyses of different types of narrative(Coffin, 2006; Martin & Rose, 2008) and expos-itory genres (Coffin, 2006; Martin, 1989). At thesame time, the program conducted original re-search on genres specific to the German culturalrealm. For example, diverse research initiativesinvestigated the contextual and textual featuresof the German political appeal (Crane et al.,2005), book reviews (Ryshina–Pankova, 2010),and printed interviews from the German media(Rinner & Weigert, 2006). In time, the resultingmore differentiated understanding of how these


genres work (e.g., the staged nature of texts) be-gan to guide text selection, didacticization, andformulation of instructions for writing andspeaking tasks.In a similar vein, it took time to determine and

describe connections between the communica-tive goals and the lexicogrammatical features oftexts salient for the realization of the program’slearning goals at each curricular level. For exam-ple, the foci on the language features in textsgradually came to be driven more by highlight-ing the saliency of these features for the natureof meaning-making in specific genres. These in-sights and decisions ultimately enabled teachersto highlight more explicitly certain linguistic fea-tures that construe meanings in a particular way,thus leading to a better integration of content andlanguage in their pedagogical approaches.A third important and gradual change in the

curriculum pertains to a heightened awarenessof the contextual situatedness of genres andspecifically their interactive nature. The role ofaudience, authorial stance, and evaluation, evenin simple texts, as well as the presence of othervoices and their specific linguistic realizations,has received more attention in the analyses ofthe materials and tasks, once more resulting inadjusted forms of text didacticization, assignmentformulation, and pedagogy. For example, explicitemphasis on the illocutionary force of languageis made as early as the second unit of the Level Ibeginning course within the context of a fashionshow speaking task. There, various adjectivesthat describe clothes are discussed not only interms of their referential content but in terms oftheir promotional function within advertisementgenres such as a fashion show.These new understandings of genre had a sig-

nificant impact on the conceptualization of thegenre-based writing tasks as well as the speakingtasks that were introduced after the developmentof the original C-test. Here, the changes tookplace in two areas. First, a more differentiated andtheoretically justified understanding of the rela-tionship was achieved between authentic texts asmodel texts and tasks based on them. Second, therubrics for the genre-based tasks were refined toreflect the difference between the more abstractstructural aspects of genres and their stages andmore specific content aspects that fill the stages.These two types of changes are discussed in thenext section.

A More Refined Understanding of the Progressionof Genres in Materials and Tasks. Initially, cur-ricular progression across five curricular levels

was defined in general terms as a move fromprivate discourses that characterize interactionsin personal settings with friends and family topublic discourses typical of communication in in-stitutions with nonintimates (Gee, 1998). Whilesuch a conceptualization worked well as a gen-eral framework, it did not capture well the dif-ferences between various types of communicationthat might appear close to each other on the pri-vate to public continuum. For example, it pro-vided few insights into the difference between apersonal recount in a letter and a recount of his-torical events in a letter. It also provided littleguidance for where and how to include literarytexts.Eventually, through a more thorough explo-

ration of the research literature, especially asdetailed within a systemic–functional framework(Coffin, 2006; Halliday & Matthiessen, 2004),the progression has come to be understood inmuch more differentiated, spelled out, and mul-tifaceted terms. This understanding turned outto be instrumental for making more systematicand theoretically informed practical decisions inmaterials revision, selection of lexicogrammaticalfoci, and instructional practices. In the end, cur-ricular trajectories were defined with regard tofour main areas: content, type of audience, role oflanguage, and type of genre. In terms of content,the progression is now characterized as a move to-ward more domain-specific thematic areas, frommore everyday life topics (e.g., family, holidays,nature) to themes related to specific subject areaslike history, literature, business, etc. In terms ofthe relationship with the audience, the materialsshift from those that represent communicationwith intimates that is typically overtly dialogicaland explicitly evaluative (e.g., as in a conversa-tion with friends, personal letter) to those withnonintimates in public settings that are implicitlydialogic and more subtly evaluative (e.g., as ina letter to the editor, public speech, academicarticle). The next curricular axis addresses therole of language in communication in terms ofthe shift from oral-like discourse, where languagetypically accompanies social activity, to written-like discourse, where language constitutes socialactivity. Finally, in terms of genres, the progres-sion moves from narrative to more interpretativeand argumentative genres: from personal re-counts to personal narratives, to biographicalnarratives, to historical narratives, to differenttypes of explanations, to expositions and, finally,reflective discussions (Coffin, 2006). All thesecurricular trajectories are interconnected andhave implications for language use that are also


spelled out in the programmatic documents (see,for example, Byrnes & Sprang, 2004; Byrnes et al.,2010; Crane, 2006; Ryshina–Pankova & Byrnes,2013, for a more detailed account of the role andmeaning of these trajectories).

In response to these changes, after close to 15years of experience with the curriculum sincethe development of the original placement test,program administrators sought to make sure thatthe placement exam was updated and was alignedto recent developments in the curriculum.

DEVELOPMENT AND VALIDATION OF ACURRICULUM-BASED C-TEST

What Is a C-Test?

C-tests, like the classic cloze tests, are an op-erationalization of the principle of reduced re-dundancy testing (Klein–Braley, 1997). A typicalC-test consists of five to eight texts of between75 and 100 words, the meanings of which canbe understood clearly without additional support-ing material. Ideally, C-test texts come from au-thentic sources and cover different topics. C-testshave been developed and researched as indicatorsof general language proficiency among literatepopulations of learners, and findings repeatedlyhave shown consistentmeasurement qualities andstrong relationships with other indicators of pro-ficiency (e.g., Eckes & Grotjahn, 2006). Develop-ment of a C-test usually comprises three phases:(a) text selection, (b) text and test preparation,and (c) pilot testing. After the texts are selected,they should be transformed into C-test texts ac-cording to standard C-test development recom-mendations (Grotjahn, 2002): For each text thefirst and the final sentence should be left in-tact, in order to provide necessary semantic con-text (Grotjahn, 1987). Beginning with the secondword of the second sentence, the second half ofeach second word should be deleted until 20/25deletions are made. For words with an odd num-ber of letters, the second half of the word plus oneletter is deleted (e.g., Gegensatz→ Gege_____; ‘op-posite’). Compound words are treated as an ex-ception in that only the second half of the sec-ond compoundword is deleted (e.g.,Nebentisch→Nebenti_____; ‘neighboring table’). Numbers anddates written numerically are not changed, norare acronyms or proper nouns.

Depending on the use of the test, various stud-ies have investigated the validity of C-tests usingdifferent approaches. In order to demonstratethat C-tests measure overall language proficiency,other criterion measures have been used to vali-

date C-tests, among them standardized tests likeTestDaF (Test of German as a Foreign Language)for German (Eckes & Grotjahn, 2006) or TCF(Test of Knowledge of French) for French (Re-ichert, Keller, & Martin, 2010). However, depend-ing on test use, using other tests as criterion mea-sures may not provide information needed formaking relevant decisions based on the C-testscores, for example, for placing students into lo-cal language curricula. Therefore, similar to Nor-ris (2006, 2013), the current validity evaluationstudy adopts a utilization-based approach to thedevelopment and validation of a German place-ment C-test because this approach addresses thelocal needs of the test users.

Development of the New C-Test

Following Norris (2006), we as researchers en-gaged the local experts in the program at differ-ent stages of test development and validation (seeOnline Materials for the timeline of the entireproject). In particular, curriculum-based text se-lection was a collaborative endeavor that requiredactive involvement of instructors at all curricularlevels rather than being the responsibility of justone facultymember or the director of curriculum.

Text Selection. Since the primary intendedusers of the original test agreed that students’ abil-ities to process a variety of texts would provide themost direct indication of differences between cur-ricular levels, it was essential to purposefully selecttexts that would represent the junctures withinthe sequenced curriculum where placement de-cisions needed to be made.

The texts for the new C-test had to be selectedin line with the following criteria. The texts hadto represent each level in terms of its typicalgenre, its content foci, and in terms of the lexi-cogrammatical features that instantiate the genre.Level instructors received documents characteriz-ing each level in terms of these focal features sothat they could identify suitable texts. All instruc-tors at the respective levels individually selected atleast three written texts which they deemed to berepresentative of the kinds of texts that studentswould be able to understand and process uponsuccessful completion of the second semester ofnonintensive study within the given level (e.g., 1.2,2.2, etc.).

Despite these specific guidelines, text selectionwas challenging for three reasons. The genre andcontent specifications had to be combined (e.g.,instructors would look for the theme of Germanholidays through a genre of recount), texts had to


be authentic, and finally, the texts typically couldnot be used in their entirety, requiring instruc-tors to find coherent textual excerpts that wouldincorporate the needed features. As a result, thetexts that were selected demonstrated varioustypes of alignment with the curriculum, in termsof genres, language features, and content foci.The second author, in her role as the director

of curriculum, collected from 8 to 24 narrative,descriptive, expository, or related texts for eachlevel and asked instructors at respective levels torank the texts on a scale from 1–10 in order ofrelevance to the level-specific curricular goals andjustify their ranking (see Online Materials for anexample of a completed rating sheet). Texts thatreceived the highest ranking were then discussedduring five level meetings where instructors foreach level (from three to six per level) agreed onthe 3 to 4 most representative texts, 16 in total.These were chosen for the initial pilot-testing.

Text and Test Preparation. After the texts wereselected, standard C-test development recom-mendations (Grotjahn, 2002) were used to trans-form them into C-texts, with a few modificationsto the GUGD context. For example, in several cir-cumstances (e.g., repeated deletion of the samewords in a single text), slight adjustments in thetext (adding or removing a word) were made toelicit production of language features that wouldmore accurately reflect the curricular expecta-tions of the corresponding level. Two texts wereeliminated at the stage of text preparation be-cause there were too many repetitive deletions,and the whole passage would otherwise have to bereformulated in order to produce desirable items.The final set of 14 C-texts was pre-piloted with

the ordinary faculty of the program with the aimof (a) looking for alternative correct answers, (b)identifying unsolvable items, and (c) obtainingfeedback on whether students of particular lev-els would be able to solve the items on a giventext. Based on this feedback, 2 texts for each level,10 in total, were identified where the deletionsbest represented students’ abilities at each par-ticular level. Furthermore, some of the introduc-tory and closing sentences were adjusted to pro-vide students with the most appropriate semanticcontext.Next, the test was computerized and piloted

with nine graduate students in an online environ-ment under time constraints tomake sure that thetest worked as intended in the electronic versionand could be completed within the given time.See Figure 1 for the overview of the C-test prepa-ration and piloting steps.

Pilot Testing. The first goal of pilot testing wasto investigatemeasurement qualities of the newC-texts, as well as their appropriateness for GUGDundergraduate students. For this, the 10-text C-test was pilot tested with 87 enrolled students fromall curricular levels (1.2: N = 40; 1 Int: N = 6; 2.2:N = 7; 2 Int: N = 4; 3.2: N = 11; 3 Int: N = 1; 4: N= 3; and 5: N = 15) at the end of the Spring 2013semester.Due to the restricted amount of time in which

placement tests can be administered at the uni-versity and in order to avoid examinee fatigue,only a limited number of C-texts can be used dur-ing the placement exam. Therefore, the secondgoal of pilot testing was to reduce the numberof the developed C-texts to five C-texts with thebest measurement qualities. Based on score dis-tribution and reliability analyses, five texts (2, 4,6, 7, and 10) with the highest point–biserial cor-relations were chosen for retention, one for eachcurricular level.In addition, actual answers of two examinees

randomly chosen from each of the five levels (1.2,2.2, 3.2, 4, and 5), 10 C-tests in total, were ana-lyzed in detail. In our respective roles as a test de-veloper and director of curriculum, we examinedstudents’ answers item by item looking for anoma-lies and unexpected responses. After discussion,4 blanks out of 125 were adjusted, e.g., Nudel-haus ‘noodle house’ was substituted by Kneipe‘bar,’ because the item was not correctly solvedby all 10 participants and does not representa frequent cultural phenomenon in Germany.For some items, alternative correct answers wereadded which violated the deletion rules but werecorrect in terms of meaning (e.g., for ein Journ-ALIST beschrieb ‘a journALIST described,’ the op-tion ein JournAL beschrieb ‘a journAL described’was added as another correct answer). Accordingto Grotjahn (2004), acceptance of the alternativeanswers can lead to higher test reliability. Qualita-tive analyses of the items aimed at improving thequality of the test prior to administering it as a partof the actual placement test.

Comparison of the Old and New C-Tests

In this section we exemplify in greater detail thedifference between the old C-test and five new C-texts chosen as a result of the piloting; we do soin terms of represented genres, lexicogrammati-cal features, and text content.Table 2 illustrates the genre axis that the newly

selected texts represent. They follow the specifiedtrajectory from a simple recount to a more com-plex biographical recount of the life of a public


FIGURE 1C-Test Preparation and Piloting Steps

figure to an exposition to a statistical report insupport of a statement, and finally a discussion.This stands in contrast to the less generically var-ied texts in the original C-test, which had been de-veloped prior to a curricular commitment to thistrajectory of genres.

The greater variety of genres in the revisedC-test was intended to target a greater variety oflanguage features in focus at the respective cur-ricular levels. Appendix A provides a comparativeoverview of the linguistic features in the texts ofthe old and revised C-tests. It exemplifies how theold and the new texts representing Level I (Text1) are different. The linguistic features of thenew Text 1 are more accurately aligned with the

language features considered focal at the end ofthe level since this text contains Present Perfectas contrasted with the Present in Text 1 of the oldC-test. Furthermore, it has more varied instancesof evaluative language use that has gained moreemphasis in the current curriculum. Finally, theideational context of these language features(restaurant theme vs. university) is more closelyaligned with the topical areas of the current LevelI. The same is true for the Level III text (Text3) from the revised test. Whereas the old textfeatured elements of chronology, the new textis dominated by intraclausal causative structuresthat become more important at the end of LevelIII where a shift from temporal structuring, as

TABLE 2Genres in the New and Old C-Test Texts

Text New C-Test Old C-Test

Text 1 Personal recount Comparative descriptionText 2 Biographical recount (public figure) DescriptionText 3 Exposition DescriptionText 4 Statistical report in support of a statement/exposition Discussion of the phenomenonText 5 Discussion Statistical report in support of a statement

Note. The genres of the new C-test correspond to the salient genres of each curricular level.


in narratives, to more causative progression, asin explanations, is fostered. Furthermore, thenew Level III Text contains a variety of evaluativeexpressions realized as modal adjuncts (insbeson-dere ‘especially’), adverbs (primär ‘primarily’),nouns (Mangel ‘lack’), or adjectives (katastrophal‘disastrous’), while the old Level III text lacksthis feature altogether. However, Appendix Aalso reveals similarities between the old and newtexts. For example, both the old and the newLevel V texts contain hypothetical structures,domain-specific vocabulary, complex clauses, andevaluative expressions.The final requirement for the new texts was

close alignment with the content foci of the ex-isting curriculum. In this respect, as can be seenin Table 3, the old test comes close to the salientthemes in the curriculum as well. The only dis-parity seems to be in Text 1 of the original testwhere the content focus is on the differences ofthe higher education systems in the United Statesand Germany, a theme that is no longer part ofthe Level I curriculum.

THE STUDY

Two evaluation questions (EQs) were posed toinvestigate the validity of the C-test:

EQ1. Is the developed C-test accurate as a toolfor measuring a wide range of student abil-ities?

EQ2. Is the developed C-test effective as a meansfor distinguishing among students acrossthe range of GUGD curricular levels?

Several conditions would have to be met inorder to demonstrate that the C-test is accurate

as a tool for measuring a wide range of studentabilities (EQ1). These conditions correspond tothe assumptions that underlie five inferentiallinks (a–e) between curricular placement deci-sions and the test-based information: (a) The C-test would elicit a broad distribution of examineescores, (b) the C-test would produce scores thatdistinguish reliably among individual examinees,(c) the individual C-texts would elicit expectedresponse patterns from examinees, in particular,mean scores would gradually decrease from levelI to level V on each text, (d) students who areenrolled in the progressively higher curricularlevels would perform with higher scores on allfive texts than students at the preceding curricu-lar levels and furthermore, test scores would de-crease from Text 1 to Text 5 for all levels, and(e) there would be a strong relationship betweenthe new C-test scores with the old C-test and amoderate correlation with the Reading and Lis-tening comprehension parts of the placementtest.To investigate the effectiveness of the C-test as

a means for distinguishing among students acrossthe range of GUGD curricular levels (EQ2), thefollowing conditions would have to be met: (f)Average C-test scores would reflect clear differ-ences between groups of students from eachof the curricular levels at both the beginningand the end of the semester, (g) average C-test scores would increase between the beginningand the end of the semester, (h) longitudinalchanges in C-test scores would be similar to cross-sectional differences in the C-test scores of stu-dents at both the beginning and the end of thesemester.

TABLE 3Content Foci in the New and Old C-Test Texts

Text New C-Test Old C-Test

Text 1 Everyday sphere: Eating in a restaurant Higher education system inGermany and the UnitedStates

Text 2 German politicians: Merkel Actions to protect theenvironment

Text 3 Post-war period in Germany East and West Germany beforethe fall of the Wall

Text 4 Advantages of Germany as a competitivelocation for starting business

Meaning of money

Text 5 Consequences of introducing the UnitedStates of Europe in the European Union

Opinions of Germans on theintroduction of the Euro inGermany

Note. The content foci of the new C-test correspond to the salient content foci of each curricular level.


Method

Procedures. Themeasurement accuracy and ef-fectiveness of the C-test instrument were investi-gated using several sources of evidence. The Fall2013 placement exam administration provided aninitial opportunity to collect evidence for inves-tigating key measurement qualities of test scoresand the five texts that comprised the new C-test.Hence, the five new C-texts chosen after the pilot-ing stage were integrated into the original place-ment test (in addition to the five C-texts of the oldC-test) and administered under standard examadministration conditions to students who regis-tered for placement testing into the German lan-guage program at Georgetown prior to the Fall2013 semester. The results of the new C-test, how-ever, were not yet considered formaking decisionsabout placing students at that point. The studentswho took the test were not aware of the fact thatone set of the C-test texts was being piloted andthat their scores on it would not be considered forthe placement decision.

Additional data were collected during the Fall2013 semester in order to facilitate interpreta-tions regarding the effectiveness of the C-test atdistinguishing between students across the fivecurricular levels of the GUGD program. All con-tinuing (nonplaced) students (i.e., those alreadystudying in the program) were asked to take boththe new as well as the old C-test in the first weekof the semester. Furthermore, all students (placedand nonplaced) took the test at the end of thesemester. For the semester-beginning administra-tion, students were informed that their test scoreswould be used for calibrating the placement examand that they should therefore perform as well asthey could. For the semester-end administration,students were informed that a final administra-tion of the C-test was needed in order to investi-gate their language development over the courseof a semester of instruction. The C-tests were ad-ministered following identical procedures usedfor the placement exam administration, therebymaintaining equivalent performance conditions.In particular, students took the C-test at home;they had 25minutes to complete each set of the C-tests, and the instructions to examinees were sim-ilar to the instructions used for actual placement.See Appendix B for the instructions.

To prevent any order effect, for both adminis-trations the order of the C-tests was randomizedwith half of the students taking the old C-test firstand the other half the new C-test.

Participants. A total of 222 students took thetest at the beginning and/or the end of the

semester. Only 39 students took the actual place-ment test at the beginning of the semester. A totalof 85 nonplaced (already enrolled from a previ-ous semester) students took the tests at the begin-ning of Fall 2013, and 96 students took the testsat the end of the semester. Furthermore, 66 stu-dents took the test at both administrations, pro-viding data for longitudinal comparisons. Table 4summarizes the numbers of participants for eachtest administration.

Several constraints limited the number andrepresentativeness of students from whom C-test scores were collected during the pre–post-semester intervention study. First, the C-test wasnot administered to the beginning level I studentsbecause it could have a detrimental effect on theirmotivation. It was also assumed that because ofstudents’ very limited language learning experi-ence, the results would most likely show very lowscores anyway. As such, the C-test was not admin-istered at the beginning of the semester to stu-dents in the first-year first-semester course sec-tions (1.1), nor in the first-year intensive coursesections (1.Int). However, six students from levelI courses took the test anyway, for some un-known reason, and the scores were kept for thepre–post cross-sectional and longitudinal compar-isons. Further, not many students from level 1.1took the test at the end of the semester either, re-porting to their instructors that the test was toohard for them and they had to give up in themiddle of the test. Second, since second-semestercourses (1.2, 2.2, 3.2) are not offered in the Fallsemester, no C-test score data were collected atthe beginning or the end of the semester forthese second-semester curricular levels. Third, al-though it would have been ideal to administer thetest in the classroom and thus have a higher num-ber of participants, most instructors hesitated todevote valuable instructional time twice duringthe semester for test piloting, so the students took

TABLE 4Distribution of Participants in the Study

Test Administration StudentPopulation

N

Fall 2013 Beginning Placed 39Nonplaced 85All 124

Fall 2013 End Placed 17Nonplaced 72All 96


the test fromhome and their participationwas vol-untarily. As a result, fewer students from each cur-ricular level took the test than were actually en-rolled in the program.

Analyses. In order to evaluate the accuracy ofthe C-test in distributing a wide range of partici-pants (EQ1), descriptive statistics were calculatedand Classical Test Theory (CTT) analyses wereconducted to estimate the reliability and errorassociated with the overall scores. Furthermore,Rasch model item response theory (IRT) analyseswere employed using FACETS software (Linacre,1998) to enable a more exacting analysis of testand item characteristics. The IRT analyses pro-vided overall examinee separation indexes, andstandard error estimates, which indicated theextent to which the interval measurement scalemodeled for the C-tests distinguished accuratelyamong examinee ability levels. Corresponding α

equivalents were also calculated for each separa-tion index. Furthermore, each individual text wasanalyzed with simple descriptive statistics in termsof its ease/difficulty and the extent to which it wasable to separate examinees into distinct ability lev-els. Finally, the relationships among the newC-testscores with the old C-test as well as Reading andListening comprehension tests were investigatedby calculating Pearson product–moment corre-lation coefficients. These correlations enabledinterpretations about the extent to which any ofthe scores might effectively represent the others.Cross-sectional comparisons between the C-test

scores of nonplaced and placed students and theircorresponding curricular levels were conductedbased on students’ pre–post semester scores toenable interpretations about the extent to which

the C-test instrument effectively distinguishedamong students across the range of GUGD cur-ricular levels (EQ2). Cross-sectional comparisonsbetween the students’ C-test scores and their cor-responding curricular levels enabled interpreta-tions about the extent to which the C-test effec-tively and consistently distinguished among stu-dents representing the full range of curricularabilities. Furthermore, a pre–post design to ob-serve the longitudinal changes in C-test scores as aresult of the semester of instruction was employedto enable interpretations about the extent towhich the developed C-test was sensitive to learn-ers’ curriculum-related language development.

RESULTS

EQ1: Measurement Accuracy of the GUGD C-Test Texts

In order to investigate the extent to which theC-test elicits a wide distribution of scores fromplacement for examinees of differing abilities aswell as nonplaced students already studying acrossthe range of curricular levels (Condition A), de-scriptive statistics for the 2013 Placement test ad-ministration were calculated. For the populationof incoming students, frequency of score distribu-tion displayed in Figure 2 suggests that the testspreads students out in a normal distribution thatis rather well centered.Furthermore, mean and median values for the

new C-test as shown in Table 5 support this obser-vation, showing almost no skewness (0.01) for thenew C-test. Although minimum scores were rela-tively high for the test, their occurrence can beexplained by a higher number of more advancedstudents on this placement occasion.

FIGURE 2Score Frequency Distribution of the New C-Test, Fall 2013 Placement Exam (N = 39)

Note. The arrow on the x-axis indicates mean values.


TABLE 5Descriptive Statistics for the New C-Test for Fall2013, Beginning

StatisticsPlacedStudents

NonplacedStudents

N (examinees) 39 85k (items) 125 (5 texts) 125 (5 texts)Mean 68.15 56.42Median 64.00 55.00SD 25.00 24.64Min 22 4Max 115 114

In order to explore the distribution pattern inmore detail, the distribution of scores among thenonplaced students including level I students wasalso investigated. As seen in Table 5, nonplacedstudents on the whole scored at slightly lowermeans (by about 10 points) than the placedstudents. This pattern was probably due to theinclusion of the level I students in the analyses.Nevertheless, the presence of the lower levelstudents in the test population enabled us todemonstrate that the new C-test utilizes almost anentire score range and elicits performances asso-ciated with examinees’ differing ability levels. Inaddition, large standard deviations for placed andnonplaced students confirmed that the new C-testconsistently elicited a wide dispersion of scores.

Last, a frequency distribution as displayed inFigure 3 demonstrated only slightly positive skew-ness for the new C-test with the nonplaced

students (skew= 0.27) showing again a rather bal-anced distribution of the test scores.

Thus, the descriptive statistics demonstratedthe ability of the new C-test to effectively elicit awide distribution of scores from placed and non-placed examinees of differing levels of languageabilities for two different student samples.

The next step was to investigate to what extentthe C-test scores distinguish reliably among exam-inees of differing curricular ability levels withindistinct placed and nonplaced student popula-tions (Condition B). For this analysis, both classi-cal test theory and item response theory estimateswere calculated for all students at the beginningand the end of the semester. Table 6 provides CTTand IRT estimates for the combined population ofplaced and non placed students at the beginningand the end of the semester.

Very high Cronbach’s α reliability estimates(0.95) for both test administrations indicated thattotal raw scores were very effective at distinguish-ing consistently among individual examinees. ARasch model IRT equivalent to test score relia-bility is provided by the examinee separation in-dex, expressed in logit values. The higher the sep-aration, the greater the ability of the test to con-sistently differentiate individual examinees’ abil-ities. As seen in Table 6, the new C-test showeda high consistency in separating examinees fromone administration to the other with a changeof only 0.25 logits. Finally, standard error ofmeasurement was calculated using both CTT andIRT analyses. Table 6 shows that the raw scores forthe new C-test at the beginning of the semester in-dicated approximately plus or minus 2 points of

FIGURE 3Score Frequency Distribution of the New C-Test Scores for Nonplaced Students, Fall 2013 Beginning(N = 85)

Note. The arrow on the x-axis indicates mean values.


TABLE 6IRT Reliability Estimates for All Student Populations

Statistic New C-Test,Beginning

New C-Test,End

N 124 94k 125 (5 texts) 125 (5 texts)α 0.95 0.95SEM (+/−) 2.27 2.36IRT examinee

separation6.01 5.76

IRT separationreliability

0.95 0.94

IRT standard error 0.24 0.25

error for a given score on the C-test, which indi-cates a high consistency of the scores. Standarderror of measurement for the C-test at the endof the semester was slightly higher (0.09 logits),likely due to a lower number of participants andnot a change in the test quality. In sum, three indi-cators of the reliability of the C-test demonstratedthat the new C-test distinguished reliably amongindividual examinees.In order to evaluate the extent to which in-

dividual C-texts function as intended at elicitingexaminee performances and contributing to thetotal score (Condition C), the difficulty of eachtext was investigated using descriptive statistics forexaminee performances on each C-test text forboth placed and nonplaced students. Since eachtext had been intentionally selected under the as-sumption that it would prove most appropriateto the abilities at a given GUGD curricular level,from Text 1 representing level I of the curriculumto Text 5 representing level V of the curriculum,mean scores on each text would have to decreasefrom the easiest to the most difficult text. Table 7displays descriptive statistics for each of the newC-test texts for the placed students.As predicted, the examinees performed with

decreasing average scores from the first through

TABLE 7Descriptive Statistics for the New C-Test Texts onFall 2013 GUGD Placement Exam

Statistic Text 1 Text 2 Text 3 Text 4 Text 5

Mean 18.00 16.31 11.87 11.13 10.85SD 5.15 5.50 5.58 6.20 4.96Min 6 6 3 0 0Max 25 25 24 24 21

Note. N = 39.

TABLE 8Descriptive Statistics for the New C-Test Texts for AllStudents, Fall 2013, End

Statistic Text 1 Text 2 Text 3 Text 4 Text 5

Mean 17.14 15.67 11.09 10.52 9.87SD 5.33 4.76 4.95 5.42 4.23Min 4 4 2 2 2Max 25 25 24 23 25

Note. N = 96.

the fifth text on the new C-test, with only a slightdifference in the difficulty between level IV andlevel V texts. Furthermore, large standard devia-tions for each text indicated the capacity of theindividual texts to distinguish among examineesin a broad distribution theoretically reflective ofability differences. The same pattern was observedwhen the scores on the C-test were combined forboth placed and nonplaced students at the end ofthe semester as displayed in Table 8.Thus, since three samples of examinees pro-

duced similar average score differences amongthe five texts, it can be concluded that the C-test can consistently elicit performances relatedto curricular-level ability differences. One some-what unexpected finding for the new C-test wasthat Text 4 and Text 5 appeared to be of a simi-lar level of difficulty for the students at the begin-ning of the semester. For the end of the semester,however, Text 5 as expected produced on averagelower scores than Text 4.In order to get a more exacting measure of

difficulty independent of the sample and provideadditional information regarding the qualitiesof the C-test texts, Rasch model analyses for re-sponse patterns on both C-tests at the beginningand at the end of the semester were conducted.Figures 4 and 5 provide graphic displays of bothitem measures and examinee measures accord-ing to the common interval scale resulting fromRasch model analyses for the combined popula-tion of students at the beginning of the semester.The predicted differences among the texts wereapparent for Texts 1 through 4. The figure alsoshows that the measures of difficulty in logits forTexts 4 and 5 proved to be very similar for the anal-yses conducted with the students at the beginningof the semester but the difficulty of Text 5 wasslightly higher for the students who took the test atthe end of the semester. Nevertheless, ability esti-mates were widely distributed across the scale forboth administrations, in other words the major-ity of examinee ability levels were captured withinthe range of text difficulty estimates.


FIGURE 4Examinee and Item Measures on FACETS Ruler forthe New C-Test for All Students, Fall 2013,Beginning

FIGURE 5Examinee and Item Measures on FACETS Ruler forthe New C-Test for All Students, Fall 2013, End


FIGURE 6Average Performance on the New C-Test Texts by Curricular Semester-Level Groups, Fall 2013, Beginning,N = 102

Note. Total N = 102 because not all students who took the placement test enrolled into the program in the samesemester.

As seen in Figure 5, there appears to be abig gap between Text 2 and Text 3 (0.95 dif-ference). Figure 5 clearly shows that there is anabrupt increase in difficulty from Text 2 to Text3, with no text centered on the ability of anaverage student, as anchored at level 0 on theRasch scale. This difference suggests that an ad-ditional text of the difficulty around 0.00 on thelogit scale might be required. Furthermore, Text4 and 5 were estimated to differ at only 0.08 log-its, thus suggesting some redundancy of infor-mation they can provide about examinees’ levelof ability. Overall, standard error estimates werefound to be quite low for each test. In addition,strong point–biserial correlation coefficients be-tween each item measure and the total test scoreindicated that each of the five texts was contribut-ing in equivalent and predicted ways to exami-nees’ overall test scores (see Online Materials forthe IRT statistics for Figures 4 and 5).A further condition (D) that needed to be ex-

amined was whether groups of examinees en-rolled in the progressively higher curricular lev-els perform with higher scores on all five textsthan the students at the preceding curricular lev-els (assumption 1) andwhether they performwithdecreasing accuracy from Text 1 through Text 5(assumption 2). For this analysis, average scoresof the correctly enrolled students were combinedwith the average scores of the nonplaced students

for the individual curricular level and examinedindependently for each curricular level. At the be-ginning of the semester, students from the inten-sive courses were combined with the first course ofthe corresponding level (e.g., students who wereenrolled into level II intensive course were ana-lyzed together with the students from 2.1 course).Figure 6 shows average score patterns on each C-test text for examinees enrolled in the five curric-ular levels for the students at the beginning of thesemester.The first assumption was supported for the

C-test administration at the beginning of thesemester, with increasing average scores on eachtext at each increasing curricular level. The factthat level I and II students performed compara-bly high on Text 3 through Text 5 might be dueto the fact that four of the level I students werein the intensive basic course and their Germanproficiency might be closer to the 1.2 curricularlevel. The second assumption was also largely sup-ported, with the exception of level IV and V textsfor the students enrolled in level I and level IIcourses. For some reason level I and II studentsscored higher on Text 5 than Text 4. Since anaccurate placement instrument should differen-tiate accurately well between students of all pro-ficiency levels, unexpected Text 4 functioningwould necessitate inclusion of a different C-test text into the placement exam or perhaps


revision of some of the Text 4 items after closerinspection.

Finally, correlations of the C-test scores withthe scores on reading and listening sections ofthe placement test, as well as with the scores onthe old C-test were investigated to look for evi-dence whether Condition E is met. Table 9 pro-vides the Pearson product–moment correlationcoefficients between the scores of the new C-test,two other parts of the placement test, and the oldC-test.

While high correlations were found betweenthe old and the new C-test, the correlations withthe reading and listening comprehension teststurned out to bemoderate. The results are similarto the correlations found in previous research,for example between the OnDaF C-test andReading (0.56) and Listening (0.62) parts of theTestDaF (Eckes & Grotjahn, 2006) demonstrat-ing that the three sections of the placement testtap into different aspects of students’ languageabilities and should therefore be used togetherfor producing more accurate and triangulatedplacement decisions.

EQ2: Effectiveness of the C-Test at DistinguishingAmong GUGD Curricular Levels

The effectiveness of the C-test was evaluatedin terms of the extent to which it is sensitiveto the GUGD curricular level structure. First,evidence was sought for whether average C-testscores would reflect clear differences between stu-dents from each of the curricular levels at boththe beginning and end of the semester (Condi-tion F) andwhether average scores would increasebetween beginning and the end of the semester(Condition G). For this analysis, average C-test

TABLE 9Pearson Correlations Among Placement Subtests,Fall 2013

New C-Test Old C-Test RCT LCT

New C-TestPearson 1 .92** .57** .62**

N 39 39 34 34Old C-Test

Pearson 1 .62** .63**

N 41 35 35

Note. **Correlations are statistically significant at thep < 0.01 level. RCT = Reading Comprehension Test;LCT = Listening Comprehension Test.

scores were investigated cross-sectionally for thecombined population of nonplaced and placedstudents. From Table 10 and Figure 7 it is evidentthat average C-test scores differentiated clearly be-tween students enrolled into different levels of thecurriculum both in terms of increase of scores be-tween the beginning and the end of the semesterfor each curricular level and also for the overallincrease in scores. A continual increase in meanscores is apparent across the curricular levels withlevel I students scoring on average at 22 pointsand level V students scoring on average at 99points out of the total 125 points on the combinedfive C-test texts. Furthermore, substantial meanscore differences between the beginning and theend of the semester are apparent within each ofthe levels, with the largest differences attributableto the second intensive course (32.3 score pointdifference) and level IV course (16.38 score pointdifference).

One discrepancy can be observed in the overallpattern: Students of the level I intensive coursescored a little higher both at the beginning andat the end of the semester than the level II (2.1)students. However, this discrepancy should betreated with caution because of a very low num-ber of participants in the level I intensive course.At the same time, this pattern can be related to

TABLE 10C-Test Descriptive Statistics for the New C-Test forAll Students, Beginning and End of the Semester ofInstruction, Fall 2013

GUGDLevel N Mean SD Min Max

1.1 BEG 4 22.25 12.71 4 33END 5 38.4 12.22 26 58

1.INT BEG 2 35.5 0.71 35 36END 5 48.6 6.07 40 56

2.1 BEG 25 34.6 8.42 20 52END 21 47.71 11.82 31 66

2.INT BEG 2 29.5 16.23 18 41END 5 61.8 14.86 39 78

3.1 BEG 20 60.9 17.86 30 102END 17 68 13.15 46 95

3.INT BEG 10 67.3 20.37 44 108END 10 78.17 13.55 61 90

4 BEG 29 70.1 16.56 28 100END 23 86.39 15.67 63 121

5 BEG 10 82.8 20.98 44 114END 2 99 22.63 83 115


FIGURE 7Cross-Sectional Comparison of Average C-Test Scores

a higher motivation typical of students, many ofthem German majors, who enroll into beginningintensive courses.A further set of analyses addressed the crit-

ical assumption that consistent longitudinalincreases in overall C-test scores would be foundas a result of one semester of studies at each ofthe curricular levels, with students in intensivecourses having a larger increase in scores thanstudents in nonintensive courses (Condition H).Table 11 and Figure 8 show that average C-testscores differentiated clearly between studentsenrolled into different levels of the curriculumboth in terms of increase of scores betweenthe beginning and the end of the semester foreach group and also for the overall continualincrease in scores. Furthermore, substantialdifferences are apparent within each of thelevels except level V, where no increase wouldbe expected. Thereby, the largest change isattributable to the three intensive courses: 13.5points for level I course, 25.4 points for levelII course, and 11.5 points for level III intensivecourse.Analogous to the cross-sectional comparisons,

one discrepancy can be observed in the over-all pattern in longitudinal comparisons: Studentsfrom level I intensive course scored equally highat the beginning of the semester and a littlehigher at the end of the semester than level II(2.1) students. Again, this discrepancy should betreated with caution because of a very low numberof participants in the level I intensive courses.

Overall, findings from both cross-sectional andlongitudinal analyses provided substantial evi-dence to support the inferential assumptions re-garding the effectiveness of the C-test at distin-guishing between students at different curricu-lar levels. It is clear that students could be con-sistently differentiated between levels, in particu-lar between levels II and III, III and IV, and IV

TABLE 11C-Test Descriptive Statistics for LongitudinalComparisons

GUGDLevel N Mean SD Min Max

1.1 BEG 2 26 2.82 24 28END 2 33.5 2.12 32 35

1 Int BEG 2 35.5 0.7 35 36END 2 49 4.24 46 52

2.1. BEG 17 35.71 8.02 20 48END 17 45.65 11.18 31 65

2 Int BEG 2 29.5 16.26 18 41END 2 54.5 21.92 39 70

3.1 BEG 16 58.75 18.42 30 102END 16 67.38 14.31 46 107

3 Int BEG 6 67.3 20.37 44 108END 6 78.8 19.97 51 121

4 BEG 20 71.85 15.3 44 99END 20 80.55 15.31 55 103

5 BEG 2 99.5 20.51 85 114END 2 99 22.63 83 115


FIGURE 8Longitudinal Comparison of Average C-Test Scores

and V, as scores from both semester-beginningand semester-end C-test administrations demon-strated substantial mean differences betweenthese groups. Because of the small number of par-ticipants in level I courses (1.1 and 1.Int), it is im-possible to talk confidently about the consistencyof distinctions made between level I and level IIstudents, even though the data did show a cleardifference between the 1.1 and 1.Int courses.

IMPLICATIONS AND CONCLUSIONS

Any validity evaluation study should start witha purpose. The current study aimed to investi-gate whether the newly developed C-test, whichwas initiated to assure closer alignment with thechanges that were implemented in the curricu-lum over the last 15 years, was accurate and ef-fective at distinguishing among a wide variety ofstudents’ abilities. As this study reports, the new C-test and its constituent texts were found to providetrustworthy estimates of students’ abilities, whichled to effective distinctions across the full rangeof curricular levels into which students were to beplaced. However, the analyses also demonstratedthat there is a gap in the difficulty assessed be-tween Text 2 and Text 3 (i.e., between level IIand III). Furthermore, the new Text 4 appearedto function unexpectedly for the lower level stu-dents. A careful investigation of the new level IVtext revealed that it is rather oriented toward the“Business in Germany” thematic area and, in com-parison to the level V text, included fewer func-tional words than content words, which might

have made it more difficult than Text 5 for lowerlevel learners.

At the same time, the old C-test level IV textwas more general in its content orientation andalso targeted the right generic structures relevantfor the goals of the curricular level IV. Specifically,the old level IV text is a discussion about the na-ture of a certain phenomenon (money). In addi-tion, it elicits the relevant lexicogrammatical fea-tures that are under focus in instruction in levelIV: relational clauses that are indicative of an in-terpretative stance, subjunctive II for renderingother people’s speech/opinions, series of nom-inalized clauses (e.g., Wie wichtig Geld in unsererGesellschaft ist und was es bedeutet, nicht genug davonzu haben, wird vielen Menschen erst dann schmerzhaftbewusst… ‘how important money is in our soci-ety and what it means not to have enough ofit becomes painfully clear to many people onlywhen…’). At the same time, vocabulary is not re-ally domain specific, quite general, except, per-haps, for one instance: das Konto überziehen, ‘over-draw an account.’ Considering these findings, thewell functioning Text 4 from the old C-test set wasselected for inclusion in the future version of theplacement exam, replacing Text 4 from the newset of C-texts.

In order to investigate this new set of C-texttests, a Rasch model analysis was rerun for thecombined population of test takers. Figure 9 pro-vides graphic displays of both item measures andexaminee measures according to the commoninterval scale resulting from Rasch model anal-yses for the chosen five C-test texts (see Online


FIGURE 9Examinee and Item Measures on FACETS Ruler forthe Final C-Test

Materials for the IRT statistics to Figure 9). As canbe seen, the inclusion of the old C-text 4 solvedthe problem of not having a text of average levelof difficulty producing a logit value of 0.04.This final combination of C-test texts was ad-

ministered at the end of the Spring 2014 semesterto 1.2, 2.2, 3.2 courses in order to determine theinternal cut scores. The newly designed C-test wasput into use in the summer of 2014.So, what lessons were learned from the evalu-

ation process? First, engagement in the develop-ment of the new placement test and its validityevaluation in the GUGD program that was ini-tiated to assure closer alignment of the place-ment procedures with the ongoing changes in thecurriculum highlighted the key curricular trans-formations since the time when the initial C-testwas constructed. Creating guidelines for choos-ing texts for the new C-test and the very processof text selection drew attention to the curriculargoals by level, including genre, content foci, andlexicogrammatical features as well as of the overalllearning trajectories in the program. The corol-lary benefit of participating in this project was thatit provided experts and novices with an importantopportunity to engage in activities that are centralfor their professional development and the pro-fession as a whole. For example, discussion aboutthe C-test texts attuned the participating graduatestudent instructors to different aspects of curricu-lum construction, assessment, materials develop-ment, and other program-related issues.Second, whereas the old C-test could reliably

spread out the incoming students across courses,the newly developed C-test is more reflective ofthe current curricular goals and therefore offersmore certainty that students’ test scores in factpredict where they fit best in the GUGD curric-ular levels in terms of their content-, genre-, andlanguage-learning goals. Thus, our evidence forthe alignment between the test and the curricu-lum comes not only from the quantitative resultsof the analysis, already the case in the original val-idation study, but also from the discourse analysisof the C-test texts.Third, the analysis revealed an interesting lan-

guage acquisition progression trend in the as-sessed student population. Namely, a big leapin performance was observed in the descriptiveanalysis between levels 2 and 3 by all curricularsemester–level groups (Figure 6). This differencewas independent of the text difficulty since it wasconsistent across all five texts (i.e., for each textthere was a big gap between how level 2 and level3 performed on them). This finding is trustwor-thy because each level had 30 participants. It is


therefore appropriate to conclude that this differ-ence in performance is likely to reflect the natureof learning in the curriculum, namely that thecrucial noticeable surge in learning takes placeat the early advanced level after 4 semesters ofFL study. A similar finding is reported in Byrneset al. (2010) for the development of syntactic com-plexity, with the most dramatic increase in thismeasure occurring after four semesters of Ger-man language study. Besides being an interest-ing statement about the learning trajectory inthe curriculum, these outcomes provide a con-vincing rationale for extending the FL require-ment beyond the intermediate level of instruc-tion because it is at the early advanced level thatlearners actually get to reap the fruit of theirefforts.

Finally, now that the newly developed C-testseems to be functioning well in its intended role,its use will nevertheless continue to be monitoredand adjusted as necessary, precisely due to thedynamic nature of the curriculum. Meanwhile,the GUGD will engage with the next evaluationstep, namely the revision of the listening and read-ing sections of the existing placement exam. Onemight comment that a commitment to this type ofcurricular assessment development requires firstof all resources, perhaps also a certain kind ofleadership and political will in a department, in-gredients that tend to be out of one’s control.But at a deeper level, such assessment projectscan be considered part of the ethical responsi-bility the program owes to its students if attain-ment of translingual and transcultural compe-tence within a relatively short period of time isat stake. If such responsibility is taken seriously,development of a locally relevant placement testwithin the context of curricular structuring andrestructuring can no longer be seen as an op-tion one is free to pursue or not, but ratheras an ethical imperative in foreign languagestudies.

ACKNOWLEDGMENTS

We would like to express our gratitude to the fac-ulty and graduate students of the German departmentfor their participation in the revision of the placementtest. Without them the project would not have come tofruition. We would also like to thank John Norris for hisvaluable recommendations throughout the project aswell as the anonymous reviewers for their helpful com-ments.

REFERENCES

Brown, J. D. (2008). Testing-context analysis: Assess-ment is just another part of language curricu-lumdevelopment.Language Assessment Quarterly, 5,275–312.

Byrnes, H. (1998). Constructing curricula in collegiateforeign language departments. InH. Byrnes (Ed.),Learning foreign and second languages: Perspectives inresearch and scholarship (pp. 262–295). New York:The Modern Language Association.

Byrnes, H. (2001). Reconsidering graduate students’ ed-ucation as teachers: ‘It takes a department!’ Mod-ern Language Journal, 85, 512–530.

Byrnes, H., Crane, C., Maxim, H. H., & Sprang, K.(2006). Taking text to task: Issues and choices incurriculum construction. ITL: International Journalof Applied Linguistics, 152, 85–110.

Byrnes, H., Maxim, H. H., & Norris. J. M. (2010). Real-izing advanced L2 writing development in a col-legiate curriculum: Curricular design, pedagogy,assessment.Modern Language Journal, 94, S1.

Byrnes, H., & Sprang, K. A. (2004). Fostering advancedL2 literacy: A genre-based, cognitive approach. InH. Byrnes & H. H. Maxim (Eds.), Advanced foreignlanguage learning: A challenge to college programs (pp.47–85). Boston: Heinle.

Coffin, C. (2006).Historical discourse: The language of time,cause and evaluation. London: Continuum.

Crane, C. (2006). Modelling a genre-based foreign lan-guage curriculum: Staging advanced L2 learning.In H. Byrnes (Ed.), Advanced language learning: Thecontribution of Halliday and Vygotsky (pp. 227–245).London: Continuum.

Crane, C., Liamkina, O., Maxim, H. H., & Ryshina–Pankova, M. (2005, March). Implementing genre-based pedagogy for the advanced learner: Materials andassessment. Pre-conference workshop. GeorgetownUniversity Roundtable on Languages and Linguis-tics (GURT), Washington, DC.

Eckes, T., & Grotjahn, R. (2006). A closer look at theconstruct validity of C-tests. Language Testing, 23,290–325.

Eigler, F. (2001). Designing a third-year German course.Die Unterrichtspraxis [Teaching German], 34, 107–118.

Gee, J. P. (1998). What is literacy? In V. Zamel & R. Spack(Eds.), Negotiating academic literacies: Teaching andlearning across languages and cultures (pp. 51–59).Mahwah, NJ: Lawrence Erlbaum.

Grotjahn, R. (1987). How to construct and evaluate a C-Test: A discussion of some problems and some sta-tistical analyses. In R. Grotjahn, C. Klein–Braley, &D. K. Stevenson (Eds.), Taking their measure: The va-lidity and validation of language tests (pp. 219–254).Bochum, Germany: Brockmeyer.

Grotjahn, R. (1996). ‘Scrambled’ C-Tests: Untersuchun-gen zum Zusammenhang zwischen Lösungs-güte und sequentieller Textstruktur [‘Scrambled’


C-Tests: Investigations of the correlation betweenthe quality of answers and sequential text struc-ture]. In R. Grotjahn (Ed.), Der C-test: Theoretis-che Grundlagen und praktische Anwendungen [TheC-test: Theoretical foundations and practicalapplications] (Vol. 3, pp. 95–125). Bochum,Germany: Brockmeyer.

Grotjahn, R. (2002). Konstruktion und Einsatz von C-Tests: Ein Leitfaden für die Praxis [Constructionand application of C-tests: Guidelines for use]. InR. Grotjahn (Ed.), Der C-Test: Theoretische Grundla-gen und praktische Anwendungen [The C-test: The-oretical foundations and practical applications](Vol. 4, pp. 211–225). Bochum, Germany: AKS-Verlag 2002.

Grotjahn, R. (2004): Der C-Test: Aktuelle Entwicklun-gen [The C-test: Current developments]. In A.Wolff, T. Ostermann, & C. Chlosta (Eds.), In-tegration durch Sprache [Integration through lan-guage]. Beiträge der 31. Jahrestagung DaF 2003(pp. 535–550). Regensburg, Germany: Fachver-band Deutsch als Fremdsprache.

Grotjahn, R. (2010). Gesamtdarbietung, Einzel-textdarbietung, Zeitbegrenzung und Zeitdruck:Auswirkungen auf Item- und Testkennwerteund C-Test-Konstrukt [Overall performance,individual text performance, time limit and timepressure: Impact on the item- and test parametersand the C-test construct]. In R. Grotjahn (Ed.),Der C-Test: Beiträge aus der aktuellen Forschung [TheC-test: Contributions from the current research](pp. 265–296). Frankfurt am Main, Germany:Peter Lang.

Halliday, M. A. K., & Matthiessen, C. M. I. M. (2004). Anintroduction to functional grammar (3rd ed.). Lon-don: Edward Arnold.

Klein–Braley, C. (1997). C-Tests in the context of re-duced redundancy testing: An appraisal. LanguageTesting, 14, 47–84.

Linacre, J. M. (1998). A user’s guide to FACETS. Chicago:MESA Press.

Martin, J. R. (1984). Language, register and genre. In F.Christie (Ed.), Children writing: Reader (pp. 21–30).Geelong, Australia: Deakin University Press.

Martin, J. R. (1989). Factual writing. Oxford: OxfordUni-versity Press.

Martin, J. R., & Rose, D. (2008). Genre relations: Mappingculture. London: Equinox.

MLA Ad Hoc Committee on Foreign Languages.(2007). Foreign languages and higher education:New structures for a changed world. Profession,1–11.

MLA Task Force. (2014). Report of the task forceon doctoral study in Modern Language andLiterature. Accessed 17 December 2014 athttp://www.mla.org/report_doctoral_study_2014

Norris, J. M. (2003). Validity evaluation in foreign lan-guage assessment. (Unpublished doctoral disserta-tion). University of Hawai‘i at Manoa, Honolulu,HI.

Norris, J. M. (2006). Development and evaluation ofa curriculum-based German C-test for placementpurposes. In R. Grotjahn (Ed.), Der C-Test: The-oretische Grundlagen und praktische Anwendungen[The C-Test: Theoretical principles and practi-cal uses] (Vol. 5, pp. 45–83). New York: PeterLang.

Norris, J. M. (2008). Validity evaluation in language assess-ment. Frankfurt am Main, Germany: Peter Lang.

Norris, J. M. (2013, October). Reconsidering assessmentvalidity at the intersection of measurement and evalua-tion. Invited plenary address at the annual confer-ence of the East Coast Organization of LanguageTesters, Georgetown University, Washington, DC.

Reichert, M., Keller, U., &Martin, R. (2010). The C-test,the TCF and the CEFR: A validation study. In R.Grotjahn (Ed.), Der C-Test: Beiträge aus der aktuellenForschung [The C-test: Contributions from the cur-rent research] (Vol. 18, pp. 205–231). New York:Peter Lang.

Rinner, S., & Weigert, A. (2006). From sports tothe EU economy: Integrating curricula throughgenre-based content courses. In H. Byrnes, H.Weger–Guntharp, & K. Sprang (Eds.), Educatingfor advanced foreign language capacities (pp. 136–151). Washington, DC: Georgetown UniversityPress.

Ryshina–Pankova, M. (2010). Towards mastering thediscourses of reasoning: Use of grammaticalmetaphor at advanced levels of foreign languageacquisition. Modern Language Journal, 92, 181–197.

Ryshina–Pankova, M., & Byrnes, H. (2013). Writing aslearning to know: Tracing knowledge constructionin L2 German compositions. Journal of Second Lan-guage Writing, 22, 179–197.


APPENDIX A

Salient Lexicogrammatical Features in the New and Old C-Test Texts

New C-Test Old C-Test

Linguistic Features Examples from Texts Linguistic Features Examples from Texts

Level 1 Texts

Present perfect tense(Perfekt)

hat geregnet ‘rained,’ habengespielt ‘played,’ habenbestellt ‘ordered,’ hatgegeben ‘gave’

Present tense es gibt ‘there is,’ bezahlen‘pay,’ beginnen ‘start’

Causative subordinateclause

weil ‘because’ Temporal subordinateclause

wenn ‘when’

Simple evaluative terms stark (geregnet) ‘(rained)heavily,’ sehr voll ‘verycrowded,’ ziemlich lange‘quite long,’ viel gelacht‘laughed a lot’

Only one instance ofevaluation

gute (Hilfe) ‘good (help)’

Everyday lexis (related toeating out); materialprocesses, concreteparticipants

regnen ‘rain,’ spielen ‘play,’bestellen ‘order,’ geben‘give,’ Tisch reservieren‘reserve a table,’ Kneipefinden ‘find a bar,’ aufeinen Platz warten ‘wait tobe seated,’ Kellner‘waiter,’ Speisekarte‘menu,’ Karten spielen‘play cards,’ lachen‘laugh’

University-related lexis;numerals; relational,existential, materialprocesses

Studenten ‘students,’Universitäten‘universities,’ zwanzigJahre ‘twenty years,’ fünf‘five,’ sieben Jahre ‘sevenyears,’ studieren ‘study’

Level 2 Texts

Simple past beschrieb ‘described,’verbrachte ‘spent,’ studiert‘studied’

Present tense

Modality könnte ‘could,’ sollte‘should,’ müsste ‘wouldhave to,’ kann ‘can’

Temporal structuring Von... bis ‘from... to,’ 1954,1978

Simple clauses Primarily simple clauses(one exception)

Everyday lexis describinga person verbs referringto significant life events

verheiratet ‘married,’evangelisch ‘evangelical,’kinderlos ‘childless,’ostdeutsch ‘EasternGerman’

geboren ‘born,’ an die Machtkommen ‘come to power,’studieren ‘study’

Nature andenvironment-relatedlexis

radfahren ‘ride a bike,’weniger Auto fahren ‘driveless,’ Wasser ‘water,’Elektrizität benutzen ‘useelectricity,’Umweltbewusstsein fordern‘require environmentalawareness’

Evaluation weltbekannte (Merkel)‘world-famous (Merkel),’größten (Teil) ‘biggest(part),’ erfolgreich (Abitur)‘successful (schoolleaving examination)’

Evaluation so wenig wie möglich ‘as littleas possible,’ natürlich ‘ofcourse,’ nur ‘only,’ retten‘save’




Level 3 Texts

Simple past hungerten ‘starved,’ führten‘led to,’ kamen ‘came,’lag ‘was due to’

Simple past gab ‘gave,’ hatte ‘had’

Simple clauses Primarily simple clausesIntraclausal causative

structureszum Hunger kam ‘along with

hunger came,’ führten zu‘led to’

Chronological structures seit ‘since,’ von... bis ‘from...to,’ zwischen 1949–1961‘between 1949–1961’

Nominalisations Hilfsmaßnahmen ‘aidefforts,’ Hilfssendungen‘relief consigment,’Verbesserung‘improvement’

Comparative subordinateclause

während ‘while’

Domain-specific lexisrelated to theNachkriegszeit ‘after warperiod’

Ernährungslage ‘foodsituation,’ Hunger‘hunger,’Mangel anWohnraum und Kälte‘shortage of housing andcold’

Domain-specific lexis(history, economics)

Wirtschaftssystem ‘economicsystem,’Marktwirtschft‘market economy,’zentrale Planung einführen‘introduce centralplanning,’ Auswanderungverhindern ‘preventemigration,’ Staaten‘countries’

Idiomatic expressions es geht um ‘it is about,’ lagan ‘it was due to’

Routine formulae es gab ‘there was’ (2 times)

Evaluation vor allem ‘mainly,’ primär‘primarily,’ insbesondere‘in particular,’ doch ‘afterall,’ katastrophal‘disastrous,’Mangel‘shortage’

No evaluation

Level 4 Texts

Present tense Present tenseIndirect speech viele Menschen sagen, Geld

hätte für sie... ‘manypeople would say thatmoney for them would…’

Mostly simple clauses Complex clauses:Relative clausesNominalized clausesConditional clausesInfinitive clauses

diejenigen, die genug davonhaben ‘those who haveenough of it,’Menschen,die zu wenig davon besitzen‘people who have toolittle thereof

wie wichtig Geld in unsererGesellschaft ist und... wirddaher vielen Menschen erstdann schmerzhaft bewusst‘how important money isin our society and what itmeans not to have itbecomes painfully clearto many people onlywhen…’




Level 4 Texts

wenn das Konto überzogenworden ist ‘when anaccount is overdrawn’

was es bedeutet, nicht genug zuhaben ‘what it means notto have enough of it’

Domain-specific lexis(economics, business,statistics)

Schuldenkrise ‘debt crisis,’Standort ‘headquarters,’Beratungsgesellschaft‘consulting firm,’Investoren ‘investors,’Arbeitsplätze schaffen‘create jobs,’ ausländischeUnternehmen ‘foreignorganizations,’ dieInfrastruktur‘infrastructure,’ dasQualifikationsniveau derArbeitskräfte ‘thequalification levels of theemployees,’ das sozialeKlima ‘social climate,’ dieZahl ‘number,’ ansteigen‘grow,’ die Studie ‘thestudy’

Primarily notdomain-specific lexis,but abstract processesthat refer to reflectionand interpretation

Geld besitzen ‘posses money,’Gesellschaft ‘company,’sagen ‘say,’ zumLebensthema werden‘become a central themein one’s life,’ schmerzhaftbewusst ‘painfully clear,’besondere Bedeutung haben‘have special meaning’

Evaluation so viele... wie nie zuvor ‘somany... as never before,’stieg deutlich an ‘grewconsiderably,’ besondersgute (Noten) ‘especiallygood (grades),’ vor allem‘mainly’

Evaluation keine besondere (Bedeutung)‘no special meaning,’bemerkens-werterweise‘remarkably,’ zum großen(Lebensthema) ‘centraltheme (of life),’ wichtig‘important,’ schmerzhaft(bewusst) ‘painfully(clear)’

Level 5 Texts

Hypothetical structures hätte ‘would have,’ würden‘would,’ sein mag ‘couldbe’

Simple past wuchs ‘grew,’ waren ‘were,’nahm zu ‘increased’

Hypothetical structures würde ‘would’ (3 times)Domain-specific lexis

(European Union,politics)

Euro einführen ‘introducethe euro,’ demokratischeTeilnahme ‘democraticparticipation,’Souveränität aufgeben‘abandon sovereignty’

Domain-specific lexis(statistics, economics)

Einführung des Euro‘introduction of theeuro,’ gemeinsameWährung ‘commoncurrency,’ die Mark durchden Euro ersetzen ‘replacethe German mark withthe euro,’Währungsumstellung‘currency changeover,’Stabilität der deutschenWirtschaft ‘stability of theGerman economy,’Umfrage ‘survey,’zunehmen ‘increase,’ einViertel der Befragten ‘onequarter of therespondents,’ wachsen‘grow’




Level 5 Texts

Complex clauses Complex clausesEvaluation den Vorzug ‘advantage,’

keine Chance haben ‘haveno more chance,’ mehrTeilnahme ‘moreparticipation’

Evaluation Skepsis ‘skepticism,’ dagegen‘in contrast,’ die Sorge‘concern,’ schaden‘damage,’ so stark wie ‘sostrong as’

APPENDIX B

Instructions for the C-Test

The C-test asks you to complete the second half of words which have been deleted at regular intervalsthroughout a series of otherwise intact texts. This i___ an exa___ sentence fr___ such a___ exam. Asyou complete the words, you recreate a meaningful text. However, in order to do so, you obviously haveto know both the deleted words and the surrounding words, you have to understand the meaning con-veyed by sentences within the text, and you have to understand the grammatical relationships expressedbetween particular words and between sentences. As such, the C-test presents you with a challenginglanguage task. Accordingly, only very advanced language learners will be able to correctly answer all ofthe items in each C-test text.

Additional instructions that students can find at the top of the page where the actual test is taken:

� Please read the entire passage prior to filling the gaps for that section.� The number of the deleted letters is half (of the word) or half plus one.� In compound words, e.g., Wörterbuch, only the second half of the second part will be deleted:Wörterbu____.

� You must provide only the letters that complete each word. Do not provide the entire word. Forexample, the correct response to da____ is s, NOT das.

� You may use appropriate accented characters such as Ä or ß. You may also use the transliterationssuch as AE or SS.

Documents

Meeting the challenges of curriculum construction and change: Revision and validity evaluation of a placement test