An Analysis of Published Tests of Writing Proficiency

Both language arts educators and educational testing specialists are well ac- quainted with differences of opinion regarding the best way to measure writing proficiency. Many educators contend that writing assessment should be based on actual demonstration of skill; that is, to properly evaluate writing proficiency, students must actually write (direct assessment of writing). Others argue that objective tests (indirect assessment of writing) can provide much useful information and can do so more efficiently and economically than writing sample tests.

Because the correlations between direct and indirect writing test scores, as well as the similarities and dissimilari- ties of each approach, have been addressed in depth elsewhere (Davis, Scriven, & Thomas, 198 I ; Stiggins, 1982), they will not be recounted here. However, there is no question that interest in the direct measurement of writing skill via writing exercises is growing, spurred by the development of new writing sample scoring procedures and the advocacy of direct assessment by such prominent national teacher organizations as the National Council ofTeachers of English. Half of all statewide writing assessment programs now rely exclusively on writing samples, and most others use writing samples in combination with objective tests (Melton & McCready, 1981).

This research was completed by the Center for Performance Assessment under grant #400-80-0105 with the Na- tional Institute of Education (NIE) of the Department of Education. The opinions expressed in this report do not necessar- ily reflect the position, policy, or en- dorsement of NIE.

An Analysis of Published Richard 1. Stiggins and Nancy J. Bridgeford

Nevertheless, nearly half of all statewide writing assessments still include objective tests, and many local districts rely principally on published objective tests in districtwide assessment of student writing skill. Consequently, objective tests remain a frequently used and important assessment method in many large-scale testing programs.

Therefore, though educators need to be apprised of developments in writing sample assessments, they also must remain aware of the nature and quality of published writing/ language usage tests. If published tests are to be used effectively, they must be of high quality and carefully selected to match identified assessment purposes. With this consideration in mind, the authors undertook a review of available writing/ language usage tests. Thirty-five test publishers were contacted, and technical information was obtained on 5 1 writing/ language usage tests. Al- though this selection does not include

all available tests, it does include tests from most major test publishing com- panies.

Each test was reviewed and profiled according to nine characteristics: test type, publication date, testing time, use of writing sample, mode of score interpretation, grade level, skills tested, and type(s) of reliability and validity reported. These profiles were then analyzed with four major questions in mind:

0 What are the major characteristics of available tests?

How have these tests changed over time?

What role do writing samples have in published tests of writing?

What is the psychometric quality of these tests?

It is important to note that the skills measured by any given test are taken to be those explicitly reported by the test publisher in the test’s technical man- ual. We presume, based on this infor-

6 Educational Measurement: Issues and Practice

Tests of Writing Proficiency

Northwest Regional Educational Laboratory

mation, that test items in each instrument actually relate to the skill areas identified. However, that link was not explicitly verified in this analysis.

Overview of the Tests The characteristics of the tests re-

viewed are summarized in Table I. On the basis of these data, several general- izations are warranted:

The number of language arts/ writing tests available from test publishers has grown steadily in the last 10 years.

The tests are about equally divided between stand-alone tests and those included in multitest achievement batteries.

The typical test takes less than an hour to administer.

Approximately 30 percent of all tests developed in recent years include an optional writing sample.

Most tests are norm referenced, but criterion-referenced interpretation (demonstration of mastery of specific

writing skills) is a common option. The majority of tests are designed

for students in elementary, junior high and high school; fewer appear to be available for adults.

Mechanics and usage are tested more frequently than other skills, but higher order skills (e.g., organization and composition) are covered also.

Most technical manuals address internal consistency reliability and content validity; few address other aspects of reliability and validity.

Evolution of the Tests Table I also reports information on

changes in test characteristics over time. For simplicity and clarity, tests are categorized by publication date into 3-year intervals. Because the primary focus of this analysis was the rela- tively recent evolution of tests, tests developed prior to 1973 were categorized together. Most of the tests in this category were produced in the late 1960s and early 1970s.

Test development of writing/ language usage tests has accelerated in recent years. Two-thirds of the tests reviewed were published in the last 6 years, with more than half of those appearing in the past 3 years. The increase in writing test availability appears to parallel the highly publicized and widespread concern over declining basic skills documented by the National Assessment of Educational Progress and other assessment programs.

The type of test available to consumers-either stand-alone or battery -has undergone a marked shift in the past 10 years. In the sample of tests reviewed, tests published prior to 1973 were evenly divided between stand- alone measures and subtests of achievement batteries. From 1972 through 1978, the vast majority of language skills tests (80% in 1973-1975 and 67% in 1976-1978) were part of larger test batteries. During the last 3 years, however, publishers have reversed this trend. Since 1979,63 percent of all the writing skills tests developed have been stand-alone tests. This change may relate to the fact (discussed below) that many of the newer tests are intended for college use, where assessment is conducted on a subject-by-subject basis for purposes such as placement.

Overall, testing time has remained constant during the last 3 years, even though the use of writing samples in published tests has risen markedly. One of every two tests developed since 1976 offers an optional writing sample. In fact, almost all of the 15 tests that include writing samples have appeared during the last 3 years.

Test score interpretation has also changed over the past 10 years. Prior to 1975, test developers relied almost

Spring 1983

totally on norm-referenced interpretation of scores. From 1976 to 1978, criterion-referenced interpretation emerged, as nearly three-quarters of the tests developed during those years included that option. In recent years,

however, that trend has reversed, with agreater number of recentlydeveloped tests relying on norms.

No age group has been neglected in the race to construct new writing assessment instruments. For example, in

the last 6 years, 14 tests were developed for use at elementary grade levels, 14 for use in junior high, 20 for high school, 9 for college. and 3 for adults. Though these tests are spread across age groups, the data show a marked increase in the past 3 years in the pro- portion of tests developed for college students. All 9 of the college-level tests reviewed were published in the last 3 years. It is important to note, however, that some of these tests are used in college admissions testing, and new forms of these tests are developed each year for security purposes.

What about the types of skills tested? Here another changing pattern emerges. More of the recently developed tests are measuring students’ ability to rec- ognize good sentence structure and to organize information. This change is accompanied by a corresponding de- cline in attention to spelling and mechanics. The emphasis on more complex composition skills (e.g., sentence structure, organization) may also relate to the increase in tests designed for older student populations and the consequent need to test more advanced writing skills.

Test quality, as judged by technical reports of test reliability and validity, offers an additional perspective on how tests have changed. First, consider reliability. Overall, the reporting of some data on reliability has declined. For example, during the past 3 years, only 79 percent of the tests provided estimates of internal consistency reliability, versus 93 percent in 1976-1978 and 100 percent in 1973-1975. During the last 6 years, a few test publishers have begun to report interrater reliability, an important consideration in the use of writing samples. But, in a rather surprising turnabout, a substantial number of new tests (21%) provide no information at all on reliability. Until the upswing in writing test development in 1976, 100 percent of the tests reported some reliability information to prospective purchasers. This issue is addressed in greater detail below.

Information on validity shows sim- ilar changes. Although recently developed tests address a wider variety of types of validity than did their predecessors, slightly over one-fourth of the tests published in the last 3 years report no information on validity, thereby providing consumers with little assur- ance that these tests are capable of accomplishing their intended purposes.


The Use of Writing Samples

Because of the apparent growth in the popularity of writing samples in recently published tests, a more detailed analysis was conducted of some of the characteristics of these tests. The results of that analysis are reported in Table 11. As this table indicates, writing samples are most often a n option with stand-alone tests and are rarely included as part of achievement batteries. Because test batteries are commonly used in elementary and junior high schools and seldom at higher levels, writing samples are less often gathered at lower grade levels. Optional writing samples are more common in high school and college tests.

Writing samples in the tests reviewed have these characteristics: First, over half the tests rely on only one writing exercise or prompt to assess student proficiency. Four tests require two writing samples, and three call for three or more. Two-thirds of the writing sample exercises are timed (these allow an average of about 30 minutes per exercise). About a third are un- timed. Most test user guides provide instructions for local scoring of writing samples; only two publishers offer central scoring services. Moreover, scoring instructions in the manuals vary greatly and range from detailed guide- books to cursory suggestions. The majority of tests provide instructions for holistic scoring, with a few offering guidance for analytical and/ or primary trait scoring.’ Also, only about a quarter of the technical manuals provide data on levels of rater consistency (interrater reliability) that can be ex- pected using suggested scoring procedures.

Test Quality Issues These results suggest a number of

potentially positive trends as well as some problematic issues in the evolution of published tests of writing/ language usage. The most obvious positive change is the rapid emergence of direct writing assessment options. Many test publishers are being respon- sive to the desires of language arts educators in expanding the array of measurement methods used to assess writing skill.

Second, the nature of skills tested in objective tests appears to be changing. Objective test items are being used to measure higher order writing skills

(e.g., organization) more frequently. This is particularly true at higher grade levels. Certainly at the lower levels there is a need to assess the students’ mastery of such foundational skills as spelling and mechanics; however, as higher order skills become more cru- cial, tests are needed to verify the development of those skills. Those tests are being developed.

cluded in this analysis has not been adequately demonstrated in the manuals available. Many reports do not meet commonly accepted psychometric standards as outlined in the Ameri- can Psychological Association’s (APA) Standards for Educational and Psy- chological Tests ( 1974). Although reasons for this are uncertain, it may be occurring because test publishers are

Published objective writing tests are changing in important and positive directions.

Third, newer tests address more di- verse aspects of reliability and validity in their technical manuals than did their predecessors. However, this ap- parently positive change must be tem- pered with the criticism that some test developers are reporting no reliability a n d / or validity information at all. Moreover, though the array of evidence of reliability and validity addressed is expanding, a great deal of additional improvement is needed. The technical adequacy of many tests in-

rushing new tests into print before completing their own technical analysis of test characteristics.

In the APA Standards for reporting validity information, presentation of “evidence of validity for each inference for which the test is recommended” is considered essential. Moreover, “if validity for some suggested interpretation has not been investigated, that fact should be made clear”(p. 31). Accord- ing to the Sfantiards, that evidence is to cover criterion-related validity (con-

TABLE I I Characteristics of Tests Using Writing Samples

Characteristic N 76 Type of Test

Stand alone 13 87 Battery 2 13

Elementary 4 17 Junior high 3 13 High school 6 26

Adult 0 0

1 8 53 2 4 27 3 or 4 3 20

Yes 10 67 No 5 33

Local 15 100 Central scoring option 2 13

Holistic 12 80

Grade Level Tested

Coltege 10 44

Number of Exercises

Timed Test

Scoring Responsibility

Scoring Method

Analytical 5 33 Primary trait 1 7

lnterrater Reliability Reported Yes 4 27 NO 11 73

Spring 1983 9

current, predictive), content validity, and construct validity.

The extent to which these standards are being met among the tests reviewed in the study is revealed in Table 111. The results are disappointing.

In reporting validity information, perhaps most surprising is the lack of evidence of construct validity (Table I) among the 51 tests. Although these tests are designed to measure at least some essential prerequisites of effective writing, few publishers report data correlating their test scores with scores achieved in actual writing samples. The reasons for inattention to this fundamental validity issue are not apparent. Procedures are well developed for using writing samples to generate reliable and valid writing skill scores, and the statistical treatments required to esti- mate the needed validity coefficients are straightforward. If objective tests of writing proficiency are to be regarded as valid by English educators, test developers must do everything in their power to establish the acceptability of this assessment option. Researchers are consistently obtaining correlations between objective test scores and writing sample scores in the order of .60 and .70 (Breland & Gaynor, 1979; Hogan& Mishler, 1980; Moss, Cole,&

Khampalikit, 1982). We can only hope that test developers will follow the lead of these researchers and broaden the investigation of validity issues to more adequately address this fundamental aspect of construct validity.

APA Standurds also require that test developers report parallel forms, inter-

since 1978) report no reliability information at all.

Interrater reliability, which has gained importance with the development of direct writing assessments, is also over- looked by 75 percent of the test developers offering writing samples as an option. This may mean that these test

Users and prospective users of tests of writing proficiency can influence the development of psychometrically sound tests by refusing to purchase and use those tests for which adequate evidence of reliability and validity is not available.

nal consistency, and test/ retest reliability estimates for objective test scores. Nearly all the tests reviewed report internal consistency estimates and stan- dard errors of measurement. In addi- tion, equivalence of test forms is gener- ally addressed for most of those tests offered in parallel forms. However, as Table I11 illustrates, only 11 of the 5 1 tests have been analyzed from a test/ retest perspective, and only 8 address all three types of reliability. More- over, 10 percent of all tests and 21 percent of the newest tests (developed

TABLE Ill Frequency of Reporting Combinations

of Validity and Reliability

Combination

Validity Content only Content, concurrent Concurrent only Content, predictive, construct Content, concurrent, predictive, construct Concurrent, predictive Content, predictive No validity reported

Internal consistency only Internal consistency, parallel forms,

Internal consistency, parallel forms Internal consistency, interrater Internal consistency, test/retest Parallel forms only Internal consistency, testhetest, interrater No reliability reported

Reliability

testhetest

N

23 7 5 4 3 2 1 6

25

8 6 2 2 2 1 5

%

45 14 10 8 6 4 2

11

49

15 11 4 4 4 2

10

developers have yet to verify the fact that their exercises and recommended scoring procedures will yield accurate and consistent scores. In defense of this omission, publishers might contend that local test users actually do the scoring and that local districts are, therefore, responsible for verifying the reliability of ratings. Though this may be true, it seems essential that publishers veriy (with data) the fact that exam- iners who follow the publishers’ suggested scoring procedures and use proper training of raters can expect to obtain reliable scores.

Summary and Conclusions Despite important developments in

the use of writing samples to measure writing proficiency, objective tests of language usage and writing skill con- tinue to be used extensively in school assessment programs. The foregoing discussion summarizes some of the key characteristics of many of the published tests available to educators. On the basis of this analysis, the following conclusions seem warranted. First, published objective writing tests are changing in important and positive directions; both the skills being tested and the assessment methods used appear to be evolving in response to developments in the field. It is especially important to note that writing samples are becoming a common component of published tests of writing skill. Second, the documentation of the technical adequacy of many tests is inadequate; test developers could do a great deal more to verify the psychometric adequacy of these tests.

Continued on page 26


dards Committee and to ask questions at the joint session sponsored by AERA and NCME. Public hear- ings o n the draft standards will be held in September 1983. The final document is scheduled for publication in January 1984. The exten- sion of the project was necessary to allow further revision, especially in response to criticisms from indus- trial psychologists who regarded the first versions of the proposed standards as too stringent and extensive for employment selection tests. Their concerns were focused particularly on how the standards would be interpreted in court. Although these criticisms were the impetus for further revision, Novick and other members of the Standards Com- mittee feel that the entire document has benefitted from the additional scrutiny and rewriting. The exten- sion of the standards project has minor budgetary implications for NCME. The Board approved a proposal from APA whereby the lion’s share of the additional costs would be mortgaged against future income from the sale of the standards. This strategy does, however, limit the use of this income for subsequent revision efforts.

In congressional testimony con- cerning truth-in-testing legislation, Gregory Anrig, President of the Edu- cational Testing Service, suggested that a joint committee might be established by APA, AERA, and NCME to monitor industry’s com- pliance with a code of fair testing. The Board was unanimous in its

interest in Anrig’s proposal but identified several procedural concerns, for example, the need to develop yet another document even broader than the test use portions of the new Test Standards, and the possible lack of widespread support in the testing industry .

As NCME’s representative t o the joint Committee o n Evaluation Stan- dards, immediate past-president Bob Linn reported o n the Committee’s fall meeting. Of particular concern was the issue raised at t h e last NCME Board meeting regarding the poten- tial inappropriate use of the Evalu-

ation Standards for certification of evaluators in Louisiana. A representative of the Joint Committee observed the 2-day workshops based on the Evaluation Standards that are required for certification in Louisiana. The Committee discussed issues such as the possible restraint of profes- sional practice and the quality of the workshops. Although the Joint Com- mittee has not encouraged the adoption of such programs, they did not find any basis for concluding that Louisiana was acting inappro- priately. rn

Lorrie Shepard

Homework Continued from page 14

22. Distinguishing characteristics of the test, its construction and use. What makes this test different from others?

23. Desirable.features. In your opinion, what are the desirable features of the test? Try to find all of them. List them separately and describe clearly. Be specific.

24. Undesirable features. Identify and describe clearly all undesirable features. Be specific.

25. Suggestions for improvement. For each undesirable feature mentioned in item 24, describe specifically how it can be corrected.

26. References consulted. List all, giving complete reference citation for future use.

A Final Plea While individual situations differ,

and there may well be occasions when you will be comfortable using a shor- tened form of this outline, we cannot encourage too strongly your thorough attention to a full critical analysis of each instrument you are evaluating. Anything less may be inadequate.

So, make your decision yourself, and make it carefully. Demand excel- lence. Hold to the expectation that any instrument worth your consideration will meet the rigorous standards of the profession. Testing is too expensive and students are too important for any of us to do otherwise.

References American Psychological Association. Standards

,for educational and psychological iesis. Washington, D.C.: Author, 1974.

Kapes, J. T., & Mastie, M. M. A counselor’s guide to vocaiional guidance instruments. Falls Church, Va.: National Vocational Gui- dance Association, 1982. #

Writing Continued from page 10

Those concerned with the measurement of writing proficiency are urged to carefully observe the continuing evolution of these tests. Objective tests will remain an important part of writing assessment for the foreseeable future. These tests will provide sound assessments of writing skill only if test developers produce tests of demonstrated quality, and those tests are evaluated and carefully selected to meet specific assessment needs.

Users and prospective users of tests of writing proficiency can influence the development of psychometrically sound tests by refusing to purchase and use those tests for which adequate evidence

of reliability and validity is not La.: Louisiana 7‘echnologicdl University, available.

References American Psychological Association. Standards

,for educational and p.v.ychologica1 test.?. Washington, D.C.: Author, 1974.

Breland, H . M.. & Gaynor, J. 1.. A comparison of direct and indirect assessments of writing skill. Journal qf Educational Measurement, 1979, 16, 119-128.

Davis. B. G., Scriven. M., & Thomas S. R e evaluation qf composition instruction. Pt. Reyes, Calif.: Edgepress, 198 I .

Hogan, T. P.. & Mishler, C. Relationships between essay tests and objective tests of language skills for elementary school students. Journal qf Educational Measuremenr, 1980,

Melton, V., & McCready, M. Survqv of large- scule writing assessment programs. Ruston,

17, 2 19-227.

1981. Moss, P. A,, Cole, N. S., & Khampalikit, C. A

comparison of procedures to assess written language skills in grade 10,7 and 4. Journal qf‘ Educational Measurement, 1982, 19, 37-18.

Stiggins, R. J . An analysis of direct and indirect writing assessment methods. Research in the Teaching qf English, 1982, 16(2), 57-69.

Footnote I Holistic, scoring calls for the reader to

rate overall writing performance ona single rating scale. Analyticalscoring breaks performance down into component parts(e.g., organization, wording, ideas) for rating on multiple scales. And primar-y trait scoring requires rating of attributes of performance unique t o a particular audienceand writing purpose (e.g., persuasiveness, awareness of audience).

26 Educational Measurement: hues and Practice

Documents

An Analysis of Published Tests of Writing Proficiency