40
Using Web-Based Testing For Large-Scale Assessment By Laura S. Hamilton Stephen P. Klein and William Lorié EDUCATION R $

IP196

Embed Size (px)

DESCRIPTION

rand

Citation preview

Page 1: IP196

UsingWeb-Based Testing

For Large-ScaleAssessment

By

Laura S. Hamilton

Stephen P. Klein

and

William Lorié

E D U C AT I O NR

$

Page 2: IP196

Building on more than 25 years of research and evaluationwork, RAND Education has as its mission the improvementof educational policy and practice in formal and informal set-tings from early childhood on.

Page 3: IP196

ACKNOWLEDGMENTS

This work was supported by the National ScienceFoundation, Division of Elementary, Secondary, andInformal Education, under grant ESI-9813981. Weare grateful to Tom Glennan for suggestions thatimproved this paper.

$

Page 4: IP196

TABLE OF CONTENTS

I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 1

II. THE CONTEXT OF LARGE-SCALE TESTING . . . . 2

How Tests Are Used . . . . . . . . . . . . . . . . . . 3How Testing Is Done . . . . . . . . . . . . . . . . . . 4

III. A NEW TECHNOLOGY OF TESTING . . . . . . . . . 7

Adaptive Versus Linear Administration . . . . . 9Item Format . . . . . . . . . . . . . . . . . . . . . . . . 10Possible Scenario for Web-Based Test

Administration . . . . . . . . . . . . . . . . . . . . 11Potential Advantages over Paper-and-Pencil

Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 13

IV. ISSUES AND CONCERNS WITH COMPUTERIZED

ADAPTIVE TESTING . . . . . . . . . . . . . . . . . . . . 17

Psychometric Issues Related to CATs . . . . . . 17Item Bank Development and Management . . 22

V. ISSUES AND CONCERNS WITH

WEB-BASED TESTING . . . . . . . . . . . . . . . . . . . 24

Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 24Human Capital . . . . . . . . . . . . . . . . . . . . . . 26Costs and Charges . . . . . . . . . . . . . . . . . . . . 28Reporting Results . . . . . . . . . . . . . . . . . . . . 30

VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . 31

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

$

Page 5: IP196

1

$

I. INTRODUCTION

Efforts to improve the quality of education in theUnited States increasingly emphasize the need for

high-stakes achievement testing. The availability ofvalid, reliable, and cost-effective measures of achieve-ment is critical to the success of many reform efforts,including those that seek to motivate and rewardschool personnel and students. Accurate measurementof student achievement is also important for initiatives,such as vouchers and charter schools, that require pub-licly available information about the academic perfor-mance of schools. This paper begins with a brief dis-cussion of the context surrounding large-scale (e.g.,statewide) achievement testing. We then describe anew approach to assessment that we believe holdspromise for reshaping the way achievement is mea-sured. This approach uses tests that are delivered tostudents over the Internet and are tailored (“adapt-ed”) to each student’s own level of proficiency.

We anticipate that this paper will be of interest topolicymakers, educators, and test developers who arecharged with improving the measurement of studentachievement. We are not advocating the wholesalereplacement of all current paper-and-pencil measureswith web-based testing. However, we believe thatcurrent trends toward greater use of high-stakes testsand the increasing presence of technology in the class-room will lead assessment in this direction. Indeed,systems similar to those we describe in this report arealready operational in several U.S. school districtsand in other countries. Furthermore, we believe thatalthough web-based testing holds promise forimproving the way achievement is measured, a num-ber of factors may limit its usefulness or potentially

Page 6: IP196

2

$

lead to undesirable outcomes. It is therefore impera-tive that the benefits and limitations of this form oftesting be explored and the potential consequences beunderstood. The purpose of this paper is to stimulatediscussion and research that will address the manyissues raised by a shift toward web-based testing.

After presenting a brief background on large-scaletesting, we describe the new technology of testing andillustrate it with an example. We then discuss a set ofissues that need to be investigated. Our list is not ex-haustive, and we do not provide answers to the manyquestions we raise. Instead, we hope that this discus-sion reveals the critical need for cross-disciplinaryresearch to enhance the likelihood that the comingshift to an emphasis on web-based testing will trulybenefit students.

II. THE CONTEXT OF LARGE-SCALE TESTING

Testing is closely linked with the current emphasison standards-based reform and accountability.

Nearly every state in the United States has adoptedacademic standards in four core subjects—English,mathematics, science, and social studies. Most statesassess achievement toward these standards in at least

reading and mathematics at one ormore grade levels. The grade levelsat which tests are administered andthe number of subjects tested con-tinue to increase. Some states alsoare exploring the use of open-endedtest items (e.g., essays) rather than

relying solely on the less-expensive multiple-choiceformat. As a result of these trends, the cost of assess-

Testing is closely linked

with the current emphasis

on standards-based reform

and accountability.

Page 7: IP196

3

$

ment for states has doubled from approximately$165 million in 1996 to $330 million in 2000, in con-stant dollars (Achieve, Inc., 1999). California alonetested over four million students in 1999 using thecommercially available Stanford 9 tests.

How Tests Are Used

State and district test results are used to make deci-sions about schools and teachers. For example, inrecent years, state policymakers have instituted high-stakes accountability systems for schools and districtsby tying various rewards and sanctions (e.g., extrafunds for the school or reassignment of staff) to stu-dent achievement. These accountability systems typi-cally involve disseminating results to the public. Thisputs pressure on lagging schools to improve their per-formance. In 1999, 36 states issued school-level“report cards” to the public (Achieve, Inc., 1999).

Test scores are also used to make important deci-sions about individual students. Several states(including New York, California, and Massachusetts)are developing high school exit examinations thatstudents must pass to earn diplomas. Many of thenation’s large school districts haveadopted policies that tie promotionfrom one grade to the next to per-formance on district or state tests.The use of test scores for tracking,promotion, and graduation is onthe rise and suggests that the needfor valid and reliable data on indi-vidual student achievement will con-tinue to grow (National ResearchCouncil, 1998).

The use of test scores for

tracking, promotion, and

graduation is on the

rise and suggests that the

need for valid and reliable

data on individual

student achievement will

continue to grow.

Page 8: IP196

4

$

Teachers also develop and adopt tests that they usefor instructional feedback and to assign grades. In thispaper, we focus on externally mandated, large-scaletests rather than these classroom assessments, thoughboth forms of measurement have a significant impacton students.

How Testing Is Done

Many large-scale testing programs purchase their testsfrom commercial publishers. The three largest compan-ies are Harcourt Educational Measurement, whichpublishes the Stanford Achievement Tests; CTB/McGraw-Hill, which publishes the Terra Nova; andRiverside Publishing, which publishes the Iowa Testsof Basic Skills (ITBS). In some cases, these publishershave adapted their materials to accommodate theneeds of particular testing programs, such as byadding items that are aligned with a state’s contentstandards. Typically, however, states and districts pur-chase these “off-the-shelf” materials and use them asis, even when it is evident that their curricula are notaligned especially well with the test’s content.

The other main approach to large-scale testing isthe use of measures developed by states or districts.For example, several states have developed tests thatare designed to reflect their own content and perfor-mance standards, and others have plans to do this inthe future. Some of the state-developed tests includeconstructed-response or performance-based items thatare intended to measure important aspects of cur-riculum standards that are difficult to assess wellwith multiple-choice tests. Although there are somecommercially available constructed-response tests,most of the states that use this type of item have

Page 9: IP196

5

$

developed their own measures, typically with the helpof an outside contractor.

Despite the diversity of tests administered by statesand districts, most large-scale testing programs shareseveral common features. All state programs andmost district programs currently rely heavily on paper-and-pencil exams, although a few districts also usesome hands-on measures, such as in science. Almostall emphasize multiple-choice items, and many relysolely on this format. Tests are typically administeredonce per year, in the spring, with results generallyreleased in early to late summer. Finally, many pro-grams stagger subjects across grade levels; e.g., mathand social studies in grades 4 and 7, reading and sci-ence in grades 5 and 8. However, a few states, such asCalifornia, test every student every year in almostevery core subject, and the general trend is towardincreasing the number of grade levels tested. Theamount of time and resources devoted to testing isoften a point of friction between those who want tomeasure student progress and those who are con-cerned about taking class time away from instructionfor testing (and for preparing students to take the tests).

There are several limitations to the current ap-proach to large-scale assessment. First, the relianceon paper-and-pencil multiple-choicetests limits the kinds of skills thatcan be measured. For this reason,many states and districts haveexperimented with other formats,such as hands-on testing, but thesecan be very expensive (Stecher &Klein, 1997) and do not necessarily measure the con-structs that their developers intended (Hamilton,Nussbaum, & Snow, 1997). A second limitation aris-

. . . the reliance on paper-

and-pencil multiple-choice

tests limits the kinds of skills

that can be measured.

Page 10: IP196

6

$

es from the lag time between test administration andscore reporting. Answer sheets typically must be sentto an outside vendor for scoring, and the test resultsmust then be linked with school and district data sys-tems. This process takes even longer if items cannotbe machine-scored, such as when open-responsequestions are used. Consequently, students, parents,and teachers generally do not receive scores fromtests administered in the spring until the summer orfall, which severely limits the usefulness of the resultsfor guiding instruction. A third problem is the dis-tinct possibility of security breaches undermining thevalidity of the test when the same questions are re-peated across years (Linn, Graue, & Sanders, 1990;Koretz & Barron, 1998). Security also can be a prob-lem when the results are used to make high-stakes deci-sions for teachers and schools, regardless of whetherquestions are changed each year (Linn, 2000).

Other limitations to the typical approach to testingrelate to the integration between assessment and in-struction. Statewide tests are typically administeredapart from the regular curriculum, so students mayperceive these tests as disconnected from their every-day school experiences. Consequently, they may notbe sufficiently motivated to perform their best, whichin turn compromises the validity of results. In addi-tion, separating classroom instruction from assess-ment leads to the perception that students spend toomuch time taking tests, even if the amount of timedevoted to testing is minimal. This occurs in partbecause the time spent testing is often considered astime that is taken away from instruction. A relatedconcern is the narrowing of curriculum that oftenoccurs as a result of high-stakes testing. There is atendency for teachers to focus on the topics that are

Page 11: IP196

7

$

tested and to neglect those that are not (Kellaghan &Madaus, 1991; Madaus, 1988; Stecher et al., 1998).Thus, although the purpose of the test may be to moni-tor progress of schools, the test is likely to have a sig-nificant influence on instruction if there are high stakesattached to the scores. Finally, the need to develop oradopt assessments that are aligned with state andlocal standards means that existing tests may not besuitable for many schools. This creates problems forgeneralizing results from one jurisdiction to another.

Although the traditional paper-and-pencil standard-ized multiple-choice test continues to be the norm, afew districts have recently experimented with computer-based testing. Advances in psychometrics and infor-mation technology are likely to accelerate the adop-tion of this approach, particularly when the tests areadministered (or downloaded into the school) via theInternet (Klein & Hamilton, 1999). We believe thatthis form of testing may address many of the prob-lems discussed above, though not all. In the next sec-tion, we describe this approach and discuss some ofits advantages.

III. A NEW TECHNOLOGY OF TESTING

The role of information technology in virtuallyevery type of educational enterprise is growing

rapidly. Educational assessment is no exception.Several well-known tests are now administered viathe computer, including the Graduate Record Exam(GRE), the Graduate Management Admissions Test(GMAT), and the Medical Licensing Examination.The high speed and large storage capacities of today’scomputers, coupled with their rapidly shrinking

Page 12: IP196

8

$

costs, make computerized testing apromising alternative to tradition-al paper-and-pencil measures. Al-though computers are now usedwidely for large-scale, high-stakesadmissions and licensing exams,their use in the K–12 market is stillquite limited. In addition, mostexisting computerized assessmentsfor K–12 students are administeredlocally on a stand-alone work-

station rather than over the Internet. However, weexpect this will change within a few years.

The next sections of this paper discuss some rele-vant technical issues. This is followed by a descrip-tion of the kind of system we envision and a scenariofor how it might be implemented. We then discuss theadvantages of this approach. Later we address theissues and concerns that are likely to arise as wemove toward this new form of testing.

As mentioned earlier, we are primarily concernedwith the type of large-scale testing that is conductedat the district and state levels, rather than with thetests that teachers use for instructional feedback. How-ever, the effects of large-scale tests on instructionmust be considered because we know that high-stakes assessment influences what happens in theclassroom. Furthermore, information technology mayoffer opportunities to create a closer link betweenlarge-scale assessment and instruction, so it is worthconsidering these tests’ instructional effects as well.The computer and the Internet obviously offerpromising new approaches for teacher-made tests,but this is beyond the scope of this paper.

The high speed and large

storage capacities of today’s

computers, coupled with

their rapidly shrinking

costs, make computerized

testing a promising

alternative to traditional

paper-and-pencil measures.

Page 13: IP196

9

$

The computerized testing approach we discussbelow has three main features. First, items are admin-istered adaptively. Second, the system makes use ofseveral different types of questions, including selected-response (e.g., multiple-choice) and constructed-response items (e.g., short-answer or fill-in-the-blankquestions). Third, the assessment is administered viathe Internet rather than relying solely on stand-aloneworkstations.

Adaptive Versus Linear Administration

Most paper-and-pencil tests present items in a linearfashion—that is, items are administered sequentiallyin a predefined order, and all students are asked thesame questions within a given “form” (version) of thetest. Students may skip items and go back, but theorder of presentation is constant. Some computerizedtests are also linear. However, technology providesthe opportunity to allow examinee responses to influ-ence the difficulty of the questions the student is asked.This is known as adaptive testing. In this type of test-ing, the examinee responds to an item (or set ofitems). If the examinee does well on the item(s), thenthe examinee is asked more-difficult items. If theexaminee does not answer the item(s) correctly, theexaminee is asked easier items. This process contin-ues until the examinee’s performance level is deter-mined. Because information about the difficulty ofeach item is stored in the computer, the examinee’s“score” is affected by the difficulty of the items theexaminee is able to answer correctly.

Computers permit this type of interactive testing tobe conducted in a rapid and sophisticated manner.Because items can be scored automatically and imme-

Page 14: IP196

10

$

diately, the computer can select the next item almostinstantly after an examinee responds to the previousone. Current computerized adaptive testing systems,or CATs, use item response theory (IRT) to estimateexaminee proficiency and determine item selection.The length of a CAT may be specified in advance orit may be based on the examinee’s responses. With thelatter type of CAT, the computer stops administeringitems once the examinee’s proficiency has been esti-mated to some prespecified degree of precision.

Item Format

Currently, most CAT systems rely on multiple-choiceitems; i.e., questions in which the examinee selects onechoice from among four or five alternatives. Theseselected-response items are commonly used in large-scale testing because the answers to them can bemachine-scored, which minimizes costs. Computersalso can accommodate selected-response items thatvary from the standard four- or five-option multiple-choice item. For example, examinees might be askedto select a subset of choices from a list.

Many large-scale testing programs that traditionallyhave relied on selected-response items are now exploringthe use of items that require examinees to generatetheir own answers—these are called constructed-

response items. Computers canaccommodate a wide variety ofsuch items. For example, studentsmay be asked to move or organizeobjects on the screen (e.g., on a his-tory test, put events in the order inwhich they occurred). Other tests mayinvolve students using the comput-

Many large-scale testing

programs . . . are now

exploring the use of items

that require examinees to

generate their own answers.

Page 15: IP196

11

$

er to prepare essay answers or other products. Someconstructed-response items may be machine scored,particularly if the responses are brief and straightfor-ward (e.g., a numerical response to a math problem).Researchers are exploring the use of computerizedessay scoring, and it is likely that future generationsof examinees will take tests that involve automaticscoring of extended responses. Currently, however,most constructed responses must be scored by trainedreaders.

Possible Scenario for Web-Based TestAdministration

Clearly, computers offer the possibility of radicallychanging the way students take tests. There are manyways this could happen. We discuss below one sce-nario for computerized testing, involving the deliveryof assessment items adaptively and the collection ofstudent response data over the Internet. It is impor-tant to recognize that the adoption of this or othercomputer-based assessment systems is likely to begradual: It will evolve over a period of time duringwhich old and new approaches are used simultane-ously. Thus, there will be efforts to ensure the com-parability of results from web-based and paper-and-pencil systems. We return to this problem in a latersection.

In the scenario we envision, a large set of test itemsis maintained on a central server. This “item bank”contains thousands of questions per subject area,covering a wide range of topics and difficulty levels.For example, a math item bank would include itemscovering numbers and operations, algebra, geometry,statistics, and other topics. Within each area, questions

Page 16: IP196

12

$

range from extremely easy to very difficult. The ques-tions also are drawn from a wide range of grade levels.

On the day of testing, the entire item bank (or alarge portion of it) for the subject being tested (e.g.,science) is downloaded to a school from the centralserver. Students take the test in their classrooms or inthe school’s computer lab. The items are administeredadaptively, so each response leads to a revised esti-mate of the student’s proficiency and a decision eitherto stop testing or to administer an additional item thatis harder or easier than the previous one. The stu-dent’s final score is computed almost instantly and isuploaded to a centralized data file. Scores may begiven to students that same day so they know howwell they did. Students complete the testing programseveral times a year rather than the “spring only” ap-proach that is typical of current statewide testing pro-grams. Each student has a unique identifier, so stu-dents’ progress can be monitored even if they changeclassrooms or schools.

The scores can be used to inform several decisions.Policymakers and staff in district or state offices ofeducation may use the results to monitor achievementacross schools and provide rewards or interventions.Results also may provide evidence regarding the effec-tiveness of various educational programs and curricu-lum materials.

Most large-scale assessments are used for externalmonitoring and accountability purposes. They arerarely if ever used for instructional feedback. Never-theless, the greater frequency of administration andthe prompt availability of results from a computer-based system may enable teachers to use the scores toassign grades, modify their instruction in response tocommon problems or misconceptions that arise, and

Page 17: IP196

13

$

provide individualized instruction that is tailored tostudent needs. Similarly, principals may use theresults to monitor student progress across differentclassrooms.

Potential Advantages over Paper-and-PencilTesting

A computerized-adaptive testing system that isInternet-based offers several advantages over paper-and-pencil multiple-choice tests. Some of these bene-fits arise from the use of computers, and adaptiveadministration in particular, whereas others derivefrom delivering the tests over the Internet. These ben-efits are discussed in turn below.

Benefits of Computerized Adaptive Testing

One of the major advantages of CAT is decreasedtesting time. Because the difficulty of the questions astudent is asked is tailored to that student’s profi-ciency level, students do not waste time on questionsthat are much too easy or too difficult for them. Ittakes many fewer items to achieve a desired level ofscore precision using CAT than using a standardmultiple-choice test (see, e.g., Bunderson et al., 1989).This not only saves time, but it may minimize studentfrustration, boredom, and test anxiety. Similarly, thismethod reduces the likelihood of ceiling and flooreffects that occur when a test is too easy or too hardfor a student, thereby providing a more accuratemeasurement for these students than is obtainedwhen the same set of questions is administered to allstudents. The use of computers may also reduce costsbecause the hardware can be used for other instruc-tional purposes rather than being dedicated solely to

Page 18: IP196

14

$

1Item exposure is a topic of growing interest among psychometriciansand test publishers, and its effects need to be considered when developingitem banks and testing schedules. New methods have been devised to con-trol exposure so that CATs can be administered multiple times to thesame examinees, though none eliminates the risk completely (see, e.g.,Revuelta & Ponsoda, 1998).

the testing function. Whether such savings are realizeddepends on a number of factors that we discuss later.

Another potential benefit is improved test security.Because each student within a classroom takes a dif-ferent test (i.e., one that is tailored to that student’sproficiency level) and because the bank from whichthe questions are drawn contains several thousanditems, there is little risk of students being exposed toitems in advance or of teachers coaching their stu-dents on specific items before or during the testingsession. As software to increase test security becomesmore widely available, the risk of unauthorized accessto test items and results should diminish, thereby fur-ther increasing the validity of test results.

CATs are particularly useful for evaluating growthover time. Progress can be measured on a continuousscale that is not tied to grade levels. This scale enablesteachers and parents to track changes in students’proficiency during the school year and across schoolyears, both within and across content areas. Studentstake different items on different occasions, so scoresare generally not affected by exposure to specificitems. Thus, the test can be administered severaltimes during the year without threatening the validi-ty of the results.1 This offers much greater potentialfor the results to have a positive influence on instruc-tion than is currently available in the typical one-time-only spring test administration schedule. CATscan also accommodate the testing of students whotransferred into the school during the year, those whomay have been absent on the scheduled testing date,

Page 19: IP196

15

$

and those with learning and other disabilities whomay require additional time, large type, or other test-ing accommodations. Several school systems servingspecial populations, such as the Juvenile Court andCommunity Schools in Los Angeles, have adoptedCAT systems to address the widely varying abilitylevels and extremely high mobility rates of their stu-dents.

Finally, computer-based testing offers the opportu-nity to develop new types of questions, especiallythose that can assess complex problem-solving skills.For example, students can observe the effects onplant growth of various amountsof water, types of fertilizer, andexposure to sunlight in order tomake inferences about the rela-tionships among these factors.Several development efforts arecurrently under way to utilize tech-nology by developing innovativetasks, including essay questionsthat can be machine-scored, simulations of laboratoryscience experiments, and other forms of constructed-response items that require students to produce,rather than just select, their answers. Many of theseefforts have sought to incorporate multimedia tech-nology to expand the range of activities in which stu-dents can engage. Bennett (1998) describes some ofthe possibilities offered by computer technology, suchas the use of multimedia to present films and broad-casts as artifacts for analysis on a history test.

Computerized assessments are especially appropri-ate for evaluating progress in areas where computersare used frequently, such as writing. Russell and Haney(1997), for example, found that students who were

. . . computer-based testing

offers the opportunity to

develop new types of

questions, especially those

that can assess complex

problem-solving skills.

Page 20: IP196

16

$

accustomed to using computers in their classes per-formed better on a writing test when they could usecomputers rather than paper and pencil. Studentsusing computers wrote more and organized their com-positions better. As instruction comes to depend moreheavily on technology, assessment will need to followin order to provide valid measurement that is alignedwith curriculum (Russell & Haney, 2000).

Benefits of Web Administration

Web-based tests offer efficient and inexpensive scor-ing. Scoring is done on-line, eliminating the need forpackaging, shipping, and processing of answersheets. Students could be given their results immedi-ately after completing the tests. A web-based systemwould allow all records to be stored automatically ata central location, facilitating the production of scoresummaries. Norms could be constantly updated, andanalyses of results could be done quickly and efficient-ly. Teachers would have results in time to incorporatethe information into their instruction. Teachers couldalso use results for assigning grades, so that studentswould be motivated to do well on the tests.

There are clear benefits to maintaining the testingsoftware and item banks in a central location so theycan be downloaded onto school computers via the Inter-net. Economies of scale would be achieved by refreshingthe item bank from a central location. New questionscould easily be inserted into existing tests to gather thedata on them for determining their difficulty andwhether they would be appropriate for operationaluse. Updating of software is done centrally ratherthan locally, so there is no need for expensive hard-ware and software at the school site. Moreover, down-

Page 21: IP196

17

$

loading the bank (or a portion of it) onto a local serv-er and uploading the results daily avoids the delays incomputer response times that might otherwise arise ifstudents were connected directly to the main server.Thus, administering tests via the web addresses sev-eral logistical and technical problems associated withboth paper-and-pencil testing and computer-basedtesting on local workstations.

IV. ISSUES AND CONCERNS WITH

COMPUTERIZED ADAPTIVE TESTING

Despite the many potential advantages of CATs, anumber of issues must be resolved before a com-

puterized testing program can be implemented on alarge scale. This section discusses several of these. Inthe next section, we discuss some of the additionalissues associated with administering CATs over theweb. We do not attempt to resolve these issues.Instead, this discussion is intended to help formulatea research agenda that will support the future devel-opment of web-based CATs.

Psychometric Issues Related to CATs

The use of CATs raises a host of psychometric con-cerns that have been the focus of intense discussion inthe psychometric community over the last decade. Someof the major issues that have been examined are sum-marized below, but the discussion is not exhaustive.

Does the Medium Matter?

Do CATs function differently from paper-and-penciltests that contain the same items? CATs reduce cer-

Page 22: IP196

18

$

tain kinds of low-frequency error associated withpaper-and-pencil testing, such as answer sheet/itemnumber mismatches, distractions from other items onthe printed page, and errors made by scanners. How-ever, CATs may introduce other kinds of errors. Forexample, a reading comprehension item that has thepassage and questions on separate screens mightmeasure different skills than the traditional one withthe passage and questions on the same or facing pages.Using multiple screens places a heavy emphasis on anexaminee’s ability to recall a passage, or part of one,presented on a previous screen (Bunderson et al., 1989).CATs also may place heavier demands on certainskills, such as typing, thereby changing the nature ofwhat is tested.

In general, studies have shown that computer-basedand paper-and-pencil tests tend to function similarly,at least in the multiple-choice format (Bunderson etal., 1989; Mead & Drasgow, 1993; Perkins, 1993;Segall et al., 1997; Zandvliet & Farragher, 1997). Thetwo methods produce similar statistical distributions(means, standard deviations, reliabilities, and stan-dard errors of measurement) and are comparable intheir predictive validity. However, there are still pos-sible threats to comparability.

The dependence of test scores on keyboardingspeed for one medium of test administration but notanother is one potential threat. A few studies haveexamined relationships between specific types andamounts of experience with computers and perfor-mance on tests. For example, Russell (1999) foundthat keyboarding speed was a good predictor of stu-dents’ scores on open-ended language arts and sci-ence tests taken from the Massachusetts Comprehen-sive Assessment System and the National Assessment

Page 23: IP196

19

$

of Educational Progress (NAEP). However, control-ling for keyboarding speed, computer experience wasnot related to test performance. In contrast, scores onan open-ended math test were only weakly predictedby keyboarding speed, probably because the answersto math items rely mainly on less frequently usednumber and symbol keys. Students with computerexperience are not much faster at identifying andusing these keys than are students without computerexperience.

Some studies have found that differences in meanscores due to administration medium can be explainedby test type. Van de Vijver and Harsveld (1994) con-ducted a study with 326 applicants to the RoyalMilitary Academy in the Netherlands. One grouptook the computerized version of the GeneralAptitude Test Battery (GATB), and the other took apaper-and-pencil form of this test. From this studyand that of others, Van de Vijver and colleagues ten-tatively concluded that cognitively simple clericaltests are more susceptible to medium effects than aremore complex tasks. They speculated that these dif-ferences might be related to previous computer usageand should disappear with repeated administrationof the computerized version.

Some medium effects have been observed withopen-response tests. As discussed earlier, experimen-tal studies have shown that students who are accus-tomed to writing with computers perform better onwriting tests that allow them to use computers thanthey do on standard paper-and-pencil writing tests(Russell & Haney, 1997; Russell, 1999). The mediumcan also affect the way that responses are scored.Powers et al. (1994) discovered that essay responsesprinted from a computer are assigned lower scores

Page 24: IP196

20

$

than the same essays presented in handwritten for-mat. Additional research is needed to examine medi-um effects across the range of subject areas, grade lev-els, and item formats that are likely to be included ina large-scale testing system, and to identify implica-tions of these effects.

Additional research is also needed to examine dif-ferences in medium effects on tests in different sub-jects as well as for different examinee groups. Thisresearch is especially important in contexts in whichthere is a need to compare the results from the twoapproaches to testing—e.g., if computerized testing isphased in gradually, perhaps starting with older chil-dren and working backwards to younger ones (orvice versa).

How Should an Adaptive Test Be Implemented?

As discussed above, there are advantages to makingtests adaptive. In particular, adaptive tests are usual-ly shorter (i.e., fewer items are needed to obtain agiven level of reliability) than standard tests. Thisoccurs because item difficulty is more closely tailoredto each examinee’s proficiency level. In a meta-analysisof 20 studies, Bergstrom (1992) found that mode ofadministration (adaptive or non-adaptive) did notaffect performance, regardless of test content orexaminee age.

Still, questions remain about how to implementadaptivity. For example, there are three approaches todetermining when to stop an adaptive test. Stoppingrules can dictate fixed-length, variable-length, ormixed solutions to this problem. On a fixed-lengthtest, the number of items to be administered is fixedin advance of the administration (but not the specific

Page 25: IP196

21

$

items). On a variable-length test, items are adminis-tered until an examinee’s proficiency level is estimat-ed to within a prespecified degree of precision. Amixed solution to the stopping rule problem com-bines some aspects of the fixed-length and variable-length strategies. Researchers have found that fixed-length tests can perform as well as variable-length testsin terms of the level of uncertainty about the final pro-ficiency estimate (McBride, Wetzel, & Hetter, 1997),but the decision about stopping rules needs to beinformed by a number of factors related to the con-text of testing.

Another question pertains to item review, or the prac-tice of allowing examinees to change their answers topreviously completed questions. This is often cited asa way to make CATs more palatable to examinees whoare accustomed to paper-and-pencil testing. When asked,examinees often state a preference for this option(Vispoel, Rocklin, & Wang, 1994). Some research sug-gests that when item review is permitted, examineeswho change earlier answers improve their scores, butonly by a small amount (Gershon & Bergstrom,1995). Although the ability to change answers iscommon on paper-and-pencil tests, researchers andtest developers have expressed concern that CATs mightbe susceptible to the use of item review to “game” thetest. For example, examinees might deliberatelyanswer early questions wrong or omit these questionsso that subsequent questions are easier, and then goback and change their initial wrong answers, whichcould result in a proficiency estimate that is artificial-ly high (Wainer, 1993). Use of this strategy doessometimes result in higher scores, but its effectsdepend on a number of factors including the statisti-cal estimation method used and the examinee’s true

Page 26: IP196

22

$

proficiency level (Vispoel et al., 1999). To mitigate therisk of using item review to “game” the test, it may belimited to small timed sections of the test (Stocking,1996), and techniques to identify the plausibility ofparticular response patterns may be used to evaluateresults (Hulin, Drasgow, & Parson, 1983).

Item Bank Development and Management

Successful implementation of a CAT system requiresa sufficiently large bank of items from which the test-ing software can select questions during test adminis-tration. Some of the issues related to item bank man-agement are psychometric. For example, how largemust the banks be to accommodate a particular test-ing schedule or test length? How many times can anitem be reused before it must be removed from thebank to eliminate effects of overexposure? What isthe optimal distribution of item difficulty in a bank?What is the most effective way to pretest items thatwill be used to refresh a bank? These and related top-ics have been the focus of recent research on CATs.For example, Stocking (1994) provided guidelines fordetermining the optimal size of an item bank and formaking one bank equivalent to another. Most of theresearch on this problem has focused on multiple-choice questions, which are scored either correct orincorrect and which can be completed quickly byexaminees. Use of constructed-response items raisesadditional questions, particularly because fewer itemscan be administered in a given amount of time.

Several organizations have produced item banksappropriate for K–12 testing. The Northwest Evalua-tion Association (NWEA), for example, has pub-lished CATs in several subjects and has item banks

Page 27: IP196

23

$

that are appropriate for students in elementarythrough secondary grades. Many of these testing sys-tems offer the possibility of modifying the assessmentso that it is aligned with local or state content stan-dards. However, it is not always clear what methodshave been used to determine this alignment orwhether curtailing the types of questions that can beasked would change their characteristics.

Developing new test items is time-consuming andexpensive. Although items may continue to be pro-duced by professional test publishers, new ways ofgenerating tests should be studied. It is becomingincreasingly possible for computers to generate itemsautomatically, subject to certain constraints. This typeof item generation, which can occur in real time asexaminees take the test, has great potential for savingmoney and for reducing test security problems. How-ever, it may lead to certain undesirable test prepara-tion strategies.

Teachers are another promising source of items,and including teachers in the item generation processmay significantly enhance their acceptance of thistype of large-scale testing program. The inclusion ofteachers may be especially valuable in situations wherea test is intended to be aligned with a particular localcurriculum.

Regardless of the source of the items, several prac-tical concerns arise. These include copyright laws andprotections, means of safeguarding data on item char-acteristics, and arrangements for selling or leasing itembanks to schools, educational programs, and parents.

Page 28: IP196

24

$

V. ISSUES AND CONCERNS WITH

WEB-BASED TESTING

Administering CATs via the Internet rather than onstand-alone machines adds another layer of

complexity. However, some of the problems discussedbelow, such as those related to infrastructure, wouldneed to be addressed even in a system that used local-ly administered CATs. We address them here becausethe requirements that must be in place for successfulweb-based test administration are often closely linkedwith those that are necessary for any computerizedtesting. The section begins with a discussion of infra-structure, including equipment and personnel, followedby a brief discussion of costs. Finally, we bring upsome issues related to reporting of assessment results.

Infrastructure

The feasibility of delivering assessment over the Inter-net depends on both the material infrastructure andthe human capital already in place in schools. Recentdata indicate that the availability of computers andthe frequency and quality of Internet access are be-coming sufficient to support this form of testing. In1999, 95 percent of public schools had some form ofInternet access, and fully 65 percent of instructionalclassrooms in the nation’s public schools were con-nected to the Internet (U.S. Department of Education,2000). In earlier years, schools with large propor-tions of students living in poverty were less likely tobe connected than were wealthier schools, but thisdifference had disappeared by 1999. The percentageof instructional rooms that are connected, in con-trast, does vary by socioeconomic status. However,

Page 29: IP196

25

$

this gap is likely to shrink due in part to the E-rateprogram, which requires telecommunications compa-nies to provide funds for Internet access in schools(U.S. Department of Education, 1996).

The quality of Internet connectivity has also im-proved over the years. For example, the percentage ofschools connecting using a dedicated line has in-creased significantly over a recent two-year period.The use of dial-up networking declined from nearly75 percent of schools in 1996 to 14 percent in 1999(U.S. Department of Education, 2000). Most of thesewere replaced with speedier dedicated-line networkconnections. However, in contrast to the number ofschools with Internet access, gaps between poor andwealthy schools in the quality of connection persist.For example, 72 percent of low-poverty schools (thosewith fewer than 11 percent of students eligible forfree or reduced-price lunches) and 50 percent of high-poverty schools (those with 71 percent or more eligi-ble students) had dedicated lines in 1999 (U.S. Depart-ment of Education, 2000).

Although the improvements in connectivity areencouraging, some important questions remain. Forexample, what is the quality of the hardware avail-able to students? Are existing allocations of comput-ers to classrooms sufficient to support several roundsof testing each year? In addition, the figures citedabove illustrate that schools serving large numbers ofstudents who live in poverty may have the basicequipment but lack the features that are necessary foreffective implementation of a large-scale web-basedassessment system. True equality of access is clearlyan important consideration. More-detailed surveys ofinfrastructure and a better understanding of whattypes of schools are in need of improvements will

Page 30: IP196

26

$

help determine the feasibility and fairness of imple-menting the kind of system we have been describing.

There are additional questions related to how com-puters are used. Are some machines dedicated solelyto testing or are they used for other instructional activ-ities? What are the test and data security implicationsof using the computers for multiple purposes? Place-ment of computers throughout the school building isanother important consideration. Do computers needto be in a lab or can a web-based assessment system beimplemented in a school that has only a few comput-ers per classroom? Answers to these and related ques-tions have implications for cost estimates. For exam-ple, testing costs would be reduced if the computersthat were used for assessment activities were alsobeing used for other instructional purposes.

Human Capital

Data suggest that most schools have staff with com-puter knowledge sufficient to supervise student com-puter use and that students are increasingly familiarwith computers (U.S. Department of Education, 1999).NAEP data from 1996 indicate that approximately80 percent of fourth and eighth grade teachersreceived some professional training in computer useduring the preceding five years (Wenglinsky, 1998),and it is likely that this number has grown since then.However, it is not clear what the quality of this train-ing was or how many of these staff members have theskills needed to address the unexpected technicalproblems that will inevitably arise. Research indi-cates that teachers’ willingness to use computers isinfluenced by the availability of professional develop-ment opportunities and on-site help (Becker, 1994),

Page 31: IP196

27

$

so an investment in training and support will be nec-essary for any large-scale instructional or assessmenteffort that relies on technology. Equity considerationsalso need to be addressed, since teachers in more-affluent school districts tend to have easier access toprofessional development and support.

Students’ increasing familiarity with the use ofcomputers increases the feasibility of implementing aCAT system in schools. Data from 1997 showed thata majority of students used computers at home orschool for some purpose, including school assign-ments, word processing, e-mail, Internet browsing,databases, and graphics and design (U.S. Departmentof Education, 1999). Of these activities, the majorityof computer use was for school assignments. Somedifferences in computer use were observed acrossracial/ethnic and socioeconomic groups, with largerdifferences reported for home computer use than forschool use. For example, in the elementary grades, 84percent of white students and 70 percent of black stu-dents used computers at school in 1997. The corre-sponding percentages for home computer use were52 percent and 19 percent, a much larger difference(U.S. Department of Education, 1999). The gap inschool computer use is likely to shrink as schools con-tinue to acquire technological resources. However,the difference in access to computers at home, alongwith variation in the quality of technology in theschools (which is not currently measured well), raisesconcerns about the fairness of any testing system thatrequires extensive use of computers.

Because schools differ in the degree to which theyare prepared for large-scale computerized testing, itmay be necessary to implement a system of testingthat supports both paper-and-pencil and computer-

Page 32: IP196

28

$

ized adaptive testing simultaneously. Such a system isonly feasible if there is sufficient evidence supportingthe comparability of these two forms of testing. Al-though much of the research discussed earlier suggestsa reasonable degree of comparability, the differencesthat have been observed, particularly on open-responsetests and those that require reading long passages,suggest that caution is warranted in making compar-isons across testing approaches.

Costs and Charges

A number of costs are associated with this form oftesting, including development of item banks, devel-opment of software with secure downloading anduploading capabilities, and acquisition of the necessaryhardware. How these costs compare with the cost oftraditional paper-and-pencil testing is not known,and the cost difference between these two options islikely to change as technology becomes less expensive.A critical area of research, therefore, is an analysis ofthe costs of alternative approaches. This analysiswould need to consider the direct costs of equipmentand software, as well as labor and opportunity costs.For example, if web-based testing can be completedin half the time required for traditional testing, teach-ers may spend the extra time providing additionalinstruction to students. In addition, cost estimatesshould consider the fact that the computers andInternet connections used for assessment are likely tobe used for other instructional and administrativeactivities.

It will also be important to investigate variousways of distributing costs and getting materials toschools. Should schools pay an annual fee to lease

Page 33: IP196

29

$

access to the item bank or should they be charged foreach administration of the test? The advantages andlimitations of alternative approaches must be identi-fied and compared.

An additional source of expenses is the scoring ofresponses, particularly essays and other constructed-response questions. Student answers to open-endedquestions are usually scored by trained raters. How-ever, the technology required for computerized scor-ing of such responses is developing rapidly. Dataabout the feasibility of machine scoring of open-ended responses will be critical for determining thedegree to which an operational web-based CAT sys-tem can incorporate novel item types.

Cost analysis of assessment is complicated by thefact that many of the costs and benefits are difficultto express in monetary terms, or even to quantify.Furthermore, costs and benefitsvary depending on the context oftesting and how scores are used.For example, if high school gradu-ation is contingent on passing anexit examination, some studentswho would have been given diplo-mas in the absence of the test will be retained inschool. Although the cost to the school of providingan extra year of education to a student can be quan-tified, the opportunity costs for the student and theeffects of the retention on his or her self-concept andmotivation to learn are much more difficult toexpress in monetary terms. We also do not have suf-ficient evidence to determine whether, and to whatdegree, the extra year of high school will increaseproductivity or result in increased dropout rates andtherefore increase social costs (Catterall & Winters,

. . . costs and benefits

vary depending on the

context of testing and how

scores are used.

Page 34: IP196

30

$

1994). This is just one example of a potential out-come of testing, and to the degree that web-basedCATs improve the validity of this type of decision(e.g., by increasing test security), a cost-benefit analy-sis needs to reflect this improvement. The currentassessment literature is lacking in studies of costs andbenefits of different approaches to, and outcomes of,large-scale assessment.

Reporting Results

As discussed earlier, two of the anticipated benefits ofa web-based CAT system are its capabilities forinstant scoring and centralized data storage. Thesefeatures have the potential for improving the qualityand utility of the information obtained from testingand making results more timely and useful for stu-dents, parents, teachers, and policymakers. Currentassessment programs do not share these features. Thus,we have little experience to guide decisions abouthow to generate, store, and distribute results. Forexample, what types of information should be pro-duced? Should the database of school, district, state,or national norms be continuously updated and usedto interpret individuals’ results? Should students receive

their scores immediately, or is itbetter to have score reports distrib-uted by teachers some time after thetesting date?

In today’s high-stakes testingenvironment, the ease of access toresults may have both positive andnegative consequences for studentsand teachers, and it is imperativethat we anticipate possible misuse

. . . the ease of access to

results may have both

positive and negative

consequences for students

and teachers, and it is

imperative that we anticipate

possible misuse of scores.

Page 35: IP196

31

$

of scores. For example, although the possibility forprincipals and superintendents to have timely infor-mation about the progress of students by classroommay help them provide needed assistance and encour-agement, this information could also be used in waysthat unfairly penalize teachers or students. It mayalso increase public pressure to raise test scores, cre-ating unrealistic demands for rapid improvement.These concerns underscore the need to examine theincentive structures and policy context surroundingany assessment system, including the kind of systemwe have described here.

Transmitting test questions and possibly studentresponses over the Internet raises a host of addition-al concerns. How will security of the tests and resultsbe ensured? Who will have access to what data andwhen? Who will have access to the system? For ex-ample, can students practice for the tests by accessingsome or all of the item banks from their homes? Willparents be able to see how well their children aredoing relative to the typical performance of studentsin the same or other classrooms? The answers to theseand related questions will play a large part in deter-mining the validity, feasibility, fairness, and accept-ability of web-based testing.

VI. CONCLUSION

Web-based testing, and especially the computer-ized adaptive version of it, will soon be com-

peting with and possibly replacing the paper-and-pencil tests that are now relied upon for large-scaleK–12 assessment programs. This new type of testingwill use the Internet to deliver tests to students in

Page 36: IP196

32

$

their schools. Test developers and policymakers willneed to prepare for this transition by embarking on acomprehensive program of research that addresses anumber of critical web-based testing issues.

A system in which adaptive assessments are deliv-ered to students at their schools over the Internet hasseveral advantages, including decreased testing time,enhanced security, novel item types, and rapid report-ing. However, before this system can be put in place,a number of issues need to be considered, such as thepsychometric quality of the measures, methods formaintaining item banks, infrastructure, human capi-tal, costs, comparability with paper-and-pencil mea-sures across subject matter areas, and reportingstrategies. Research is needed in all of these and otherrelated areas to provide the foundation for a smoothand effective transition to web-based testing.

Several issues and questions that were not men-tioned in this paper undoubtedly will arise concern-ing the implementation of web-based testing in gen-eral and CATs in particular. However, we anticipatethat the foregoing discussion provides some of thebroad brush strokes for framing a research agenda toinvestigate the validity, feasibility, and appropriatenessof this form of assessment. A program of researchthat is designed to address these and other relatedimportant questions will require the expertise ofresearchers from a wide range of disciplines, includ-ing psychometrics, psychology, sociology, informa-tion sciences, political science, and economics, as wellas input from educators, test developers, and policy-makers. Information technology has permeated near-ly all aspects of our lives, including education, and itis only a matter of time before large-scale testing willutilize current technologies, including the Internet. It

Page 37: IP196

33

$

is critical, therefore, that we begin a program of researchnow, while there is still time to steer this process in a direc-tion that will improve education and benefit students.

Page 38: IP196

35

$

REFERENCES

Achieve, Inc. (1999). National Education Summit briefing book(available at http://www.achieve.org).

Becker, H. J. (1994). How exemplary computer-using teachers dif-fer from other teachers: Implications for realizing the poten-tial of computers in schools. Journal of Research on Com-puting in Education, 26, 291-320.

Bennett, R. E. (1998). Reinventing assessment. Princeton, NJ:Educational Testing Service.

Bergstrom, B. A. (1992). Ability measure equivalence of comput-er adaptive and pencil and paper tests: a research synthesis.Paper presented at the annual meeting of the American Educa-tional Research Association, San Francisco. ERIC ED377228.

Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). The fourgenerations of computerized educational measurement. In R.L. Linn (Ed.), Educational measurement (3rd Ed., pp. 367-407). New York: Macmillan.

Catterall, J. S., & Winters, L. (1994). Economic analysis of test-ing: Competency, certification, and “authentic” assessments(CSE Technical Report 383). Los Angeles: Center for Researchon Evaluation, Standards, and Student Testing.

Gershon, R., & Bergstrom, B. (1995). Does cheating on CATpay: NOT! ERIC ED392844.

Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Inter-view procedures for validating science assessments. AppliedMeasurement in Education, 10, 181-200.

Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Itemresponse theory: Applications to psychological measurement.Homewood, IL: Dow Jones-Irwin.

Kellaghan, T., & Madaus, G. F. (1991). National testing: Lessonsfor America from Europe. Educational Leadership, 49, 87-93.

Klein, S. P., & Hamilton, L. (1999). Large-scale testing: Currentpractices and new directions. Santa Monica: RAND.

Koretz, D., and Barron, S. I. (1998). The validity of gains on theKentucky Instructional Results Information System (KIRIS).Santa Monica: RAND.

Linn, R. L. (2000). Assessments and accountability. EducationalResearcher, 29 (2), 4-16.

Linn, R. L., Graue, M. E., & Sanders, N. M. (1990). Comparingstate and district test results to national norms: The validity ofclaims that “everyone is above average.” Educational Measure-ment: Issues and Practice, 9, 5-14.

Madaus, G. F. (1988). The influence of testing on the curriculum.In L. Tanner (Ed.), Critical issues in curriculum (pp. 83-121).Chicago: University of Chicago Press.

Page 39: IP196

36

$

McBride, J. R., Wetzel, C. D., & Hetter, R. D. (1997). Prelim-inary psychometric research for CAT-ASVAB: Selecting an adap-tive testing strategy. In Computerized adaptive testing: Frominquiry to operation (pp. 83-95). Washington, DC: AmericanPsychological Association.

Mead, A. D., & Drasgow, F. (1993). Equivalence of computerizedand paper-and-pencil cognitive ability tests: A meta-analysis.Psychological Bulletin, 114, 449-458.

National Research Council, Committee on Appropriate Test Use(1998). High stakes: Tests for tracking, promotion, and grad-uation. Washington, DC: National Academy Press.

Perkins, B. (1993). Differences between computer administeredand paper administered computer anxiety and performancemeasures. ERIC ED355905.

Powers, D., Fowles, M., Farnum, M., & Ramsey, P. (1994). Willthey think less of my handwritten essay if others word processtheirs? Effects on essay scores of intermingling handwrittenand word-processed essays. Journal of Educational Measure-ment, 31, 220-233.

Revuelta, J., & Ponsoda, V. (1998). A comparison of item expo-sure control methods in computerized adaptive testing.Journal of Educational Measurement, 35 (311-327).

Russell, M. (1999). Testing writing on computers: A follow-upstudy comparing performance on computer and on paper.Education Policy Analysis Archives, 7 (20), available athttp://olam.ed.asu.edu/epaa/.

Russell, M., & Haney, W. (1997). Testing writing on computers:An experiment comparing student performance on tests con-ducted via computer and via paper-and-pencil. Education PolicyAnalysis Archives, 5 (3), available at http://olam.ed.asu.edu/epaa/.

Russell, M., & Haney, W. (2000). Bridging the gap between test-ing and technology in schools. Educational Policy AnalysisArchives, 8 (19), available at http://olam.ed.asu.edu/epaa/.

Segall, D. O., Moreno, K. E., Kieckhaefer, W. F., Vicino, F. L., &McBride, J. R. (1997). Validation of the Experimental CAT-ASVAB System. In Sands, W. A., Waters, B. K., & McBride, J. R.(Eds.) Computerized adaptive testing: From inquiry to opera-tion (pp. 103-114). Washington, DC: American PsychologicalAssociation.

Stecher, B. M., Barron, S., Kaganoff, T., & Goodwin, J. (1998).The effects of standards-based assessment on classroom prac-tices: Results of the 1996-97 RAND survey of Kentuckyteachers of mathematics and writing (CSE Technical Report482). Los Angeles: Center for Research on Evaluation, Stan-dards, and Student Testing.

Stecher, B. M., & Klein, S. P. (1997). The cost of science perfor-mance assessments in large-scale testing programs. Edu-cational Evaluation and Policy Analysis, 19, 1-14.

Page 40: IP196

37

$

Stocking, M. L. (1994). Three practical issues for modern adap-tive testing pools. ETS Report ETS-RR-94-5. Princeton, NJ:Educational Testing Service.

Stocking, M. (1996). Revising answers to items in computerizedadaptive testing: A comparison of three models. ETS ReportNumber ETS-RR-96-12. Princeton, NJ: Educational TestingService.

U.S. Department of Education (1996). Getting America’s stu-dents ready for the 21st century: Meeting the Technology Lit-eracy Challenge. Washington, DC: U.S. Government PrintingOffice.

U.S. Department of Education (1999). The condition of education1999. Available at http://www.nces.ed.gov/pubs99/condition99/index.html.

U.S. Department of Education (2000). Internet access in U.S.public schools and classrooms: 1994-99 (NCES 2000-086).Washington, DC: U.S. Government Printing Office.

Van de Vijver, F. J. R., & Harsveld M. (1994). The incompleteequivalence of the paper-and-pencil and computerized ver-sions of the General Aptitude Test Battery. Journal of AppliedPsychology, 79 (6), 852-859.

Vispoel, W. P., Rocklin, T. R., & Wang, T. (1994). Individual dif-ferences and test administration procedures: A comparison offixed-item, computerized-adaptive, and self-adapted testing.Applied Measurement in Education, 53, 53-79.

Vispoel, W. P., Rocklin, T. R., Wang, T., & Bleiler, T. (1999). Canexaminees use a review option to obtain positively biased abil-ity estimates on a computerized adaptive test? Journal ofEducational Measurement, 36, 141-157.

Wainer, H. (1993). Some practical considerations when convert-ing a linearly administered test to an adaptive format.Educational Measurement, Issues and Practice, 12, 15-20.

Wenglinsky, H. (1998). Does it compute? The relationshipbetween educational technology and student achievement inmathematics ETS Policy Information Report. Princeton, NJ:Educational Testing Service.

Zandvliet, D., & Farragher, P. (1997). A comparison of computer-administered and written tests. Journal of Research on Com-puting in Education, 29 (4), 423-438.