Effective Assessment of Educational Progress: A Review of ......focuses on assessing student academic achievement to evaluate student academic progress or growth. Two major types of

PHREE Background Paper Series

Document No. PHREE/91/34

Effective Assessment of Educational Progress:A Review of Strategies for

Measuring Learning Achievement

By

Abigail M. Harris(Consultant)

Education and Employment DivisionPopulation and Human Resources Department

The World Bank

February 1991

Thius publication serwe serves as an outlet for background products from the ongowg w,ork progam of policy research and analysis of rhieEducation and Employment Divisin in the Populaaon and Human Resources Deparnt of the World Bank. 7he views expressed air e hoseof the author(s), and should no: be autributed to the World BankB

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Pub

lic D

iscl

osur

e A

utho

rized

Effective Assessment of Educational Progress: A Reviewof Strategies for Measuring Learning Achievement

Contents

Page

I. Introduction and Rationale ........ . ............................. 1

II. Selecting Assessment Goals. 3

What constitutes academic achievement?. 3Selecting assessment questions. 4Avoiding a popular pitfall. 5

III. Selecting Assessment Strategies .................................. 7

Norm-referenced tests. 7Criterion-referenced tests .12Combinations and derivations .16

Item response theory and item banking .16Proficiency scaling .18

Smmary .20

IV. Standardization and Local Conditions .20

V. Final Considerations .21

References .. 23

Abstract

This paper discusses the purposes of educational assessment and the

advantages and disadvantages of three types of standardized tools for

measuring student learning; norm-referenced tests, criterion-referenced tests,

and combinations of the two. The purpose is to provide a general, non-

technical overview for Bank education sector staff.

Effective Assessment of Educational Progress: A Reviewof Strategies for Measruing Learning Achievement

I. Introduction and Rationale

Evaluation of how well an educational system is performing can be

accomplished in a variety of ways. Tracking enrollments, surveying student

attitudes, observing classroom interaction, cataloging educational resources,

monitoring teacher qualifications, and analyzing student achievement are all

strategies that can be used to gauge educational productivity. This paper

focuses on assessing student academic achievement to evaluate student academic

progress or growth. Two major types of assessment approaches will be

described: criterion-referenced tests End norm-referenced tests. These two

approaches are not mutually exclusive (e.g., proficiency scaling can be both

norm-referenced and criterion-referenced), although both their purposes and

the nature of the data they provide are different.

Monitoring the academic achievement of students is a critical component

of effective education. Appropriate assessment makes it possible to answer

important questions about the individual student, the classroom, the school,

and the educational system as a whole. Such knowledge provides a basis for

decision-making regarding educational goals and effective allocation of

resources. Yet, there is no single, preordained process for evaluating how

well an educational system is performing. The choice of process depends upon

the kind of infc;mation that is desired and the particular questions that need

to be addressed. For example, consider the situation in many developing

countries rin which tests have been used primarily for admissions/selection

purposes in order to control the flow of students to higher education.

Typically, the examinations that are designed exclusively for this purpose

-2-

have not provided information about how the students taking the test this year

compared to those who took last year's admissions test. Nor do the results

provide information about the actual literacy or skil'ls mastery levels of

examinees. To obtain this additional information, the scope and design of the

admissions test would need to be altered, or, more likely, additional

assessment would be appropriate. By first defining assessment needs (for

individuals, classrooms, schools, or the entire system) and identifying the

significant questions, it is then possiole to begin to design an effective and

efficient assessment program.

The selected assessment process must take into account local conditions

and needs. In some educational systems, the testing and scoring environment

can be highly controlled, and the same test or test questions can be reused

without lapses in test security. In other systems, conditions are not as

standardized: test administration procedures and conditions vary, each test

becomes part of the public domain as soon as it is administered, and scoring

procedur_s are subjective or not consistent across administrations. The

degree to which testing procedures (administration, scoring, etc.) can be

standardized is an important consideration in designing effective assessment.

If data are to be meaningful, it must be possible to exercise some controls

over the assessment process. In instances where such control is not feasible,

alternative data collection designs (e.g., sampling) may be advisable. The

assessment process that is selected must reflect these kinds of differences.

3-

II. Selecting Assessment Goals

What constitutes educational achievement?

An important first step in planning an assessment program is deciding

what aspects of educational achievement will be evaluated. Over the years,

educational achievement has be3n characterized in a variety of ways, but two

major conceptions have dominated recent writing and thinking (Cole, 1990). One

perspective focuses on minimum competency, emphasizing mastery of basic skills

and raising all students to a minimum level of proficiency. The other

perspective gives priority to the development of higher order thinking s.k.ills

and advanced knowledge; primary goals are to improve the academic preparatIon

of students who will receive advanced schooling and perhaps to increase the

proportion of students who have this training. This dichotomy is illustrated

by the attempt in Sri Lanka in 1974-75 to replace testing measures that were

used for selection purposes ("0" and "A" level examinations) with measures of

educational attainment (national examinations emphasizing pre-vocational

subjects). Public pressure forced a reversal of this decision since the new

examinations were not recognized by many universities and the value of the

certification of the new examination was undermined (Heyneman & Ransom, 1989).

The choice to emphasize (monitor progress, support with resources, etc.)

one aspect or conception of educational achievement affects what teachers

teach and how they teach it. For example, the incentive for teachers to devote

their energies and resources to teaching basic skills to less able students is

minimized by a system that evaluates and rewards only the brightest, most able

students. In deciding on the goals of education and how achievement of these

goals will be measured, one must recognize the potential impact of emphasizing

-4-

accountability associated with only one particular conception of educational

achievemei

The alternative is to recognize and support the importance of academic

achievement at various instructional levels. The risk, however, is that the

educational resources (including those allocated for assessment) will then be

spread too thinly to be effective. Only through thoughtful consideration of

the long tern. goals of education and careful planning in selecting the goals

of assessment can a balance be achieved.

Selecting Assessment Ouestions

Another issue to consider in selecting assessment goals and strategies

is the questions policy makers may want answered. There are two main types of

questions. The first type focuses on the specific skills or information that

students do or do not know. For example, "Has this particular child mastered

two place addition without regrouping?" or "Wnat percentage of the

intermediate level students are able to identify all of the countries in

Europe and their capitals?" Progress can be monitored by comparing mastery

levels of the same students over time (e.g., "At the beginning of primary

school, 5% of the students could name the letters of the alphabet. What

percentage of these students can name the letters of the alphabet at the end

of their first school year?") or by comparing the performance of different

students (e.g., "Last year, 65% of the intermediate level students responded

correctly to a two-digit addition problem that didn't involve regrouping, what

is the percentage of intermediate level students this year who responded

correctly to this type of problem?") These types of questions are best

answered using criterion-referenced testing which requires that test

-5-

performance be linked to skills that can be well defined. The emphasis is on

describing performance on specified c-,riculum objectives.

The second type of question focLses on the relative ranking of the

performance of individual students er groups of students compared to a

preestablished normative or reference group. For example, "How does the

performance of this intermediate level student (or students) compare to that

of students in a national sample?" or "On the average, how does the

performance of intermediate level students this year compare with that of

intermediat. ,el stu- -ts last year?" or "Compared with a 1988 national

sample, wha.t i. .hc propc-,iion of children this year who scored in the lowest

quartile and what is !he proportion scoring in the top quartile?" It is also

possible to monitor relative performance of the same student or students over

time: "Last year this student scored at the 45th percentile compared to a

national sample. What is his percentile or relative ranking this year

compared to the national sample?" These kinds of questions are

norm-referenced, that is, the performance of students has been compared to the

performance of children in the norming or standardization group.

Norm-referenced testing allows you to evaluate the performance of an

individual or group of children with reference to other children but it

requil the tests that are used be referenced to a common external scale.

Avoiding a Popular Pitfall

A common pitfall in educational assessment is to act as if all attained

scores have meaning and can provide information on changes over time.

However, tests that lack a referent cannot provide trend data and are also

otherwise meaningless. Unfortunately, many tests used in developing countries

are neither criterion-referenced nor norm-referenced. There may be good

-6-

reasons for this; in some settings it may not be possible to reuse the same

test or test questions from one year to the r.ext. Yet, without a reference or

a way of linking a test to some external scale or criteria, scores derived

from the test lack meaning and cannot be used to answer most of the questions

that are of interest to policy makers.

For example, responding correctly to 75% of the questions or items on a

general achievement or admissions test does not give evidence that the student

has mastered 75% of the subject matter covered by the examination. What would

it mean to say that a person has mastered 75% of a test covering World War II?

Does the person know 75% of the important dates? Was the person able to list

75% of the important battles? Did the examinee write essays that were 75%

accurate? Although this type of percentage scale is widely used, its

meaningfulness is very limited. It should not be confused with

criterion-referenced statements (e.g., "Fifty-five percent of the students

tested correctly identified the dates of World War I and World War II."). The

latter requires a specificity of proficiency that is not reflected in the

percent correct score of a test covering multiple curriculum objectives such

as a general achievement test or an admissions test.

Another popular misinterpretation is to say that there is "an annual

failure rate of 50%" and conclude that this says something meaningful about

relative performance from one year to the next. Generally, in developing

countries new examinations are developed every year and the rate of success is

arbitrarily set to control the flow of students. Similarly, saying that a

student scored higher than 60% of the students taking the test this year

doesn't necessarily allow you to say anything about how the student would have

compared to students taking a different test last year. In both instances,

unless provisions have been made for equating the scores or calibrating the

-7-

items on the two tests, comparisons acro-ss years are not possible. Perhaps,

for example, the teachers have been more 6ffective in teaching this past year,

and, as a group, the students taking the test this year know more than the

students who took last year's test. Unless some reference point is

used--whether norm-referenced or criterion-referenced or some combination or

derivation of the two--test performa'-e will not provide the evidence

necessary to demonstrate this educational improvement.

III. Selecting Assessment Strategies

The assessment questions that are identified provide the basis for

determining the appropriate assessment strategies. As suggested above, some

questions are answered most effectively by norm-referenced testing whereas

other questions lend themselves to criterion-referenced testing. Also, due to

recent psychometric advances, in some situations, it is possible to use

strategies that combine or enhance these two major assessment techniques.

Properties of norm-refe-enced, criterion-referenced and combination approaches

are summarized briefly below.

Norm-referenced tests

Norm-referenced testing is used when scudents' scores are .terpreted

with reference to a particular group usually referred to as the

standardization or norm group. The intent is to assess where a particular

community, school, or individual lies on a normal distribution curve, the mean

and distribution of which is established from the pretesting of students in

the norm group. The emphasis is on the relative standing of individuals

rather than on absolute mastery of content. Thus, one might say, "This year

-8

55% of the students tested scored above the national average whereas last year

52% of the students tested scored above the national average." Because the

norm group provides a basis for anchoring and interpreting raw scores,--in

this case for establishing the national average--it is possible to use the

test to make performance comparisons between individuals, schools, performance

last year versus this year, and so on.

In order to create norms, the test is administered to the norm group.

If generalizations are going to be made (e.g., compared to a national sample

of students who were tested at the completion of primary school, compared to

other schools in the region, etc.), it is important that this sample be

representative of the group who will eveiatually use the examination. It

should match as closely as possible the various demographic characteristics of

the population with which it will be used. The sample also should be large

enough to ensure stability of the test scores and inclusion of the various

groups that are represented in the designated population.

A goal in the development of norm-referenced tests is total score

variability. If all examinees get all items right or all items wrong, there

is no score variability and no way of distinguishing or ranking the examinees.

Consequently, it is important to have items that are answered correctly by

some but not all of the individuals takine the test. Further, it is desirable

if performance on each of the items is highly correlated with total test score

(i.e., high scorers get the item correct while low scorers get the item

incorrect). This aspect of an item is termed item discrimination.

Score variability is determined in part by the distribution of relatively easy

or hard items on the examination (i.e., item difficulty). This distribution

should depend on the purpose(s) of the assessment. For example, a test used

exclusively for selection purposes should include items that will accurately

-9-

rank order examinees who are onL the bord3rline between being accepted cr

rejected. For scholarship or placement purposes it also may be useful to have

an accurate ranking of those who score above this decis'on point. If a

ranking of low scorers is not required, items that would rank or discriminate

between low scorers (easier items) would not be necessary.

Examinations used for system-wide monitoring of educational progress

need to discriminate accurately at a broad range of achievement levels. Fo.

example, it is desirable to know if an increase in the allocation of textbooks

or other school resources is associated with changes in performance of low

scoring students as well as high scoring students. Remedial and enrichment

programs may affect students differentially. A selection test such as is

described above (i.e., with a high concentration of relatively difficult

items) would be unlikely to r' ect the effects of a program designed to train

all students to at least a minimum level of literacy. If one test is going to

be used to rank examinees accurately at all achievement levels, a broad

distribution of item difficulties (easy, medium, and hard items) is desirable.

The potential trade-off is that to do this the test may need to be excessively

long.

Limitations and Considerations. There are several important limitations

of norm-referenced tests. A critical factor is that the nozms that are used

must provide comparisons that are relevant and accurate in terms of the

purpose for which the test was administered. For some purposes national or

regional norms may be the most relevant. In other circumstances norms

developed on a particular portion of the population may be meaningful. In

either case, careful thought must be given to selecting the appropriate group.

For example, national norms could refer to a representative sample of all

children attending state run schools, all children attending any type school

- 10 -

(public or private), or all children (those attending any type school and

those not attending school). If a national goal is to increase the proportion

of children attending schools, the norms associated with the first two groups

would become quickly outdated since hopefully the "attending school"

population would be changing annually.

In addition, norms based on the performance of individuals are

appropriate for the evaluation of individuals but are not appropriate for

comparisons between groups of students. The variability of scores for

individuals is far greater than the vatiability, for example, of school means.

Therefore, if individual norms are used to interpret school means, a nchocl

whose students average higher than the mean of the individual norms will be

underevaluated since the average performance of those students will appear to

be less superior than it actually is. The reverse is true for schools whose

average is below the mean. Thus, if a purpose of testing is to compare

schools, school-mean norms must be generated.

Another limitation of norm-referenced tes.ing is that the same or

equated forms of the test must be used in order to make the link with the

reference group. Using the same form from one year to next can be

problematic. A survey of the 50 states in the United States found that no

state reported being below the norm on any of the six major nationally normed,

commercially available achievement tests and most states reported being well

above the national norm (Cannell, 1988). Of course this does not mean that all

states are above the current year's national average! What is does suggest is

that the average reported achievement this year was higher than the average

achievement of the national norm groups when the norming data was collected

several years previously. In addition, this assertion that all states are

- 11 -

above the national average probably reflects teaching to the test since

typically the same test forms are reused annually.

To avoid reusing the same test form, it becomes necessary to construct

equivalent forms of the same test. However, since two forms of a test can

rarely if ever be made to be precisely equivalent in level and range of

difficulty, it becomes necessary to equate the forms. Equating is a process

through which the correspondence of scores on different forms of a test is

established. The objective is to convert the system of units of one form to

the system of units of the other so that scores derived from the two forms

after conversion will be directly equivalent.

Empirical data are needed to equate test scores. The data collection

procedure that is used must provide some type of commonality or overlap

between the two sets of test data obtained from two different forms of the

test. This commonality can come from an overlap in examinees (the same

examinees take items from both tests) or in items (each set of examinees takes

one test but the same subset of items are on both tests). There are a number

of acceptable ways this can be accomplished, most of which are extensions or

variations on three basic designs: counterbalanced, equivalent groups, and

anchor tests. In the counterbalanced design, a sample of subjects is randomly

assigned into two subgroups. Both groups are administered both forms of the

test; however the order in which the two forms are given is reversed in the

two groups. With the equivalent groups design, subjects are randomly assigned

to two subgroups that are assumed to be equivalent and each group receives a

unique form of the test. With the anchor test design, a small number of common

items are embedded in both forms of the test. The set of common items is

called the anchor test and responses to the anchor test can be used as the

- 12-

basis for equating. Methods associated with all three of these designs can be

jeopardized by breaches in test security.

Criterion-Referenced tests

When students' scores are interpreted with reference to a well defined

skill, the assessment is criterion-referenced. One might ask, "Can a

particular child perform a specific mathematical task or has the child

acquired particular knowledge?" Similarly, one might want to know what

percentage of children can successfully complete this task or have mastered a

particular skill. In criterion-referenced measurement the emphasis is on

assessing mastery of specific, clearly defined, relevant behaviors. Tests are

specifically constructed to support generalizations about an individual's

performance relative to a specified domain of instructionally relevant tasks.

Criterion-referenced interpretation does not depend on comparisons between

students, that is, a criterion-referenced interpretation is not diminished if

all students obtain the same score because each student's score is interpreted

independently. Score variability is not essential as it is with

norm-referenced testing.

The principle challenge in criterion-referenced testing is to adequately

and operationally define the domain of content or behaviors the test is to

measure. By defining the domain, one is specifying the boundaries of the

knowledge or skill that is to be evaluated. A domain is well-defined when it

is clear to those reading the domain definition and specifications (whether

tLst developer, user, or policy-maker) what kinds of tasks or items should and

should not be considered as part of the domain. Well-defined domains are a

necessary condition for criterion-referenced assessment since the basic idea

is to generalize how well an examinee can perform with regard to a particular

- 13 -

domain based on a sampling of items from that domain. To illustrate: an

examinee need not complete all of the possible single-digit addition of two

numbers combinations (e.g., 1 + 1 - 2; 1 + 2 - 3; and so on) in order for one

to draw the conclusion that this domain of skills is mastered if the student

can demonstrate success on a representative sampling of the possible

combinations.

Several decisions must be made in selecting and defining the domains

that will be assessed. It is necessary to decide the academic achievement

level(s) that will be the focus (i.e., whether basic skills, higher level

skills, or a broad spectrum of skills) of the assessment effort. One must

decide also if it is sufficient to gauge mastery versus nonmastery, or, if

possible, if the test should also provide information on levels of mastery or

other diagnostic information.

The validity and interpretation of criterion-referenced test scores are

contingent upon the precision of the definition and specifications of the

domain. This task is facilitated if the domain can be ordered, that is, the

behaviors in the domain can be ordered along a continuum of achievement.

Indicating a student's position within a well-defined ordered domain indicates

what that individual can and cannot do. Nitko (1980) points out several ways

that a domain can be ordered: subject matter difficulty or complexity (e.g.,

a series of math items ordered such that answering each item requires mastery

of a skill over and above the skills measured by easier items); degree of

proficiency with which complex skills are performed (e.g., differences

associated with being an amateur versus an expert; may include speed and

accuracy); prerequisite learning or developmental sequences (e.g., ordering

the domain based on a hierarchy of learning or developmental prerequisites

- 14

such as those defined by Piag6t or Gagnn);i/ location on an empirically

defined latent trait (described below); or judged social or aesthetic quality.

Well-defined ordered domains make it possible to identify and describe levels

of domain mastery rather than simply mastery versus nonmastery.

A large number of important school learning outcomes represent

behavioral domains that conceptually cannot be fully ordered along a continuum

of achievement. Examples include locating words in a dictionary or

demonstrating reading comprehension. While these activities may require the

student to perform a particular sequence of behaviors, measuring the student's

proficiency with these skills would not necessarily locate the individual's

performance on a continuum within the domain. Many of these skills can be

well-defined (e.g., by using strategies developed by Popham, 1978, 1984);

however, the resulting criterion-referenced tests would be nonscaled (i.e.,

knowing someone's score would not necessarily reveal their location on a

continuum of achievement or identify the tasks on this continuum that they

could or could not do).

Once the domain has been well-defined, the test and item specifications

can be devised. The assumption is that if item and test specifications can be

described in sufficient detail, then items constructed to meet these

specifications, viewed collectively, constitute a sample from the theoretical

domain of all possible items that could be written for each of the objectives

defined in the specifications. These items can be evaluated for item-objective

congruence and the ability to differentiate between masters and nonmasters.

Repeated item sampling, while yielding different items, would nonetheless

II Several psychologists assert that children learn and acquire skills in aspecified hierarchical sequence (e.g. rote learning precedes problem solving)or go through invariant states as they devlop cognitively (e.g. thinking inconcrete terms precedes abstract thinking). See reference list for moredetail.

- 15 -

allow dtuain-referenced interpretation of student performance, facilitating a

criterion-referenced form of equating.

There are several important advantages to criterion-referenced testing.

One benefit is that the obtained scores have meaning for instructional

purposes. If a student or students haven't mastered a particular skill, this

will be evident from their scores, and remediation can be planned accordingly.

A related benefit is that Vwen students demonstrate mastery their skill level

can be defined in terms that are meaningful to people within and outside of

the educational system. These benefits make it possible to monitor skill

acquisition at the individual, classroom, school, community, and system levels

and to identify meaningful trends in educational achievement.

Limitations and Considerations. A limitation of criLerion-referenced

testing has been in the area of actual test construction. One major problem

has been that items written to the same specifications do not always permit

comparable interpretations. Further, design of a domain-sampling test calls

for several arbitrary decisions: (1) defining the boundaries of the domain,

(2) deciding on what format will be used to present the questions, (3)

deciding in what form the examinee will be asked to respond (e.g., open ended

versus multiple choice), (4) deciding how large a sample of items is

sufficient to represent the domain, and (5) specifying what percentage of the

domain the examinee must appear to have learned in order to be credited with

"mastery" of the domain (Thorndike, 1982).

The multiple choice format is frequently the format of choice because of

its objectivity and ease of scoring. However, preparing fool-proof test

specifications for multiple choice examinations is complicated by the fact

that the choice of the wrong answers or distractors has an impact on the item

difficulty. Consider the question: What is the weight in pounds of the

- 16 -

average adult human male? If the five possible responses are whole numbers

between 150 and 220, the problem is more difficult than if the responses are

five numbers, each of which is larger than the preceding response by a power

of ten (e.g., 2, 20, 200, 2000, 20000). Or, taken to an unlikely extreme, what

if four of the possible responses are names of colors and only one of the

choices is a number? Similarly, an item that asks for the date that an event

occurred tends to be easier if the response choices are centuries apart rather

than single years apart.

As was mentioned earlier, another limitation of criterion-referenced

testing is that, in non-ordered domains, levels of proficiency are not

identifiable. One must set arbitrary mastery/nonmastery cutoff scores. That

is, a given test score is arbitrarily assumed to represent mastery. While a

variety of empirical and judgmental methods have been identified for setting

cutoff scores, the end result is still limited to mastery/nonmastery or

pass/fail. Similarly, intermediate gradations (e.g., a score range designated

to represent "partial mastery") can only be set arbitrarily and do not

indicate which skills within the domain a student has or has not mastered.

Also, criterion-referenced testing may not be appropriate in all content

areas. For example, if the intent of testing is to measure mental ability,

consider the difficulty one would have trying to define the boundaries of this

construct. Similarly, critical thinking, creativity, reasoning ability, and

so on, appear to be beyond the scope of criterion-referenced testing.

Combinations and Derivations

Item ResDonse Theory (IRT) and Item Banking. Item response theory (also

referred to as latent trait theory) emphasizes levels of performance on a

particular trait or construct. An IRT or latent trait model specifies a

- 17 -

relationship (described by a mathematical function) between observable

examinee test performance and the unobservable (thus, latent) trait or ability

assumed to underlie performance on the test.

An underlying assumption in IRT methodology is that the trait to be

measured is unidimensional and that all the items in a test are homogeneous in

the sense of measuring only a single ability or latent trait. Using IRT

methodology it is possible to estimate properties of items (i.e., item

difficulty, item discrimination, and guessing parameters) that are invariant

across groups of examinees and that can be used to estimate the level at which

a given individual falls on the specified latent trait. Thus, given a set of

items that have been fitted to a latent trait model, it is possible to

estimate an examinee's ability on the same ability scale from any subset of

items (as long as the number of items is not too small) in the domain of items

that have been fitted to the model.

In some respects, IRT models are analogous to well-defined ordered

domains in criterion-referenced testing. Both make the assumption of

unidimensionality of the trait or domain to be measured. Both focus on

locating individuals on a continuum of ability or proficiency.

A logical extension of IRT is item banking. For the testing of a particular

construct, instead of creating a different test each time a test is needed,

one can establish an item bank or a warehouse of test items on that construct.

Each time a test is needed, items can be selected in a mix-and-match fashion

from this item bank to make a new test. The principle requirement of an item

bank is a large number of field-tested items with known properties (e.g., item

discrimination, item difficulty, and so on) so that the ability of each new

test to reflect the construct is known. The scores from each new test are

comparable to those from other tests constructed from the same item bank.

- 18 -

Because IRT methodology allows item difficulty and item discrimination to be

calibrated to a construct or trait, rather than to a total test score or a

particular set of individuals, it is well suited to item banking.

Limitations of IRT methodology a:e that (1) very large samples of examinees

are required initially for the ca.'ibrating of items, (2) sophisticated data

analysis procedures are used to calibrate items, and (3) items are calibrated

in order to be reused, thus making test security an important issue.

Proficiency Scaling. Proficienc) scaling combines norm-referenced

testing with some of the advantages of criterion-referenced testing. The

purpose of such scaling is to relate selected score levels on a

norm-referenced test to specific levels of mastery or proficiency. Thus,

getting 15 questions right on a 60 item math test could place the examinee at

the 10th percentile of the norming sample (norm-referenced) and be associated

with mastery of addition of two single-digit numbers; getting 55 questions

right could place the student at the 85th percentile and be associated with

mastery of addition involving 2 two-digit numbers with carrying. Ideally,

once examinees know their scores on a test, they would simultaneously be able

to determine their rankings relative to the norm group and their levels of

proficiency on the ability or domain measured by the test.

Two approaches have been proposed for proficiency scaling. The first

uses a theoretical hierarchical model of cognitive demand as the basis for

identifying relevant proficiency levels and then attempts to locate sets of

items that reflect these levels. This approach is referred to as "model

based" and it is theoretically consistent with criterion-referenced,

well-defined, ordered domains. Both criterion-referenced ordered domain and

model based approaches assume that skills can be ordered on a continuum and

that proficiency at the second level requires mastery of levels one and two,

19 -

at the third level, mastery of levels one, two and three, and so on. With the

model based approach, once proficiency levels have been defined, small sets of

items that reflect each of the levels are identified in an existing or newly

developed norm-rsferenced test. The next step is to evaluate empirically

whether data from actual student performance is consistent with the assumed

hierarchical model (i.e., whether student scores are correlated with

performance on the items representing each proficiency level). If the data

supports the propose model, the proficiency levels can be tied to specific

scores on the norm-referenced test.

The second major approach to proficiency scaling-- empirical

anchoring--allows test performance data to build the model. No a- zri

cognitive model is assumed and, in a sense, the test is allowed to describe

itself. Performance data from a large sample of examinees is used to identify

items that discriminate well at the test score levels for which a proficiency

description is desired. An item is considered to discriminate well if the

majority of students at one test score level can successfully complete the

item but the majority of students at the next lower test score level are not

able to complete the item successfully. Once a subset of discriminating items

has been identified for each of the predetermined test score levels, a group

of subject matter experts examines the item subsets and provides proficiency

descriptions for each of the designated test score levels.

There are three limitations to proficiency scaling approaches: first,

the test must measure a trait or domain that can be ordered; second, very

large sample sizes are needed to establish proficiency levels; third, items

used for linking proficiency levels to test score levels are reused, thus

making test security important.

- 20 -

Summary

As is evident from this discussion, selecting an assessment approach

should take into account the kinds of questions that one hopas to answer and

the limitations of using the approach given local conditions.

Criterion-referenced and norm-referenced approaches each have advantages

as well as limitations. While not mutually exclusive, the purposes of these

two approaches are different, and, in most instances, it is unlikely that a

single test will be optimal in responding to these purposes. It may be that

different approaches are needed at different times in the educational

sequence. For example, annual criterion-referenced testing could be used to

monitor selected proficiency levels whereas less frequent norm-referenced

testing could be used to gauge overall academic progress. Regardless of the

assessment approaches that are selected, however, referencing examinations to

an external scale is critical to obtaining meaningful data.

The assessment approach that is selected also depends upon the

feasibility of using the approach given local conditions (i.e., feasibility of

maintaining test security, availability of adequate resources for test

development and data analysis, etc.). Some of these issues have been mentioned

as part of the limitations of the various approaches (e.g., test and item

security); others are discussed below.

IV. Standardization and Local Conditions

Once the assessment questions have been identified, it is necessary to

consider local conditions and the degree to which the requirements of various

assessment approaches can be met. If meaningful comparisons between tests,

administrations, and educational units (individuals, schools, etc.) are to be

- 21 -

made, there must be some standardization in the assessment practices that are

utilized. With regard to test administration, standardized procedures would

apply to the verbal instructions that are provided during the administration,

timing, materials, and the testing environment. Differences in exam time

provided, amount of disruption, clarity of directions, and availability of

technical aids (e.g., calculators, dictionaries, etc.), could be expected to

give some examinees an unfair advantage or disadvantage.

Test scoring represents another aspect of the assessment process that

should be standardized. Procedures for scoring should be defined so as to

minimize the subjectivity involved. Not surprisingly, the scoring of essay and

open-ended questions is more problematic than the scoring of questions posed

in an objective format such as multiple-choice, true-false, or matching.

Related to standardization of scoring is equating. If one exam would be

scored differently by two different readers, the feasibility of equating two

different tests is tenuous.

If large scale standardization is not possible, one alternative is to

reduce the scope of the assessment process. Some educational questions can be

answered effectively using an evaluation design in which the data is collected

from selected samples of students rather than the entire population. In

limiting the scope, more resources are available to focus on collecting data

that has known reliability. Proper sampling of examinees would permit

generalizations to the broader population.

V. Final Considerations

Effective assessment of educational progress is a complex process.

Deciding how to assess student academic achievement and monitor student

- 22 -

academic progress requires consideration of many different factors. One must

begin by deciding upon assessment goals. To do this it is necessary to

consider what kird of achievement will be assessed and what are the important

assessment questions. The next step is selecting the appropriate assessment

strategies. At this stage it is critical to keep in mind that it is better to

select a manageable process that results in valid, reliable data that is

meaningful for answering questions about students than to be overly ambitious.

The ultimate choice of approaches must be responsive to the assessment

questions to which answers are sought; at the same time, it must take into

account the limitations and possibilities due to local conditions and

resources.

Devising and implementing an assessment program that is both effective

and efficient is likely to be a multistep, multistage process. It is

frequently useful to develop an implementation plan in which parts of the

program are introduced in stages. For example, by initially limiting the

content areas that will be covered or the age levels that will be assessed, or

by sampling from the larger population, resources can be focused on promoting

quality rather than quantity.

A final point is that education goals should drive the agenda of testing

and evaluation, and not the other way around. Tests should not drive the

curriculum. However, policy makers and educators must recognize that tests can

have this effect. Teachers and students alike know what is valued and rewarded

and they act accordingly. This is an additional reason for exercising care in

planning assessment. Decisions about what kind of achievement to measure and

how educational progress will be monitored must be an integral part of

educational goals.

- 23 -

References

Anastasi, A. (1988). Psychological testing. (6th ed.). New York: Macmillan.

Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike(Ed.). Educational Measurement, (2nd ed.). Washington D.C.: AmericanCouncil on Education.

Berk, R.A. (Ed.). (1980). Criterion-referenced measurement: The state of theart. Baltimore, MD: Johns Hopkins University Press.

Berk, R.A. (Ed.). (1984). A guide to criterion-referenced test construction.Baltimore, MD: Johns Hopkins University Press.

Cannell, J.J. (1988). Nationally normed elementary achievement testing inAmerica's public schools: How all 50 states are above the nationalaverage. Educational Measurement: Issues and Practice. 7, 5-9.

Cole, N.S. (1990). Conceptions of educational achievement. EducationalResearcher, 19, 2-7.

Gagn6, R.M. (1970). The conditions of learning. 2nd ed. New York: Holt,Rinehart & Winston.

Glass, G.V. (1978). Standards and criteria. Journal of EducationalMeasurement, 15, 237-261.

Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their use inthe analysis of educational test data. Journal of EducationalMeasurement, 14, 73-96.

Hambleton, R.K., & Eignor, D.R. (1978). Guidelines for evaluatingcriterion-referenced tests and test manuals. Journal of EducationalMeasurement, 15, 321-327.

Hambleton, R.K., Swaminathan, H., Algina, J. & Coulson, D.B. (1978).Criterion-referenced testing and measurement: A review of technicalissues and developments. Review of Educational Researc1., 48, 1-48.

Heyneman, S.P., & Ransom, A.W. (1989). Using examinations and testing toimprove educational quality. Paper presented at the VIIth WorldCongress of Comparative Education. Montreal, Canada.

Nitko, A.J. (1980). Distinguishing the many varieties of criterion-referencedtests. Review of Educational Research, 50, 461-485.

Nitko, A.J. (1984). Defining h'criterion--efprenced test." in R.A. Berk (Ed.).A guide to criterion-referenced test construction. Baltimore, MD:Johns Hopkins University Press.

Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, andequating. In R.L. Thorndike (Ed.). Educational Measurement, (3rd ed.).New York: Macmillan.

- 24 -

Piag6t, J. (1971). The theory of stages in cognitive development. In D.R.Green, M.P. Ford, & G.B. Flamer (Eds.). Measurement and Piaget. NewYork: McGraw Hill.

Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ:Prentice-Hall.

Popham, W.J. (1984). Specifying the domain of content or behaviors. In R.A.Berk (Ed.). A guide to criterion-referenced test construction.Baltimore, MD: Johns Hopkins University Press.

Suen, H.K. (1990). Principles of test theories. Hillsdale, NJ: LawrenceErlbaum.

Thorndike, R.L. (1982). Applied Psychometrics. Boston, MA: Houghton Mifflin.

Documents

Effective Assessment of Educational Progress: A Review of ......focuses on assessing student academic achievement to evaluate student academic progress or growth. Two major types of