Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
PHREE Background Paper Series
Document No. PHREE/91/34
Effective Assessment of Educational Progress:A Review of Strategies for
Measuring Learning Achievement
By
Abigail M. Harris(Consultant)
Education and Employment DivisionPopulation and Human Resources Department
The World Bank
February 1991
Thius publication serwe serves as an outlet for background products from the ongowg w,ork progam of policy research and analysis of rhieEducation and Employment Divisin in the Populaaon and Human Resources Deparnt of the World Bank. 7he views expressed air e hoseof the author(s), and should no: be autributed to the World BankB
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Pub
lic D
iscl
osur
e A
utho
rized
Effective Assessment of Educational Progress: A Reviewof Strategies for Measuring Learning Achievement
Contents
Page
I. Introduction and Rationale ........ . ............................. 1
II. Selecting Assessment Goals. 3
What constitutes academic achievement?. 3Selecting assessment questions. 4Avoiding a popular pitfall. 5
III. Selecting Assessment Strategies .................................. 7
Norm-referenced tests. 7Criterion-referenced tests .12Combinations and derivations .16
Item response theory and item banking .16Proficiency scaling .18
Smmary .20
IV. Standardization and Local Conditions .20
V. Final Considerations .21
References .. 23
Abstract
This paper discusses the purposes of educational assessment and the
advantages and disadvantages of three types of standardized tools for
measuring student learning; norm-referenced tests, criterion-referenced tests,
and combinations of the two. The purpose is to provide a general, non-
technical overview for Bank education sector staff.
Effective Assessment of Educational Progress: A Reviewof Strategies for Measruing Learning Achievement
I. Introduction and Rationale
Evaluation of how well an educational system is performing can be
accomplished in a variety of ways. Tracking enrollments, surveying student
attitudes, observing classroom interaction, cataloging educational resources,
monitoring teacher qualifications, and analyzing student achievement are all
strategies that can be used to gauge educational productivity. This paper
focuses on assessing student academic achievement to evaluate student academic
progress or growth. Two major types of assessment approaches will be
described: criterion-referenced tests End norm-referenced tests. These two
approaches are not mutually exclusive (e.g., proficiency scaling can be both
norm-referenced and criterion-referenced), although both their purposes and
the nature of the data they provide are different.
Monitoring the academic achievement of students is a critical component
of effective education. Appropriate assessment makes it possible to answer
important questions about the individual student, the classroom, the school,
and the educational system as a whole. Such knowledge provides a basis for
decision-making regarding educational goals and effective allocation of
resources. Yet, there is no single, preordained process for evaluating how
well an educational system is performing. The choice of process depends upon
the kind of infc;mation that is desired and the particular questions that need
to be addressed. For example, consider the situation in many developing
countries rin which tests have been used primarily for admissions/selection
purposes in order to control the flow of students to higher education.
Typically, the examinations that are designed exclusively for this purpose
-2-
have not provided information about how the students taking the test this year
compared to those who took last year's admissions test. Nor do the results
provide information about the actual literacy or skil'ls mastery levels of
examinees. To obtain this additional information, the scope and design of the
admissions test would need to be altered, or, more likely, additional
assessment would be appropriate. By first defining assessment needs (for
individuals, classrooms, schools, or the entire system) and identifying the
significant questions, it is then possiole to begin to design an effective and
efficient assessment program.
The selected assessment process must take into account local conditions
and needs. In some educational systems, the testing and scoring environment
can be highly controlled, and the same test or test questions can be reused
without lapses in test security. In other systems, conditions are not as
standardized: test administration procedures and conditions vary, each test
becomes part of the public domain as soon as it is administered, and scoring
procedur_s are subjective or not consistent across administrations. The
degree to which testing procedures (administration, scoring, etc.) can be
standardized is an important consideration in designing effective assessment.
If data are to be meaningful, it must be possible to exercise some controls
over the assessment process. In instances where such control is not feasible,
alternative data collection designs (e.g., sampling) may be advisable. The
assessment process that is selected must reflect these kinds of differences.
3-
II. Selecting Assessment Goals
What constitutes educational achievement?
An important first step in planning an assessment program is deciding
what aspects of educational achievement will be evaluated. Over the years,
educational achievement has be3n characterized in a variety of ways, but two
major conceptions have dominated recent writing and thinking (Cole, 1990). One
perspective focuses on minimum competency, emphasizing mastery of basic skills
and raising all students to a minimum level of proficiency. The other
perspective gives priority to the development of higher order thinking s.k.ills
and advanced knowledge; primary goals are to improve the academic preparatIon
of students who will receive advanced schooling and perhaps to increase the
proportion of students who have this training. This dichotomy is illustrated
by the attempt in Sri Lanka in 1974-75 to replace testing measures that were
used for selection purposes ("0" and "A" level examinations) with measures of
educational attainment (national examinations emphasizing pre-vocational
subjects). Public pressure forced a reversal of this decision since the new
examinations were not recognized by many universities and the value of the
certification of the new examination was undermined (Heyneman & Ransom, 1989).
The choice to emphasize (monitor progress, support with resources, etc.)
one aspect or conception of educational achievement affects what teachers
teach and how they teach it. For example, the incentive for teachers to devote
their energies and resources to teaching basic skills to less able students is
minimized by a system that evaluates and rewards only the brightest, most able
students. In deciding on the goals of education and how achievement of these
goals will be measured, one must recognize the potential impact of emphasizing
-4-
accountability associated with only one particular conception of educational
achievemei
The alternative is to recognize and support the importance of academic
achievement at various instructional levels. The risk, however, is that the
educational resources (including those allocated for assessment) will then be
spread too thinly to be effective. Only through thoughtful consideration of
the long tern. goals of education and careful planning in selecting the goals
of assessment can a balance be achieved.
Selecting Assessment Ouestions
Another issue to consider in selecting assessment goals and strategies
is the questions policy makers may want answered. There are two main types of
questions. The first type focuses on the specific skills or information that
students do or do not know. For example, "Has this particular child mastered
two place addition without regrouping?" or "Wnat percentage of the
intermediate level students are able to identify all of the countries in
Europe and their capitals?" Progress can be monitored by comparing mastery
levels of the same students over time (e.g., "At the beginning of primary
school, 5% of the students could name the letters of the alphabet. What
percentage of these students can name the letters of the alphabet at the end
of their first school year?") or by comparing the performance of different
students (e.g., "Last year, 65% of the intermediate level students responded
correctly to a two-digit addition problem that didn't involve regrouping, what
is the percentage of intermediate level students this year who responded
correctly to this type of problem?") These types of questions are best
answered using criterion-referenced testing which requires that test
-5-
performance be linked to skills that can be well defined. The emphasis is on
describing performance on specified c-,riculum objectives.
The second type of question focLses on the relative ranking of the
performance of individual students er groups of students compared to a
preestablished normative or reference group. For example, "How does the
performance of this intermediate level student (or students) compare to that
of students in a national sample?" or "On the average, how does the
performance of intermediate level students this year compare with that of
intermediat. ,el stu- -ts last year?" or "Compared with a 1988 national
sample, wha.t i. .hc propc-,iion of children this year who scored in the lowest
quartile and what is !he proportion scoring in the top quartile?" It is also
possible to monitor relative performance of the same student or students over
time: "Last year this student scored at the 45th percentile compared to a
national sample. What is his percentile or relative ranking this year
compared to the national sample?" These kinds of questions are
norm-referenced, that is, the performance of students has been compared to the
performance of children in the norming or standardization group.
Norm-referenced testing allows you to evaluate the performance of an
individual or group of children with reference to other children but it
requil the tests that are used be referenced to a common external scale.
Avoiding a Popular Pitfall
A common pitfall in educational assessment is to act as if all attained
scores have meaning and can provide information on changes over time.
However, tests that lack a referent cannot provide trend data and are also
otherwise meaningless. Unfortunately, many tests used in developing countries
are neither criterion-referenced nor norm-referenced. There may be good
-6-
reasons for this; in some settings it may not be possible to reuse the same
test or test questions from one year to the r.ext. Yet, without a reference or
a way of linking a test to some external scale or criteria, scores derived
from the test lack meaning and cannot be used to answer most of the questions
that are of interest to policy makers.
For example, responding correctly to 75% of the questions or items on a
general achievement or admissions test does not give evidence that the student
has mastered 75% of the subject matter covered by the examination. What would
it mean to say that a person has mastered 75% of a test covering World War II?
Does the person know 75% of the important dates? Was the person able to list
75% of the important battles? Did the examinee write essays that were 75%
accurate? Although this type of percentage scale is widely used, its
meaningfulness is very limited. It should not be confused with
criterion-referenced statements (e.g., "Fifty-five percent of the students
tested correctly identified the dates of World War I and World War II."). The
latter requires a specificity of proficiency that is not reflected in the
percent correct score of a test covering multiple curriculum objectives such
as a general achievement test or an admissions test.
Another popular misinterpretation is to say that there is "an annual
failure rate of 50%" and conclude that this says something meaningful about
relative performance from one year to the next. Generally, in developing
countries new examinations are developed every year and the rate of success is
arbitrarily set to control the flow of students. Similarly, saying that a
student scored higher than 60% of the students taking the test this year
doesn't necessarily allow you to say anything about how the student would have
compared to students taking a different test last year. In both instances,
unless provisions have been made for equating the scores or calibrating the
-7-
items on the two tests, comparisons acro-ss years are not possible. Perhaps,
for example, the teachers have been more 6ffective in teaching this past year,
and, as a group, the students taking the test this year know more than the
students who took last year's test. Unless some reference point is
used--whether norm-referenced or criterion-referenced or some combination or
derivation of the two--test performa'-e will not provide the evidence
necessary to demonstrate this educational improvement.
III. Selecting Assessment Strategies
The assessment questions that are identified provide the basis for
determining the appropriate assessment strategies. As suggested above, some
questions are answered most effectively by norm-referenced testing whereas
other questions lend themselves to criterion-referenced testing. Also, due to
recent psychometric advances, in some situations, it is possible to use
strategies that combine or enhance these two major assessment techniques.
Properties of norm-refe-enced, criterion-referenced and combination approaches
are summarized briefly below.
Norm-referenced tests
Norm-referenced testing is used when scudents' scores are .terpreted
with reference to a particular group usually referred to as the
standardization or norm group. The intent is to assess where a particular
community, school, or individual lies on a normal distribution curve, the mean
and distribution of which is established from the pretesting of students in
the norm group. The emphasis is on the relative standing of individuals
rather than on absolute mastery of content. Thus, one might say, "This year
-8
55% of the students tested scored above the national average whereas last year
52% of the students tested scored above the national average." Because the
norm group provides a basis for anchoring and interpreting raw scores,--in
this case for establishing the national average--it is possible to use the
test to make performance comparisons between individuals, schools, performance
last year versus this year, and so on.
In order to create norms, the test is administered to the norm group.
If generalizations are going to be made (e.g., compared to a national sample
of students who were tested at the completion of primary school, compared to
other schools in the region, etc.), it is important that this sample be
representative of the group who will eveiatually use the examination. It
should match as closely as possible the various demographic characteristics of
the population with which it will be used. The sample also should be large
enough to ensure stability of the test scores and inclusion of the various
groups that are represented in the designated population.
A goal in the development of norm-referenced tests is total score
variability. If all examinees get all items right or all items wrong, there
is no score variability and no way of distinguishing or ranking the examinees.
Consequently, it is important to have items that are answered correctly by
some but not all of the individuals takine the test. Further, it is desirable
if performance on each of the items is highly correlated with total test score
(i.e., high scorers get the item correct while low scorers get the item
incorrect). This aspect of an item is termed item discrimination.
Score variability is determined in part by the distribution of relatively easy
or hard items on the examination (i.e., item difficulty). This distribution
should depend on the purpose(s) of the assessment. For example, a test used
exclusively for selection purposes should include items that will accurately
-9-
rank order examinees who are onL the bord3rline between being accepted cr
rejected. For scholarship or placement purposes it also may be useful to have
an accurate ranking of those who score above this decis'on point. If a
ranking of low scorers is not required, items that would rank or discriminate
between low scorers (easier items) would not be necessary.
Examinations used for system-wide monitoring of educational progress
need to discriminate accurately at a broad range of achievement levels. Fo.
example, it is desirable to know if an increase in the allocation of textbooks
or other school resources is associated with changes in performance of low
scoring students as well as high scoring students. Remedial and enrichment
programs may affect students differentially. A selection test such as is
described above (i.e., with a high concentration of relatively difficult
items) would be unlikely to r' ect the effects of a program designed to train
all students to at least a minimum level of literacy. If one test is going to
be used to rank examinees accurately at all achievement levels, a broad
distribution of item difficulties (easy, medium, and hard items) is desirable.
The potential trade-off is that to do this the test may need to be excessively
long.
Limitations and Considerations. There are several important limitations
of norm-referenced tests. A critical factor is that the nozms that are used
must provide comparisons that are relevant and accurate in terms of the
purpose for which the test was administered. For some purposes national or
regional norms may be the most relevant. In other circumstances norms
developed on a particular portion of the population may be meaningful. In
either case, careful thought must be given to selecting the appropriate group.
For example, national norms could refer to a representative sample of all
children attending state run schools, all children attending any type school
- 10 -
(public or private), or all children (those attending any type school and
those not attending school). If a national goal is to increase the proportion
of children attending schools, the norms associated with the first two groups
would become quickly outdated since hopefully the "attending school"
population would be changing annually.
In addition, norms based on the performance of individuals are
appropriate for the evaluation of individuals but are not appropriate for
comparisons between groups of students. The variability of scores for
individuals is far greater than the vatiability, for example, of school means.
Therefore, if individual norms are used to interpret school means, a nchocl
whose students average higher than the mean of the individual norms will be
underevaluated since the average performance of those students will appear to
be less superior than it actually is. The reverse is true for schools whose
average is below the mean. Thus, if a purpose of testing is to compare
schools, school-mean norms must be generated.
Another limitation of norm-referenced tes.ing is that the same or
equated forms of the test must be used in order to make the link with the
reference group. Using the same form from one year to next can be
problematic. A survey of the 50 states in the United States found that no
state reported being below the norm on any of the six major nationally normed,
commercially available achievement tests and most states reported being well
above the national norm (Cannell, 1988). Of course this does not mean that all
states are above the current year's national average! What is does suggest is
that the average reported achievement this year was higher than the average
achievement of the national norm groups when the norming data was collected
several years previously. In addition, this assertion that all states are
- 11 -
above the national average probably reflects teaching to the test since
typically the same test forms are reused annually.
To avoid reusing the same test form, it becomes necessary to construct
equivalent forms of the same test. However, since two forms of a test can
rarely if ever be made to be precisely equivalent in level and range of
difficulty, it becomes necessary to equate the forms. Equating is a process
through which the correspondence of scores on different forms of a test is
established. The objective is to convert the system of units of one form to
the system of units of the other so that scores derived from the two forms
after conversion will be directly equivalent.
Empirical data are needed to equate test scores. The data collection
procedure that is used must provide some type of commonality or overlap
between the two sets of test data obtained from two different forms of the
test. This commonality can come from an overlap in examinees (the same
examinees take items from both tests) or in items (each set of examinees takes
one test but the same subset of items are on both tests). There are a number
of acceptable ways this can be accomplished, most of which are extensions or
variations on three basic designs: counterbalanced, equivalent groups, and
anchor tests. In the counterbalanced design, a sample of subjects is randomly
assigned into two subgroups. Both groups are administered both forms of the
test; however the order in which the two forms are given is reversed in the
two groups. With the equivalent groups design, subjects are randomly assigned
to two subgroups that are assumed to be equivalent and each group receives a
unique form of the test. With the anchor test design, a small number of common
items are embedded in both forms of the test. The set of common items is
called the anchor test and responses to the anchor test can be used as the
- 12-
basis for equating. Methods associated with all three of these designs can be
jeopardized by breaches in test security.
Criterion-Referenced tests
When students' scores are interpreted with reference to a well defined
skill, the assessment is criterion-referenced. One might ask, "Can a
particular child perform a specific mathematical task or has the child
acquired particular knowledge?" Similarly, one might want to know what
percentage of children can successfully complete this task or have mastered a
particular skill. In criterion-referenced measurement the emphasis is on
assessing mastery of specific, clearly defined, relevant behaviors. Tests are
specifically constructed to support generalizations about an individual's
performance relative to a specified domain of instructionally relevant tasks.
Criterion-referenced interpretation does not depend on comparisons between
students, that is, a criterion-referenced interpretation is not diminished if
all students obtain the same score because each student's score is interpreted
independently. Score variability is not essential as it is with
norm-referenced testing.
The principle challenge in criterion-referenced testing is to adequately
and operationally define the domain of content or behaviors the test is to
measure. By defining the domain, one is specifying the boundaries of the
knowledge or skill that is to be evaluated. A domain is well-defined when it
is clear to those reading the domain definition and specifications (whether
tLst developer, user, or policy-maker) what kinds of tasks or items should and
should not be considered as part of the domain. Well-defined domains are a
necessary condition for criterion-referenced assessment since the basic idea
is to generalize how well an examinee can perform with regard to a particular
- 13 -
domain based on a sampling of items from that domain. To illustrate: an
examinee need not complete all of the possible single-digit addition of two
numbers combinations (e.g., 1 + 1 - 2; 1 + 2 - 3; and so on) in order for one
to draw the conclusion that this domain of skills is mastered if the student
can demonstrate success on a representative sampling of the possible
combinations.
Several decisions must be made in selecting and defining the domains
that will be assessed. It is necessary to decide the academic achievement
level(s) that will be the focus (i.e., whether basic skills, higher level
skills, or a broad spectrum of skills) of the assessment effort. One must
decide also if it is sufficient to gauge mastery versus nonmastery, or, if
possible, if the test should also provide information on levels of mastery or
other diagnostic information.
The validity and interpretation of criterion-referenced test scores are
contingent upon the precision of the definition and specifications of the
domain. This task is facilitated if the domain can be ordered, that is, the
behaviors in the domain can be ordered along a continuum of achievement.
Indicating a student's position within a well-defined ordered domain indicates
what that individual can and cannot do. Nitko (1980) points out several ways
that a domain can be ordered: subject matter difficulty or complexity (e.g.,
a series of math items ordered such that answering each item requires mastery
of a skill over and above the skills measured by easier items); degree of
proficiency with which complex skills are performed (e.g., differences
associated with being an amateur versus an expert; may include speed and
accuracy); prerequisite learning or developmental sequences (e.g., ordering
the domain based on a hierarchy of learning or developmental prerequisites
- 14
such as those defined by Piag6t or Gagnn);i/ location on an empirically
defined latent trait (described below); or judged social or aesthetic quality.
Well-defined ordered domains make it possible to identify and describe levels
of domain mastery rather than simply mastery versus nonmastery.
A large number of important school learning outcomes represent
behavioral domains that conceptually cannot be fully ordered along a continuum
of achievement. Examples include locating words in a dictionary or
demonstrating reading comprehension. While these activities may require the
student to perform a particular sequence of behaviors, measuring the student's
proficiency with these skills would not necessarily locate the individual's
performance on a continuum within the domain. Many of these skills can be
well-defined (e.g., by using strategies developed by Popham, 1978, 1984);
however, the resulting criterion-referenced tests would be nonscaled (i.e.,
knowing someone's score would not necessarily reveal their location on a
continuum of achievement or identify the tasks on this continuum that they
could or could not do).
Once the domain has been well-defined, the test and item specifications
can be devised. The assumption is that if item and test specifications can be
described in sufficient detail, then items constructed to meet these
specifications, viewed collectively, constitute a sample from the theoretical
domain of all possible items that could be written for each of the objectives
defined in the specifications. These items can be evaluated for item-objective
congruence and the ability to differentiate between masters and nonmasters.
Repeated item sampling, while yielding different items, would nonetheless
II Several psychologists assert that children learn and acquire skills in aspecified hierarchical sequence (e.g. rote learning precedes problem solving)or go through invariant states as they devlop cognitively (e.g. thinking inconcrete terms precedes abstract thinking). See reference list for moredetail.
- 15 -
allow dtuain-referenced interpretation of student performance, facilitating a
criterion-referenced form of equating.
There are several important advantages to criterion-referenced testing.
One benefit is that the obtained scores have meaning for instructional
purposes. If a student or students haven't mastered a particular skill, this
will be evident from their scores, and remediation can be planned accordingly.
A related benefit is that Vwen students demonstrate mastery their skill level
can be defined in terms that are meaningful to people within and outside of
the educational system. These benefits make it possible to monitor skill
acquisition at the individual, classroom, school, community, and system levels
and to identify meaningful trends in educational achievement.
Limitations and Considerations. A limitation of criLerion-referenced
testing has been in the area of actual test construction. One major problem
has been that items written to the same specifications do not always permit
comparable interpretations. Further, design of a domain-sampling test calls
for several arbitrary decisions: (1) defining the boundaries of the domain,
(2) deciding on what format will be used to present the questions, (3)
deciding in what form the examinee will be asked to respond (e.g., open ended
versus multiple choice), (4) deciding how large a sample of items is
sufficient to represent the domain, and (5) specifying what percentage of the
domain the examinee must appear to have learned in order to be credited with
"mastery" of the domain (Thorndike, 1982).
The multiple choice format is frequently the format of choice because of
its objectivity and ease of scoring. However, preparing fool-proof test
specifications for multiple choice examinations is complicated by the fact
that the choice of the wrong answers or distractors has an impact on the item
difficulty. Consider the question: What is the weight in pounds of the
- 16 -
average adult human male? If the five possible responses are whole numbers
between 150 and 220, the problem is more difficult than if the responses are
five numbers, each of which is larger than the preceding response by a power
of ten (e.g., 2, 20, 200, 2000, 20000). Or, taken to an unlikely extreme, what
if four of the possible responses are names of colors and only one of the
choices is a number? Similarly, an item that asks for the date that an event
occurred tends to be easier if the response choices are centuries apart rather
than single years apart.
As was mentioned earlier, another limitation of criterion-referenced
testing is that, in non-ordered domains, levels of proficiency are not
identifiable. One must set arbitrary mastery/nonmastery cutoff scores. That
is, a given test score is arbitrarily assumed to represent mastery. While a
variety of empirical and judgmental methods have been identified for setting
cutoff scores, the end result is still limited to mastery/nonmastery or
pass/fail. Similarly, intermediate gradations (e.g., a score range designated
to represent "partial mastery") can only be set arbitrarily and do not
indicate which skills within the domain a student has or has not mastered.
Also, criterion-referenced testing may not be appropriate in all content
areas. For example, if the intent of testing is to measure mental ability,
consider the difficulty one would have trying to define the boundaries of this
construct. Similarly, critical thinking, creativity, reasoning ability, and
so on, appear to be beyond the scope of criterion-referenced testing.
Combinations and Derivations
Item ResDonse Theory (IRT) and Item Banking. Item response theory (also
referred to as latent trait theory) emphasizes levels of performance on a
particular trait or construct. An IRT or latent trait model specifies a
- 17 -
relationship (described by a mathematical function) between observable
examinee test performance and the unobservable (thus, latent) trait or ability
assumed to underlie performance on the test.
An underlying assumption in IRT methodology is that the trait to be
measured is unidimensional and that all the items in a test are homogeneous in
the sense of measuring only a single ability or latent trait. Using IRT
methodology it is possible to estimate properties of items (i.e., item
difficulty, item discrimination, and guessing parameters) that are invariant
across groups of examinees and that can be used to estimate the level at which
a given individual falls on the specified latent trait. Thus, given a set of
items that have been fitted to a latent trait model, it is possible to
estimate an examinee's ability on the same ability scale from any subset of
items (as long as the number of items is not too small) in the domain of items
that have been fitted to the model.
In some respects, IRT models are analogous to well-defined ordered
domains in criterion-referenced testing. Both make the assumption of
unidimensionality of the trait or domain to be measured. Both focus on
locating individuals on a continuum of ability or proficiency.
A logical extension of IRT is item banking. For the testing of a particular
construct, instead of creating a different test each time a test is needed,
one can establish an item bank or a warehouse of test items on that construct.
Each time a test is needed, items can be selected in a mix-and-match fashion
from this item bank to make a new test. The principle requirement of an item
bank is a large number of field-tested items with known properties (e.g., item
discrimination, item difficulty, and so on) so that the ability of each new
test to reflect the construct is known. The scores from each new test are
comparable to those from other tests constructed from the same item bank.
- 18 -
Because IRT methodology allows item difficulty and item discrimination to be
calibrated to a construct or trait, rather than to a total test score or a
particular set of individuals, it is well suited to item banking.
Limitations of IRT methodology a:e that (1) very large samples of examinees
are required initially for the ca.'ibrating of items, (2) sophisticated data
analysis procedures are used to calibrate items, and (3) items are calibrated
in order to be reused, thus making test security an important issue.
Proficiency Scaling. Proficienc) scaling combines norm-referenced
testing with some of the advantages of criterion-referenced testing. The
purpose of such scaling is to relate selected score levels on a
norm-referenced test to specific levels of mastery or proficiency. Thus,
getting 15 questions right on a 60 item math test could place the examinee at
the 10th percentile of the norming sample (norm-referenced) and be associated
with mastery of addition of two single-digit numbers; getting 55 questions
right could place the student at the 85th percentile and be associated with
mastery of addition involving 2 two-digit numbers with carrying. Ideally,
once examinees know their scores on a test, they would simultaneously be able
to determine their rankings relative to the norm group and their levels of
proficiency on the ability or domain measured by the test.
Two approaches have been proposed for proficiency scaling. The first
uses a theoretical hierarchical model of cognitive demand as the basis for
identifying relevant proficiency levels and then attempts to locate sets of
items that reflect these levels. This approach is referred to as "model
based" and it is theoretically consistent with criterion-referenced,
well-defined, ordered domains. Both criterion-referenced ordered domain and
model based approaches assume that skills can be ordered on a continuum and
that proficiency at the second level requires mastery of levels one and two,
19 -
at the third level, mastery of levels one, two and three, and so on. With the
model based approach, once proficiency levels have been defined, small sets of
items that reflect each of the levels are identified in an existing or newly
developed norm-rsferenced test. The next step is to evaluate empirically
whether data from actual student performance is consistent with the assumed
hierarchical model (i.e., whether student scores are correlated with
performance on the items representing each proficiency level). If the data
supports the propose model, the proficiency levels can be tied to specific
scores on the norm-referenced test.
The second major approach to proficiency scaling-- empirical
anchoring--allows test performance data to build the model. No a- zri
cognitive model is assumed and, in a sense, the test is allowed to describe
itself. Performance data from a large sample of examinees is used to identify
items that discriminate well at the test score levels for which a proficiency
description is desired. An item is considered to discriminate well if the
majority of students at one test score level can successfully complete the
item but the majority of students at the next lower test score level are not
able to complete the item successfully. Once a subset of discriminating items
has been identified for each of the predetermined test score levels, a group
of subject matter experts examines the item subsets and provides proficiency
descriptions for each of the designated test score levels.
There are three limitations to proficiency scaling approaches: first,
the test must measure a trait or domain that can be ordered; second, very
large sample sizes are needed to establish proficiency levels; third, items
used for linking proficiency levels to test score levels are reused, thus
making test security important.
- 20 -
Summary
As is evident from this discussion, selecting an assessment approach
should take into account the kinds of questions that one hopas to answer and
the limitations of using the approach given local conditions.
Criterion-referenced and norm-referenced approaches each have advantages
as well as limitations. While not mutually exclusive, the purposes of these
two approaches are different, and, in most instances, it is unlikely that a
single test will be optimal in responding to these purposes. It may be that
different approaches are needed at different times in the educational
sequence. For example, annual criterion-referenced testing could be used to
monitor selected proficiency levels whereas less frequent norm-referenced
testing could be used to gauge overall academic progress. Regardless of the
assessment approaches that are selected, however, referencing examinations to
an external scale is critical to obtaining meaningful data.
The assessment approach that is selected also depends upon the
feasibility of using the approach given local conditions (i.e., feasibility of
maintaining test security, availability of adequate resources for test
development and data analysis, etc.). Some of these issues have been mentioned
as part of the limitations of the various approaches (e.g., test and item
security); others are discussed below.
IV. Standardization and Local Conditions
Once the assessment questions have been identified, it is necessary to
consider local conditions and the degree to which the requirements of various
assessment approaches can be met. If meaningful comparisons between tests,
administrations, and educational units (individuals, schools, etc.) are to be
- 21 -
made, there must be some standardization in the assessment practices that are
utilized. With regard to test administration, standardized procedures would
apply to the verbal instructions that are provided during the administration,
timing, materials, and the testing environment. Differences in exam time
provided, amount of disruption, clarity of directions, and availability of
technical aids (e.g., calculators, dictionaries, etc.), could be expected to
give some examinees an unfair advantage or disadvantage.
Test scoring represents another aspect of the assessment process that
should be standardized. Procedures for scoring should be defined so as to
minimize the subjectivity involved. Not surprisingly, the scoring of essay and
open-ended questions is more problematic than the scoring of questions posed
in an objective format such as multiple-choice, true-false, or matching.
Related to standardization of scoring is equating. If one exam would be
scored differently by two different readers, the feasibility of equating two
different tests is tenuous.
If large scale standardization is not possible, one alternative is to
reduce the scope of the assessment process. Some educational questions can be
answered effectively using an evaluation design in which the data is collected
from selected samples of students rather than the entire population. In
limiting the scope, more resources are available to focus on collecting data
that has known reliability. Proper sampling of examinees would permit
generalizations to the broader population.
V. Final Considerations
Effective assessment of educational progress is a complex process.
Deciding how to assess student academic achievement and monitor student
- 22 -
academic progress requires consideration of many different factors. One must
begin by deciding upon assessment goals. To do this it is necessary to
consider what kird of achievement will be assessed and what are the important
assessment questions. The next step is selecting the appropriate assessment
strategies. At this stage it is critical to keep in mind that it is better to
select a manageable process that results in valid, reliable data that is
meaningful for answering questions about students than to be overly ambitious.
The ultimate choice of approaches must be responsive to the assessment
questions to which answers are sought; at the same time, it must take into
account the limitations and possibilities due to local conditions and
resources.
Devising and implementing an assessment program that is both effective
and efficient is likely to be a multistep, multistage process. It is
frequently useful to develop an implementation plan in which parts of the
program are introduced in stages. For example, by initially limiting the
content areas that will be covered or the age levels that will be assessed, or
by sampling from the larger population, resources can be focused on promoting
quality rather than quantity.
A final point is that education goals should drive the agenda of testing
and evaluation, and not the other way around. Tests should not drive the
curriculum. However, policy makers and educators must recognize that tests can
have this effect. Teachers and students alike know what is valued and rewarded
and they act accordingly. This is an additional reason for exercising care in
planning assessment. Decisions about what kind of achievement to measure and
how educational progress will be monitored must be an integral part of
educational goals.
- 23 -
References
Anastasi, A. (1988). Psychological testing. (6th ed.). New York: Macmillan.
Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike(Ed.). Educational Measurement, (2nd ed.). Washington D.C.: AmericanCouncil on Education.
Berk, R.A. (Ed.). (1980). Criterion-referenced measurement: The state of theart. Baltimore, MD: Johns Hopkins University Press.
Berk, R.A. (Ed.). (1984). A guide to criterion-referenced test construction.Baltimore, MD: Johns Hopkins University Press.
Cannell, J.J. (1988). Nationally normed elementary achievement testing inAmerica's public schools: How all 50 states are above the nationalaverage. Educational Measurement: Issues and Practice. 7, 5-9.
Cole, N.S. (1990). Conceptions of educational achievement. EducationalResearcher, 19, 2-7.
Gagn6, R.M. (1970). The conditions of learning. 2nd ed. New York: Holt,Rinehart & Winston.
Glass, G.V. (1978). Standards and criteria. Journal of EducationalMeasurement, 15, 237-261.
Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their use inthe analysis of educational test data. Journal of EducationalMeasurement, 14, 73-96.
Hambleton, R.K., & Eignor, D.R. (1978). Guidelines for evaluatingcriterion-referenced tests and test manuals. Journal of EducationalMeasurement, 15, 321-327.
Hambleton, R.K., Swaminathan, H., Algina, J. & Coulson, D.B. (1978).Criterion-referenced testing and measurement: A review of technicalissues and developments. Review of Educational Researc1., 48, 1-48.
Heyneman, S.P., & Ransom, A.W. (1989). Using examinations and testing toimprove educational quality. Paper presented at the VIIth WorldCongress of Comparative Education. Montreal, Canada.
Nitko, A.J. (1980). Distinguishing the many varieties of criterion-referencedtests. Review of Educational Research, 50, 461-485.
Nitko, A.J. (1984). Defining h'criterion--efprenced test." in R.A. Berk (Ed.).A guide to criterion-referenced test construction. Baltimore, MD:Johns Hopkins University Press.
Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, andequating. In R.L. Thorndike (Ed.). Educational Measurement, (3rd ed.).New York: Macmillan.
- 24 -
Piag6t, J. (1971). The theory of stages in cognitive development. In D.R.Green, M.P. Ford, & G.B. Flamer (Eds.). Measurement and Piaget. NewYork: McGraw Hill.
Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ:Prentice-Hall.
Popham, W.J. (1984). Specifying the domain of content or behaviors. In R.A.Berk (Ed.). A guide to criterion-referenced test construction.Baltimore, MD: Johns Hopkins University Press.
Suen, H.K. (1990). Principles of test theories. Hillsdale, NJ: LawrenceErlbaum.
Thorndike, R.L. (1982). Applied Psychometrics. Boston, MA: Houghton Mifflin.