Upload
walter-n-durost
View
214
Download
1
Embed Size (px)
Citation preview
Accountability: The Task, the Tools, and the PitfallsAuthor(s): Walter N. DurostSource: The Reading Teacher, Vol. 24, No. 4, Testing (Jan., 1971), pp. 291-304, 367Published by: Wiley on behalf of the International Reading AssociationStable URL: http://www.jstor.org/stable/20196498 .
Accessed: 24/06/2014 23:44
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
Wiley and International Reading Association are collaborating with JSTOR to digitize, preserve and extendaccess to The Reading Teacher.
http://www.jstor.org
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
Walter Durost is an Adjunct Professor of Education at the
University of New Hampshire.
Accountability: the task, the tools, and the
pitfalls
WALTER N. DUROST
within recent months the United States Office of Education has
taken the position, especially in the case of Title I, that it is essen
tial to show "hard data" to prove that the enormous amounts of
money being spent are not being wasted. Thus, the term "account
ability" has com? into vogue and it is very much the "in" thing to
talk in these terms. What does accountability mean, in fact? Ac
cording to Webster's Third New International Dictionary Un
abridged, accountability means, "the quality or state of being accountable, liable or responsible." The American Heritage Dic
tionary is even more specific on the point. It says, "accountable,
adj., accountability, noun, subject to having to report, explain, or
justify; responsible, answerable."
SOURCES OF DEMAND
In both of these definitions the idea comes through loud and
clear that the individual or individuals carrying out some assigned task are to be held accountable for their performance. There is, by
implication at least, the further thought that if the individual or
individuals performing the given task do not provide convincing evidence of their success, they will somehow be made to "pay the
piper." Whether or not the threat is real is beside the point if the
local authorities accept it as real. In Title I this conceivably could
mean losing the funds now supplied by the federal government for
Title I projects. There is a demand for positive proof to present to
Congress, for example, that Title I programs are successful. The
reasonableness of the demand is not the question; the problem is
how to meet it without destroying public education in the process
by destroying public confidence in its schools. That this is a real
threat is perfectly evident by examining the course of events in
recent months. Perhaps some examples will suffice.
In instance after instance school bond issues are being de
feated and budgets are being cut. In other instances private con
291
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
292 THE READING TEACHER Volume 24, No. 4 January 1971
tractors are being paid to do the work of the school almost always with federal money. In most instances the contractor gets paid
only to the extent that he can "prove" real gains usually stipulated in terms of months of grade. There is no real understanding of the
fallacies involved in this procedure relating to the operation of
chance effects; inadequately standardized tests; instances, espe
cially in the upper grades, where a "gain" of one point of score is
translated into several "months." Since the cases selected for in
struction are usually those who are retarded to start with, regres sion effect alone on repeat testing will guarantee a "gain" which
is, in fact, due to chance alone.
Many such contract arrangements provide for individualized
instruction under very favorable environmental conditions and
using sophisticated teaching machines. Extraneous rewards are
offered to the students such as green stamps, free radios, etc. More
over the time allotment for this specialized instruction is much
greater than allowed in the normal curriculum.
This notion of "accountability" as being a threatening gesture from the state or federal government is a potent source of con
tamination in any experimental program and denies, to those who
are interested in experimentation to find new and better ways, the
opportunity to discover these better ways of doing things without
personal or professional risk. That freedom of mind which permits the individual to try but fail and yet count his failure as a success,
provided it shows the way to some new and better procedure, is in
danger of being seriously eroded. Scientific method cannot possibly survive in an atmosphere suffused with threats, real or implied, and therefore the very use of the term "accountability" is anathema
to those who are vitally interested in innovation.
The United States Office of Education is now demanding cred
ible, convincing evidence, in terms of test results and/or equally
objective data, to prove that Title I funds and similar government subventions are being spent wisely. This demand is both prudent and needful provided the pitfalls described above carefully are
avoided. This calls for more valid and more reliable tests, for the
development of non-test techniques for evaluation, and for im
proved and understandable statistical analysis.
Coincidentally, there is now a high level of public interest in
developing an increased level of reading skill among American
children and adults. Educators have long been aware of the need,
but only very infrequently has financial support at the local level
been forthcoming, especially for corrective programs which are
expensive in terms of personnel and logistical support. Reading has been singled out both because of its essential contribution to
all phases of the curriculum and because of the deeply seated be
lief that having a literate society is of vital concern, especially in
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability; the task, the tools, and the pitfalls 293
a technological age?a fact few would deny. The criterion for success in Title I reading projects, according
to one memorandum issued by the Office of Education, was a gain of one full year of grade placement for a school year (which
usually means seven months) of instruction with a further provi sion that the children selected for instruction under these programs should have been at least a full year below normal grade-for-age at the time they begin their work. Unfortunately, the grade equiv alent or grade placement score, used so confidently in this memo
randum to define the population to be taught and the criteria of
success is a very tricky statistic. In many instances, it is an almost
meaningless concept which distorts the truth and cannot be de
pended upon to describe scientifically either the initial status or
the degree of progress made by any particular group during any
period of time. Gains recorded in grade equivalents are often
within the range of chance. This is especially true of short periods of seven to nine months, which usually apply. The specific diffi
culties involved in the use of the grade equivalent as a statistical
unit will be discussed in more detail later.
Under the pressure of time, and suffering from a woeful lack
of properly trained personnel, many school communities and larger educational units have undertaken programs which were not well
suited to the needs of their groups and have collected data, the
significance of which it is almost impossible to determine. It is
the purpose of this article to explore in some detail and to illustrate, where possible, some of the problems involved in accounting for
the progress, or lack of progress, in Title I projects and programs,
particularly the remedial reading programs at the local and state
level. In order to do this, many pertinent questions need discussion.
EVALUATION PROCEDURES
Are proposals for evaluating outcomes in the reading projects
presently being written into applications, as required by law, ade
quate to the need? This is unanswerable as a general question. It
must be answered project by project in terms of what is happening in a particular locality or state department. One of the author's
tasks as a Consultant to the Title I Office in the New Hampshire State Department of Education has been to help plan the analysis of the data gathered by a statewide testing program in connection
with the evaluative activities of the Title I Office.
One step in this direction has been to develop a series of
categories into which Title I projects may be pigeon-holed. In order
to obtain the necessary data as to the number of students par
ticipating in Title I programs, project by project, in each of the
categories, IN, or registration, cards and OUT, or termination,
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
294 THE READING TEACHER Volume 24, No. 4 January 1971
cards were devised. These are overprinted precoded IBM cards to make card punching more objective and accurate. On these cards,
the teachers responsible for Title I instruction are required to com
plete, for each child, a questionnaire which indicates the category in which this child's instruction falls, plus other questions having to do with the amount of time spent each week, etc. This proce
dure has been in effect for one year and a preliminary analysis of the IN cards shows, for example, that a very large percent of all
pupils in the programs in Grades two, four, six, and eight (the only grades so far completed) have been engaged in some project re
lated to corrective work or additional instruction intended to im
prove the quality of the child's reading performance. The ten
categories and the number of children coming under each category
according to the analysis of the IN cards for the Fall '69 program are listed below in Table 1. The percentage in the reading project alone is shown also in Table 2.
Table 1 Number of cases enrolled in each project category in 1969-70
New Hampshire Title I program
Project Category GRADE
By Number Two Four Six Eight
1. Language 31 17 11 16 2. Reading 657 606 286 398
3. Speech 69 26 18 4 4. Math 6 55 42 2
5. Guidance 79 56 27 19 6. Special Education 3 2 3 8 7. Psychiatric Services 14 ? ? ?
8. Aides 50 27 22 6 9. Cultural Enrichment 16 25 5 25
10. Other 35 10 31 16
Totals 960 824 445 494
Table 2 Number and percentage of Title I cases in corrective reading programs in 1969-70 New Hampshire State Department of Education
C R A D F TWO FOUR SIX EIGHT
Number Percentage Number Percentage Number Percentage Number Percentage
657 68M> 606 7V? 286 64 398 80%
Ideally, as one step in the process, every Title I project sub
mitted should be reviewed from the point of view of the appropriate evaluative evidence that can be collected, summarized, analyzed, and reported on. Practically, this may never be possible because
some projects are approved and funded on the basis of the best
judgment of the Title I Staff that these projects are good for the
target schools in question and will result in a generally improved education for the educationally deprived children involved.
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability: the task, the tools, and the pitfalls 295
In 1968-69, a Fall-Spring comparison in grade equivalents was attempted. In absence of the criterion information described
above, no clear differentiation of those in remedial reading pro
grams versus all others could be made. This analysis proved to
be quite unsatisfactory for several reasons. The lack of pupil in
formation concerning the nature of the projects and level of stu
dent participation was one major cause of concern which is now
taken care of by the IN and OUT cards. Another difficulty was
the impossibility of making sensible comparisons from grade-to
grade because of the inadequacies of the grade equivalent type of
score transformation. There was no way of knowing what would
be normal gain over seven calendar months. It certainly was not
seven months of grade equivalent as commonly assumed. Further
more, the bivariate distributions for the separate grades two, four,
six, and eight clearly revealed the lack of equivalence of the grade
equivalent from grade to grade. Children tested at the beginning of grade two could not go below a grade equivalent of 1.0 but
could go very much higher. Many pupils at grade eight earned
grade equivalents of 10.0+ whereas the fact of the matter is that
reading is not recognized as a formal subject above grade eight. The availability of more complete student information for the
1969-70 program will make possible a more systematic comparison than before and this is now underway. State stanines will be used in
making this comparison. In order to obtain meaningful spring sta
nines a random sample of the statewide grade populations was re
tested in the spring along with those designated as Title I cases in
Reading by the IN card documents. A separate study is being done on the basis of the random sample to assess changes which can be
considered typical for the instructional period involved.
Since the reading group is such a large proportion of the total
Title I population at every grade level, only this category is being
analyzed separately. Attention is being given to the idea of evalu
ating all of the remaining children not in corrective reading pro
grams as one group on the assumption that the individualized
attention received may, in itself, result in improved academic
achievement as measured by standardized tests.
Even refining the major sample analyzed to include only read
ing projects completely ignores one important aspect of the oper
ating situation. Even those supposedly under corrective reading instruction are not being handled in a sufficiently similar manner
in all projects to make this kind of comprehensive statewide evalu
ation very meaningful. It could be, for example, that, buried in
the mass of data, there are some local projects where terrific gains are being made as a result of improved methods of identifying,
diagnosing, and instructing children with correctible reading dif
ficulties. These gains might be masked and hidden because many
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
296 THE READING TEACHER Volume 24, No. 4 January 1971
other children are being instructed in programs which generally are well-intentioned and, perhaps, even very successful from some
different point of view but do not emphasize large gains in score
on an objective reading test as the primary criterion of success.
If this is, indeed, true at the state level, how much truer must
it be when an attempt is made at the national level to merge data for thousands of projects loosely regarded as being related to
corrective reading and to draw conclusions on the effectiveness
of Title I sponsored instruction for improving the general reading
ability of the disadvantaged. Would it not be more sensible if, at
level after level in this hierarchy from the Office of Education
down to the local school, well-planned, carefully documented at
tempts were made to see what could be done to improve the
efficiency of corrective work in reading? This might require im
proving, first, the selection procedures; second, the diagnosis, so
as to pinpoint the nature of each child's difficulty; and finally, the
evaluative procedures to be sure that the instruction following
analysis and diagnosis was effective.
SELECTION OF CASES
Are the children finally chosen for remedial or corrective work
in reading really the ones who need this kind of help? Choosing cases for corrective or remedial work in reading raises a problem? first of semantics and secondarily of basic educational philosophy.
Many children are chosen for corrective or remedial work in read
ing for one reason?they are reading below their normal grade
placement for their chronological age. Many of these children may be slow learners. This is said with no attempt whatsoever to label
such children as being of low intelligence, but simply to state the
inescapable fact that there are great differences among children
in regard to the rapidity with which they respond to instruction.
Perhaps the point can best be made if the reader will assume
for the moment the validity of the point of view taken by the
author that one indication of a child's potential reading ability is
the level of his oral vocabulary. While this oral vocabulary may be measured in many different ways, it is measured most per
tinently, in the author's opinion, by the use of the reading reinforced
technique discussed elsewhere in this issue. According to this tech
nique, a child not only takes a reading test under standard condi
tions but is also given an equivalent form with the further condition
stipulated that the teacher read aloud as the child reads silently. This oral bridge to comprehension will result in absolute
gains in score for a majority of children under normal circum
stances because, apparently, few children really do live up to their
potential as so determined. (A few very able readers are slowed
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability; the task, the tools, and the pitfalls 297
down by the oral reading and achieve less well under reinforce
ment. ) However, experimental evidence is rather convincing that
children with correctible reading defects will make enormous gains,
comparatively speaking, under these conditions if their oral vo
cabulary has been adequately developed both in and out of school.
This issue has been raised at this time, however, to make
another point as well; namely, that according to this procedure, it is entirely possible for a child who is reading at-grade or even
above the normal grade for his age to make substantial gains under reinforcement, which indicates that he is not reading up to his potential. In the author's opinion, these children very likely are genuinely reading-disabled children who need help in order
to realize their full potential and thus make reading more effec
tive as a means of communication and learning. This in no way bars consideration of, or the inclusion in the program of, children
whose reading abilities are far below normal for chronological
age or of grade placement. In other words, use of this technique hurts no one and does provide at least one objective basis on
which to select children for corrective work in reading. If this
technique is then followed by adequate analysis, it appears highly
probable that a group of children will be identified who can really be helped by a carefully planned and thoroughly documented
analysis of their difficulties followed by intensive instruction re
lated to the nature of their problem. What about improving the reading skills of those poor
achievers who appear to be reading about as well as one would
expect for their level of measured mental ability or, to put it
differently, children who appear to be slow learners on the basis
of some kind of objective evidence, including well-documented
observation by the teacher? Teaching these "slow learners" to read more effectively can be accomplished adequately under conditions
specified in most Title I projects which involve intensive instruc tion in small groups, or individually, with materials carefully selected to be suitable for the child's needs. It is also undoubtedly true that if this kind of individual attention were given to all children in school they would show similar substantial gains in
reading and would reach something approaching their maximum
level of reading ability much sooner than they do under the
present rather casual program of instruction, especially in the
upper elementary grades. If this general improvement in reading, especially for slow
learners is the intent of the program, it should be clearly stated as such. It should not be labelled as a corrective program intended to discover and remove specific difficulties which have barred the child's normal progress .
One suggestion for improving the selection process can be
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
298 THE READING TEACHER Volume 24, No. 4 January 1971
made and implemented immediately. This involves the adminis tration of two forms of the test under normal conditions given about a week apart. The purpose would be to identify those chil
dren who show a consistent pattern of low achievement as com
pared to those whose achievement seems to be quite variable on
the two measures. Wide variability of scores on the two tests
suggests that these children have problems outside the reading realm, relating possibly to their emotional adjustment and tem
peramental factors. Such children need to have quite as much
attention paid to these factors as to the mechanics of correcting
any existing reading defects they may have. Moreover, these emo
tional problems are harder to deal with and often call for a team
approach involving the social worker and the psychologist as
well as the teacher.
TEST INSTRUMENTS
Are the instruments used for pre- and post-testing appro
priate for the task (validity) and are they sufficiently reliable to
produce results ivhich can be said to be non-chance in character!
At the present time about the only reading tests available for
this purpose are the standardized reading survey tests, most of
which have been produced as part of a comprehensive achieve ment battery, such as Stanford, Metropolitan, California, etc.
These tests were not constructed as instruments intended to
indicate with maximum accuracy the reading level achieved by a particular student at a particular time in his school career.
Thus, the use of these tests for before-after testing requires the
tests to do something for which they were not constructed.
Reliability coefficients for these tests generally tend to be
high compared to the reliability coefficients in other tested areas
and generally these tests are considered to be adequate measuring instruments. Standard errors of measurement are tolerable in size
for the purposes for which the instruments were intended. How
ever, they were not intended to measure gain in reading ability over relatively short periods of time.
For children who have well-documented, correctible reading
difficulties, even these instruments will be successful in indi
cating substantial gains over the period of time in question. The
Crowley-Ellis study reported in this journal gives some evidence
on this point, although this is not intended as a definitive study. It uses national Fall and Spring stanines on the Metropolitan as
the basis for comparison and measures the success of the pro
gram by the extent to which individuals gain by substantially
larger margins than normally expected for a given level of initial
score. Additional studies on larger numbers of cases under more
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability; the task, the tools, and the pitfalls 299
carefully controlled conditions are needed to validate the ap
proach, but in the author's opinion it constitutes the most prom
ising approach to assessing the progress of the disabled reader
using available survey tests. The extent to which individuals make
substantial gains over and above gain in score needed to maintain
their current status with respect to their peer groups is convinc
ing evidence that these plus gains are specific to the program
being evaluated.
A far more vital question to ask at this point, however, is
whether it is possible to produce instruments which are more
sensitive and better suited to the task of measuring gains in read
ing achievement over these relatively short periods of time. The
answer to this is quite definitely positive, but the specifications for such tests or series of tests are hard to write.
It is also evident that it would be possible to construct tests
where a larger percentage of the material to be read would be
closer to the operating level of the child taking the test than is
possible with tests organized to cover wide ranges of ability or
skill. In such a test items which are so easy that the child could
answer them without real effort, or so difficult that he rarely answered them correctly, would need to be eliminated as being ineffective in establishing the child's place along some kind of
continuum of reading ability with highest accuracy.
GRADE EQUIVALENTS
Are the widely used grade equivalents ever the proper sta
tistical units for indicating change from beginning to end of the
training period? If so, under what circumstances? It was stated
earlier in this article that the use of grade equivalents is of ques tionable value in measuring the status of a child on a before
after basis in a corrective reading program. In order to understand the deficiencies of grade equivalents
one must understand the procedure whereby they are obtained.
Tests are administered to a series of grades at the same time in
the school year on instruments that either are identical for suc
cessive grades or have been scaled to provide for continuity from
grade to grade. The "scaled" scores for successive grades are then
plotted on cross-secction paper against grade and time of year of testing and a continuous smooth curve of best fit is drawn
through these plotted points. Obviously, the plotted points are
twelve months apart since the children whose averages are plotted are in successive grades. However, grade equivalents assume that all gain in reading takes place during the period a child is in
school, ignoring the two and three month vacation. The proce dure further assumes that this period is typically ten months.
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
300 THE READING TEACHER Volume 24, No. 4 January 1971
Neither assumption is correct. Thus, the procedure of dividing the difference in score from one grade to another into ten parts and calling these months of grade can be seriously misleading,
especially in vocabulary and reading where substantial data indi
cate continued gains in score during out-of-school months. Such
norms cannot possibly determine accurately whether or not a
child has made normal or more than normal gain during a period of seven months of in-school instruction.
Furthermore, there is ample evidence that when tests are
scaled in some kind of absolute unit, the rate of growth in read
ing and vocabulary is very substantial in the early grades (or
ages) and becomes less and less substantial as one goes up the
age-grade ladder until, at the junior high school level, gains are
only marginal as compared to the astounding variability in read
ing ability found in the scores of students tested at these higher
grades. Thus, increases of only two or three points of score in
terms of items answered correctly may account for all the gain to be found from the seventh grade to the eighth grade over an
entire twelve month period! Grade equivalents for the enormous range of scores above or
below the successive averages must be obtained by some artificial means. Above grade eight, no real data at all are available under
most circumstances and the grade equivalents can be assigned
only in terms of subjective judgment. The temptation is great to extend the norm line (the trend line through the averages of
successive grades where data are available) upward as steeply as
possible in order to provide grade equivalents for the maximum
number of points of score above the median of the highest grade tested. This procedure results in assigning grade values even up into the senior high school years in spite of the fact that this
is obviously errant nonsense since there is no continuum of in
struction in reading in these grades.
NORMATIVE POPULATIONS
Are the national norms on the available tests used really
representative of any truly national population? The matter of
national norms for achievement tests prepared by different au
thors and published by different publishers is a very vexing and
complex issue. No author or publisher can be charged with neg
lecting his responsibilities solely on the basis that his norms differ
from those of some other author or publisher on a similar test, such as reading, unless the procedure for obtaining such norms
have been carefully and stringently laid down and adhered to by all. The United States Office of Education is cognizant of this
problem of variation in norm assigned to equivalent scores and
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability: the task, the tools, and the pitfalls 301
has under consideration a study which will attempt to equate a
half-dozen of the most widely used reading tests in order to set
up tables of comparable scores. The office then will attempt to
establish one set of national norms which will meet the most
rigid specifications for representativeness. The net result of this
approach should be to eliminate one objectionable feature involved
in contract teaching; namely, the selection and use of a test that
the experimenter feels will make his efforts look good.
Perhaps, in order to drive this point home, it would be well
to point out that the same population given two different reading
tests, both claiming to have representative national norms, can
be substantially below the norm on one test while being at or
even substantially above the norm on another presumably equally well standardized test.
The only saving grace in this situation is that national norms
really are not a necessary requirement for a large number of
uses to which such tests can be put. Data gathered on local pop
ulations, whether these be state or county or smaller administra
tive units, often are more meaningful and more useful in deter
mining those children who need corrective instruction as well as
in evaluating growth.
REALITY OF GAINS
Is the evidence convincing that gains in over-all reading
skill, as opposed to gains in reading test scores, have really taken
place and is this evidence equally convincing at all grade levels?
A challenge of the validity of the standardized reading test is
implied. There are masses of statistical data to indicate that these
tests are valid in that children with high reading scores, espe
cially with high associated mental ability scores, do well in school.
In other words, they are given high letter grades. They tend to
graduate at the top of their class or to be in the upper ranges of
the rank-in-class usually reported at the end of the period of in
struction available under public education. Thus, one can say that
such tests do have validity. The question remains unanswered,
however, as to the extent to which the test validity might be im
proved if it could be related more directly to the criterion of
successful reading skill measured in some oher way. To put this
question differently, is it self-evident that a child who makes a
high score on a reading test will be able to read the instructional
material placed before him comprehendingly at his grade level
in any of the various subject matter areas to which he may be
subjected? No definitive studies in this area have been made
recently to the knowledge of this author.
A very interesting series of studies can be imagined in this
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
302 THE READING TEACHER Volume 24, No. 4 January 1971
area?a real challenge to the ingenuity of the investigator. One
good starting point would be to study the relative level of agree ment between the usual reading test with unlimited reference to
the material read and a test requiring exposure only for a con
trolled amount of time after which the questions would be an
swered without an opportunity to re-read the passages.
DATA COLLECTION AND PROCESSING
Are the required accountability data being collected accord
ing to acceptable scientific procedures and are the results process able by modern data processing equipment? In some respects, this is the Achilles heel of the entire attempt at evaluation of
Title I projects. Unless one single type of project is imposed on
an entire unit with the specifications agreed upon in advance, the multiplicity of data collected will result in total chaos from
the point of view of administrative units larger than the local
unit; that is, the level at which the project is carried out.
For example, if fifty communities are all carrying on what
each calls a remedial project in reading but hardly any two of
them are pursuing this matter from the same point of view, how
is collective data processing possible? Perhaps by using the same
measuring instrument for pre- and post-testing? But if each of
these fifty communities attempts to evaluate its product each
using different instruments, possibly administered at different
times of the year, then the data simply are not combinable in any
meaningful sense even with the most adequate equating tables, use of which is the prevailing practice.
If a common instrument is applied throughout the entire
population, one must pose the following question: Does this
objective instrument presume to measure the goals of all these
projects equally well? Only if the common goal is generalized
improvement in the quality of reading of the students under
instruction is this remotely logical. Is the chosen instrument
suited for comparison of averages only, or does the instrument
measure individual performance adequately? A consideration of these vexing questions must result in
choosing one horn of the dilemma. 1] Measure some generalized factor, possibly reading effectiveness commensurate with ability; or 2] select or devise valid instruments to measure the partic ular aspect of reading which the project is intended to teach. The
choice becomes clearer if the desired outcomes are clearly de
fined in behavioral objective terms, a process only dimly under
stood and vaguely practiced. For the moment, assume that the objective is the same;
namely, improvement in the effectiveness with which the students
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
durost: Accountability: the task, the tools, and the pitfalls 303
under instruction tackle and master the directed page. For valid
accounting of outcomes, all of the communities involved must
administer the same test(s) under the stipulated conditions, as
to time and precise adherence to directions for both pre- and
post-testing. Furthermore, it requires that there will be enough identification information available for every student to permit the matching of cases tested at the beginning and the end of
the period of instruction.
It also assumes that all children who are selected for the
program initially are included in the final evaluation even if
they have been dropped from the program. To make this last
point perfectly clear, it would be entirely possible for the separate schools within a community or district-wide project to sabotage the results completely if they tested at the end of the project and/or returned data to the central office only for those children
who, in their opinion, made notable gains under instruction.
Perhaps an additional word is needed in regard to the matter
of identifying cases from fall to spring testing, especially where
there is no existing imposed student ID number, such as a Social
Security number, or something similar. It is possible to generate a code from the student's name, his birth date, age and sex which
will uniquely identify him within close limits, especially if the information is used in conjunction with some locality informa
tion, such as school, supervisory unit, or school district. However, the generation of such a code depends upon the collection of the
information needed in regard to name, sex, birth date, etc., con
sistently at the beginning of fall, or first, administration and
again in the spring, or at the end of instruction.
The collection of such information proves to be very difficult,
perhaps because the need for identical information on a repeat basis is not fully communicated. The end result is almost com
plete chaos if the matching information is inaccurate or incom
plete. This makes it necessary to do a person-by-person identifi
cation of cases through visual inspection of punched cards
produced and interpreted for this purpose. Such "eyeballing"
permits investigation of a number of different clues before con
cluding that the cards in question are, indeed, applicable to the
same individual, but it is prohibitively expensive for large groups. There is no guarantee at all that the lost data will be random and
that the net result of the analysis of the remaining data would
be the same as it would be if the total population were examined.
Some statistical safeguards are possible, but at great cost.
REPORTING
Are the reports of results both scientifically sound and ob
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
304 THE READING TEACHER Volume 24, No. 4 January 1971
jective? The question can apply both at the local level and at
the central office level, whether the "central office" be the state
or the federal government. In projects such as those carried on
under Title I the pressure to show good growth or gains as a
result of the experimental factor may be great, especially if there
is a threat, either potential or real, to future programs from lack
of funding. Sometimes the pressure on the part of the local ad
ministration to show gains causes teachers and others directly involved with the children to lose their own objective point of
view and either consciously or unconsciously "teach for the test."
Such contamination of teaching material reported in the Texar
kana project is a good case in point.
Spring testing, that is testing at the end of a school year, in the author's opinion, has always suffered from the serious
drawback that the temptation to teach for the test is always
present, especially in the minds of teachers who are inexperi enced in such matters or who fear, usually groundlessly, that they
will be rated by the level of success of their students especially without regard to the initial status of these students. The fact
that their attempts to measure gains are sometimes ludicrously obvious seems not to deter them because this fact is just not
realized ahead of time. Any analysis of a before-after testing
program done, for example, in terms of community means, ex
pressed in standard score terms, will quickly reveal communities
where the post-test mean, so expressed, is just ridiculously out
of line with the pre-test performance either on the evaluation
instrument or on some other measure of learning potential.
CONCLUSION
For the reading teacher and for the reading supervisor or
specialist, this is a time of great opportunity. Public sentiment
is strongly in favor of anything that can be done to improve
reading skills, both for the population as a whole and, more spe
cifically, for those for whom correctible defects can be identified
and assessed. It would be nothing short of criminal for our pro fession to fail to evaluate objectively and adequately outcomes of
our present affairs. A valid determination that a technique is not
effective is as valuable as a determination that it is effective. We
can learn by our mistakes as well as our success. There can be
no onus on making a mistake once; it is inexcusable if the same
mistake is perpetuated because we fail to assess the nature of our
failure and take proper steps to correct it.
What is needed is more effective evaluation at the local level
wherever projects are set up on the basis of local option, and the (Continued on page 367)
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions
goodman: Promises, promises 367
reduced if input were no longer a concern and only output mat
tered. We could leave it to the contractor to deal with all input trivia secure in the knowledge that faced with a loss of his profits
he would not promise what he could, in fact, not achieve. War?
The Pentagon, the State Department and three administrations
have not been able to achieve the goal of ending the Viet-Nam war.
Performance contracts could be let which would end the war by a specific date with no more than X American casualties, no less
than Y enemy casualties and no more than W new areas of mili
tary involvement. (Again outmoded considerations such as bans
on the use of chemical and bacteriological warfare might be ig nored as long as the end was achieved). In fact we might contract
out all American involvement in international problems. To be
fair we could give one company the Middle East, another the
Soviet Union; still another could guarantee to cope with Red
China. After that why not divorce, drugs, child raising, The Gen
eration gap. And then, why not?why do we need elected officials?
Why not a performance contract to run the country? Too com
plex? OK we will break it up. Separate performance contract to
run each cabinet level department. Think of the savings on Con
gress alone which has demonstrated by its past performance its
inability to handle the job. The author prefers to bid on the treasury and promise a
balanced budget, lower taxes, and a reduced national debt. He will
get 2 per cent if he succeeds and 1 per cent if he does not.
Accountability: . . .
(Continued from page 304)
supplementing of the local effort by encouraging cross validation
of programs which embody the best of what has been found. The
problem then is to obtain a convincing mass of data to show that a replicable technique has been identified and perfected which
will give results.
Every person intimately involved in the problem of improv
ing reading instruction has a responsibility in this matter. Per
sons in charge of such programs should take a second look at the
evaluative procedures being used locally. Are they sound? Are
the comparisons being made in ways which are statistically ac
ceptable and meaningful? Are these data being analyzed and the results communicated in such a way that the success or the failure of the local effort is clearly demonstrated? Only when positive responses are forthcoming can appropriate action be taken either to correct or drop inept programs and to strengthen those that seem most promising.
This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions