Testing || Accountability: The Task, the Tools, and the Pitfalls

Accountability: The Task, the Tools, and the PitfallsAuthor(s): Walter N. DurostSource: The Reading Teacher, Vol. 24, No. 4, Testing (Jan., 1971), pp. 291-304, 367Published by: Wiley on behalf of the International Reading AssociationStable URL: http://www.jstor.org/stable/20196498 .

Accessed: 24/06/2014 23:44

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and International Reading Association are collaborating with JSTOR to digitize, preserve and extendaccess to The Reading Teacher.

http://www.jstor.org

This content downloaded from 195.34.79.214 on Tue, 24 Jun 2014 23:44:13 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=black

http://www.jstor.org/action/showPublisher?publisherCode=ira

http://www.jstor.org/stable/20196498?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Walter Durost is an Adjunct Professor of Education at the

University of New Hampshire.

Accountability: the task, the tools, and the

pitfalls

WALTER N. DUROST

within recent months the United States Office of Education has

taken the position, especially in the case of Title I, that it is essen

tial to show "hard data" to prove that the enormous amounts of

money being spent are not being wasted. Thus, the term "account

ability" has com? into vogue and it is very much the "in" thing to

talk in these terms. What does accountability mean, in fact? Ac

cording to Webster's Third New International Dictionary Un

abridged, accountability means, "the quality or state of being accountable, liable or responsible." The American Heritage Dic

tionary is even more specific on the point. It says, "accountable,

adj., accountability, noun, subject to having to report, explain, or

justify; responsible, answerable."

SOURCES OF DEMAND

In both of these definitions the idea comes through loud and

clear that the individual or individuals carrying out some assigned task are to be held accountable for their performance. There is, by

implication at least, the further thought that if the individual or

individuals performing the given task do not provide convincing evidence of their success, they will somehow be made to "pay the

piper." Whether or not the threat is real is beside the point if the

local authorities accept it as real. In Title I this conceivably could

mean losing the funds now supplied by the federal government for

Title I projects. There is a demand for positive proof to present to

Congress, for example, that Title I programs are successful. The

reasonableness of the demand is not the question; the problem is

how to meet it without destroying public education in the process

by destroying public confidence in its schools. That this is a real

threat is perfectly evident by examining the course of events in

recent months. Perhaps some examples will suffice.

In instance after instance school bond issues are being de

feated and budgets are being cut. In other instances private con

291



292 THE READING TEACHER Volume 24, No. 4 January 1971

tractors are being paid to do the work of the school almost always with federal money. In most instances the contractor gets paid

only to the extent that he can "prove" real gains usually stipulated in terms of months of grade. There is no real understanding of the

fallacies involved in this procedure relating to the operation of

chance effects; inadequately standardized tests; instances, espe

cially in the upper grades, where a "gain" of one point of score is

translated into several "months." Since the cases selected for in

struction are usually those who are retarded to start with, regres sion effect alone on repeat testing will guarantee a "gain" which

is, in fact, due to chance alone.

Many such contract arrangements provide for individualized

instruction under very favorable environmental conditions and

using sophisticated teaching machines. Extraneous rewards are

offered to the students such as green stamps, free radios, etc. More

over the time allotment for this specialized instruction is much

greater than allowed in the normal curriculum.

This notion of "accountability" as being a threatening gesture from the state or federal government is a potent source of con

tamination in any experimental program and denies, to those who

are interested in experimentation to find new and better ways, the

opportunity to discover these better ways of doing things without

personal or professional risk. That freedom of mind which permits the individual to try but fail and yet count his failure as a success,

provided it shows the way to some new and better procedure, is in

danger of being seriously eroded. Scientific method cannot possibly survive in an atmosphere suffused with threats, real or implied, and therefore the very use of the term "accountability" is anathema

to those who are vitally interested in innovation.

The United States Office of Education is now demanding cred

ible, convincing evidence, in terms of test results and/or equally

objective data, to prove that Title I funds and similar government subventions are being spent wisely. This demand is both prudent and needful provided the pitfalls described above carefully are

avoided. This calls for more valid and more reliable tests, for the

development of non-test techniques for evaluation, and for im

proved and understandable statistical analysis.

Coincidentally, there is now a high level of public interest in

developing an increased level of reading skill among American

children and adults. Educators have long been aware of the need,

but only very infrequently has financial support at the local level

been forthcoming, especially for corrective programs which are

expensive in terms of personnel and logistical support. Reading has been singled out both because of its essential contribution to

all phases of the curriculum and because of the deeply seated be

lief that having a literate society is of vital concern, especially in



durost: Accountability; the task, the tools, and the pitfalls 293

a technological age?a fact few would deny. The criterion for success in Title I reading projects, according

to one memorandum issued by the Office of Education, was a gain of one full year of grade placement for a school year (which

usually means seven months) of instruction with a further provi sion that the children selected for instruction under these programs should have been at least a full year below normal grade-for-age at the time they begin their work. Unfortunately, the grade equiv alent or grade placement score, used so confidently in this memo

randum to define the population to be taught and the criteria of

success is a very tricky statistic. In many instances, it is an almost

meaningless concept which distorts the truth and cannot be de

pended upon to describe scientifically either the initial status or

the degree of progress made by any particular group during any

period of time. Gains recorded in grade equivalents are often

within the range of chance. This is especially true of short periods of seven to nine months, which usually apply. The specific diffi

culties involved in the use of the grade equivalent as a statistical

unit will be discussed in more detail later.

Under the pressure of time, and suffering from a woeful lack

of properly trained personnel, many school communities and larger educational units have undertaken programs which were not well

suited to the needs of their groups and have collected data, the

significance of which it is almost impossible to determine. It is

the purpose of this article to explore in some detail and to illustrate, where possible, some of the problems involved in accounting for

the progress, or lack of progress, in Title I projects and programs,

particularly the remedial reading programs at the local and state

level. In order to do this, many pertinent questions need discussion.

EVALUATION PROCEDURES

Are proposals for evaluating outcomes in the reading projects

presently being written into applications, as required by law, ade

quate to the need? This is unanswerable as a general question. It

must be answered project by project in terms of what is happening in a particular locality or state department. One of the author's

tasks as a Consultant to the Title I Office in the New Hampshire State Department of Education has been to help plan the analysis of the data gathered by a statewide testing program in connection

with the evaluative activities of the Title I Office.

One step in this direction has been to develop a series of

categories into which Title I projects may be pigeon-holed. In order

to obtain the necessary data as to the number of students par

ticipating in Title I programs, project by project, in each of the

categories, IN, or registration, cards and OUT, or termination,




cards were devised. These are overprinted precoded IBM cards to make card punching more objective and accurate. On these cards,

the teachers responsible for Title I instruction are required to com

plete, for each child, a questionnaire which indicates the category in which this child's instruction falls, plus other questions having to do with the amount of time spent each week, etc. This proce

dure has been in effect for one year and a preliminary analysis of the IN cards shows, for example, that a very large percent of all

pupils in the programs in Grades two, four, six, and eight (the only grades so far completed) have been engaged in some project re

lated to corrective work or additional instruction intended to im

prove the quality of the child's reading performance. The ten

categories and the number of children coming under each category

according to the analysis of the IN cards for the Fall '69 program are listed below in Table 1. The percentage in the reading project alone is shown also in Table 2.

Table 1 Number of cases enrolled in each project category in 1969-70

New Hampshire Title I program

Project Category GRADE

By Number Two Four Six Eight

1. Language 31 17 11 16 2. Reading 657 606 286 398

3. Speech 69 26 18 4 4. Math 6 55 42 2

5. Guidance 79 56 27 19 6. Special Education 3 2 3 8 7. Psychiatric Services 14 ? ? ?

8. Aides 50 27 22 6 9. Cultural Enrichment 16 25 5 25

10. Other 35 10 31 16

Totals 960 824 445 494

Table 2 Number and percentage of Title I cases in corrective reading programs in 1969-70 New Hampshire State Department of Education

C R A D F TWO FOUR SIX EIGHT

Number Percentage Number Percentage Number Percentage Number Percentage

657 68M> 606 7V? 286 64 398 80%

Ideally, as one step in the process, every Title I project sub

mitted should be reviewed from the point of view of the appropriate evaluative evidence that can be collected, summarized, analyzed, and reported on. Practically, this may never be possible because

some projects are approved and funded on the basis of the best

judgment of the Title I Staff that these projects are good for the

target schools in question and will result in a generally improved education for the educationally deprived children involved.



durost: Accountability: the task, the tools, and the pitfalls 295

In 1968-69, a Fall-Spring comparison in grade equivalents was attempted. In absence of the criterion information described

above, no clear differentiation of those in remedial reading pro

grams versus all others could be made. This analysis proved to

be quite unsatisfactory for several reasons. The lack of pupil in

formation concerning the nature of the projects and level of stu

dent participation was one major cause of concern which is now

taken care of by the IN and OUT cards. Another difficulty was

the impossibility of making sensible comparisons from grade-to

grade because of the inadequacies of the grade equivalent type of

score transformation. There was no way of knowing what would

be normal gain over seven calendar months. It certainly was not

seven months of grade equivalent as commonly assumed. Further

more, the bivariate distributions for the separate grades two, four,

six, and eight clearly revealed the lack of equivalence of the grade

equivalent from grade to grade. Children tested at the beginning of grade two could not go below a grade equivalent of 1.0 but

could go very much higher. Many pupils at grade eight earned

grade equivalents of 10.0+ whereas the fact of the matter is that

reading is not recognized as a formal subject above grade eight. The availability of more complete student information for the

1969-70 program will make possible a more systematic comparison than before and this is now underway. State stanines will be used in

making this comparison. In order to obtain meaningful spring sta

nines a random sample of the statewide grade populations was re

tested in the spring along with those designated as Title I cases in

Reading by the IN card documents. A separate study is being done on the basis of the random sample to assess changes which can be

considered typical for the instructional period involved.

Since the reading group is such a large proportion of the total

Title I population at every grade level, only this category is being

analyzed separately. Attention is being given to the idea of evalu

ating all of the remaining children not in corrective reading pro

grams as one group on the assumption that the individualized

attention received may, in itself, result in improved academic

achievement as measured by standardized tests.

Even refining the major sample analyzed to include only read

ing projects completely ignores one important aspect of the oper

ating situation. Even those supposedly under corrective reading instruction are not being handled in a sufficiently similar manner

in all projects to make this kind of comprehensive statewide evalu

ation very meaningful. It could be, for example, that, buried in

the mass of data, there are some local projects where terrific gains are being made as a result of improved methods of identifying,

diagnosing, and instructing children with correctible reading dif

ficulties. These gains might be masked and hidden because many




other children are being instructed in programs which generally are well-intentioned and, perhaps, even very successful from some

different point of view but do not emphasize large gains in score

on an objective reading test as the primary criterion of success.

If this is, indeed, true at the state level, how much truer must

it be when an attempt is made at the national level to merge data for thousands of projects loosely regarded as being related to

corrective reading and to draw conclusions on the effectiveness

of Title I sponsored instruction for improving the general reading

ability of the disadvantaged. Would it not be more sensible if, at

level after level in this hierarchy from the Office of Education

down to the local school, well-planned, carefully documented at

tempts were made to see what could be done to improve the

efficiency of corrective work in reading? This might require im

proving, first, the selection procedures; second, the diagnosis, so

as to pinpoint the nature of each child's difficulty; and finally, the

evaluative procedures to be sure that the instruction following

analysis and diagnosis was effective.

SELECTION OF CASES

Are the children finally chosen for remedial or corrective work

in reading really the ones who need this kind of help? Choosing cases for corrective or remedial work in reading raises a problem? first of semantics and secondarily of basic educational philosophy.

Many children are chosen for corrective or remedial work in read

ing for one reason?they are reading below their normal grade

placement for their chronological age. Many of these children may be slow learners. This is said with no attempt whatsoever to label

such children as being of low intelligence, but simply to state the

inescapable fact that there are great differences among children

in regard to the rapidity with which they respond to instruction.

Perhaps the point can best be made if the reader will assume

for the moment the validity of the point of view taken by the

author that one indication of a child's potential reading ability is

the level of his oral vocabulary. While this oral vocabulary may be measured in many different ways, it is measured most per

tinently, in the author's opinion, by the use of the reading reinforced

technique discussed elsewhere in this issue. According to this tech

nique, a child not only takes a reading test under standard condi

tions but is also given an equivalent form with the further condition

stipulated that the teacher read aloud as the child reads silently. This oral bridge to comprehension will result in absolute

gains in score for a majority of children under normal circum

stances because, apparently, few children really do live up to their

potential as so determined. (A few very able readers are slowed




down by the oral reading and achieve less well under reinforce

ment. ) However, experimental evidence is rather convincing that

children with correctible reading defects will make enormous gains,

comparatively speaking, under these conditions if their oral vo

cabulary has been adequately developed both in and out of school.

This issue has been raised at this time, however, to make

another point as well; namely, that according to this procedure, it is entirely possible for a child who is reading at-grade or even

above the normal grade for his age to make substantial gains under reinforcement, which indicates that he is not reading up to his potential. In the author's opinion, these children very likely are genuinely reading-disabled children who need help in order

to realize their full potential and thus make reading more effec

tive as a means of communication and learning. This in no way bars consideration of, or the inclusion in the program of, children

whose reading abilities are far below normal for chronological

age or of grade placement. In other words, use of this technique hurts no one and does provide at least one objective basis on

which to select children for corrective work in reading. If this

technique is then followed by adequate analysis, it appears highly

probable that a group of children will be identified who can really be helped by a carefully planned and thoroughly documented

analysis of their difficulties followed by intensive instruction re

lated to the nature of their problem. What about improving the reading skills of those poor

achievers who appear to be reading about as well as one would

expect for their level of measured mental ability or, to put it

differently, children who appear to be slow learners on the basis

of some kind of objective evidence, including well-documented

observation by the teacher? Teaching these "slow learners" to read more effectively can be accomplished adequately under conditions

specified in most Title I projects which involve intensive instruc tion in small groups, or individually, with materials carefully selected to be suitable for the child's needs. It is also undoubtedly true that if this kind of individual attention were given to all children in school they would show similar substantial gains in

reading and would reach something approaching their maximum

level of reading ability much sooner than they do under the

present rather casual program of instruction, especially in the

upper elementary grades. If this general improvement in reading, especially for slow

learners is the intent of the program, it should be clearly stated as such. It should not be labelled as a corrective program intended to discover and remove specific difficulties which have barred the child's normal progress .

One suggestion for improving the selection process can be




made and implemented immediately. This involves the adminis tration of two forms of the test under normal conditions given about a week apart. The purpose would be to identify those chil

dren who show a consistent pattern of low achievement as com

pared to those whose achievement seems to be quite variable on

the two measures. Wide variability of scores on the two tests

suggests that these children have problems outside the reading realm, relating possibly to their emotional adjustment and tem

peramental factors. Such children need to have quite as much

attention paid to these factors as to the mechanics of correcting

any existing reading defects they may have. Moreover, these emo

tional problems are harder to deal with and often call for a team

approach involving the social worker and the psychologist as

well as the teacher.

TEST INSTRUMENTS

Are the instruments used for pre- and post-testing appro

priate for the task (validity) and are they sufficiently reliable to

produce results ivhich can be said to be non-chance in character!

At the present time about the only reading tests available for

this purpose are the standardized reading survey tests, most of

which have been produced as part of a comprehensive achieve ment battery, such as Stanford, Metropolitan, California, etc.

These tests were not constructed as instruments intended to

indicate with maximum accuracy the reading level achieved by a particular student at a particular time in his school career.

Thus, the use of these tests for before-after testing requires the

tests to do something for which they were not constructed.

Reliability coefficients for these tests generally tend to be

high compared to the reliability coefficients in other tested areas

and generally these tests are considered to be adequate measuring instruments. Standard errors of measurement are tolerable in size

for the purposes for which the instruments were intended. How

ever, they were not intended to measure gain in reading ability over relatively short periods of time.

For children who have well-documented, correctible reading

difficulties, even these instruments will be successful in indi

cating substantial gains over the period of time in question. The

Crowley-Ellis study reported in this journal gives some evidence

on this point, although this is not intended as a definitive study. It uses national Fall and Spring stanines on the Metropolitan as

the basis for comparison and measures the success of the pro

gram by the extent to which individuals gain by substantially

larger margins than normally expected for a given level of initial

score. Additional studies on larger numbers of cases under more




carefully controlled conditions are needed to validate the ap

proach, but in the author's opinion it constitutes the most prom

ising approach to assessing the progress of the disabled reader

using available survey tests. The extent to which individuals make

substantial gains over and above gain in score needed to maintain

their current status with respect to their peer groups is convinc

ing evidence that these plus gains are specific to the program

being evaluated.

A far more vital question to ask at this point, however, is

whether it is possible to produce instruments which are more

sensitive and better suited to the task of measuring gains in read

ing achievement over these relatively short periods of time. The

answer to this is quite definitely positive, but the specifications for such tests or series of tests are hard to write.

It is also evident that it would be possible to construct tests

where a larger percentage of the material to be read would be

closer to the operating level of the child taking the test than is

possible with tests organized to cover wide ranges of ability or

skill. In such a test items which are so easy that the child could

answer them without real effort, or so difficult that he rarely answered them correctly, would need to be eliminated as being ineffective in establishing the child's place along some kind of

continuum of reading ability with highest accuracy.

GRADE EQUIVALENTS

Are the widely used grade equivalents ever the proper sta

tistical units for indicating change from beginning to end of the

training period? If so, under what circumstances? It was stated

earlier in this article that the use of grade equivalents is of ques tionable value in measuring the status of a child on a before

after basis in a corrective reading program. In order to understand the deficiencies of grade equivalents

one must understand the procedure whereby they are obtained.

Tests are administered to a series of grades at the same time in

the school year on instruments that either are identical for suc

cessive grades or have been scaled to provide for continuity from

grade to grade. The "scaled" scores for successive grades are then

plotted on cross-secction paper against grade and time of year of testing and a continuous smooth curve of best fit is drawn

through these plotted points. Obviously, the plotted points are

twelve months apart since the children whose averages are plotted are in successive grades. However, grade equivalents assume that all gain in reading takes place during the period a child is in

school, ignoring the two and three month vacation. The proce dure further assumes that this period is typically ten months.




Neither assumption is correct. Thus, the procedure of dividing the difference in score from one grade to another into ten parts and calling these months of grade can be seriously misleading,

especially in vocabulary and reading where substantial data indi

cate continued gains in score during out-of-school months. Such

norms cannot possibly determine accurately whether or not a

child has made normal or more than normal gain during a period of seven months of in-school instruction.

Furthermore, there is ample evidence that when tests are

scaled in some kind of absolute unit, the rate of growth in read

ing and vocabulary is very substantial in the early grades (or

ages) and becomes less and less substantial as one goes up the

age-grade ladder until, at the junior high school level, gains are

only marginal as compared to the astounding variability in read

ing ability found in the scores of students tested at these higher

grades. Thus, increases of only two or three points of score in

terms of items answered correctly may account for all the gain to be found from the seventh grade to the eighth grade over an

entire twelve month period! Grade equivalents for the enormous range of scores above or

below the successive averages must be obtained by some artificial means. Above grade eight, no real data at all are available under

most circumstances and the grade equivalents can be assigned

only in terms of subjective judgment. The temptation is great to extend the norm line (the trend line through the averages of

successive grades where data are available) upward as steeply as

possible in order to provide grade equivalents for the maximum

number of points of score above the median of the highest grade tested. This procedure results in assigning grade values even up into the senior high school years in spite of the fact that this

is obviously errant nonsense since there is no continuum of in

struction in reading in these grades.

NORMATIVE POPULATIONS

Are the national norms on the available tests used really

representative of any truly national population? The matter of

national norms for achievement tests prepared by different au

thors and published by different publishers is a very vexing and

complex issue. No author or publisher can be charged with neg

lecting his responsibilities solely on the basis that his norms differ

from those of some other author or publisher on a similar test, such as reading, unless the procedure for obtaining such norms

have been carefully and stringently laid down and adhered to by all. The United States Office of Education is cognizant of this

problem of variation in norm assigned to equivalent scores and




has under consideration a study which will attempt to equate a

half-dozen of the most widely used reading tests in order to set

up tables of comparable scores. The office then will attempt to

establish one set of national norms which will meet the most

rigid specifications for representativeness. The net result of this

approach should be to eliminate one objectionable feature involved

in contract teaching; namely, the selection and use of a test that

the experimenter feels will make his efforts look good.

Perhaps, in order to drive this point home, it would be well

to point out that the same population given two different reading

tests, both claiming to have representative national norms, can

be substantially below the norm on one test while being at or

even substantially above the norm on another presumably equally well standardized test.

The only saving grace in this situation is that national norms

really are not a necessary requirement for a large number of

uses to which such tests can be put. Data gathered on local pop

ulations, whether these be state or county or smaller administra

tive units, often are more meaningful and more useful in deter

mining those children who need corrective instruction as well as

in evaluating growth.

REALITY OF GAINS

Is the evidence convincing that gains in over-all reading

skill, as opposed to gains in reading test scores, have really taken

place and is this evidence equally convincing at all grade levels?

A challenge of the validity of the standardized reading test is

implied. There are masses of statistical data to indicate that these

tests are valid in that children with high reading scores, espe

cially with high associated mental ability scores, do well in school.

In other words, they are given high letter grades. They tend to

graduate at the top of their class or to be in the upper ranges of

the rank-in-class usually reported at the end of the period of in

struction available under public education. Thus, one can say that

such tests do have validity. The question remains unanswered,

however, as to the extent to which the test validity might be im

proved if it could be related more directly to the criterion of

successful reading skill measured in some oher way. To put this

question differently, is it self-evident that a child who makes a

high score on a reading test will be able to read the instructional

material placed before him comprehendingly at his grade level

in any of the various subject matter areas to which he may be

subjected? No definitive studies in this area have been made

recently to the knowledge of this author.

A very interesting series of studies can be imagined in this




area?a real challenge to the ingenuity of the investigator. One

good starting point would be to study the relative level of agree ment between the usual reading test with unlimited reference to

the material read and a test requiring exposure only for a con

trolled amount of time after which the questions would be an

swered without an opportunity to re-read the passages.

DATA COLLECTION AND PROCESSING

Are the required accountability data being collected accord

ing to acceptable scientific procedures and are the results process able by modern data processing equipment? In some respects, this is the Achilles heel of the entire attempt at evaluation of

Title I projects. Unless one single type of project is imposed on

an entire unit with the specifications agreed upon in advance, the multiplicity of data collected will result in total chaos from

the point of view of administrative units larger than the local

unit; that is, the level at which the project is carried out.

For example, if fifty communities are all carrying on what

each calls a remedial project in reading but hardly any two of

them are pursuing this matter from the same point of view, how

is collective data processing possible? Perhaps by using the same

measuring instrument for pre- and post-testing? But if each of

these fifty communities attempts to evaluate its product each

using different instruments, possibly administered at different

times of the year, then the data simply are not combinable in any

meaningful sense even with the most adequate equating tables, use of which is the prevailing practice.

If a common instrument is applied throughout the entire

population, one must pose the following question: Does this

objective instrument presume to measure the goals of all these

projects equally well? Only if the common goal is generalized

improvement in the quality of reading of the students under

instruction is this remotely logical. Is the chosen instrument

suited for comparison of averages only, or does the instrument

measure individual performance adequately? A consideration of these vexing questions must result in

choosing one horn of the dilemma. 1] Measure some generalized factor, possibly reading effectiveness commensurate with ability; or 2] select or devise valid instruments to measure the partic ular aspect of reading which the project is intended to teach. The

choice becomes clearer if the desired outcomes are clearly de

fined in behavioral objective terms, a process only dimly under

stood and vaguely practiced. For the moment, assume that the objective is the same;

namely, improvement in the effectiveness with which the students




under instruction tackle and master the directed page. For valid

accounting of outcomes, all of the communities involved must

administer the same test(s) under the stipulated conditions, as

to time and precise adherence to directions for both pre- and

post-testing. Furthermore, it requires that there will be enough identification information available for every student to permit the matching of cases tested at the beginning and the end of

the period of instruction.

It also assumes that all children who are selected for the

program initially are included in the final evaluation even if

they have been dropped from the program. To make this last

point perfectly clear, it would be entirely possible for the separate schools within a community or district-wide project to sabotage the results completely if they tested at the end of the project and/or returned data to the central office only for those children

who, in their opinion, made notable gains under instruction.

Perhaps an additional word is needed in regard to the matter

of identifying cases from fall to spring testing, especially where

there is no existing imposed student ID number, such as a Social

Security number, or something similar. It is possible to generate a code from the student's name, his birth date, age and sex which

will uniquely identify him within close limits, especially if the information is used in conjunction with some locality informa

tion, such as school, supervisory unit, or school district. However, the generation of such a code depends upon the collection of the

information needed in regard to name, sex, birth date, etc., con

sistently at the beginning of fall, or first, administration and

again in the spring, or at the end of instruction.

The collection of such information proves to be very difficult,

perhaps because the need for identical information on a repeat basis is not fully communicated. The end result is almost com

plete chaos if the matching information is inaccurate or incom

plete. This makes it necessary to do a person-by-person identifi

cation of cases through visual inspection of punched cards

produced and interpreted for this purpose. Such "eyeballing"

permits investigation of a number of different clues before con

cluding that the cards in question are, indeed, applicable to the

same individual, but it is prohibitively expensive for large groups. There is no guarantee at all that the lost data will be random and

that the net result of the analysis of the remaining data would

be the same as it would be if the total population were examined.

Some statistical safeguards are possible, but at great cost.

REPORTING

Are the reports of results both scientifically sound and ob




jective? The question can apply both at the local level and at

the central office level, whether the "central office" be the state

or the federal government. In projects such as those carried on

under Title I the pressure to show good growth or gains as a

result of the experimental factor may be great, especially if there

is a threat, either potential or real, to future programs from lack

of funding. Sometimes the pressure on the part of the local ad

ministration to show gains causes teachers and others directly involved with the children to lose their own objective point of

view and either consciously or unconsciously "teach for the test."

Such contamination of teaching material reported in the Texar

kana project is a good case in point.

Spring testing, that is testing at the end of a school year, in the author's opinion, has always suffered from the serious

drawback that the temptation to teach for the test is always

present, especially in the minds of teachers who are inexperi enced in such matters or who fear, usually groundlessly, that they

will be rated by the level of success of their students especially without regard to the initial status of these students. The fact

that their attempts to measure gains are sometimes ludicrously obvious seems not to deter them because this fact is just not

realized ahead of time. Any analysis of a before-after testing

program done, for example, in terms of community means, ex

pressed in standard score terms, will quickly reveal communities

where the post-test mean, so expressed, is just ridiculously out

of line with the pre-test performance either on the evaluation

instrument or on some other measure of learning potential.

CONCLUSION

For the reading teacher and for the reading supervisor or

specialist, this is a time of great opportunity. Public sentiment

is strongly in favor of anything that can be done to improve

reading skills, both for the population as a whole and, more spe

cifically, for those for whom correctible defects can be identified

and assessed. It would be nothing short of criminal for our pro fession to fail to evaluate objectively and adequately outcomes of

our present affairs. A valid determination that a technique is not

effective is as valuable as a determination that it is effective. We

can learn by our mistakes as well as our success. There can be

no onus on making a mistake once; it is inexcusable if the same

mistake is perpetuated because we fail to assess the nature of our

failure and take proper steps to correct it.

What is needed is more effective evaluation at the local level

wherever projects are set up on the basis of local option, and the (Continued on page 367)



goodman: Promises, promises 367

reduced if input were no longer a concern and only output mat

tered. We could leave it to the contractor to deal with all input trivia secure in the knowledge that faced with a loss of his profits

he would not promise what he could, in fact, not achieve. War?

The Pentagon, the State Department and three administrations

have not been able to achieve the goal of ending the Viet-Nam war.

Performance contracts could be let which would end the war by a specific date with no more than X American casualties, no less

than Y enemy casualties and no more than W new areas of mili

tary involvement. (Again outmoded considerations such as bans

on the use of chemical and bacteriological warfare might be ig nored as long as the end was achieved). In fact we might contract

out all American involvement in international problems. To be

fair we could give one company the Middle East, another the

Soviet Union; still another could guarantee to cope with Red

China. After that why not divorce, drugs, child raising, The Gen

eration gap. And then, why not?why do we need elected officials?

Why not a performance contract to run the country? Too com

plex? OK we will break it up. Separate performance contract to

run each cabinet level department. Think of the savings on Con

gress alone which has demonstrated by its past performance its

inability to handle the job. The author prefers to bid on the treasury and promise a

balanced budget, lower taxes, and a reduced national debt. He will

get 2 per cent if he succeeds and 1 per cent if he does not.

Accountability: . . .

(Continued from page 304)

supplementing of the local effort by encouraging cross validation

of programs which embody the best of what has been found. The

problem then is to obtain a convincing mass of data to show that a replicable technique has been identified and perfected which

will give results.

Every person intimately involved in the problem of improv

ing reading instruction has a responsibility in this matter. Per

sons in charge of such programs should take a second look at the

evaluative procedures being used locally. Are they sound? Are

the comparisons being made in ways which are statistically ac

ceptable and meaningful? Are these data being analyzed and the results communicated in such a way that the success or the failure of the local effort is clearly demonstrated? Only when positive responses are forthcoming can appropriate action be taken either to correct or drop inept programs and to strengthen those that seem most promising.



Documents

Testing || Accountability: The Task, the Tools, and the Pitfalls