"The Effects of Accountability System Design on Teachers’ Use of Test Score Data" by Jennifer Jennings

8/10/2019 "The Effects of Accountability System Design on Teachers Use of Test Score Data" by Jennifer Jennings

1/23

Teachers College RecordVolume 114, 110304, November 2012, 23 pagesCopyright by Teachers College, Columbia University0161-4681

1

The Effects of Accountability System

Design on Teachers Use of Test ScoreData

JENNIFER JENNINGS

New York University

Background/Context:Many studies have concluded that educational accountability poli-cies increase data use, but we know little about how to design accountability systems toencourage productive versus distortive uses of test score data.Purpose:I propose that five features of accountability systems affect how test score data areused and examine how individual and organizational characteristics interact with system

features to influence teachers data use. First, systems apply varying amounts of pressure.Second, the locus of pressure varies across systems. Third, systems diverge in the distribu-tional goals they set for student performance. Fourth, the characteristics of assessments varyacross systems. Finally, systems differ in scopethat is, whether they incorporate multiple

measures or are process- or outcome oriented.Research Design:I review the literature on the effects of accountability systems on teachersdata use and propose a research agenda to further our understanding of this area.Conclusions/Recommendations: Researchers have spent much more time analyzing testscore data than investigating how teachers use data in their work. Evolving accountabilitysystems provide new opportunities for scholars to study how the interactions betweenaccountability features, individual characteristics, and organizational contexts affect teach-ers test score data use.

The focus on data, I would say, is the driving force [behind edu-cation] reform. No longer can we guess. We need to challengeourselves everyday to see what the data mean. Secretary ofEducation Arne Duncan, 2010 (Quoted in Prabhu, 2010)


2/23

Teachers College Record, 114, 110304 (2012)

Since the 1970s, American education policy has relied on test-basedaccountability policies to improve student achievement and to closeachievement gaps between advantaged and disadvantaged groups.Central to the theory of action underlying accountability is the idea thatnewly available test score data, in conjunction with the sanctions attachedto these data, change the way that schools and teachers do business. Inthe view of many policy makers, exemplified by the Secretary ofEducation quoted at the beginning of this article, data are the primarydriver of education reform.

Because data cannot do anything by themselves, whats missing fromthis account is an understanding of whether and how data change prac-tice at the district, school, and classroom level and lead to educationalimprovement. Scholars have identified a number of potential positiveeffects of accountability-induced test score data use (hereafter, datause) on classroom and school practices, such as supporting diagnosis ofstudent needs, identifying strengths and weaknesses in the curriculum,identifying content not mastered by students, motivating teachers towork harder and smarter, changing instruction to better align it withstandards, encouraging teachers to obtain professional development thatwill improve instruction, and more effectively allocating resources withinschools (Stecher, 2004). Other scholars studying the unintended conse-quences of accountability systems have been more skeptical about thetransformative potential of data use because educators can also use dataas a strategic resource to manipulate performance on accountability mea-sures (Koretz, 2008).

Many studies have found that accountability policies increase data use(Kerr, Marsh, Ikemoto, & Barney, 2006; Marsh, Pane, & Hamilton, 2006;Massell, 2001). Yet little is known about how features of accountability sys-tems affect how educators use data because accountability has been con-ceived of as one treatment in the literature. This is misleading, becauseaccountability systems, even under the No Child Left Behind Act(NCLB), differ in important ways that have implications for schools andteachers use of data. As Lee (2008) wrote, We need to know who isaccountable for what, how, and why (p. 625). Almost 20 years after theimplementation of the first state accountability systems, we still know lit-tle about how features of accountability systems interact with organiza-tional and individual characteristics to influence teachers responses toaccountability.

This article draws on the existing literature to catalog the features ofaccountability systems that may affect data use and reviews what we know,and what we dont know, about their effects. I limit my scope to the useof test score data, which at present are the primary data used in schools

2


3/23


4/23


math or science?) or students (Which students should I tutor afterschool?). Data provide an objective rationale for making such deci-sions. Because of the cultural standing of data, they also provide a legiti-mate way to account for ones actions to other parties, such as a colleagueor a principal.

Together, these types of data use capture how teachers view theirschools, students, and themselves (lens); how they determine whats work-ing, whats going wrong, and why (diagnosis); what they should do inresponse (compass); how they establish whether it worked (monitoring);and how they justify decisions to themselves or to others (legitimizer).

Because of the positive valence of data use in the practitioner literature(i.e., Boudet, City, & Murnane, 2005) and in the culture at large, it isworth further refining these categories to distinguish between productiveand distortive uses of data. At the core of data use is the process of mak-ing inferences from student test scores regarding students and schoolsperformance, responding (or choosing not to respond) to these infer-ences, monitoring the effectiveness of this response, and accounting forit to oneself and others. I thus define productive data use as practices thatimprove student learning and do not invalidate the inferences about student- andschool-level performance that policy makers, educators, and parents hope to make.To the extent that teachers use of test score data to make instructionaland organizational decisions produces score gains that do not generalizeto other measures of learningfor example, other measures of achieve-ment or other measures of educational attainmentand thus leads us tomake invalid inferences about which schools, teachers, and programs areeffective, I will characterize this as distortive data use.

Two concrete examples are useful in making this distinction clear. Inan extreme case such as cheating, teachers use formative assessment datato determine which students are lowest performing and thus whoseanswer sheets should be altered. As a result, scores increase substantially,and we infer that school quality has improved when it has not. On theother hand, consider a teacher who uses formative assessment data topinpoint her students weaknesses in the area of mathematics and findsthat they perform much worse on statistics and probability problems thangeometry problems. She searches for more effective methods for teach-ing this contentmethods focused on the most important material inthis strand, not the details of the specific testand students perfor-mance on multiple external assessments improves in this strand of math-ematics. We infer that students made gains in statistics and probability,which is, in this case, a valid inference.

4


5/23

TCR, 114, 110304 Accountability System Design

THE EFFECTS OF FEATURES OF ACCOUNTABILITY SYSTEMS ON

TEST SCORE DATA USE

I focus my review of the impact of five features of accountability systemson teachers data use and discuss these features in terms of productiveand distortive forms of data use. I chose these features based on myreview of the literature because they represent, in my view, the mostimportant dimensions on which accountability systems differ.

First, accountability systems apply varying amounts of pressure. Systemsdiffer in the required pace of improvement and vary on a continuumfrom supportive to punitive pressure. Second, the locus of pressurevariesacross accountability systems. Systems may hold districts, schools, or stu-dents accountable for performance, and recent policies hold individualteachers accountable. Third, accountability systems vary in the distribu-tional goalsthey set for student performance. Prioritizing growth versusproficiency may produce different responses, as will holding schoolsaccountable for racial and socioeconomic subgroups. Fourth, features ofassessmentsvary across systems. To the extent that teachers feel that usingtest data will improve scores on a given test, teachers may be more likelyto use it, though it is not clear whether they will use it in productive ordistortive ways. Fifth, the scope of the accountability system may affect datause. This includes whether an accountability system incorporates multi-ple measures, or is process- or outcome oriented. An accountability sys-tem that rewards teachers for short-term test score increases will likelyproduce different uses of data than one that measures persistent effectson test scores.

In the preceding description, accountability features are treated as uni-versally understood and processed by all teachers. The implication is thatadopting a design feature will lead to a predicted set of responses. Butthere are good reasons to believe that teachers may understand andinterpret the implications of these features differently. Coburn (2005)studied how three reading teachers responded to changes in the institu-tional environment. She found that teachers sensemaking, framed bytheir prior beliefs and practices, influenced their responses to messagesfrom the institutional environment. Diamond (2007), studying theimplementation of high-stakes testing policy in Chicago, found thatteachers understanding of and responses to this policy were mediated bytheir interactions of colleagues. A number of other studies confirm thatorganizational actors may not perceive and react to the organizationalenvironment similarly (Coburn, 2001, 2006; Spillane, Reiser, & Reimer,2002; Weick, 1995). These studies largely found that organizationalactors both within and between organizations construct the demands of,

5


6/23


and appropriate responses to, accountability systems differently. As aresult, individuals and organizations respond in varied ways (and poten-tially use data in varied ways) that are not simply a function of what pol-icy makers perceive as teachers incentives.

The object of interest in this review, then, is not the average effect ofthe five accountability system features on data use. Rather, I consider howindividual and organizational characteristics interact with different fea-tures to produce the responses we observe.

VARYING AMOUNTS OF PRESSURE

Accountability systems vary in the amount and type of pressure exertedon schools. The empirical problem for studying this issue, however, isthat accountability pressure is in the eye of the beholder; that is, it doesnot exist objectively in the environment. What is perceived as significantpressure in one school may be simply ignored in another. Teachers usedata as a lens for understanding environmental demands, but the mean-ings they attach to these data vary across individuals and schools.Although the studies that I discuss next treat pressure as a quantityknown and agreed on by all teachers, I emphasize that understanding theeffects of pressure on data use first requires a more complex understand-ing of how teachers use data to establish that they face accountabilitypressure.

Accountability pressure may have heterogeneous effects on data usedepending on schools proximity to accountability targets. As noted,teachers may not have a uniform understanding of their proximity to tar-gets, but most studies simulate this pressure by predicting schools oddsof missing accountability targets. Only one study has examined the effectof schools proximity to accountability targets (here, adequate yearlyprogress [AYP]) on instructional responses. Combining school-level dataon test performance and survey data from the RAND study of the imple-mentation of NCLB in three states (Pennsylvania, Georgia, andCalifornia), Reback, Rockoff, and Schwartz (2010) examined the effectsof accountability pressure on a series of plausibly data-driven instruc-tional behaviors. Schools that were furthest from AYP targets were sub-stantially more likely to focus on students close to proficiency relative tothose that were very likely to make AYP (53% of teachers vs. 26%), tofocus on topics emphasized on the state test (84% vs. 69%), and to lookfor particular styles and formats of problems in the state test and empha-size them in [their] instruction (100% vs. 67%; Reback et al., 2010).

This study does not illuminate the process of data use in these schools.These activities could have resulted from data use as a lens and stimulus

6


7/23


rather than examination of test data to diagnose student performancelevels and content weaknesses. Nonetheless, the study provides some sup-port for the hypothesis that the amount of accountability pressure affectshow, and how much, data are used; data appear to be used to target bothstudents and item content and styles when schools face more accountabil-ity pressure. It also demonstrates that teachers facing objectively low risksof missing accountability targets still respond strongly to accountabilitysystems. In these low-risk schools, more than two thirds of teachers inthis study appeared to use data to focus on particular content and itemformats.

Hamilton et al.s (2007) study of the implementation of NCLB in threestates (Pennsylvania, Georgia, and California) also considered how varia-tion in state accountability systems may affect instructional responses,and data use in particular. Hamilton et al. found that districts and schoolsresponded in similar ways across the three states. Among the most com-mon improvement strategies deployed were aligning curricula andinstruction with standards, providing extra support for low-performingstudents, encouraging educators to use test results for planning instruc-tion, adopting benchmark assessments, and engaging in test preparationactivities. Though this review focuses on teachers, some of the most use-ful information on data use in this study came from principal surveys.More than 90% of principals in all three states reported that they wereusing student test data to improve instruction, though a higher fractionof principals in Georgia found data useful than in the other two states.Georgia also appeared to be an outlier in data use in other ways. Forexample, 89% of districts required interim assessments in elementaryschool math, whereas only 44% did in California, and 38% did inPennsylvania. Other studies suggest that these findings could be a func-tion of variation in pressure resulting from where accountability targetswere set and how quickly schools were expected to increase scores.Pedulla et al.s (2003) study provides some evidence on this issue; theyfound that higher stakes increase data use, and another study has foundthat the amount of pressure a district is facing is associated with teachersemphasis of tested content and skills (Center on Education Policy, 2007).

Beyond these two studies, the management literature on organizationallearning and behavior provides further insight into how the amount ofaccountability pressure might interact with organizational characteristicsto affect data use. Current federal accountability targets of 100% profi-ciency by 20142 can be understood as a stretch goal, which Sitkin, Miller,See, Lawless, and Carton (in press) described as having two features,extreme difficultyan extremely high level of difficulty that renders thegoal seemingly impossible given current situational characteristics and

7


8/23


resources; and extreme noveltythere are no known paths for achievingthe goal given current capabilities (i.e., current practices, skills, andknowledge) (p. 9). Ordonez, Schweitzer, Galinsky, and Bazerman(2009) characterized this problem as goals gone wild, writing that theycan narrow focus, motivate risk-taking, lure people into unethical behav-ior, inhibit learning, increase competition, and decrease intrinsic motiva-tion (p. 17). Schweitzer, Ordonez, and Douma (2004) empirically testedthis idea in a laboratory experiment and found that subjects with unmetgoals engaged in unethical behavior at a higher rate than those told todo their best and that this effect was stronger when subjects just missedtheir goals. Although we have little empirical evidence on how these find-ings might generalize to the level of organizations, Sitkin et al. proposedthat high-performing organizations are the most likely to benefit fromstretch goals, whereas these goals are likely to have disruptive, suboptimaleffects in low-performing organizations.

To the extent that the management literature described earlier appliesto schools, these findings suggest that high- and low-performing schoolswill respond differently to accountability pressure. We can hypothesizethat the lowest performing schools may be more likely to pursue dis-tortive uses of data when faced with stretch goals. These schools mayattempt to quickly increase test scores using some of the approaches doc-umented in the studies reviewed here, such as targeting high-returncontent or students. Higher performing schools, which often havehigher organizational capacity, may respond in more productive ways.

There is some evidence for this argument in the educational literatureon data use; two studies have found that schools with higher organiza-tional capacity are more likely to use data productively (Marsh et al.,2006; Supovitz & Klein, 2003). Marsh et al. synthesized the results of fourRAND studies that incorporated study of teachers data use and foundthat staff preparation to analyze data, the availability of support to helpmake sense of data, and organizational norms of openness and collabo-ration facilitated data use. Supovitz and Klein (2003), in their study ofAmericas Choice schools use of data, found that a central barrier to datause was the lack of technical ability to manipulate data to answer ques-tions about student performance.

LOCUS OF PRESSURE

Accountability systems hold some combination of students, teachers,schools, and districts accountable for improving performance. In citieslike New York and Chicago, students must attain a certain level of perfor-mance on tests to be promoted to the next grade, and 26 states currently

8


9/23


10/23


there? Do teachers with substantially greater effectiveness on high-stakesthan low-stakes tests use item-level data to diagnose student needs andinform their instruction? The second task is to explain whether there aresystematic teacher characteristics that are associated with different levelsand types of data use. Do new teachers understand the line between pro-ductive and distortive data use differently than more experienced teach-ers? Does preservice training now focus more on data use, which wouldproduce different types of usage among new teachers? Do untenuredteachers experience accountability pressure in a more acute way and thusmake them more attentive to test score data? Answering each of thesequestions requires an understanding of how teachers internalize thelocus of pressure, what they do as a result, and how these behaviors varyacross teachers both between and within organizations.

The locus of accountability may also affect how data are used to targetresources, such as time and attention in and out of class, to students.Targeting reflects the use of data as diagnosis and potentially as legit-imizer. That is, test scores are used to diagnose students in need of addi-tional resources, and the imperative to improve test scores may legitimizethe use of these data to target additional time and attention to these stu-dents. Teachers (or their administrators) may also use administrativedata to determine which students do not required targeted resources inthe short term. Corcoran (2010) found that in the Houston IndependentSchool District, a large fraction of students did not have two consecutiveyears of test data available. This creates incentives for teachers to use datato identify students who dont count if teachers are held individuallyaccountable for student value-added, so they can focus their attentionelsewhere. Even if accountability is at the level of the school, Jenningsand Crosta (2010) found that 7% of students are not continuouslyenrolled in the average school in Texas and thus do not contribute toaccountability indicators. This may be consequential for students. Forexample, Booher-Jennings (2005) found that teachers in the Texasschool that she studied focused less attention on students who were notcounted in the accountability scheme because they were not continu-ously enrolled. These actions were legitimated as data-driven decisions.In this case, data systems that were intended to improve the outcomes ofall students were instead used to determine which students would notaffect the schools scores.

Because policies that hold individual teachers accountable for scoresare new and have not been studied, the performance management liter-ature on individual incentives is helpful in understanding how teachersmay use data differently in these cases. Much of this literature is based onformal models of human behavior rather than empirical data. These

10


11/23


studies suggest that high-powered individual incentives focused on a lim-ited set of easily measurable goals are likely to distort behavior and leadto undesirable uses of data (Baker, Gibbons, & Murphy, 1994; Campbell,1979; Holmstrom & Milgrom, 1991). If this precept applies to schools, wecan predict that individual accountability focused largely on test scoreswill encourage distortive uses of data. But we may also expect that theseresponses will be mediated by the organizational contexts in which teach-ers work (Coburn, 2001). In some places, this pressure may be experi-enced acutely, whereas in others, principals and colleagues may act as abuffer. As I will discuss later, subjective performance measures have beenproposed as a way to offset these potentially distortive uses of data.

DISTRIBUTIONAL GOALS OF THE ACCOUNTABILITY SYSTEM:

PROFICIENCY, GROWTH, AND EQUITY

The goals of an accountability system affect how student performance ismeasured, which in turn may affect how data are used. The three majormodels currently in use are a status (i.e., proficiency) model, a growthmodel, or some combination of the two. These models create differentincentives for using data as both diagnosis and compass to targetresources to students. In the case of status modelsby which I meanmodels that focus on students proficiencyteachers have incentives tomove as many students over the cut score as possible but need not attendto the average growth in their class. In a growth model, teachers haveincentives to focus on those students who they believe have the greatestpropensity to exhibit growth. Because state tests generally have strongceiling effects that limit the measurable growth of high-performing stu-dents (Koedel & Betts, 2009), teachers may focus on lower performingstudents in a growth system.

No studies to date have investigated whether status models have differ-ent effects on data use than growth models. However, we can generatehypotheses about these effects by considering a growing body of litera-ture that has assessed how accountability systems affect student achieve-ment across the test score distribution. A prime suspect in producinguneven distributional effects is reliance of current accountability systemson proficiency rates, a threshold measure of achievement. Measuringachievement this way can lead teachers to manipulate systems of mea-surement to create the appearance of improvement. For example, teach-ers can focus on bubble students, those close to the proficiency cutscore (Booher-Jennings, 2005; Hamilton et al., 2007; Neal &Schanzenbach, 2007; Reback, 2008). Test score data appear to play acentral role in making these targeting choices and may also be used to

11


12/23


legitimize these choices as data-driven decision making (Booher-Jennings, 2005). Because sanctions are a function of passing rates, slightlyincreasing the scores of a small number of students can positively impactthe schools accountability rating.

A large body of evidence addresses the issue of distributional effectsand provides insight into the extent to which teachers are using data totarget resources to students. The literature is decidedly mixed. One studyin Chicago found negative effects of accountability pressure on the low-est performing students (Neal & Schanzenbach, 2007), whereas anotherin Texas found larger gains for marginal students and positive effects forlow-performing students as well (Reback, 2008). Four studies identifiedpositive effects on low-performing students (Dee & Jacob, 2009; Jacob,2005; Ladd & Lauen, 2010; Springer, 2007), whereas four found negativeeffects on high-performing students (Dee & Jacob, 2009; Krieg, 2008;Ladd & Lauen, 2010; Reback, 2008). Because these studies intended toestablish effects at the level of the population, they did not directly attendto how teachers varied in their use of data to target students and howorganizational context may have mediated these responses. I return tothese issues in my proposed research agenda.

Only one study to date has compared the effects of status and growthmodels on achievement. Analyzing data from North Carolina, which hasa low proficiency bar, Ladd and Lauen (2010) found that low-achievingstudents made more progress under a status-based accountability system.In contrast, higher achieving students made more progress under agrowth-based system. This suggests that teachers allocation of resourcesis responsive to the goals of the measurement system; teachers targetedstudents below the proficiency bar under a status system, and thoseexpected to make larger gains (higher performing students) under agrowth system. As more states implement growth models, researchers willhave additional opportunities to address this question and determinewhat role the difficulty of the proficiency cut plays in affecting how teach-ers allocate their attention.

A second feature that may be important for how data are used to targetstudents is whether the system requires subgroup accountability, andwhat cutoffs are established for separately counting a subgroup. Statesvary widely in how they set their subgroup cutoffs. In Georgia andPennsylvania, 40 students count as a subgroup, whereas in California,schools must enroll 100 students or 50 students if that constitutes 15% ofschool enrollment (Hamilton et al., 2007). Only one study by Lauen andGaddis (2010) has addressed the impact of NCLBs subgroup require-ments. Though they found weak and inconsistent effects of subgroupaccountability, Lauen and Gaddis found large effects of subgroup

12


13/23


accountability on low-achieving students test scores in reported sub-groups; these were largest for Hispanic students. These findings suggestthat we need to know more about how data are used for targeting inschools that are separately accountable for subgroups compared withsimilar schools that are not.

To summarize, most of our knowledge about the effects of distribu-tional goals of accountability systems comes from studies that examinetest scores, rather than data use, as the focus of study. These studies raisemany questions about the role data use played in producing these out-comes. First, data use has made targeting more sophisticated, real-time,and accurate, but we know little about how targeting varies across teach-ers and schools. Second, we need to know whether teachers, administra-tors, or both are driving targeting behavior. For example, whereas77%90% of elementary school principals reported encouraging teach-ers to focus their efforts on students close to meeting the standards,only 29%37% of teachers reported doing so (Hamilton et al., 2007).Third, we need to know more about the uses of summative data for mon-itoring the effectiveness of targeting processes. How do teachers inter-pret students increases in proficiency when they are applying targetingpractices? Depending on the inference teachers want to make, targetingcan be perceived as a productive or distortive use of data. Targeting stu-dents below passing creates the illusion of substantial progress on profi-ciency, making it distortive if the object of interest is change in studentlearning. On the other hand, the inferences made based on test scoreswould not be as distortive if teachers examined average student scalescores. At present, we do not know to what extent teachers draw these dis-tinctions in schools. A final area of interest, which will be discussed inmore detail in the following section, is the extent to which targetingincreases students skills generally or is tailored to predictable test itemsthat will push students over the cut score.

FEATURES OF ASSESSMENTS

Features of assessments may affect whether teachers use data in produc-tive or distortive ways. Here, I focus on three attributes of assessmentsthe framing of standards, the sampling of standards, and the cognitivedemand of the skills represented on state testsbecause they are mostrelevant to the potentially distortive uses of data. Many more features ofassessments, such as the extent to which items are coachable, shouldalso be explored. The specificity of standards and their content varieswidely across states (Finn, Petrilli, & Julian, 2006). Framing standards toobroadly leads teachers to use test data to focus on tests rather than on the

13


14/23


standards themselves (Stecher, Chun, Barron, & Ross, 2000). In otherwords, if a standard requires that students understand multiple instantia-tions of a skill but always test the same one on the test, teachers will likelyignore the unsampled parts of the standard. On the other hand, overlynarrow framing of standards also enables test preparation that increasesoverall scores without increasing learning. For example, in some cases,state standards are framed so narrowly that they describe a test questionrather than a set of skills (Jennings & Bearak, 2010). The implication fordata use is that the framing of standards may affect how much teachersuse item data disaggregated at the standard level. Ultimately, this turnson teachers assessments of how much information standard-level dataprovide.

By the same token, the sampling of standardswhat is actually coveredon the testsmay also affect how teachers use data to inform instruction.Teachers themselves have reported significant misalignment between thetests and the standards, such that content was included that was not cov-ered in the curriculum, or important content was omitted (Hamilton etal., 2007). Omissions are not randomly drawn from the standards; forexample, Rothman, Slattery, Vranek, and Resnick (2002) found that testswere more likely to cover standards that were less cognitively demanding.Studies of standard coverage on the New York, Massachusetts, and Texastests confirm these impressions and have found that state tests do notcover the full state standards and are predictable across years in ways thatfacilitate test-specific instruction, though there is substantial variationacross subjects and states (Jennings & Bearak, 2010). At one end of thecontinuum is New York; in mathematics, in no grade is more than 55%of the state standards tested in 2009. By contrast, the Texas math tests cov-ered almost every standard in 2009, and the Massachusetts exams cov-ered roughly four fifths of the standards in that year. Jennings and Bearak(2010) analyzed test-item level data and found that students performedbetter on standards that predictably accounted for a higher fraction oftest points. This suggests that teachers had targeted their attention tostandards most likely to increase test scores. Survey evidence largely con-firms these findings. In the RAND three-state study, teachers reportedthat there were many standards to be tested, so teachers had identifiedhighly assessed standards on which to focus their attention (Hamilton& Stecher, 2006).

A complementary group of studies on teaching to the format illus-trates how features of the assessment can affect how data are used toinfluence practice. Studies by Borko and Elliott (1999), Darling-Hammond and Wise (1985), McNeil (2000), Shepard (1988), Shepardand Dougherty (1991), Smith and Rottenberg (1991), Pedulla et al.

14


15/23


(2003), and Wolf and McIver (1999) all demonstrate how teachers focustheir instruction not only on the content of the test, but also its format,by presenting material in formats as they will appear on the test anddesigning tasks to mirror the content of the tests. To the extent that stu-dents learn how to correctly answer questions when they are presented ina specific format but struggle with the same skills when they are pre-sented in a different format, this use of test data to inform instruction isdistortive because it inflates scores.

Taken together, these studies suggest that different assessments createincentives for teachers to use data in different ways. They also suggestthat teachers are using relatively detailed data about student perfor-mance on state exams in their teaching but provide little insight into thetypes of teachers or schools where these practices are most likely to beprevalent. School organizational characteristics, such as the availability ofdata systems, instructional support staff, and profesional developmentopportunties, may affect how features of assessments are distilled forteachers. Teachers own beliefs about whether these uses constitutegood teaching may also matter. Some teachers view focusing on fre-quently tested standards as best practice, whereas others see this asteaching to the test. Another important area for inquiry is whether thistype of data use is arising from the bottom up or the top down. In manycases, teachers are not making these decisions alone. Rather, school anddistrict leaders may mandate uses of data and changes in practice thatwill increase test scores, and teachers may unevenly respond to thesedemands (Bulkley, Fairman, Martinez, & Hicks, 2004; Hannaway, 2007;Koretz, Barron, Mitchell, & Stecher, 1996; Koretz, Mitchell, Barron, &Keith, 1996; Ladd & Zelli, 2002).

SCOPE

Many have hypothesized that accountability systems based on multipleprocess and outcome measures may encourage more productive uses ofdata than those based only on test scores. In a policy address, Ladd(2007) proposed the use of teams of inspectors that would produce qual-itative evaluations of school quality so that accountability systems wouldmore proactively promote good practice. Hamilton et al. (2007) sug-gested creating a broader set of indicators to provide more completeinformation to the public about how schools are performing and tolessen the unwanted consequences of test-based accountability. The hopeis that by measuring multiple outcomes and taking account of theprocesses through which outcomes are produced, educators will haveweaker incentives to use distortive means.

15


16/23


A large body of literature in economics and management has consid-ered how multiple measures may influence distortive responses to incen-tives, particularly when firms have multiple goals. In their seminal article,Holmstrom and Milgrom (1991) outlined two central problems in thisarea: Organizations have multiple goals that are not equally easy to mea-sure, and success in one goal area may not lead to improved performancein other goal areas. They showed that when organizational effectivenessin achieving goals is weakly correlated across goal domains and informa-tion across these domains is asymmetric, workers will focus their atten-tion on easily measured goals to the exclusion of others. Theyrecommended minimizing strong objective performance incentives inthese cases.

Because comprehensive multiple measures systems have not beenimplemented in the United States, it is currently difficult to study theireffects on data use. Existing studies have examined how the use of multi-ple measures affects the validity of inferences we can make about schoolquality (Chester, 2005) or have described the features of these systems(Brown, Wohlstetter, & Liu, 2008), but none has evaluated their effectson data use. The European and U.K. experience with inspectorates pro-vides little clear guidance on this issue; scholars continue to debatewhether these systems have improved performance or simply led to gam-ing of inspections (Ehren & Visscher, 2006).

A PROPOSED RESEARCH AGENDA

To improve accountability system design and advance our scholarlyknowledge of data use, researchers should study both productive and dis-tortive uses of data. Next, I propose a series of studies that would helpbuild our understanding of how accountability system features affectteachers data use and what factors produce variability in teachersresponses.

AMOUNT OF PRESSURE

There are two specific features of regulatory accountability systems thatshould be explored to understand their impacts on data use: where thelevel of expected proficiency or gain is set, and how quickly schools areexpected to improve. There is wide variation in the cut scores for profi-ciency that states set under NCLB as well as substantial variation in therequired pace of improvement under NCLB. As Koretz and Hamilton(2006) have written, establishing the right amount of required improve-ment and its effects has been a recurrent problem in performance

16


17/23


management systems and is generally aspirational rather than evidencebased. Some states used linear trends to reach 100% proficiency by20132014, whereas others allowed slower progress early on but requiredthe pace of improvement to increase as 2014 approached (Porter, Linn,& Trimble, 2005). Such differences across states can be exploited tounderstand the impact of the amount of pressure on data use.

Also worth exploring is how schools react to cross-cutting accountabil-ity pressures. Local, state, and federal accountability systems are currentlylayered on top of each other, yet we know little about which sanctionsdrive educators behavior and why. For example, California maintainedits pre-NCLB state growth system and layered a NCLB-compliant statussystem on top. In New York City, the A-F Progress Report system was lay-ered on top of the state and federal accountability system. In both cases,growth and status systems frequently produced different assessments ofschool quality. We need to know more about how educators make senseof the competing priorities of these systems and determine how torespond to their multiple demands. For example, we might expect thatjudgments of low performance from both systems would intensifyaccountability pressure and increase teachers use of test score data,whereas conflicting judgments might lead teachers to draw on test scoredata to legitimize one designation versus the other.

LOCUS OF PRESSURE

Rapidly evolving teacher evaluation systems and the staggered implemen-tation of these systems across time and space provide an opportunity tounderstand how data use differs when both teachers and schools are heldaccountable.

We need to know more about how educators understand value-addeddata that are becoming widely used and distributed and how theyrespond in their classrooms. For example, we need to know more abouthow teachers interpret the meaning of these measures as well as theirvalidity. We need to know to what extent teachers perceive that thesemeasures accurately capture their impact and how this perceived legiti-macy affects the way they put these data into use. Of particular interest ishow these interpretations vary by teacher and principal characteristicssuch as experience, grade level, subject taught, teacher preparation (i.e.,traditional vs. alternative), demographics, and tenure. Researchersshould also investigate the interaction between individual and organiza-tional characteristics, such as the schools level of performance, racialand socioeconomic composition, resources, and climate.

Once we have a better understanding of how teachers interpret these

17


18/23


data, scholars should study whether and how teachers data use changesin response to measures of their value-added. For example, we need tounderstand whether teachers pursue productive or distortive forms ofdata use as they try to improve their value-added scores and how thesereactions are mediated by individual and organizational characteristics.

GOALS OF THE ACCOUNTABILITY SYSTEM: STATUS, GROWTH, AND

EQUITY

Although scholars have proposed a variety of theories to predict howteachers will respond to status versus growth-oriented systems, few ofthese theories have been tested empirically. What seems clear is that sta-tus and growth-based systems create different incentives for using data toinform instruction. Though there is no extant research on this area, onemight hypothesize that in a status-based system, a teacher might use for-mative assessment data to reteach skill deficits of students below thethreshold. Under a growth system, the same teacher might use these datato maximize classroom growth, which could occur by focusing on theskills that the majority of the class missed. Alternatively, a more sophisti-cated approach could be used whereby teachers determine which stu-dents have the greatest propensity to exhibit growth and focus on theskill deficits of these students. As accountability systems evolve,researchers should study how the incentives implied by status and growthsystems affect data use practices.

All the ideas posed in the preceding paragraphs suggest that teachershave a clear understanding of how status versus growth systems work.Because growth systems are based on sophisticated statistical models,there are good reasons to suspect that is untrue. Researchers should alsostudy how teachers vary in their understanding of these systems, howthese understandings vary between and within organizations, and howthey shape teachers use of data.

FEATURES OF ASSESSMENTS

As described in the literature review, assessments offer different opportu-nities to use data to improve test scores. This area will become particu-larly relevant with the implementation of the Common Core standardsand assessments, which focus on fewer concepts but promote deeperunderstanding of them. The switch to a new set of standards will providea unique opportunity for researchers to observe the transition in data usethat occurs when the features of standards and assessments change.

We also need to know how teachers vary in the extent to which they use

18


19/23


distortive approaches to increase scores, which can be enabled by assess-ments that build in predictable features. Existing research suggests thatthere is substantial variation across states in such responses, which sug-gests that features of assessments may matter. More investigation at theorganizational level is needed. For example, there is substantial variationin the fraction of teachers reporting that they emphasize certain assess-ment styles and formats of problems in their classrooms. One hypothesisis that features of assessments make these behaviors a higher return strat-egy in some places than others. Another hypothesis is that variation inscore reporting produces variation in teachers data use. Researcherscould contrast teachers responses in states and districts that disaggregatesubscores in great detail, relative to those that provide only overallscores.3 Once we better understand the features of assessments andassessment reporting that contribute to these differences, researcherscan design assessments that promote desired uses of data and minimizeundesired uses. Using existing student-by-item-level administrative data,it is now possible to model teachers responses to these incentives, butwhat is missing from such studies is an understanding of the data-relatedbehaviors that produced them. Future studies using both survey andqualitative approaches to study data use can help to unpack thesefindings.

SCOPE

Many have hypothesized that accountability systems based on multiplemeasuresand in particular, those that are both process- and outcomeorientedmay produce more productive uses of data. Future studiesshould establish how teachers interpret multiple measures systems. Thesesystems will put different weights on different types of measures, requir-ing teachers to decide how to allocate their time between meeting them.For example, in New York City, 85% of schools letter grades are based onstudent test scores, whereas 15% are based on student, teacher, and par-ent surveys and attendance records. Likewise, new systems of teacherevaluation incorporate both test scores (in some cases, up to 51% of theevaluation) and other evaluations. We need to know how teachers under-stand these systems in practice and how the weights put on different typesof measures influence their understanding.

CONCLUSION

The rise of the educational accountability movement has created a flurryof enthusiasm for the use of data to transform practice and generated

19


20/23


reams of test score data that teachers now work with every day.Researchers have spent much more time analyzing these test score datathemselves than trying to understand how teachers use data in theirwork. What this literature review makes clear is just how scant our knowl-edge is about what teachers are doing with these data on a day-to-daybasis. Given the widespread policy interest in redesigning accountabilitysystems to minimize the undesired consequences of these policies, under-standing how accountability features influence teachers data use is animportant first step in that enterprise.

Notes

1. Federal, state, and district policy makers, of course, formally use data to measure

how schools are doing and to apply rewards or sanctions, but I focus here on use of data by

teachers.

2. Recent regulations now make it possible for states to request waivers from this

requirement.

3. I thank a reviewer for this point.

References

Baker, G., Gibbons, R., & Murphy, K. J. (1994). Subjective performance measures in opti-mal incentive contracts. Quarterly Journal of Economics, 109, 11251156.

Booher-Jennings, J. (2005). Below the bubble: Educational triage and the Texas account-

ability system. American Educational Research Journal, 42, 231268.

Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability:

Tensions between competing commitments for exemplary math teachers in Kentucky.

Phi Delta Kappan, 80, 394400.

Boudet, K., City, E., & Murnane, R. (Eds.). (2005).Data-wise: A step-by-step guide to using assess-

ment results to improve teaching and learning. Cambridge, MA: Harvard Education Press.

Brown, R. S., Wohlstetter, P., & Liu, S. (2008). Developing an indicator system for schools

of choice: A balanced scorecard approach.Journal of School Choice, 2, 392414.

Buddin, R. (2010). Los Angeles teacher ratings: FAQ and about. Los Angeles Times. Retrieved

from http://projects.latimes.com/value-added/faq/

Bulkley, K., Fairman, J., Martinez, M. C., & Hicks, J. E. (2004). The district and test prepa-

ration. In W. A. Firestone & R. Y. Schorr (Eds.), The ambiguity of test preparation (pp.

113142). Mahwah, NJ: Erlbaum.

Campbell, D. T. (1979). Assessing the impact of planned social change. Evaluation and

Program Planning, 2, 6790.

Center on Education Policy. (2007). Choices, changes, and challenges: Curriculum and instruc-

tion in the NCLB era. Retrieved from http://www.cep-dc.org

Chester, M. D. (2005). Making valid and consistent inferences about school effectiveness

from multiple measures.Educational Measurement: Issues and Practice, 24, 4052.Coburn, C. E. (2001). Collective sensemaking about reading: How teachers mediate read-

ing policy in their professional communities. Educational Evaluation and Policy Analysis,

23, 145170.

20


21/23


Coburn, C. E. (2005). Shaping teacher sensemaking: School leaders and the enactment ofreading policy.Educational Policy, 19, 476509.

Coburn, C. E. (2006). Framing the problem of reading instruction: Using frame analysis to

uncover the microprocesses of policy implementation. American Educational ResearchJournal, 43, 343379.

Corcoran, S. P. (2010). Can teachers be evaluated by their students test scores? Should they be?Providence, RI: Annenberg Institute, Brown University.

Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2010). Teacher effectiveness on high and low-stakes tests(Working paper). New York University.

Darling-Hammond, L., & Wise, A.E. (1985). Beyond standardization: State standards andschool improvement.Elementary School Journal, 85, 315336.

Dee, T. S., & Jacob, B. (2009). The impact of No Child Left Behind on student achievement(NBERworking paper). Cambridge, MA: National Bureau of Economic Research.

Diamond, J. B. (2007). Where rubber meets the road: Rethinking the connection between

high-stakes testing policy and classroom instruction. Sociology of Education, 80, 285313.Ehren, M., & Visscher, A.J. (2006). Towards a theory on the impact of school inspections.

British Journal of Educational Studies, 54, 5172.Finn, C. E., Petrilli, M. J., & Julian, L. (2006). The state of state standards. Washington, DC:

Thomas B. Fordham Foundation.Hamilton, L. S., & Stecher, B. M. (2006). Measuring instructional responses to standards-based

accountability. Santa Monica, CA: RAND.Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., . . .

Barney, H. (2007). Implementing standards-based accountability under No Child Left Behind:Responses of superintendents, principals, and teachers in three states. Santa Monica, CA: RAND.

Hannaway, J. (2007, November). Unbounding rationality: Politics and policy in a data rich system.Mistisfer lecture, University Council of Education Administration, Alexandria, VA.

Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: Incentive con-tracts, asset ownership, and job design. Journal of Law, Economics, and Organization, 7,2452.

Jacob, B. A. (2005). Accountability, incentives, and behavior: Evidence from school reformin Chicago.Journal of Public Economics, 89, 761796.

Jennings, J. L., & Bearak, J. (2010, August). State test predictability and teaching to the test:Evidence from three states. Paper presented at the annual meeting of the AmericanSociological Association, Atlanta, GA.

Jennings, J. L., & Crosta, P. (2010, November). The unaccountables. Paper presented at the

annual meeting of APPAM, Boston, MA.Kerr, K. A., Marsh, J. A., Ikemoto, G. S., & Barney, H. (2006). Strategies to promote data use

for instructional improvement: Actions, outcomes, and lessons from three urban dis-tricts. American Journal of Education, 112, 496520.

Koedel, C., & Betts, J. (2009). Value-added to what? How a ceiling in the testing instrument influ-ences value-added estimation(Working paper). University of Missouri.

Koretz, D. (2008). Measuring up: What standardized testing really tells us. Cambridge, MA:Harvard University Press.

Koretz, D., Barron, S., Mitchell, K., & Stecher, B. (1996a). The perceived effects of the KentuckyInstructional Results Information System. Santa Monica, CA: RAND.

Koretz, D., & Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L. Brennan

(Ed.),Educational measurement (4th ed., pp. 531578). Westport, CT: American Councilon Education/Praeger.

Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996b). The perceived effects of the MarylandSchool Performance Assessment Program (CSE Tech. Rep. No. 409). Los Angeles: University

21


22/23


23/23


Shepard, L. A. (1988, April). The harm of measurement-driven instruction. Paper presented atthe annual meeting of the American Educational Research Association, Washington, DC.

Shepard, L. A., & Dougherty, K. D. (1991). The effects of high stakes testing. In R. L. Linn

(Ed.), Annual meetings of the American Education Research Association and the National Councilof Measurement in Education. Chicago, IL.

Sitkin, S., Miller, C., See, K., Lawless, M., & Carton, D. (in press). The paradox of stretchgoals: Pursuit of the seemingly impossible in organizations. Academy of ManagementReview, 36.

Smith, M. L., & Rottenberg, C. (1991). Unintended consequences of external testing in ele-mentary schools.Educational Measurement: Issues and Practice, 10, 711.

Spillane, J. P., Reiser, B. J., & Reimer, T. (2002). Policy implementation and cognition:Reframing and refocusing implementation research. Review of Educational Research, 72,387431.

Springer, M. (2007). The influence of an NCLB accountability plan on the distribution of

student test score gains.Economics of Education Review, 27, 556563.Stecher, B. (2004). Consequences of large-scale high-stakes testing on school and classroom

practice. In L. Hamilton, B. M. Stecher, & S. Klein (Eds.), Making sense of test-based account-ability in education(pp. 79100). Santa Monica, CA: RAND.

Stecher, B. M., Chun, T. J., Barron, S. I., & Ross, K. E. (2000). The effects of the WashingtonState education reform on schools and classrooms: Initial findings. Santa Monica, CA: RAND.

Supovitz, J. A., & Klein, V. (2003). Mapping a course for improved student learning: How innov-ative schools systematically use student performance data to guide improvement. Philadelphia:Consortium for Policy Research in Education, University of Pennsylvania.

Urbina, I. (2010, January 12). As school exit tests prove tough, states ease standards. TheNew York Times, p. A1.

Weick, K. (1995). Sensemaking in organizations. London, England: Sage.Wolf, S. A., & McIver, M.C. (1999). When process becomes policy: The paradox of Kentucky

state reform for exemplary teachers of writing. Phi Delta Kappan, 80, 401406.

JENNIFER JENNINGS is assistant professor of sociology at New YorkUniversity. Her research interests include the effects of organizationalaccountability systems on racial, socioeconomic, and gender disparitiesin educational and health outcomes and the effects of schools and teach-

ers on nontest score outcomes.

Documents

"The Effects of Accountability System Design on Teachers’ Use of Test Score Data" by Jennifer Jennings