Download pdf - Using the Introduction of a New Test to Investigate the … · 2015. 3. 2. · investigate the distribution of score inflation. Using new high-stakes and norm-referenced tests first

Using the Introduction of a New Test to Investigate the Distribution of Score Inflation

A working paper of the Education Accountability Project at the Harvard Graduate School of Education

http://projects.iq.harvard.edu/eap

Daniel Koretz Marcus Waldman

Carol Yu Meredith Langi Aaron Orzech

Harvard Graduate School of Education

December 8, 2014

© 2014 by the authors. All rights reserved.

The research reported here was supported by the Institute of Education Sciences, U.S.

Department of Education through Grant R305AII0420 to the President and Fellows of Harvard

College. The authors thank the Kentucky Department of Education and the Kentucky Center for

Education and Workforce Statistics for providing the data used in this study. They particularly

thank Charles McGrew and Barrett Ross of KCREW, who devoted considerable time to making

this work possible. The opinions expressed are solely those of the authors and do not represent

views of KDE, KCREWS, the Institute, the U.S. Department of Education, or the staff of any of

these organizations.

http://projects.iq.harvard.edu/eap

New tests and the distribution of inflation i

Abstract

High-stakes testing can induce undesirable forms of test preparation and score

inflation. This study uses changes in relative performance when a test is replaced to

investigate the distribution of score inflation. Using new high-stakes and norm-referenced

tests first administered in Kentucky in 2012, we investigated the relationship between

changes in relative performance and student and school characteristics. The performance

of poor students declined relative to others in their schools, and the mean scores of

schools with high concentrations of poor students declined more than those of schools

with lower concentrations, above and beyond the effects of student-level poverty.

Students with disabilities declined in performance, while Asian students improved. These

findings are all consistent with the hypothesis that score inflation tends to be more severe

among groups with low average performance.

New tests and the distribution of inflation 1

High-stakes testing in various forms has been a cornerstone of U.S. education

policy for decades. Numerous studies have shown that as a result, scores can become

inflated—that is, scores can increase more than improvements in achievement warrant—

and that this bias can be very large (e.g., Jacob, 2007; Koretz & Barron, 1998). A smaller

number of studies have found that on average, both test preparation and score inflation

affects disadvantaged students more than others (e.g., Herman & Golan, 1993; Klein et

al., 2000).

However, the literature investigating the distribution of score inflation remains

limited. Much of it is highly aggregated, e.g., comparing inflation for subgroups at the

level of states (e.g., Klein et al., 2000). Although it is reasonable to expect the processes

that create score inflation to vary differently within- and between schools (Koretz &

Hamilton, 2006), few studies have applied a multi-level framework to the evaluation of

inflation.

Using statewide data from Kentucky, this study uses a novel approach to explore

the distribution of score inflation: we examine the distribution of changes in performance

when long-standing high-stakes test was replaced by a new high-stakes test aligned with

the Common Core state standards. We expect substantial variation in these difference

scores because of both measurement error and differences in test content. However,

systematic variations in the difference scores may signal variations in inappropriate

preparation for the old test. We investigate whether systematic variation in difference

scores is consistent with the literature showing greater inappropriate test preparation and

inflation among disadvantaged students. We fit two-level models to estimate the

distribution of inflation both within and between schools. To guard against artifacts


stemming from the specifics of the new test, we replicate these analyses using as an

outcome a second, low-stakes test first administered in the same year.

Background

The problem of score inflation has been well documented over the past quarter

century. Score inflation can arise even under low-stakes conditions because even then,

teachers may focus on the specific content of the test rather than on the broader domain

from which it samples and that it is intended to represent (Lindquist, 1951). However, the

empirical literature evaluating inflation arose decades later in response to the increasing

importance of high-stakes testing, which increases incentives to focus on the specific

content of the test. Most often, potential inflation has been evaluated by comparing trends

in scores on a high-stakes test to trends on another, lower-stakes “audit” test measuring a

similar domain. The logic of these studies is that for inferences based on the high-stakes

test to be valid, performance must generalize from that test to the largely latent domain,

and if performance does generalize to the domain, it should show a reasonable degree of

generalization to performance on other tests designed to support similar inferences. Most

often, the audit test has been the National Assessment of Educational Progress (NAEP;

e.g., Klein et al., 2000; Jacob, 2007; Ho & Haertel, 2006; Koretz & Barron, 1998). NAEP

has several advantages: it is widely accepted as a high-quality test, and there are few

incentives to prepare students specifically for it. However, numerous other tests have

been used as audit measures, including commercial norm-referenced tests (Koretz, Linn,

Dunbar, & Shepard, 1991; Ng & Koretz, 2013) and college- admissions tests (Koretz &

Barron, 1998).


Although no studies to date directly link specific behaviors of individual

educators to score inflation, a considerable body of research has documented responses to

testing that could produce score inflation. Examples include narrowing of instruction,

adapting instruction to mirror the format or style of test items, overemphasizing scoring

rubrics used on the test, and teaching test-taking tricks (e.g., Koretz, Barron, Mitchell,

and Stecher, 1996; Pedulla, et al., 2003; Stecher & Mitchell, 1995). In addition, some

studies have documented behaviors that would not bias individuals’ scores but can bias

aggregate scores. One is disproportionately focusing resources on students thought to be

near the “proficient” cut score—so-called “bubble students”—because in current

standards-based accountability systems, it is the binary classification of students as

proficient or not that has the most severe consequences (Booher-Jennings, 2005; Gillborn

& Youdell, 2000; Neal & Schanzenbach, 2010; Stecher et al., 2008). While this may

produce valid gains for these students, it creates a biased view of aggregate improvement.

Numerous studies have found that undesirable forms of test preparation are more

severe among disadvantaged students and in schools serving a high percentage of

disadvantaged students. These schools often have a stronger emphasis on assigning drills

of test-style items and teaching test taking strategies (Cimbricz, 2002; Diamond &

Spillane, 2004; Eisner, 2001; Firestone, Camilli, Yurecko, Monfils, & Mayrowetz, 2000;

Herman & Golan, 1993; Jacob, Stone, & Roderick, 2004; Jones, Jones, & Hargrove,

2003; Ladd & Zelli, 2002; Lipman, 2002; Luna & Turner, 2001; McNeil, 2000; McNeil

& Valenzuela, 2001; Taylor et al., 2002; Urdan & Paris, 1994). Evidence about the

distribution of score inflation is more limited but consistent with this. Several reports

show that the gains made by low-income and minority students relative to white students


on state tests are not matched on audit tests (Klein et al., 2000; Jacob, 2007; Ho &

Haertel, 2006).

In a study that compared performance on several different tests in Texas, Klein et

al. (2000) noticed an anomalous pattern in the between-school relationships between

high-stakes test scores and other variables. In the case of other tests, relationships

between scores on different tests and between scores and SES were stronger at the school

level than at the student level, which is the anticipated effect of aggregation. In the case

of high-stakes test scores, however, the reverse was true: in almost every case, the

relationship between high-stakes scores and other variables was weaker at the school

level than at the student level. These findings, unlike the main findings of their study,

were based on a sample of only 20 schools in a single district, so the authors were

hesitant to draw conclusions from it. However, Koretz & Hamilton (2006) noted that

these findings are consistent with greater test preparation and score inflation in low-

scoring schools. Therefore, they recommended examining differences between within-

and between-school relationships as a part of investigating potential score inflation. To

our knowledge, however, few studies have explored this.

The limited number of findings to date suggesting greater test preparation and

score inflation among disadvantaged students is not surprising given the specifics of

schooling and accountability in the U.S. Both No Child Left Behind and many of the

state accountability programs that preceded it require larger gains—often, much larger

gains—by low-scoring groups of students. In addition, many low-scoring schools face

additional difficulties in raising scores, such as high transience and relatively

inexperienced staff.


We therefore undertook the present study with the following hypotheses:

• Within schools, low-scoring groups of students will on average have more

inflated scores on the old high-stakes test and will therefore decline in relative

performance on the two new tests.

• Similarly, schools with high concentrations of students from low-scoring groups

will decline in relative performance, above and beyond the decline predicted by

the within-school student-level relationships.

In keeping with the earlier studies noted above, we hypothesized that both poor

students and students in high-poverty schools would decline in relative performance with

the introduction of the new test. Because of Kentucky’s demographics, we did not have a

strong hypothesis about the relationships between race/ethnicity and changes in

performance. Kentucky’s Hispanic student population is small, and its African-American

population is small and highly concentrated: 50% percent are in Jefferson County

(Louisville), and another 13% are in Fayette County (Lexington). This leaves the race

effect, if there is one, substantially conflated with district effects. Nonetheless, we

included dummy variables for race/ethnicity. Although we are not aware of research

evaluating differential test preparation or score inflation affecting students with

disabilities, we expect that they will be affected more, for the for the same reasons that

other low-scoring groups have been. Some students with disabilities are exempted from

the general-education testing program and hence would not create similar incentives for

teachers, but those students do not appear in our data.

In addition, in the general case, one might expect that schools that had shown

particularly rapid cohort-to-cohort gains on the previous high-stakes test would show


declines in relative performance. However, it is not clear whether one should expect this

in the present context. The earlier testing program was in place for 12 years before the

change to the new program used as an outcome in this study. While one might expect to

find relationships between cohort-to-cohort gains and inflation in the early years of

testing programs, test preparation activities might have been so well diffused through the

state at the time of our data that variations in gains on the earlier test might no longer

have predictive power. Nonetheless, we evaluated this possibility.

This study evaluates these hypotheses by examining changes in performance

when one high-stakes test is replaced by another. This approach, not previously used in

this literature, shares two assumptions with the more conventional approach. We assume

that the audit test used for comparison—in this case, the new test—has not (yet) been the

focus of intensive test preparation, and that it is sufficiently similar to the old test in terms

of the intended inference to serve as an audit.

This approach faces two threats. The first is that teachers will have already

engaged in substantial preparation focused on the new test. If preparation for the new test

is extensive enough and is distributed similarly to preparation for the old test, the result

will be a Type II error, either failing to identify inflation or underestimating it. This

possibility therefore does not threaten positive findings from our analysis.

The second threat is more serious: the possibility that the new and old tests are

designed to sample differently enough from the domain to make the adequacy of the new

test as an audit measure questionable. This may be a particularly important threat

currently because of the introduction of tests aligned with the Common Core State

Standards, including the primary test used as an outcome in this study. If the new and old


tests are sufficiently different in their sampling, differences in alignment between

curriculum and the two tests could create systematic variations in the difference scores

we examine even in the absence of any behavioral responses to testing—and hence

without any score inflation. Moreover, it is possible that such differences in alignment are

consistent with our hypotheses. For example, suppose that the new test includes content

not included in the old test and that this new content was emphasized more in advantaged

schools than in disadvantaged schools before the introduction of the new test. This is

plausible when the new content is more advanced. This would create systematic

variations in the difference scores similar to that produced by greater inflation in

disadvantaged schools.

We took advantage of a unique aspect of Kentucky’s testing program to address

this limitation. In the spring of 2012, the Kentucky Department of Education first

administered two new tests. One, the Kentucky Performance Rating for Educational

Progress (KPREP) test, is the state’s new high-stakes test, aligned with the Common Core

Standards. This replaced the Kentucky Core Content Test (KCCT), the high-stakes test

last administered in 2011. The second was a norm-referenced test, the Stanford 10 (SAT-

10). This allowed us to replicate our analyses of KPREP with identical analysis of the

SAT-10. Similar findings with the SAT-10 would strengthen the inference that variations

in the difference score reflect preparation for the KCCT rather than specific attributes of

the KPREP.


Data

Our data consist of student-level test scores for students in grades 3-8 from spring

2007 through spring 2012. These data include KPREP scores for 2012 and KCCT scores

for all earlier years.

For this analysis, we limited our sample to students who were in grades 5-8 in

2012 in schools classified as A1 by the Kentucky Department of Education. These are

public schools other than preschools, special education schools, or other alternative

programs. We also limited our analysis to schools with more than 10 students. Because

we used difference scores as our outcome, we included only the 71% of students in these

schools who had scores in the appropriate grades in both the 2012 and 2011. We removed

26,244 students across all grades who were assigned the highest or lowest obtainable

scale scores (HOSS/LOSS) on either test. Reasons for this exclusion are discussed in

detail below. (A sensitivity test examining the impact of this exclusion is included in the

Appendix.) Finally, we excluded students whose recorded scale scores were outside the

range for the relevant grade, as these students were either out-of-grade or given incorrect

scores. Our final count across grades 5-8 was 181,477 students, which was approximately

63% of our original data.

The final analytic sample was very similar in terms of demographics to the

original data. In both samples, approximately 82% of the students were white, 11%

African-American, 4% Hispanic, and 1% Asian. As a result of dropping the students who

scored HOSS or LOSS, there is a slightly higher percent (60%) of poor students in our

analytic sample compared to 57% in the original data.


Methods

Outcome variable

Our outcome variable is the difference between a student’s score on KPREP in

2012 and his or her score on the KCCT in the prior grade in 2011.

These two tests were scaled differently, and the shapes of the distributions of

scores were markedly different. It was therefore not reasonable to use a simple difference

score, and simple standardization would have retained differences between the

distributions other than means and standard deviations. We therefore needed to transform

the scale scores, and the distribution of KCCT scores further necessitated a sensitivity

analysis to address censoring.

The KCCT was scaled using a 3-parameter and generalized partial credit IRT

model, which does not provide a one-to-one mapping of scale scores to raw scores. To

obtain such a mapping, raw scores were mapped to the test characteristic curve (TCC;

Kentucky Department of Education, 2009). Mapping scale scores to the TCC often

stretches the tails of the distribution considerably more than some other scaling

approaches (e.g., Thissen & Orlando, 2001). In addition, the estimation method used does

not directly estimate scale scores for students with perfect scores, either 0 or 100 percent

of possible credit. This was handled by setting the lowest and highest obtainable scale

scores, HOSS and LOSS, a priori and assigning these to students with perfect scores

(Kentucky Department of Education, 2009).

In contrast, the KPREP is scaled using a Rasch model (Kentucky Department of

Education, n.d.), which provides scale scores that are mapped one-to-one with raw

scores. Rasch scaling will often stretch the tails of the distribution less than mapping to


the TCC. Rasch scaling, like the estimation method used with the KCCT, does not

directly estimate scale scores for students with perfect scores. In the case of KPREP, this

was handled in two stages. First, a small adjustment (±0.25) was made to perfect scores

to permit direct estimation of the underlying scale score. Second, LOSS and HOSS of

100 and 300 were imposed on the reporting scale (Johnson, 2014).

The result of these decisions was very different distributions of scores on the two

tests in the two years we analyzed. As expected, the KCCT distribution shows more

stretching of the tails. This is shown in Figure 1, which displays adjacent grades (grade 5

for the KCCT and grade 6 for the KPREP) because adjacent grades are used in the

calculation of our outcome measure. In addition, the KCCT distribution shows modest

left-censoring and substantial right-censoring, which could reflect both the location of

HOSS and LOSS and raw-score censoring. In the grade 5 KCCT, 13% of students were

in the HOSS spike and 4% in the LOSS spike. In contrast, the KPREP distributions were

free of censoring. As the figure suggests, with HOSS and LOSS included, the KCCT is

more skewed (skewness = -.38) than is the KPREP (skewness = .29).

We took two steps to address these differences in scale. First, we normalized

(probit-transformed) scores on both tests by calculating percentile ranks and mapping

these to the cumulative normal distribution. Second, because we have no information on

the appropriate distribution of performance for students assigned HOSS or LOSS, we

excluded those students from our main analyses. Our outcome variable was then the

simple difference in these probit-transformed scores.


Predictors

Our predictor variables were a number of student characteristics, school-level

aggregates of these student-level variables, and historical data on school mean test scores.

We included three race/ethnicity dummy variables, for Black, Hispanic, and

Asian. The reference group coded zero included Whites and other. A dummy variable for

poverty was defined by whether the student participated in the free or reduced-price

lunch program. The final dummy indicated whether a student had an individualized

education program (IEP) plan under the Individuals with Disabilities Education Act.

Because most students with recognized disabilities have IEP plans, this is a reasonable

proxy for the presence of a disability.

The school-level aggregate variables were the means of these variables, that is,

the proportions of students with the dummy variables set to 1. However, these aggregate

variables were calculated from all students enrolled in each school, without excluding the

students noted above, because we expect educators’ responses to testing to be influenced

by the characteristics of all the students present in the school.

In addition to these aggregates, to evaluate the possible relevance of earlier

cohort-to-cohort gains, we included a school-level variable representing the difference

between schools’ mean scores on the KCCT in 2010 from 2007. In order to remain

consistent in our use of KCCT scores, we normalized this trend variable, but with respect

to the distribution of school means rather than students’ scores. We then subtracted the

2007 normalized school mean from the 2010 normalized mean. We also specified models

using each individual year of change (e.g. 2007 to 2008), but we found that these yielded

similar results but introduced more noise, so our final models use a single change


variable. This change does not include the change between 2010 and 2011 because 2011

scores are used in the calculation of the outcome variable.

Finally, we included a dummy variable indicating whether the school was located

in a county identified as Appalachian by the Appalachian Regional Commission because

many of the communities identified as Appalachian are highly disadvantaged.

Analytic strategy

We used a difference-score approach in which the outcome is the difference

between a student’s normalized score on KPREP in the first year of administration (2012)

and her normalized score on the KCCT the in the prior year and grade, that is:

𝑌𝑖𝑖 = 𝑆𝑖𝑖𝑖𝑖𝐾𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−1)(𝑖−1),𝐾𝐾𝐾𝐾 (1)

where S is the normalized score and i, j, t, and g index student, school, year, and grade,

respectively. Because the KPREP and KCCT scores are normalized and not linked, this

difference score contains no information about absolute changes in performance with the

introduction of the new test. Rather, measures the change in a students’ position in the

distribution of scores with the introduction of the new test. We assume that scores on the

new KPREP test are less affected by test preparation and therefore that on average,

students whose scores on KCCT were more inflated should have negative values on the

outcome, that is, they should on average fall in relative position.

We began with simple student-level ordinary least squares models for both the

KPREP and the SAT-10. These models are of the form:

𝑌𝑖 = 𝛽0 + 𝐗𝐢𝛃𝟏 + 𝑒𝑖, (2)


where i indexes students, Y is the difference score, and 𝐗𝐢 is a vector of student

characteristics. These models confound within- and between-school relationships and

ignore clustering of students within schools, but they provide a baseline description.

We then estimated two-level mixed models (students nested in schools) not only

for accurate standard errors, but also because the structural relationships at the school

level are substantively important. Because score inflation depends on the behavior of

educators, we expect it to be a teacher- and school-level variable, and we expect

substantial predictable variations in inflation between schools. Our core models were the

following random-intercepts models:

𝑌𝑖𝑖 = 𝛽0𝑖 + 𝐗𝐢𝐢𝛃𝟏𝟏 + 𝑒𝑖𝑖

𝛽0𝑖 = 𝛾00 + 𝐗�𝟏𝟎𝜸𝟏𝟏 + 𝐙𝟏𝐢𝛄𝟏𝟎 + 𝛾03𝑊0𝑖 + 𝑢𝑖 (3)

where 𝐗𝐢𝐢 is a vector of student characteristics, 𝐗�𝟏𝟎 is the school means of these student

characteristics, 𝐙𝟏𝐢 is the vector of school mean change variables, and 𝑊0𝑖 is a dummy

variable indicating whether a school is located in a county identified as Appalachian.

Estimates of 𝛾03 were small and not significant, so we dropped the term 𝛾03𝑊0𝑖 from the

models reported here. As noted, because we the use of all school change variables did not

add useful information, we replaced 𝐙𝟏𝐢 with a single variable indicating school mean

change over the three-year period in the models reported below.

We grand-mean-centered the student-level outcome and predictor variables in our

models. We centered all test scores before calculating the gain scores. We centered all

school-level predictors, including historical school mean test scores, on the means of

school means.


Grand-mean centering at the student level yields parameter estimates for level-2

variables that are direct estimates of context effects (Raudenbush & Bryk, 2002). That is,

the parameter estimates for level-2 aggregate variables indicate the extent to which they

have an association with the outcome above and beyond the effects of the corresponding

level-1 variables.

SAT-10 Replication

Our replication using the SAT-10 differed from the model in equation (3) only in

the construction of the outcome measure. Rather than subtracting KCCT scores from

KPREP scores in the following year and grade, we subtracted them from SAT-10 scores

in the following year and grade:

𝑌𝑖𝑖 = 𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝐾10 − 𝑆𝑖𝑖(𝑖−1)(𝑖−1).𝐾𝐾𝐾𝐾 (4)

Falsification Tests

The falsification tests differed from the primary analyses in equation (3) in that

they were lagged by one and two additional years. That is, the outcome in the first

falsification model was the difference between KCCT in the final and next-to-final years

in which it was administered:

𝑌𝑖𝑖 = 𝑆𝑖𝑖(𝑖−1)𝑖𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−2)(𝑖−1).

𝐾𝐾𝐾𝐾 (5)

The second falsification test was lagged one additional year:

𝑌𝑖𝑖 = 𝑆𝑖𝑖(𝑖−2)𝑖𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−3)(𝑖−1).

𝐾𝐾𝐾𝐾 (6)

Our motivation for including the second falsification test in equation (6) was concern that

performance on the 2010 KCCT might have been anomalous because educators knew

that it was the final year of that testing program.


Sensitivity Tests

As noted, the scaling of the KCCT resulted in appreciable censoring, with a

substantial number of students scoring at the HOSS and a more modest number at the

LOSS. Because we have no information about the actual distribution of performance of

these students, we consider it most appropriate to remove them from our analyses.

However, we also fitted models in which we included those students in order to clarify

the effects of excluding them. These are presented in the Appendix.

Results

We first present the simple student-level OLS results for both the KPREP and

SAT-10 difference scores. We then discuss our principal findings based on two-level

models, including the SAT-10 and falsification models. Because the difference scores are

calculated by subtracting prior scores from recent scores, a negative sign indicates a

decline in relative performance.

Student-Level OLS Results

As expected, all eight of the OLS models (four grades by two tests) showed a

decline in performance by poor students (Table 1). These declines were all significant but

varied markedly in size, from 0.02 SD to 0.11 SD. Students with IEPs also declined in

seven of eight models, in many cases by more, with effects up to 0.19 SD. Asian students,

about whom we had no specified hypothesis, increased in all comparisons, in some cases

by roughly 0.3 SD. The coefficients for Black students were inconsistent.

Two-Level KPREP and SAT-10 Models

Many of our findings are consistent with our hypotheses, but there are a number

of exceptions, particularly in grade 8. Most of the findings of our main KPREP analyses


were mirrored in our SAT-10 replication. We discuss grades 5 through 7 first and then

turn to grade 8. Because the difference scores are calculated by subtracting prior scores

from recent scores, a negative sign indicates a decline in relative performance.

Within schools, the relative performance of poor students in grades 5, 6, and 7 fell

when the new KPREP was introduced. These decreases ranged from -0.06 to -0.09

standard deviations, and all were highly significant (Table 2). Our SAT-10 replication

yielded similar findings for these relationships (Table 3). These findings, however, do not

represent the total association between poverty and the change in performance. Rather,

they represent only the within-school relationship.

At the school level, an increase in the proportion of poor students predicted a

decline in relative performance. In the case of KPREP, this effect ranged from -0.18 in

grade 6 to -0.54 in grade 7 (Table 2). Again, we found similar results with our SAT-10

replication (Table 3). Recall that because of the grand-mean centering, these are estimates

of context effects. That is, even after controlling for the effects of the poverty status of

individual students, an increase in the proportion of poor students predicted a decline in

performance.

These school-level coefficients are substantial, but interpreting their magnitude

requires additional calculation. The student-level and school-level coefficients are not

directly comparable because they are not on the same scale. The student-level poverty

coefficient is the adjusted mean difference between poor and non-poor students. In

contrast, because the school poverty variable runs from 0 to 1.0, the school-level poverty

coefficient is the estimated difference between schools with 0% and 100% poor students.

To put the magnitude of the school-level coefficients into perspective, we can apply them


to the school-level distribution of the proportion poor. For example, we can compare the

estimated change in performance for schools at the 25th and 75th percentiles on the

distribution of proportion poor. In grade 7, where this coefficient was largest, the

difference in the proportion poor for the schools at the 25th and 75th percentiles of this

distribution was 0.24. Therefore, our KPREP model in this grade predicts that the school

at the 75th percentile would be 0.13 student-level standard deviations (.24 × .54) lower in

relative performance than the school at the 25th percentile, even after taking into account

the student-level effect of poverty.

Relationships between change scores and both the Black/White dummy and the

school proportion Black were inconsistent in sign and often non-significant. In all three

grades, students with IEPs declined in relative performance on the KPREP, with

coefficients ranging from -.03 to -.22 (Table 2). These declines were mirrored in the SAT-

10 (Table 3). We found no significant school-level relationships between the proportion

of students with IEPs and changes in relative performance on either test.

While we had no explicit hypothesis about Asian students, we found a modest

student-level increase in relative performance for Asian students in grades 5-7 on the

KPREP, but only in grade 6 on the SAT-10. There were no school-level effects of

proportion Asian in any instance.

The relationships with school’s mean gains were inconsistent. In grades 5 and 6,

we found a modest negative relationship between mean gains on KCCT for the three

prior years and relative performance on the KPREP, and these effects were significant at

𝑝 < .001 (Table 2). Similar results were found for the SAT-10 (Table 3). However, no


relationships were found for either test in grade 7.1 Moreover, these findings failed the

falsification test described below in several instances. That is, we found negative

associations in some of the falsification models, when ideally we should not. This

suggests that we should not have confidence in the estimates for this coefficient that we

obtained in our main models.

The results in grade 8 were weaker, and this was again true with both the KPREP

and the SAT-10. The negative student-level coefficient for poor students found in the

other grades appeared in grade 8 as well, although the coefficient for poverty was very

small (-0.03 with both tests; Tables 2 and 3). The grade-8 student-level coefficients for

Asian and IEP students were roughly comparable to those in the other grades. However,

the school-level coefficients for proportion poor were neither substantial nor significant

in grade 8.

Falsification tests

Our main models use a decline in relative performance with the introduction of

the new KPREP and SAT-10 as an indication of greater score inflation. To test that the

estimates of our main models stem from the changes in tests, we conducted falsification

tests in which we replicated our main models with data lagged both one and two years,

such that the outcome was the change between two successive implementations of the old

KCCT test. The results of the falsification test can be seen in Tables 4 and 5. We expect

that our results will fail to replicate in the falsification tests. 1 While we do not have a full explanation of the lack of effects in grade 7, we found that part of the

difference between this grade and the other two appears to stem from multivariate outliers, specifically, a

much larger range of Mahalnobis distances. Eliminating observations with large distances made the grade 7

coefficient for the change variable somewhat more similar to those in the other grades.


For the most part, the falsification models failed to replicate the findings from our

main models. Unlike the estimates for the core two-level model we presented in Equation

2, the student- level poor coefficients in the falsification models are near zero and

generally not significant. In contrast, the estimates from the core model are universally

negative, greater in magnitude, and highly significant (p<0.01). We found only two

exceptions. First, the grade 7 student-level poor coefficient in the falsification model is

negative and significant between 2010-11 and 2009-10 (p = 0.01). The grade 6 student-

level poor coefficient in the falsification model was also negative and significant in 2009-

2010 (p = 0.05). Even in these exceptional cases, however, the point estimates are still

attenuated compared to the core two-level model.

In all grades except for grade 7, the school-level poor coefficients were also

attenuated in the falsification models and sometimes reversed in sign (positive). For

example, the point estimates were significant and positive in grade 6 (p = 0.01) and grade

8 in 2009-10 (p = 0.05), indicating that poor students at times improved in rank when

Kentucky administered the KCCT. As the exception, the grade 7 school-level poor

estimate was negative and highly significant in 2010-11 (p < 0.01). Once again, however,

the magnitude of this estimate is less than the estimate from the core model.

Discussion

Our findings are largely consistent with the hypothesis that test-specific

preparation for the previous KCCT high-stakes test was more intensive—or at least more

effective—with disadvantaged students. As predicted, within schools, poor students fell

in relative performance when the new high-stakes test was introduced. Similarly, students

with IEPs fell in relative performance. These findings, and the finding that Asian students


increased in relative performance, are consistent with the notion that students who are at

risk of scoring poorly receive more intensive test preparation.

In our view, the school-level relationship between the proportion of poor students

and the change in relative performance is particularly important. Our models provided

estimates of context or concentration effects: an increase in the proportion of poor

students predicted a decline in relative performance above and beyond the effects of

student-level poverty. While there is evidence that teachers can to some degree target test

preparation to specific groups of students within their classes (e.g., Booher-Jennings,

2005), it is reasonable to expect that much of the variation in test preparation lies between

teachers and schools. Our findings are consistent with this expectation and support the

suggestion by Koretz & Hamilton (2006) that it is important to examine aggregate

relationships in evaluating score inflation (Koretz & Hamilton, 2006).

The SAT-10 replication and the falsification tests both substantially strengthen

these conclusions. If systematic changes in relative performance were a result of the

specific content sampling used to create the new high-stakes test, our primary results

would not be replicated when using the SAT-10 as the outcome measure, but our findings

were replicated consistently. Similarly, if our findings were a result of weaknesses of the

data or the specifications employed, one would expect our findings to be replicated when

we lagged our data by a single year, which removed the change in testing but left all else

unaltered. However, our results were not replicated in the falsification tests.

Our data are not sufficient to explore the reasons for the far weaker findings in the

grade 8 data. We found weaker findings in grade 8 using the SAT-10 as well, which rules

out an anomaly in the construction of the eighth-grade KPREP as an explanation. It may


be that both of the new tests are similar enough to the eighth-grade KCCT that the

difference scores do not capture useful variation. It is also plausible that preparation for

the KCCT was distributed differently in the eighth grade than in the other grades,

although we know of no other data to suggest this. Further explanation with additional

data would be needed to explain these findings.

The modest sizes of some of our findings warrant discussion. Given our data, we

can only speculate about this. There are also a number of reasons why our findings may

underestimate these variations. First, as the present approach is novel, we do not know

what fraction of score inflation can be captured by changes in high-stakes tests. We

expect that this will vary depending on the context. New high-stakes tests are rarely

introduced without preparation, and the more extensive the advance preparation, the less

likely it is that this approach will capture a meaningful share of test preparation and

inflation. For example, we replicated our models with data from New York State, which

provided extensive preparation for the introduction of its Common Core tests in 2013,

such as the launching of extensive web-based preparation two years before the

introduction of the tests.2 Our KY findings did not replicate in the New York data.

Second, no single audit measure is likely to capture fully the variations in test preparation

activities and score inflation. Our data are not sufficient to evaluate these competing

explanations.

Our findings have two broad implications, one substantive and the other

methodological. First, this paper adds to the slowly growing literature indicating that test

preparation and score inflation are in some cases more severe for disadvantaged students.

2 See www.engageny.org/.

http://www.engageny.org/


This is of concern not only because of the negative implications about the quality of

education provided to these students, but also because the result of this differential bias is

an illusion of greater equity that can distort both policy and practice. Second, our findings

suggest that the introduction of new material into high-stakes tests can serve as one

method to help monitor for inappropriate test preparation and score inflation.


References

Booher-Jennings, J. (2005). Below the bubble: "Educational Triage" and the Texas

Accountability System. American Educational Research Journal, 42(2), 231-268.

Cimbricz, S. (2002). State-mandated testing and teachers’ beliefs and practice. Education

Policy Analysis Archives, 10(2). Retrieved from

http://epaa.asu.edu/epaa/v10n2.html

Diamond, J. B., & Spillane, J. P. (2004). High-stakes accountability in urban elementary

schools: Challenging or reproducing inequality? Teachers College Record, 106(6),

1145-1176.

Eisner, E. (2001). What does it mean to say a school is doing well? Phi Delta Kappan,

82(5), 367-372.

Firestone, W. A., Camilli, G., Yurecko, M., Monfils, L., & Mayrowetz, D. (2000). State

standards, socio-fiscal context and opportunity to learn in New Jersey. Education

Policy Analysis Archives, 8(35). Retrieved from http://olam.ed.asu.edu/epaa/v8n35.

Gillborn, D. & Youdell, D. (2000). Rationing education: policy, practice, reform, and

equity. Buckingham, UK: Open University Press.

Herman, J. L., & Golan, S. (1993). The effects of standardized testing on teaching and

schools. Educational Measurement: Issues and Practice, 12(4), 20-25, 41-42.

Ho, A. D., & Haertel, E. H. (2006). Metric-Free Measures of Test Score Trends and Gaps

with Policy-Relevant Examples (CSE Report 665). Los Angeles, CA: National Center

for Research on Evaluation, Standards, and Student Testing (CRESST), Center for the

Study of Evaluation, University of California, Los Angeles.


Jacob, B. (2007). Test-based accountability and student achievement: An investigation of

differential performance on NAEP and state assessments. Cambridge, MA:

National Bureau of Economic Research (Working Paper 12817).

Jacob, R. T., Stone, S., & Roderick, M. (2004). Ending social promotion: The response of

teachers and students. Chicago, IL: Consortium on Chicago School Research.

Retrieved March 29, 2011, from http://www.eric.ed.gov/PDFS/ED483823.pdf

Jones, M. G., Jones, B. D., & Hargrove, T. Y. (2003). The unintended consequences of

high-stakes testing. Lanham, MD: Rowman & Littlefield Publishers, Inc.

Johnson, Marc (2014). Personal communication, August 27.

Kentucky Department of Education (2009). Commonwealth Accountability Testing System,

2008-09 Technical Report, Version 1.2. Frankfort, KY: Author.

Kentucky Department of Education (n.d.). Kentucky Performance Rating for Educational

Progress, 2011-12 Technical Manual, Version 1.1. Frankfort, KY: Author.

Klein, S. P., Hamilton, L.S., McCaffrey, D.F., and Stecher, B.M. (2000). What do test

scores in Texas tell us? Santa Monica, CA: RAND (Issue Paper IP-202). Last

accessed from http://www.rand.org/publications/IP/IP202/ on June 4, 2013.

Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional

Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.

Koretz, D., Barron, S., Mitchell, K., and Stecher, B. (1996). The Perceived Effects of the

Kentucky Instructional Results Information System (KIRIS). MR-792-PCT/FF,

Santa Monica: RAND.

Koretz, D., and Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L.

Brennan (Ed.), Educational measurement (4th ed.), 531-578. Westport, CT:

American Council on Education/Praeger.


Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-

stakes testing: Preliminary evidence about generalization across tests, in R. L.

Linn (chair), The Effects of High Stakes Testing, symposium presented at the

annual meetings of the American Educational Research Association and the

National Council on Measurement in Education, Chicago, April.

Ladd, H. F., & Zelli, A. (2002). School-based accountability in North Carolina: The

responses of school principals. Educational Administration Quarterly, 38(4), 494-

529. doi: 10.1177/001316102237670

Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F.

Lindquist (Ed.), Educational Measurement (pp. 119-184). Washington, DC:

American Council on Education.

Lipman, P. (2002). Making the global city, making inequality: The political economy and

cultural politics of Chicago school policy. American Educational Research Journal,

39(2), 379-419.

Luna, C., & Turner, C. L. (2001). The impact of the MCAS: Teachers talk about high-

stakes testing. The English Journal, 91(1), 79-87.

McNeil, L. M. (2000). Contradictions of school reform: Educational costs of

standardized testing. New York, NY: Routledge.

McNeil, L. M. & Valenzuela, A. (2001). The harmful impact of the TAAS system of

testing in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield

(Eds.), Raising standards or raising barriers? Inequality and high-stakes testing in

public education (pp. 127-150). New York, NY: Century Foundation.


Neal, D. and Schanzenbach, D. (2010). Left behind by design: Proficiency counts and

test-based accountability. Review of Economics & Statistics, 92(2), 263–283.

Ng, H. L., & Koretz, D. (2013). Sensitivity of School-Performance Ratings to the Test

Used. A working paper of the Educational Accountability Project.

http://projects.iq.harvard.edu/files/eap/files/houston_paper_wpdraft_032513_1.pdf

Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J.

(2003). Perceived effects of state-mandated testing programs on teaching and

learning: Findings from a national survey of teachers. Boston, MA: National Board

on Educational Testing and Public Policy. Retrieved from

http://www.bc.edu/research/nbetpp/statements/nbr2.pdf

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and

data analysis methods, second edition. Thousand Oaks, CA: Sage.

Stecher, B. M., Epstein, S., Hamilton, L. S., Marsh, J. A., Robyn, A., McCombs, J. S.,

Russell, J., & Naftel, S. (2008). Pain and gain: Implementing NCLB in three states,

2004 – 2006. Santa Monica, CA: RAND. Retrieved from

http://www.rand.org/pubs/monographs/2008/RAND_MG784.pdf

Stecher, B. M., & Mitchell, K. J. (1995): Portfolio Driven Reform: Vermont Teachers’

Understanding of Mathematical Problem Solving (CSE Technical Report 400).

Los Angeles, CA: University of California Center for Research on Evaluation,

Standards, and Student Testing.

Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2002). A survey of teachers’

perspectives on high-stakes testing in Colorado: What gets taught, what gets lost

http://projects.iq.harvard.edu/files/eap/files/houston_paper_wpdraft_032513_1.pdf


(CSE Technical Report 588). Los Angeles, CA: University of California. Retrieved

September 20, 2010, from http://eric.ed.gov/PDFS/ED475139.pdf

Thissen, D., & Orlando, M. (2001). Item Response Theory for Items Score in Two

Categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 141-186).

Mahwah, N.J.: Lawrence Erlbaum Associates.

Urdan, T. C., & Paris, S. G. (1994). Teachers’ perceptions of standardized achievement

tests. Educational Policy, 8(2), 137-157.


Figures

Figure 1. Distributions of KCCT and KPREP scores for calculation of the difference

score for students in grade 6 in 2012, with the x-axis scale bounded by LOSS and HOSS.


Tables

Table 1

OLS Regression results for 2012 KPREP minus 2011 KCCT analysis and for 2012 SAT-10 minus 2011 KCCT analysis

Grade 4-5

Grade 5-6

Grade 6-7

Grade 7-8

KPREP-KCCT

SAT 10- KCCT

KPREP-KCCT

SAT 10- KCCT

KPREP-KCCT

SAT 10- KCCT

KPREP-KCCT

SAT 10- KCCT

Poor -0.082***

-0.077***

-0.069***

-0.046***

-0.108***

-0.090***

-0.021**

-0.019**

(0.007)

(0.007)

(0.007)

(0.008)

(0.007)

(0.007)

(0.007)

(0.007)

IEP -0.037***

-0.092***

-0.139***

-0.193***

0.049***

-0.121***

-0.115***

-0.151***

(0.011)

(0.011)

(0.011)

(0.011)

(0.010)

(0.011)

(0.011)

(0.011)

Asian 0.302***

0.136***

0.284***

0.185***

0.234***

0.106***

0.284***

0.163***

(0.031)

(0.031)

(0.032)

(0.033)

(0.030)

(0.032)

(0.031)

(0.031)

Black 0.067***

-0.015

-0.013

0.016

0.089***

0.104***

0.151***

0.152***

(0.012)

(0.012)

(0.012)

(0.012)

(0.011)

(0.011)

(0.011)

(0.011)

Hispanic 0.045*

-0.006

-0.013

-0.043*

0.063***

0.033

0.100***

0.053**

(0.018)

(0.018)

(0.019)

(0.020)

(0.018)

(0.019)

(0.018)

(0.018)

Intercept 0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

(0.004)

(0.004)

(0.004)

(0.004)

(0.003)

(0.003)

(0.003)

(0.003)

N 46,821 46,240 44,943 44,452 44,767 44,321 44,946 44,405


Table 2

Regression results for 2012 KPREP minus 2011 KCCT analysis

Grade 4-5

Grade 5-6

Grade 6-7

Grade 7-8

Student-Level Variables Poor -0.057***

-0.086***

-0.075***

-0.030***

(0.008)

(0.008)

(0.007)

(0.007)

IEP -0.080***

-0.220***

-0.031**

-0.182***

(0.010)

(0.011)

(0.010)

(0.011)

Asian 0.187***

0.245***

0.144***

0.166***

(0.037)

(0.035)

(0.032)

(0.033)

Black 0.012

0.022

-0.037**

0.040***

(0.013)

(0.013)

(0.012)

(0.011)

Hispanic 0.023

0.025

0.005

0.055**

(0.018)

(0.019)

(0.018)

(0.017)

School-Level Variables Proportion Poor -0.310***

-0.182*

-0.544***

-0.055

(0.068)

(0.074)

(0.078)

(0.088)

Proportion IEP -0.032

-0.147

-0.437

-0.011

(0.195)

(0.200)

(0.237)

(0.265)

Proportion Asian -0.265

0.188

-0.411

1.825*

(0.446)

(0.534)

(0.694)

(0.832)

Proportion Black -0.01

-0.273**

0.377**

0.168

(0.087)

(0.090)

(0.119)

(0.131)

Proportion Hispanic 0.31

-0.133

0.176

-0.05

(0.195)

(0.230)

(0.340)

(0.385)

School Difference 07-10 -0.052***

-0.049***

-0.016

-0.02

(0.012)

(0.012)

(0.011)

(0.014)

Intercept 0.035**

0.017

-0.039**

-0.005

(0.011)

(0.011)

(0.012)

(0.012)

N 37,139 37,320 39,707 39,862 Note. *p<0.05 **p<0.01 ***p<0.001


Table 3

Regression results for 2012 SAT10 minus 2011 KCCT analysis

Grade 4-5

Grade 5-6

Grade 6-7

Grade 7-8

Student-Level Variables

Poor -0.035***

-0.064***

-0.065***

-0.027***

(0.008)

(0.008)

(0.007)

(0.007)

IEP -0.138***

-0.275***

-0.196***

-0.219***

(0.010)

(0.011)

(0.011)

(0.011)

Asian 0.037

0.166***

0.021

0.073*

(0.038)

(0.038)

(0.034)

(0.035)

Black -0.083***

0.034*

-0.02

0.016

(0.013)

(0.014)

(0.012)

(0.012)

Hispanic -0.036

-0.03

-0.026

-0.009

(0.018)

(0.020)

(0.019)

(0.018)

School-Level Variables Proportion Poor -0.455***

-0.227***

-0.496***

-0.055

(0.059)

(0.068)

(0.074)

(0.069)

Proportion IEP 0.12

-0.024

-0.219

-0.119

(0.171)

(0.183)

(0.228)

(0.211)

Proportion Asian 0.077

0.298

-0.486

1.283*

(0.388)

(0.484)

(0.663)

(0.642)


-0.246**

0.397***

0.261**

(0.076)

(0.082)

(0.113)

(0.101)


0.023

0.326

-0.039

(0.170)

(0.209)

(0.324)

(0.298)


-0.048***

-0.011

-0.026*

(0.010)

(0.011)

(0.010)

(0.011)

Intercept 0.039***

0.023*

-0.031**

-0.006

(0.010)

(0.010)

(0.012)

(0.010)

N 36,682 37,176 39,386 39,550 Note. *p<0.05 **p<0.01 ***p<0.001


Table 4

Regression results for 2011 KCCT minus 2010 KCCT analysis

Grade 4-5

Grade 5-6

Grade 6-7

Grade 7-8

Student-Level Variables Poor 0.003

-0.011

-0.020**

0.002

(0.008)

(0.007)

(0.007)

(0.006)

IEP 0.070***

-0.023*

-0.014

-0.046***

(0.010)

(0.011)

(0.010)

(0.010)

Asian 0.066

0.131***

0.170***

0.042

(0.036)

(0.037)

(0.035)

(0.031)

Black 0.029*

0.027*

0.015

-0.022*

(0.012)

(0.012)

(0.011)

(0.011)

Hispanic 0.051**

0.037*

0.043*

0.021

(0.018)

(0.018)

(0.017)

(0.017)

School-Level Variables Proportion Poor -0.065

0.039

-0.298***

0.113

(0.058)

(0.058)

(0.076)

(0.067)

Proportion IEP 0.071

0.178

0.534*

-0.513*

(0.152)

(0.169)

(0.209)

(0.212)


0.096

-0.446

0.79

(0.447)

(0.463)

(0.710)

(0.758)

Proportion Black 0.236**

-0.316***

0.126

0.097

(0.073)

(0.076)

(0.119)

(0.102)

Proportion Hispanic -0.012

-0.052

0.1

-0.239

(0.186)

(0.198)

(0.350)

(0.311)


-0.037***

-0.033**

-0.034*

(0.010)

(0.011)

(0.012)

(0.013)

Intercept 0.022*

0.020*

-0.015

0.006

(0.009)

(0.010)

(0.011)

(0.010)

N 33,433 34,560 36,273 36,142 Note. *p<0.05 **p<0.01 ***p<0.001


Table 5

Regression results for 2010 KCCT minus 2009 KCCT analysis

Grade 4-5

Grade 5-6

Grade 6-7

Grade 7-8

Student-Level Variables Poor -0.001

-0.016*

-0.020**

-0.009

(0.007)

(0.007)

(0.007)

(0.006)

IEP -0.005

-0.049***

0.068***

0.057***

(0.010)

(0.011)

(0.010)

(0.010)

Asian 0.083*

0.065

0.115***

0.041

(0.038)

(0.039)

(0.034)

(0.032)

Black -0.027*

0.009

0.012

-0.019

(0.012)

(0.012)

(0.011)

(0.011)

Hispanic 0.023

-0.032

0.051**

0.008

(0.019)

(0.019)

(0.018)

(0.018)

School-Level Variables Proportion Poor -0.038

0.157**

-0.014

0.142*

(0.055)

(0.055)

(0.070)

(0.061)

Proportion IEP 0.2

-0.117

0.133

-0.307

(0.154)

(0.149)

(0.221)

(0.200)


-0.296

-0.757

1.028

(0.465)

(0.480)

(0.877)

(0.838)


-0.584***

-0.092

-0.132

(0.073)

(0.075)

(0.124)

(0.091)


-0.239

-0.334

0.301

(0.194)

(0.196)

(0.383)

(0.301)

School Difference 07-10 -0.001

-0.027*

-0.026

0.012

(0.012)

(0.012)

(0.015)

(0.014)

Intercept 0.011

0.015

-0.005

0.016

(0.010)

(0.009)

(0.012)

(0.009)

N 32,899 33,488 34,862 36,546 Note. *p<0.05 **p<0.01 ***p<0.001


Appendix

Sensitivity Test: Including HOSS and LOSS

The sizable number of students who obtained the highest and lowest obtainable

scores (HOSS/LOSS) on the KCCT (see Figure 1) indicates a substantial amount of

censoring, particularly at the high end of the distribution. Because censoring is likely to

artificially attenuate relationships between the outcome and predictors, we excluded

students with censored scores. To test the impact of this decision, we replicated our

analyses with the students who had censored scores included. As expected, we found that

including these students weakened our findings. Importantly, we found that the student-

level poor estimates were less negative (smaller in absolute value) when students who

obtained the HOSS or LOSS were included. A comparison between estimates derived

from the KPREP and NRT models including and excluding students at the HOSS or

LOSS can be seen in Tables A1 and A2, respectively.


Table A1 Regression results for 2012 KPREP minus 2011 KCCT analysis with students performing at the HOSS or LOSS included Grade 4-5 Grade 5-6 Grade 6-7 Grade 7-8 Student-Level Variables Poor -0.028*** -0.043*** -0.039*** -0.004 (0.008) (0.007) (0.007) (0.007) IEP -0.032** -0.124*** 0.054*** -0.105*** (0.010) (0.010) (0.010) (0.010) Asian 0.222*** 0.262*** 0.148*** 0.171*** (0.029) (0.030) (0.029) (0.030) Black 0.044*** 0.053*** -0.007 0.074*** (0.013) (0.013) (0.012) (0.012) Hispanic 0.01 0.025 0.007 0.066*** (0.017) (0.018) (0.018) (0.018) School-Level Variables Proportion Poor -0.311*** -0.137 -0.536*** -0.019 (0.073) (0.077) (0.079) (0.089) Proportion IEP -0.126 -0.189 -0.547* -0.146 (0.210) (0.207) (0.240) (0.269) Proportion Asian -0.136 0.432 -0.371 1.734* (0.480) (0.552) (0.702) (0.845) Proportion Black 0.033 -0.236* 0.442*** 0.222 (0.094) (0.093) (0.120) (0.133) Proportion Hispanic 0.4 -0.173 0.315 -0.041 (0.211) (0.293) (0.344) (0.391) School Difference 07-10 -0.062*** -0.058*** -0.02 -0.026 (0.012) (0.012) (0.011) (0.014) Intercept -0.021 -0.008 -0.053*** -0.016 (0.012) (0.012) (0.012) (0.013) N 46,521 44,527 44,442 44,493 Note. *p<0.05 **p<0.01 ***p<0.001


Table A2 Regression results for 2012 NRT minus 2011 KCCT analysis with students performing at the HOSS or LOSS included Grade 4-5 Grade 5-6 Grade 6-7 Grade 7-8 Student-Level Variables Poor 0.000 -0.014 -0.030*** 0.003 (0.008) (0.008) (0.007) (0.007) IEP -0.086*** -0.184*** -0.117*** -0.142*** (0.010) (0.011) (0.010) (0.010) Asian 0.065* 0.163*** 0.023 0.062* (0.030) (0.032) (0.031) (0.031) Black -0.045*** 0.066*** 0.005 0.050*** (0.013) (0.013) (0.012) (0.012) Hispanic -0.046* -0.026 -0.033 0 (0.018) (0.020) (0.019) (0.018) School-Level Variables Proportion Poor -0.443*** -0.177* -0.494*** -0.042 (0.065) (0.071) (0.077) (0.071) Proportion IEP 0.012 -0.09 -0.386 -0.226 (0.187) (0.191) (0.235) (0.219) Proportion Asian 0.083 0.335 -0.537 1.221 (0.423) (0.506) (0.684) (0.666) Proportion Black 0.038 -0.196* 0.484*** 0.309** (0.084) (0.086) (0.117) (0.105) Proportion Hispanic 0.451* -0.033 0.471 -0.014 (0.187) (0.220) (0.336) (0.309) School Difference 07-10 -0.081*** -0.056*** -0.016 -0.035** (0.011) (0.011) (0.011) (0.011) Intercept -0.024* -0.013 -0.494*** -0.015 (0.011) (0.011) (0.077) (0.010) N 46,155 44,367 44,246 44,329 Note. *p<0.05 **p<0.01 ***p<0.001