Using the Introduction of a New Test to Investigate the Distribution of Score Inflation
A working paper of the Education Accountability Project at the Harvard Graduate School of Education
http://projects.iq.harvard.edu/eap
Daniel Koretz Marcus Waldman
Carol Yu Meredith Langi Aaron Orzech
Harvard Graduate School of Education
December 8, 2014
© 2014 by the authors. All rights reserved.
The research reported here was supported by the Institute of Education Sciences, U.S.
Department of Education through Grant R305AII0420 to the President and Fellows of Harvard
College. The authors thank the Kentucky Department of Education and the Kentucky Center for
Education and Workforce Statistics for providing the data used in this study. They particularly
thank Charles McGrew and Barrett Ross of KCREW, who devoted considerable time to making
this work possible. The opinions expressed are solely those of the authors and do not represent
views of KDE, KCREWS, the Institute, the U.S. Department of Education, or the staff of any of
these organizations.
New tests and the distribution of inflation i
Abstract
High-stakes testing can induce undesirable forms of test preparation and score
inflation. This study uses changes in relative performance when a test is replaced to
investigate the distribution of score inflation. Using new high-stakes and norm-referenced
tests first administered in Kentucky in 2012, we investigated the relationship between
changes in relative performance and student and school characteristics. The performance
of poor students declined relative to others in their schools, and the mean scores of
schools with high concentrations of poor students declined more than those of schools
with lower concentrations, above and beyond the effects of student-level poverty.
Students with disabilities declined in performance, while Asian students improved. These
findings are all consistent with the hypothesis that score inflation tends to be more severe
among groups with low average performance.
New tests and the distribution of inflation 1
High-stakes testing in various forms has been a cornerstone of U.S. education
policy for decades. Numerous studies have shown that as a result, scores can become
inflated—that is, scores can increase more than improvements in achievement warrant—
and that this bias can be very large (e.g., Jacob, 2007; Koretz & Barron, 1998). A smaller
number of studies have found that on average, both test preparation and score inflation
affects disadvantaged students more than others (e.g., Herman & Golan, 1993; Klein et
al., 2000).
However, the literature investigating the distribution of score inflation remains
limited. Much of it is highly aggregated, e.g., comparing inflation for subgroups at the
level of states (e.g., Klein et al., 2000). Although it is reasonable to expect the processes
that create score inflation to vary differently within- and between schools (Koretz &
Hamilton, 2006), few studies have applied a multi-level framework to the evaluation of
inflation.
Using statewide data from Kentucky, this study uses a novel approach to explore
the distribution of score inflation: we examine the distribution of changes in performance
when long-standing high-stakes test was replaced by a new high-stakes test aligned with
the Common Core state standards. We expect substantial variation in these difference
scores because of both measurement error and differences in test content. However,
systematic variations in the difference scores may signal variations in inappropriate
preparation for the old test. We investigate whether systematic variation in difference
scores is consistent with the literature showing greater inappropriate test preparation and
inflation among disadvantaged students. We fit two-level models to estimate the
distribution of inflation both within and between schools. To guard against artifacts
New tests and the distribution of inflation 2
stemming from the specifics of the new test, we replicate these analyses using as an
outcome a second, low-stakes test first administered in the same year.
Background
The problem of score inflation has been well documented over the past quarter
century. Score inflation can arise even under low-stakes conditions because even then,
teachers may focus on the specific content of the test rather than on the broader domain
from which it samples and that it is intended to represent (Lindquist, 1951). However, the
empirical literature evaluating inflation arose decades later in response to the increasing
importance of high-stakes testing, which increases incentives to focus on the specific
content of the test. Most often, potential inflation has been evaluated by comparing trends
in scores on a high-stakes test to trends on another, lower-stakes “audit” test measuring a
similar domain. The logic of these studies is that for inferences based on the high-stakes
test to be valid, performance must generalize from that test to the largely latent domain,
and if performance does generalize to the domain, it should show a reasonable degree of
generalization to performance on other tests designed to support similar inferences. Most
often, the audit test has been the National Assessment of Educational Progress (NAEP;
e.g., Klein et al., 2000; Jacob, 2007; Ho & Haertel, 2006; Koretz & Barron, 1998). NAEP
has several advantages: it is widely accepted as a high-quality test, and there are few
incentives to prepare students specifically for it. However, numerous other tests have
been used as audit measures, including commercial norm-referenced tests (Koretz, Linn,
Dunbar, & Shepard, 1991; Ng & Koretz, 2013) and college- admissions tests (Koretz &
Barron, 1998).
New tests and the distribution of inflation 3
Although no studies to date directly link specific behaviors of individual
educators to score inflation, a considerable body of research has documented responses to
testing that could produce score inflation. Examples include narrowing of instruction,
adapting instruction to mirror the format or style of test items, overemphasizing scoring
rubrics used on the test, and teaching test-taking tricks (e.g., Koretz, Barron, Mitchell,
and Stecher, 1996; Pedulla, et al., 2003; Stecher & Mitchell, 1995). In addition, some
studies have documented behaviors that would not bias individuals’ scores but can bias
aggregate scores. One is disproportionately focusing resources on students thought to be
near the “proficient” cut score—so-called “bubble students”—because in current
standards-based accountability systems, it is the binary classification of students as
proficient or not that has the most severe consequences (Booher-Jennings, 2005; Gillborn
& Youdell, 2000; Neal & Schanzenbach, 2010; Stecher et al., 2008). While this may
produce valid gains for these students, it creates a biased view of aggregate improvement.
Numerous studies have found that undesirable forms of test preparation are more
severe among disadvantaged students and in schools serving a high percentage of
disadvantaged students. These schools often have a stronger emphasis on assigning drills
of test-style items and teaching test taking strategies (Cimbricz, 2002; Diamond &
Spillane, 2004; Eisner, 2001; Firestone, Camilli, Yurecko, Monfils, & Mayrowetz, 2000;
Herman & Golan, 1993; Jacob, Stone, & Roderick, 2004; Jones, Jones, & Hargrove,
2003; Ladd & Zelli, 2002; Lipman, 2002; Luna & Turner, 2001; McNeil, 2000; McNeil
& Valenzuela, 2001; Taylor et al., 2002; Urdan & Paris, 1994). Evidence about the
distribution of score inflation is more limited but consistent with this. Several reports
show that the gains made by low-income and minority students relative to white students
New tests and the distribution of inflation 4
on state tests are not matched on audit tests (Klein et al., 2000; Jacob, 2007; Ho &
Haertel, 2006).
In a study that compared performance on several different tests in Texas, Klein et
al. (2000) noticed an anomalous pattern in the between-school relationships between
high-stakes test scores and other variables. In the case of other tests, relationships
between scores on different tests and between scores and SES were stronger at the school
level than at the student level, which is the anticipated effect of aggregation. In the case
of high-stakes test scores, however, the reverse was true: in almost every case, the
relationship between high-stakes scores and other variables was weaker at the school
level than at the student level. These findings, unlike the main findings of their study,
were based on a sample of only 20 schools in a single district, so the authors were
hesitant to draw conclusions from it. However, Koretz & Hamilton (2006) noted that
these findings are consistent with greater test preparation and score inflation in low-
scoring schools. Therefore, they recommended examining differences between within-
and between-school relationships as a part of investigating potential score inflation. To
our knowledge, however, few studies have explored this.
The limited number of findings to date suggesting greater test preparation and
score inflation among disadvantaged students is not surprising given the specifics of
schooling and accountability in the U.S. Both No Child Left Behind and many of the
state accountability programs that preceded it require larger gains—often, much larger
gains—by low-scoring groups of students. In addition, many low-scoring schools face
additional difficulties in raising scores, such as high transience and relatively
inexperienced staff.
New tests and the distribution of inflation 5
We therefore undertook the present study with the following hypotheses:
• Within schools, low-scoring groups of students will on average have more
inflated scores on the old high-stakes test and will therefore decline in relative
performance on the two new tests.
• Similarly, schools with high concentrations of students from low-scoring groups
will decline in relative performance, above and beyond the decline predicted by
the within-school student-level relationships.
In keeping with the earlier studies noted above, we hypothesized that both poor
students and students in high-poverty schools would decline in relative performance with
the introduction of the new test. Because of Kentucky’s demographics, we did not have a
strong hypothesis about the relationships between race/ethnicity and changes in
performance. Kentucky’s Hispanic student population is small, and its African-American
population is small and highly concentrated: 50% percent are in Jefferson County
(Louisville), and another 13% are in Fayette County (Lexington). This leaves the race
effect, if there is one, substantially conflated with district effects. Nonetheless, we
included dummy variables for race/ethnicity. Although we are not aware of research
evaluating differential test preparation or score inflation affecting students with
disabilities, we expect that they will be affected more, for the for the same reasons that
other low-scoring groups have been. Some students with disabilities are exempted from
the general-education testing program and hence would not create similar incentives for
teachers, but those students do not appear in our data.
In addition, in the general case, one might expect that schools that had shown
particularly rapid cohort-to-cohort gains on the previous high-stakes test would show
New tests and the distribution of inflation 6
declines in relative performance. However, it is not clear whether one should expect this
in the present context. The earlier testing program was in place for 12 years before the
change to the new program used as an outcome in this study. While one might expect to
find relationships between cohort-to-cohort gains and inflation in the early years of
testing programs, test preparation activities might have been so well diffused through the
state at the time of our data that variations in gains on the earlier test might no longer
have predictive power. Nonetheless, we evaluated this possibility.
This study evaluates these hypotheses by examining changes in performance
when one high-stakes test is replaced by another. This approach, not previously used in
this literature, shares two assumptions with the more conventional approach. We assume
that the audit test used for comparison—in this case, the new test—has not (yet) been the
focus of intensive test preparation, and that it is sufficiently similar to the old test in terms
of the intended inference to serve as an audit.
This approach faces two threats. The first is that teachers will have already
engaged in substantial preparation focused on the new test. If preparation for the new test
is extensive enough and is distributed similarly to preparation for the old test, the result
will be a Type II error, either failing to identify inflation or underestimating it. This
possibility therefore does not threaten positive findings from our analysis.
The second threat is more serious: the possibility that the new and old tests are
designed to sample differently enough from the domain to make the adequacy of the new
test as an audit measure questionable. This may be a particularly important threat
currently because of the introduction of tests aligned with the Common Core State
Standards, including the primary test used as an outcome in this study. If the new and old
New tests and the distribution of inflation 7
tests are sufficiently different in their sampling, differences in alignment between
curriculum and the two tests could create systematic variations in the difference scores
we examine even in the absence of any behavioral responses to testing—and hence
without any score inflation. Moreover, it is possible that such differences in alignment are
consistent with our hypotheses. For example, suppose that the new test includes content
not included in the old test and that this new content was emphasized more in advantaged
schools than in disadvantaged schools before the introduction of the new test. This is
plausible when the new content is more advanced. This would create systematic
variations in the difference scores similar to that produced by greater inflation in
disadvantaged schools.
We took advantage of a unique aspect of Kentucky’s testing program to address
this limitation. In the spring of 2012, the Kentucky Department of Education first
administered two new tests. One, the Kentucky Performance Rating for Educational
Progress (KPREP) test, is the state’s new high-stakes test, aligned with the Common Core
Standards. This replaced the Kentucky Core Content Test (KCCT), the high-stakes test
last administered in 2011. The second was a norm-referenced test, the Stanford 10 (SAT-
10). This allowed us to replicate our analyses of KPREP with identical analysis of the
SAT-10. Similar findings with the SAT-10 would strengthen the inference that variations
in the difference score reflect preparation for the KCCT rather than specific attributes of
the KPREP.
New tests and the distribution of inflation 8
Data
Our data consist of student-level test scores for students in grades 3-8 from spring
2007 through spring 2012. These data include KPREP scores for 2012 and KCCT scores
for all earlier years.
For this analysis, we limited our sample to students who were in grades 5-8 in
2012 in schools classified as A1 by the Kentucky Department of Education. These are
public schools other than preschools, special education schools, or other alternative
programs. We also limited our analysis to schools with more than 10 students. Because
we used difference scores as our outcome, we included only the 71% of students in these
schools who had scores in the appropriate grades in both the 2012 and 2011. We removed
26,244 students across all grades who were assigned the highest or lowest obtainable
scale scores (HOSS/LOSS) on either test. Reasons for this exclusion are discussed in
detail below. (A sensitivity test examining the impact of this exclusion is included in the
Appendix.) Finally, we excluded students whose recorded scale scores were outside the
range for the relevant grade, as these students were either out-of-grade or given incorrect
scores. Our final count across grades 5-8 was 181,477 students, which was approximately
63% of our original data.
The final analytic sample was very similar in terms of demographics to the
original data. In both samples, approximately 82% of the students were white, 11%
African-American, 4% Hispanic, and 1% Asian. As a result of dropping the students who
scored HOSS or LOSS, there is a slightly higher percent (60%) of poor students in our
analytic sample compared to 57% in the original data.
New tests and the distribution of inflation 9
Methods
Outcome variable
Our outcome variable is the difference between a student’s score on KPREP in
2012 and his or her score on the KCCT in the prior grade in 2011.
These two tests were scaled differently, and the shapes of the distributions of
scores were markedly different. It was therefore not reasonable to use a simple difference
score, and simple standardization would have retained differences between the
distributions other than means and standard deviations. We therefore needed to transform
the scale scores, and the distribution of KCCT scores further necessitated a sensitivity
analysis to address censoring.
The KCCT was scaled using a 3-parameter and generalized partial credit IRT
model, which does not provide a one-to-one mapping of scale scores to raw scores. To
obtain such a mapping, raw scores were mapped to the test characteristic curve (TCC;
Kentucky Department of Education, 2009). Mapping scale scores to the TCC often
stretches the tails of the distribution considerably more than some other scaling
approaches (e.g., Thissen & Orlando, 2001). In addition, the estimation method used does
not directly estimate scale scores for students with perfect scores, either 0 or 100 percent
of possible credit. This was handled by setting the lowest and highest obtainable scale
scores, HOSS and LOSS, a priori and assigning these to students with perfect scores
(Kentucky Department of Education, 2009).
In contrast, the KPREP is scaled using a Rasch model (Kentucky Department of
Education, n.d.), which provides scale scores that are mapped one-to-one with raw
scores. Rasch scaling will often stretch the tails of the distribution less than mapping to
New tests and the distribution of inflation 10
the TCC. Rasch scaling, like the estimation method used with the KCCT, does not
directly estimate scale scores for students with perfect scores. In the case of KPREP, this
was handled in two stages. First, a small adjustment (±0.25) was made to perfect scores
to permit direct estimation of the underlying scale score. Second, LOSS and HOSS of
100 and 300 were imposed on the reporting scale (Johnson, 2014).
The result of these decisions was very different distributions of scores on the two
tests in the two years we analyzed. As expected, the KCCT distribution shows more
stretching of the tails. This is shown in Figure 1, which displays adjacent grades (grade 5
for the KCCT and grade 6 for the KPREP) because adjacent grades are used in the
calculation of our outcome measure. In addition, the KCCT distribution shows modest
left-censoring and substantial right-censoring, which could reflect both the location of
HOSS and LOSS and raw-score censoring. In the grade 5 KCCT, 13% of students were
in the HOSS spike and 4% in the LOSS spike. In contrast, the KPREP distributions were
free of censoring. As the figure suggests, with HOSS and LOSS included, the KCCT is
more skewed (skewness = -.38) than is the KPREP (skewness = .29).
We took two steps to address these differences in scale. First, we normalized
(probit-transformed) scores on both tests by calculating percentile ranks and mapping
these to the cumulative normal distribution. Second, because we have no information on
the appropriate distribution of performance for students assigned HOSS or LOSS, we
excluded those students from our main analyses. Our outcome variable was then the
simple difference in these probit-transformed scores.
New tests and the distribution of inflation 11
Predictors
Our predictor variables were a number of student characteristics, school-level
aggregates of these student-level variables, and historical data on school mean test scores.
We included three race/ethnicity dummy variables, for Black, Hispanic, and
Asian. The reference group coded zero included Whites and other. A dummy variable for
poverty was defined by whether the student participated in the free or reduced-price
lunch program. The final dummy indicated whether a student had an individualized
education program (IEP) plan under the Individuals with Disabilities Education Act.
Because most students with recognized disabilities have IEP plans, this is a reasonable
proxy for the presence of a disability.
The school-level aggregate variables were the means of these variables, that is,
the proportions of students with the dummy variables set to 1. However, these aggregate
variables were calculated from all students enrolled in each school, without excluding the
students noted above, because we expect educators’ responses to testing to be influenced
by the characteristics of all the students present in the school.
In addition to these aggregates, to evaluate the possible relevance of earlier
cohort-to-cohort gains, we included a school-level variable representing the difference
between schools’ mean scores on the KCCT in 2010 from 2007. In order to remain
consistent in our use of KCCT scores, we normalized this trend variable, but with respect
to the distribution of school means rather than students’ scores. We then subtracted the
2007 normalized school mean from the 2010 normalized mean. We also specified models
using each individual year of change (e.g. 2007 to 2008), but we found that these yielded
similar results but introduced more noise, so our final models use a single change
New tests and the distribution of inflation 12
variable. This change does not include the change between 2010 and 2011 because 2011
scores are used in the calculation of the outcome variable.
Finally, we included a dummy variable indicating whether the school was located
in a county identified as Appalachian by the Appalachian Regional Commission because
many of the communities identified as Appalachian are highly disadvantaged.
Analytic strategy
We used a difference-score approach in which the outcome is the difference
between a student’s normalized score on KPREP in the first year of administration (2012)
and her normalized score on the KCCT the in the prior year and grade, that is:
𝑌𝑖𝑖 = 𝑆𝑖𝑖𝑖𝑖𝐾𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−1)(𝑖−1),𝐾𝐾𝐾𝐾 (1)
where S is the normalized score and i, j, t, and g index student, school, year, and grade,
respectively. Because the KPREP and KCCT scores are normalized and not linked, this
difference score contains no information about absolute changes in performance with the
introduction of the new test. Rather, measures the change in a students’ position in the
distribution of scores with the introduction of the new test. We assume that scores on the
new KPREP test are less affected by test preparation and therefore that on average,
students whose scores on KCCT were more inflated should have negative values on the
outcome, that is, they should on average fall in relative position.
We began with simple student-level ordinary least squares models for both the
KPREP and the SAT-10. These models are of the form:
𝑌𝑖 = 𝛽0 + 𝐗𝐢𝛃𝟏 + 𝑒𝑖, (2)
New tests and the distribution of inflation 13
where i indexes students, Y is the difference score, and 𝐗𝐢 is a vector of student
characteristics. These models confound within- and between-school relationships and
ignore clustering of students within schools, but they provide a baseline description.
We then estimated two-level mixed models (students nested in schools) not only
for accurate standard errors, but also because the structural relationships at the school
level are substantively important. Because score inflation depends on the behavior of
educators, we expect it to be a teacher- and school-level variable, and we expect
substantial predictable variations in inflation between schools. Our core models were the
following random-intercepts models:
𝑌𝑖𝑖 = 𝛽0𝑖 + 𝐗𝐢𝐢𝛃𝟏𝟏 + 𝑒𝑖𝑖
𝛽0𝑖 = 𝛾00 + 𝐗�𝟏𝟎𝜸𝟏𝟏 + 𝐙𝟏𝐢𝛄𝟏𝟎 + 𝛾03𝑊0𝑖 + 𝑢𝑖 (3)
where 𝐗𝐢𝐢 is a vector of student characteristics, 𝐗�𝟏𝟎 is the school means of these student
characteristics, 𝐙𝟏𝐢 is the vector of school mean change variables, and 𝑊0𝑖 is a dummy
variable indicating whether a school is located in a county identified as Appalachian.
Estimates of 𝛾03 were small and not significant, so we dropped the term 𝛾03𝑊0𝑖 from the
models reported here. As noted, because we the use of all school change variables did not
add useful information, we replaced 𝐙𝟏𝐢 with a single variable indicating school mean
change over the three-year period in the models reported below.
We grand-mean-centered the student-level outcome and predictor variables in our
models. We centered all test scores before calculating the gain scores. We centered all
school-level predictors, including historical school mean test scores, on the means of
school means.
New tests and the distribution of inflation 14
Grand-mean centering at the student level yields parameter estimates for level-2
variables that are direct estimates of context effects (Raudenbush & Bryk, 2002). That is,
the parameter estimates for level-2 aggregate variables indicate the extent to which they
have an association with the outcome above and beyond the effects of the corresponding
level-1 variables.
SAT-10 Replication
Our replication using the SAT-10 differed from the model in equation (3) only in
the construction of the outcome measure. Rather than subtracting KCCT scores from
KPREP scores in the following year and grade, we subtracted them from SAT-10 scores
in the following year and grade:
𝑌𝑖𝑖 = 𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝐾10 − 𝑆𝑖𝑖(𝑖−1)(𝑖−1).𝐾𝐾𝐾𝐾 (4)
Falsification Tests
The falsification tests differed from the primary analyses in equation (3) in that
they were lagged by one and two additional years. That is, the outcome in the first
falsification model was the difference between KCCT in the final and next-to-final years
in which it was administered:
𝑌𝑖𝑖 = 𝑆𝑖𝑖(𝑖−1)𝑖𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−2)(𝑖−1).
𝐾𝐾𝐾𝐾 (5)
The second falsification test was lagged one additional year:
𝑌𝑖𝑖 = 𝑆𝑖𝑖(𝑖−2)𝑖𝐾𝐾𝐾𝐾 − 𝑆𝑖𝑖(𝑖−3)(𝑖−1).
𝐾𝐾𝐾𝐾 (6)
Our motivation for including the second falsification test in equation (6) was concern that
performance on the 2010 KCCT might have been anomalous because educators knew
that it was the final year of that testing program.
New tests and the distribution of inflation 15
Sensitivity Tests
As noted, the scaling of the KCCT resulted in appreciable censoring, with a
substantial number of students scoring at the HOSS and a more modest number at the
LOSS. Because we have no information about the actual distribution of performance of
these students, we consider it most appropriate to remove them from our analyses.
However, we also fitted models in which we included those students in order to clarify
the effects of excluding them. These are presented in the Appendix.
Results
We first present the simple student-level OLS results for both the KPREP and
SAT-10 difference scores. We then discuss our principal findings based on two-level
models, including the SAT-10 and falsification models. Because the difference scores are
calculated by subtracting prior scores from recent scores, a negative sign indicates a
decline in relative performance.
Student-Level OLS Results
As expected, all eight of the OLS models (four grades by two tests) showed a
decline in performance by poor students (Table 1). These declines were all significant but
varied markedly in size, from 0.02 SD to 0.11 SD. Students with IEPs also declined in
seven of eight models, in many cases by more, with effects up to 0.19 SD. Asian students,
about whom we had no specified hypothesis, increased in all comparisons, in some cases
by roughly 0.3 SD. The coefficients for Black students were inconsistent.
Two-Level KPREP and SAT-10 Models
Many of our findings are consistent with our hypotheses, but there are a number
of exceptions, particularly in grade 8. Most of the findings of our main KPREP analyses
New tests and the distribution of inflation 16
were mirrored in our SAT-10 replication. We discuss grades 5 through 7 first and then
turn to grade 8. Because the difference scores are calculated by subtracting prior scores
from recent scores, a negative sign indicates a decline in relative performance.
Within schools, the relative performance of poor students in grades 5, 6, and 7 fell
when the new KPREP was introduced. These decreases ranged from -0.06 to -0.09
standard deviations, and all were highly significant (Table 2). Our SAT-10 replication
yielded similar findings for these relationships (Table 3). These findings, however, do not
represent the total association between poverty and the change in performance. Rather,
they represent only the within-school relationship.
At the school level, an increase in the proportion of poor students predicted a
decline in relative performance. In the case of KPREP, this effect ranged from -0.18 in
grade 6 to -0.54 in grade 7 (Table 2). Again, we found similar results with our SAT-10
replication (Table 3). Recall that because of the grand-mean centering, these are estimates
of context effects. That is, even after controlling for the effects of the poverty status of
individual students, an increase in the proportion of poor students predicted a decline in
performance.
These school-level coefficients are substantial, but interpreting their magnitude
requires additional calculation. The student-level and school-level coefficients are not
directly comparable because they are not on the same scale. The student-level poverty
coefficient is the adjusted mean difference between poor and non-poor students. In
contrast, because the school poverty variable runs from 0 to 1.0, the school-level poverty
coefficient is the estimated difference between schools with 0% and 100% poor students.
To put the magnitude of the school-level coefficients into perspective, we can apply them
New tests and the distribution of inflation 17
to the school-level distribution of the proportion poor. For example, we can compare the
estimated change in performance for schools at the 25th and 75th percentiles on the
distribution of proportion poor. In grade 7, where this coefficient was largest, the
difference in the proportion poor for the schools at the 25th and 75th percentiles of this
distribution was 0.24. Therefore, our KPREP model in this grade predicts that the school
at the 75th percentile would be 0.13 student-level standard deviations (.24 × .54) lower in
relative performance than the school at the 25th percentile, even after taking into account
the student-level effect of poverty.
Relationships between change scores and both the Black/White dummy and the
school proportion Black were inconsistent in sign and often non-significant. In all three
grades, students with IEPs declined in relative performance on the KPREP, with
coefficients ranging from -.03 to -.22 (Table 2). These declines were mirrored in the SAT-
10 (Table 3). We found no significant school-level relationships between the proportion
of students with IEPs and changes in relative performance on either test.
While we had no explicit hypothesis about Asian students, we found a modest
student-level increase in relative performance for Asian students in grades 5-7 on the
KPREP, but only in grade 6 on the SAT-10. There were no school-level effects of
proportion Asian in any instance.
The relationships with school’s mean gains were inconsistent. In grades 5 and 6,
we found a modest negative relationship between mean gains on KCCT for the three
prior years and relative performance on the KPREP, and these effects were significant at
𝑝 < .001 (Table 2). Similar results were found for the SAT-10 (Table 3). However, no
New tests and the distribution of inflation 18
relationships were found for either test in grade 7.1 Moreover, these findings failed the
falsification test described below in several instances. That is, we found negative
associations in some of the falsification models, when ideally we should not. This
suggests that we should not have confidence in the estimates for this coefficient that we
obtained in our main models.
The results in grade 8 were weaker, and this was again true with both the KPREP
and the SAT-10. The negative student-level coefficient for poor students found in the
other grades appeared in grade 8 as well, although the coefficient for poverty was very
small (-0.03 with both tests; Tables 2 and 3). The grade-8 student-level coefficients for
Asian and IEP students were roughly comparable to those in the other grades. However,
the school-level coefficients for proportion poor were neither substantial nor significant
in grade 8.
Falsification tests
Our main models use a decline in relative performance with the introduction of
the new KPREP and SAT-10 as an indication of greater score inflation. To test that the
estimates of our main models stem from the changes in tests, we conducted falsification
tests in which we replicated our main models with data lagged both one and two years,
such that the outcome was the change between two successive implementations of the old
KCCT test. The results of the falsification test can be seen in Tables 4 and 5. We expect
that our results will fail to replicate in the falsification tests. 1 While we do not have a full explanation of the lack of effects in grade 7, we found that part of the
difference between this grade and the other two appears to stem from multivariate outliers, specifically, a
much larger range of Mahalnobis distances. Eliminating observations with large distances made the grade 7
coefficient for the change variable somewhat more similar to those in the other grades.
New tests and the distribution of inflation 19
For the most part, the falsification models failed to replicate the findings from our
main models. Unlike the estimates for the core two-level model we presented in Equation
2, the student- level poor coefficients in the falsification models are near zero and
generally not significant. In contrast, the estimates from the core model are universally
negative, greater in magnitude, and highly significant (p<0.01). We found only two
exceptions. First, the grade 7 student-level poor coefficient in the falsification model is
negative and significant between 2010-11 and 2009-10 (p = 0.01). The grade 6 student-
level poor coefficient in the falsification model was also negative and significant in 2009-
2010 (p = 0.05). Even in these exceptional cases, however, the point estimates are still
attenuated compared to the core two-level model.
In all grades except for grade 7, the school-level poor coefficients were also
attenuated in the falsification models and sometimes reversed in sign (positive). For
example, the point estimates were significant and positive in grade 6 (p = 0.01) and grade
8 in 2009-10 (p = 0.05), indicating that poor students at times improved in rank when
Kentucky administered the KCCT. As the exception, the grade 7 school-level poor
estimate was negative and highly significant in 2010-11 (p < 0.01). Once again, however,
the magnitude of this estimate is less than the estimate from the core model.
Discussion
Our findings are largely consistent with the hypothesis that test-specific
preparation for the previous KCCT high-stakes test was more intensive—or at least more
effective—with disadvantaged students. As predicted, within schools, poor students fell
in relative performance when the new high-stakes test was introduced. Similarly, students
with IEPs fell in relative performance. These findings, and the finding that Asian students
New tests and the distribution of inflation 20
increased in relative performance, are consistent with the notion that students who are at
risk of scoring poorly receive more intensive test preparation.
In our view, the school-level relationship between the proportion of poor students
and the change in relative performance is particularly important. Our models provided
estimates of context or concentration effects: an increase in the proportion of poor
students predicted a decline in relative performance above and beyond the effects of
student-level poverty. While there is evidence that teachers can to some degree target test
preparation to specific groups of students within their classes (e.g., Booher-Jennings,
2005), it is reasonable to expect that much of the variation in test preparation lies between
teachers and schools. Our findings are consistent with this expectation and support the
suggestion by Koretz & Hamilton (2006) that it is important to examine aggregate
relationships in evaluating score inflation (Koretz & Hamilton, 2006).
The SAT-10 replication and the falsification tests both substantially strengthen
these conclusions. If systematic changes in relative performance were a result of the
specific content sampling used to create the new high-stakes test, our primary results
would not be replicated when using the SAT-10 as the outcome measure, but our findings
were replicated consistently. Similarly, if our findings were a result of weaknesses of the
data or the specifications employed, one would expect our findings to be replicated when
we lagged our data by a single year, which removed the change in testing but left all else
unaltered. However, our results were not replicated in the falsification tests.
Our data are not sufficient to explore the reasons for the far weaker findings in the
grade 8 data. We found weaker findings in grade 8 using the SAT-10 as well, which rules
out an anomaly in the construction of the eighth-grade KPREP as an explanation. It may
New tests and the distribution of inflation 21
be that both of the new tests are similar enough to the eighth-grade KCCT that the
difference scores do not capture useful variation. It is also plausible that preparation for
the KCCT was distributed differently in the eighth grade than in the other grades,
although we know of no other data to suggest this. Further explanation with additional
data would be needed to explain these findings.
The modest sizes of some of our findings warrant discussion. Given our data, we
can only speculate about this. There are also a number of reasons why our findings may
underestimate these variations. First, as the present approach is novel, we do not know
what fraction of score inflation can be captured by changes in high-stakes tests. We
expect that this will vary depending on the context. New high-stakes tests are rarely
introduced without preparation, and the more extensive the advance preparation, the less
likely it is that this approach will capture a meaningful share of test preparation and
inflation. For example, we replicated our models with data from New York State, which
provided extensive preparation for the introduction of its Common Core tests in 2013,
such as the launching of extensive web-based preparation two years before the
introduction of the tests.2 Our KY findings did not replicate in the New York data.
Second, no single audit measure is likely to capture fully the variations in test preparation
activities and score inflation. Our data are not sufficient to evaluate these competing
explanations.
Our findings have two broad implications, one substantive and the other
methodological. First, this paper adds to the slowly growing literature indicating that test
preparation and score inflation are in some cases more severe for disadvantaged students.
2 See www.engageny.org/.
New tests and the distribution of inflation 22
This is of concern not only because of the negative implications about the quality of
education provided to these students, but also because the result of this differential bias is
an illusion of greater equity that can distort both policy and practice. Second, our findings
suggest that the introduction of new material into high-stakes tests can serve as one
method to help monitor for inappropriate test preparation and score inflation.
New tests and the distribution of inflation 23
References
Booher-Jennings, J. (2005). Below the bubble: "Educational Triage" and the Texas
Accountability System. American Educational Research Journal, 42(2), 231-268.
Cimbricz, S. (2002). State-mandated testing and teachers’ beliefs and practice. Education
Policy Analysis Archives, 10(2). Retrieved from
http://epaa.asu.edu/epaa/v10n2.html
Diamond, J. B., & Spillane, J. P. (2004). High-stakes accountability in urban elementary
schools: Challenging or reproducing inequality? Teachers College Record, 106(6),
1145-1176.
Eisner, E. (2001). What does it mean to say a school is doing well? Phi Delta Kappan,
82(5), 367-372.
Firestone, W. A., Camilli, G., Yurecko, M., Monfils, L., & Mayrowetz, D. (2000). State
standards, socio-fiscal context and opportunity to learn in New Jersey. Education
Policy Analysis Archives, 8(35). Retrieved from http://olam.ed.asu.edu/epaa/v8n35.
Gillborn, D. & Youdell, D. (2000). Rationing education: policy, practice, reform, and
equity. Buckingham, UK: Open University Press.
Herman, J. L., & Golan, S. (1993). The effects of standardized testing on teaching and
schools. Educational Measurement: Issues and Practice, 12(4), 20-25, 41-42.
Ho, A. D., & Haertel, E. H. (2006). Metric-Free Measures of Test Score Trends and Gaps
with Policy-Relevant Examples (CSE Report 665). Los Angeles, CA: National Center
for Research on Evaluation, Standards, and Student Testing (CRESST), Center for the
Study of Evaluation, University of California, Los Angeles.
New tests and the distribution of inflation 24
Jacob, B. (2007). Test-based accountability and student achievement: An investigation of
differential performance on NAEP and state assessments. Cambridge, MA:
National Bureau of Economic Research (Working Paper 12817).
Jacob, R. T., Stone, S., & Roderick, M. (2004). Ending social promotion: The response of
teachers and students. Chicago, IL: Consortium on Chicago School Research.
Retrieved March 29, 2011, from http://www.eric.ed.gov/PDFS/ED483823.pdf
Jones, M. G., Jones, B. D., & Hargrove, T. Y. (2003). The unintended consequences of
high-stakes testing. Lanham, MD: Rowman & Littlefield Publishers, Inc.
Johnson, Marc (2014). Personal communication, August 27.
Kentucky Department of Education (2009). Commonwealth Accountability Testing System,
2008-09 Technical Report, Version 1.2. Frankfort, KY: Author.
Kentucky Department of Education (n.d.). Kentucky Performance Rating for Educational
Progress, 2011-12 Technical Manual, Version 1.1. Frankfort, KY: Author.
Klein, S. P., Hamilton, L.S., McCaffrey, D.F., and Stecher, B.M. (2000). What do test
scores in Texas tell us? Santa Monica, CA: RAND (Issue Paper IP-202). Last
accessed from http://www.rand.org/publications/IP/IP202/ on June 4, 2013.
Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional
Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.
Koretz, D., Barron, S., Mitchell, K., and Stecher, B. (1996). The Perceived Effects of the
Kentucky Instructional Results Information System (KIRIS). MR-792-PCT/FF,
Santa Monica: RAND.
Koretz, D., and Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L.
Brennan (Ed.), Educational measurement (4th ed.), 531-578. Westport, CT:
American Council on Education/Praeger.
New tests and the distribution of inflation 25
Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-
stakes testing: Preliminary evidence about generalization across tests, in R. L.
Linn (chair), The Effects of High Stakes Testing, symposium presented at the
annual meetings of the American Educational Research Association and the
National Council on Measurement in Education, Chicago, April.
Ladd, H. F., & Zelli, A. (2002). School-based accountability in North Carolina: The
responses of school principals. Educational Administration Quarterly, 38(4), 494-
529. doi: 10.1177/001316102237670
Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F.
Lindquist (Ed.), Educational Measurement (pp. 119-184). Washington, DC:
American Council on Education.
Lipman, P. (2002). Making the global city, making inequality: The political economy and
cultural politics of Chicago school policy. American Educational Research Journal,
39(2), 379-419.
Luna, C., & Turner, C. L. (2001). The impact of the MCAS: Teachers talk about high-
stakes testing. The English Journal, 91(1), 79-87.
McNeil, L. M. (2000). Contradictions of school reform: Educational costs of
standardized testing. New York, NY: Routledge.
McNeil, L. M. & Valenzuela, A. (2001). The harmful impact of the TAAS system of
testing in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield
(Eds.), Raising standards or raising barriers? Inequality and high-stakes testing in
public education (pp. 127-150). New York, NY: Century Foundation.
New tests and the distribution of inflation 26
Neal, D. and Schanzenbach, D. (2010). Left behind by design: Proficiency counts and
test-based accountability. Review of Economics & Statistics, 92(2), 263–283.
Ng, H. L., & Koretz, D. (2013). Sensitivity of School-Performance Ratings to the Test
Used. A working paper of the Educational Accountability Project.
http://projects.iq.harvard.edu/files/eap/files/houston_paper_wpdraft_032513_1.pdf
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J.
(2003). Perceived effects of state-mandated testing programs on teaching and
learning: Findings from a national survey of teachers. Boston, MA: National Board
on Educational Testing and Public Policy. Retrieved from
http://www.bc.edu/research/nbetpp/statements/nbr2.pdf
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and
data analysis methods, second edition. Thousand Oaks, CA: Sage.
Stecher, B. M., Epstein, S., Hamilton, L. S., Marsh, J. A., Robyn, A., McCombs, J. S.,
Russell, J., & Naftel, S. (2008). Pain and gain: Implementing NCLB in three states,
2004 – 2006. Santa Monica, CA: RAND. Retrieved from
http://www.rand.org/pubs/monographs/2008/RAND_MG784.pdf
Stecher, B. M., & Mitchell, K. J. (1995): Portfolio Driven Reform: Vermont Teachers’
Understanding of Mathematical Problem Solving (CSE Technical Report 400).
Los Angeles, CA: University of California Center for Research on Evaluation,
Standards, and Student Testing.
Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2002). A survey of teachers’
perspectives on high-stakes testing in Colorado: What gets taught, what gets lost
New tests and the distribution of inflation 27
(CSE Technical Report 588). Los Angeles, CA: University of California. Retrieved
September 20, 2010, from http://eric.ed.gov/PDFS/ED475139.pdf
Thissen, D., & Orlando, M. (2001). Item Response Theory for Items Score in Two
Categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 141-186).
Mahwah, N.J.: Lawrence Erlbaum Associates.
Urdan, T. C., & Paris, S. G. (1994). Teachers’ perceptions of standardized achievement
tests. Educational Policy, 8(2), 137-157.
New tests and the distribution of inflation 28
Figures
Figure 1. Distributions of KCCT and KPREP scores for calculation of the difference
score for students in grade 6 in 2012, with the x-axis scale bounded by LOSS and HOSS.
New tests and the distribution of inflation 29
Tables
Table 1
OLS Regression results for 2012 KPREP minus 2011 KCCT analysis and for 2012 SAT-10 minus 2011 KCCT analysis
Grade 4-5
Grade 5-6
Grade 6-7
Grade 7-8
KPREP-KCCT
SAT 10- KCCT
KPREP-KCCT
SAT 10- KCCT
KPREP-KCCT
SAT 10- KCCT
KPREP-KCCT
SAT 10- KCCT
Poor -0.082***
-0.077***
-0.069***
-0.046***
-0.108***
-0.090***
-0.021**
-0.019**
(0.007)
(0.007)
(0.007)
(0.008)
(0.007)
(0.007)
(0.007)
(0.007)
IEP -0.037***
-0.092***
-0.139***
-0.193***
0.049***
-0.121***
-0.115***
-0.151***
(0.011)
(0.011)
(0.011)
(0.011)
(0.010)
(0.011)
(0.011)
(0.011)
Asian 0.302***
0.136***
0.284***
0.185***
0.234***
0.106***
0.284***
0.163***
(0.031)
(0.031)
(0.032)
(0.033)
(0.030)
(0.032)
(0.031)
(0.031)
Black 0.067***
-0.015
-0.013
0.016
0.089***
0.104***
0.151***
0.152***
(0.012)
(0.012)
(0.012)
(0.012)
(0.011)
(0.011)
(0.011)
(0.011)
Hispanic 0.045*
-0.006
-0.013
-0.043*
0.063***
0.033
0.100***
0.053**
(0.018)
(0.018)
(0.019)
(0.020)
(0.018)
(0.019)
(0.018)
(0.018)
Intercept 0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
(0.004)
(0.004)
(0.004)
(0.004)
(0.003)
(0.003)
(0.003)
(0.003)
N 46,821 46,240 44,943 44,452 44,767 44,321 44,946 44,405
New tests and the distribution of inflation 30
Table 2
Regression results for 2012 KPREP minus 2011 KCCT analysis
Grade 4-5
Grade 5-6
Grade 6-7
Grade 7-8
Student-Level Variables Poor -0.057***
-0.086***
-0.075***
-0.030***
(0.008)
(0.008)
(0.007)
(0.007)
IEP -0.080***
-0.220***
-0.031**
-0.182***
(0.010)
(0.011)
(0.010)
(0.011)
Asian 0.187***
0.245***
0.144***
0.166***
(0.037)
(0.035)
(0.032)
(0.033)
Black 0.012
0.022
-0.037**
0.040***
(0.013)
(0.013)
(0.012)
(0.011)
Hispanic 0.023
0.025
0.005
0.055**
(0.018)
(0.019)
(0.018)
(0.017)
School-Level Variables Proportion Poor -0.310***
-0.182*
-0.544***
-0.055
(0.068)
(0.074)
(0.078)
(0.088)
Proportion IEP -0.032
-0.147
-0.437
-0.011
(0.195)
(0.200)
(0.237)
(0.265)
Proportion Asian -0.265
0.188
-0.411
1.825*
(0.446)
(0.534)
(0.694)
(0.832)
Proportion Black -0.01
-0.273**
0.377**
0.168
(0.087)
(0.090)
(0.119)
(0.131)
Proportion Hispanic 0.31
-0.133
0.176
-0.05
(0.195)
(0.230)
(0.340)
(0.385)
School Difference 07-10 -0.052***
-0.049***
-0.016
-0.02
(0.012)
(0.012)
(0.011)
(0.014)
Intercept 0.035**
0.017
-0.039**
-0.005
(0.011)
(0.011)
(0.012)
(0.012)
N 37,139 37,320 39,707 39,862 Note. *p<0.05 **p<0.01 ***p<0.001
New tests and the distribution of inflation 31
Table 3
Regression results for 2012 SAT10 minus 2011 KCCT analysis
Grade 4-5
Grade 5-6
Grade 6-7
Grade 7-8
Student-Level Variables
Poor -0.035***
-0.064***
-0.065***
-0.027***
(0.008)
(0.008)
(0.007)
(0.007)
IEP -0.138***
-0.275***
-0.196***
-0.219***
(0.010)
(0.011)
(0.011)
(0.011)
Asian 0.037
0.166***
0.021
0.073*
(0.038)
(0.038)
(0.034)
(0.035)
Black -0.083***
0.034*
-0.02
0.016
(0.013)
(0.014)
(0.012)
(0.012)
Hispanic -0.036
-0.03
-0.026
-0.009
(0.018)
(0.020)
(0.019)
(0.018)
School-Level Variables Proportion Poor -0.455***
-0.227***
-0.496***
-0.055
(0.059)
(0.068)
(0.074)
(0.069)
Proportion IEP 0.12
-0.024
-0.219
-0.119
(0.171)
(0.183)
(0.228)
(0.211)
Proportion Asian 0.077
0.298
-0.486
1.283*
(0.388)
(0.484)
(0.663)
(0.642)
Proportion Black -0.008
-0.246**
0.397***
0.261**
(0.076)
(0.082)
(0.113)
(0.101)
Proportion Hispanic 0.328
0.023
0.326
-0.039
(0.170)
(0.209)
(0.324)
(0.298)
School Difference 07-10 -0.073***
-0.048***
-0.011
-0.026*
(0.010)
(0.011)
(0.010)
(0.011)
Intercept 0.039***
0.023*
-0.031**
-0.006
(0.010)
(0.010)
(0.012)
(0.010)
N 36,682 37,176 39,386 39,550 Note. *p<0.05 **p<0.01 ***p<0.001
New tests and the distribution of inflation 32
Table 4
Regression results for 2011 KCCT minus 2010 KCCT analysis
Grade 4-5
Grade 5-6
Grade 6-7
Grade 7-8
Student-Level Variables Poor 0.003
-0.011
-0.020**
0.002
(0.008)
(0.007)
(0.007)
(0.006)
IEP 0.070***
-0.023*
-0.014
-0.046***
(0.010)
(0.011)
(0.010)
(0.010)
Asian 0.066
0.131***
0.170***
0.042
(0.036)
(0.037)
(0.035)
(0.031)
Black 0.029*
0.027*
0.015
-0.022*
(0.012)
(0.012)
(0.011)
(0.011)
Hispanic 0.051**
0.037*
0.043*
0.021
(0.018)
(0.018)
(0.017)
(0.017)
School-Level Variables Proportion Poor -0.065
0.039
-0.298***
0.113
(0.058)
(0.058)
(0.076)
(0.067)
Proportion IEP 0.071
0.178
0.534*
-0.513*
(0.152)
(0.169)
(0.209)
(0.212)
Proportion Asian -0.029
0.096
-0.446
0.79
(0.447)
(0.463)
(0.710)
(0.758)
Proportion Black 0.236**
-0.316***
0.126
0.097
(0.073)
(0.076)
(0.119)
(0.102)
Proportion Hispanic -0.012
-0.052
0.1
-0.239
(0.186)
(0.198)
(0.350)
(0.311)
School Difference 07-10 -0.045***
-0.037***
-0.033**
-0.034*
(0.010)
(0.011)
(0.012)
(0.013)
Intercept 0.022*
0.020*
-0.015
0.006
(0.009)
(0.010)
(0.011)
(0.010)
N 33,433 34,560 36,273 36,142 Note. *p<0.05 **p<0.01 ***p<0.001
New tests and the distribution of inflation 33
Table 5
Regression results for 2010 KCCT minus 2009 KCCT analysis
Grade 4-5
Grade 5-6
Grade 6-7
Grade 7-8
Student-Level Variables Poor -0.001
-0.016*
-0.020**
-0.009
(0.007)
(0.007)
(0.007)
(0.006)
IEP -0.005
-0.049***
0.068***
0.057***
(0.010)
(0.011)
(0.010)
(0.010)
Asian 0.083*
0.065
0.115***
0.041
(0.038)
(0.039)
(0.034)
(0.032)
Black -0.027*
0.009
0.012
-0.019
(0.012)
(0.012)
(0.011)
(0.011)
Hispanic 0.023
-0.032
0.051**
0.008
(0.019)
(0.019)
(0.018)
(0.018)
School-Level Variables Proportion Poor -0.038
0.157**
-0.014
0.142*
(0.055)
(0.055)
(0.070)
(0.061)
Proportion IEP 0.2
-0.117
0.133
-0.307
(0.154)
(0.149)
(0.221)
(0.200)
Proportion Asian -0.111
-0.296
-0.757
1.028
(0.465)
(0.480)
(0.877)
(0.838)
Proportion Black -0.119
-0.584***
-0.092
-0.132
(0.073)
(0.075)
(0.124)
(0.091)
Proportion Hispanic 0.117
-0.239
-0.334
0.301
(0.194)
(0.196)
(0.383)
(0.301)
School Difference 07-10 -0.001
-0.027*
-0.026
0.012
(0.012)
(0.012)
(0.015)
(0.014)
Intercept 0.011
0.015
-0.005
0.016
(0.010)
(0.009)
(0.012)
(0.009)
N 32,899 33,488 34,862 36,546 Note. *p<0.05 **p<0.01 ***p<0.001
New tests and the distribution of inflation 34
Appendix
Sensitivity Test: Including HOSS and LOSS
The sizable number of students who obtained the highest and lowest obtainable
scores (HOSS/LOSS) on the KCCT (see Figure 1) indicates a substantial amount of
censoring, particularly at the high end of the distribution. Because censoring is likely to
artificially attenuate relationships between the outcome and predictors, we excluded
students with censored scores. To test the impact of this decision, we replicated our
analyses with the students who had censored scores included. As expected, we found that
including these students weakened our findings. Importantly, we found that the student-
level poor estimates were less negative (smaller in absolute value) when students who
obtained the HOSS or LOSS were included. A comparison between estimates derived
from the KPREP and NRT models including and excluding students at the HOSS or
LOSS can be seen in Tables A1 and A2, respectively.
New tests and the distribution of inflation 35
Table A1 Regression results for 2012 KPREP minus 2011 KCCT analysis with students performing at the HOSS or LOSS included Grade 4-5 Grade 5-6 Grade 6-7 Grade 7-8 Student-Level Variables Poor -0.028*** -0.043*** -0.039*** -0.004 (0.008) (0.007) (0.007) (0.007) IEP -0.032** -0.124*** 0.054*** -0.105*** (0.010) (0.010) (0.010) (0.010) Asian 0.222*** 0.262*** 0.148*** 0.171*** (0.029) (0.030) (0.029) (0.030) Black 0.044*** 0.053*** -0.007 0.074*** (0.013) (0.013) (0.012) (0.012) Hispanic 0.01 0.025 0.007 0.066*** (0.017) (0.018) (0.018) (0.018) School-Level Variables Proportion Poor -0.311*** -0.137 -0.536*** -0.019 (0.073) (0.077) (0.079) (0.089) Proportion IEP -0.126 -0.189 -0.547* -0.146 (0.210) (0.207) (0.240) (0.269) Proportion Asian -0.136 0.432 -0.371 1.734* (0.480) (0.552) (0.702) (0.845) Proportion Black 0.033 -0.236* 0.442*** 0.222 (0.094) (0.093) (0.120) (0.133) Proportion Hispanic 0.4 -0.173 0.315 -0.041 (0.211) (0.293) (0.344) (0.391) School Difference 07-10 -0.062*** -0.058*** -0.02 -0.026 (0.012) (0.012) (0.011) (0.014) Intercept -0.021 -0.008 -0.053*** -0.016 (0.012) (0.012) (0.012) (0.013) N 46,521 44,527 44,442 44,493 Note. *p<0.05 **p<0.01 ***p<0.001
New tests and the distribution of inflation 36
Table A2 Regression results for 2012 NRT minus 2011 KCCT analysis with students performing at the HOSS or LOSS included Grade 4-5 Grade 5-6 Grade 6-7 Grade 7-8 Student-Level Variables Poor 0.000 -0.014 -0.030*** 0.003 (0.008) (0.008) (0.007) (0.007) IEP -0.086*** -0.184*** -0.117*** -0.142*** (0.010) (0.011) (0.010) (0.010) Asian 0.065* 0.163*** 0.023 0.062* (0.030) (0.032) (0.031) (0.031) Black -0.045*** 0.066*** 0.005 0.050*** (0.013) (0.013) (0.012) (0.012) Hispanic -0.046* -0.026 -0.033 0 (0.018) (0.020) (0.019) (0.018) School-Level Variables Proportion Poor -0.443*** -0.177* -0.494*** -0.042 (0.065) (0.071) (0.077) (0.071) Proportion IEP 0.012 -0.09 -0.386 -0.226 (0.187) (0.191) (0.235) (0.219) Proportion Asian 0.083 0.335 -0.537 1.221 (0.423) (0.506) (0.684) (0.666) Proportion Black 0.038 -0.196* 0.484*** 0.309** (0.084) (0.086) (0.117) (0.105) Proportion Hispanic 0.451* -0.033 0.471 -0.014 (0.187) (0.220) (0.336) (0.309) School Difference 07-10 -0.081*** -0.056*** -0.016 -0.035** (0.011) (0.011) (0.011) (0.011) Intercept -0.024* -0.013 -0.494*** -0.015 (0.011) (0.011) (0.077) (0.010) N 46,155 44,367 44,246 44,329 Note. *p<0.05 **p<0.01 ***p<0.001