Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Of Cabbages and Kings: Classroom Observations & Value-Added Measures
Julie Cohen & Pam Grossman Stanford University
March 30, 2011
Paper to be presented at the Annual Meeting of AERA, April 2011. We would like to thank the Carnegie Corporation, W.T. Grant Foundation, and the Spencer
Foundation for funding this work.
Introduction and research questions
Teachers and teacher quality are the focus of current discussions about
educational improvement. A number of studies suggest teachers represent one of the most
important factors affecting student achievement (c.f. Rivkin, Hanushek and Kain, 2005;
Rockoff, 2004). This recognition has led to policies, codified into the Race to the Top
legislation, promoting more rigorous evaluation of teachers. There is tremendous
enthusiasm among policy-makers about the use of value-added methodologies to assess
teacher effectiveness, including using such measures instead of years of experience or
education coursework to make consequential tenure decisions. This enthusiasm is
tempered by many researchers’ skepticism that such measures can be used to evaluate
individual teachers (c.f. Raudenbush, 2004; Rothstein, 2009; McCaffrey et al., 2004). To
address these concerns, researchers advocate using multiple measures, including
structured observations, to capture features of teaching (Gitomer, 2009; Goe, Bell, &
Little, 2008). This allows researchers and practitioners alike to understand the process,
the work that teachers do in classrooms, associated with outcomes such as student
achievement gains. However, relatively few studies have tried to go into the classrooms
of teachers identified as more or less effective to try to understand the relationship
between value-added measures and classroom instruction (c.f. Grossman et al., 2009;
Kane, Taylor, Tyler, & Wooten, 2010). While many might assume a straightforward
relationship between teaching practices and student achievement gains, the nature and
strength of that relationship may largely depend upon how both classroom practice and
student learning are conceptualized and measured.
In this paper, we focus on a variety of issues related to measuring classroom
practice, using both value-added measures and one subject-specific observation
instrument, the Protocol for Language Arts Teaching Observation or PLATO, as the basis
for this exploration. Our research questions are:
o How do classroom practices of more effective teachers differ from those
of less effective teachers? Are there consistent patterns both within and
across schools?
o Are value-added measures identifying teachers who score higher on
measures of classroom practice?
Background
Even as research has begun to document that teachers matter, there is less
certainty about what attributes of teachers actually make the most difference in raising
student achievement. Some studies suggest teacher attributes, including certification, the
selectivity of teachers’ undergraduate institutions, and scores on tests of general
knowledge and verbal ability as factors related to student achievement gains (see Rice,
2003 for a review). Other work indicates that programmatic differences in teacher
preparation and other measured characteristics, may account for only a limited portion of
the variation in student achievement among teachers, particularly in English/Language
Arts (ELA) (Boyd, Grossman, Lankford, Loeb, & Wyckoff, 2009; Kane, Rockoff, &
Staiger, 2005).
This emphasis on teacher characteristics and preparation as well as student
outcomes on standardized assessments obscures the importance of classroom instruction.
If teaching, and not just teachers, matters, then it is crucial to understand what teaching
practices are associated with student achievement gains. We know little about whether
instruction, rather than the composition of students, is substantively different in the
classrooms of high value-added ELA teachers. Identifying classroom practices associated
with high student achievement gains, and then targeting these practices in teacher
education, mentoring, coaching, and professional development, provides a possible way
to improve the quality of instruction for many students.
Our study focuses specifically on classroom practices within English/Language
Arts classrooms for several reasons: our value-added analyses are able to link teachers to
students’ scores in ELA on city and state tests, and our belief that teaching effectiveness
is likely to be domain-specific (e.g. Seidel & Shavelson, 2007). In addition, there is
growing concern about adolescent literacy in this country, given that there have been
relatively few literacy gains 8th and 12th grade students, as measured by the National
Assessment of Educational Progress, (http://nces.ed.gov/nationsreportcard). Though a
number of curricular and instructional programs have been aimed at improving
elementary literacy (see Snow & Biancarosa, 2003, for an overview of 12 such
programs), we still know relatively little about the instructional practices that support
adolescent literacy achievement, particularly in urban schools.
While we know that teachers in other content areas can, and must, help develop
students’ academic literacy, ELA teachers are those most explicitly charged with
developing such skills. We focus on instruction in ELA classrooms where we can look at
relationships between instruction and student achievement. This study looks at teaching
in middle school classrooms because so little research has focused on instruction at this
level, yet we believe that middle school is a consequential time in students’ academic
lives (Carnegie Corporation, 1989).
In order to look at the quality of instruction provided in English/Language Arts,
we need measures of teaching that can be used across multiple settings and schools.
Structured observation protocols direct observers to focus on specific facets of
instruction, and provide a common technical vocabulary for describing those facets.
Consistent language for describing teaching is rare (Grossman & McDonald, 2008), but
allows for comparisons of practice across classrooms. Some researchers have focused on
more “generic” features of classroom practice that cut across grade levels and subject
areas, such as behavior management and instructional planning and reflection (Danielson,
2006; Pianta, LaParo, Stuhlman, 2004). A few protocols have targeted specific content
areas, particularly mathematics (Hill, 2005) and elementary reading (Taylor, Pearson,
Peterson, & Rodriguez, 2005; Hoffman, Sailors, & Duffy, 2004). None of the existing
observation protocols, however, provides a way to measure the quality of ELA classroom
practice across the multiple domains of ELA, particularly at the secondary level.
The paucity of discipline-based observation approaches has been a persistent
problem in efforts to develop assessments of teaching (Kennedy, in press). Indeed, as
Kennedy argues, “until recently, assessments have not attended to the intellectual
substance of teaching: to the content actually presented, how that content is represented,
and whether and how students are engaged with it…Documenting the intellectual
meaning of teaching events remains the illusive final frontier in performance assessment”
(p. 21). To that end, the PLATO instrument builds on existing observation tools and
research on effective teaching practices in ELA in an attempt to parse the different facets
of teaching events in secondary ELA classrooms.
Research Design and Methods Sample
We first identified 45 middle schools (6th-8th grade) in New York City that are
similar in terms of student demographics (more than 70% minority students, more than
50% students qualify for free and reduced lunch). To maximize potential observation
time, we identified a subset of 37 schools that were geographically clustered and had at
least 10 ELA teachers. We then contacted principals to request participation in the study.
After identifying willing schools, we worked with the school’s literacy coach or another
administrator to recruit teachers. We observed in all schools in which the majority of
teachers were willing to participate. Though we did not sample based on value-added
models as we had in previous rounds of data collection (see Grossman et al., 2009), we
assumed that if the majority of teachers in a school participated we would have a range of
levels of “effectiveness” within the sample (see Table 1 for background characteristics of
teachers in this sample). 1
1 There were a number of instances in which teachers were eliminated from the study. For example, a few teachers designated as teaching ELA actually only taught special education classes or classes in which Spanish was the language of instruction. We did not observe in their classrooms.
Variable (%)
2008-9 Sample (N= 179)*
2008-9 Teachers without value-added coefficients used in analyses (N=54)*
2008-9 Teachers with value-added coefficients used in analyses (N=125)*
New York City Middle School Teachers (N= 3777
College Recommended 48.8
53.49 47.15 47.3
Our final sample of 13 schools included sites located in all of the boroughs in
New York City, except for Staten Island. Six of those schools were located in Queens,
five were in Brooklyn, one was in the Bronx, and one was in Manhattan. These schools
represented a large range in school size: at one school, there were only 10 ELA teachers,
and at the largest, there were 38 ELA teachers (see Table 2 for demographics of the
schools in the study). The final participation rates also varied from 45% of ELA teachers
in a school participating in the study to 90% of teachers participating in the study. As
can be seen the Table 1, the teachers in our sample do not differ significantly from other
ELA teachers in NYC public schools in terms of gender, or scores on LAST, the Liberal
Arts and Sciences Test required of teachers in New York. Our sample differs somewhat
in terms of pathway than the larger population of middle school ELA teachers in NYC;
we have fewer teachers in our sample who entered teaching through Teach For America
New York City Teaching Fellows 16.3
18.6 15.45 24.3
Teach For America 1.2 4.65 0 8.7 Individual Evaluation 16.3
11.63 17.89 9
Temporary License 13.3 6.98 14.45 0.9
Female 83.05
80.77
84
83.1
White 77.51 88.46
72.65
69.5 Black 11.24 5.77 13.68 14.2 Hispanic 5.92 3.84 6.84 9.7 Other ethnicity 5.33 1.92 6.84 6.5
Age 36.67 32.07
(10.65) 38.58
(10.65) 32 Years of Experience 7.17
4.29 (5.33)
8.368 (4.93)
SAT Math (diff. N) 500.44 526.36 (90.48)
492.06 (100.02) 495
SAT Verbal (diff. N) 526.22
505 (114.22)
533.09 (88.83) 509
LAST score first time (diff. N)
257.013 (24)
264.32 (18.9)
254.24 (25.24)
258 (21.8)
or The NYC Teaching Fellows. Perhaps most surprising, we have a much larger
percentage of teachers who entered teaching on a Temporary License.
Table 2: Information on Middle Schools Included in the Study
School Code
Environment Grade 2008-9
Student Performance Score 2008-9
(out of 25)
Student Progress
Score 2008-9 (out of
60)
# of ELA Teachers
Total
ELA Teachers Participat
ing in Study
Participating Teachers for whom we can
calculate a value-added coefficient
Free and
Reduced Lunch
English Language Learners
Black His
1 A 20.80 43.70 31 45.2% 71.4% 93% 36% 7% 2 A 20.30 41.80 25 72.0% 72.2% 89% 23% 3% 3 B 18.20 44.60 20 90.0% 61.1% 91% 18% 15% 4 B 20.90 44.80 27 59.3% 68.7% 88% 11% 3% 5 A 19.50 29.80 23 60.9% 78.6% 92% 65% 6% 6 A 21.90 38.90 16 50.0% 75.0% 94% 5% 91% 7 D 17.40 42.00 25 80.0% 70.0% 58% 4% 74% 8 C 15.40 37.90 10 50.0% 100.0% 76% 13% 18% 9 C 21.70 43.10 38 57.9% 90.9% 78% 12% 7%
10 D 18.60 33.70 12 58.3% 42.9% 90% 4% 80% 11 A 18.10 44.30 20 50.0% 80.0% 94% 30% 7% 12 A 22.7 50.9 16 81.3% 84.6% 83% 8% 5% 13 B 22.4 44.8 17 76.5% 69.2% 82% 16% 5%
Raters
We worked with EDC to recruit potential raters for this study. Members of the
PLATO team trained 12 new raters in New York City with both ELA background, and
experience teaching in the middle grades. The two-day training was face-to-face and
focused on all of the PLATO elements. We provided numerous opportunities for
potential raters to score video clips and receive feedback on their scores. By the end of
the training and with some follow-up with individual raters, 12 of 14 potential raters
achieved 80% reliability, with exact score matches on at least five videos of ELA
instruction that had been master coded by several members of the PLATO team (for more
details about the raters, their use of the instrument and overall reliability, see Cor, 2011).
Revision of PLATO rubric
The PLATO protocol used in this study included twelve elements of instruction
highlighted in existing literature on adolescent literacy and effective instruction in
secondary ELA: purpose, intellectual challenge, connections to prior knowledge,
connections to personal and cultural experiences, modeling, strategy use and instruction,
guided practice, classroom discourse, text-based instruction, accommodations for
language learning, and classroom environment (for details on the development of these
elements and the background literature on each, see Grossman et al, 2009). These
elements were refined based on analysis of previous rounds of data collection. Text-based
instruction was added to assess how regularly and effectively a teacher and students
reference texts and use those textual references to meet the goals of ELA instruction. The
classroom environment element looks at both time and behavior management to assess
the teacher’s efficient organization of classroom routines and materials to ensure that
instructional time is maximized, and the degree to which student behavior facilitates
academic work. A factor analysis of a previous data round of data collection suggest
three underlying factors: disciplinary demand and representation of content, instructional
scaffolding, and classroom environment.
The first version of PLATO used a seven-point scale, in which reliability was
achieved based on 80% agreement with exact and adjacent score matching. Based on
feedback from experts (Gitomer, personal communication), we switched to a 4-point
scale. Each element includes a rubric that details how to score instruction on a scale from
one (lowest) to four (highest). In addition, PLATO captures the content of instruction
(writing, reading, literature, grammar) and activity structures (whole group, small group,
independent work). PLATO is designed for use over multiple segments of instruction in
each lesson; each observation cycle captures 15 minutes of instruction with five minutes
for scoring. Observations focus on the classroom experience of the “average” student,
and try to weigh the balance of evidence across a fifteen-minute segment.
Observation Process
Teachers were observed on three separate days of instruction for at least two class
periods per day in two waves of observations. The number of observation cycles varied
depending on the length of class periods, but on average, teachers were observed for
twelve PLATO cycles. Neither observers nor teachers knew the teachers’ value-added
coefficient or quartile during observations. To ensure consistency among raters, 15% of
observations were double-coded. The study was designed to be able to conduct a
generalizability study, so teachers were observed by multiple raters, and raters were
assigned to multiple schools. (For results of the g-study, see Cor, 2011).
Value-added models
To calculate teachers’ value-added scores, we chose to run a number of different
specifications. The base used to estimate teacher effects is summarized by Equation 1.
Aisgty = g0 + g 1 Ais’g(g-1)t’(y-1) + g 2Siy + Cjt g3 + Tjtg4 + πs + πg + εisgty (1)
Here the standardized achievement level (test score) A of student i in school s in
grade g with teacher t in year y is a linear function of the student’s test score in the prior
year, characteristics of the student S, characteristics of the other students in the same
grade with the same teacher in that year C, and the teacher’s attributes T, including
classroom practices. Student characteristics include gender, race/ethnicity, poverty
status, days absent during the prior year, and suspensions in the prior year. The aggregate
(teacher by grade by school) student characteristics include race/ethnicity, poverty status,
average attendance in the prior year, average suspensions in the prior year, average
student test scores in the prior year, and the standard deviation of student test scores in
the prior year. Teaching experience is measured by dummy variables for each year.
Other teacher attributes will vary by model but include test score performance along a
number of dimensions. In addition, the model includes fixed effects for grades and
schools. The standard errors are clustered at the teacher level to account for the fact
teacher-level variables are of primary interest in our analysis. This approach identifies the
effectiveness of a teacher attribute such as classroom practices by comparing teachers
with different attributes within the same school.
There are active debates concerning the best specification for estimating teacher
effects. Because there is no consensus on the best approach, we chose to combine two
measures for our analyses in this paper. In particular, we use one estimate that models
gains in student achievement as a function of student fixed-effects, student time varying
characteristics, school characteristics, classroom characteristics, and year and grade
indicator variables. This strategy identifies value-added by comparing teachers who
teach the same students, usually in different years. Our other estimate includes student
controls, school controls, classroom controls and year and grade indicator variables. The
student controls include gender, race, eligibility for free lunch, prior year test scores in
math and ELA, and English learner status, among other factors. Classroom variables
include the aggregates of all the individual variables plus the standard deviations of the
prior year test scores. The school variables include enrollment, the percent of both black
and Hispanic students, the percent of English learners, and the school average
expenditures per pupil. We shrink each measure of value-added using empirical Bayes
techniques to adjust for estimation error in calculating value-added. This is generally a
conservative approach to the identification of the effects.
It is important to note that while we observed in the spring of 2008-9 school year,
the value-added models were based on student achievement data from the 2007-8 school
year. Since we know that year-to-year correlations of teachers’ value-added scores are in
the range of .50, we are eager to re-run these analyses with the 2008-09 data. We have
recently received the student achievement scores from 2008-9 and are in the process of
developing value-added models using this data.
Analysis
We analyzed the relationship between teachers’ value-added and elements of their
observed practices, using both OLS and logistic regression. In the first year of data
collection, we compared teachers in the 4th quartile to those in the 2nd quartile. However,
in this round of data collection, our sample was skewed toward the lower quartile
teachers. As seen in Table 3, we had many more teachers in the 1st and 2rd quartiles (70%
of the sample), and very few in the 4th quartile (11.25%). To maximize the number of
teachers in our analyses for the logistic regressions, we compared the “more effective”
teachers in the 3rd quartile (second highest quartile) of the specified VAM to teachers in
the “less effective” 1st quartile (the lowest quartile). We are not sure why teachers in the
lower quartiles were more likely to volunteer for this study, as it seems counter-intuitive.
However, the skewed nature of the sample provides a cautionary note for studies that rely
on volunteers. The skewed distribution may also affect our ability to accurately measure
relationships with classroom practices.
Table 3- Value-added distribution by quartile and school
School 1st Quartile 2nd Quartile 3rd Quartile 4th quartile 1 6 2 0 0 2 4 5 1 0 3 2 2 0 1 4 1 3 5 0 5 3 3 0 2 6 1 1 4 1 7 2 5 1 2 8 0 0 0 0 9 4 3 2 1
10 1 1 0 0 11 4 0 0 1 12 0 0 0 0 13 3 0 2 1
Total Number 31 25 15 9 Percentage 38.75% 31.25% 18.75% 11.25%
Limitations
We recognize that there are a number of limitations to this study. First, we do not
yet have the value-added data for the year we actually observed these teachers, so these
findings represent the relationships between classroom practices and teachers’ value-
added scores from previous years of instruction. Since the correlation between teachers’
value added scores from one year to the next varies, and can be as low as .3 (McCaffrey
et al., 2009), we are eager to re-run these analyses with the VA scores from the 2008-09
achievement data.
A second limitation has to do with the skewed sample. While we tried our best to
recruit 80% of the ELA teachers at every school, participation varied by school. More
troubling, the teachers in the highest VA quartile were least likely to volunteer. This
meant that we had a more restricted range of teachers than we anticipated, which may
have affected the findings as well.
Findings Frequency distributions by element The distribution of scores differed significantly by element (see Table 4). On
elements such as Purpose, Representations of Content, and Classroom Environment the
vast majority of instructional segments received a score of 3 or 4. In contrast, most
instructional segments scored a 1 on Modeling and Strategy Instruction. Other elements
like Guided Practice and Text-Based Instruction had closer to a uniform distribution of
scores.
Table 4 Sample Size 1 2 3 4
Purpose 2462
observations 0% 10% 63% 27%
Intellectual Challenge 2462
observations 9% 43% 39% 10%
Representations of Content 2462
observations 1% 11% 75% 13%
Connections-Prior Knowledge 2462
observations 41% 16% 25% 18% Connections-Personal/Cultural Experiences
2462 observations 69% 10% 12% 9%
Models/Modeling 2462
observations 66% 15% 11% 8%
Explicit Strategy Instruction 2462
observations 74% 13% 9% 5%
Guided Practice 2462
observations 17% 23% 40% 20%
Classroom Discourse 2462
observations 24% 41% 27% 7%
Text-based Instruction 2462
observations 24% 18% 30% 28%
Accommodations for Language Learning 2462
observations 50% 44% 5% 1%
Classroom Environment 2462
observations 2% 8% 22% 68%
Figure 1: Distributions of PLATO scores Distribution of scores by teacher
We then looked at the basic descriptive statistics of PLATO scores aggregated at
the teacher, rather than the observation, level (see Table 5). They reflect similar patterns
noted in the frequency distributions. Classroom Environment had the highest mean score
across our elements, suggesting that on average, these were well-managed classrooms.
Purpose had the 2nd highest mean score, again suggesting that there was a clear purpose
related to ELA instruction in most of the classrooms we observed. Explicit strategy
instruction had the lowest mean score across teachers (1.44), followed by Modeling,
Accommodations for Language Learners (both 1.58) and Connections to Personal
Experience (1.62).
Figure
Figure 2: Mean scores of PLATO elements
OLS and Logistic (1st to 3rd Quartile Comparisons)
In looking at the OLS regression coefficients for the entire sample (table 6), we
see that none of the elements is significantly related to value-added scores of teachers,
with the exception of Text-Based Instruction, which is negatively related.
Table 6: OLS of Value-added Coefficients regressed on PLATO scores
Regression Coefficient Std. Deviation
Purpose -0.091 0.16 Intellectual Challenge -0.182 0.223
Representations of Content -0.081 0.135 Connections-Prior Knowledge -0.028 0.265 Connections-Personal /Cultural Experiences -0.175 0.229 Models/Modeling -0.112 0.245 Explicit Strategy Instruction 0.213 0.205 Guided Practice -0.414 0.258 Classroom Discourse 0.055 0.264 Text-based Instruction -0.736* 0.309 Accommodations for Language Learning 0.016 0.177 Classroom Environment 0.067 0.256 Number of teachers=125
We then ran logistic regressions to compare the classroom practices of teachers in
the 1st and 3rd quartiles (N=46)2. Table 7 below shows the odds of being in the 3rd quartile
group based on a 1 unit higher score on each of the PLATO elements. While a number of
the elements (Purpose, Connections to Prior Knowledge, Connections to Personal
Experience), have odds ratios above 2.0, only two elements are significant: Modeling
(4.340) and Explicit Strategy Instruction (4.698).
Table 7: Effect of PLATO Elements on the Odds Ratios Predicting Being in the 3rd Value-Added Quartile Versus the First Value-Added Quartile
Purpose 2.27 Likelihood of Being
in 3rd Quartile Standard
Error Purpose 2.271 (2.308) Intellectual Challenge 0.914 (0.683) Representations of Content 0.767 (0.973) Connections to Prior Knowledge 2.392 (1.604) Connections to Personal and Cultural Experiences 2.163 (1.612) Modeling 4.340* (3.026) Strategy Instruction 4.698+ (3.829) Guided Practice 0.777 (0.509) Classroom Discourse 1.182 (0.840)
2 We have many fewer teachers in the logistic regressions because the value-added model used for this study is a composite of two separate models. Teachers were identified in a certain quartile only if they were in that quartile in both value-added specifications. For this reason, we lost a large number of teachers when the sample was divided in quartile groups.
Text-Based Instruction 0.974 (0.535) Accommodations for Language Learning 1.445 (1.359) Classroom Environment 1.135 (0.831) Number of Teachers=51 Results presented as odds-ratios + p<.10, * p<.05, ** p<.01
Results for teachers with 3-10 years of experience in New York City
We then re-ran the analyses focusing on teachers with between 3 and10 years of
teaching experience in New York City schools. Since our original study focused on
teachers with between 3-6 years of experience, and because experience may be associated
with the use of different classroom practices, we were interested in exploring differences
in a more restricted sample. Table 8 below provides the mean years of experience of ELA
teachers at the schools in our sample. It indicates that the experience profile of the
teachers in our sample varied among schools. Schools 6 and 7 have teachers with the
highest mean level of experience, while schools 1 and 4 have the lowest mean
experience. As a result, restricting the sample in this way differentially impacted the
samples across schools.
Table 8: Mean years of experience by school
School Mean Years of Experience Std. Dev Min Max
1 4.214 2.259 1 8 2 9.833 5.813 1 20 3 5.823 5.714 1 20 4 4.867 3.044 1 11 5 6.5 5.08 1 21 6 9.375 4.897 4 17 7 9.047 6.359 1 20 8 9.8 11.691 2 29 9 5.952 2.923 1 13
10 7.375 7.501 1 22 11 7.556 4.927 1 16 12 6.71 4.196 2 18 13 8.462 5.517 1 22
With our restricted sample, we then re-ran both our logistic regressions comparing
teachers in the 3rd and 1st quartiles as well as the OLS regressions, looking at the full
range of value-added coefficients. In the logistic regressions, higher PLATO scores were
positively associated with being in the 3rd quartile on nine of the twelve elements
including: Purpose, Intellectual Challenge, Connections to Prior Knowledge,
Connections to Personal Experience, Strategy Instruction, Modeling, Guided Practice,
Accommodations for Language Learning, and Classroom Environment. The teachers’
scores on Modeling and Explicit Strategy Instruction were significantly predictive (p <
.01) of the teachers’ odds of being in the 3rd quartile. Using OLS regressions, seven of the
twelve PLATO elements had positive associations with the teachers’ value-added
coefficients including: Purpose, Connections to Personal Experience, Strategy
Instruction, Modeling, Classroom Discourse, Accommodations for Language Learning,
Classroom Environment. Though none of these associations are significant, they do
suggest that on average, for the teachers with 3-10 years of experience, higher scores on
many of the PLATO elements may be associated with higher value-added coefficients.
Table 9: Logistic regressions for teachers with 3-10 years of experience
Likelihood of Being in 3rd Quartile Std. Error
Purpose 3.791 (4.821) Intellectual Challenge 1.182 (1.114) Representations of Content 0.993 (1.587) Connections-Prior Knowledge 2.727 (2.396) Connections-Personal /Cultural Experiences 3.488 (3.023) Models/Modeling 17.430** (19.191) Explicit Strategy Instruction 12.414* (13.014) Guided Practice 1.544 (1.336) Classroom Discourse 0.620 (0.582) Text-based Instruction 0.983 (0.679)
Accommodations for Language Learning 5.816 (7.122) Classroom Environment 1.172 (1.047) Number of Teachers= 39 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
Table 10: OLS regressions for teachers with 3-10 years of experience
Regression Coefficient Std. Error
Purpose 0.017 (0.203) Intellectual Challenge -0.035 (0.276) Representations of Content -0.100 (0.165) Connections-Prior Knowledge -0.120 (0.340) Connections-Personal /Cultural Experiences 0.116 (0.324) Models/Modeling 0.281 (0.332) Explicit Strategy Instruction 0.439 (0.272) Guided Practice -0.231 (0.309) Classroom Discourse 0.073 (0.336) Text-based Instruction -0.636 (0.401) Accommodations for Language Learning 0.035 (0.233) Classroom Environment 0.078 (0.346) Number of Teachers= 96
Adjusting for psychometric properties of PLATO
We then reran these analyses with adjusted scores that predicted a PLATO score
of each teacher on each item based on the following measurement model:
log ( Pnijkr/Pnijk(r-1)) = Bn - Di - Cj - Fk - Gr
This is a polytomous Rasch model that adjusts for the difficulty of occasion and
segment of measurement as well as the severity of the rater. In the model Pnijkr is the
probability of observing category r for teacher n encountering occasion i, segment j, and
judge k. Pnijk(r-1) is the probability of observing category r-1. Bn is the ability of
teacher n and Gr is the difficulty of being observed in category r relative to category r-1.
Therefore, the scores can be viewed as an estimate of the teacher’ underlying skill in
each instructional element that are adjusted for the measurement circumstance of each
observation. A reliability analysis found that the Cronbach's alpha for the 12 PLATO
items used in conjunction increased from .72 to .85 when using the adjusted scores.
Table 11 shows the relationship between PLATO elements and teachers’ value-
added scores, once we adjust for the various sources of error.
Table 11: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Total sample of teachers Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 2.271 1.326 Intellectual Challenge 0.914 1.077 Representations of Content 0.767 1.027 Connections-Prior Knowledge 2.392 1.700 Connections-Personal/Cultural Experiences 2.163 1.925 Models/Modeling 4.340* 2.658 Explicit Strategy Instruction 4.698 1.579 Guided Practice 0.777 0.890 Classroom Discourse 1.182 1.458 Text-based Instruction 0.974 1.068 Accommodations for Language Learning 1.445 1.123 Classroom Environment 1.135 0.959 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
Table 12: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Teachers with 3-10 years of Experience Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 3.791 1.546 Intellectual Challenge 1.182 1.204 Representations of Content 0.993 1.101 Connections-Prior Knowledge 2.727 1.789 Connections-Personal/Cultural Experiences 3.488 3.312
Models/Modeling 17.430** 6.196* Explicit Strategy Instruction 12.414* 4.993* Guided Practice 1.544 1.174 Classroom Discourse 0.620 1.006 Text-based Instruction 0.983 0.892 Accommodations for Language Learning 5.816 2.267 Classroom Environment 1.172 0.961 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
These results suggest that while adjusting the scores based on our g-study generally
reduced the odds ratios for many of the elements, it also raised the odds ratios for
Classroom Discourse and Representation of Content in both the full and restricted
samples.
In general, the reason unadjusted and adjusted scores differentially predict
likelihood of being in third quartile is that the raw scores are systematically affected by
measurement features in the data collection process. For example, it could be that the
teachers with the highest unadjusted scores were more likely to be rated by the most
lenient raters. As a result, their scores are spuriously high compared to the rest of the
teachers. Alternatively, it could be that the teachers with the higher scores were observed
on fewer occasions and on fewer segments than teachers with lower scores. Once again
the result would be a spuriously high likelihood of being in the third quartile. In order to
verify these and other potential explanations, the measurement characteristics of the
teachers most likely to be in the third quartile must be tracked to identify potential
predictors of systematic differences.
Discussion
These results provide a mixed picture of the relationship between teachers’
classroom practices, as measured by PLATO, and teachers’ value-added scores. When
we use logistic regression, two of the elements, Modeling and Explicit Strategy
Instruction, seem to distinguish between teachers in the 1st and 3rd value-added quartiles.
This relationship seems to hold up across a number of different analyses, and is
particularly strong among teachers with 3-10 years of experience. It is easy to understand
why “Strategy Use and Instruction,” explicitly teaching students how to successfully
complete academic tasks, and “Modeling,” demonstrating or enacting the processes in
which students will engage, might be considered “high-leverage practices” for
achievement gains on standardized English Language Arts assessments. At the highest
level of Modeling, the teacher provides a specific, concrete image of what student work,
including process or intellectual work, can and should look like and decomposes the
process, highlighting specific features for students to replicate. The opportunity to watch
an “expert,” the teacher, engage the same activity as students helps make visible how a
more experienced reader or writer approaches the task. This, in turn, might students
better understand the processes being modeled and help them become more persistent and
flexible when they approach novel tasks such as those they face on standardized
assessments, the primary tool for determining a teacher’s value-added. Moreover to be
successful on such assessments, students must consistently employ strategies to interpret
literary text, make a compelling argument, or analyze grammatical errors. It is the
flexibility of strategies that makes them critical for success on a range of ELA tasks.
When students understand when and how to use specific strategies, as well as why they
are useful, they are better able to attack less familiar tasks or material. During our first
year of data collection, we found the vast majority of teachers provided students with
directions for completing activities, but they did not instruct them on the nuances of how
to complete those activities effectively. In literature circles, students were often told to
analyze a character’s actions or determine the meaning of unknown words without any
discussion of the strategies that would enable them to do so. Similarly, teachers
highlighted the features of cinquains or editorials but did not teach students how they
might approach different types of writing based on those features. Thus the goal of many
lessons was completion of the specific task rather than mastering a more broadly
applicable skill. Those teachers who actually taught students strategies were nearly
always in the highest value-added quartile.
However, other elements do not show a clear relationship to teachers’ value-
added scores. While in some of these analyses, most relationships between elements and
value-added scores are positive, few of the other relationships are significant. In the
remainder of this paper, we explore a number of hypotheses related to our findings.
Measuring well: Measuring what matters
One hypothesis is that these particular classroom practices are not associated with
teachers’ impact on student achievement in ELA. While certainly possible, this
hypothesis seems unlikely to be true for all elements, given the previous research on the
effects of a teacher’s time on task and student achievement (Denham & Lieberman, 1980)
and on the relationship between the cognitive demand of classrooms and student learning
(Newmann, Lopez, & Bryk, 1998).
A second hypothesis is that we are not measuring these practices well. There are
a number of reasons to question the value of how well we measured these instructional
elements. We had just revised significantly the PLATO instrument from the version used
in our earlier study, and the revised instrument may not have adequately captured the
qualities of classroom practice we intended to capture. We know, for example, that some
of the measures, such as Representation of Content, shifted their meaning during data
collection. Raters began to use the score of 3 as a “default score,” so we may not have
measured well the quality of teachers’ representations of content during these
observations. Secondly, we shifted our scale from a 7 point scale to a 4 point scale; in this
version of PLATO, we may not have done a good job of distinguishing among the
various score points. In addition, some of these elements were new to this version of
PLATO, such as Text-Based Instruction and Classroom Environment. So it is possible
that while these practices are indeed components of effective instruction, our instrument
did not do a good job of capturing them during this round of data collection.
Another potential problem with PLATO as a measurement tool is that some of our
elements measure aspects of instruction that are always present in some form, such as
intellectual challenge, purpose, representation of content, and classroom environment,
while other elements normally occur only at discrete points in the lesson, such as strategy
instruction, guided practice, etc. The fact that the g-study suggests that the greater
variation occurs across segments for the same teacher suggests that to get an accurate
measure of classroom practice would require multiple observations. In the first year of
data collection using PLATO we observed six days of instruction, while in this second
year, we observed only three days of instruction. This may have meant that we missed
low-incidence practices among teachers who do, in fact, use these practices.
There may also have been problematic scoring by raters during this year of
classroom observations. In the first year of the study, most raters had expertise in the
instrument; in fact, many of the raters were also the instrument developers. In contrast,
during year 2, none of the developers regularly observed in classrooms. While our g-
study indicates that relatively little of the error is attributable to raters, it is possible that
raters agreed with each other, but may have scored differently from the way the first
group of raters scored. Because we do not have videos from these observations, we are
unable to check this hypothesis. Raters might have been consistent with each other, but
they might not have been consistent with the developers of PLATO. The one rater who
had the most experience in PLATO actually scored differently than other raters during
Year 2, lending some support to this hypothesis. In looking at the school-level data,
however, we can spot some effects of the school context on rater behavior. In schools
where there were few, or no, 3rd and 4th quartile teachers, scores may have drifted
upward, as raters re-calibrated to what was typical practice in a school. Without
consistent examples of what PLATO would consider high-quality teaching, raters may
have redefined the scale somewhat.
While PLATO may not have done a good job measuring the quality of teacher
practice, it is also possible that the standardized tests in ELA do not do a good job of
measuring some aspects of student outcomes. For example, the quality of classroom
discourse may be important in developing students’ reasoning abilities and conceptual
understanding of literature and writing, but these abilities may not be measured well by
the tests that were used to construct value-added scores. Few would argue that students
need not develop their ability to engage in productive academic discourse, but the skills
learned in these discussions may not be captured on state assessments. In this case, the
observation scores might provide better measures of this particular outcome than value-
added scores. In addition to developing multiple measures of instruction, we need to
develop multiple measures of student outcomes to ensure that classroom instruction
supports the development of a broad range of learning outcomes for students.
The model matters
One of the striking findings relates to the difference in the results depending upon
whether we used OLS or logistic regression. Again, there might be several hypotheses
for why there are negative correlations between PLATO and value-added coefficients
under OLS and positive relationships using logistic regression, particularly in the
restricted sample. One hypothesis might be that OLS may not be the best approach to
looking at relationships between practices and teachers’ value-added scores. In
particular, OLS assumes that there is a normal distribution of errors, which is highly
unlikely given the skewed distribution of both our dependent and independent variables.
As such, the sample is heavily weighted towards teachers with lower value-added
coefficients the few observations in the 4th quartile have an inappropriately large
influence on the slope of the fitted line. OLS is also extremely sensitive to outliers; a few
outliers in our sample—a few 4th quartile teachers who scored lower on some PLATO
elements or 1st quartile teachers who scored higher—would affect the overall results. In
contrast, logistic regression creates two groups (“more and less effective” teachers) that
are defined vis a vis some parameters, in this case percentile cut-off points, which
eliminates the potential for outlier effects and does not necessitate normally distributed
data.
We also know that value-added estimates for individual teachers tend to fluctuate
significantly from year to year (McCaffrey, Sass, Lockwood, & Mihaly, 2009),
particularly those in the middle of the distribution of value-added scores. So including the
full range of teachers might mean that some teachers who are in our 2nd and 3rd quartiles
are misidentified and are really more or less effective at raising achievement than these
particular VA scores might suggest. Because of this, small distinctions between teachers
with similar value added scores may be less meaningful. As a result, treating value-added
coefficients as a continuous variable is problematic. Blunter categories, such as quartiles
used in a logistic regression, may make more sense.
Another possibility for the lack of strong relationships between VA scores and
scores on PLATO could be the influence of school context. As illustrated by the work of
Julie Cohen and Michelle Brown, certain classroom practices may be associated with
particular school contexts, which makes it more difficult to distinguish among teachers in
the same context. While value-added models control for school characteristics, they do
not address what economists call “unobservables,” which could include the school
culture, collegial interaction, curriculum, and instructional coherence3. Schools that are
higher-functioning may have higher quality instruction overall.
Implications: Leveraging measurement for improvement
Our experience designing PLATO suggests that it is not easy work developing a
systematic observation system and that it is even more difficult to train others to use such 3 Models with school fixed effects are designed to address these issues but also limit comparisons to teachers within the same school.
a tool reliably. Given the challenges, we should be somewhat wary of the Race to the
Top mandate that districts develop teacher evaluation systems that include multiple
indicators, including classroom observations. Such systems will take time and expertise
to develop.
Despite these various issues, continuing to develop better measures of both
learning and classroom practice is work worth doing. Classroom observation systems
offer the potential to both measure and improve the quality of teaching. Part of the
challenge involved in improving the quality of teaching in our nation’s schools is the lack
of valid and reliable measures for assessing teaching effectiveness or tools for targeting
specific features of instruction (Gitomer, 2009). Without such tools, it is nearly
impossible to identify effective classroom practices and support teachers’ growth in
classroom instruction. Identifying classroom practices associated with high student
achievement gains and then targeting these practices in professional development
provides a potentially powerful approach for improving the quality of instruction for all
students. Value-added measures may be able to distinguish between teachers who have
differential impact on student achievement scores, but they contribute nothing about the
mechanisms through which teachers achieve this impact.
Among the middle school ELA classrooms we studied, the practices of modeling
and strategy instruction seem to be strong predictors of teachers’ effectiveness as
measured by value-added. The good news about this finding is that these are practices
that teachers can develop, providing a lever for instructional improvement. The fact that
they occur so infrequently also makes them an easy target for reform. Helping teachers
integrate more strategy instruction and modeling into their ELA lessons would provide
much more support for students in the areas of reading and writing. If these findings
continue to hold up, a next step would be to leverage these findings through targeted
professional development in these practices. We believe that the new tools and
technology being developed now have tremendous potential for raising the floor on
classroom practice.
References:
Biancarosa, G., & Snow, C. E. (2004). Reading next: A vision for action and research in
middle and high school literacy: A report from the Carnegie Corporation of New
York. Washington DC: Alliance for Excellent Education.
Boyd, D.J., Grossman, P.L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher
preparation and student achievement. Educational Evaluation and Policy
Analysis, 31(4), 416-440.
Carnegie Corporation of New York. (1989). Turning Points: Preparing American youth
for the 21st century. New York: Carnegie Corporation of New York.
Cor, K. (2011). The measurement properties of the PLATO rubric. Paper presented at
American Educational Research Association, New Orleans, LA.
Danielson, C. (2007). Enhancing professional practice a framework for teaching.
Alexandria, VA: Association for Supervision and Curriculum Development.
Denham, C. & Lieberman, A. (Eds.) (1980). Time to learn. Washington D. C.: National
Institute of Health.
Gitomer, D., Ed. (2009). Measurement issues and assessment for teaching quality. Los
Angeles: Sage Publications.
Goe, L., C. Bell, C., & O. Little. (2008). Approaches to evaluating teacher effectiveness:
A research synthesis. Washington, DC: National Comprehensive Center for
Teacher Quality. Retrieved November 29, 2009,
http://www.tqsource.org/publications/EvaluatingTeachEffectiveness.pdf.
Grossman, P., Loeb, S., Cohen, J., Hammerness, K., Wyckoff, J., Boyd, D., et al. (2010).
Measure for Measure: The relationship between measures of instructional practice
in middle school English Language Arts and teachers’ value-added scores. NBER
Working Paper No. 16015.
Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in
teaching and teacher education. American Educational Research Journal, 45(1),
184-205.
Hill, H. (2005). Content across communities: Validating measures of elementary
mathematics instruction. Educational Policy, 19, 447-475.
Hoffman, J.V., Sailors, M., & Duffy, G. (2004). The effective elementary classroom
literacy environment: examining the validity of the TEX-IN3 observation system”
Journal of Literacy Research, 36, 303-334.
Kane, T. J., Rockoff, J., & Staiger, D. (2008). What does certification tell us about
teacher effectiveness? Evidence from New York City. Economics of Education
Review, 27(6), 615-631.
Kane, T.J., Taylor, E.S., Tyler, J.H., Wooten, A.L. (2010). Identifying effective
classroom practices using student achievement data, NBER Working Paper 15803.
Kennedy, M. M. (in press). Recognizing good teaching when we see it. To appear in
Handbook of Teacher Assessment and Teacher Quality. M. Kennedy (Ed.) San
Francisco, Jossey-Bass.
McCaffrey, D.F., Lockwood, J.R., Louis, T.A., & Hamilton, L. (2004). Models for value-
added modeling of teacher effects, Journal of Educational and Behavioral
Statistics, 29(1), 67-101.
McCaffrey, D., Sass, T., Lockwood, J., & Mihaly, K. (2009). The intertemporal
variability of teacher effect estimates. Education Finance and Policy, 4(4), 572-
606.
Newmann, F., Lopez, G, & Bryk, A. (1998). The quality of intellectual work in Chicago
schools. Chicago, IL: Consortium on Chicago School Research.
Pianta, R. C., LaParo, K. M., Stuhlman, M. (2004). The classroom assessment scoring
system: Findings from the prekindergarten year, The Elementary School Journal,
104, 409-426.
Raundenbush, S.W. (2004). What are value-added models estimating and what does
this imply for statistical practice?, Journal of Educational and Behavioral
Statistics, 29(1), 121-129.
Rice, J.K. (2003). Teacher quality: Understanding the effectiveness of teacher attributes.
Washington: Economic Policy Institute.
Rivkin, S., Hanushek, E., & Kain, J. (2005). Teachers, schools, and academic
achievement. Econometrica, 73(2), 417-458.
Rockoff, J. (2004). The impact of individual teachers on student achievement: Evidence
from panel data. American Economic Review, 94(2), 247-252.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on
observables and unobservables. Education Finance and Policy, 4(4), 537-571.
Seidel, T.S. & Shavelson, R. (2007) Teaching effectiveness research in the past decade:
The role of theory and research design in disentangling meta-analysis results.
Review of Educational Research, 77(4), 454-499.
Snow, C. E., & Biancarosa, G. (2003). Adolescent literacy and the achievement gap:
What do we know and where do we go from here? Report of the Adolescent
Literacy Funders Meeting. New York: Carnegie Corporation of New York.
Taylor, B. M., Pearson, D. P., Peterson, D.S., Rodriguez, M.C. (2005). The CIERA
School Change Framework: An evidence-based approach to professional
development and school reading improvement. Reading Research Quarterly,
40(1), 40–69.