Upload
john-cronin
View
369
Download
0
Embed Size (px)
DESCRIPTION
NWEA presentation on teacherf evaluation in Massachusetts
Citation preview
John Cronin, Ph.D.Director
The Kingsbury Center @ NWEA
Implementing the framework for teacher evaluation in Massachusetts
Three definitions
• Performance/Status• Growth• Improvement
Examples
• Performance – 75% of the students in my 7th grade mathematics class will achieve the qualifying score needed for placement in 8th grade Algebra.
• Growth – 65% of my students will show growth on the NWEA mathematics test that is greater than the state reported norm.
• Improvement – Last year 40% of my students showed growth on the NWEA mathematics test that was greater than the norm. This year 50% of my students will show greater than normal growth.
Mathematics
No ChangeDownUp
Fall RIT
Num
ber o
f Stu
dent
s
One district’s change in 5th grade math performance relative to Kentucky cut scores
MathematicsFailed growth targetMet growth target
Student’s score in fall
Nu
mb
er o
f S
tud
ents
Number of students who achieved the normal mathematics growth in that district
Preliminary estimates of NY’s college ready proficiency cut scores relative to the NWEA scale
Grade Mathematics Reading
Current NY NY/Mass Current NY/Mass
3 209 199/199 207 198/191
4 220 207/205 214 205/200
5 234 214/221 220 212/210
6 233 221/230 223 216/213
7 241 224/237 227 220/214
8 245 235/242 229 223/215
Comparing NY and Mass cut scores to NY’s estimated college ready cut scores
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80
10
20
30
40
50
60
70
80
90
39 36 3340 37
5259
76 77
6368 69
69 72
82
6975 73
Prior NYMassCurrent
Comparing NY and Mass cut scores to NY’s estimated college ready cut scores - ELA
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80
10
20
30
40
50
60
70
80
48 4651 51 53 5351
6053
4135
29
72 7174
71 7369
Prior NYMassCurrent NY
Issues in the use of growth measures
Measurement design of the instrument
Many assessments are not designed to measure growth. Others do not measure growth equally well for all students.
Tests are not equally accurate for all students
California STAR NWEA MAP
Issues in the use of growth measures
Instructional alignment
Tests used for teacher evaluation must align to the teacher’s instructional responsibilities.
Purpose of multiple measures
He strongest system evaluates with different measures and methodologies without the expectation that these will necessarily predict one another.
The problem with spring-spring testing
3/12 4/12 5/12 6/12 7/12 8/12 9/12 10/12 11/12 12/12 1/13 2/13 3/13
Teacher 2Teacher 1 Summer
Student’s spring to spring growth trajectory
test retest
Test 2 Test 2
Test 1 Test 1
April ‘12 April ‘13
Why growth results are likely to be inconsistent
Common problems with instructional alignment
• Using school level math and reading results in the evaluation of music, art, and other specials teachers.
• Using general tests of a discipline (reading, math, science) as a major component of the evaluation high school teachers delivering specialized courses.
Florida Teachers Sue Over Evaluation SystemNew York Times, April 17, 2013
Seven Florida teachers have brought a federal lawsuit to protest job evaluation policies that tether individual performance ratings to the test scores of students who are not even in their classes. The suit, which was filed Tuesday in conjunction with three local affiliates of the National Education Association in Federal District Court for the Northern District of Florida in Gainesville, says Florida’s two-year-old evaluation system violates teachers’ rights of due process and equal protection. Under a 2011 law, schools and districts must evaluate teachers in part based on how much their students learn, as measured by standardized tests. But since Florida, like most states, administers only math and reading tests and only in selected grades, many teachers do not teach tested subjects. One of the plaintiffs, a first-grade teacher, was rated on the basis of test scores of students in a different school in her district, and another, who teaches vocational classes to aspiring health care workers, was rated based on test scores of students in grades and subjects she had never taught. “This lawsuit highlights the absurdity of the current evaluation system,” said Andy Ford, president of the Florida Education Association.
Testing
Metric (Growth Score)
Analysis (Value-Added)
Evaluation (Rating)
The testing to teacher evaluation process
Issues in the use of growth and value-added measures
Instability of results
A variety of factors can cause value-added results to lack stability.
The use of multiple-years of data is highly recommended.
Issues in the use of growth and value-added measures
“Among those who ranked in the top category on the TAKS reading test, more than 17% ranked among the lowest two categories on the Stanford. Similarly more than 15% of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.”
Corcoran, S., Jennings, J., & Beveridge, A., Teacher Effectiveness on High and Low Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).
Teachers with growth scores in lowest and highest quintile over two years using NWEA’s Measures of Academic Progress
Bottom quintile Y1&Y2
Top quintile Y1&Y2
Number 59/493 63/493
Percent 12% 13%
r .64 r2 .41
Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)
Reliability of estimates
Testing
Metric (Growth Score)
Analysis (Value-Added)
Evaluation (Rating)
The testing to teacher evaluation process
Colorado Growth Model
• Is a descriptive and NOT a causal model• Estimate the progress of students relative to the progress
that would be expected of students like these.• Adjust for two major factors
– Starting score– Past trajectory of performance
• Does not adjust for– Income of the students– ELL status– Mobility
• Result is a ranking of the school’s growth relative to other schools.
Issues in the use of growth and value-added measures
Differences among value-added models
Los Angeles Times Study
Los Angeles Times Study #2
Issues in the use of value-added measures
Control for statistical error
All models attempt to address this issue. Nevertheless, many teachers value-added scores will fall within the range of statistical error.
Issues in the use of growth and value-added measures
Control for statistical error
New York City
New York City #2
-12.00-11.00-10.00
-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00
10.0011.0012.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
Aver
age
Grow
th In
dex
Scor
e an
d Ra
nge
Q4
Q3
Q2
Q1
Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.
Q5
Range of teacher value-added estimates
Issues with local tests and goal setting
• Validity and reliability of assessments.• Teachers/administrators are unlikely to set
goals that are inconsistent with their current performance.
• It is difficult to set goals without prior evidence or context.
Limitations of Colorado Growth Model
• The metrics are inherently NORMATIVE.• Value-added metrics don’t measure improvement in
the school population over time.• Changes in the value-added model have the greatest
effect on extreme cases.
What Makes Schools Work Study - Mathematics
-10.0 -5.0 0.0 5.0 10.0 15.0-10.0
-5.0
0.0
5.0
10.0
15.0
Growth Index within Group Growth Index relative to NWEA Norms
Year 1
Year
2
Data used represents a portion of the teachers who participated in Vanderbilt University’s What Makes Schools Work Project, funded by the federal Institute of Education Sciences
Testing
Metric (Growth Score)
Analysis (Value-Added)
Evaluation (Rating)
The testing to teacher evaluation process
Establishing high, low, average growth
• When norms are available.– + 1 standard deviation = high– Within 1 standard deviation = average– - 1 standard deviation = low
The final rating
• Is inherently a judgment – not a bad thing.• Evidence should inform and not direct the
judgment.• The implemented system should differentiate
performance.
Suggested reading
Baker B., Oluwole, J., Green, P. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the Race to the Top Era. Education Policy Analysis Archives. Vol 21. No 5.
Presenter - John Cronin, Ph.D.
Contacting us:NWEA Main Number: 503-624-1951 E-mail: [email protected]
The presentation and recommended resources are available at our SlideShare site: http://www.slideshare.net/NWEA/tag/kingsbury-center
Thank you for attending