Teacher evaluation presentation3 mass

John Cronin, Ph.D.Director

The Kingsbury Center @ NWEA

Implementing the framework for teacher evaluation in Massachusetts

Three definitions

• Performance/Status• Growth• Improvement

Examples

• Performance – 75% of the students in my 7th grade mathematics class will achieve the qualifying score needed for placement in 8th grade Algebra.

• Growth – 65% of my students will show growth on the NWEA mathematics test that is greater than the state reported norm.

• Improvement – Last year 40% of my students showed growth on the NWEA mathematics test that was greater than the norm. This year 50% of my students will show greater than normal growth.

Mathematics

No ChangeDownUp

Fall RIT

Num

ber o

f Stu

dent

s

One district’s change in 5th grade math performance relative to Kentucky cut scores

MathematicsFailed growth targetMet growth target

Student’s score in fall

Nu

mb

er o

f S

tud

ents

Number of students who achieved the normal mathematics growth in that district

Preliminary estimates of NY’s college ready proficiency cut scores relative to the NWEA scale

Grade Mathematics Reading

Current NY NY/Mass Current NY/Mass

3 209 199/199 207 198/191

4 220 207/205 214 205/200

5 234 214/221 220 212/210

6 233 221/230 223 216/213

7 241 224/237 227 220/214

8 245 235/242 229 223/215

Comparing NY and Mass cut scores to NY’s estimated college ready cut scores

Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80

10

20

30

40

50

60

70

80

90

39 36 3340 37

5259

76 77

6368 69

69 72

82

6975 73

Prior NYMassCurrent

Comparing NY and Mass cut scores to NY’s estimated college ready cut scores - ELA

Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80

10

20

30

40

50

60

70

80

48 4651 51 53 5351

6053

4135

29

72 7174

71 7369

Prior NYMassCurrent NY

Issues in the use of growth measures

Measurement design of the instrument

Many assessments are not designed to measure growth. Others do not measure growth equally well for all students.

Tests are not equally accurate for all students

California STAR NWEA MAP

Issues in the use of growth measures

Instructional alignment

Tests used for teacher evaluation must align to the teacher’s instructional responsibilities.

Purpose of multiple measures

He strongest system evaluates with different measures and methodologies without the expectation that these will necessarily predict one another.

The problem with spring-spring testing

3/12 4/12 5/12 6/12 7/12 8/12 9/12 10/12 11/12 12/12 1/13 2/13 3/13

Teacher 2Teacher 1 Summer

Student’s spring to spring growth trajectory

test retest

Test 2 Test 2

Test 1 Test 1

April ‘12 April ‘13

Why growth results are likely to be inconsistent

Common problems with instructional alignment

• Using school level math and reading results in the evaluation of music, art, and other specials teachers.

• Using general tests of a discipline (reading, math, science) as a major component of the evaluation high school teachers delivering specialized courses.

Florida Teachers Sue Over Evaluation SystemNew York Times, April 17, 2013

Seven Florida teachers have brought a federal lawsuit to protest job evaluation policies that tether individual performance ratings to the test scores of students who are not even in their classes. The suit, which was filed Tuesday in conjunction with three local affiliates of the National Education Association in Federal District Court for the Northern District of Florida in Gainesville, says Florida’s two-year-old evaluation system violates teachers’ rights of due process and equal protection. Under a 2011 law, schools and districts must evaluate teachers in part based on how much their students learn, as measured by standardized tests. But since Florida, like most states, administers only math and reading tests and only in selected grades, many teachers do not teach tested subjects. One of the plaintiffs, a first-grade teacher, was rated on the basis of test scores of students in a different school in her district, and another, who teaches vocational classes to aspiring health care workers, was rated based on test scores of students in grades and subjects she had never taught. “This lawsuit highlights the absurdity of the current evaluation system,” said Andy Ford, president of the Florida Education Association.

Testing

Metric (Growth Score)

Analysis (Value-Added)

Evaluation (Rating)

The testing to teacher evaluation process

Issues in the use of growth and value-added measures

Instability of results

A variety of factors can cause value-added results to lack stability.

The use of multiple-years of data is highly recommended.


“Among those who ranked in the top category on the TAKS reading test, more than 17% ranked among the lowest two categories on the Stanford. Similarly more than 15% of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.”

Corcoran, S., Jennings, J., & Beveridge, A., Teacher Effectiveness on High and Low Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).

Teachers with growth scores in lowest and highest quintile over two years using NWEA’s Measures of Academic Progress

Bottom quintile Y1&Y2

Top quintile Y1&Y2

Number 59/493 63/493

Percent 12% 13%

r .64 r2 .41

Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)

Reliability of estimates

Testing



Evaluation (Rating)


Colorado Growth Model

• Is a descriptive and NOT a causal model• Estimate the progress of students relative to the progress

that would be expected of students like these.• Adjust for two major factors

– Starting score– Past trajectory of performance

• Does not adjust for– Income of the students– ELL status– Mobility

• Result is a ranking of the school’s growth relative to other schools.


Differences among value-added models

Los Angeles Times Study

Los Angeles Times Study #2

http://projects.latimes.com/value-added/value-added-comparison



Issues in the use of value-added measures

Control for statistical error

All models attempt to address this issue. Nevertheless, many teachers value-added scores will fall within the range of statistical error.


Control for statistical error

New York City

New York City #2

http://projects.wsj.com/nyc-teachers/



-12.00-11.00-10.00

-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00

10.0011.0012.00

Mathematics Growth Index Distribution by Teacher - Validity Filtered

Aver

age

Grow

th In

dex

Scor

e an

d Ra

nge

Q4

Q3

Q2

Q1

Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.

Q5

Range of teacher value-added estimates

Issues with local tests and goal setting

• Validity and reliability of assessments.• Teachers/administrators are unlikely to set

goals that are inconsistent with their current performance.

• It is difficult to set goals without prior evidence or context.

Limitations of Colorado Growth Model

• The metrics are inherently NORMATIVE.• Value-added metrics don’t measure improvement in

the school population over time.• Changes in the value-added model have the greatest

effect on extreme cases.

What Makes Schools Work Study - Mathematics

-10.0 -5.0 0.0 5.0 10.0 15.0-10.0

-5.0

0.0

5.0

10.0

15.0

Growth Index within Group Growth Index relative to NWEA Norms

Year 1

Year

2

Data used represents a portion of the teachers who participated in Vanderbilt University’s What Makes Schools Work Project, funded by the federal Institute of Education Sciences

Testing



Evaluation (Rating)


Establishing high, low, average growth

• When norms are available.– + 1 standard deviation = high– Within 1 standard deviation = average– - 1 standard deviation = low

The final rating

• Is inherently a judgment – not a bad thing.• Evidence should inform and not direct the

judgment.• The implemented system should differentiate

performance.

Suggested reading

Baker B., Oluwole, J., Green, P. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the Race to the Top Era. Education Policy Analysis Archives. Vol 21. No 5.

Presenter - John Cronin, Ph.D.

Contacting us:NWEA Main Number: 503-624-1951 E-mail: [email protected]

The presentation and recommended resources are available at our SlideShare site: http://www.slideshare.net/NWEA/tag/kingsbury-center

Thank you for attending

mailto:[email protected]

http://www.slideshare.net/NWEA/tag/kingsbury-center