The Role of Calibration in Determining Educator Effectiveness

The Role of Calibration inDetermining Educator Effectiveness

THE IMPORTANCE OF FAIR, RELIABLE, VALID AND DEFENSIBLE EVALUATIONS

By Albert “Duffy” Miller, Ed.D., President, Teaching Learning Solutions

The Role of Calibration in Determining Educator Effectiveness 2

EXECUTIVE SUMMARY

So why does teacher evaluation matter? Because teaching matters. The core of education

is teaching and learning, and the teaching-learning connection works best when we have

effective teachers working with every student every day. Simply put, high-quality evaluation

systems help ensure high-quality teachers.

Yet, assessing teacher effectiveness consistently and accurately is a challenge, and too

often, the decisions made about a teacher’s effectiveness are not defensible. To establish

quality teacher evaluation practices, evaluators must be fair, their work must be valid, and

the judgments they reach about a teacher’s practice must be reliable. Calibration, a solution

provided by TalentEd with the help of Teaching Learning Solutions (TLS), helps evaluators

build the skills they will need to complete fair and valid teacher evaluations, and through

calibration and assessment of the evaluators’ evidence and ratings of teachers’ practice, will

ensure that the judgments made are reliable and defensible.

Fairness in the evaluation process occurs when teachers have confidence that observers

follow similar processes, and have the necessary skills to gather and provide objective and

representative evidence of their teaching practice.

Calibration promotes fairness in several ways. The platform itself can be customized in

design to follow the district’s established and negotiated processes, and use of the platform

allows those processes to be monitored for evaluator fidelity to those processes. Platform

customization guarantees that the roles of both the teacher and evaluator are clearly

OBJECTIVE, WELL REPRESENTED EVIDENCE HAS SEVERAL BENEFITS AND USES

It provides a lens to the teacher through which she can view

and reflect upon her practice

The evidence is a vehicle through which honest feedback can be provided

The use of objective evidence establishes trust in the evaluators’ work

1

2

3


understood, and the responsibilities of each party are completed on the platform as part of

the evaluation process. And finally, as districts calibrate using the platform, the observers’

evidence (the product of their work gathered to ensure objectivity and representativeness),

and teacher ratings are assessed for quality and accuracy.

Validity is established when the criteria used to assess teacher practice is representative

of effective teaching, and when there is a research base that establishes what represents

effective practice and establishes how that practice correlates with student achievement.

Calibration is designed for the use of any rubric selected by a district, or approved by the

district’s’ regulatory agency (e.g., State Department of Education). While the validity of some

rubrics has been established through research, such as the CLASS, FfT, MQI, and PLATO

criteria, in other circumstances districts may revise existing criteria to bring a tighter focus to

practices that they know from their own study of their students’ achievement has a positive

influence. Working with the district or state, TalentEd/TLS provide aligned and scored videos

for use with multiple criteria as part of the calibration process. If a district has revised existing

rubrics, or created their own rubrics, TLS scorers will score videos against the criteria, and

as districts move forward with implementation, will provide consultation to ensure that the


effective levels of performance are validated by high levels of student achievement using

district achievement data. Additionally, when needed, TLS will provide custom professional

development to district evaluators ensuring that they understand the priorities of the selected

scoring criteria, thus promoting accuracy in the scoring process.

Teachers want assurance that multiple

evaluators will reach agreement when

assessing their performance. Nothing will

destroy trust in the evaluation process as

much as two different evaluators reaching

different conclusions about a teacher’s level of

effectiveness. District executives and boards

must be confident that the ratings given

to teachers are consistent interpretations

of the criteria used for evaluation. When

the interpretations are accurate there can

be confidence that the evaluation process will help identify those teachers with the skills

necessary to lead, support and mentor the work of their colleagues, and to identify which

teachers need support to move their practice to a higher level of effectiveness. Calibration

solution provides districts with a tool to assess both inter-rater agreement and inter-rater

reliability, thus assuring teachers that the ratings given for performance will be consistent

among observers, and assuring district leaders that the evaluators are accurately assessing

and identifying practice that is exemplary, and in need of more support and development.

Let’s briefly review how Calibration supports effective teacher evaluation practices. Teacher

evaluation is, first, about documenting the quality of teacher performance. Then, its focus

shifts to helping teachers improve their performance as well as holding them accountable for

their work. The most effective way to evaluate teachers is to watch them teach, and those doing

the watching must agree on what the performance should be and how the observation should be

done.


Calibration provides a structure for observation of teachers that:

• Can incorporate any rubric selected by a district,

• Uses video training on a customized platform to teach the observer to follow established

processes and recognize the teaching skills desired by the district,

• Teaches the observers to produce valid observations by using video training to rate the

observed behaviors against a master score,

• Trains the observers to produce observations that are consistent, no matter how many

people or how many times, across time as well as when using different observers, in order

to produce reliable observations.

There is also a great deal of evidence that says that, in order to develop an effective teacher,

we need to observe that teacher regularly, provide specific feedback to the teacher and

differentiated professional development opportunities that support growth in identified areas.

The most powerful part of any observation is the feedback that comes from the observation

process. Effective evaluation models are based on frequent observations, and focus on

improving teacher effectiveness, especially when supported by coaching to support the

teacher in areas where growth is needed. A sound, valid, and defensible observation process

is the foundation upon which teacher development is built.

The data gathered in Calibration solution gives districts critical data for identifying areas in

which instructional improvement is needed and strength exists, and informs the design of

school improvement and personal growth for all teachers at all levels.


EFFECTIVE TEACHERS RESULT IN HIGHER STUDENT ACHIEVEMENT

Education leaders and policymakers have grappled with the difficult issue of education

reform, a topic that has received only lip service in the past. Because there is now a body of

recent research that links student achievement with teaching behavior, the improvement of

teaching practice becomes crucial. How do we help all teachers reach their full potential in

the classroom, and how do we respond to ineffective teaching?

None of these issues can be addressed without better teacher observation and evaluation

systems. Evaluation/observation models need to incorporate elements of effective coaching

practices that use evidence from the observations to provide teachers with evidence-based

feedback that helps them grow as professionals, whether they are developing or highly

effective in practice. The result of the observation processes need to provide school leaders

with a consistent, appropriate and authentic way to measure teachers’ practice in order to

support each teacher’s development, and to give the schools the data they need to build the

strongest possible support and instructional teams.

While we expect that evaluation models provide a vehicle to do all of these things, in most

cases, they haven’t even come close. Instead, they were typically perfunctory compliance


exercises that rated all teachers “good” or “great” and yielded little useful information. As

Secretary of Education Arne Duncan noted in a summer 2010 speech, “our system of teacher

evaluation… frustrates teachers who feel that their good work goes unrecognized and ignores

other teachers who would benefit from additional support.”

The research that shows the correlation between teacher behavior and student success

covers a broad area of the country and is documented throughout many discipline areas. For

instance, in the late 1990s, all 184 schools in Milwaukee — private, public, and charter schools

— focused on improving quality of instruction and student outcomes across the system.

They agreed that one way to improve was through direct observation of instruction and by

providing tools for, and support to, principals’ classroom observation.

Their conclusion, as reported by the Brookings Institute, was that “Having a top-quartile

teacher rather than a bottom-quartile teacher four years in a row could be enough to

close the black-white test score performance.” 1 Researchers looked at a large data set on

individuals in grades 3 through 7 in the State of Texas. This data set permitted analysis of

the question of differences in teacher quality in the determination of student outcomes in

2005. Their conclusion, as reported in the journal Econometrica, was that “Having a high-

quality teacher throughout elementary school can substantially offset or even eliminate the

disadvantage of low socio-economic background. Teachers and, therefore, schools matter

importantly for student achievement.”2

Researchers have also shown that the best predictor of a teacher’s effectiveness is his or

her previous success with children in the classroom. Most other factors pale in comparison,

including a teacher’s preparation route, advanced degrees, and even experience level (after

the first few years).

THE LESSON IS CLEAR: TO ENSURE THAT EVERY CHILD LEARNS FROM THE

MOST EFFECTIVE TEACHERS POSSIBLE, SCHOOLS MUST BE ABLE TO GAUGE

THEIR TEACHERS’ PERFORMANCE FAIRLY AND ACCURATELY.3


In Tennessee, researchers reported that the two most important factors impacting student

gain are differences in classroom teacher effectiveness and the prior achievement level of the

student. The teacher effect is highly significant in every analysis and has a larger effect than

any other factor in twenty of the thirty analyses.

Researchers at the University of Washington reported that “The effect of increases in teacher

quality swamps the impact of any other educational investment, such as reductions in class

size.” Education research convincingly shows that teacher quality is the most important

schooling factor influencing student achievement. A very good teacher, as opposed to a very

bad one, can make as much as a full year’s difference in learning growth for students.4

The research done by Wright, Horn and Sanders sums it up as “More can be done to improve

education by improving the effectiveness of teachers than by any other single factor.”5

Years of research have proven that nothing schools can do for their students matters more

than giving them effective teachers. A few years with effective teachers can put even the

most disadvantaged students on the path to college. A few years with ineffective teachers

can deal students an academic blow from which they may never recover.

VALUE OF EVALUATIONS AND THE IMPACT OF OBSERVATIONS

A major portion of any teacher evaluation is the observation of that teacher. An observation

is usually done by an administrator in the building who watches a sufficient part of a lesson,

matches it to criteria, and then discusses the observation with the teacher. The feedback/

coaching loop includes that discussion and a plan for some type of intervention so that

improvement can happen. There is a built in assumption that evweryone can improve.

Effective evaluation models incorporate multiple observations, either by one individual or

two, or even more evaluators.

THE EFFECT OF INCREASES IN TEACHER QUALITY

SWAMPS THE IMPACT OF ANY OTHER EDUCATIONAL

INVESTMENT, SUCH AS REDUCTIONS IN CLASS SIZE.


Evaluations, including the observation component, should provide all teachers with regular

feedback that helps them grow as professionals, no matter how long they have been in the

classroom. Observations and evaluation models should give schools the information they

need to build the strongest possible instructional teams, and help districts hold school leaders

accountable for supporting each teacher’s development. These models must include direct

observations of the teachers working with the students, and must be done in a systematic

and regular fashion.

One report, The Widget Effect: Our National Failure to Acknowledge and Act on Differences

in Teacher Effectiveness, details the difficulties shown by most teacher evaluation systems,

providing evidence of a multitude of design flaws. These would include the fact that most

evaluation systems presently in use are:

INFREQUENT: Many teachers — especially more experienced teachers — aren’t

evaluated every year. These teachers might go years between receiving any

meaningful feedback on their performance.

UNFOCUSED: A teacher’s most important responsibility is to help students learn,

yet student academic progress rarely factors directly into evaluations. Instead,

teachers are often evaluated based on superficial judgments about behaviors

and practices that may not have any impact on student learning—like the

presentation of their bulletin boards.

AA

UNDIFFERENTIATED: In many school districts, teachers can earn only two

possible ratings: “satisfactory” or “unsatisfactory.” This pass/fail system makes it

impossible to distinguish great teaching from good, good from fair, and fair from

poor.

UNHELPFUL: In many of the districts studied, teachers overwhelmingly reported

that evaluations don’t give them useful feedback on their performance in the

classroom.


INCONSEQUENTIAL: The results of evaluations are rarely used to make

important decisions about development, compensation, tenure or promotion.6

Teachers contribute to student learning in ways that can largely be observed and measured.

Through focused, rigorous observation of classroom practice, examination of student work,

and analysis of students’ performance on high-quality assessments, it is possible to accurately

distinguish effective teaching from ineffective teaching.

Observation/evaluation data should form the foundation of teacher development.

Although there must be meaningful consequences for consistently poor performance, the

primary purpose of evaluations should not be punitive. Good evaluation models identify

excellent teachers and help teachers of all skill levels understand how they can improve;

they encourage a school culture that prizes excellence and continual growth. With better

teacher evaluations in place, school districts can also do a better job holding school leaders

accountable for doing their most important job: helping teachers reach their peak. Removing

persistently underperforming teachers is a necessary but insufficient step for building a

thriving teacher workforce.7

FAIRNESS, RELIABILITY AND VALIDITY = DEFENSIBLE EVALUATIONS

There are many instances in which an observation of a research participant’s performance is

made by someone such as an instructor, a teacher, or some other professional. Because of the

wide variability that can occur in the scoring of performance, it is important to determine the

consistency of such an evaluation. Evaluation of the degree of agreement that exists between

two scorers or raters is referred to as inter-rater agreement. In this context, agreement

means that scores of different observers using the same criteria are consistent. Simply, when

multiple observers score the same teacher’s practice, their scores should be nearly the same

when using the same criteria.

WHEN MULTIPLE OBSERVERS SCORE THE SAME TEACHER’S PRACTICE, THEIR

SCORES SHOULD BE NEARLY THE SAME WHEN USING THE SAME CRITERIA.


The simplest way to determine the degree of consistency between two or more scorers is

to compute a correlation coefficient between the scores provided by the different scorers.

Thus, if two observers visit the same classroom, the teacher feels confident that the scores

he/she receives by both evaluators are consistent. The teacher would soon lose trust in the

evaluation process if one evaluator’s scores are significantly different than the scores of the

second evaluator.

Inter-rater agreement is a procedure used when making observations of practice that involves

observations made by two or more individuals of someone’s practice. The observers record

their scores of the practice and then compare scores to see if they are similar or different.

It has the advantage of negating any bias that any one individual might bring to scoring,

or adjusting for one single instance of teacher or observer doing a particularly challenging

lesson. It has the disadvantage of requiring the researcher to train the observers and requiring

the observers to negotiate outcomes and reconcile differences in their observations.

Frequently, the agreement between two or more scorers is not very strong unless some

degree of training and practice precedes the scoring. Fortunately, with training, the degree

of agreement can improve. The important issue is that training is often required and that a

measure of the reliability of an evaluation of performance by scorers is necessary.


If an evaluation instrument is to be used in making decisions, it must be valid as well as

reliable. What is referred to as “content validity” is the extent to which the questions on

the instrument and the scores from these questions are representative of all the possible

questions that a researcher could ask about the skills. Typically, researchers go to a panel of

experts and have them identify whether the questions are valid. That is, they ask:

• Do the items represent the thing you are trying to measure?

• Have you excluded any important content areas?

• Have you included any irrelevant items?

In some schools, validation is done by taking an average of multiple evaluations. These

evaluations may be performed at different times by the same individual or by different

people. But taking an average of invalid or inaccurate evaluations does not necessarily result

in a valid result. This is a much less defensible method than using a gold standard by which

all evaluations are measured. In accomplishing this task, the experts will make a judgment

about how well the task samples the content. They then make a judgment as to whether the

behavior adequately represents the content. They finally make a judgment of the degree to

which the evidence supports the validity. To measure validity, observers’ ratings must be

measured against a “true” or master score. A critical component of the calibration training

process is master-scored videos. Thus, if an observation is to be considered both reliable and

valid, a teacher has to be confident that his/her rating has been accurately scored against

the selected criteria. Inter-rater reliability and validity provide an assurance to teachers that

the evaluators who rate their practice have high levels of agreement among each other, and

that the ratings they apply are an accurate and reliable interpretation of their practice when

scored against the selected criteria.

IF AN EVALUATION INSTRUMENT IS TO BE USED IN MAKING

DECISIONS, IT MUST BE VALID AS WELL AS RELIABLE.


In this context, the term inter-rater agreement is used when referring just to agreement

between the scorers. Inter-rater reliability refers to both agreement between scorers and

content validity.8,9 Thus, validating based on an average might result in inter-rater agreement,

but only validating against an expert standard will yield inter-rater reliability. An average

doesn’t give the reader the same picture as looking at the reliability. It doesn’t take into

account whether the content that you’re being rated on is content that has been agreed upon

as important to observe. For example, observers might agree that a teacher is not dressed

properly for teaching, but teacher attire might not be something that everyone agrees really

matters in the way children learn. The raters would all agree, but it would not be a valid part

of teaching.

RESEARCH-BASED EDUCATION METHODOLOGIES SUPPORT EVIDENCE

Classroom observations have long been a staple of most teacher evaluation systems but they

have generally not been “fair, reliable and valid”. It’s only recently that research has agreed

on what behaviors make for a good teacher. The observations and rubrics used in teacher

evaluations were not previously grounded in clearly articulated models or frameworks

of effective teaching. In contrast, the 2007 Framework for Teaching, the Classroom

Assessment Scoring System, or CLASS, and other newer observation criteria are explicitly

grounded in models of effective teaching and/or sets of teacher performance standards.


Observation protocols are used with rubrics that differentiate among various levels of teacher

performance. Evaluators, including principals, district administrators and other teachers, are

expected to provide detailed records of what they observe, organized around the standards

and rubrics. Thus, these observations are characterized as evidence-based and are less likely

to rely on subjective judgments that are unsupported by evidence or concrete examples.

Data from these observations can then be used to:

• Provide detailed feedback to teachers

• Plan tailored opportunities for professional learning for teachers

• Contribute to overall ratings of teacher performance when combined with other data

The Framework for Teaching — developed by Charlotte Danielson, an education consultant

from Princeton, New Jersey — whose design for assessing instruction is one of the most-

used in the United States, can be used with K-12 teachers in core content areas such as

mathematics, English language arts, science, history/social studies, and world languages.

Danielson’s framework features four domains: planning and preparation, classroom environment,

instruction, and professional responsibilities. Trained evaluators assess teachers in all four

domains with classroom observations focusing on teacher performance in two of these domains:

classroom environment and instruction. In most districts that use the Framework for Teaching —

or modified versions of it — teachers are observed multiple times in a school year, through both

formal and informal (unannounced) observations. Research from Cincinnati and Chicago schools

indicates that teachers’ observation ratings based on the Framework for Teaching are related

to their effects on student achievement gains. The Danielson Framework has been approved by

many states.10, 11

A different model, the Classroom Assessment Scoring System (CLASS) teacher assessment

model, was first developed by Robert Pianta and his colleagues for use in early childhood

settings. In a recent Center for American Progress report, Pianta outlined the ways that CLASS

can be employed across subject areas and grade levels to assess teachers’ interactions with

children in three broad domains: classroom organization, instructional support, and emotional

support.12


These domains are common from preschool through the 12th grade. CLASS identifies

particular types of interaction within each domain that vary by grade. It assesses “effective

teacher-student interactions across pre-K–12 in a way that is sensitive to important

developmental and context shifts that occur as students mature.”

In research on CLASS and a precursor to CLASS — the Classroom Observation System —

Pianta and colleagues have reported relationships between teachers’ ratings based on these

observation protocols and achievement gains at the preschool and elementary school levels.

A third model, the New York State United Teachers’ (NYSUT) Teacher Practice Rubric,

2012 Edition, has also been approved by the New York State Education Department. In a

comparison of the NYSUT model, it is fairly clear that the specific behaviors which could

be observed in a classroom are consistent across models. These “criteria” are structured

as detailed rubrics in each of the models so that observers can be trained to differentiate

according to level. The table below details some of the specifics of the Danielson model and

the NYSUT model so that the reader can see the consistencies. More information can be

found in the original documents.13

DOMAIN 1 – PLANNING AND PREPARATION

Demonstrating knowledge of content and pedagogy

Demonstrating knowledge of students

Setting instructional objectives

Demonstrating knowledge of resources

Designing coherent instruction

Designing student assessment

DOMAIN 3 – INSTRUCTION

Communicating with students

Using questioning and discussion techniques

Engaging students in learning

Using assessment in instruction

Demonstrating flexibility and responsiveness

DANIELSON MODEL

DOMAIN 2 – CLASSROOM ENVIRONMENT

Creating an environment of respect and rapport

Establishing a culture of learning

Managing classroom procedures

Managing student behaviors

Organizing physical space

DOMAIN 4 – PROFESSIONAL RESPONSIBILITY

Reflecting on teaching


1. KNOWLEDGE OF STUDENTS

AND STUDENT LEARNING

1.1 Child development

1.2 Research-based knowledge of learning

1.3 Knowledge of diversity

1.4 Knowledge of students and families

1.5 Response to diversity

1.6 Knowledge of technology & effect on learning

3. INSTRUCTIONAL PRACTICE

3.1 Engage and challenge all students

3.2 Communicate clearly and accurately

3.3 Set high expectations and challenge students

3.4 Use a variety of instructional approaches

3.5 Develop multidisciplinary skills

3.6 Assess, monitor and adapt instruction

5. ASSESSMENT FOR STUDENT LEARNING

5.1 Use a rangae of assessment practices

5.2 Use data to monitor and plan

5.3 Communicate aspects of the evaluation system

5.4 Evaluate effectiveness of the measurement

5.5 Prepare students to understand the assessment

system

7. PROFESSIONAL GROWTH

7.1 Reflect on practice

7.2 Engage in professional development

7.3 Communicate and collaborate to improve practice

7.4 Remain current in content knowledge and

pedagog

NYSUT MODEL

2. KNOWLEDGE OF CONTENT

AND INSTRUCTIONAL PLANNING

2.1 Knowledge of content

2.2 Cross-discipline, creative thinking

2.3 Instructional strategies

2.4 Align with learning standards

2.5 Connect to prior knowledge

2.6 Utilize materials to promote student success

4. LEARNING ENVIRONMENT

4.1 Create a safe and supportive environment

4.2 Create a challenging and stimulating

environment

4.3 Manage the learning environment

4.4 Create a safe and productive learning

environment

6. PROFESSIONAL RESPONSIBILITIES AND

COLLABORATION

6.1 Uphold professional standards and policy

6.2 Engage and collaborate with colleagues and

community

6.3 Collaborate with families

6.4 Manage and perform non-instructional duties

6.5 Comply with laws and policies


It is evident in examining the above criteria that not all of the parts can be observed directly.

While all the pieces are necessary to produce a strong learning environment, a classroom

observer on a visit to the classroom will be able to see only select pieces. In the Danielson

model, Domains 2 and 3 (Classroom Environment and Instruction) would be observable. In

the NYSUT model, Instructional Practice, the Learning Environment, and some indicators of

Assessment for Student Learning could be observed.

The Framework for Teaching, CLASS, NYSUT criteria and other newer observation

criteria have the potential to support implementation of the Common Core standards and

assessments in a number of ways. First, these criteria provide teachers and administrators

with a common language for analyzing and documenting teaching practices. Second, the

use of these criteria provides teachers and evaluators with a guide against which evidence

to describe teachers’ strengths and areas in need of improvement can be diagnosed. Third,

the observation results can be used to make decisions about professional development for

individual teachers.

Districts face challenges in implementing observation criteria. Districts must be able to

ensure the validity and reliability of the protocols by implementing a standardized approach

to training evaluators and monitoring evaluators’ ratings. In addition, using new protocols

require principals to demonstrate instructional leadership, provide timely and meaningful

feedback to teachers, and connect teachers to relevant opportunities for professional learning

and development. Despite their strong potential to contribute to changes in instruction, it

is important to note that the Framework for Teaching and other models are generic with

regard to subject matter and, as noted, can be used across grade levels and content areas.

On one hand, this leads to efficiencies for districts in that they only need to train evaluators to

use one observation protocol. At the same time, it also means that such observation criteria

are not able to measure or directly promote teachers’ pedagogical content knowledge. If,

as many believe, pedagogical content knowledge is necessary in order to implement the

Common Core in mathematics and English language arts, some observation criteria would

need to be supplemented by other forms of teacher evaluation, or by revisions that privilege

the type of pedagogical practice necessary for effective teaching in a specific

content area.14, 15, 16, 17


Given the variety of possible criteria to select, districts may very well opt to adapt well

researched criteria with some of its own priorities focusing on identified instructional needs.

The criteria chosen need to be flexible enough to adapt to the district’s choices.

ALIGNING TO DISTRICT-DEFINED FRAMEWORK AND EXPECTATIONS

Teaching begins with a teacher’s understanding of what is to be learned and how it is to

be taught. It proceeds through a series of activities during which the students are provided

specific instruction and opportunities for learning, though the learning itself ultimately

remains the responsibility of the students. The specific methods of teaching, the specific

content to be taught, and the scope and sequence of what is taught — the “curriculum” — is

determined by the school district. When teachers are evaluated and, specifically, when they

are observed teaching, the behavior of the teacher is rated against the district’s expectations.

Any system which is designed to evaluate that observation needs to be sufficiently flexible to

take the district’s expectations into account.

At the same time, there is a research base which details what makes “good” teaching. The

various models described previously (Danielson, etc.) have been validated by comparing

them with student progress, as well as validated by expert teachers. Whatever system is

used for teacher evaluation and observation needs to build on that research base. Teacher


observations and evaluations are valid when using research-based criteria, and when

observers collect objective evidence that is correctly aligned with the selected criteria against

which teachers’ practice is measured.

According to the criteria, most individuals doing evaluations in a district are in administrative

positions and have administrative credentials. With the increase in the number of evaluations,

most districts are asking almost all of their administrative staff to do some observations. This

makes it all the more important that these observers be trained using the same model. There

needs to be consistency in what and how the observer performs. Calibration provides that

training. Moreover, after training, the observations submitted can be examined for consistency

across individual observers. If one observer is regularly different from others, more training is

necessary.

IMPROVING INTER-RATER RELIABILITY AMONG OBSERVERS USING A CALIBRATION PLATFORM

Calibration has the flexibility to incorporate any criteria (rubric) selected by the district.

When teachers are to be observed, the observers must be consistent as to how they rate

the specific good teaching skills, whether they are using an already published criteria or

the district’s own criteria. The observers need to be trained very specifically to be accurate

in evaluating what they are observing. Calibration incorporates expertly scored videos

against which observer accuracy is measured. These videos are used to train the observers

so that they are actually matching what they observe to the specific rubrics. If districts have

constructed their own rubrics, or revised existing versions of published rubrics, videos for

assessing evaluator accuracy can be scored by experts to ensure consistent interpretation

of the district’s criteria. The training of the observers is one of the three measures that make

Calibration defensible and usable, because rating the behaviors against a master scorer

assures an evidence-based “fair” system.

In addition, to be sure that the observer rates the very specific behavior given in the rubric,

there is also the question of who is doing the rating and how many times that rating is done,

that is, the consistency across the different ratings, also known as the inter-rater reliability.


As an example, if a teacher is to be rated six times and the Principal does three observations

and the Assistant Principal does the other three, do both observers agree on what they have

seen? Here also, the video-based training given the observers helps with that consistency.

In this case, the TalentEd/TLS protocol ensures that the score again shows the use of an

evidence-based process.

Even more importantly, the observer

evidence can be reviewed statistically

and assessed for objectivity as well as

ensuring alignment between the two

observers. Even if all of the ratings

are done by one individual, there is

often “drift” from one rating to another

depending on many variables. This

statistical treatment ensures that the

teacher rating is consistent as well

as making sure that the final rating is

based upon defensible evidence. This

is the evidence in the TalentEd/TLS of

reliability.

Calibration ensures fairness by incorporating statistical methods of measuring accuracy.

It incorporates content validity by scoring against protocols validated by master scorers.

It incorporates reliability by seeing that the observations are consistent, no matter how

many people or how many times, and it incorporates fairness by letting the district agree

on the behavior to be observed. The program thus assesses inter-rater reliability, so that

the observation is valid, reliable and fair. The three quantitative measures provide the most

comprehensive analysis of an observer’s work. This type of statistical treatment is much

more powerful than a simple comparison to the mean, which is sometimes done by other

software. The calibration and certification results are reviewed by statisticians, and if desired,

a complete analysis and report can be prepared for the district with recommendations for

differentiated professional development for observers.


Because districts use evaluation for more than one purpose, the observers’ evidence-

gathering skills and accuracy in rating teachers’ practice is scored at four performance levels:

• Uncertified / Ineffective

• Initially certified / Developing

• Certified / Effective

• Certified with distinction / Highly Effective

The data can then be used for evaluation and/or tenure purposes, which is of more value

to a district than a simple pass/fail because it permits different types of remediation. While

uncertified or initially certified teachers may need some focus on classroom management,

those who have more experience will probably benefit more from professional development

that focuses on increasing questioning and inquiry in a classroom. And, even those who

are certified with distinction may benefit from professional development targeting the use

of technology to increase student involvement. As professional development needs are

developed, additional observer training can be provided to address specific topics.


In addition, because this flexible platform allows accuracy to be assessed at either the

component or element level of the rubric, a specific and fairly individualized plan of professional

development can be structured. Reports include detailed analysis of cohort scores and individual

results.

Professional development can be

structured based on individual scores

Customized reports can be prepared for

individuals, describing how they per-

formed

Professional development can be

structured based on level of performance;

e.g., Ineffective, Developing, Effective,

High Effective

Customized reports can be prepared for

groups, describing how they performed

individually and as a cohort

Observer data is reliable, valid, and fair

BEHAVIOR EVIDENCE

Statistical scores for inter-rater agree-ment, validity, inter-rater reliability

Observer data is consistent and calibrated Observers trained on master-scored video

Observations can be based onCommon-Core Standards

Observers use an evidence-based process

Protocols can be varied for districtsDistrict specific rubrics and

processes can be used

Observations can be conducted by morethan one person and be reliable

Graph can be shown of distribution of different observations of scores and the chart shows any “outliers”

Observations will show when one specific observation is different from all the oth-ers, permitting some conversation about “outliers”

Graph can be shown of distribution of different observations of scores and the chart shows any “outliers”

Evaluation data is “defensible”Data is consistent across scorers and re-

lates to district’s protocols


In summary then, Calibration provides a flexible system that ensures that the data on which

teacher evaluations are built will be viewed by the staff and administration as reliable, valid

and fair. Because the protocols used in an observation are based on an evidence-based

process and because the observers are trained on master-scored videos, teachers will agree

that the observation does rate them on what they are actually expected to teach and is

fair. Because the observers are trained and the observations are examined for consistency,

teachers will view the observations as both reliable and believable.

When teachers agree that the system is fair, a strong professional development program

can be built upon that foundation, targeting specific areas for growth as well as professional

development programs that build on areas of expertise. Lastly, should there need to be personnel

decisions made, the information on which the decision is made provides a strong, defensible

basis.


GLOSSARY

Calibration — Calibration is a comparison between measurements – one of known magnitude

or correctness made or set with one device, and another measurement made in as similar a

way as possible with a second device. The device with the known or assigned correctness

is called the standard and, in this case, would be the expert rubrics developed by a research

base. The second device here would be the rubrics as applied by each observer.

Certification — Generally, certification is granted by the specific state in which the district

is located. Most states have levels of teacher certification: an initial level granted to a newly

trained teacher and a second level granted after a certain number of years of teaching

experience. The number of years varies from state to state, and whether or not tenure is

granted at the same time varies from state to state.

Inter-rater Reliability — The simplest way to determine the degree of consistency, between

two or more scorers in the scoring of some performance, who independently measure and

evaluate a performance, is to compute a correlation coefficient between the scores provided

by the different scorers. The reliability refers to two components:

1. The validity of the criteria, that is, an expert judgment about how well the behavior

adequately represents the criteria and the degree to which the evidence supports the

validity. To measure validity, observers’ ratings must be measured against a “true” or

master score.

2. The inter-rater agreement — the consistency and stability of the score.

Inter-rater Agreement — Evaluation of the degree of agreement that exists between two

or more scorers or raters is referred to as inter-rater agreement. In this context, agreement

means that scores from an instrument are stable and consistent. Scores should be nearly the

same when an observer administers the instrument multiple times at different times.

Observation — Inter-rater agreement is a procedure used when making observations of

behavior. It involves observations made by two or more individuals of someone’s behavior.


The observers record their evidence of the behavior and then interpret the evidence against

the valid criteria prior to determining a score. Observations are generally done by building

administrators, including Principals, Assistant Principals, Curriculum Coordinators, Facilitators,

etc., but may also be conducted by mentors, coaches, members of a Peer Assistance

program, etc.

Evaluation — Evaluation is a systematic determination of a teacher’s performance and merit,

using criteria governed by a set of standards. It can assist a school or district to ascertain

the degree of achievement of a teacher in regard to the aims and objectives of the district.

The primary purpose of evaluation is to enable reflection, honor expertise, and assist in the

identification of future change.

Defensible Evaluations — Generally, when examining teacher evaluations, the defensibility

breaks down into two areas – substantive and procedural:

Substantive refers to issues of validity and reliability in the system’s design to permit

comparability between similar teachers.

Procedural deals with issues regarding implementation of the system, including clarity

around the criteria; use of appropriate evidence; training of evaluators; and opportunities for

feedback to improve.

Value Added Measures — Value added measures generally refer to the importance of

teachers as a source of variance in student outcomes. Policymakers see VAM as a possible

component of education reform through improved teacher evaluations. VAM are generally

complex statistical techniques can provide estimates of the effects of teachers and schools

that are not distorted by the powerful effects of such non-educational factors as family

background and socioeconomic confoundings.

Formative Evaluation — Formative evaluation is an assessment procedure done in order

to modify teaching and learning activities to improve student achievement. This type of

assessment aids learning by generating feedback on performance, and enables the learner


to restructure their understanding/skills. Formative assessment is not distinguished by the

format of assessment, but by how the information is used. The same test may act as either

formative or summative.

Summative Evaluation — Summative evaluation refers to the assessment of the learning

and summarizes the development of learners at a particular time. Summative assessment

may be used for diagnostic assessment to identify any weaknesses, and then build on that

using a professional development plan. Summative assessment is commonly used to refer to

assessment of educational faculty by their respective supervisors. It is uniformly applied to

faculty members with the object of measuring all teachers by the same criteria to determine

the level of their performance. It is meant to meet the school or district’s needs for teacher

accountability, and looks to provide remediation for sub-standard performance and also

provides grounds for dismissal if necessary. It may also be used to determine career ladder

opportunities or other incentives for highly effective teachers. Summative assessment is

characterized as assessment of learning and is contrasted with formative assessment, which is

assessment for learning.


REFERENCES

1. Gordon, R.., Kane, T. J., & Staiger, D. O. (April, 2006). Identifying Effective Teachers Using

Performance on the Job . Hamilton Project Discussion Paper, Brookings Institute

2. Rivkin, S. G., Hanushek, E. A., & Kain, J. F. Econometrica, (March, 2005), Teachers, Schools,

and Academic Achievement. Vol. 73, No. 2 417– 458

3. Jordan, Mendro, & Weerasinghe. (July, 1997) The Effects of Teachers on Longitudinal

Student Achievement, A Preliminary Report on Research on Teacher Effectiveness, Presented

at the CREATE Annual Meeting.

4. Goldhaber, D. (May 2009) Teacher Pay Reforms, The Political Implications of Recent

Research, Center for American progress, University of Washington and Urban Institute

5. Wright, S.P., Horn, S.P., & Sanders, W.L. (1997). Teacher and classroom context effects on

student achievement: Implications for teacher evaluation. Journal of Personnel Evaluation in

Education 12:3 247-256, 1998

6. Weisberg. D., Sexton, S., Mulhern, J., & Keeling, D. (June, 2009) The Widget Effect: Our

National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New

Teacher Project Think Tank

7. Abe, Y., Thomas, V., Sinicrope, C., & Gee, K. A. (2012). Effects of the Pacific CHILD

Professional Development Program. (NCEE 2013–4002). Washington, DC: National Center

for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S.

Department of Education.

8. Johnson, B. & Christensen, L. (2004) Educational Research: Quantitative, Quantitative, and

Mixed Approaches, 2nd edition Boston: Pearson.


9. Creswell, J. W. (2005) Educational Research: Planning, conducting, and Evaluating

Quantitative and Qualitative Research. Boston: Pearson.

10. Danielson, C., (1996) Enhancing Professional Practice: A Framework for Teaching.

Alexandria, Virginia: Association for Supervision and Development

11. Danielson, C. (2000) & McGreal, T. (2000) Evaluation to Enhance Professional Practice.

Alexandria, Virginia: Association for Supervision and Development

12. Pianta, R. C. et al, (2008) Classroom Effects on Children’s Achievement Trajectories in

Elementary School, American Educational Research Journal 45 (2): 365–398

13. New York State United Teacher’s (NYSUT) Teacher Practice Rubric, 2012 Edition

14. Kane, T. J., et al., (2010) Identifying Effective Classroom Practices Using Student

Achievement Data. NBER Working Paper 15803. Washington: National Bureau of Economic

Research

15. Downer, J. T. et al, (2012) Observations of Teacher-Child Interactions in Classrooms

Serving Latinos and Dual Language Learners: Applicability of the Classroom Assessment

Scoring System in Diverse Settings,” Early Childhood Research Quarterly 27 (1) 21–32.

16. Sartain, L., Stoelinga, S. R. & Brown, E. R. (2011) Rethinking Teacher Evaluation in Chicago:

Lessons Learned from Classroom Observations, Principal-Teacher Conferences, and District

Implementation. Chicago: Consortium on Chicago School Research.

17. Youngs, P. (February, 2013) Using Teacher Evaluation Reform and Professional

Development to Support Common Core Assessments. Center for Americ


ABOUT THE AUTHOR

Albert “Duffy” Miller, Ed.D. is President of Teaching Learning Solutions. Duffy was a high

school principal for sixteen years, during which he served as President of Vermont Principals’

Association and Chairman of the New Association of Schools and Colleges Commission on

Public Secondary Schools. He supports the work of the Commission on Public Secondary

Schools by providing professional development for schools around high school accreditation.

He works with schools nationally assisting with high school reform efforts, establishing

small learning communities and embedding literacy instruction in all content areas. He

has worked closely with large urban districts, suburban districts, small rural districts and

state departments of education in the areas of teacher and administrator professional

development, and teaches courses at the graduate level. Duffy has served as a consultant

assisting numerous districts and states around the United States in using commonly accepted

and district/state specific frameworks as tools to improve instruction.

TalentEd is focused on improving education and increasing student achievement by

connecting student and educator data. The TalentEd Growth Platform includes tools for

educator professional development, evaluation, observation, and calibration, as well as student

assessment development and delivery. Calibration is a key component in the TalentEd Growth

Platform, empowering observers with the skills and knowledge to conduct unbiased and

equitable evaluations for all district employees. The TalentEd Growth Platform is used by more

than 1 million educators and 11 million students in school districts across the country.

For information, visit www.talentedk12.com

http://www.talentedk12.com

Documents

The Role of Calibration in Determining Educator Effectiveness