Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
The Role of Calibration inDetermining Educator Effectiveness
THE IMPORTANCE OF FAIR, RELIABLE, VALID AND DEFENSIBLE EVALUATIONS
By Albert “Duffy” Miller, Ed.D., President, Teaching Learning Solutions
The Role of Calibration in Determining Educator Effectiveness 2
EXECUTIVE SUMMARY
So why does teacher evaluation matter? Because teaching matters. The core of education
is teaching and learning, and the teaching-learning connection works best when we have
effective teachers working with every student every day. Simply put, high-quality evaluation
systems help ensure high-quality teachers.
Yet, assessing teacher effectiveness consistently and accurately is a challenge, and too
often, the decisions made about a teacher’s effectiveness are not defensible. To establish
quality teacher evaluation practices, evaluators must be fair, their work must be valid, and
the judgments they reach about a teacher’s practice must be reliable. Calibration, a solution
provided by TalentEd with the help of Teaching Learning Solutions (TLS), helps evaluators
build the skills they will need to complete fair and valid teacher evaluations, and through
calibration and assessment of the evaluators’ evidence and ratings of teachers’ practice, will
ensure that the judgments made are reliable and defensible.
Fairness in the evaluation process occurs when teachers have confidence that observers
follow similar processes, and have the necessary skills to gather and provide objective and
representative evidence of their teaching practice.
Calibration promotes fairness in several ways. The platform itself can be customized in
design to follow the district’s established and negotiated processes, and use of the platform
allows those processes to be monitored for evaluator fidelity to those processes. Platform
customization guarantees that the roles of both the teacher and evaluator are clearly
OBJECTIVE, WELL REPRESENTED EVIDENCE HAS SEVERAL BENEFITS AND USES
It provides a lens to the teacher through which she can view
and reflect upon her practice
The evidence is a vehicle through which honest feedback can be provided
The use of objective evidence establishes trust in the evaluators’ work
1
2
3
The Role of Calibration in Determining Educator Effectiveness 3
understood, and the responsibilities of each party are completed on the platform as part of
the evaluation process. And finally, as districts calibrate using the platform, the observers’
evidence (the product of their work gathered to ensure objectivity and representativeness),
and teacher ratings are assessed for quality and accuracy.
Validity is established when the criteria used to assess teacher practice is representative
of effective teaching, and when there is a research base that establishes what represents
effective practice and establishes how that practice correlates with student achievement.
Calibration is designed for the use of any rubric selected by a district, or approved by the
district’s’ regulatory agency (e.g., State Department of Education). While the validity of some
rubrics has been established through research, such as the CLASS, FfT, MQI, and PLATO
criteria, in other circumstances districts may revise existing criteria to bring a tighter focus to
practices that they know from their own study of their students’ achievement has a positive
influence. Working with the district or state, TalentEd/TLS provide aligned and scored videos
for use with multiple criteria as part of the calibration process. If a district has revised existing
rubrics, or created their own rubrics, TLS scorers will score videos against the criteria, and
as districts move forward with implementation, will provide consultation to ensure that the
The Role of Calibration in Determining Educator Effectiveness 4
effective levels of performance are validated by high levels of student achievement using
district achievement data. Additionally, when needed, TLS will provide custom professional
development to district evaluators ensuring that they understand the priorities of the selected
scoring criteria, thus promoting accuracy in the scoring process.
Teachers want assurance that multiple
evaluators will reach agreement when
assessing their performance. Nothing will
destroy trust in the evaluation process as
much as two different evaluators reaching
different conclusions about a teacher’s level of
effectiveness. District executives and boards
must be confident that the ratings given
to teachers are consistent interpretations
of the criteria used for evaluation. When
the interpretations are accurate there can
be confidence that the evaluation process will help identify those teachers with the skills
necessary to lead, support and mentor the work of their colleagues, and to identify which
teachers need support to move their practice to a higher level of effectiveness. Calibration
solution provides districts with a tool to assess both inter-rater agreement and inter-rater
reliability, thus assuring teachers that the ratings given for performance will be consistent
among observers, and assuring district leaders that the evaluators are accurately assessing
and identifying practice that is exemplary, and in need of more support and development.
Let’s briefly review how Calibration supports effective teacher evaluation practices. Teacher
evaluation is, first, about documenting the quality of teacher performance. Then, its focus
shifts to helping teachers improve their performance as well as holding them accountable for
their work. The most effective way to evaluate teachers is to watch them teach, and those doing
the watching must agree on what the performance should be and how the observation should be
done.
The Role of Calibration in Determining Educator Effectiveness 5
Calibration provides a structure for observation of teachers that:
• Can incorporate any rubric selected by a district,
• Uses video training on a customized platform to teach the observer to follow established
processes and recognize the teaching skills desired by the district,
• Teaches the observers to produce valid observations by using video training to rate the
observed behaviors against a master score,
• Trains the observers to produce observations that are consistent, no matter how many
people or how many times, across time as well as when using different observers, in order
to produce reliable observations.
There is also a great deal of evidence that says that, in order to develop an effective teacher,
we need to observe that teacher regularly, provide specific feedback to the teacher and
differentiated professional development opportunities that support growth in identified areas.
The most powerful part of any observation is the feedback that comes from the observation
process. Effective evaluation models are based on frequent observations, and focus on
improving teacher effectiveness, especially when supported by coaching to support the
teacher in areas where growth is needed. A sound, valid, and defensible observation process
is the foundation upon which teacher development is built.
The data gathered in Calibration solution gives districts critical data for identifying areas in
which instructional improvement is needed and strength exists, and informs the design of
school improvement and personal growth for all teachers at all levels.
The Role of Calibration in Determining Educator Effectiveness 6
EFFECTIVE TEACHERS RESULT IN HIGHER STUDENT ACHIEVEMENT
Education leaders and policymakers have grappled with the difficult issue of education
reform, a topic that has received only lip service in the past. Because there is now a body of
recent research that links student achievement with teaching behavior, the improvement of
teaching practice becomes crucial. How do we help all teachers reach their full potential in
the classroom, and how do we respond to ineffective teaching?
None of these issues can be addressed without better teacher observation and evaluation
systems. Evaluation/observation models need to incorporate elements of effective coaching
practices that use evidence from the observations to provide teachers with evidence-based
feedback that helps them grow as professionals, whether they are developing or highly
effective in practice. The result of the observation processes need to provide school leaders
with a consistent, appropriate and authentic way to measure teachers’ practice in order to
support each teacher’s development, and to give the schools the data they need to build the
strongest possible support and instructional teams.
While we expect that evaluation models provide a vehicle to do all of these things, in most
cases, they haven’t even come close. Instead, they were typically perfunctory compliance
The Role of Calibration in Determining Educator Effectiveness 7
exercises that rated all teachers “good” or “great” and yielded little useful information. As
Secretary of Education Arne Duncan noted in a summer 2010 speech, “our system of teacher
evaluation… frustrates teachers who feel that their good work goes unrecognized and ignores
other teachers who would benefit from additional support.”
The research that shows the correlation between teacher behavior and student success
covers a broad area of the country and is documented throughout many discipline areas. For
instance, in the late 1990s, all 184 schools in Milwaukee — private, public, and charter schools
— focused on improving quality of instruction and student outcomes across the system.
They agreed that one way to improve was through direct observation of instruction and by
providing tools for, and support to, principals’ classroom observation.
Their conclusion, as reported by the Brookings Institute, was that “Having a top-quartile
teacher rather than a bottom-quartile teacher four years in a row could be enough to
close the black-white test score performance.” 1 Researchers looked at a large data set on
individuals in grades 3 through 7 in the State of Texas. This data set permitted analysis of
the question of differences in teacher quality in the determination of student outcomes in
2005. Their conclusion, as reported in the journal Econometrica, was that “Having a high-
quality teacher throughout elementary school can substantially offset or even eliminate the
disadvantage of low socio-economic background. Teachers and, therefore, schools matter
importantly for student achievement.”2
Researchers have also shown that the best predictor of a teacher’s effectiveness is his or
her previous success with children in the classroom. Most other factors pale in comparison,
including a teacher’s preparation route, advanced degrees, and even experience level (after
the first few years).
THE LESSON IS CLEAR: TO ENSURE THAT EVERY CHILD LEARNS FROM THE
MOST EFFECTIVE TEACHERS POSSIBLE, SCHOOLS MUST BE ABLE TO GAUGE
THEIR TEACHERS’ PERFORMANCE FAIRLY AND ACCURATELY.3
The Role of Calibration in Determining Educator Effectiveness 8
In Tennessee, researchers reported that the two most important factors impacting student
gain are differences in classroom teacher effectiveness and the prior achievement level of the
student. The teacher effect is highly significant in every analysis and has a larger effect than
any other factor in twenty of the thirty analyses.
Researchers at the University of Washington reported that “The effect of increases in teacher
quality swamps the impact of any other educational investment, such as reductions in class
size.” Education research convincingly shows that teacher quality is the most important
schooling factor influencing student achievement. A very good teacher, as opposed to a very
bad one, can make as much as a full year’s difference in learning growth for students.4
The research done by Wright, Horn and Sanders sums it up as “More can be done to improve
education by improving the effectiveness of teachers than by any other single factor.”5
Years of research have proven that nothing schools can do for their students matters more
than giving them effective teachers. A few years with effective teachers can put even the
most disadvantaged students on the path to college. A few years with ineffective teachers
can deal students an academic blow from which they may never recover.
VALUE OF EVALUATIONS AND THE IMPACT OF OBSERVATIONS
A major portion of any teacher evaluation is the observation of that teacher. An observation
is usually done by an administrator in the building who watches a sufficient part of a lesson,
matches it to criteria, and then discusses the observation with the teacher. The feedback/
coaching loop includes that discussion and a plan for some type of intervention so that
improvement can happen. There is a built in assumption that evweryone can improve.
Effective evaluation models incorporate multiple observations, either by one individual or
two, or even more evaluators.
THE EFFECT OF INCREASES IN TEACHER QUALITY
SWAMPS THE IMPACT OF ANY OTHER EDUCATIONAL
INVESTMENT, SUCH AS REDUCTIONS IN CLASS SIZE.
The Role of Calibration in Determining Educator Effectiveness 9
Evaluations, including the observation component, should provide all teachers with regular
feedback that helps them grow as professionals, no matter how long they have been in the
classroom. Observations and evaluation models should give schools the information they
need to build the strongest possible instructional teams, and help districts hold school leaders
accountable for supporting each teacher’s development. These models must include direct
observations of the teachers working with the students, and must be done in a systematic
and regular fashion.
One report, The Widget Effect: Our National Failure to Acknowledge and Act on Differences
in Teacher Effectiveness, details the difficulties shown by most teacher evaluation systems,
providing evidence of a multitude of design flaws. These would include the fact that most
evaluation systems presently in use are:
INFREQUENT: Many teachers — especially more experienced teachers — aren’t
evaluated every year. These teachers might go years between receiving any
meaningful feedback on their performance.
UNFOCUSED: A teacher’s most important responsibility is to help students learn,
yet student academic progress rarely factors directly into evaluations. Instead,
teachers are often evaluated based on superficial judgments about behaviors
and practices that may not have any impact on student learning—like the
presentation of their bulletin boards.
AA
UNDIFFERENTIATED: In many school districts, teachers can earn only two
possible ratings: “satisfactory” or “unsatisfactory.” This pass/fail system makes it
impossible to distinguish great teaching from good, good from fair, and fair from
poor.
UNHELPFUL: In many of the districts studied, teachers overwhelmingly reported
that evaluations don’t give them useful feedback on their performance in the
classroom.
The Role of Calibration in Determining Educator Effectiveness 10
INCONSEQUENTIAL: The results of evaluations are rarely used to make
important decisions about development, compensation, tenure or promotion.6
Teachers contribute to student learning in ways that can largely be observed and measured.
Through focused, rigorous observation of classroom practice, examination of student work,
and analysis of students’ performance on high-quality assessments, it is possible to accurately
distinguish effective teaching from ineffective teaching.
Observation/evaluation data should form the foundation of teacher development.
Although there must be meaningful consequences for consistently poor performance, the
primary purpose of evaluations should not be punitive. Good evaluation models identify
excellent teachers and help teachers of all skill levels understand how they can improve;
they encourage a school culture that prizes excellence and continual growth. With better
teacher evaluations in place, school districts can also do a better job holding school leaders
accountable for doing their most important job: helping teachers reach their peak. Removing
persistently underperforming teachers is a necessary but insufficient step for building a
thriving teacher workforce.7
FAIRNESS, RELIABILITY AND VALIDITY = DEFENSIBLE EVALUATIONS
There are many instances in which an observation of a research participant’s performance is
made by someone such as an instructor, a teacher, or some other professional. Because of the
wide variability that can occur in the scoring of performance, it is important to determine the
consistency of such an evaluation. Evaluation of the degree of agreement that exists between
two scorers or raters is referred to as inter-rater agreement. In this context, agreement
means that scores of different observers using the same criteria are consistent. Simply, when
multiple observers score the same teacher’s practice, their scores should be nearly the same
when using the same criteria.
WHEN MULTIPLE OBSERVERS SCORE THE SAME TEACHER’S PRACTICE, THEIR
SCORES SHOULD BE NEARLY THE SAME WHEN USING THE SAME CRITERIA.
The Role of Calibration in Determining Educator Effectiveness 11
The simplest way to determine the degree of consistency between two or more scorers is
to compute a correlation coefficient between the scores provided by the different scorers.
Thus, if two observers visit the same classroom, the teacher feels confident that the scores
he/she receives by both evaluators are consistent. The teacher would soon lose trust in the
evaluation process if one evaluator’s scores are significantly different than the scores of the
second evaluator.
Inter-rater agreement is a procedure used when making observations of practice that involves
observations made by two or more individuals of someone’s practice. The observers record
their scores of the practice and then compare scores to see if they are similar or different.
It has the advantage of negating any bias that any one individual might bring to scoring,
or adjusting for one single instance of teacher or observer doing a particularly challenging
lesson. It has the disadvantage of requiring the researcher to train the observers and requiring
the observers to negotiate outcomes and reconcile differences in their observations.
Frequently, the agreement between two or more scorers is not very strong unless some
degree of training and practice precedes the scoring. Fortunately, with training, the degree
of agreement can improve. The important issue is that training is often required and that a
measure of the reliability of an evaluation of performance by scorers is necessary.
The Role of Calibration in Determining Educator Effectiveness 12
If an evaluation instrument is to be used in making decisions, it must be valid as well as
reliable. What is referred to as “content validity” is the extent to which the questions on
the instrument and the scores from these questions are representative of all the possible
questions that a researcher could ask about the skills. Typically, researchers go to a panel of
experts and have them identify whether the questions are valid. That is, they ask:
• Do the items represent the thing you are trying to measure?
• Have you excluded any important content areas?
• Have you included any irrelevant items?
In some schools, validation is done by taking an average of multiple evaluations. These
evaluations may be performed at different times by the same individual or by different
people. But taking an average of invalid or inaccurate evaluations does not necessarily result
in a valid result. This is a much less defensible method than using a gold standard by which
all evaluations are measured. In accomplishing this task, the experts will make a judgment
about how well the task samples the content. They then make a judgment as to whether the
behavior adequately represents the content. They finally make a judgment of the degree to
which the evidence supports the validity. To measure validity, observers’ ratings must be
measured against a “true” or master score. A critical component of the calibration training
process is master-scored videos. Thus, if an observation is to be considered both reliable and
valid, a teacher has to be confident that his/her rating has been accurately scored against
the selected criteria. Inter-rater reliability and validity provide an assurance to teachers that
the evaluators who rate their practice have high levels of agreement among each other, and
that the ratings they apply are an accurate and reliable interpretation of their practice when
scored against the selected criteria.
IF AN EVALUATION INSTRUMENT IS TO BE USED IN MAKING
DECISIONS, IT MUST BE VALID AS WELL AS RELIABLE.
The Role of Calibration in Determining Educator Effectiveness 13
In this context, the term inter-rater agreement is used when referring just to agreement
between the scorers. Inter-rater reliability refers to both agreement between scorers and
content validity.8,9 Thus, validating based on an average might result in inter-rater agreement,
but only validating against an expert standard will yield inter-rater reliability. An average
doesn’t give the reader the same picture as looking at the reliability. It doesn’t take into
account whether the content that you’re being rated on is content that has been agreed upon
as important to observe. For example, observers might agree that a teacher is not dressed
properly for teaching, but teacher attire might not be something that everyone agrees really
matters in the way children learn. The raters would all agree, but it would not be a valid part
of teaching.
RESEARCH-BASED EDUCATION METHODOLOGIES SUPPORT EVIDENCE
Classroom observations have long been a staple of most teacher evaluation systems but they
have generally not been “fair, reliable and valid”. It’s only recently that research has agreed
on what behaviors make for a good teacher. The observations and rubrics used in teacher
evaluations were not previously grounded in clearly articulated models or frameworks
of effective teaching. In contrast, the 2007 Framework for Teaching, the Classroom
Assessment Scoring System, or CLASS, and other newer observation criteria are explicitly
grounded in models of effective teaching and/or sets of teacher performance standards.
The Role of Calibration in Determining Educator Effectiveness 14
Observation protocols are used with rubrics that differentiate among various levels of teacher
performance. Evaluators, including principals, district administrators and other teachers, are
expected to provide detailed records of what they observe, organized around the standards
and rubrics. Thus, these observations are characterized as evidence-based and are less likely
to rely on subjective judgments that are unsupported by evidence or concrete examples.
Data from these observations can then be used to:
• Provide detailed feedback to teachers
• Plan tailored opportunities for professional learning for teachers
• Contribute to overall ratings of teacher performance when combined with other data
The Framework for Teaching — developed by Charlotte Danielson, an education consultant
from Princeton, New Jersey — whose design for assessing instruction is one of the most-
used in the United States, can be used with K-12 teachers in core content areas such as
mathematics, English language arts, science, history/social studies, and world languages.
Danielson’s framework features four domains: planning and preparation, classroom environment,
instruction, and professional responsibilities. Trained evaluators assess teachers in all four
domains with classroom observations focusing on teacher performance in two of these domains:
classroom environment and instruction. In most districts that use the Framework for Teaching —
or modified versions of it — teachers are observed multiple times in a school year, through both
formal and informal (unannounced) observations. Research from Cincinnati and Chicago schools
indicates that teachers’ observation ratings based on the Framework for Teaching are related
to their effects on student achievement gains. The Danielson Framework has been approved by
many states.10, 11
A different model, the Classroom Assessment Scoring System (CLASS) teacher assessment
model, was first developed by Robert Pianta and his colleagues for use in early childhood
settings. In a recent Center for American Progress report, Pianta outlined the ways that CLASS
can be employed across subject areas and grade levels to assess teachers’ interactions with
children in three broad domains: classroom organization, instructional support, and emotional
support.12
The Role of Calibration in Determining Educator Effectiveness 15
These domains are common from preschool through the 12th grade. CLASS identifies
particular types of interaction within each domain that vary by grade. It assesses “effective
teacher-student interactions across pre-K–12 in a way that is sensitive to important
developmental and context shifts that occur as students mature.”
In research on CLASS and a precursor to CLASS — the Classroom Observation System —
Pianta and colleagues have reported relationships between teachers’ ratings based on these
observation protocols and achievement gains at the preschool and elementary school levels.
A third model, the New York State United Teachers’ (NYSUT) Teacher Practice Rubric,
2012 Edition, has also been approved by the New York State Education Department. In a
comparison of the NYSUT model, it is fairly clear that the specific behaviors which could
be observed in a classroom are consistent across models. These “criteria” are structured
as detailed rubrics in each of the models so that observers can be trained to differentiate
according to level. The table below details some of the specifics of the Danielson model and
the NYSUT model so that the reader can see the consistencies. More information can be
found in the original documents.13
DOMAIN 1 – PLANNING AND PREPARATION
Demonstrating knowledge of content and pedagogy
Demonstrating knowledge of students
Setting instructional objectives
Demonstrating knowledge of resources
Designing coherent instruction
Designing student assessment
DOMAIN 3 – INSTRUCTION
Communicating with students
Using questioning and discussion techniques
Engaging students in learning
Using assessment in instruction
Demonstrating flexibility and responsiveness
DANIELSON MODEL
DOMAIN 2 – CLASSROOM ENVIRONMENT
Creating an environment of respect and rapport
Establishing a culture of learning
Managing classroom procedures
Managing student behaviors
Organizing physical space
DOMAIN 4 – PROFESSIONAL RESPONSIBILITY
Reflecting on teaching
The Role of Calibration in Determining Educator Effectiveness 16
1. KNOWLEDGE OF STUDENTS
AND STUDENT LEARNING
1.1 Child development
1.2 Research-based knowledge of learning
1.3 Knowledge of diversity
1.4 Knowledge of students and families
1.5 Response to diversity
1.6 Knowledge of technology & effect on learning
3. INSTRUCTIONAL PRACTICE
3.1 Engage and challenge all students
3.2 Communicate clearly and accurately
3.3 Set high expectations and challenge students
3.4 Use a variety of instructional approaches
3.5 Develop multidisciplinary skills
3.6 Assess, monitor and adapt instruction
5. ASSESSMENT FOR STUDENT LEARNING
5.1 Use a rangae of assessment practices
5.2 Use data to monitor and plan
5.3 Communicate aspects of the evaluation system
5.4 Evaluate effectiveness of the measurement
5.5 Prepare students to understand the assessment
system
7. PROFESSIONAL GROWTH
7.1 Reflect on practice
7.2 Engage in professional development
7.3 Communicate and collaborate to improve practice
7.4 Remain current in content knowledge and
pedagog
NYSUT MODEL
2. KNOWLEDGE OF CONTENT
AND INSTRUCTIONAL PLANNING
2.1 Knowledge of content
2.2 Cross-discipline, creative thinking
2.3 Instructional strategies
2.4 Align with learning standards
2.5 Connect to prior knowledge
2.6 Utilize materials to promote student success
4. LEARNING ENVIRONMENT
4.1 Create a safe and supportive environment
4.2 Create a challenging and stimulating
environment
4.3 Manage the learning environment
4.4 Create a safe and productive learning
environment
6. PROFESSIONAL RESPONSIBILITIES AND
COLLABORATION
6.1 Uphold professional standards and policy
6.2 Engage and collaborate with colleagues and
community
6.3 Collaborate with families
6.4 Manage and perform non-instructional duties
6.5 Comply with laws and policies
The Role of Calibration in Determining Educator Effectiveness 17
It is evident in examining the above criteria that not all of the parts can be observed directly.
While all the pieces are necessary to produce a strong learning environment, a classroom
observer on a visit to the classroom will be able to see only select pieces. In the Danielson
model, Domains 2 and 3 (Classroom Environment and Instruction) would be observable. In
the NYSUT model, Instructional Practice, the Learning Environment, and some indicators of
Assessment for Student Learning could be observed.
The Framework for Teaching, CLASS, NYSUT criteria and other newer observation
criteria have the potential to support implementation of the Common Core standards and
assessments in a number of ways. First, these criteria provide teachers and administrators
with a common language for analyzing and documenting teaching practices. Second, the
use of these criteria provides teachers and evaluators with a guide against which evidence
to describe teachers’ strengths and areas in need of improvement can be diagnosed. Third,
the observation results can be used to make decisions about professional development for
individual teachers.
Districts face challenges in implementing observation criteria. Districts must be able to
ensure the validity and reliability of the protocols by implementing a standardized approach
to training evaluators and monitoring evaluators’ ratings. In addition, using new protocols
require principals to demonstrate instructional leadership, provide timely and meaningful
feedback to teachers, and connect teachers to relevant opportunities for professional learning
and development. Despite their strong potential to contribute to changes in instruction, it
is important to note that the Framework for Teaching and other models are generic with
regard to subject matter and, as noted, can be used across grade levels and content areas.
On one hand, this leads to efficiencies for districts in that they only need to train evaluators to
use one observation protocol. At the same time, it also means that such observation criteria
are not able to measure or directly promote teachers’ pedagogical content knowledge. If,
as many believe, pedagogical content knowledge is necessary in order to implement the
Common Core in mathematics and English language arts, some observation criteria would
need to be supplemented by other forms of teacher evaluation, or by revisions that privilege
the type of pedagogical practice necessary for effective teaching in a specific
content area.14, 15, 16, 17
The Role of Calibration in Determining Educator Effectiveness 18
Given the variety of possible criteria to select, districts may very well opt to adapt well
researched criteria with some of its own priorities focusing on identified instructional needs.
The criteria chosen need to be flexible enough to adapt to the district’s choices.
ALIGNING TO DISTRICT-DEFINED FRAMEWORK AND EXPECTATIONS
Teaching begins with a teacher’s understanding of what is to be learned and how it is to
be taught. It proceeds through a series of activities during which the students are provided
specific instruction and opportunities for learning, though the learning itself ultimately
remains the responsibility of the students. The specific methods of teaching, the specific
content to be taught, and the scope and sequence of what is taught — the “curriculum” — is
determined by the school district. When teachers are evaluated and, specifically, when they
are observed teaching, the behavior of the teacher is rated against the district’s expectations.
Any system which is designed to evaluate that observation needs to be sufficiently flexible to
take the district’s expectations into account.
At the same time, there is a research base which details what makes “good” teaching. The
various models described previously (Danielson, etc.) have been validated by comparing
them with student progress, as well as validated by expert teachers. Whatever system is
used for teacher evaluation and observation needs to build on that research base. Teacher
The Role of Calibration in Determining Educator Effectiveness 19
observations and evaluations are valid when using research-based criteria, and when
observers collect objective evidence that is correctly aligned with the selected criteria against
which teachers’ practice is measured.
According to the criteria, most individuals doing evaluations in a district are in administrative
positions and have administrative credentials. With the increase in the number of evaluations,
most districts are asking almost all of their administrative staff to do some observations. This
makes it all the more important that these observers be trained using the same model. There
needs to be consistency in what and how the observer performs. Calibration provides that
training. Moreover, after training, the observations submitted can be examined for consistency
across individual observers. If one observer is regularly different from others, more training is
necessary.
IMPROVING INTER-RATER RELIABILITY AMONG OBSERVERS USING A CALIBRATION PLATFORM
Calibration has the flexibility to incorporate any criteria (rubric) selected by the district.
When teachers are to be observed, the observers must be consistent as to how they rate
the specific good teaching skills, whether they are using an already published criteria or
the district’s own criteria. The observers need to be trained very specifically to be accurate
in evaluating what they are observing. Calibration incorporates expertly scored videos
against which observer accuracy is measured. These videos are used to train the observers
so that they are actually matching what they observe to the specific rubrics. If districts have
constructed their own rubrics, or revised existing versions of published rubrics, videos for
assessing evaluator accuracy can be scored by experts to ensure consistent interpretation
of the district’s criteria. The training of the observers is one of the three measures that make
Calibration defensible and usable, because rating the behaviors against a master scorer
assures an evidence-based “fair” system.
In addition, to be sure that the observer rates the very specific behavior given in the rubric,
there is also the question of who is doing the rating and how many times that rating is done,
that is, the consistency across the different ratings, also known as the inter-rater reliability.
The Role of Calibration in Determining Educator Effectiveness 20
As an example, if a teacher is to be rated six times and the Principal does three observations
and the Assistant Principal does the other three, do both observers agree on what they have
seen? Here also, the video-based training given the observers helps with that consistency.
In this case, the TalentEd/TLS protocol ensures that the score again shows the use of an
evidence-based process.
Even more importantly, the observer
evidence can be reviewed statistically
and assessed for objectivity as well as
ensuring alignment between the two
observers. Even if all of the ratings
are done by one individual, there is
often “drift” from one rating to another
depending on many variables. This
statistical treatment ensures that the
teacher rating is consistent as well
as making sure that the final rating is
based upon defensible evidence. This
is the evidence in the TalentEd/TLS of
reliability.
Calibration ensures fairness by incorporating statistical methods of measuring accuracy.
It incorporates content validity by scoring against protocols validated by master scorers.
It incorporates reliability by seeing that the observations are consistent, no matter how
many people or how many times, and it incorporates fairness by letting the district agree
on the behavior to be observed. The program thus assesses inter-rater reliability, so that
the observation is valid, reliable and fair. The three quantitative measures provide the most
comprehensive analysis of an observer’s work. This type of statistical treatment is much
more powerful than a simple comparison to the mean, which is sometimes done by other
software. The calibration and certification results are reviewed by statisticians, and if desired,
a complete analysis and report can be prepared for the district with recommendations for
differentiated professional development for observers.
The Role of Calibration in Determining Educator Effectiveness 21
Because districts use evaluation for more than one purpose, the observers’ evidence-
gathering skills and accuracy in rating teachers’ practice is scored at four performance levels:
• Uncertified / Ineffective
• Initially certified / Developing
• Certified / Effective
• Certified with distinction / Highly Effective
The data can then be used for evaluation and/or tenure purposes, which is of more value
to a district than a simple pass/fail because it permits different types of remediation. While
uncertified or initially certified teachers may need some focus on classroom management,
those who have more experience will probably benefit more from professional development
that focuses on increasing questioning and inquiry in a classroom. And, even those who
are certified with distinction may benefit from professional development targeting the use
of technology to increase student involvement. As professional development needs are
developed, additional observer training can be provided to address specific topics.
The Role of Calibration in Determining Educator Effectiveness 22
In addition, because this flexible platform allows accuracy to be assessed at either the
component or element level of the rubric, a specific and fairly individualized plan of professional
development can be structured. Reports include detailed analysis of cohort scores and individual
results.
Professional development can be
structured based on individual scores
Customized reports can be prepared for
individuals, describing how they per-
formed
Professional development can be
structured based on level of performance;
e.g., Ineffective, Developing, Effective,
High Effective
Customized reports can be prepared for
groups, describing how they performed
individually and as a cohort
Observer data is reliable, valid, and fair
BEHAVIOR EVIDENCE
Statistical scores for inter-rater agree-ment, validity, inter-rater reliability
Observer data is consistent and calibrated Observers trained on master-scored video
Observations can be based onCommon-Core Standards
Observers use an evidence-based process
Protocols can be varied for districtsDistrict specific rubrics and
processes can be used
Observations can be conducted by morethan one person and be reliable
Graph can be shown of distribution of different observations of scores and the chart shows any “outliers”
Observations will show when one specific observation is different from all the oth-ers, permitting some conversation about “outliers”
Graph can be shown of distribution of different observations of scores and the chart shows any “outliers”
Evaluation data is “defensible”Data is consistent across scorers and re-
lates to district’s protocols
The Role of Calibration in Determining Educator Effectiveness 23
In summary then, Calibration provides a flexible system that ensures that the data on which
teacher evaluations are built will be viewed by the staff and administration as reliable, valid
and fair. Because the protocols used in an observation are based on an evidence-based
process and because the observers are trained on master-scored videos, teachers will agree
that the observation does rate them on what they are actually expected to teach and is
fair. Because the observers are trained and the observations are examined for consistency,
teachers will view the observations as both reliable and believable.
When teachers agree that the system is fair, a strong professional development program
can be built upon that foundation, targeting specific areas for growth as well as professional
development programs that build on areas of expertise. Lastly, should there need to be personnel
decisions made, the information on which the decision is made provides a strong, defensible
basis.
The Role of Calibration in Determining Educator Effectiveness 24
GLOSSARY
Calibration — Calibration is a comparison between measurements – one of known magnitude
or correctness made or set with one device, and another measurement made in as similar a
way as possible with a second device. The device with the known or assigned correctness
is called the standard and, in this case, would be the expert rubrics developed by a research
base. The second device here would be the rubrics as applied by each observer.
Certification — Generally, certification is granted by the specific state in which the district
is located. Most states have levels of teacher certification: an initial level granted to a newly
trained teacher and a second level granted after a certain number of years of teaching
experience. The number of years varies from state to state, and whether or not tenure is
granted at the same time varies from state to state.
Inter-rater Reliability — The simplest way to determine the degree of consistency, between
two or more scorers in the scoring of some performance, who independently measure and
evaluate a performance, is to compute a correlation coefficient between the scores provided
by the different scorers. The reliability refers to two components:
1. The validity of the criteria, that is, an expert judgment about how well the behavior
adequately represents the criteria and the degree to which the evidence supports the
validity. To measure validity, observers’ ratings must be measured against a “true” or
master score.
2. The inter-rater agreement — the consistency and stability of the score.
Inter-rater Agreement — Evaluation of the degree of agreement that exists between two
or more scorers or raters is referred to as inter-rater agreement. In this context, agreement
means that scores from an instrument are stable and consistent. Scores should be nearly the
same when an observer administers the instrument multiple times at different times.
Observation — Inter-rater agreement is a procedure used when making observations of
behavior. It involves observations made by two or more individuals of someone’s behavior.
The Role of Calibration in Determining Educator Effectiveness 25
The observers record their evidence of the behavior and then interpret the evidence against
the valid criteria prior to determining a score. Observations are generally done by building
administrators, including Principals, Assistant Principals, Curriculum Coordinators, Facilitators,
etc., but may also be conducted by mentors, coaches, members of a Peer Assistance
program, etc.
Evaluation — Evaluation is a systematic determination of a teacher’s performance and merit,
using criteria governed by a set of standards. It can assist a school or district to ascertain
the degree of achievement of a teacher in regard to the aims and objectives of the district.
The primary purpose of evaluation is to enable reflection, honor expertise, and assist in the
identification of future change.
Defensible Evaluations — Generally, when examining teacher evaluations, the defensibility
breaks down into two areas – substantive and procedural:
Substantive refers to issues of validity and reliability in the system’s design to permit
comparability between similar teachers.
Procedural deals with issues regarding implementation of the system, including clarity
around the criteria; use of appropriate evidence; training of evaluators; and opportunities for
feedback to improve.
Value Added Measures — Value added measures generally refer to the importance of
teachers as a source of variance in student outcomes. Policymakers see VAM as a possible
component of education reform through improved teacher evaluations. VAM are generally
complex statistical techniques can provide estimates of the effects of teachers and schools
that are not distorted by the powerful effects of such non-educational factors as family
background and socioeconomic confoundings.
Formative Evaluation — Formative evaluation is an assessment procedure done in order
to modify teaching and learning activities to improve student achievement. This type of
assessment aids learning by generating feedback on performance, and enables the learner
The Role of Calibration in Determining Educator Effectiveness 26
to restructure their understanding/skills. Formative assessment is not distinguished by the
format of assessment, but by how the information is used. The same test may act as either
formative or summative.
Summative Evaluation — Summative evaluation refers to the assessment of the learning
and summarizes the development of learners at a particular time. Summative assessment
may be used for diagnostic assessment to identify any weaknesses, and then build on that
using a professional development plan. Summative assessment is commonly used to refer to
assessment of educational faculty by their respective supervisors. It is uniformly applied to
faculty members with the object of measuring all teachers by the same criteria to determine
the level of their performance. It is meant to meet the school or district’s needs for teacher
accountability, and looks to provide remediation for sub-standard performance and also
provides grounds for dismissal if necessary. It may also be used to determine career ladder
opportunities or other incentives for highly effective teachers. Summative assessment is
characterized as assessment of learning and is contrasted with formative assessment, which is
assessment for learning.
The Role of Calibration in Determining Educator Effectiveness 27
REFERENCES
1. Gordon, R.., Kane, T. J., & Staiger, D. O. (April, 2006). Identifying Effective Teachers Using
Performance on the Job . Hamilton Project Discussion Paper, Brookings Institute
2. Rivkin, S. G., Hanushek, E. A., & Kain, J. F. Econometrica, (March, 2005), Teachers, Schools,
and Academic Achievement. Vol. 73, No. 2 417– 458
3. Jordan, Mendro, & Weerasinghe. (July, 1997) The Effects of Teachers on Longitudinal
Student Achievement, A Preliminary Report on Research on Teacher Effectiveness, Presented
at the CREATE Annual Meeting.
4. Goldhaber, D. (May 2009) Teacher Pay Reforms, The Political Implications of Recent
Research, Center for American progress, University of Washington and Urban Institute
5. Wright, S.P., Horn, S.P., & Sanders, W.L. (1997). Teacher and classroom context effects on
student achievement: Implications for teacher evaluation. Journal of Personnel Evaluation in
Education 12:3 247-256, 1998
6. Weisberg. D., Sexton, S., Mulhern, J., & Keeling, D. (June, 2009) The Widget Effect: Our
National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New
Teacher Project Think Tank
7. Abe, Y., Thomas, V., Sinicrope, C., & Gee, K. A. (2012). Effects of the Pacific CHILD
Professional Development Program. (NCEE 2013–4002). Washington, DC: National Center
for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S.
Department of Education.
8. Johnson, B. & Christensen, L. (2004) Educational Research: Quantitative, Quantitative, and
Mixed Approaches, 2nd edition Boston: Pearson.
The Role of Calibration in Determining Educator Effectiveness 28
9. Creswell, J. W. (2005) Educational Research: Planning, conducting, and Evaluating
Quantitative and Qualitative Research. Boston: Pearson.
10. Danielson, C., (1996) Enhancing Professional Practice: A Framework for Teaching.
Alexandria, Virginia: Association for Supervision and Development
11. Danielson, C. (2000) & McGreal, T. (2000) Evaluation to Enhance Professional Practice.
Alexandria, Virginia: Association for Supervision and Development
12. Pianta, R. C. et al, (2008) Classroom Effects on Children’s Achievement Trajectories in
Elementary School, American Educational Research Journal 45 (2): 365–398
13. New York State United Teacher’s (NYSUT) Teacher Practice Rubric, 2012 Edition
14. Kane, T. J., et al., (2010) Identifying Effective Classroom Practices Using Student
Achievement Data. NBER Working Paper 15803. Washington: National Bureau of Economic
Research
15. Downer, J. T. et al, (2012) Observations of Teacher-Child Interactions in Classrooms
Serving Latinos and Dual Language Learners: Applicability of the Classroom Assessment
Scoring System in Diverse Settings,” Early Childhood Research Quarterly 27 (1) 21–32.
16. Sartain, L., Stoelinga, S. R. & Brown, E. R. (2011) Rethinking Teacher Evaluation in Chicago:
Lessons Learned from Classroom Observations, Principal-Teacher Conferences, and District
Implementation. Chicago: Consortium on Chicago School Research.
17. Youngs, P. (February, 2013) Using Teacher Evaluation Reform and Professional
Development to Support Common Core Assessments. Center for Americ
The Role of Calibration in Determining Educator Effectiveness 29
ABOUT THE AUTHOR
Albert “Duffy” Miller, Ed.D. is President of Teaching Learning Solutions. Duffy was a high
school principal for sixteen years, during which he served as President of Vermont Principals’
Association and Chairman of the New Association of Schools and Colleges Commission on
Public Secondary Schools. He supports the work of the Commission on Public Secondary
Schools by providing professional development for schools around high school accreditation.
He works with schools nationally assisting with high school reform efforts, establishing
small learning communities and embedding literacy instruction in all content areas. He
has worked closely with large urban districts, suburban districts, small rural districts and
state departments of education in the areas of teacher and administrator professional
development, and teaches courses at the graduate level. Duffy has served as a consultant
assisting numerous districts and states around the United States in using commonly accepted
and district/state specific frameworks as tools to improve instruction.
TalentEd is focused on improving education and increasing student achievement by
connecting student and educator data. The TalentEd Growth Platform includes tools for
educator professional development, evaluation, observation, and calibration, as well as student
assessment development and delivery. Calibration is a key component in the TalentEd Growth
Platform, empowering observers with the skills and knowledge to conduct unbiased and
equitable evaluations for all district employees. The TalentEd Growth Platform is used by more
than 1 million educators and 11 million students in school districts across the country.
For information, visit www.talentedk12.com