IRR Flight School Instructors

Embed Size (px)

Citation preview

  • 8/9/2019 IRR Flight School Instructors

    1/61

    INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:

    A FOUNDATIONAL STUDY

    By

    Matthew Vail Smith

    An Applied Project Presented in Partial Fulfillmentof the Requirements for the Degree

    Master of Science in Technology

    ARIZONA STATE UNIVERSITY

    December 2007

  • 8/9/2019 IRR Flight School Instructors

    2/61

    2007 Matthew Vail Smith

    All Rights Reserved

  • 8/9/2019 IRR Flight School Instructors

    3/61

    INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:

    A FOUNDATIONAL STUDY

    by

    Matthew Vail Smith

    has been approved

    December 2007

    Graduate Supervisory Committee:

    Mary Niemczyk, ChairWilliam McCurry

    ACCEPTED BY THE GRADUATE COLLEGE

  • 8/9/2019 IRR Flight School Instructors

    4/61

    iii

    ACKNOWLEDGMENTS

    I would like to acknowledge the help of several people:

    Dr Joel Hutchinson, who helped me to overcome the mental blocks I struggled with.

    Lisa Cahill, ASU Polytechnic Writing Center, for her constructive criticism and helpful

    suggestions.

    Professors Merrill Karp and Jim Anderson for introducing me to the PCATD and

    explaining the possibilities it offered.

    Greg and David, the Lab Assistants who taught me how to use the PCATD.

    The student volunteers who flew the sample flights.

    The four flight instructors who took time out of their busy schedules to watch three hours

    of footage.

    Committee member Dr. William McCurry for his guidance and suggestions.

    And very special thanks to my committee chair, Dr. Mary Niemczyk, without whose

    unwavering faith and support, I could never have accomplished this project and

    graduated.

  • 8/9/2019 IRR Flight School Instructors

    5/61

    iv

    TABLE OF CONTENTS

    Page

    LIST OF TABLES............................................................................................................ vii

    LIST OF FIGURES ......................................................................................................... viii

    CHAPTER

    1 INTRODUCTION .......................................................................................1

    Statement of Purpose .......................................................................3

    Scope................................................................................................3

    Assumptions.....................................................................................4

    Limitations .......................................................................................4

    Equipment Used...............................................................................5

    Chapter Summary ............................................................................5

    2 LITERATURE REVIEW ............................................................................7

    Background......................................................................................7

    Cohens Kappa.................................................................................9

    Inter-rater reliability in Sports .......................................................12

    Inter-rater reliability in Psychology...............................................13

    Inter-rater reliability in Health Care ..............................................14

    Inter-rater reliability in Education .................................................17

    Chapter Summary ..........................................................................19

    3 METHOD ..................................................................................................20

    Flight Pattern..................................................................................20

    Pilot Participants ............................................................................20

  • 8/9/2019 IRR Flight School Instructors

    6/61

    v

    CHAPTER Page

    Rater Participants...........................................................................21

    Scoring Rubric ...............................................................................23

    Flying the Pattern...........................................................................23

    Experiment Execution....................................................................24

    Chapter Summary ..........................................................................26

    4 RESULTS ..................................................................................................27

    Raw Scores.....................................................................................27

    Contingency Tables .......................................................................28

    Summary of Results.......................................................................32

    5 CONCLUSIONS AND RECOMMENDATIONS ....................................34

    Recurrent Training.........................................................................34

    Scoring Rubric Improvements .......................................................35

    Technical Improvements................................................................36

    Recommendations for Further Research........................................37

    Commercial Application of this Study ..........................................39

    Summary........................................................................................40

    REFERENCES ..................................................................................................................41

    APPENDIX

    A Scoring Rubric ...........................................................................................45

    B Briefing and Script.....................................................................................47

    C Instructor Instructions ................................................................................50

    D Score Sheet.................................................................................................52

  • 8/9/2019 IRR Flight School Instructors

    7/61

    vi

    LIST OF TABLES

    Table Page

    1. Example of the inter-rater reliability Contingency Table

    used in this Experiment........................................................................25

    2. Raters Raw Scores ...................................................................................................27

    3. Rater 1 versus Rater 2..............................................................................................29

    4. Rater 1 versus Rater 3..............................................................................................29

    5. Rater 1 versus Rater 4..............................................................................................30

    6. Rater 2 versus Rater 3..............................................................................................30

    7. Rater 2 versus Rater 4..............................................................................................31

    8. Rater 3 versus Rater 4..............................................................................................31

    9. Summary of Results.................................................................................................32

  • 8/9/2019 IRR Flight School Instructors

    8/61

    vii

    LIST OF FIGURES

    Figure Page

    1. Contingency Table Highlighting Agreement Cells..................................................10

    2. Contingency Table with Chance-Corrected Agreement ..........................................11

    3. Pattern D ..................................................................................................................22

  • 8/9/2019 IRR Flight School Instructors

    9/61

    CHAPTER 1

    INTRODUCTION

    Several educational institutions exist to train students to become professional

    pilots. As part of the regular curriculum, students must attend ground school and engage

    in the required number of flight training hours. Ground school and written exams issued

    by the Federal Aviation Administration (FAA) are standardized as well as the required

    flight syllabi. However, training from school to school is not identical, even though fully

    compliant with FAA regulations. Even in a flight school that has very exacting

    standards, training may be different under different instructors for any number of reasons

    such as the instructors abilities and interests. Some pilots dislike instructing and only do

    it to build hours and to put experience on a resume. Others do it because they enjoy

    sharing their love of flying with others. All instructors regardless of their personal

    characteristics must do one thing: evaluate student performance. And yet, because of

    their personal characteristics, instructors perceive student performances differently from

    one another. The reasons for differences in instructor perception of student performance

    can be systematic or arbitrary, conscious or subconscious, innocuous or malicious; one

    simply cannot catalog anothers motives, but one can see the result of the instructors

    perceptions: difference.

    When scoring a student pilot, there is the student pilots performance, which is

    objective, and the instructor pilots perception of that performance, which is subjective.

    In the best of circumstances, the performance and the recorded perception of that

    performance share a high degree of similarity. That is, the instructor ought always to

    record a score that accurately and precisely reflects the students performance. However,

  • 8/9/2019 IRR Flight School Instructors

    10/61

    2

    this is not always the case. Some perceptions of performance are too forgiving, while

    others are overly critical. In other words, the same student pilot can receive a passing

    score from an overly forgiving instructor and a failing score from an overly critical

    instructor for an identical or near-identical performance, leaving the student confused or

    frustrated. There is a problem ensuring that all student pilots receive standardized scores

    that reflect the student pilots performance with a high degree of reliability.

    Students, as well as the other stake-holders of flight schools, must be sure that the

    scoring system is such that the scores are a meaningful indicator of the students

    performance rather than an arbitrary indicator of the instructors perception.

    Furthermore, the scores should be consistent from one instructor to another. This

    problem can be examined with an inter-rater reliability study. Inter-rater reliability is

    used to assess the degree to which different raters/observers give consistent estimates of

    the same phenomenon (Trochim, 2001, p.96). This investigation, then, seeks to offer

    any flight school a method to determine the inter-rater reliability of its instructor pilots.

    Chapter 1 introduces the problem and sets the parameters of this investigation.

    Chapter 2, the literature review, examines pertinent inter-rater reliability literature dealing

    with both the statistical theory and application of inter-rater reliability. The literature

    reviewed in this study does not come from aviation sources because, after an exhaustive

    search of reputable science journal databases, the researcher could not find aviation inter-

    rater reliability studies. Instead, the literature reviewed comes from other fields such as

    sports, psychology, health care and education, where inter-rater reliability studies are

    used extensively. Many lessons learned from these fields may be applied to aviation,

    especially in the sub-fields of aviation human factors and flight training/pilot education.

  • 8/9/2019 IRR Flight School Instructors

    11/61

    3

    Chapter 3 discusses the methodology used to plan, design and execute the project and to

    analyze the data. Chapter 4 examines the results. Chapter 5 discusses two possible ways

    to improve inter-rater reliability at the flight school, suggests technical improvements

    while executing the project, offers a commercial application, suggestions for further

    research and summarizes the project.

    Statement of Purpose

    The purpose of this investigation, then, is to determine the reliability of rating

    student pilot performance between instructor pilots. In order to accomplish this task, this

    investigation:

    defines inter-rater reliability and discusses its application to pilot training;

    reviews literature regarding inter-rater reliability;

    describes the method (experiment) that was used to assess inter-rater

    reliability;

    analyzes the data as collected from the performed experiment using

    Cohens kappa coefficient;

    discusses the results;

    makes recommendations for corrective action;

    suggests a commercial application for this research; and

    suggests areas for further research.

    Scope

    The scope of this investigation is a foundational study in which the rating

    performances of a cross-section of instructors are analyzed to determine inter-rater

  • 8/9/2019 IRR Flight School Instructors

    12/61

    4

    reliability. Four instructor pilots were asked to watch the flight performances of ten

    students flying the same instrument flight pattern as recorded on a DVD. The testing of

    the raters took place throughout the course of a single afternoon in a controlled

    environment, under the supervision of the researcher.

    Assumptions

    This investigation assumes that there may be a difference between the raters in

    terms of their evaluation of student performance that is worth examining and that the

    traditional methods for determining inter-rater reliability, such as the kappa coefficient,

    are sound. Furthermore, it assumes that the principles of inter-rater reliability are

    transferable from one field to another.

    Limitations

    This investigation has a few limitations. First, this study does notindeed,

    cannotpresume to act as a predictive model. It measures what exists now, but cannot

    definitively state that raters will evaluate in this way or that. This study does not consider

    questions of gender, racial or other forms of favoritism or bias because bias is an error

    that causes a rater to be unreliable. This study does not seek to answer why the raters are

    reliable or not, but only to establish a repeatable method for determining inter-rater

    reliability. Therefore, this study does not claim to be exhaustive. It is a foundational

    study that seeks only to show that inter-rater reliability studies can be adapted from other

    fields and made useful for aviation research, and it uses the instructors of the flight school

    as test subjects.

    It cannot be over-emphasized that this study investigates neither the student pilots

    nor their performance. The student pilots and their performance are only means to the

  • 8/9/2019 IRR Flight School Instructors

    13/61

    5

    end of examining inter-rater reliability. Whether a student pilot is a good pilot or a poor

    pilot is entirely moot. This study investigates how reliably the raters rate the flight

    performances, not the flight performances or the students who flew them.

    Finally, there were budgetary limitations. This study was funded entirely by the

    researcher. Much of the equipment used, as listed below, belonged to the flight school.

    However, the researcher paid for the video camera, accessories and the computer used to

    transfer the footage to DVD from personal funds.

    Equipment Used

    The following equipment was used to complete this project:

    Elite brand Personal Computer Aircraft Training Device (PCATD);

    a computer projector and a movie screen;

    a Sony DCR-HC28 video camera, used to record the flight instruments

    (computer simulated instrument panel); the camera was equipped with a

    fire wire output in order to transfer the recorded footage to the hard

    drive of a computer;

    an iMac personal computer with iMovie HD and iDVD, used to organize

    the recorded footage and create DVDs for the raters (instructor pilots) to

    view; and

    a PC, projector and movie screen for showing the DVDs to the raters.

    Chapter Summary

    In order to ensure that students are scored fairly and consistently, flight schools

    must consider the inter-rater reliability of their instructor pilots. This study describes the

  • 8/9/2019 IRR Flight School Instructors

    14/61

    6

    method for testing inter-rater reliability of flight school instructors that the researcher

    developed and discusses the research on which this method is based.

  • 8/9/2019 IRR Flight School Instructors

    15/61

    7

    CHAPTER 2

    LITERATURE REVIEW

    This chapter is a review of literature related to inter-rater reliability. The chapter

    begins by establishing the background of inter-rater reliability: explaining what inter-rater

    reliability is and discussing a coefficient used to measure inter-rater reliability. The

    coefficient discussed, kappa, is the one used to analyze the data in this study. The rest of

    the chapter focuses on how, in the absence of inter-rater reliability studies in aviation,

    inter-rater reliability studies have been used in other fields, such as sports, psychology,

    health care and education.

    Background

    Inter-rater reliability measures the extent of agreement between two or more

    individual raters. Inter-rater reliability is used to measure the consistency of a scoring or

    rating system, and those who use it (DeVellis, 2005; Trochim, 2001). Since this study

    seeks to establish the inter-rater reliability of instructor pilots, it is helpful to have some

    background on inter-rater reliability and how it has been used.

    In his 2005 entry into the Encyclopedia of Social Measurement, Robert F.

    DeVellis managed to pack extensive information into a few short pages. DeVellis reports

    that there are two influences at work in the process of measuring scores: (1) the true

    score of the object, person, event, or other phenomenon being measured, and (2) error

    (i.e. everything other than the true score of the phenomenon of interest) (p. 315). In

    Chapter One, Introduction, true score was referred to as objective performance. Error can

    be influenced by the instructors perception. Or, rather, the instructors perception is

    susceptible to error, thus the disconnect between the true score (objective performance)

  • 8/9/2019 IRR Flight School Instructors

    16/61

    8

    and the recorded score (instructors perception). Error is simply a phenomenon to be

    dealt with through statistical processes and analysis. This investigation seeks to measure

    rater error. It does not study what errors are, why errors exist, or the moral implications

    of error.

    The purpose of the kappa statistic is to account for and eliminate agreement by

    chancechance being a type of errorso that the researcher can get a clearer idea of

    how much agreement there really is between raters. The coefficient, then, distinguishes

    between purposeful agreement and accidental agreement. In a reliability formula, the

    quantified possible error becomes the denominator, while the quantified true score is the

    numerator. Thus, whatever reliability coefficient is used it is the ratio of variability

    ascribable to the true score relative to the total variability of the obtained score

    (DeVellis, 2005). Or, in the terms chosen for this investigation, it is the ratio of the

    pilots objective performance and the instructors recorded perception of that

    performance. In this study, it is assumed that any disconnect in the relationship between

    the pilots performance (true score) and the instructors recorded perception (obtained

    score) is due to the raters, not the pilot.

    The way to find this coefficient, then, is to measure rater against rater rather than

    pilot against rater. Each rater observed the exact same flight performances. Therefore,

    the raters ought to record identical scores. In practice they may or may not. This is why

    one performs an inter-rater reliability study, to discover these discrepancies between true

    score and obtained score, should discrepancy (error) exist.

  • 8/9/2019 IRR Flight School Instructors

    17/61

    9

    Cohens Kappa

    In the late 1950s and throughout the 1960s, Jacob Cohen conducted seminal

    research focusing on inter-rater reliability. Cohen proposed a coefficient represented by

    the Greek letter kappa (), as the standard coefficient for inter-rater reliability, with .70

    being considered reliable. This is not merely a 70% agreement, because agreement can

    happen by chance. Instead, kappa accommodates the expected frequency of ratings; thus

    eliminating mere chance agreement (Cohen, 1960; Gwet, 2002b).

    Cohens original article,A coefficient of agreement for nominal scales, which

    appeared in theJournalof Educational & Psychological Measurement, explains the

    kappa coefficient and raises three points that are foundational to inter-rater reliability:

    1. The units are independent.2. The categories of the nominal scale are independent, mutually exclusive, and

    exhaustive.

    3. The judges operate independently. (Cohen, 37)

    Dr. Kalim Gwets paper explaining Cohens kappa gave additional information

    not presented in Cohens article, such as explaining how to use Cohens kappa step-by-

    step. Gwets work gave much inspiration to this investigation and the methodology he

    describes has been adapted for use in this project. What follows is a brief paraphrasing of

    the methodology provided by Cohen, as explained by Gwet (2002b).

    Two raters observe three species of turtles. They are told to identify the species to

    which each turtle belongs (y, r or c). Thirty-six turtles are observed and the raters tally

    their judgments in a three-by-three table. (Three, because y, r and c.)

    If Rater 1 claims Y and Rater 2 claims R, then the tally goes in the box that

    corresponds with Y/R: first column, second row. If both raters claim R, then the tally

  • 8/9/2019 IRR Flight School Instructors

    18/61

    10

    goes into the R/R box in the middle of the table: second column, second row. And so on.

    The row and column tallies were the totaled in order to ensure that the correct number of

    observations, 36, was recorded. The total number of agreements is calculated, by

    summing the values of the diagonal cells of the table a= 9 + 8 + 6 = 23. (Gwet, 2002b)

    Figure 1 shows Gwets contingency table. The cells showing agreement (Y/Y, R/R and

    C/C) are shaded.

    Figure 1 Contingency Table Highlighting Agreement Cells (Gwet, 2002b)

    Rater 1

    Y R CRow totals:

    Y 9 3 1 13

    R 4 8 2 14Rater 2

    C 2 1 6 9

    Column totals: 15 12 9 23

    Out of the thirty-six turtles observed, the raters agreed on 23 decisions, thus

    making the agreement level 64%. That is not good enough because some of the

    agreements may have been mere chance agreements.

    In order to account for chance agreement, one must compute the expected

    frequency (ef) by dividing the product of the row and column totals by the number of

    samples (N). Figure 2 shows that by dividing the products of the row and column totals

    a = 9 + 8 + 6 = 23 becomes ef= 5.42 + 4.67 + 2.25 = 12.34. This is the expected

    frequency by chance.

  • 8/9/2019 IRR Flight School Instructors

    19/61

    11

    Figure 2 Contingency Table with Chance-Corrected Agreement (Gwet, 2002b)

    Rater 1

    Y R CRow totals:

    Y 9 (5.42) 3 1 13

    R 4 8 (4.67) 2 14Rater 2

    C 2 1 6 (2.25) 9

    Column totals: 15 12 9 23 (12.34)

    To find kappa, then, one divides the difference ofa minus efby the difference

    ofN(number of samples) minus ef(sum of expected frequency). That is:

    = (a - ef) / (N- ef) = (23 - 12.34) / (36 - 12.34) = .45

    Kappa is evaluated next. As was stated above, a kappa of .70 or greater is

    considered satisfactory; less than .70 is not. This example has a kappa of .45, denoting

    rather weak inter-rater reliability.

    In this case, Gwets recommendation was to retrain the raters to recognize the

    species better. Specifically, the raters had trouble with two species in particular, thus

    Gwet recommended raters to focus on correctly discriminating between these two types

    of turtles, in order to improve inter-rater reliability (Gwet, 2002b).

    Gwets explanation of Cohens kappa showed two raters with thirty-six samples

    of three species. The current inter-rater reliability study has four raters judging ten flight

    performance samples on a scale of 1 to 5. Chapter 3, Method, will discuss the application

    of Cohens kappa to this project.

    Remarkably enough, Gwets article explaining Cohens kappa coefficient was

    later followed by a second article on why kappa is insufficient (Gwet, 2002a). However

    interesting Gwets argument is regarding kappas insufficiency and his alternative

  • 8/9/2019 IRR Flight School Instructors

    20/61

    12

    coefficients merits, the researcher did not find Gwets alternative method in literature

    other than his own, whereas the researcher found Cohens kappa coefficient used

    extensively. Therefore, Gwets criticism of kappa is mentioned here only to make the

    reader aware that there are other means (other coefficients) of determining inter-rater

    reliability. This study uses Cohens kappa, since it is widely accepted, while Gwets new

    coefficient is not.

    Inter-rater reliability in Sports

    Flying and sports are related activities in that they are both simultaneously

    physical and mental, or psychomotor, to denote the inseparability between the physical

    and mental aspects. Being physical acts, they can be measured. And being measurable,

    they can be used in an inter-rater reliability study.

    One such study,Development of an Instrument to Assess Jump-Shooting Form in

    Basketball(Lindeman, Libkuman, King, & Kruse, 2000), examined the physical form

    and movements of a jump shot. Basketball coaches have written books that discuss what

    proper shooting form is, and the study used that information to create an instrument for

    assessing jump-shots. Four raters then viewed video tapes of 32 shooters and rated the

    shooters form and movement according to the instrument developed. The conclusion

    was that the instrument may help discern a correlation between the shooters form and the

    shooters success rate.

    The jump shot study shows the validity of an inter-rater reliability study when

    observing psychomotor activity. By analogy, then, an inter-rater reliability study is likely

    valid when observing flight performances, because it, too, observes psychomotor activity.

  • 8/9/2019 IRR Flight School Instructors

    21/61

    13

    Inter-rater reliability in Psychology

    Inter-rater reliability studies are often used in psychology to determine if scales

    and other methods of measuring patient behavior are reliable means of assessment. These

    studies have been used to assess rating scales and assessment methods related to sleep

    disorders (Ferri, Bruni, Miano, Smerieri, Spruyt & Terzano, 2005), mental capacity

    (Raymont, Buchanan, David, Hayward, Wessley & Hotopf, 2006), agoraphobia

    (Schmidt, Salas, Bernert & Schatschneider, 2005), delusions (Bell, Halligan & Ellis,

    2006 and Meyers, English, Gabriele, Peasley-Milkus, Heo, Flint, et al., 2006), social

    dysfunction in schizophrenia and related illnesses (Monroe-Blum, Collins, McCleary, &

    Nuttall, 1996), and other means of rating psychological disorders (Drake, Haddock,

    Terrier, Bentall & Lewis, 2007).

    Using inter-rater reliability studies to validate psychological testing is not limited

    to the United States. It is used in China (Leung & Tsang, 2006), Korea (Joo, Joo, Hong,

    Hwang, Maeng, Han, et al., 2004), Japan (Kaneda, Ohmoria & Fujii, 2001), in the Arabic

    language (Kadri, Agoub, El Gnaoui, Mchichi Alami, Hergueta & Moussaoui, 2005),

    Turkey (Tural, Fidaner, Alkin & Bandelow, 2002), Greece (Papavasiliou, Rapidi, Rizou,

    Petrapoulou & Tzavara, 2007 and Kolaitas, Korpa, Kolvin & Tsiantis, 2003), and France

    (Thuile, Even, Friedman & Guelfi, 2005). In all of these articles, scales or other methods

    of assessment were tested, or foreign language translations of English language scales

    and methods of assessment were tested and validated using inter-rater reliability studies.

    It seems, then, that inter-rater reliability studies serve a very useful purpose in

    determining the validity of scoring or rating rubrics. Thus, one can surmise that an inter-

  • 8/9/2019 IRR Flight School Instructors

    22/61

    14

    rater reliability study may be very useful to a flight school that needs to measure the

    reliability of its raters and scoring rubrics.

    Inter-rater reliability in Health Care

    Training health care practitioners also has parallels to training pilots. Both health

    care practice and the practice of flying require both mental aptitude and the physical

    skills to carry out their mentally-driven tasks. This fact is true for the entire gamut of

    health care practitioners from nurses to surgeons and the gamut of pilots from the simple

    sport (ultra-light) pilot to captains of 747s. All of the individuals in these vast and

    diverse groups require a level of mental and physical harmony that demands high-level

    training. This training regimen is ready-made for inter-rater reliability studies.

    Research regarding nursing in triage units verified that live experiments may be

    more reliable than paper-based exercises. Triage tool inter-rater reliability: a

    comparison of live versus paper case scenarios (Worster, Sardo, Fernandes, Eva, &

    Upadhy, 2007) shows that the kappa was acceptable in both live and paper cases,

    however, the correlation in live cases was much higher (.90 live, versus .76 on paper).

    Therefore it seems that it is better to test inter-rater reliability of instructor pilots with a

    live flight scenario rather than a paper-based scenario.

    Paper-based scenarios would have been easy enough to create for the

    instructor/raters being investigated, but as this triage nursing study makes clear, live is

    more desirable because it is more reliable. The researcher did not conduct this present

    investigation live due to physical constraints of aircraft and budgetary constraints.

    Instead, the performances that the raters observed were captured on video for viewing at

    another place and time, which is consistent with other studies reviewed in this chapter.

  • 8/9/2019 IRR Flight School Instructors

    23/61

    15

    Bann, Davis, Moorthy, Munz, Hernandez, Khan, Datta, and Darzi (2005) studied

    11 surgical trainees and put them through a 15 minute, six-station rotation of basic

    surgical tasks. Each trainee performed the six-station rotation on five separate occasions

    for a total of 90 minutes of observation. All of the trainees performances were video

    recorded for later review. The six tasks each had criteria determining what makes a

    trainee competent or not at that task. For example, in the suturing task, trainees were

    rated on the time taken and total number of movements used to complete the task

    (Bann, et al., 2005). The trainees were further rated on the quality of the suture, based on

    the squareness and orientation of the knots. The authors emphasized that their measuring

    instrument was able to discern both quantity and quality of work.

    The researchers used the Spearman correlation coefficient (rho) in their statistical

    analysis, which is used to examine correlations between sittings. (Bann, Davis,

    Moorthy, Munz, Hernandez, Khan, Datta, & Darzi, 2005). (Since neither the pilots nor

    the raters sit for their part of the study more than once, there will not be any improvement

    to measure. Therefore, rho is not necessary to this study.) On the other hand, the

    researchers used Cronbachs alpha coefficient to test a number of internal consistency;

    these included the inter-rater reliability of video assessment and intra-task reliability.

    (Bann, et al., 2005). The result of this experiment was that video assessment is indeed a

    reliable means of assessing performance. Yet another study concluded that inter-rater

    reliability of video taped cases was excellent, having a coefficient of .93. (Hulsman,

    Mollema, Oort, Hoos & de Haes., 2006)

    In a rare example, James D. Michelson, MD drew a direct parallel between

    medicine and aviation. Moreover, Michelson specifically cites the usefulness and

  • 8/9/2019 IRR Flight School Instructors

    24/61

    16

    ubiquity of simulator training in aviation, and suggests that more and better simulators be

    developed in the training of orthopedic surgeons. Michelson cites other studies that

    suggest good, but not perfect, correlation (Michelson, 2006) and later suggests that

    simulator-based competency standards be developed and will likely come built-in to the

    software packages of off-the-shelf simulators in the future. One benefit of using

    simulators is that they are asynchronous. That is, a resident doctor need not have a

    supervisor present during training if using a simulator. Furthermore, the data collected

    during the simulation can be reviewed by more than one supervisor or rater

    independently, meeting Cohens third requirement that raters perform their duties

    independently (Cohen, 1960).

    Inter-rater reliability studies are not used solely in the training of health care

    professionals, but also to verify the rubrics for various cases such as rating the

    effectiveness of out-of-hospital CPR (Rittenberger, Martin, Kelly, Roth, Hostler, &

    Callaway, 2006) and for rating the severity of rosacea (Bamford, Gessert, & Renier,

    2004). The authors of the rosacea article admitted that when the scale ranged from 1 to

    10, the inter-rater reliability coefficient indicated unreliable rating. But when the scale

    was reduced to a range from 1 to 5, the inter-rater reliability coefficient was much

    greater, indicating reliability.

    Inter-rater reliability is also used a great deal in physiotherapy. Holey and

    Watson (1995) provided a stark example of the necessity for kappa rather than using

    mere percentage of agreement when performing an inter-rater reliability study. In some

    cases the percentage of agreement was 100%, while the kappa coefficient, which

    accounts for chance agreement, was 0.01, the absolute lowest number possible.

  • 8/9/2019 IRR Flight School Instructors

    25/61

    17

    Kappa has also been found useful in determining inter-rater reliability in other

    studies. A study conducted by Kolt, Brewer, Pizzari, Schoo, & Garrett (2007) combined

    two inter-rater reliability studies, one in which six physiotherapists and physiotherapy

    students examined videotaped cases, the other compared two live clinical sessions. The

    results were that the inter-rater reliability of the first study was very high (= .87 to .93)

    and the second study reliability varied from very good to good (= .76 to .89 and .63 to

    .76). Dionne, Bybee, & Tomaka (2006) used kappa to establish moderate reliability (=

    .55) in a study using 20 patients and 54 trained clinicians. Fifty-four raters is the greatest

    number of raters seen in the entire literature review.

    Inter-rater reliability in Education

    Laura D. and William L. Goodwin wroteAn analysis of statistical techniques

    used in the Journal of Educational Psychology, 1979-1983 (1985) in order to discern the

    most popularly used statistical methods in educational psychology. The Journal of

    Educational Psychology (JEP) is a long-established, peer-reviewed journal. Therefore, it

    is understood that the statistical methods used by its contributors are useful and

    appropriate for anyone doing research in a field related to educational psychology,

    including this investigation.

    From 1979-1983, 40 out of 92 reliability studies in the JEP were inter-rater

    reliability studies. Inter-rater reliability studies comprised nearly half of the studiesby

    far the greatest percentage. Considering how commonly researchers use inter-rater

    reliability studies to establish or verify reliability in an educational setting, the Goodwins

    article indicates that performing an inter-rater reliability study at flight schools, which are

    rightly considered educational institutions, is a legitimate pursuit.

  • 8/9/2019 IRR Flight School Instructors

    26/61

    18

    A common use of inter-rater reliability studies in education assesses writing. The

    question of what constitutes good or bad writing cannot be answered with an inter-rater

    reliability study. Instead, much like the rubrics used to rate medical observations or the

    jump-shot as discussed previously, the rubrics for scoring essays must be created first by

    an expert or group of collaborating experts who know what good writing is. Qualitative

    characteristics must be sorted and presented in such a way that raters can quantify their

    observations and opinions of the writing samples. Lee (2004) noticed that, given a

    holistic scoring rubric, raters scored computer-based writing samples provided by English

    as Second Language (ESL) students far more reliably than when using paper-basedthat

    is to say, handwrittenwriting samples. The holistic rubric included several criteria that

    accounted not only for the quality of content, but also quality of expression, as

    determined by the writing experts. Lee suggests that the raters may need to learn how not

    to discriminate against messy handwriting, and that correcting that bias may help to make

    the scores that the raters awarded more reliable.

    Penny, Johnson and Gordon (2000) introduced the idea of augmenting a holistic

    rubric with benchmark writing samples. Writing, like many other human activities, is

    performed on a continuum. That is, one cannot easily discern discreet moments, but

    rather observe ability over the passage of time. Assigning an integer to rate a

    performancethat is, shifting from a qualitative to quantitative measuring system

    requires a snapshot, or a discreet variable. In many cases, this means assigning a rating

    from 1 to 5. Inter-rater reliability studies show whether the quality of writing (or what

    ever act is being rated) is being accurately translated into a quantity, which can then be

    measured. Introducing benchmark papers helped those charged with assessing writing

  • 8/9/2019 IRR Flight School Instructors

    27/61

    19

    samples to more accurately rate the quality of writing because each integer had an

    exemplar to which the raters could refer. Thus, the inter-rater reliability was increased,

    and may also have led to a greater external validity.

    Chapter Summary

    Inter-rater reliability literature is plentiful and offers researchers several methods

    and many examples of how to design and execute inter-rater reliability studies. The

    articles featured in this study were chosen because the fields of study all involved training

    and featured psycho-motor skills that are analogous with and transferable to evaluating

    pilot training.

  • 8/9/2019 IRR Flight School Instructors

    28/61

    20

    CHAPTER 3

    METHOD

    This investigation was designed to assess inter-rater reliability between instructor

    pilots when observing flights performed by student pilots. This study included

    videotaping the performance of student pilots flying an industry standard instrument

    flight rules (IFR) pattern. The researcher transferred the footage to a DVD. Four

    instructor pilots reviewed DVDs of the flight performance footage and scored the student

    pilots performances on a scale of 1 to 5. The researcher then analyzed the scores using

    Cohens kappa coefficient. The resulting coefficients are discussed in Chapter Four,

    Results.

    Flight Pattern

    In The Pilots Manual: Instrument Flying(Kirshner, 1990) there are several flight

    patterns to choose from. The pattern used for this investigation is referred to as Pattern

    D. It was chosen because it is long enough to give the raters something substantial to

    score, yet not so time-consuming as to prove burdensome. An illustration of the pattern

    appears in Figure 3.

    Pilot Participants

    Student pilots enrolled in a flight program at a four-year research university

    participated by flying the aforementioned flight pattern using a PCATD. The researcher

    explained to the students that they were being videotaped for the purpose of investigating

    inter-rater reliability. They were assured that these scores, good or bad, would not figure

    into their course average. Their identities were protected by preventing any

    distinguishing features from being recorded on video. Also, the order in which the flight

  • 8/9/2019 IRR Flight School Instructors

    29/61

    21

    performances were viewed was different from the order they were recorded. Thus, the

    student who flew the first flight on the day of recording might have actually have been

    the last flight viewed by the raters. The researcher did not collect or record any

    demographic data about the student pilot participants in order to abide by the limitations

    as discussed in Chapter One, Introduction.

    Rater Participants

    The rater-participants were selected from the pool of instructor pilots at the flight

    school. All instructor pilots were offered a chance to participate and the researcher

    enlisted the help of four volunteers. These instructor pilots watched and scored the

    flights that are contained on the DVDs. They are the raters, whose reliability this study

    investigates. Just as the student pilots who flew the pattern were assured that their

    participation would not affect their scores in school, the raters were assured of their

    anonymity and that their performance in this study would not impact their employment at

    the flight school. Also just as with the student pilot participants, the researcher did not

    collect or record any demographic data about the rater participants in order to abide by

    the limitations as discussed in Chapter One, Introduction.

  • 8/9/2019 IRR Flight School Instructors

    30/61

    22

    Figure 3 Pattern D (Kirshner, 1990)

  • 8/9/2019 IRR Flight School Instructors

    31/61

    23

    Scoring Rubric

    In order to measure inter-rater reliability, there must be an established scoring

    rubric. The flight school at which this study was performed already has a scoring

    rubrican explanation of how scores are determinedwhich was used in this

    investigation. The reader is referred to the scoring rubric in Appendix A, which explains

    what the scores represent.

    As stated in Chapter Two, Literature Review, there is a difference between quality

    and quantity, yet in studies such as this and those in the social sciences, medical science,

    and education, researchers must change qualitative performance into quantitative data in

    order to perform statistical analysis. One cannot average words or put words into a

    formula. Thus, words (qualities) must be transformed into numbers (quantities). There is

    no analyzing poor, good, or great, but one can analyze scores of 1, 3, or 5. This is

    precisely the reason for this inter-rater reliability study: to determine if the student pilot

    flight performance is being accurately transformed into a quantitative score according to

    the scoring rubric.

    Flying the Pattern

    Prior to sitting at the PCATD, the researcher briefed the student pilots. The

    pattern is rather complex, and depending on the skill of the student pilot, the researcher

    gave oral instructions, if necessary. As stated in Chapter One, Introduction, this study is

    not investigating the student pilots. Therefore, the student pilots ability to perform the

    flight pattern well or poorly is immaterial. What this study investigates is whether the

    raters agree about the student pilots performance. Therefore helping a lesser skilled

    student pilot complete the pattern does not affect the inter-rater reliability. The raters

  • 8/9/2019 IRR Flight School Instructors

    32/61

    24

    were entirely unaware of which student referred to the pattern and which students

    performed the pattern from memory. After the flight was finished, there was a

    debriefing. See the script in Appendix B.

    Experiment Execution

    After the flight patterns were recorded, it was time to test the raters. The raters

    viewed the DVDs in a controlled environment so as not to influence or be influenced by

    other raters, just as specified by Cohen (1960). Then raters were asked to score the

    student pilots performances according to the scoring rubric. After the raters scored the

    student pilots performances, the researcher analyzed the data.

    Cohens coefficient kappa is derived using only two raters. Several studies cited

    in Chapter Two, Literature Review, used only two raters, some four, some more. After a

    very thorough search, the researcher could not find any research that suggests an optimal

    number of raters for inter-rater reliability studies. In this study, there are four raters

    because the researcher looked for an even number of raters, as most of the other studies

    had, and four instructor pilots made themselves available for testing purposes.

    The researcher used six contingency tables similar to the tables described in

    Chapter Two, Literature Review, but adapted the table to provide the resultant

    information in conformity to the APA style manual. (Gwets contingency tables do not

    conform to the APA manual.) The example table (Table 1) shows hypothetical Rater X

    versus hypothetical Rater Y. The numbers 1 through 5 indicate the scores which raters

    can give to student pilots. A score of 1 represents an unsatisfactory performance; a 2,

    marginal; a 3, good; a 4, very good; and a 5, excellent, as described in Appendix A. Pairs

    of scores from the ten flights (A through J) were tallied in the table according to the rules

  • 8/9/2019 IRR Flight School Instructors

    33/61

    25

    as described by Gwet (2002b) in Chapter Two, Literature Review. That is, if rater 1

    gives a score of 3 and rater 2 gives a score of 3 then one point will be tallied in the

    cell (3, 3). (In Table 1 below, the numbers 0 denote nothing, as this is only an

    example.Nis 10 because the number of samples is already known.)

    Table 1

    Example of the inter-rater reliability Contingency Table used in this Experiment

    Rater XScore 1 2 3 4 5

    RowTotals:

    a ef

    1 0 0 0 0 0 0 0 0

    2 0 0 0 0 0 0 0 0

    3 0 0 0 0 0 0 0 0

    4 0 0 0 0 0 0 0 0

    Rater Y

    5 0 0 0 0 0 0 0 0

    Column Totals: 0 0 0 0 0 N a ef

    10 0 0

    The tables will account for each possible permutation without replicating pairs.

    After the result of each table is tallied according to Cohens kappa method, the resultant

    coefficients will then be analyzed to determine the inter-rater reliability of the instructor

    pilots in comparison with each other. Each column and row should add up to 10, which

    is theN, the only constant in the equation. Column a is the number of agreements. This

    number is simply the cells showing agreement (e.g. 1, 1; 2, 2, etc.) brought over to a

    single column. Column efis the expected frequency. (The method to derive the efwas

    discussed earlier.) At the bottom of column a and column efis the sum ofa (a) and the

  • 8/9/2019 IRR Flight School Instructors

    34/61

    26

    sum ofef(ef). In the next chapter, these tables will have beneath them the kappa

    equation worked out, resulting in the kappa coefficient.

    Chapter Summary

    In summary, the methodology is as follows. The researcher enlisted student pilots

    as volunteers to fly Instrument Pattern D using the Elite PCATD. A video camera

    recorded the image of the simulated instrument panel on the movie screen during the

    flights. After recording the student pilots flights, the researcher transferred the footage

    onto DVDs for easier viewing. Each flight was assigned a letter, A through J. The

    researcher then enlisted the help of four instructor pilots to be the rater participants. The

    instructor pilots watched and scored the flights in a controlled environment. Upon

    finishing their task, the researcher collected their score sheets and placed the scores into

    the contingency tables. The researcher then took the pertinent numbers from the table

    (those that indicate agreement) and put them into the kappa formula.

    If the coefficient, kappa, is .70 or greater, the rater pairs can be said to exhibit

    greater reliability; if less than .70, then the rater pairs may be said to exhibit lesser

    reliability. The next chapter will discuss the results of this experiment.

  • 8/9/2019 IRR Flight School Instructors

    35/61

    27

    CHAPTER 4

    RESULTS

    The experiment was conducted in a classroom equipped with a PC, projector and

    movie screen. The four raters sat in the same room, but were seated far apart to prevent

    communication between raters. They were given instructions and a score sheet

    (Appendix C and D, respectively) and were briefed by the researcher about how to

    behave during the test (i.e. no talking, gesturing, or using other means of communicating

    during flights, no talking about the flights during break times, etc.). It took three hours to

    watch all of the flights, including two short restroom breaks and one longer break time

    during which the researcher switched from the first to the second DVD.

    Raw Scores

    The raters watched the flights and marked the scores on the score sheet that was

    provided. The researcher collected the score sheets and the raw scores are in Table 2

    below.

    Table 2

    Raters Raw Scores

    Sample Flight

    Rater A B C D E F G H I J

    1 4 5 2 1 4 3 2 3 1 52 4 5 1 1 4 4 2 4 1 3

    3 3 3 1 1 3 4 1 4 1 2

    4 3 5 1 1 3 3 2 4 1 4

  • 8/9/2019 IRR Flight School Instructors

    36/61

    28

    At first glance, these scores appear to show good agreement, especially in sample

    flights C, D, G, H and I. A brief examination of the raw scores also reveals that Rater 1

    evenly distributed the scores; the only rater to do so. Raters 2 and 4 had very similar

    results, with only disagreement being between a score of 3 and 4. Rater 3 gave the most

    scores of 1, and gave no scores of 5. However, to properly analyze the data for inter-rater

    reliability, these raw scores must be tallied in the contingency tables.

    Contingency Tables

    To analyze the data, the researcher created a series of contingency tables as

    illustrated on page 25. Tables 3 through 8 below are the contingency tables that were

    used to sort and analyze the data. These tables were adapted from Gwet (2002b) in order

    to conform to APA standards and to show data without the redundancy of tables as in

    Gwet (2002b). Beneath each contingency table is the mathematical work used to derive

    the kappa coefficient.

  • 8/9/2019 IRR Flight School Instructors

    37/61

    29

    Table 3

    Rater 1 versus Rater 2

    Rater 1

    Score 1 2 3 4 5Row

    Totals:a ef

    1 2 1 0 0 0 3 2 .6

    2 0 1 0 0 0 1 1 .2

    3 0 0 0 0 1 1 0 .2

    4 0 0 2 2 0 4 2 .8

    Rater 2

    5 0 0 0 0 1 1 1 .2

    Column Totals: 2 2 2 2 2 N a ef

    10 6 2

    Given:N= 10, a = 6, ef= 2

    = (a - ef) (N- ef) = (6 2) (10 2) = 4 8 = .50

    Table 4

    Rater 1 versus Rater 3

    Rater 1

    Score 1 2 3 4 5Row

    Totals:a ef

    1 2 2 0 0 0 4 2 .8

    2 0 0 0 0 1 1 0 .2

    3 0 0 0 2 1 3 0 .6

    4 0 0 2 0 0 2 0 .4

    Rater 3

    5 0 0 0 0 0 0 0 0

    Column Totals: 2 2 2 2 2 N a ef

    10 2 2

    Given:N= 10, a = 2, ef= 2

    = (a - ef) (N- ef) = (2 2) (10 2) = 0 8 = 0

  • 8/9/2019 IRR Flight School Instructors

    38/61

    30

    Table 5

    Rater 1 versus Rater 4

    Rater 1

    Score 1 2 3 4 5Row

    Totals:a ef

    1 2 1 0 0 0 3 2 .6

    2 0 1 0 0 0 1 1 .2

    3 0 0 0 0 1 1 0 .2

    4 0 0 2 2 0 4 2 .8

    Rater 4

    5 0 0 0 0 1 1 1 .2

    Column Totals: 2 2 2 2 2 N a ef

    10 6 2

    Given:N= 10, a = 6, ef= 2

    = (a - ef) (N- ef) = (6 2) (10 2) = 4 8 = .50

    Table 6

    Rater 2 versus Rater 3

    Rater 2

    Score 1 2 3 4 5Row

    Totals:a ef

    1 3 1 0 0 0 4 3 1.2

    2 0 0 1 0 0 1 0 .1

    3 0 0 0 2 1 3 0 .1

    4 0 0 0 2 0 2 2 .8

    Rater 3

    5 0 0 0 0 0 0 0 0Column Totals: 3 1 1 4 1 N a ef

    10 5 2.2

    Given:N= 10, a = 5, ef= 2.2

    = (a - ef) (N- ef) = (5 2.2) (10 2.2) = 2.8 7.8 = .38

  • 8/9/2019 IRR Flight School Instructors

    39/61

    31

    Table 7

    Rater 2 versus Rater 4

    Rater 2

    Score 1 2 3 4 5Row

    Totals:a ef

    1 3 0 0 0 0 3 3 .9

    2 0 1 0 0 0 1 1 .2

    3 0 0 0 3 0 3 0 .3

    4 0 0 1 1 0 2 1 .8

    Rater 4

    5 0 0 0 0 1 1 1 .2

    Column Totals: 3 1 1 4 1 N a ef

    10 6 2.4

    Given:N= 10, a = 6, ef= 2.4

    = (a - ef) (N- ef) = (6 2.4) (10 2.4) = 3.6 7.6 = .47

    Table 8

    Rater 3 versus Rater 4

    Rater 3

    Score 1 2 3 4 5Row

    Totals:a ef

    1 3 0 0 0 0 3 3 1.2

    2 1 0 0 0 0 1 0 .1

    3 0 0 2 1 0 3 2 .9

    4 0 1 0 1 0 2 1 .6

    Rater 4

    5 0 0 1 0 0 1 0 .1

    Column Totals: 4 1 3 3 1 N a ef

    10 6 2.9

    Given:N= 10, a = 6, ef= 2.9

    = (a - ef) (N- ef) = (6 2.9) (10 2.9) = 3.1 7.1 = .44

  • 8/9/2019 IRR Flight School Instructors

    40/61

    32

    Summary of Results

    The scores have been tallied and the kappa for each rater pair calculated. As

    stated previously throughout this study, the minimum desirable kappa coefficient is .70.

    The results in this study were markedly lower.

    Table 9

    Summary of Results

    Rater Pair Kappa

    Rater 1 vs. Rater 2 .50

    Rater 1 vs. Rater 3 .00

    Rater 1 vs. Rater 4 .50

    Rater 2 vs. Rater 3 .38

    Rater 2 vs. Rater 4 .47

    Rater 3 vs. Rater 4 .44

    Average .38

    The best kappa was .50, and the worst, 0. The average kappa coefficient was .38just

    over half of the desired .70.

    Although all of the rater pairings in this study fell far below .70, one rater, Rater

    3, seemed the least reliable of the four. The three pairings in which Rater 3 was involved

    were the least reliable, one of which had a kappa of 0, entirely unreliable. Rater 1, with

    whom Rater 3 shared the kappa of 0, enjoyed the two highest reliability scores, .50, with

    Raters 2 and 4.

    Each rater was paired three times. When each raters three pairings were

    averaged, Rater 1 scored a .33, Rater 2, .45, Rater, 3 .27, and Rater 4, .37. However,

  • 8/9/2019 IRR Flight School Instructors

    41/61

    33

    removing Rater 3 from the averages, so that each rater was only paired twice, Rater 1s

    average rose to .50, Rater 2 to .48 and Rater 4 to .48. Among Raters 1, 2 and 4, the

    scores are extremely similar (pair 1 & 2 .50, pair 1 & 4 .50 and pair 2 & 4 .47). Thus it

    seems that removing Rater 3 improved the inter-rater reliability in this study. Without

    Rater 3 the overall average reliability increased from .38 to .49. This is still well below

    .70, but much better.

    The next chapter will discuss two methods to improve inter-rater reliability at the

    flight school and recommendations for improving the execution of the study and further

    research. The next chapter also includes a commercial application of this study.

  • 8/9/2019 IRR Flight School Instructors

    42/61

    34

    CHAPTER 5

    DISCUSSION

    The resultant coefficients are such that the study did not yield good inter-rater

    reliability. There must be some way to improve inter-rater reliability at the flight school.

    Two suggestions are to engage in extensive recurrent training and to improve the scoring

    rubric. There are also some ways to improve the technical aspects of the study and to do

    further research. Finally, the researcher proposes a commercial application for this inter-

    rater reliability study.

    Recurrent Training

    The previous chapter described the raw scores and the resultant kappa coefficients

    for the four raters. These scores show low inter-rater reliability which may indicate the

    need for recurrent training, which may help the flight school reinforce the scoring

    criteria. In the case of Rater 3, more training would be required than for Raters 1, 2 and

    4. In sample C, while Raters 1, 2 and 4 agreed upon a score of 5, Rater 3 awarded a score

    of 3. In sample G where all others gave a score of 2, Rater 3 gave a 1. And in Sample J,

    where there was no agreement among any raters, Rater 3 gave the low score of 2. After

    examining the raw scores, it is evident that the most common disagreement was between

    the scores 3 and 4. It may be that Raters 1, 2 and 4 need to review the standards to help

    them differentiate between performances that rate a 3 rather than a 4, while Rater 3 needs

    a greater amount of training to align that raters expectations of student performance with

    flight school standards.

    It may also be helpful to start training instructor pilots how to interpret the

    standards used to score student pilot performance first using simple maneuvers and

  • 8/9/2019 IRR Flight School Instructors

    43/61

    35

    working their way up to complex patterns, just as the students themselves must work

    their way up from simple maneuvers to complex patterns. This recurrent training may be

    of little use unless the standards are better defined through an improved scoring rubric.

    Scoring Rubric Improvements

    It could be that the scoring rubric needs improving. Referring again to Appendix

    A, there is a disconnect between the description of the quality of performance and

    quantifiable data. For example, An Excellent (5) grade will be issued when a students

    performance far exceeds and is well above the completion standards. Unfortunately,

    there is little to define exactly what makes a performance far exceed or well above the

    completion standards. The same can be said for scores 4, 3, 2, and 1. There definitions

    of the scores are too broad.

    The scoring sheet (Appendix D) offered the rater the completion standards from

    the lesson in which Pattern D is taught. The altitude standard asks only that a student

    pilot remain within plus or minus 200 feet of the starting altitude. This standard is very

    broadly defined and leaves too much open to interpretation by individual instructor pilots

    and hence affects inter-rater reliability. An example of how to fine tune the altitude

    standards could include the following scores:

    a score of 5 should require the student remain within plus or minus 50 feet;

    a 4, plus or minus 100 feet;

    a 3 plus or minus 150 feet;

    a 2, plus or minus 200; and

  • 8/9/2019 IRR Flight School Instructors

    44/61

    36

    a 1 indicates that the student violated the 200 foot limit in either direction, and

    therefore is unsatisfactory.

    The other standards, heading, bank angle and airspeed, could also be redefined to

    more precisely indicate how skilled the student is, rather than leaving a broad range that

    is susceptible to loose interpretation. Perhaps by fine-tuning the standards and requiring

    the instructor pilots to be retrained in these newer, more precisely defined, standards

    would help to improve inter-rater reliability. Fine-tuning these standards may require

    further research.

    Technical Improvements

    Although the researcher is confident in the methodology, there can be

    improvements made to how the experiment is executed on a technical level. This project

    was the researchers first attempt to record video footage from a PCATD and then

    transfer that footage to DVD. While the footage was usable, the quality could be

    improved by recording the footage directly from the PCATD rather than through another

    media. The footage had to travel through a few steps of media: from the PCATD to the

    projector, to the screen, to the video camera, to the iMac, to the iMovie HD application,

    to the iDVD application, to actual DVDs. The transfer from camera to the digital movie

    applications iMovie HD and iDVD are not problematic because there is no noticeable

    degradation of footage from one digital source to another. Thus, removing the projector,

    movie screen, and video camera from the middle, would likely produce higher quality

    images, making the footage easier to watch clearly. Since the raters all watched the same

    footage, the footage quality does not affect the inter-rater reliability. It would only affect

  • 8/9/2019 IRR Flight School Instructors

    45/61

    37

    inter-rater reliability if some raters watched one set of footage, and other raters watched

    an improved version of the footage.

    In summary, the technical execution of the project could be improved simply by

    learning how to use all of the features of the iMovie HD and iDVD applications to their

    fullest extent. There are other high-end software applications for video editing such as

    Final Cut that should also be considered provided the future researcher has the budget for

    to make these technological upgrades.

    Recommendations for Further Research

    With the technical improvement recommendations out of the way, this is an

    opportunity to discuss the future for which this project is the foundation. As stated in

    Chapter One, Introduction, this project was a foundational study, meant to lay the

    groundwork and establish a method to study inter-rater reliability at flight schools that

    can be used at any flight school that has the resources to carry out the experiment.

    The first recommendation is to expand the number of samples, the number of

    raters, or both. This researcher would also encourage a future researcher to test other

    means of measuring inter-rater reliability. Chapter Two, Literature Review, cited studies

    which used alpha and rho. In the interest of finding the best analytical method, alpha,

    rho, and other coefficients should be tested along with the increase in samples and raters

    until an agreed upon method is derived.

    The second recommendation is to choose different patterns. One suggestion is to

    begin testing particular maneuvers such as shallow, medium and steep turns, ascending

    and descending turns, or constant airspeed climbs. These are just examples, and a future

    researcher could experiment with particular maneuvers rather than entire patterns. At the

  • 8/9/2019 IRR Flight School Instructors

    46/61

    38

    same time, one could also consider choosing from a catalog of other instrument patterns,

    more or less challenging than Pattern D.

    Recommendations one and two do not cast doubt on the methodology of this

    study. Adding more raters might lead to more agreement, but it might also lead to more

    disagreement. Likewise, adding more sample flights may or may not cause lesser or

    greater reliability. What must be avoided at all costs is designing a study that is

    structured to create agreement. Testing particular maneuvers rather than patterns is not

    necessarily better because doing maneuvers is just one part of flight training and the goal

    of flight training is not to make a pilot proficient at doing maneuvers, but to make a pilot

    have such a depth of understanding and technical ability that he or she can take the

    maneuvers learned through the years of training and spontaneously serialize or combine

    discreet maneuvers into an organic flight that has unity from take off to landing. So

    testing only maneuvers versus testing patterns or testing spontaneous flights is not

    necessarily better. However, more samples, raters, other patterns, and other statistical

    methods all deserve to be tested for the sake of expanding our body of knowledge and for

    perfecting a method that one day could become tried and true. In short, researchers

    must trust in the scientific method to continually develop better means of testing and

    never rest contented with existing research.

    Upon doing further research, fine-tuning the standards and processing instructors

    through updated training, one may find that the method can be adapted for commercial

    use.

  • 8/9/2019 IRR Flight School Instructors

    47/61

    39

    Commercial Application of this Study

    Upon testing and re-testing this experiment such that the results can be replicated

    and are consistent, and the method deemed valid by a panel of experts in related fields,

    this study can be developed into an instructor training program that may be created for

    the commercial market and sold to flight schools.

    Following some of the recommendations above, perhaps the instructor training

    program could begin by evaluating maneuvers and testing reliability. Upon reaching a

    kappa of .70 or greater, the instructor trainee can move on to the next phase learning how

    to evaluate simple patterns, and then moving onto learning how to evaluate complex

    patterns, and finally how to reliably rate IFR check rides. The training need not happen

    only using a PCATD. The method and training system must be such that as the training

    progresses, the footage from the PCATD is replaced by footage from a full simulator, and

    the full simulator eventually replaced by footage from an actual aircraft, because the

    instructors and their students will experience training in all three media.

    Future researchers who wish to apply this project to a commercial application

    must establish baseline flights, just as Penny, et al. (2000) establish benchmark essays for

    scoring writing samples. For example, a future researcher may find that a particular

    flight has been viewed by raters and they have consensus that the flight is a 3. A

    researcher for a commercial developer or flight school must build up a catalog of baseline

    flights that have all been tested and create a test in which the established baseline scores

    are entered into the contingency table as Rater 1, while the rater currently being tested

    becomes Rater 2. Thus, the future researcher or tester will place the New Rater versus

    the baseline scores. A kappa of .70 or greater shows that the New Rater can score flights

  • 8/9/2019 IRR Flight School Instructors

    48/61

    40

    reliably, while a kappa less than .70 will indicate that the New Rater needs further

    instruction before being allowed to rate actual flights. The result may be that flight

    schools can effective and economically screen potential flight instructors or maintain

    standards with current instructors.

    Summary

    The search for valid, reliable, feasible, and fair assessments of cognitive and

    human performance is, in many ways, at the very heart of educational measurement

    (Penny, et al, 2000). In a very real way, instructor pilots are educators, and their

    evaluations of student performance are educational measurements. The researcher sought

    to find research in scientific and educational journals that would help to lay the

    foundation of inter-rater reliability studies in flight training. To that end, four flight

    school instructors (raters) were tested according to the methodology inspired by the

    literature reviewed and statistical analysis based upon Cohens Kappa coefficient. This

    coefficient is commonly used in inter-rater reliability studies in several fields from

    education, social science, psychology, medicine and even sports. It is used quite often in

    training situations. In this study, kappa was applied to flight training, specifically testing

    instructor pilots for inter-rater reliability. Ultimately, the study indicated that the inter-

    rater reliability was low; having an average kappa of .38, well below the desired .70.

    Nevertheless, this study was successful in that it showed a usable method for testing

    inter-rater reliability in flight training and provides the basis for further research and

    commercial development.

  • 8/9/2019 IRR Flight School Instructors

    49/61

    41

    REFERENCES

    Bamford, J.T.M., Gessert, C.E., & Renier, C.M. (2004) Measurement of the severity of

    rosacea. [Electronic Version]. Journal of the American Academy Dermatology,51(5), 697-703.

    Bann, S., Davis, I.M., Moorthy, K., Munz, Y., Hernandez, J., Khan, M., Datta, V., &

    Darzi, A. (2005). The Reliability of multiple objective measures of surgery andthe role of human performance. [Electronic version]. The American Journal ofSurgery, 189, 747-752.

    Bell, V., Halligan P.W., & Ellis, H.D. (2006). Diagnosing Delusions: A review of inter-

    rater reliability. [Electronic version]. Schizophrenia Research, 86, 76-79.

    Cohen, J. (1960). A coefficient of agreement for nominal scales. EducationalPsychological Measurement, 20(1), 37-46.

    DeVellis, R.F. (2005). Inter-Rater Reliability. [Electronic version]. In Encyclopedia ofSocial Measurement(Vol. 2, pp.317-322). New York: Elsevier Inc.,

    Dionne, C.P., Bybee, R.F., & Tomaka, J. (2006). Inter-rater reliability of McKenzie

    assessment in patients with neck pain. [Electronic version]. Physiotherapy, 92,

    75-82.

    Drake, R., Haddock, G., Terrier, N., Bentall, R., & Lewis, S. (2007). The Psychotic

    Symptom Rating Scales (PSYRATS): Their usefulness and properties in firstepisode psychosis. [Electronic version]. Schizophrenia Research, 89, 119-122.

    Ferri, R., Bruni, O., Miano, S., Smerieri, A., Spruyt, K., & Terzano, M. (2005). Inter-

    rater reliability of sleep cyclic alternating pattern (CAP) scoring and validation ofa new computer-assisted CAP scoring method. [Electronic version]. ClinicalNeurophysiology, 116, 696-707.

    Goodwin, L.D. & Goodwin, W.L. (1985). An Analysis of Statistical Techniques Used in

    the Journal of Educational Psychology, 1979-1983. [Electronic version].Educational Psychologist, 20(1), 13-21.

    Gwet, K. (2002a) Kappa statistic is not satisfactory for assessing the extent of agreement

    between raters. Retrieved December 15, 2006, fromhttp://www.stataxis.com/files/articles/kappa_statistic_is_not_satisfactory.pdf.

    Gwet, K. (2002b) Cohens Kappa. Retrieved December 15, 2006, from http://www-

    class.unl.edu/psycrs/handcomp/hckappa.pdf.

  • 8/9/2019 IRR Flight School Instructors

    50/61

    42

    Holey, L.A., & Watson, M.J. (1995) Inter-rater reliability of connective tissue zones

    recognition. [Electronic version]. Physiotherapy, 61(7), 369-372.

    Hulsman, R.L., Mollema, E.D., Oort, F.J., Hoos, A.M., & de Haes, J.C.J.M. (2006) Using

    standardized video cases for assessment of medical communication skills:

    Reliability of an objective structured video examination by computer. [Electronicversion]. Patient Education and Counseling, 60, 24-31.

    Joo, E.-J., Joo, Y.-H., Hong, J.-P., Hwang, S., Maeng, S.-J., Han J.-H., Yang, B.-H., Lee,Y.-S., & Kim, Y.-S. (2004). Korean Version of the Diagnostic Interview for

    Genetic Studies: Validity and Reliability. [Electronic version]. Comprehensive

    Psychiatry, 45(3), 225-229.

    Kadri, N., Agoub, M., El Gnaoui, S., Mchichi Alami, Kh., Hergueta, T., & Moussaoui, D.(2005). Moroccan colloquial Arabic version of the Mini International

    Neuropsychiatric Intervire (MINI): qualitative and quantitative validation.

    [Electronic Version]. European Psychiatry, 20, 193-195.

    Kaneda, Y., Ohmoria, T., & Fujii, A. (2001). The serotonin syndrome: investigation

    using the Japanese version of the Serotonin Syndrome Scale. [Electronic version].Psychiatry Research, 105, 135-142.

    Kirshner, W.K. (1990) The Pilots Manual: Instrument Flying(4th

    ed.). Ames, IA: Iowa

    State Press

    Kolaitas, J., Korpa, T., Kolvin, I., & Tsiantis, J. (2003). Letter to the Editor. [Electronic

    version]. European Psychiatry, 18, 374-375.

    Kolt, G.S., Brewer, B.W., Pizzari, T., Schoo, A.M.M., & Garrett, N. (2006). The Sport

    Injury Rehabilitation Adherence Scale: a reliable scale for use in clinical

    physiotherapy. [Electronic version]. Physiotherapy 93(1), 17-22.

    Lee, H.K. (2004). A comparative study of ESL writers performance in a paper-based and

    a computer-delivered writing test. [Electronic version]. Assessing Writing, 9, 4-26.

    Leung, T.K.S. & Tsang H.W.H. (2006). Chinese version of the Assessment of

    Interpersonal Problem Solving Skills. [Electronic version]. Psychiatry Research143, 189-197.

    Lindeman, B., Libkuman, T., King, D., & Kruse B. (2000). Development of an

    Instrument to Assess Jump-Shooting Form in Basketball. [Electronic version].Journal of Sports Behavior. 23(4), 335-348.

  • 8/9/2019 IRR Flight School Instructors

    51/61

    43

    Meyers, B.S., English, J., Gabriele, M., Peasley-Miklus, C., Heo, M., Flint, A.J., Mulsant,

    B.H., & Rothschild, A.J. (2006). A Delusion Assessment Scale for Psychoticmajor Depression: Reliability, Validity, and Utility.Biological Psychiatry, 60,

    136-1342.

    Michelson, J.D. (2006). Simulation in Orthopaedic Education: An Overview of Theoryand Practice. [Electronic version]. The Journal of Bone & Joint Surgery. 88-

    A(6), 1405-1411.

    Monroe-Blum, H., Collins, E., McCleary, L., & Nuttall, S. (1996). The social dysfunction

    index (SDI) for patients with schizophrenia and related disorders. [Electronicversion]. Schizophrenia Research. 20, 211-219.

    Papavasilou, A.S., Rapidi, C.A., Rizou, C., Petrapoulou, K., & Tzavara, Ch. (2006).Reliability of Greek version Gross Motor Function Classification System.

    [Electronic version]. Brain & Development, 29 79-82

    Penny, J., Johnson, R.L., & Gordon, B. (2000) The effect of rating augmentation on inter-

    rater reliability: and empirical study of a holistic rubric. [Electronic version].Assessing Writing, 7,143-164.

    Raymont, V., Buchanan, A., David, A.S., Hayward, P., Wessley, S., & Hotopf, M.

    (2006). The inter-rater reliability of mental capacity assessments. [Electronic

    version]. Law and Psychiatry, 30, 112-117

    Rittenberger, J.C., Martin, J.R., Kelly, L.J., Roth, R.N., Hostler, D., & Callaway, C.W.

    (2006). Inter-rater reliability for witnessed collapse and presence of bystanderCPR. [Electronic version]. Resuscitation, 70, 410-415.

    Schmidt, N.B., Salas, D., Bernert, R., & Schatschneider, C. (2005). Diagnosing

    agoraphobia in the context of panic disorder: examining the effect of the DSM-IVcriteria on diagnostic decision-making. [Electronic version]. Behavior Researchand Therapy, 43, 1219-1229.

    Thuile, J., Even, C., Friedman, S., & Guelfi, J.-D. (2005). Inter-rater reliability of the

    French version of the core index for melancholia. [Electronic version]. Journalof Effective Disorders, 88, 193-208.

    Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd

    ed.). Mason, OH:

    Thomson

    Tural, U., Fidaner, H., Alkin, T. & Bandelow, B. (2002). Assessing the severity of panic

    disorder and agoraphobia: Validity, reliability and objectivity of the Turkish

    translation of the Panic and Agoraphobia Scale (P & A). [Electronic version].Journal of Anxiety Disorders, 16, 331-340.

  • 8/9/2019 IRR Flight School Instructors

    52/61

    44

    Worster, A., Sardo, A.A., Fernandes C.M.B., Eva, K., & Upadhy, S. (2007). Triage toolinter-rater reliability: a comparison of live versus paper case scenarios.

    [Electronic version].Journal of Emergency Nursing, 33(4), 319-323.

  • 8/9/2019 IRR Flight School Instructors

    53/61

    APPENDIX A

    SCORING RUBRIC

  • 8/9/2019 IRR Flight School Instructors

    54/61

    46

  • 8/9/2019 IRR Flight School Instructors

    55/61

    APPENDIX B

    BRIEFING AND SCRIPT

  • 8/9/2019 IRR Flight School Instructors

    56/61

    48

    Brief

    Thank you for participating in this inter-rater reliability study. You are not being

    tested. Your upcoming flight will be scored by instructors for research purposes only.

    Your performance here today will not have any effect on your scores in school. Your

    name is not being recorded. Even I, the researcher, am not keeping a record of your

    name or any information about you.

    During this flight you will be asked to fly Pattern D from The Pilots Manual:

    Instrument Flying. Whether you have a passing or thorough knowledge of this flight

    pattern is not important. I will talk you through the flight, if necessary. I will not keep

    track of the time for you. I will, however, give you ample time before the next maneuver.

    Remember, you are not the one being tested. This flight is being used to test your

    instructors. Even though your performance is not being tested, I ask that you still try

    your best just as you would in a real plane with a real instructor pilot.

    Do you have any questions?

    Instructions

    This flight will begin with you already airborne. You are flying at 6000 feet,

    straight and level, heading 360, at 130 knots cruising speed. The flight will end with you

    airborne as well. Do you have any questions before we begin?

    1. Beginnow. Keep the aircraft straight and level for one minute.

    2. At the one minute mark, turn left to heading 315.

    3. When you come to heading 315, fly straight and level for one minute.

    4. Turn right, 180 degrees to heading 135.

  • 8/9/2019 IRR Flight School Instructors

    57/61

    49

    5. When you reach heading 135, fly straight and level for 30 seconds.

    6. Turn right 45 degrees to heading 180.

    7. When you reach heading 180, fly straight and level for 2 minutes.

    8. At the two-minute mark, turn right thirty degrees to heading 210.

    9. Fly straight and level for 45 seconds.

    10.Turn left 210 degrees to heading 360.

    11.When you reach heading 360, fly straight and level for 2 minutes.

    12.Turn right 180 degrees to heading 180.

    13.Fly straight and level for 2 minutes.

    14.Turn right 180 degrees to heading 360. Fly straight and level for 2 minutes.

    15.You have finished the flight. Please stop.

    Debrief

    Thank you for flying this pattern. Your flight is one of many that will be used to

    help us test the reliability of the instructor pilots. Although a recording of your flight has

    been made, no information about you has been kept, and thus no information about you

    can or will be shared.

    Do you have any questions before you go?

  • 8/9/2019 IRR Flight School Instructors

    58/61

    APPENDIX C

    INSTRUCTIONS TO RATERS

  • 8/9/2019 IRR Flight School Instructors

    59/61

    51

    Brief

    Thank you for being kind enough to participate in this study. We are soon going

    to watch DVDs containing 10 sample flights. Before we watch these flights, I must lay

    out some ground rules:

    1. We will watch each flight only once.

    2. Score each flight at the end of the flight. Do not wait until all flights are over to

    score them all. Take each flight as it is.

    3. You have been given a copy of Pattern D and the scoring rubric, which you may

    refer to throughout this process. On the score sheet, there is also a brief summary

    of the standards and of the scoring rubric.

    4. You must not communicate with each other while watching the flights. This

    includes talking, nodding, winking, gesturing, making faces, etc.

    5. We will take short breaks after every two videos, and a long break at the end of

    the first DVD.

    6. You may talk during the break times, but you must refrain from talking about the

    flights. Please keep the conversation to unrelated topics.

    7. At the end of the viewing, after I have collected your score sheets, we may then

    discuss any flights. You will not have the ability to change your scores.

    Do you have any questions before we begin?

  • 8/9/2019 IRR Flight School Instructors

    60/61

    APPENDIX D

    SCORING SHEET

  • 8/9/2019 IRR Flight School Instructors

    61/61

    53

    Standards:

    Altitude: +/- 200 feet

    Heading +/- 15

    Bank angle +/- 10

    Airspeed +/- 15 KIAS

    Grading Scale:

    5 Excellent

    4 Very Good

    3 Good

    2 Marginal

    1 Unsatisfactory

    Note: Standards are taken directly from a lesson pertinent to Pattern D. The grading scale

    is the same as described in the scoring rubric.

    Give a score of 1 5 for each sample.

    A B C D E F G H I J