14
2011; 33: e572–e585 WEB PAPER Reliability and validity of student peer assessment in medical education: A systematic review RENE ´ E SPEYER 1 , WALMARI PILZ 2 , JOLIEN VAN DER KRUIS 3 & JAN WOUTER BRUNINGS 2 1 HAN University of Applied Sciences, The Netherlands, 2 Maastricht University Medical Center, The Netherlands, 3 Jeroen Bosch Hospital, The Netherlands Abstract Background: Peer assessment has been demonstrated to be an effective educational intervention for health science students. Aims: This study aims to give an overview of all instruments or questionnaires for peer assessments used in medical and allied health professional educational settings and their psychometric characteristics as described in literature. Methods: A systematic literature search was carried out using the electronic databases Pubmed, Embase, ERIC, PsycINFO and Web of Science, including all available inclusion dates up to May 2010. Results: Out of 2899 hits, 28 studies were included, describing 22 different instruments for peer assessment in mainly medical educational settings. Although most studies considered professional behaviour as a main subject of assessment and described peer assessment usually as an assessment tool, great diversity was found in educational settings and application of peer assessment, dimensions or constructs as well as number of items and scoring system per questionnaire, and in psychometric characteristics. Conclusions: Although quite a few instruments of peer assessment have been identified, many questionnaires did not provide sufficient psychometric data. Still, the final choice of an instrument for educational purposes can only be justified by its sufficient reliability and validity as well as the discriminative and evaluative purposes of the assessment. Introduction In medical practice, peer assessment is considered to be a valuable instrument. Peer assessment can be used to stimulate students to participate in educational activities and clarify assessment criteria, improve team performance or determine individual effort. Peer assessment will encourage students to develop a critical attitude towards each other’s professional behaviour. By judging their peers, students might gain insight into their own performance, although Eva and Regehr (2005) indicated at the fundamental limitations of self-assessment which did not ameliorate by peer assessment. Peer teaching and learning have demonstrated to be an effective educational intervention for health science students on clinical placements (Secomb 2007). According to Gielen (2007), peer assessment has five main goals: The use of peer assessment as an assessment tool and learning tool, the installation of social control in the learning environment, the preparation of students for self-monitoring and self-regulation in lifelong learning, and the active participation of students in the classroom. The most well-known goal of peer assessment is its use as an assessment tool. As the judgements by peers need to be valid and reliable, some quality criteria for the peers’ judgements can be formulated depending on the subgoal of peer assessment as an assessment tool. For example, if peer assessment has to be able to replace staff assessment, agreement or concurrent validity is important. Agreement, however, requires an adequate reference for comparison. In literature, disagreement exists on the most appropriate choice of reference (e.g. assessment by teachers or peers, other episodes of assessments by the same assessors or self- assessment by the assessees). A prerequisite to achieve this goal of using peer assessment as an assessment tool is that stakeholders need to have confidence in and show acceptance Practice points . In this systematic review, 22 different instruments for peer assessment in mainly medical educational settings have been identified that provided sufficient study quality. . Many of these peer assessment tools do not provide sufficient psychometric data. . The use of instruments for educational purposes can only be justified by its sufficient reliability and validity as well as the discriminative and evaluative purposes of the assessment. . Peer assessment is a common tool in medical educa- tional settings and can be an effective format in peer- teaching and peer-learning for health-science students. . The outcome of this review stresses the need to focus on the psychometric characteristics of peer assessment tools in future research. Correspondence: R. Speyer, Institute of Health Studies, HAN University of Applied Sciences, PO Box 6960, 6503 GL Nijmegen, The Netherlands. Tel: 00 31 24 3531128; fax: 00 31 3875580; email: [email protected] e572 ISSN 0142–159X print/ISSN 1466–187X online/11/110572–14 ß 2011 Informa UK Ltd. DOI: 10.3109/0142159X.2011.610835 Med Teach Downloaded from informahealthcare.com by University of Cambridge on 10/31/14 For personal use only.

Reliability and validity of student peer assessment in medical education: A systematic review

Embed Size (px)

Citation preview

Page 1: Reliability and validity of student peer assessment in medical education: A systematic review

2011; 33: e572–e585

WEB PAPER

Reliability and validity of student peerassessment in medical education: Asystematic review

RENEE SPEYER1, WALMARI PILZ2, JOLIEN VAN DER KRUIS3 & JAN WOUTER BRUNINGS2

1HAN University of Applied Sciences, The Netherlands, 2Maastricht University Medical Center, The Netherlands,3Jeroen Bosch Hospital, The Netherlands

Abstract

Background: Peer assessment has been demonstrated to be an effective educational intervention for health science students.

Aims: This study aims to give an overview of all instruments or questionnaires for peer assessments used in medical and allied

health professional educational settings and their psychometric characteristics as described in literature.

Methods: A systematic literature search was carried out using the electronic databases Pubmed, Embase, ERIC, PsycINFO and

Web of Science, including all available inclusion dates up to May 2010.

Results: Out of 2899 hits, 28 studies were included, describing 22 different instruments for peer assessment in mainly medical

educational settings. Although most studies considered professional behaviour as a main subject of assessment and described peer

assessment usually as an assessment tool, great diversity was found in educational settings and application of peer assessment,

dimensions or constructs as well as number of items and scoring system per questionnaire, and in psychometric characteristics.

Conclusions: Although quite a few instruments of peer assessment have been identified, many questionnaires did not provide

sufficient psychometric data. Still, the final choice of an instrument for educational purposes can only be justified by its sufficient

reliability and validity as well as the discriminative and evaluative purposes of the assessment.

Introduction

In medical practice, peer assessment is considered to be a

valuable instrument. Peer assessment can be used to stimulate

students to participate in educational activities and clarify

assessment criteria, improve team performance or determine

individual effort. Peer assessment will encourage students to

develop a critical attitude towards each other’s professional

behaviour. By judging their peers, students might gain insight

into their own performance, although Eva and Regehr (2005)

indicated at the fundamental limitations of self-assessment

which did not ameliorate by peer assessment. Peer teaching

and learning have demonstrated to be an effective educational

intervention for health science students on clinical placements

(Secomb 2007). According to Gielen (2007), peer assessment

has five main goals: The use of peer assessment as an

assessment tool and learning tool, the installation of social

control in the learning environment, the preparation of

students for self-monitoring and self-regulation in lifelong

learning, and the active participation of students in the

classroom. The most well-known goal of peer assessment is

its use as an assessment tool. As the judgements by peers need

to be valid and reliable, some quality criteria for the peers’

judgements can be formulated depending on the subgoal of

peer assessment as an assessment tool. For example, if peer

assessment has to be able to replace staff assessment,

agreement or concurrent validity is important. Agreement,

however, requires an adequate reference for comparison. In

literature, disagreement exists on the most appropriate choice

of reference (e.g. assessment by teachers or peers, other

episodes of assessments by the same assessors or self-

assessment by the assessees). A prerequisite to achieve this

goal of using peer assessment as an assessment tool is that

stakeholders need to have confidence in and show acceptance

Practice points

. In this systematic review, 22 different instruments for

peer assessment in mainly medical educational settings

have been identified that provided sufficient study

quality.

. Many of these peer assessment tools do not provide

sufficient psychometric data.

. The use of instruments for educational purposes can

only be justified by its sufficient reliability and validity as

well as the discriminative and evaluative purposes of the

assessment.

. Peer assessment is a common tool in medical educa-

tional settings and can be an effective format in peer-

teaching and peer-learning for health-science students.

. The outcome of this review stresses the need to focus on

the psychometric characteristics of peer assessment tools

in future research.

Correspondence: R. Speyer, Institute of Health Studies, HAN University of Applied Sciences, PO Box 6960, 6503 GL Nijmegen, The Netherlands.

Tel: 00 31 24 3531128; fax: 00 31 3875580; email: [email protected]

e572 ISSN 0142–159X print/ISSN 1466–187X online/11/110572–14 � 2011 Informa UK Ltd.

DOI: 10.3109/0142159X.2011.610835

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 2: Reliability and validity of student peer assessment in medical education: A systematic review

of the results of the assessment. The second goal refers to peer

assessment as a learning tool. Three processes are able to

produce or support this learning: learning by the assessee

through assessment for learning and feedback, learning by the

assessor through assessing for learning, and, learning by both

through peer learning processes. By providing feedback to

students, their professional behaviour can be adjusted and

improved if necessary. The third goal, peer assessment as a

tool for social control, requires the most external control. The

efficiency in reaching desired behaviour and avoiding

undesired behaviour is considered the most important quality

concept. However, when aiming at helping students to learn

how to assess themselves as lifelong learners and grow to

become independent learners, a fourth goal has been

achieved, resulting in self-regulation and self-monitoring of

learning behaviour by students themselves. The final goal is

most directly linked to autonomy support in the classroom as

peer assessment becomes a tool to stimulate active participa-

tion of students in their learning within student-centred

learning environments. Quality of assessment can be concep-

tualized as the development of a ‘sense of ownership’ of the

learning and assessment for each student.

Thus, the quality of peer assessment can be influenced by a

variety of factors, including the reliability of the assessment,

the interaction between peers, the stakes of the assessment

and the assumption of equivalence between the evaluations of

each (student) colleague or peer (Norcini 2003). Although the

use of peer assessment may be limited by issues such as

required confidentiality or anonymity, its ratings are based on

credible sources and encompasses habitual behaviours

and reflect future academic and clinical performance

(Epstein 2007).

Peer assessment is used in various educational settings (van

Zundert et al. 2010), including peer assessment by medical

students (Linn et al. 1975; Arnold et al. 2007). Evans et al.

(2004) described a systematic review on the validity and

reliability of existing instruments for rating peers (professional

colleagues) in medical practice. The authors concluded that

the instruments developed for peer review needed further

assessment of validity before widespread use was merited. As

doubts have been described about the psychometric charac-

teristics of the assessment instruments used for peer review by

physicians, the same problems might be expected when

studying peer assessment by medical students (Dijcks et al.

2003). In general, it is necessary to have exact knowledge of

the psychometric characteristics of assessment instruments

being used, because the outcome of instruments showing

insufficient validity or reliability cannot be correctly inter-

preted. Especially, when using peer assessment as an assess-

ment tool, good measurement properties are essential.

In literature, no systematic review has been described

about peer assessment in medical or paramedical education.

This study will present such a systematic review and identify

existing instruments for peer assessment in educational

settings by medical students and students in allied health

professions. As medical or paramedical educational settings

may assess student performance or professional behaviour

highly specific for this type of education (e.g. physician

performance or clinical competence), it was decided to restrict

this review to instruments designed for and being used in these

specific settings instead of including peer assessment in other

settings as well. In this study, peer assessment is defined

according to Topping (1998): ‘An arrangement in which

individuals consider the amount, level, value, worth, quality,

or success of the products or outcomes of learning of peers of

similar status’. Peer assessment refers to the assessment of a

student’s performance undertaken by a fellow (peer) student

in the same field, in order to maintain or enhance the quality of

the work or performance in that medical field. Once, having

identified the instruments in literature, the psychometric

characteristics of these assessment tools will be described in

order to facilitate an adequate choice of tools in future research

or assessment settings.

The purpose of this study is to twofold: to give an overview

of all instruments or questionnaires for peer assessments used

in medical and allied health professional educational settings

and to present their psychometric characteristics as described

in literature.

Method

A literature search was performed independently by two

reviewers using five electronic literature databases (Table 1):

Pubmed, Embase, ERIC, PsycINFO and Web of Science. All

available inclusion dates up to 22 April 2010 were searched.

The search was limited to English, German, French, Spanish

and Dutch publications. Mesh or thesaurus terms were

supplemented using free text words (truncation or wild card)

to identify the most recent publications over the last year.

Only articles describing an instrument or questionnaire for

peer assessment in paramedical or medical educational

settings were included. Both original articles describing such

peer assessment tools as well as those articles presenting data

about the reliability or validity of any of these instruments were

included in this study. Studies about peer rating (Levine et al.

2007) in which report marks are assigned by peers to one

other without using a well-defined rating scale and/or

specified subject of assessment excluded, because the assign-

ment of report marks may be a method of peer evaluation, but

is not considered an instrument or tool for peer assessment as

defined in this review. Peer nomination (McCormack et al.

2007) or performance ranking (Reiter et al. 2002) was

excluded too for similar reasons. Studies describing peer

assessment of group performance without presenting data of

individual performance (Miller 2003) or studies presenting

differences in educational level between peer assessors and

those to be assessed (Davis & Inamdar 1988; Wendling &

Hoekstra 2002; Field et al. 2007; Bucknall et al. 2008; Ljungman

& Silen 2008; Maker & Donnelly 2008), e.g. senior or

graduating students versus first year students, do not fit the

concept of peer assessment as defined by Topping (1998).

Therefore, these studies were not included. Review articles

and case reports were excluded as well. Study quality was

assessed, using criteria based on Bangert-Drowns et al. (1997).

Criteria referred to adequate reporting of population charac-

teristics or study characteristics (e.g. educational level, number

of peer assessors), exclusion of inappropriate student tasks

(e.g. prediction of grades), reproducibility of the study

Reliability and validity of peer assessment

e573

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 3: Reliability and validity of student peer assessment in medical education: A systematic review

excluding studies providing inadequate procedural informa-

tion or global ratings and peer assessment without criteria or

structure provided (Burnett & Cavaye 1980; Magin 2001;

Parikh et al. 2001; McCarty et al. 2005; Nieder et al. 2005;

Chenot et al. 2007; Hulsman et al. 2009). Finally, the citations

listed in the selected studies were searched for additional

literature.

Results

Overview of studies

The combined searches of Mesh or Thesaurus terms plus free

text words in five databases yielded a total of 2899 articles

(Table 1), of which 28 met the inclusion criteria (Methods

section). Table 2 presents an overview of all included articles

listing authors in alphabetical order. For each study, the

following data are summarized based on the original article:

number of participants and study settings, subject of peer

assessment, description of the instrument for peer assessment

and psychometric data according to the author(s) findings.

The first column presents the authors in combination with

the year of publication. In the second column, information is

summarized on the number of participants in relation to

students’ compliance or amount of missing data, the educa-

tional level of the students involved and the subject area or

course name in combination with the educational activity in

which students are participating, the moments of measure-

ment, and the number of groups and evaluations per student.

Next, the subject of peer assessment mentioned, followed by

the description of the instrument being used, including the

number of items and rating scale as well as the number of

distinctive domains. The final column shows reliability and

validity data on the peer assessment instrument as described

by the author(s).

Other measurement tools, besides peer assessment instru-

ments that have been used in any of the study designs of the

included articles only have been mentioned in Table 2 when

considered important for adequate interpretation of e.g. the

psychometrics of peer assessment. For example, to determine

the criterion validity of a peer assessment instrument, the gold

standard (usually faculty assessment) has been named.

Furthermore, as the purpose of this review is to summarize

information on the reliability and validity of peer assessment

instruments, information about, for example, the effects of

peer assessment in general or comparing study outcome in

different groups (Perera et al. 2010) has not been reported.

General description

Looking at the included articles, all publications were English

written with the exception of one Spanish study by Amato and

Novales-Castro (2009). Four studies were published in the

1970s, another two in the 1980s and four in the 1990s. The

other 18 studies were published this last decade, suggesting a

growing interest in peer assessment and, especially, in its

reliability and validity. Although the purpose of this study was

to give an overview of all instruments or questionnaires for

peer assessment in paramedical or in medical educational

settings, no such questionnaire was found for paramedical

settings. Almost all studies focused on medical students. One

study described peer assessment in pharmacy students

(O’Brien et al. 2008), and another in a combined group of

medical and dentistry students (Nofziger et al. 2010). The

actual number of participants in all these studies, taking into

account the number of drop-outs or missing data if mentioned

by the author(s), ranged from 16 to 349 students. Seven studies

included student populations with fewer than 50 subjects; 8

studies had between 50 and 100 subjects and 13had more than

100 subjects. The median number of subjects was 98 (25th

percentile¼ 51; 75th percentile¼ 160).

Most studies considered professional behaviour as the main

subject of assessment (Bryan et al. 2005; Cottrell et al. 2006;

Kovach et al. 2009), whereas other studies focused on topics

such as leadership capacities (Chen et al. 2009), interview

skills (Rudy et al. 2001; Perera et al. 2010) or problem-based

performance (Sullivan et al. 1999; Papinczak et al. 2007a,b;

Amato & Novales-Castro 2009). The majority of the studies

used peer assessment mainly as an assessment tool. Many

questionnaires covered diverse domains of interest. For

example, the concept of resident performance by Davis

(2002) included both the domain of clinical competence and

Table 1. Systematic literature search.

Literaturedatabase

Abstractsidentified

(Ntotal¼ 2899) Search termsLimits (inclusion

dates; type of paper)

Mesh or

thesaurus

terms

Pubmed 482 (Education, Medical OR Education, Medical,

Undergraduate OR Education, Medical

Graduate OR Education Medical, continuing)

AND Peer Review

NA

Embase 747 Peer Review AND Medical Education

ERIC 49 Peer Evaluation AND Medical Education

PsycINFO 21

Free text words

(truncation

or wild card)

Pubmed 429 (Medical education OR medicine OR paramedical

student* OR medical student*) AND

(peer review OR peer assessment OR peer

evaluation OR peer feedback)

Last year

Embase 228 January ’09 to present

ERIC 6

PsycINFO 33

Web of Science 791 Article

113 Proceedings Paper

R. Speyer et al.

e574

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 4: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.S

um

mary

of

chara

cte

ristic

sof

inst

rum

ents

for

peer

ass

ess

ment

inm

ed

icaland

allied

health

pro

fess

ionaled

ucatio

n.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Ala

gna

and

Red

dy

(1985)

Firs

tp

reclin

icalye

ar

med

ical

stud

ents

inanato

my

dis

sectio

n

gro

up

s(N¼

128)

Tw

om

om

ents

of

measu

rem

ent:

16

weeks

into

cours

eand

4m

onth

s

late

r;

Num

ber

of

peer

ratin

gs

per

stud

ent?

Gro

up

functio

nin

gTw

oite

ms

on

satis

factio

nas

lab

part

ners

and

frie

nd

liness

felt

tow

ard

sm

em

bers

(seve

n-

poin

tsc

ale

)

NA

Am

ato

and

Nova

les-

Cast

ro(2

009)

(Sp

anis

hla

nguage)

Third

year

med

icalst

ud

ents

(N¼

334)

Activ

ity:

casu

istr

y(1

–11

pro

ble

m-

base

dp

atie

nt

case

sp

er

gro

up

);

Fiv

eto

nin

est

ud

ents

per

gro

up

Pro

ble

m-b

ase

dle

arn

ing

perf

or-

mance

with

intu

toria

lsett

ing:

1.

Activ

ityand

inte

ractio

nw

ith

gro

up

mem

bers

;

2.

Qualit

yof

ora

lcase

pre

senta

tion

Tw

oq

uest

ionnaire

s(s

ee

sub

ject

of

ass

ess

ment):

Each

of

seve

nite

ms

(five

-poin

t

scale

)

NA

Arn

old

et

al.

(1981)

Six

thye

ar

med

icalg

rad

uate

sd

urin

g

finalro

tatio

n(N

Gro

up

58)

and

gra

duate

sst

illin

resi

dency

train

ing

(NG

roup

86)

ove

ra

perio

dof

4ye

ars

Com

ple

ted

ata

(conse

nt

agre

e-

ment):

NG

roup

50/5

8and

NG

roup

56/8

6

(Slig

ht

ad

diti

onalre

ductio

nin

num

ber

of

case

sb

ecause

of

mis

sing

data

.)

Num

ber

of

measu

rem

ents

?:

Fin

al-

year

eva

luatio

n;

At

least

two

peer

eva

luatio

ns

per

stud

ent

Clin

icalp

erf

orm

ance

Willoughb

yet

al.

(1979)

Inte

rnalc

onsi

stency:

Cro

nb

ach’s

alp

ha,

for

both

gro

up

scom

bin

ed

,w

as

0.9

4

Inte

rrate

rre

liab

ility

:usi

ng

analy

sis

of

varia

nce,

intr

acla

sscorr

ela

tion

coeffic

ients

for

Gro

up

s1

and

2w

ere

,re

spectiv

ely

,0.5

0and

0.5

2

Relia

bility

(psy

choso

cia

lb

ias)

:no

statis

tically

signifi

cant

corr

ela

tions

em

erg

ed

betw

een

peer

ratin

gs

and

race,gend

er,

geogra

phic

alo

rigin

or

socia

lcla

ssof

stud

ents

.O

nly

few

statis

tically

signifi

cant

rela

tion-

ship

sw

ere

found

betw

een

peer

ratin

gs

and

pers

onalit

yor

vocatio

nal

inte

rest

s

Crit

erio

nva

lidity

:p

eer

eva

luatio

ns

(Gro

up

2)

show

ed

hig

hsi

gnifi

cant

corr

ela

tions

with

faculty

ratin

gs

of

stud

ents

’fin

al-

year

rota

tion

(Clin

icalP

erf

orm

ance

Eva

luatio

n,

CP

E),

the

Natio

nalB

oard

Med

ical

Exa

min

atio

n(N

BM

Ep

art

II),th

eQ

uart

erly

Pro

file

Exa

min

atio

nand

the

cum

ula

tive

gra

de-p

oin

tave

rage

(baccala

ure

ate

degre

e)

Pre

dic

tive

valid

ity:

signifi

cant

corr

ela

tions

betw

een

peer

ratin

gs

durin

g

the

finaly

ear

and

CP

Es

giv

en

tost

ud

ents

as

gra

duate

sb

yre

sid

ency

sup

erv

isors

,w

ere

0.3

4(G

roup

2)

and

0.3

7(G

roup

1)

Ste

p-w

ise

multi

ple

regre

ssio

nanaly

sis

use

dfo

rd

ete

rmin

ing

the

best

pre

dic

tor

of

stud

ent

perf

orm

ance

by

com

bin

ing

achie

vem

ent

mea-

sure

sd

urin

gm

ed

icalsc

hool,

resu

lted

ineq

uatio

ns

exc

lud

ing

peer

eva

luatio

n

Bry

an

et

al.

(2005)

Firs

tye

ar

med

icals

tud

ents

inG

ross

and

deve

lop

menta

lanato

my

cours

ed

urin

g5

conse

cutiv

e

years

(N¼

213)

Tw

om

om

ents

of

measu

rem

ent:

mid

poin

tand

end

of

cours

e;

Thre

eto

four

stud

ents

(tw

oto

thre

e

peer

ass

ess

ments

)p

er

gro

up

:

tota

lof

1234

peer

ass

ess

ments

;

Onlin

eq

uest

ionnaire

Pro

fess

ionalis

mFiv

eite

ms

(five

-poin

tsc

ale

)p

lus

two

op

en

quest

ions

Crit

erio

nva

lidity

:lin

ear

regre

ssio

nin

dic

ate

da

statis

tically

signifi

cant

corr

ela

tion

betw

een

peer

ratin

gs

and

finalgra

de

(r¼

0.2

6),

with

an

ass

ocia

ted

3.3

-poin

tin

cre

ase

inth

efin

alg

rad

ep

er

peer-

ratin

gp

oin

t

(contin

ued

)

Reliability and validity of peer assessment

e575

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 5: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Chen

et

al.

(2009)

Firs

tye

ar

med

icalst

ud

ents

inth

e

conse

cutiv

eB

asi

cst

ructu

reand

Hum

an

stru

ctu

reb

locks

(N¼

49)

Activ

ity:

rota

ting

gro

up

lead

ing

role

for

3-w

eek

perio

d(tota

lperio

dof

12

weeks

);

13

gro

up

softh

ree

tofo

ur

stud

ents

;

Six

tonin

ep

eer

eva

luatio

ns

per

lead

er

(ass

igned

by

cours

e

inst

ructo

rs);

Ele

ctr

onic

surv

eys

Lead

ers

hip

Seve

nite

ms

(six

-poin

tsc

ale

)

ass

ess

ing

altr

uis

m,

com

-

pass

ion,

resp

ect,

inte

grit

y,

resp

onsi

bility

,com

mitm

ent

toexc

elle

nce

and

self-

refle

ctio

n,

plu

ssp

ace

for

com

ments

(str

ength

sand

are

as

for

imp

rove

ment)

(Mod

ified

vers

ion

of

the

Pro

fess

ionalA

ssocia

te

Ratin

gs

inst

rum

ent

by

Ram

sey

and

Wenric

h’s

(1999)

NA

Cott

rell

et

al.

(2006)

Firs

tye

ar

med

icalst

ud

ents

(N¼

111)

inp

rob

lem

-base

d

learn

ing

cours

e

�91%

com

plia

nce

Sin

gle

measu

rem

ent:

End

of

cours

e?

14

gro

up

sof�

8st

ud

ents

plu

s

facilita

tor;

�7

peer

eva

luatio

np

er

stud

ent;

Onlin

eq

uest

ionnaire

Pro

fess

ionalsk

ills

Nin

eite

ms

(one

item

per

dom

ain

;si

x-p

oin

tsc

ale

,not

ob

serv

ed

):fir

st,

fourt

h

(mid

dle

of

contin

uum

as

pre

ferr

ed

leve

lof

achie

ve-

ment),

and

seve

nth

leve

ls

anchore

dw

ithd

eta

iled

desc

riptio

ns

Nin

ed

om

ain

s

(honest

yand

inte

grit

y,account-

ab

ility

,re

sponsi

bility

,

resp

ectfuland

non-j

ud

ge-

menta

lb

ehavi

our,

com

pas-

sion

and

em

path

y,m

atu

rity,

skilf

ulcom

munic

atio

n,

con-

fidentia

lity

and

priv

acy

inall

patie

nt

affairs

,se

lf-d

irecte

d

learn

ing

and

ap

pra

isalsk

ills)

Conte

nt

valid

ity:

dom

ain

sare

defin

ed

usi

ng

the

med

icalsc

hool’s

Cod

e

of

Pro

fess

ionalis

mb

ase

don

exi

stin

glit

era

ture

and

fram

ew

ork

s(e

.g.

the

AA

MC

Med

icalS

tud

ent

Ob

jectiv

es

Pro

ject)

Inte

rnalconsi

stency:

Cro

nb

ach’s

alp

ha

for

two

rand

om

sam

ple

sof

the

tota

ld

ata

set

were

0.8

2and

0.7

6,

ind

icatin

gm

od

era

teconsi

stency

Inte

rrate

rre

liab

ility

:genera

lizab

ility

and

decis

ion

stud

ies

found

that

when

usi

ng

five

ratin

gs

per

stud

ent,

the

genera

lizab

ility

coeffic

ient

was

0.4

5.

When

usi

ng

13

ratin

gs,

the

coeffic

ient

imp

rove

dto

0.7

0.

Thus,

incre

asi

ng

the

num

ber

of

rate

rs,

will

red

uce

err

or

and

imp

rove

peer

eva

luatio

n’s

measu

rem

ent

pre

cis

ion

Dannefe

ret

al.

(2005)

Second

year

med

icalst

ud

ents

(N¼

97)

Com

ple

ted

ata

:N¼

96

Sin

gle

measu

rem

ent:

2m

onth

s

prio

rto

end

of

second

year;

Peer

ass

ess

or:

co-m

em

ber

of

pro

-

ble

m-b

ase

dle

arn

ing

gro

up

or

small

gro

up

;

15–1

6p

eer

eva

luatio

ns

(str

atif

ied

by

gend

er)

per

stud

ent:

tota

lof

1519

peer

ass

ess

ments

;

Onlin

eq

uest

ionnaire

Pro

fess

ionalcom

pete

nce

Peer

Ass

ess

ment

Pro

tocol

(PA

P):

14

plu

sone

glo

balra

ting

item

(five

-poin

tsc

ale

with

first

and

fifth

leve

lsanchore

dw

ith

deta

iled

desc

riptio

ns,

unab

le

toeva

luate

)

Tw

od

om

ain

s

(WH

,w

ork

hab

its,

IA,

inte

rper-

sonalhab

its)

Inte

rnalconsi

stency:

exp

lora

tory

facto

ranaly

sis

of

aggre

gate

dp

eer

ratin

gs

suggest

ed

atw

o-d

imensi

onalconcep

tualiz

atio

nof

pro

fes-

sionalc

om

pete

nce.The

first

facto

rW

Hre

ferr

ed

tocogniti

vesk

ills

and

stud

yhab

its,

where

as

the

second

facto

rIA

refe

rred

tosk

ills/

hum

anis

ticq

ualit

ies

com

monly

ass

ocia

ted

with

dim

ensi

ons

of

pro

fess

ionalis

m

Inte

rrate

rre

liab

ility

:genera

lizab

ility

and

decis

ion

stud

ies

found

that

ap

pro

xim

ate

lysi

xp

eer

rate

rsw

ere

need

ed

for

the

WH

and

IAsc

ale

s

inord

er

toachie

vea

genera

lizab

ility

coeffic

ient

of

0.7

0(s

uffic

ient

for

form

ativ

ep

urp

ose

s).R

elia

bility

was

notsi

gnifi

cantly

enhanced

for

the

scale

sas

the

num

ber

of

peer

rate

rsexc

eed

ed

15

Crit

erio

nva

lidity

:th

eW

Hfa

cto

rw

as

statis

tically

signifi

cant

corr

ela

ted

with

oth

er

perf

orm

ance

measu

res

of

know

led

ge

and

clin

icalsk

ills

(0.2

1�

r�0.5

3),

where

as

the

IAfa

cto

rd

idnot.

IAcorr

ela

ted

hig

h

with

peer

rela

tionsh

ipva

riab

les

(contin

ued

)

R. Speyer et al.

e576

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 6: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Davi

s(2

002)

Med

icalre

sid

ents

(four

at

each

post

gra

duate

leve

l)at

the

dep

art

ment

of

ob

stetr

ics

and

gyn

aecolo

gy

(N¼

16)

Sin

gle

measu

rem

ent:

year-

end

eva

luatio

nof

resi

dency

pro

-

gra

mm

e;

15

peer

eva

luatio

ns

per

stud

ent

Resi

dent

perf

orm

ance

16

item

s(s

eve

n-p

oin

tsc

ale

,

unab

leto

eva

luate

)

Thre

ed

om

ain

s

(clin

icalcom

pete

ncy,

inte

rper-

sonalsk

ills,

ove

rall

ass

ess

-

ment)

(mod

ified

vers

ion

of

peer

ass

ess

ment

form

by

Ram

sey

et

al.

1993)

Relia

bility

:in

tracla

sscorr

ela

tion

coeffic

ients

measu

ring

the

relia

bility

in

peer

and

faculty

ratin

gs

per

dom

ain

,re

spectiv

ely

,va

ried

betw

een

0.7

8and

0.9

0and

0.6

6and

0.8

4(e

xcelle

nt

relia

bility

)

Crit

erio

nva

lidity

:P

ears

on

corr

ela

tion

coeffic

ients

for

peer

ratin

gve

rsus

faculty

ratin

g(g

old

stand

ard

)va

ried

betw

een

0.7

2and

0.8

6,

ind

icatin

gat

statis

tically

signifi

cant

ass

ocia

tions

Dijc

kset

al.

(2003)

Fourt

hye

ar

med

icalst

ud

ents

(N¼

212)

Exc

lud

ing

all

stud

ents

with

few

er

than

five

of

the

eig

ht

jud

ges

agre

ein

g(4

0%

);in

clu

sion

128

Sin

gle

measu

rem

ent;

Eig

ht

stud

ents

jud

ged

their

peer

Com

pete

nce

(functio

nin

gas

a

stud

ent

and

as

afu

ture

phys

icia

n)

Tw

oite

ms

(thre

e-p

oin

tsc

ale

,

unkn

ow

n)

ass

ess

ing

com

-

pete

nce

as

ast

ud

ent

and

com

pete

nce

as

afu

ture

docto

r

(Com

pete

nces

as

ast

ud

ent

and

as

afu

ture

docto

rw

ere

rate

db

yp

eers

as

‘do

not

know

’,re

spectiv

ely

,in

25%

and

29%

of

all

case

s.)

Relia

bility

:in

tracla

sscorr

ela

tion

treatin

gd

iffere

nces

betw

een

jud

ges

in

rank

ord

erin

gas

err

or,

yield

ed

ava

lue

of

0.6

8fo

rth

efir

stite

mand

0.6

6fo

rth

ese

cond

item

Crit

erio

nva

lidity

:p

eer

ass

ess

ment

was

use

das

gold

stand

ard

.

Reve

rsin

gth

ere

latio

nsh

ipand

taki

ng

ob

jectiv

ete

sts

as

crit

erio

n,

the

peerass

ess

mentoft

he

stud

ents

was

valid

.B

oth

com

pete

nce

ratin

gs

were

hig

hly

corr

ela

ted

,b

ut

did

not

reach

unity

.B

est

univ

aria

te

pre

dic

tor

forcom

pete

nce

as

ast

ud

entw

as

the

blo

ck

test

(test

atend

ofu

nit

orb

lock

cove

ring

the

conte

ntoft

he

blo

ck)

,fo

llow

ed

by

anear-

eq

ualcorr

ela

tion

by

the

pro

gre

sste

st(g

enera

lkn

ow

led

ge

ass

ess

-

ment).

For

com

pete

nce

as

afu

ture

docto

rall

corr

ela

tions

were

low

er

and

the

best

pre

dic

tor

was

the

OS

CE

Pre

dic

tive

valid

ity:

dis

crim

inant

analy

sis

corr

ectly

cla

ssifi

ed

63%

of

the

stud

ents

for

the

com

pete

nce

ratin

gas

ast

ud

ent,

with

the

pro

gre

ss

and

skills

test

sb

ein

gid

entif

ied

as

signifi

cant

pre

dic

tors

.For

functio

nin

gas

afu

ture

docto

r,only

the

OS

CE

was

identif

ied

as

a

signifi

cantp

red

icto

r(p

erc

enta

ge

corr

ectly

pre

dic

ted

was

45%

).U

sing

the

ext

rem

es

‘poor’

and

‘good

’only

(om

ittin

gth

em

idd

lecla

ssifi

ca-

tion),

these

perc

enta

ges

were

,re

spectiv

ely

,87%

(only

pro

gre

sste

st

pre

dic

tor

signifi

cant)

and

87%

(OS

CE

only

)

Kova

ch

et

al.

(2009)

Third

year

med

icalst

ud

ents

durin

g

inte

rnalm

ed

icin

ecle

rksh

ipove

r

ap

erio

dof

5ye

ars

(N¼

349)

Sin

gle

measu

rem

ent:

end

of

cle

rk-

ship

;

�18

stud

ent

ineach

cle

rksh

ip;

�12

peer

eva

luatio

ns

and

one

fac-

ulty

ratin

g(?

)p

er

stud

ent

(range

6–1

9)

Pro

fess

ionalb

ehavi

our

Thre

eite

ms

(five

-poin

tsc

ale

,

unab

leto

eva

luate

)ass

ess

-

ing

self-

dire

cte

dle

arn

ing,

inte

rpers

onalre

latio

nsh

ips,

motiv

atio

n/d

ep

end

ab

ility

/

resp

onsi

bility

(plu

sop

en

com

ments

)

Inte

rnalconsi

stency:

Cro

nb

ach’s

alp

ha

for

the

thre

e-i

tem

peer

eva

lua-

tion

toolw

as

0.8

9(s

trong

inte

rnalc

onsi

stency

on

ap

erst

ud

entb

asi

s)

Relia

bility

/Crit

erio

nva

lidity

:p

eer

ass

ess

ments

had

statis

tically

low

er

ratin

gs

than

those

by

faculty

.P

eer

and

faculty

ratin

gs

were

weakl

y

corr

ela

ted

(r¼

0.2

9),

but

statis

ticalsi

gnifi

cant.

There

was

als

olit

tle

corr

ela

tion

betw

een

peer

ass

ess

ment

and

oth

er

perf

orm

ance

measu

res

Lin

net

al.

(1975)

Junio

rm

ed

icalst

ud

ents

(N¼

54)

Tw

om

om

ents

of

measu

rem

ent:

test

–rete

st(�

1w

eek

ap

art

);

Peer

ass

ess

or:

co-m

em

ber

of

ward

ass

ignm

ents

;

�7–1

0p

eer

eva

luatio

ns

per

stu-

dent:

tota

lof

874

peer

ass

ess

ments

Phys

icia

np

erf

orm

ance

Perf

orm

ance

Ratin

gS

cale

:

16

item

s(fo

ur-

poin

tsc

ale

)

Tw

od

om

ain

s

(KF,

know

led

ge

facto

r;R

F,

rela

tionsh

ipfa

cto

r)

Face

and

conse

nsu

alva

lidity

:p

rese

nt

(accord

ing

toauth

ors

)

Inte

rnalc

onsi

stency:

usi

ng

facto

ranaly

sis,

two

dis

tinct

facto

rsem

erg

ed

.

An

inte

rpers

onalor

RF

and

akn

ow

led

ge

or

skill

facto

r(K

F),

each

exp

lain

ing,

resp

ectiv

ely

,40%

and

29%

of

the

varia

nce

Test

–rete

stre

liab

ility

:corr

ela

tions

betw

een

rep

eate

dm

easu

rem

ents

of

ave

raged

peer

ratin

gs

per

stud

ent

were

hig

h(r

KF¼

0.9

1,

r RF¼

0.9

0),

thus,

ind

icatin

ggood

rep

rod

ucib

ility

Crit

erio

nva

lidity

:si

gnifi

cant

corr

ela

tions

betw

een

faculty

finalgra

des

vers

us

RF

and

KF

were

,re

spectiv

ely

,0.3

3and

0.5

0

(contin

ued

)

Reliability and validity of peer assessment

e577

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 7: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Lin

net

al.

(1976)

Firs

tye

ar

junio

rm

ed

icalst

ud

ents

(N¼

98)d

urin

gsu

rgic

alc

lerk

ship

Stu

dent

part

icip

atio

n:

98/1

02

Activ

ity:

surg

icalcle

rksh

ip(s

mall

gro

up

s);

Sin

gle

measu

rem

ent:

end

of

12-

week

rota

tion;

�8

peer

eva

luatio

ns

per

stud

ent:

tota

lof

830

peer

ass

ess

ment

Phys

icia

np

erf

orm

ance

13-i

tem

s(fo

ur-

poin

tsc

ale

)

Tw

od

om

ain

s(k

now

led

ge,

inte

rpers

onalre

latio

nsh

ips)

Pre

dic

tive

valid

ity:

peer

know

led

ge

ratin

gs

were

hig

hly

pre

dic

tive

for

perf

orm

ance

on

the

NB

ME

,w

here

as

peer

ratin

gs

on

the

rela

tionsh

ip

facto

rw

ere

not

statis

tically

signifi

cant

as

pre

dic

tors

.In

the

case

of

finalgra

des

insu

rgery

,b

oth

peer

ratin

gs

on

know

led

ge

and

rela

tionsh

ips

were

hig

hly

rela

ted

,th

us,

good

pre

dic

tors

Lurie

et

al.

(2006a)

Second

year

med

icalst

ud

ents

ove

r

ap

erio

dof

2ye

ars

(tw

ocon-

secutiv

ecla

sses:

Nto

tal¼

77þ

85¼

162)

Stu

dent

part

icip

atio

np

er

cla

ss:

77/

93

and

85/1

08

stud

ents

;

Com

ple

ted

ata

:98%

Tw

om

om

ents

of

measu

rem

ent:

near

the

end

ofse

cond

and

third

years

;

Diff

ere

nt

meth

od

sof

peer

sele

ctio

n

(second

year:

peer-

rate

rsw

ere

ass

igned

form

cla

sslis

ts;

third

year:

stud

ent’

sre

com

mend

ed

peers

tora

teand

be

rate

db

y);

6–1

2p

eer

eva

luatio

ns/

cla

ssm

ate

s

per

stud

ent

each

year

Pro

fess

ionalw

ork

hab

itsand

inte

rpers

onalatt

ribute

s

Dannefe

ret

al.

(2005)

Inte

rnalc

onsi

stency:

Cro

nb

ach’s

alp

ha’s

for

the

WH

and

IAsc

ale

for

the

second

year

vers

us

the

third

year,

were

,re

spectiv

ely

,0.8

4and

0.9

4

vers

us

0.8

9and

0.9

2

Conve

rgent

valid

ity:

both

scale

sw

ere

mod

era

tely

corr

ela

ted

with

one

anoth

er

(ryear2¼

0.3

6and

r year3¼

0.2

8)

(Test

–rete

st)

relia

bility

:sc

ore

son

both

scale

sim

pro

ved

signifi

cantly

betw

een

the

second

and

the

third

year

(poss

ibly

influ

enced

by

diff

ere

nt

sele

ctio

nof

peer

eva

luato

rs).

The

second

year

ratin

gs

were

hig

hly

pre

dic

tive

for

the

third

year

ratin

gs

for

both

scale

s(r

WH¼

0.6

4

and

r IA¼

0.6

2)

(Inte

rrate

r)re

liab

ility

:genera

lizab

ility

and

decis

ion

analy

sis

reve

ale

dth

at

one

cla

ssw

as

consi

stently

more

dis

crim

inatin

gw

ithth

eW

Hsc

ale

,

while

the

oth

er

was

more

dis

crim

inatin

gw

ithth

eIA

scale

.It

is

suggest

ed

that

ind

ivid

ualp

eer

ass

ess

ments

refle

ct

not

only

the

ind

ivid

ual,

but

als

oth

ecultu

reof

the

entir

egro

up

,th

us

cre

atin

g

poss

ible

ab

ias

Diff

ere

nces

betw

een

stud

ents

consi

stently

rep

rese

nte

donly

asm

all

part

of

the

ove

rall

varia

nce

inra

tings,

pro

bab

ly,

as

are

sult

of

stud

ents

unw

illin

gness

touse

the

entir

esc

ale

Dep

end

ing

on

the

cla

ss,ye

ar

and

scale

,th

enum

ber

ofra

ters

need

ed

to

achie

vere

aso

nab

lere

liab

ility

ranged

betw

een

7and

28

Lurie

et

al.

(2006b

)S

econd

year

med

icalst

ud

ents

ove

r

ap

erio

dof

2ye

ars

(thre

econ-

secutiv

ecla

sses:

296)

Tw

om

om

ents

of

measu

rem

ent:

near

the

end

ofse

cond

and

third

years

;

Diff

ere

nt

meth

od

sof

peer

sele

ctio

n

(second

year:

peer-

rate

rsw

ere

ass

igned

form

cla

sslis

ts;

third

year:

stud

ent’

sre

com

mend

ed

peers

tora

teand

be

rate

db

y);

Num

ber

of

peer

eva

luatio

ns

per

stud

ent:

10

inse

cond

yearand

6

inth

irdye

ar

Pro

fess

ionalw

ork

hab

itsand

inte

rpers

onalatt

ribute

s

Dannefe

ret

al.

(2005)

Inte

rnalconsi

stency:

facto

ranaly

sis

of

score

sin

the

entir

esa

mp

lew

as

the

most

consi

stent

with

asi

mp

leone-f

acto

rso

lutio

n(6

5%

of

tota

l

varia

nce).

Asi

ngle

scale

com

pris

ing

all

item

shad

aC

ronb

ach’s

alp

ha

of

0.9

5

Conte

nt

valid

ity:

stud

ents

with

low

leve

lsof

peer-

ass

ess

ed

IAare

more

negativ

ein

their

jud

gem

ents

of

cla

ssm

ate

s.This

find

ing

pro

vid

es

evi

dence

of

the

valid

ityof

peer

ass

ess

ment

as

am

eth

od

of

measu

ring

IA

Relia

bility

:th

irdye

ar

med

icals

tud

ents

with

low

leve

lofIA

are

more

likely

tob

era

ted

by

oth

er

stud

ents

with

low

leve

lsof

IA,

irresp

ectiv

eof

meth

od

ofra

ter

ass

ignm

ent.

Bia

ses

inse

lectio

nofra

ters

am

ong

third

year

stud

ents

do

not

ap

pear

toaffect

the

resu

ltsofp

eer

ass

ess

ment

(contin

ued

)

R. Speyer et al.

e578

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 8: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Lurie

et

al.

(2007)

Gra

duatin

gm

ed

icalst

ud

ents

(N¼

281)

Com

ple

ted

ata

set:

240

(exc

lusi

on

of

41:

no

part

icip

a-

tion

inp

eer

ass

ess

ments

)

Tw

om

om

ents

of

measu

rem

ent:

near

the

end

ofse

cond

and

third

years

;

6–1

2p

eer

eva

luatio

ns/

cla

ssm

ate

s

per

stud

ent

each

year

Pro

fess

ionalw

ork

hab

itsand

inte

rpers

onalatt

ribute

s

Dannefe

ret

al.

(2005)

Pre

dic

tive

valid

ity:

peer

ass

ess

ed

WH

was

pre

dic

tive

of

late

rM

ed

ical

Stu

dent

Perf

orm

ance

Eva

luatio

nra

nki

ng

gro

up

sin

both

the

second

and

third

years

,w

here

as

IA-r

atin

gs

were

not

(dis

crim

inant

functio

n

analy

sis)

.Late

rin

tern

ship

dire

cto

rs’

ratin

gs

of

stud

ent’

sgenera

l

clin

ical,

inte

rpers

onaland

pro

fess

ionalq

ualit

ies

were

signifi

cantly

rela

ted

tose

cond

and

third

year

peer-

ass

ess

ed

WH

score

s

(resp

ectiv

ely

,r¼

0.3

2and

r¼0.4

3),

but

not

toIA

-score

s

Magzo

ub

et

al.

(1998)

Med

icalst

ud

ents

(N¼

34)

Activ

ity:

rura

lfie

ldtr

ain

ing;

Sin

gle

measu

rem

ent:

end

of

train

-

ing;

Thre

egro

up

sof

11–1

2st

ud

ents

;

Eig

ht

peer

eva

luatio

ns

per

stud

ent:

tota

lof

272

peer

ass

ess

ments

Perf

orm

ance

incom

munity

sett

ings

22

item

s(s

ix-p

oin

tsc

ale

)

Four

dom

ain

s

(effort

,com

munity

inte

ractio

n,

lead

ers

hip

,su

bje

ct

matt

er-

contr

ibutio

n)

Inte

rrate

rre

liab

ility

:tr

ue

varia

nce

ass

ocia

ted

with

stud

ents

for

the

four

dom

ain

sva

ried

betw

een

30.5

%and

43.3

%.

The

genera

lizab

ility

coeffic

ients

for

each

dom

ain

ranged

from

0.8

8to

0.9

1w

itheig

ht

rate

rsp

er

stud

ent.

The

stand

ard

err

or

of

measu

rem

ent

ranged

betw

een

0.2

1and

0.2

5,

dem

onst

ratin

gth

at

relia

bility

isaccep

tab

le

(Const

ruct)

valid

ity/Inte

rnalconsi

stency:

alth

ough,

the

corr

ela

tion

coef-

ficie

nts

betw

een

the

four

dom

ain

sw

ere

rath

er

hig

h(0

.61�

r�0.8

9),

aone-f

acto

rm

od

elusi

ng

confir

mato

ryfa

cto

ranaly

sis

did

not

fitth

e

data

.A

mod

elin

whic

hso

cia

lin

tera

ctio

nsk

ills

(lead

ers

hip

and

com

munity

-inte

ractio

n)and

conte

nt-

rela

ted

skills

(effort

and

sub

ject-

matt

er

contr

ibutio

n)

were

exp

lain

ed

by

second

-ord

er

facto

rsd

idfit

the

data

(Accep

tab

ility

:p

eer

ass

ess

ment

pro

ved

quite

accep

tab

leto

stud

ents

)

Nofz

iger

et

al.

(2010)

Second

(N¼

101)

and

fourt

h

(N¼

83)

year

stud

ents

of

Med

icin

eand

of

Dentis

try

Pro

fess

ionalb

ehavi

ours

,p

art

i-

cula

rlyin

terp

ers

onal

dim

ensi

ons

15

item

s(fi

ve-p

oin

tsc

ale

,

unab

leto

ass

ess

)p

lus

two

op

en-e

nd

ed

free-t

ext

ques-

tions

(str

ength

sand

weak-

ness

es)

Tw

od

om

ain

s(w

ork

hab

its,

inte

rpers

onalatt

ribute

s)p

lus

glo

balite

ms

NA

O’B

rien

et

al.

(2008)

Second

year

pharm

acy

stud

ents

in

the

Thera

peutic

sIcours

e

First

(pilo

t)st

ud

y(N¼

86)

Com

ple

teru

bric

sfo

rp

eers

and

inst

ructo

rs,

resp

ectiv

ely

,89.8

%

and

43.7

%

Activ

ity:

case

pre

senta

tion;

Six

sectio

ns

(one

inst

ructo

rp

er

sectio

n)

div

ided

into

four

gro

up

s

of

thre

eto

four

stud

ents

;

13–1

5p

eer

eva

luatio

ns

and

one

inst

ructo

reva

luatio

np

er

stud

ent

Second

stud

y(N¼

Np

ilot?

)

Revi

sion:

peer

eva

luatio

ns

by

pre

-

sente

r’s

gro

up

mem

bers

only

(Tw

oto

thre

ep

eer

eva

luatio

ns

per

stud

ent);

Pap

er

ass

ess

ment

rep

laced

by

onlin

esc

orin

gsy

stem

Perf

orm

ance

of

case

pre

senta

tions

10

item

s(d

esc

riptio

ns

of

thre

e

leve

lsof

com

pete

nce

per

item

:lo

west

,m

idd

le,

and

hig

hest

leve

l,re

spectiv

ely

,5,

7.5

,and

10

poin

ts)

Thre

ed

om

ain

s(k

now

led

ge:

know

led

ge

of

dis

ease

state

,

know

led

ge

of

dru

gth

era

py;

skills:

patie

nt

ass

ess

ment,

thera

peutic

pla

nd

eve

lop

-

ment,

com

munic

atio

nw

ith

small

gro

up

,p

rese

nta

tion

styl

e;

pro

fess

ionalb

ehav-

iour:

pro

fess

ionalatt

ire/

ap

peara

nce,

resp

ectful

inte

ractio

ns,

pro

fess

ional

ap

pro

ach,

pre

pare

dness

).

Conte

nt

valid

ity:

dim

ensi

ons

mirr

or

the

recom

mend

atio

ns

from

both

the

AC

PE

Sta

nd

ard

s2007

and

the

Cente

rfo

rth

eA

dva

ncem

ent

of

Pharm

aceutic

alE

ducatio

n/E

ducatio

nalO

utc

om

es

First

(pilo

t)st

ud

y

Flo

or

or

ceiling

effects

:no

floor

effects

,b

utab

out40%

ofth

ein

stru

cto

rs’

ratin

gs

achie

ved

the

hig

hest

poss

ible

score

s

Relia

bility

:p

eer

ass

ess

ment

score

s(rub

ric-b

ase

dcom

posi

teto

talsu

m

score

and

score

sp

er

dim

ensi

on)

were

statis

tically

hig

her

than

inst

ructo

rsc

ore

s

Second

stud

y

Relia

bility

:com

posi

tesc

ore

sfr

om

inst

ructo

rsand

peers

were

signifi

-

cantly

low

er

com

pare

dto

pilo

tre

sults

.S

till,

peer

ass

ess

ment

score

s

were

signifi

cantly

hig

her

than

inst

ructo

rsc

ore

s

(contin

ued

)

Reliability and validity of peer assessment

e579

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 9: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Pap

incza

ket

al.

(2007b

)Firs

tye

ar

stud

ents

of

the

Bachelo

r

of

Med

icin

eand

Bachelo

rof

Surg

ery

(N¼

125)

With

dra

wal:

16%

(of

125

stud

ents

)

Activ

ity:

pre

senta

tion

of

asu

mm

ary

of

the

pre

vious

week’

sm

ed

ical

pro

ble

m;

Tw

om

om

ents

of

measu

rem

ent

(8–

10

weeks

betw

een

both

mea-

sure

ments

);

14

tuto

rialgro

up

s;

Eig

ht

tonin

ep

eer

eva

luatio

ns

and

one

tuto

reva

luatio

np

er

stud

ent

Pro

ble

m-b

ase

dle

arn

ing

perf

or-

mance

with

intu

toria

lse

ttin

g

17

item

s(fi

ve-p

oin

tsc

ale

)Fiv

e

dom

ain

s(resp

onsi

bility

and

resp

ect,

info

rmatio

n

pro

cess

ing,

com

munic

atio

n,

crit

icalanaly

sis,

self-

aw

are

ness

)

Face

valid

ity:

unanim

ous

face

valid

ityfo

rall

dom

ain

s,exc

ep

tso

me

dis

sent

for

the

self-

aw

are

ness

sub

-score

(base

don

thre

eexp

eri-

enced

PB

Lfa

cilita

tors

);

Const

ruct

valid

ity:all

five

const

ructs

or

dom

ain

sare

ext

ensi

vely

rep

ort

ed

inlit

era

ture

;each

PB

Ltu

tor

cate

goriz

ed

all

17

item

sin

accord

ance

with

the

dom

ain

sas

defin

ed

on

the

inst

rum

ent;

Inte

rnalc

onsi

stency:

Cro

nb

ach’s

alp

ha

ranged

from

0.7

6to

0.8

4(g

ood

inte

rnalconsi

stency)

Relia

bility

:P

ears

on

corr

ela

tions

coeffic

ients

for

peer-

ave

raged

and

tuto

r

ass

ess

mentra

ngin

gfr

om

0.4

0to

0.6

0(s

om

eim

pro

vem

entove

rtim

e)

pro

ved

accep

tab

lere

liab

le;

self-

aw

are

ness

item

sw

ere

pro

ble

matic

(‘not

ap

plic

ab

le’)

Test

–rete

stre

liab

ility

:st

ab

leto

talgro

up

mean

ove

rtim

e

Pap

incza

ket

al.

(2007a)

Firs

tye

ar

med

icalst

ud

ents

of

the

Bachelo

rof

Med

icin

eand

Bachelo

rof

Surg

ery

inp

rob

lem

-

base

dle

arn

ing

curr

icula

(N¼

125)

With

dra

wal:

thre

etu

toria

lgro

up

s

Qualit

ativ

est

ud

y;

Activ

ity:

pre

senta

tion

of

asu

mm

ary

of

the

pre

vious

week’

sm

ed

ical

pro

ble

m;

Weekl

ym

easu

rem

ent

of

pre

sentin

g

stud

ent;

20

min

PB

Ltu

toria

ltim

eeach

week

for

20

weeks

of

the

year;

13

tuto

rialgro

up

sof

9–1

0st

ud

ents

Pap

incza

ket

al.

(2007b

)P

ap

incza

ket

al.

(2007b

)P

ap

incza

ket

al.,

(2007b

)

Pere

raet

al.

(2010)

Firs

tse

mest

er

med

icalund

erg

rad

-

uate

stud

ents

insm

all

gro

up

com

munic

atio

nsk

ills

teachin

g

sess

ions

(N¼

97)

Activ

ity:

thre

etu

toria

lm

eetin

gs,

fol-

low

ed

by

ath

ree

statio

nob

jec-

tive

stru

ctu

red

clin

ical

exa

min

atio

n(O

SC

E);

Sin

gle

measu

rem

ent(d

urin

gO

SC

E):

at

end

of

the

sem

est

er;

Four

tofiv

ep

eer

eva

luatio

ns

per

stud

ent.

Basi

cin

terv

iew

skills

Ob

jectiv

ely

Str

uctu

red

(Self-

ass

ess

ment)

and

Peer

feed

-

back:

four

op

en

end

ed

and

15

clo

sed

quest

ions

(five

-

poin

tsc

ale

)ass

ess

ing

build

-

ing

rap

port

,lis

tenin

gsk

ills,

language,in

terv

iew

styl

eand

inte

rvie

wst

ructu

re

NA

(contin

ued

)

R. Speyer et al.

e580

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 10: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Ris

ucciet

al.

(1989)

Surg

icalre

sid

ents

(N¼

32)

Because

of

mis

sing

valu

es:

27

Sin

gle

measu

rem

ent;

26

peer

eva

luatio

ns

and

29

eva

lua-

tions

by

sup

erv

isors

per

resi

dent

Clin

icalcom

pete

nce

10

item

s(fi

ve-p

oin

tsc

ale

)

ass

ess

ing

technic

alab

ility

,

basi

csc

ience

know

led

ge,

jud

gem

ent,

rela

tions

with

patie

nts

,re

latio

ns

with

peers

,re

liab

ility

,in

dust

ry,

pers

onalap

peara

nce

and

reactio

nto

pre

ssure

Tw

od

om

ain

s(in

terp

ers

onala

nd

know

led

ge).

Crit

erio

nva

lidity

:p

eer

and

sup

erv

isor

ratin

gs

of

surg

icalre

sid

ents

corr

ela

ted

hig

hly

with

each

oth

er

(0.6

6�

r�0.8

6).

Ove

r-all

ratin

gs

by

peers

(ave

raged

score

ove

rall

peer

eva

luatio

ns

ineach

perf

orm

ance

are

a,

per

resi

dent)

and

sup

erv

isors

were

hig

hly

inte

rcorr

ela

ted

(r¼

0.9

2).

Peer

and

sup

erv

isor

ratin

gs

mod

era

tely

and

signifi

cantly

corr

ela

ted

with

the

Am

eric

an

Board

of

Surg

ery

In-T

rain

ing

Exa

min

atio

nsc

ore

s

(Const

ruct)

valid

ity/Inte

rnalconsi

stency:

when

the

matr

ixof

ove

r-all

ratin

gs

by

peers

(27

resi

dentb

y10-i

tem

matr

ixofa

vera

ge

peerra

ting)

was

facto

red

,one

genera

lfa

cto

rem

erg

ed

,exp

lain

ing

85.3

%of

the

varia

nce

with

each

ofth

eite

ms

load

ing

signifi

cantly

.H

ow

eve

r,fa

cto

r

analy

sis

of

the

matr

ix,

deriv

ed

two

facto

rs,

lab

elle

das

inte

rpers

onal

and

know

led

ge

facto

rs,

exp

lain

ing

69.8

%and

10.5

%of

varia

nce,

resp

ectiv

ely

.

Roark

et

al.

(2006)

Firs

t(N¼

18),

Second

(N¼

17)

and

third

(N¼

16)

year

resi

dents

of

oto

lary

ngolo

gy

ina

surg

ical

post

gra

duate

pro

gra

mm

e,

for

a

tota

lof

26

resi

dents

durin

ga

third

-year

stud

yp

erio

d

Com

plia

nce

for

com

ple

tion

offo

rms

was

97%

;

(Onlin

eass

ess

ment

syst

em

for

sur-

gic

alp

ost

gra

duate

ed

ucatio

n

pro

gra

mm

ein

clu

din

ga

set

of

inst

rum

ents

)

Mom

ent

of

measu

rem

ent:

end

of

each

rota

tion;

269

peer

eva

luatio

ns

and

660

fac-

ulty

eva

luatio

ns

into

tal

Clin

icalcom

pete

ncie

sin

oto

lary

ngolo

gy

10

item

s(fi

ve-p

oin

tsc

ale

)

ass

ess

ing

inte

grit

y,re

spect,

ove

rall

clin

icalsk

ills,

com

-

pass

ion,

pro

ble

m-s

olv

ing,

resp

onsi

bility

,am

bula

tory

care

skills,

managem

ent

of

com

ple

xp

rob

lem

s,m

an-

agem

ent

of

hosp

italiz

ed

patie

nts

,m

ed

icalk

now

led

ge

Six

dom

ain

s

(patie

nt

care

,m

ed

icalkn

ow

l-

ed

ge,

pra

ctic

e-b

ase

dle

arn

-

ing

and

imp

rove

ment,

inte

rpers

onaland

com

muni-

catio

nsk

ills,

pro

fess

ional-

ism

,sy

stem

s-b

ase

dp

ractic

e

(Faculty

eva

luatio

nq

uite

sim

ilar

inst

rum

ent,

but

not

identic

al.)

Face

valid

ity:lit

era

ture

searc

h(c

onte

ntb

yth

eA

meric

an

Board

ofI

nte

rnal

Med

icin

e)

Conte

ntva

lidity

:exp

ert

panel(

rep

rese

nta

tive

com

mitt

ee

ofre

sid

ents

and

faculty

)ju

dged

wheth

er

ase

tof

inst

rum

ents

sam

ple

dall

genera

l

com

pete

ncie

sof

the

Accre

dita

tion

Council

of

Gra

duate

Med

ical

Ed

ucatio

n.

Peer

and

faculty

eva

luatio

ns

are

part

of

this

ass

ess

ment

syst

em

Conve

rgent

valid

ity(c

om

parin

gra

tings

am

ong

the

diff

ere

nt

eva

luato

r

typ

es)

:to

talave

rage

score

ass

igned

by

faculty

isst

atis

tically

low

er

than

that

by

peers

(4.3

1ve

rsus

4.6

3)

Crit

erio

nva

lidity

:corr

ela

tion

was

ob

serv

ed

in‘r

esp

ectofp

atie

nts

’b

utnot

in‘m

ed

icalkn

ow

led

ge’

betw

een

faculty

and

resi

dent

peer

gro

up

s

Rud

yet

al.

(2001)

Firs

tye

ar

med

icalst

ud

ents

(N¼

97)

part

icip

atin

gin

inte

rvie

win

g

cours

e

Because

of

mis

sing

valu

es

(incom

-

ple

tep

eer

and

faculty

eva

lua-

tion):

82

Activ

ity:

patie

nt

inte

rvie

w;

Sin

gle

measu

rem

ent

mid

way

thro

ugh

the

cours

e;

�8

stud

ents

per

tuto

rialgro

up

;

�7

peer

eva

luatio

ns

and

one

faculty

pre

cep

tor

eva

luatio

np

er

stud

ent

(Patie

nt)

inte

rvie

win

g

perf

orm

ance

Thre

eite

ms

(15-p

oin

tsc

ale

)

ass

ess

ing

inte

rvie

win

gst

yle

(em

path

y,se

lf-p

rese

nta

tion,

inte

rest

,re

spect

and

rap

-

port

),in

terv

iew

stru

ctu

re

(ab

ility

tofo

llow

guid

elin

es,

info

rmatio

nflo

wand

com

-

ple

tion

of

top

ics)

,and

inte

r-

view

ing

techniq

ues

(op

en/

clo

sed

quest

ions,

sum

mari-

zatio

n,

legiti

miz

atio

n,

transi

-

tion

state

ments

,and

ap

pro

pria

teop

enin

gand

clo

sing);

Inte

rvie

win

gp

erf

orm

ance

(thre

e-i

tem

linear

com

pos-

ite):

mean

of

all

thre

eite

ms

Inte

rnalconsi

stency:

coeffic

ient

alp

ha’s

for

gro

up

peer-

ave

raged

and

faculty

com

posi

tera

tings

were

,re

spectiv

ely

,0.8

8and

0.8

4;

Relia

bility

:P

ears

on

corr

ela

tions

betw

een

faculty

and

peer

ratin

gs

were

mod

era

te,

but

statis

tically

signifi

cant

(r¼

0.5

0);

Peer

ratin

gs

were

signifi

cantly

hig

her

than

faculty

ratin

gs

(contin

ued

)

Reliability and validity of peer assessment

e581

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 11: Reliability and validity of student peer assessment in medical education: A systematic review

Tab

le2

.C

ontinue

d.

Refe

rence

(alp

hab

etic

alord

er)

Part

icip

ants

/stu

dy

sett

ings

Sub

ject

of

ass

ess

ment

Desc

riptio

nof

inst

rum

ent

for

peer

ass

ess

ment

Sum

mary

of

psy

chom

etr

icd

ata

by

the

auth

or(s)

a

Sulliva

net

al.

(1999)

Third

year

med

ialst

ud

ents

durin

g

surg

icalcle

rksh

ip(N¼

154)

Com

ple

tep

eer

and

tuto

reva

luatio

n

for

152

Activ

ity:

casu

istr

y;

Sin

gle

measu

rem

ent:

peer

eva

lua-

tion

at

the

end

(six

gro

up

meet-

ings)

;

�8

stud

ents

per

tuto

rialgro

up

;

Each

stud

ent

eva

luate

dall

peers

in

gro

up

.

Pro

ble

m-B

ase

dLearn

ing

per-

form

ance

with

intu

toria

l

sett

ing.

Thre

eite

ms

on

pro

ble

m-s

ol-

ving,

ind

ep

end

ent

learn

ing

and

gro

up

part

icip

atio

n(fi

ve-

poin

tsc

ale

).

Relia

bility

:P

ears

on

corr

ela

tions

for

peer

and

faculty

(tuto

r)ra

tings

were

mod

era

tein

the

are

as

of

ind

ep

end

ent

learn

ing

(r¼

0.5

0)

and

gro

up

part

icip

atio

n(r¼

0.5

4),

and

low

inth

eare

aof

pro

ble

m-s

olv

ing

(r¼

0.2

4).

Thom

as

et

al.

(1999)

Inte

rns

(N¼

16)a

tth

eend

-of-

month

ward

rota

tions

(tw

oin

patie

nt

firm

s)

Nin

em

om

ents

of

measu

rem

ent

(9

month

s);

4–2

3eva

luatio

ns

per

inte

rn;

177

eva

luatio

ns

into

talfo

r16

inte

rns

(76

inte

rns

eva

luatin

g

inte

rns;

101

senio

rre

sid

ents

eva

luatin

gin

tern

s);

197

faculty

inp

atie

nt

end

-of-

month

eva

luatio

ns

(sim

ilar

item

s)

Clin

icalcom

pete

nce

10

item

s(n

ine-p

oin

tsc

ale

)Tw

o

dom

ain

s(T

echnic

alsk

ills:

cogniti

veand

psy

chom

oto

r

skills

and

behavi

ours

;in

ter-

pers

onalsk

ills:

inte

rpers

onal

skills

and

hum

anis

tic

behavi

ours

)

Inte

rrate

rre

liab

ility

:N

A(a

nonym

ous

eva

luatio

nre

turn

s)

Inte

rnalc

onsi

stency:

ave

rage

inte

ritem

corr

ela

tion

and

Cro

nb

ach’s

alp

ha

for

inte

rns

ass

ess

ment

of

inte

rns

was,

resp

ectiv

ely

,0.7

3and

0.9

6.

(Forse

nio

rre

sid

enteva

luatio

nofi

nte

rns,

resp

ectiv

ely

,0.5

5and

0.9

3.)

Exp

lora

tory

facto

ranaly

sis

(Prin

cip

alc

om

ponentanaly

sis)

resu

lted

intw

o

dom

ain

s(technic

alsk

ills

and

inte

rpers

onalsk

ills)

Relia

bility

:P

ears

on

pro

duct-

mom

ent

corr

ela

tions

betw

een

inte

rnand

faculty

eva

luatio

ns

were

mod

era

teto

hig

h(e

xcep

tfo

rm

ed

ical

know

led

ge);

Corr

ela

tions

betw

een

inte

rnand

senio

rre

sid

ent

eva

lu-

atio

ns

were

low

(all

r�0.4

4,

exc

ep

tfo

rp

roced

ura

lsk

ills

r¼0.7

3)

Willoughb

yet

al.

(1979)

Six

thye

ar

med

icalgra

duate

s

(N¼

64)

Four

mom

ents

of

measu

rem

ent:

end

of

each

rota

tion;

CP

Eeva

luatio

np

er

rota

tion

by

one

toth

ree

peers

,one

sup

erv

isor

(faculty

mem

ber,

resi

dent

or

phys

icia

n)

and

self

Clin

icalp

erf

orm

ance

CP

E:

11

item

sor

dim

ensi

ons

(nin

e-

poin

tsc

ale

)ass

ess

ing

att

i-

tud

e,

peer

rela

tions,

relia

bil-

ity,

med

icalin

form

atio

n,

concep

ts,

skills,

matu

rity,

patie

nt

rap

port

,in

genuity

,

consc

ientio

usn

ess

and

inte

grit

y

NA

(Relia

bility

and

valid

ityanaly

ses

were

base

don

mean

CP

E-s

core

s.Thus,

ap

art

from

peer

ass

ess

ment,

ratin

gs

by

sup

erv

isors

and

self-

eva

luatio

ns

were

inclu

ded

as

well.

)

van

Rose

nd

aaland

Jennett

(1994)

Resi

dents

inth

ecore

(3-y

ear)

inte

r-

nalm

ed

icin

ep

rogra

mm

e

(N¼

22)

Mom

ents

of

measu

rem

ent:

end

of

each

rota

tion;

Conse

nsu

sfa

culty

ratin

gp

er

rota

-

tion

(of

all

pre

cep

tors

invo

lved

)

per

stud

ent;

Num

ber

of

peer

ratin

gs

per

stu-

dent?

74

matc

hed

peer

and

faculty

ratin

gs

Clin

icalcom

pete

nce

In-t

rain

ing

eva

luatio

nre

port

(ITE

R)

(Mid

way

durin

gst

ud

y:change

of

5–1

0-p

oin

tsc

ale

s.)

Com

pete

nce

cate

gorie

s:w

ritte

n

work

up

s,his

tory

taki

ng,

phys

icalexa

min

atio

ns,

case

pre

senta

tions,

pro

ced

ure

s,

clin

icalju

dgem

ent,

team

rela

tionsh

ips,

phys

icia

n–

patie

nt

rela

tionsh

ips,

ind

us-

trio

usn

ess

and

enth

usi

asm

,

basi

csc

ience

and

clin

ical

know

led

ge,

teachin

g,

ove

rall

com

pete

nce

(Mid

way

durin

gst

ud

y:so

me

ad

just

ment

of

eva

luatio

n

cate

gorie

s.)

Relia

bility

:si

gnifi

cant

diff

ere

nt

score

sb

etw

een

peer

and

faculty

ratin

gs

(Wilc

oxo

nm

atc

hed

-pair

signed

-rank

test

)fo

rth

ecate

gorie

sp

hys

ical

exa

min

atio

ns,

team

rela

tionsh

ips,

ind

ust

riousn

ess

and

enth

usi

asm

,

teachin

g,

phys

icia

n–p

atie

nt

rela

tionsh

ips,

and

case

pre

senta

tions.

No

signifi

cant

diff

ere

nces

for

writ

ten

work

up

s,his

tory

taki

ng,

pro

ce-

dura

lski

lls,

clin

icalj

ud

gem

ent,

basi

csc

ience

and

clin

icalk

now

led

ge,

and

ove

rall

com

pete

nce.

(Consi

dera

ble

consi

stency

betw

een

pair

and

faculty

ratin

gs

acro

sscate

gorie

s,b

ut,

where

diff

ere

nces

exi

sted

,

faculty

more

often

gra

ded

hig

her

than

peers

.)

Crit

erio

nva

lidity

:va

lidatio

nb

ycom

parin

gIT

ER

with

OS

CE

and

perf

or-

mance

on

the

Am

eric

an

Colle

ge

of

Phys

icia

ns’

multi

ple

-choic

e

exa

min

atio

n(s

econd

year):

faculty

revi

ew

ers

ass

igned

low

ITE

R

phys

icalexa

min

atio

nsc

ore

sto

resi

dents

with

low

OS

CE

score

s,

where

as

peers

did

not.

Furt

herm

ore

,no

oth

er

note

wort

hy

rela

tionsh

ips

aThe

auth

or(s)

pro

vid

e(s

)no

psy

chom

etr

icd

ata

of

the

inst

rum

ent

for

peer

ass

ess

ment.

NA

,not

ap

plic

ab

le.

R. Speyer et al.

e582

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 12: Reliability and validity of student peer assessment in medical education: A systematic review

of interpersonal skills, while the concept of physician perfor-

mance by Linn et al. (1975, 1976) referred to knowledge and

interpersonal relationships. The number of items per ques-

tionnaire varied greatly. The shortest questionnaire consisted

of two single items (Alagna & Reddy 1985), whereas the

longest questionnaire by Magzoub et al. (1998) referred to a

22-item instrument. In total, the 28 included studies described

22 different instruments for peer assessment in mainly medical

educational settings. Three studies by Lurie et al. (2006a,b,

2007) were based on the Peer Assessment Protocol by

Dannefer et al. (2005), Arnold et al. (1981) included the

clinical Performance Evaluation by Willoughby et al. (1979),

Papinckzak et al. (2007a,b) published two studies using the

same questionnaire, and Linn et al. (1975, 1976) presented two

different versions of the same peer assessment questionnaire.

Psychometric characteristics

The psychometric characteristics for each peer assessment

instrument as described by the author(s) have been summa-

rized in Table 2; Main findings on validity and reliability of

peer assessment have been presented in this table, whereas

Table 3 provides a glossary of psychometric terms used in this

review. Six studies provided no psychometric data

(Willoughby et al. 1979; Alagna & Reddy 1985; Amato &

Novales-Castro 2009; Chen et al. 2009; Nofziger et al. 2010;

Perera et al. 2010). Only a few studies (Cottrell et al. 2006;

Roark et al. 2006; O’Brien et al. 2008) described the concept of

content validity: the extent to which the domain of interest was

sampled comprehensively by the items in the questionnaire

(Terwee et al. 2007). This content validity was usually

explained referring to existing literature or frameworks.

Criterion validity, frequently mentioned in Table 2, refers to

the relationship linking the attributes in a tool with the

performance on a criterion, whereas predictive validity (Linn

et al. 1976; Arnold et al. 1981; Dijcks et al. 2003; Lurie et al.

2007) indicates the degree to which test scores predict

performance on some future criterion and convergent validity

(Lurie et al. 2006a) the extent to which different measures of

the same construct correlate with one other (DeVon et al.

2007). Although no single criterion or gold standard will ever

be perfect (Norman et al. 1996), usually, faculty ratings are

considered to be the gold standard in educational settings (Lin

et al. 1975; van Rosendaal & Jennett 1994; Davis 2002; Dijcks

et al. 2003; Bryan et al. 2005; Roark et al. 2006). Construct

validity refers to the extent to which scores on a particular

questionnaire relate to other measures in a manner that is

consistent with theoretically derived hypotheses concerning

the concepts that are being measured (Terwee et al. 2007).

Only a few studies provided detailed information on construct

validity (Papinckzak et al. 2007a,b). Reliability pertains to the

ability of an instrument to consistently measure an attribute

(DeVon et al. 2007). Internal consistency is a measure of

homogeneity, indicating the extent to which items in a

(sub)scale are inter-correlated, measuring the same construct.

Reproducibility describes the degree to which an instrument is

free of measurement error by yielding stable scores over time

(absolute measurement error), and the extent to which

patients can be distinguished from each other, despite

measurement errors (relative measurement error; Terwee

et al. 2007; Timmerman et al. 2007). Interrater reliability and

test–retest reliability refer to this concept of reproducibility as

well. Finally, floor and ceiling effects reflect the number of

respondents achieving the lowest or highest possible score,

thus, describing the data distribution. Only the study by

O’Brien et al. (2008) provided this information.

Discussion and conclusion

Summarizing

This review has identified 22 different instruments for peer

assessment in – mainly – medical educational settings. Only

those studies have been included that provided sufficient study

quality (Bangert-Drowns et al. 1997), thus excluding articles

with inadequate reporting of subject characteristics or study

settings, as well as global ratings or unstructured peer

assessments. Most authors design their own instrument for

Table 3. Glossary of psychometric terms.

Term Definition

Agreement The extent to which the scores on repeated measures are close to each other: absolute measurement error (Terwee et al. 2007)

Construct validity The extent to which a measurement corresponds to theoretical concepts (constructs) concerning the phenomenon under study

(Last 2001)

Content validity The extent to which the domain of interest is comprehensively sampled by the items in the questionnaire (Terwee et al. 2007)

Convergent validity The degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with. Convergent

validity is a variant of construct validity (Last 2001)

Criterion validity The extent to which the measurement correlates with an external criterion of the phenomenon under study (Last 2001)

Discriminant validity The degree to which the measure is not similar to (diverges from) other measures that it theoretically should not be similar to.

Discriminant validity is a variant of construct validity (Last 2001)

Floor or ceiling effect The number of respondents who achieved the lowest or highest possible score (Terwee et al. 2007; McHorney & Tarlov 1995)

Internal consistency The extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct (Terwee et al. 2007)

Predictive validity The degree to which test scores predict performance on some future criterion (Dijcks et al. 2003)

Reliability The extent to which patients can be distinguished from each other, despite measurement errors: relative measurement error

(Terwee et al. 2007) The extent to which the same measurements of individuals obtained under different conditions yield

similar results (Everitt 2006)

Reproducibility The degree to which repeated measurements in stable persons provide similar answers (Terwee et al. 2007)

Test–retest reliability An index of score consistency over a brief period of time (typically several weeks), usually the correlation coefficient determined

between administration of the test twice with a certain amount of time between administrations (Everitt 2006)

Reliability and validity of peer assessment

e583

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 13: Reliability and validity of student peer assessment in medical education: A systematic review

peer assessment to be applied in their specific educational

environment according to their own criteria and scoring

system. The subject of assessment, although showing frequent

overlap with other studies, shows a great diversity as well,

being adjusted or revised according to the authors’ working

surroundings or personal interest. Most studies used peer

assessment mainly as an assessment tool.

Psychometry and heterogeneity

For six instruments, no information about the questionnaire’s

psychometric characteristics was available. Although these

questionnaires are listed in Table 2, their application and

usefulness in educational settings are considered to be

doubtful without further assessment of its psychometric

properties. Many other studies as well provide fragmentary

or insufficient data on reliability and validity of the tools being

used. In literature, quality criteria can be found for measure-

ment properties of questionnaires and standards for

educational and psychological testing (AERA 1999; Terwee

et al. 2007). But still, the rating of a questionnaire’s psycho-

metric characteristics is highly dependent on the reporting and

availability of data in the corresponding study. In the case of

underreporting of psychometric information, the study in

particular does not necessarily have a poor study design or

performance.

In view of the heterogeneity of the study designs and

diversity in questionnaires for peer assessment as well as the

frequently restricted availability of data on a questionnaire’s

psychometric characteristics, statistical pooling of the data was

not possible nor was useful for this review. The outcome of

this review has given an overview of all available instruments

for peer assessment in, predominantly, medical education as

well as its reliability and validity according to the author(s).

However, in general, peer assessment using tools not showing

good validity and/or reliability should be avoided because of

their questionable contribution to evaluating students’ perfor-

mance. Especially, when using peer assessment as an assess-

ment tool, good measurement properties are essential.

Future research

Choices of peer assessment instruments in education can be

justified, on the one hand, in terms of optimal psychometric

qualities, and on the other hand, by taking into account the

discriminative and evaluative purposes of the assessment.

There is no such thing as one universal instrument for peer

assessing, a single gold standard. However, before applying

peer assessment on a large scale as a valuable instrument for

evaluation of a student’s performance by quantifying judge-

ments made by his or her future colleagues, research will need

to focus more thoroughly on exploring and determining the

exact psychometric characteristics per instrument for peer

assessment. Surprisingly, although peer assessment is a

common tool in medical educational settings, in literature,

the lack of information about questionnaires’ psychometry has

seldom been mentioned or questioned. As stated before, in

general, the application of unvalidated or unreliable instru-

ments or questionnaires will result in data that cannot be

adequately interpreted, nor lead to any useful contribution to

formal assessment. Maybe, that one of the most important

findings in this review is the well-founded indication at the

need for caution when using peer assessment in medical

educational settings.

Declaration of interest: The authors report no declarations

of interest.

Notes on contributors

RENEE SPEYER, PhD, SLP, MS, is research coordinator at the Institute of

Health Studies, HAN University of Applied Sciences, Nijmegen (The

Netherlands).

WALMARI PILZ, SLP, is a speech and language pathologist at the Medical

University Hospital, Maastricht (The Netherlands).

JOLIEN VAN DER KRUIS, MD, works at the Jeroen Bosch Hospital in Den

Bosch as a medical doctor (The Netherlands).

JAN WOUTER BRUNINGS, MD, is an ENT specialist/laryngologist at the

Medical University Hospital Maastricht (The Netherlands).

References

AERA 1999. American Psychological Association, National Council on

Measurement in Education. Standards for educational and psycholog-

ical testing. Washington, DC: American Educational Research

Association.

Alagna SW, Reddy DM. 1985. Self and peer ratings and evaluations of group

process in mixed-sex and male medical training groups. J Appl Soc

Psychol 15(1):31–45.

Amato D, de Jesus Novales-Castro X. 2009. Aceptacion del aprendizaje

basado en problemas y de la evaluacion entre pares por los estudiantes

de medicina. Gac Med 145(3):197–205.

Arnold L, Shue CK, Kalishman S, Prislin M, Pohl C, Pohl H, Stern DT. 2007.

Can there be a single system for peer assessment of professionalism

among medical student? A multi-institutional study. Acad Med

82(6):578–586.

Arnold L, Willoughby L, Calkins V, Gammon L, Eberhart G. 1981. Use of

peer evaluation in the assessment of medical students. Med Educ

56:35–42.

Bangert-Drowns RL, Wells-Parker E, Chevillard I. 1997. Assessing the

methodological quality of research in narrative reviews and meta-

analyses. In: Bryant KJ, Windle M, West SG, editors. The science of

prevention. Methodological advances from alcohol and substance

abuse. Washington, DC: American Psychological Association.

pp 405–429.

Bryan RE, Krych AJ, Carmichael SW, Viggiano TR, Pawlina W. 2005.

Assessing professionalism in early medical education: Experience with

peer evaluation and self-evaluation in the gross anatomy course. Ann

Acad Med Singapore 34:486–491.

Bucknall V, Sobic EM, Wood HL, Howlett SC, Taylor R, Perkins GD. 2008.

Peer assessment of resuscitation skills. Resuscitation 77:211–215.

Burnett W, Cavaye G. 1980. Peer assessment by fifth year students of

surgery. Assess Eval High Educ 5(3):273–278.

Chen LP, Gregory JK, Camp CL, Juskewitch JE, Pawlina W, Lachman N.

2009. Learning to lead: Self- and peer evaluation of team leaders in the

human structure didactic block. Anat Sci Educ 2:210–217.

Chenot J-F, Simmenroth-Nayda A, Koch A, Fischer T, Scherer M, Emmert B,

Stanske B, Kochen MM, Himmel W. 2007. Can student tutors act as

examiners in an objective structured clinical examination? Med Educ

41:1032–1038.

Cottrell S, Diaz S, Cather A, Shumway J. 2006. Assessing medical student

professionalism: An analysis of a peer assessment. Med Educ Online

11(8):1–8.

R. Speyer et al.

e584

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.

Page 14: Reliability and validity of student peer assessment in medical education: A systematic review

Dannefer EF, Henson LC, Bierer SB, Grady-Weliky TA, Meldrum S, Nofziger

AC, Barclay C, Epstein RM. 2005. Peer assessment of professional

competence. Med Educ 39:713–722.

Davis JD. 2002. Comparison of faculty, peer, self, and nurse assessment

of obstetrics and gynecology residents. Obstet Gynecol

99(4):647–651.

Davis JK, Inamdar S. 1988. Use of peer ratings in a pediatric residency.

J Med Educ 63:647–649.

DeVon HA, Block ME, Moyle-Wright P, Ernst DM, Hayden SJ, Lazzara DJ,

Savoy SM, Kostas-Polston E. 2007. A psychometric toolbox for testing

validity and reliability. J Nurs Scholarsh 39(2):155–164.

Dijcks R, Prince KJAH, van der Vleuten CPM, Scherpbier AJJA. 2003.

Validity of objective tests towards peer-rated competence by students.

Med Teach 25(3):273–276.

Epstein RM. 2007. Assessment in medical education. N Engl J Med

356:387–396.

Eva KW, Regehr G. 2005. Self-assessment in the health professions: A

reformulation and research agenda. Acad Med 80(10):S46–S54.

Evans R, Elwyn G, Edwards A. 2004. Review of instruments for peer

assessment of physicians. BMJ 328(7450):1240–1243.

Everitt BS. 2006. Medical statistics from A to Z: A guide for clinicians and

medical students. Cambridge: Cambridge University Press.

Field M, Burke JM, McAllister D, Lloyd DM. 2007. Peer-assisted learning: A

novel approach to clinical skills learning for medical students. Med

Educ 41:411–418.

Gielen S. 2007. Peer assessment as a tool for learning [dissertation]. Leuven,

Belgium: Catholic University of Leuven.

Hulsman RL, Harmsen AB, Fabriek M. 2009. Reflective teaching of medical

communication skills with DiViDU: Assessing the level of student

reflection on recorded consultations with simulated patients. Patient

Educ Couns 74:142–149.

Kovach RA, Resch DS, Verhulst SJ. 2009. Peer assessment of profession-

alism: A five-year experience in medical clerkship. J Gen Intern Med

24(6):742–746.

Last JM (editor). 2001. A dictionary of epidemiology. Oxford: Oxford

University Press.

Levine RE, Kelly PA, Karakoc T, Haidet P. 2007. Peer evaluation in a clinical

clerkship: Students’ attitudes, experiences, and correlations with

traditional assessments. Acad Psych 31(1):19–24.

Linn BS, Arostegui M, Zeppa R. 1975. Performance rating scale for peer and

self assessment. Br J Med Educ 9:98–101.

Linn BS, Arostegui M, Zeppa R. 1976. Peer and self assessment in

undergraduate surgery. J Surg Res 21:453–456.

Ljungman AG, Silen C. 2008. Examination involving students as peer

examiners. Assess Eval High Educ 33(3):289–300.

Lurie SJ, Lambert DR, Nofziger AC, Epstein RM, Grady-Weliky TA. 2007.

Relationship between peer assessment during medical school, dean’s

letter rankings, and ratings by internship directors. J Gen Intern Med

22:13–16.

Lurie SJ, Nofziger AC, Meldrum S, Mooney C, Epstein RM. 2006a. Temporal

and group-related trends in peer assessment amongst medical students.

Med Educ 40:840–847.

Lurie SJ, Nofziger AC, Meldrum S, Mooney C, Epstein R. 2006b. Effects of

rater selection on peer assessment among medical students. Med Educ

40:1088–1097.

Magin D. 2001. Reciprocity as a source of bias in multiple peer assessment

of group work. Stud High Educ 26(1):53–63.

Magzoub MEMA, Schmidt HG, Dolmans DHJM, Abdelhameed AA. 1998.

Assessing students in community settings: The role of peer evaluation.

Adv Health Sci Educ 3:3–13.

Maker VK, Donnelly MB. 2008. Surgical resident peer evaluations – What

have we learned. J Surg Educ 65(1):8–16.

McCarty T, Parkes MV, Anderson TT, Mines J, Skipper BJ, Grebosky J. 2005.

Improved patient notes from medical students during web-based

teaching using faculty-calibrated peer review and self-assessment. Acad

Med 80(10):567–570.

McCormack WT, Lazarus C, Stern D, Small Jr PA. 2007. Peer nomination: A

tool for identifying medical student exemplars in clinical competence

and caring, evaluated at three medical schools. Acad Med

82(11):1033–1039.

McHorney CA, Tarlov AR. 1995. Individual-patient monitoring in clinical

practice: Are available health status surveys adequate? Qual Life Res

4(4):293–307.

Miller PJ. 2003. The effect of scoring criteria specificity on peer and self-

assessment. Assess. Eval. High. Educ 28(4):383–394.

Nieder GL, Parmelee DX, Stolfi A, Hudes PD. 2005. Team-based learning

in a medical gross anatomy and embryology course. Clin Anat

18:56–63.

Nofziger AC, Naumburg EH, Davis BJ, Mooney CJ, Epstein RM. 2010.

Impact of peer assessment on the professional development of medical

students: A qualitative study. Acad Med 85(1):140–147.

Norcini JJ. 2003. Peer assessment of competence. Med Educ 37:539–543.

Norman GR, Swanson DB, Case SM. 1996. Conceptual and methodological

issues in studies comparing assessment formats. Teach Learn Med

8:208–216.

O’Brien CE, Franks AM, Stowe CD. 2008. Multiple rubric-based assessments

of student case presentations. Am J Pharm Educ 72(3):1–7.

Papinczak T, Young L, Groves M. 2007a. Peer assessment in problem-bases

learning: A qualitative study. Adv Health Sci Educ 12:169–186.

Papinczak T, Young L, Grove M, Haynes M. 2007b. An analysis of peer, self,

and tutor assessment in provlem-based learning tutorials. Med Teach

29:e122–e132.

Parikh A, McReelis K, Hodges B. 2001. Student feedback in problem based

learning: A survey of 103 final year students across five Ontario medical

schools. Med Educ 35:632–636.

Perera J, Mohamadou G, Kaur S. 2010. The use of objective structured self-

assessment and peer-feedback (OSSP) for learning communication

skills: Evaluation using a controlled trial. Adv Health Sci Educ

15:185–193.

Ramsey PG, Wenrich MD. 1999. Use of professional associate ratings to

assess the performance of practicing physicians: Past, present, future.

Adv Health Sci Educ Theory Pract 4:27–38.

Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB, LoGerfo JP. 1993.

Use of peer ratings to evaluate physician performance. JAMA

269:1655–1660.

Reiter HI, Eva KW, Hatala RM, Norman GR. 2002. Self and peer assessment

in tutorials: Application of a relative-ranking model. Acad Med

77(11):1134–1139.

Risucci DA, Tortolani AJ, Ward RJ. 1989. Ratings of surgical residents by

self, supervisors and peers. Surg Gynecol Obstet 169:519–526.

Roark RM, Schaefer SD, Yu GP, Branovan DI, Peterson SJ, Lee WN. 2006.

Assessing and documenting general competencies in otolaryngology

resident training programs. Laryngoscope 116:682–695.

Rudy DW, Fejfar MC, Griffith CH, Wilson JF. 2001. Self- and peer

assessment in a first-year communication and interviewing course.

Eval Health Prof 24:436–445.

Secomb J. 2007. A systematic review of peer teaching and learning in

clinical education. J Clin Nurs 17:703–715.

Sullivan ME, Hitchcock MA, Dunnington GL. 1999. Peer and self assessment

during problem-based tutorials. Am J Surg 177:266–269.

Terwee CB, Bot SDM, de Boer MR, van der Windt DAWM, Knol DL, Dekker J,

Bouter LM, de Vet HC. 2007. Quality criteria were proposed for

measurement properties of health status questionnaires. J Clin Epidemiol

60:34–42.

Thomas PA, Gebo KA, Hellmann DB. 1999. A pilot study of peer review in

residency training. J Gen Intern Med 14:551–554.

Timmerman AA, Meesters CMG, Speyer R, Anteunis LJC. 2007.

Psychometric qualities of questionnaires for the assessment of otitis

media impact. Clin Otolaryngol 32(6):429–439.

Topping K. 1998. Peer assessment between students in colleges and

universities. Rev Educ Res 68(3):249–276.

van Rosendaal FMA, Jennett PA. 1994. Comparing peer and faculty

evaluations in an internal medicine residency. Acad Med

69(4):299–303.

van Zundert M, Sluijsmans D, van Merrienboer J. 2010. Effective peer

assessment processes: Research findings and future directions. Learn

Instruct 20(4): 270–279.

Wendling A, Hoekstra L. 2002. Interactive peer review: An innovative

resident evaluation tool. Fam Med 34(10):738–743.

Willoughby TL, Gammon LC, Jonas HS. 1979. Correlates of clinical

performance during medical school. Med Educ 54:453–460.

Reliability and validity of peer assessment

e585

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y U

nive

rsity

of

Cam

brid

ge o

n 10

/31/

14Fo

r pe

rson

al u

se o

nly.