Download pdf - 2018 Annual Educational Conferencemed.stanford.edu/content/dam/sm/gme/gme_community/... · } Data extracted from our Residency Management System (RMS) Online Portal and from the Ad

2018 Annual Educational Conference

SES084: Milestone Evaluations: Discovery of Some Thought Provoking Rating Trends

Marty Keil, MA, PhD Candidate, Stanford UniversityNancy Piro, PhD, Sr. Program Manager/Education Specialist, Stanford Health Care

Ann Dohn, MA, DIO & GME Director, Stanford Health Care

v Marty Keil, MAv Nancy Piro, PhDv Ann Dohn, MA

None of the above speakers have any conflicts of interest to report.

Confidential – For Discussion Purposes Only

Agenda} Milestones− Background and Goals− Milestone Project Initiation

} Milestone Score Analysis− Research Questions− Data Collection and Analysis− Summary of Findings− What does this mean for your institution?

} Milestone Narrative (Comments) Analysis− Research Questions− Data Collection and Analysis− Summary of Findings− What does this mean for your institution?

} Current Directions – Feedback Model

} Impact of Resident Evaluations on Faculty Evaluations of Residents

} Interventions


What is a Milestone? } “A defined observable marker of an individual’s ability along a developmental continuum”

− Used for planning and teaching

} “Educators will use Milestones to design educational activities and teach specific abilities.”

Royal College of Physicians and Surgeons of Canada, 2015


ACGME goals for Milestone Evaluation System

} Develop measures of resident performance across core competencies

− that also provide valuable feedback for residents to focus their practice

} High Quality Assessment Measures

− Reliable− Valid

} High Quality Feedback System

− Specific− Timely− Constructive


What are we experiencing?} We now have Milestones in all ACGME Programs….but

− Our largest number of ACGME citations has been for evaluations and feedback− Our internal GME House Staff survey demonstrates that our trainees are dissatisfied with their evaluation

feedback− Our AGGME Survey:


So what to do???} Milestone Project Initiation

− 2016 Annual Institutional Review (AIR) } Committee members recommended pursuing a collaborative relationship with the Stanford Graduate School of

Education (SGSOE) to leverage their expertise to− analyze Milestone Evaluations Trends− create improvement strategies


Milestone Project Initiation and Goals} Initial Groundwork (GME and SGSOE)

− Defined the problem from GME’s perspective− Brainstormed with the School of Education

} Began working with Faculty and PhD Candidate focusing on improving graduate education

} Project Implementation

− Stakeholder Determination (Anesthesia, Neurology, OB/GYN, Orthopedic Surgery Psychiatry and Radiology)

− Expanding to include other Institutions


Milestone Score Analysis – Research Questions} How reliable are milestone score evaluations? What factors contribute to variance within the system,

including attendings, residents, specialties, and institutions?

} Do attendings differentiate between the six core competencies when evaluating milestones?

} What are the relationships between milestone scores and time, as well as milestone scores and PGY, and how do these variables interact?

} Can we identify evaluators that consistently provide high and low quality feedback?

} What evaluator characteristics contribute to high quality feedback and assessment?


Milestone Score Analysis – Data Collection

} Data extracted from our Residency Management System (RMS) Online Portal and from the Ad Hoc Resident Reporting functionality concatenating several cells to delineate PGY Levels

} Data collected from 5 ACGME Core Specialties across 2 large academic medical centers (8 programs total)

} All data de-identified to protect Resident Confidentiality

} Datasets from 2013-2017

− 4 year period beginning with the implementation of ACGME Milestones

} Data Filtering to include only

− End-of-rotation milestone subcompetency scores from Faculty − Competencies evaluated in more than 30% of rotations

} Data analysis using R -> open-source code applicable to all files


Milestone Score Analysis – Data Collection and Overview

} Summary of Data

Program Institution Evaluations Evaluators Residents Competencies

Neurology B 504 55 39 28

Neurology A 1067 90 36 29

OBGYN B 1851 58 57 28

OBGYN A 1376 66 31 27

Psychiatry B 648 50 64 22

Psychiatry A 935 97 65 21

Anesthesia A 2406 66 116 25

Radiology A 2965 79 67 10


Milestone Score Analysis-A Classic Progression Plot

1

2

3

4

5

1 2 3 4

PGY

Mile

ston

e S

core

Mean Milestone Scores by PGY: UCLA Obstetrics & Gynecology

ACGME National Report (2016) UCLA OBGYN (2013-2017)

} Scores increase each year in equally distributed levels

− Signals consistent growth in resident performance and measurement validity

B –ob/gyn“B” – OB/GYN


Milestone Score Analysis – Individual Plots} Delving deeper we see highly variable scores for each resident across rotations

180707 180708 180709 180710 180711 180712 180713

179275 179276 179277 179278 179280 179281 179282 180706

175016 175017 175018 175019 175020 175021 175022 179274

150428 150516 150666 150700 150712 150778 150857 175015

150035 150056 150058 150075 150100 150112 150278 150323

0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900

0 300 600 900

12345

12345

12345

12345

12345

Time

Mile

ston

e S

core

“B” Neurology


Milestone Score Analysis - Reliability} Mixed Model Results demonstrate an evaluation system with low Intraclass

Correlation Coefficients (ICC), suggesting poor reliability

} Attendings explain significantly more of the variance than residents in six programs

Institution ProgramPercent Explained by

Evaluator ICC Score (Reliability)

B Neurology86.3% 0.086

A Neurology59.6% 0.266

B OBGYN88.9% 0.058

A OBGYN93.7% 0.043

B Psychiatry84.2% 0.117

A Psychiatry92.9% 0.042

A Anesthesia28.2% 0.481

A Radiology43.0% 0.396

Milestone. Score ~ total time + Program. + Institution + (1|Evaluator.ID.) + (1|Evaluatee.ID.)


Milestone Score Analysis - Reliability Scaled

} Milestone scores scaled by average evaluator score

} Results demonstrate that the issue extends beyond incorrect internal reference point

Institution ProgramPercent Explained by

Evaluator ICC Score (Reliability)

B Neurology43.0% 0.156

A Neurology19.3% 0.32

B OBGYN90.4% 0.05

A OBGYN73.1% 0.075

B Psychiatry23.3% 0.234

A Psychiatry29.3% 0.087

A Anesthesia44.6% 0.351

A Radiology71.9% 0.21


Milestone Score Analysis – Evaluator and Resident Variation Plots} Evaluator means dramatically different compared to resident means for any

particular PGY level (PGY1 level most variable) – Institution “A”

1

2

3

4

5

8158

5

7961

7

8058

0

8058

2

7961

8

8158

3

8057

9

8158

4

8058

3

8168

0

8058

1

8158

6

8158

2

7961

6

7961

9

7961

5

Evaluatee

Mile

ston

eS c

ore

PGY1

Mean Milestone Scores by Resident: Obstetrics & Gynecology

1

2

3

4

5

7216

781

313

7993

972

133

7758

482

046

7918

177

656

7220

079

109

8104

180

056

8092

372

075

7369

972

121

8181

979

994

7911

272

174

7845

779

183

7758

380

126

8100

972

726

7330

580

963

8129

081

818

7956

378

339

Evaluator

Mile

ston

eS c

ore

PGY1

Mean Milestone Scores by Evaluators: Obstetrics & Gynecology


Milestone Score Analysis – Split Regression Analysis

} Split mixed model into PGY by days (every 365 days)

} Determine Slope and Intercept for each model result

Radiology

Mixed Model: Milestone Score ~ Time + Program + Institution + (1|Evaluator.ID.) + (1|Evaluatee.ID.)

1

2

3

4

5

0 500 1000 1500totaltime

Milestone.Level. PGY

1

2

3

4


Milestone Score Analysis – Lag Analysis} For all programs besides psychiatry the y-intercept increases across years while the

slope is relatively flat

} This suggests widespread scoring bias focused on PGY instead of time

Institution Program PGY1 -Intercept

PGY1 -Slope

PGY2 -Intercept

PGY2 -Slope

PGY3 -Intercept

PGY3 -Slope

PGY14-Intercept

PGY4 -Slope

B Neurology 2.97 0.0007 3.83 0.0002 3.97 0.0003 NA NA

A Neurology 2.9425 0.0013 2.8284 0.0012 2.9381 0.0011 2.616 0.0012

B OBGYN 1.6841 0.0007 2.2305 0.0006 2.0359 0.0014 3.1788 0.0007

A OBGYN 2.0119 0.0014 2.4949 0.0014 2.9962 0.0009 3.2742 0.0007

B Psychiatry2.6922 -0.0002 3.1067 -0.0008 -0.0425 0.0039 NA NA

A Psychiatry 3.0038 0.0005 2.9641 0.0008 2.5556 0.0013 2.613 0.001

A Anesthesia 2.525 0.0008 2.6424 0.0009 3.014 0.0008 2.6803 0.0009

A Radiology 1.45 0.0001 2.4151 0.0003 3.1739 0.0002 3.8521 0.0003


Milestone Score Analysis – Factor Analysis } Feedback should be specific

} Factor analysis results in one factor explaining more than 90% of variance

Factor Analysis Results “B” Neurology

Institution ProgramVariance

Explained by 1st

Factor

B Neurology .95

A Neurology .90

B OBGYN .97

A OBGYN .92

B Psychiatry .96

A Psychiatry .82


Milestone Score Analysis – Mean Variation per Evaluation} Low standard deviation between subcompetency scores within an evaluation

} Overall 56.6% of evaluations had the same scores for all rated subcompetencies

Institution Program Mean SD in Evaluation

% No Variation

B Neurology 0.248 0.377

A Neurology 0.302 0.321

B OBGYN 0.19 0.586

A OBGYN 0.261 0.311

B Psychiatry 0.191 0.517

A Psychiatry 0.425 0.276

A Anesthesia 0.084 0.789

A Radiology 0.121 0.714

-0.2

0.0

0.2

0.4

0.6

7140

671

408

7141

871

422

7581

977

324

7770

478

970

7909

379

201

7988

379

884

7996

781

873

8190

781

908

8209

273

587

7688

574

582

7751

475

322

7356

780

948

7139

071

414

7522

379

125

7301

580

498

7613

079

126

7141

780

223

7595

372

116

7139

371

392

7138

676

760

7679

177

603

8095

674

199

7327

976

401

8089

177

140

7140

274

319

7661

877

513

7137

271

388

7581

278

929

7140

476

934

7575

180

892

8190

981

892

7141

171

407

7714

774

514

7135

076

361

7851

979

995

7905

278

517

7140

381

874

8050

080

826

8186

581

905

7363

0

Evaluator

Mea

nS

DM

ilest

one

Mean Evaluator Variance per Evaluation: “B” Radiology


Milestone Score Analysis – Timing Analysis } Feedback should be timely

} Mean evaluation delay from end-of-rotation was more than 1 month - “A”

Institution Program Mean Delay

B Neurology 81.739

A Neurology 41.191

B OBGYN 37.124

A OBGYN 36.477

B Psychiatry 59.316

A Psychiatry 66.541

A Anesthesia 41.071

A Radiology 29.693

0

50

100

150

7770

481

905

7140

279

967

8089

281

874

8190

876

401

7141

180

948

7999

576

885

7851

779

884

7732

475

751

7676

082

092

8190

773

587

7905

280

891

7920

171

414

7139

375

223

7211

678

929

7135

081

865

8049

877

603

7142

276

618

7140

476

791

7851

973

630

7581

976

934

7138

674

199

8095

673

567

7613

074

514

7751

476

361

7912

573

015

8189

281

909

7140

775

953

7897

074

582

7714

779

883

7431

975

322

8050

071

372

8022

371

408

7912

677

513

7140

671

418

7581

271

388

7714

079

093

7140

371

390

8187

380

826

7141

771

392

7327

9

Evaluator

Del

ay(d

ays)

Mean Eva luator Delay: Radiology


Milestone Score Analysis – Evaluator Characteristics

} Evaluator Quality Outcomes:

− Delay of evaluation− Competency variation within evaluation− Mean milestone score

} Evaluator Characteristics:

− Number of years teaching− Faculty position (Assistant, Associate, Full Professor, Director, Chair)− Clinical supervision hours

} Mixed Models show no significant relationships between evaluator characteristics and evaluator quality outcomes

} Models do show a relationship between delay and other quality outcomes− The longer delayed evaluations have significantly higher milestone scores and the significantly lower

competency variation


} Steady progression of scores over time

Curriculum Competence Committee

53718 53719 53720

49768 53715 53716 53717

49764 49765 49766 49767

46731 46732 46733 46734

44890 44893 44903 46730

43039 43072 44805 44845

41425 42777 42973 43037

500 1000 1500 500 1000 1500 500 1000 1500

500 1000 1500

234

234

234

234

234

234

234

Time

Mile

ston

e S

core

Evaluatee.Employee.ID41425

42777

42973

43037

43039

43072

44805

44845

44890

44893

44903

46730

46731

46732

46733

46734

49764

49765

49766

49767

49768

53715

53716

53717

53718

53719

53720


Curriculum Competence Committee } Factor analysis across Institution “A” 5 programs shows one factor loading across competencies

} Standard deviation analysis shows increased variation within evaluations for certain program’s CCC scores

Program Mean SD acrossEvaluations

Percent all same

Neurology0.953 33.82%

OBGYN0.416 1.10%

Psychiatry0.368 1.01%

Anesthesia0.141 0%

Radiology0.212 42.85%


Curriculum Competence Committee

} How does the CCC committee aggregate rotation data?

− Mean of six scores− Predictive model controlling for random effects− Last score of six

} Mean of six scores most predictive of CCC scores

} Significant relationship between rotation scores and CCC scores when time is included in the model


Milestone Score Analysis – Summary of Findings1. Milestone evaluations are unreliable across specialties and institutions

2. Evaluators are more predictive of milestone scores than the residents

3. Evaluators often rate milestone competencies along one dimension

4. Time and milestone score is a non-linear function (jump at PGY transition)

5. Evaluator characteristics do not appear to influence milestone scores

6. Monthly milestone evaluations directly impact CCC scores for residents


What does this mean for your institution?


Narrative Comment Analysis – Research Questions

} Do open-ended narrative comments more reliably differentiate residents when compared to milestone evaluation scores?

} How common is constructive feedback in narrative comments?

} How much variation is there among attendings and residents in the amount of constructive feedback they provide and receive?

} Do evaluation prompts increase the amount of high quality feedback?

} What relationships are there between attending characteristics and narrative feedback quality?


Narrative Comment Analysis – Data Collection and Summary

} Data extracted from the RMSs for end-of rotation attending comments

} Data from 5 specialties at 2 institutions, 10 programs total

} 10,044 complete evaluations divided into 21,290 sentence chunks

Program Institution Comment Chunks

Neurology B 1681

Neurology A 1039

OBGYN B 529

OBGYN A 930

Psychiatry B 2793

Psychiatry A 3864

Anesthesia B 3471

Anesthesia A 3244

Radiology B 247

Radiology A 3492


Narrative Comment Analysis – Coding Procedure

} Valence: Positive, Neutral, Negative/Corrective

} Specificity: General, Specific

Example 1:PositiveGeneralIreallylikethisResident.She ishardworking,personable,takesinitiativeandgoestheextramile.Thesequalitiesarerareandgreatlyappreciated.

Example 2:NeutralGeneralAppropriatefundofknowledgeforherlevel

Example3:NegativeSpecificAfterthemeeting,thetimelinessofhernotesdidimprove,butstillneedsworkoncontent

Example4:PositiveSpecificHewasefficient,responsible,andwenttheextramilewithpatienttocreatesafetydischargeplansthatwerefeasibleforpatientsinfollow-up

Example5: NeutralGeneralContinuereading.


Narrative Comment Analysis – Evaluator and Resident Variation Plot} 40% of OB/GYN evaluators never provided critical feedback to residents

} There is greater variation in narrative critical feedback than milestone score among residents

0.00

0.25

0.50

0.75

1.00

46731

46734

49765

53716

53718

53719

53720

53715

49767

49766

49764

46732

49768

46730

53717

46733

Evaluatee

Per

cent

with

Neg

ativ

e

OBGYN Stanford Valence by Resident

0.00

0.25

0.50

0.75

1.00

2 4 5 6 7 8 9 10 14 15 16 18 19 21 23 25 3 12 13 11 24 1 22 17 20

EvaluatorP

erce

nt w

ith N

egat

ive

OBGYN Stanford Valence by Resident“A” OB/GYN Resident Means “A” OB/GYN Evaluator Means


Narrative Comment Analysis – Summary of Data

Overall Prompts: n = 31502

Overview: n = 13392

Strengths: n = 8112

Complete: N = 5347

Nothing: N = 2765

Weaknesses / Improvement :

n = 9548

Complete: N = 6458

Nothing: N = 6934

Complete: N = 2306

Nothing: N = 7242

Nothing: N = 1228

Constructive: N = 898

Positive: N = 180


Narrative Comment Analysis “A”– Evaluator Characteristics

} No OB/GYN faculty teaching over 20 years provided negative feedback

} This relationship was also apparent with faculty rank

Teaching Years + Critical Feedback

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40

Teaching Years

Per

cent

with

Neg

ati v

e

Mean Milestone Score by Teaching Years: OBGYN


Narrative Comment Analysis – Summary } OB/GYN narrative comments more reliably differentiate resident abilities compared to milestone scores

(may apply more broadly to all specialties)

} There is significant variation between evaluators in the amount of critical feedback they provide

} Critical and constructive feedback is rare in narrative comments, occurring less than 10% of the time

} Prompting attendings for resident areas of improvement/weaknesses does not guarantee the comment will address this request (still less that 10% of these prompts were constructive)

} There is a significant positive relationship between faculty rank and the amount of constructive feedback given to residents. This is also true for years teaching.


What does this mean for your institution?


Current Directions – Feedback Model

Evaluator

Student

Evaluator

Student

Data provided to evaluator

Interpretation of data

Student adapts behavior

Providing Feedback


Current Directions – Feedback Model Breakdown

} Data Provided to Evaluator

− Inadequate observational data to make an accurate assessment− Not a valid construct: The gradation of behaviors described do not exist

} Evaluator Data Interpretation

− Unable to differentiate levels of competence− Incorrect reference points of competency− Cognitive overload: Unable to evaluate many competencies at once

} Evaluator Providing Feedback

− Lack of descriptive feedback: poor motivation, lacking incentive structure, time− Lack of critical feedback: fear retaliation, uncertain, poorly documented, hurt relationship

Currently conducting resident focus groups, attending interviews, and online surveys to explore these hypotheses


Current Directions – A call for data collaborators


Measures of Resident Teaching Quality} Student evaluations of teacher performance on rotations

} Student learning outcomes

} Observational Data of resident teachers

− Observation data by peers or supervisors− Process-based analysis through electronic data

} Example: Assessing feedback quality

− Milestone score analysis: Evaluator mean score, competency variation, timing− Narrative comment analysis: Positive/Negative, Specific


Resident Scores of Faculty Teaching

} Ceiling effect in teaching scores for all residency programs

} Across 5 programs from Institution A 62% of evaluations of faculty had no variance in scores

Program Percent all same

Neurology56.3%

OBGYN53.4%

Psychiatry52.3%

Anesthesia75.2%

Radiology69.2%

0

200

400

600

2 4 6Constructive Feedback Rating

# of

Rat

ings

Feedback Rating Distribution Institution A Neurology


Individual Teacher Ratings} Students evaluations of attending feedback does not adequately differentiate faculty performance

} Teacher ratings do not correlate strongly with process data

2

3

4

5

6

7

71343

75339

74127

71330

78181

71318

75308

71321

79044

74208

79238

74328

73555

71964

78327

71703

77675

71323

78151

71804

73284

73293

77659

79060

71727

76146

75340

Faculty ID

Feed

back

Sco

re

Feedback Score by Attending Neurology Institution A


Effect of Student Ratings on Teaching Behavior} In many residency programs teaching scores contribute to promotion and bonus decisions

} Basing teacher ratings solely on student evaluations, faculty could be incentivized to avoid giving negative, specific feedback

} Faculty may also be incentivized not to provide low milestone scores

} Automated process measures allow program directors to identify faculty that may need extra training on differentiating competencies and submitting feedback in a timely, informative manner


Current Directions – A call for data collaborators


Questions?} Marty Keil: [email protected]

} Nancy Piro: [email protected]

} Ann Dohn: [email protected]