2018 Annual Educational Conference
SES084: Milestone Evaluations: Discovery of Some Thought Provoking Rating Trends
Marty Keil, MA, PhD Candidate, Stanford UniversityNancy Piro, PhD, Sr. Program Manager/Education Specialist, Stanford Health Care
Ann Dohn, MA, DIO & GME Director, Stanford Health Care
v Marty Keil, MAv Nancy Piro, PhDv Ann Dohn, MA
None of the above speakers have any conflicts of interest to report.
Confidential – For Discussion Purposes Only
Agenda} Milestones− Background and Goals− Milestone Project Initiation
} Milestone Score Analysis− Research Questions− Data Collection and Analysis− Summary of Findings− What does this mean for your institution?
} Milestone Narrative (Comments) Analysis− Research Questions− Data Collection and Analysis− Summary of Findings− What does this mean for your institution?
} Current Directions – Feedback Model
} Impact of Resident Evaluations on Faculty Evaluations of Residents
} Interventions
Confidential – For Discussion Purposes Only
What is a Milestone? } “A defined observable marker of an individual’s ability along a developmental continuum”
− Used for planning and teaching
} “Educators will use Milestones to design educational activities and teach specific abilities.”
Royal College of Physicians and Surgeons of Canada, 2015
Confidential – For Discussion Purposes Only
ACGME goals for Milestone Evaluation System
} Develop measures of resident performance across core competencies
− that also provide valuable feedback for residents to focus their practice
} High Quality Assessment Measures
− Reliable− Valid
} High Quality Feedback System
− Specific− Timely− Constructive
Confidential – For Discussion Purposes Only
What are we experiencing?} We now have Milestones in all ACGME Programs….but
− Our largest number of ACGME citations has been for evaluations and feedback− Our internal GME House Staff survey demonstrates that our trainees are dissatisfied with their evaluation
feedback− Our AGGME Survey:
Confidential – For Discussion Purposes Only
So what to do???} Milestone Project Initiation
− 2016 Annual Institutional Review (AIR) } Committee members recommended pursuing a collaborative relationship with the Stanford Graduate School of
Education (SGSOE) to leverage their expertise to− analyze Milestone Evaluations Trends− create improvement strategies
Confidential – For Discussion Purposes Only
Milestone Project Initiation and Goals} Initial Groundwork (GME and SGSOE)
− Defined the problem from GME’s perspective− Brainstormed with the School of Education
} Began working with Faculty and PhD Candidate focusing on improving graduate education
} Project Implementation
− Stakeholder Determination (Anesthesia, Neurology, OB/GYN, Orthopedic Surgery Psychiatry and Radiology)
− Expanding to include other Institutions
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Research Questions} How reliable are milestone score evaluations? What factors contribute to variance within the system,
including attendings, residents, specialties, and institutions?
} Do attendings differentiate between the six core competencies when evaluating milestones?
} What are the relationships between milestone scores and time, as well as milestone scores and PGY, and how do these variables interact?
} Can we identify evaluators that consistently provide high and low quality feedback?
} What evaluator characteristics contribute to high quality feedback and assessment?
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Data Collection
} Data extracted from our Residency Management System (RMS) Online Portal and from the Ad Hoc Resident Reporting functionality concatenating several cells to delineate PGY Levels
} Data collected from 5 ACGME Core Specialties across 2 large academic medical centers (8 programs total)
} All data de-identified to protect Resident Confidentiality
} Datasets from 2013-2017
− 4 year period beginning with the implementation of ACGME Milestones
} Data Filtering to include only
− End-of-rotation milestone subcompetency scores from Faculty − Competencies evaluated in more than 30% of rotations
} Data analysis using R -> open-source code applicable to all files
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Data Collection and Overview
} Summary of Data
Program Institution Evaluations Evaluators Residents Competencies
Neurology B 504 55 39 28
Neurology A 1067 90 36 29
OBGYN B 1851 58 57 28
OBGYN A 1376 66 31 27
Psychiatry B 648 50 64 22
Psychiatry A 935 97 65 21
Anesthesia A 2406 66 116 25
Radiology A 2965 79 67 10
Confidential – For Discussion Purposes Only
Milestone Score Analysis-A Classic Progression Plot
1
2
3
4
5
1 2 3 4
PGY
Mile
ston
e S
core
Mean Milestone Scores by PGY: UCLA Obstetrics & Gynecology
ACGME National Report (2016) UCLA OBGYN (2013-2017)
} Scores increase each year in equally distributed levels
− Signals consistent growth in resident performance and measurement validity
B –ob/gyn“B” – OB/GYN
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Individual Plots} Delving deeper we see highly variable scores for each resident across rotations
180707 180708 180709 180710 180711 180712 180713
179275 179276 179277 179278 179280 179281 179282 180706
175016 175017 175018 175019 175020 175021 175022 179274
150428 150516 150666 150700 150712 150778 150857 175015
150035 150056 150058 150075 150100 150112 150278 150323
0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900 0 300 600 900
0 300 600 900
12345
12345
12345
12345
12345
Time
Mile
ston
e S
core
“B” Neurology
Confidential – For Discussion Purposes Only
Milestone Score Analysis - Reliability} Mixed Model Results demonstrate an evaluation system with low Intraclass
Correlation Coefficients (ICC), suggesting poor reliability
} Attendings explain significantly more of the variance than residents in six programs
Institution ProgramPercent Explained by
Evaluator ICC Score (Reliability)
B Neurology86.3% 0.086
A Neurology59.6% 0.266
B OBGYN88.9% 0.058
A OBGYN93.7% 0.043
B Psychiatry84.2% 0.117
A Psychiatry92.9% 0.042
A Anesthesia28.2% 0.481
A Radiology43.0% 0.396
Milestone. Score ~ total time + Program. + Institution + (1|Evaluator.ID.) + (1|Evaluatee.ID.)
Confidential – For Discussion Purposes Only
Milestone Score Analysis - Reliability Scaled
} Milestone scores scaled by average evaluator score
} Results demonstrate that the issue extends beyond incorrect internal reference point
Institution ProgramPercent Explained by
Evaluator ICC Score (Reliability)
B Neurology43.0% 0.156
A Neurology19.3% 0.32
B OBGYN90.4% 0.05
A OBGYN73.1% 0.075
B Psychiatry23.3% 0.234
A Psychiatry29.3% 0.087
A Anesthesia44.6% 0.351
A Radiology71.9% 0.21
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Evaluator and Resident Variation Plots} Evaluator means dramatically different compared to resident means for any
particular PGY level (PGY1 level most variable) – Institution “A”
1
2
3
4
5
8158
5
7961
7
8058
0
8058
2
7961
8
8158
3
8057
9
8158
4
8058
3
8168
0
8058
1
8158
6
8158
2
7961
6
7961
9
7961
5
Evaluatee
Mile
ston
eS c
ore
PGY1
Mean Milestone Scores by Resident: Obstetrics & Gynecology
1
2
3
4
5
7216
781
313
7993
972
133
7758
482
046
7918
177
656
7220
079
109
8104
180
056
8092
372
075
7369
972
121
8181
979
994
7911
272
174
7845
779
183
7758
380
126
8100
972
726
7330
580
963
8129
081
818
7956
378
339
Evaluator
Mile
ston
eS c
ore
PGY1
Mean Milestone Scores by Evaluators: Obstetrics & Gynecology
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Split Regression Analysis
} Split mixed model into PGY by days (every 365 days)
} Determine Slope and Intercept for each model result
Radiology
Mixed Model: Milestone Score ~ Time + Program + Institution + (1|Evaluator.ID.) + (1|Evaluatee.ID.)
1
2
3
4
5
0 500 1000 1500totaltime
Milestone.Level. PGY
1
2
3
4
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Lag Analysis} For all programs besides psychiatry the y-intercept increases across years while the
slope is relatively flat
} This suggests widespread scoring bias focused on PGY instead of time
Institution Program PGY1 -Intercept
PGY1 -Slope
PGY2 -Intercept
PGY2 -Slope
PGY3 -Intercept
PGY3 -Slope
PGY14-Intercept
PGY4 -Slope
B Neurology 2.97 0.0007 3.83 0.0002 3.97 0.0003 NA NA
A Neurology 2.9425 0.0013 2.8284 0.0012 2.9381 0.0011 2.616 0.0012
B OBGYN 1.6841 0.0007 2.2305 0.0006 2.0359 0.0014 3.1788 0.0007
A OBGYN 2.0119 0.0014 2.4949 0.0014 2.9962 0.0009 3.2742 0.0007
B Psychiatry2.6922 -0.0002 3.1067 -0.0008 -0.0425 0.0039 NA NA
A Psychiatry 3.0038 0.0005 2.9641 0.0008 2.5556 0.0013 2.613 0.001
A Anesthesia 2.525 0.0008 2.6424 0.0009 3.014 0.0008 2.6803 0.0009
A Radiology 1.45 0.0001 2.4151 0.0003 3.1739 0.0002 3.8521 0.0003
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Factor Analysis } Feedback should be specific
} Factor analysis results in one factor explaining more than 90% of variance
Factor Analysis Results “B” Neurology
Institution ProgramVariance
Explained by 1st
Factor
B Neurology .95
A Neurology .90
B OBGYN .97
A OBGYN .92
B Psychiatry .96
A Psychiatry .82
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Mean Variation per Evaluation} Low standard deviation between subcompetency scores within an evaluation
} Overall 56.6% of evaluations had the same scores for all rated subcompetencies
Institution Program Mean SD in Evaluation
% No Variation
B Neurology 0.248 0.377
A Neurology 0.302 0.321
B OBGYN 0.19 0.586
A OBGYN 0.261 0.311
B Psychiatry 0.191 0.517
A Psychiatry 0.425 0.276
A Anesthesia 0.084 0.789
A Radiology 0.121 0.714
-0.2
0.0
0.2
0.4
0.6
7140
671
408
7141
871
422
7581
977
324
7770
478
970
7909
379
201
7988
379
884
7996
781
873
8190
781
908
8209
273
587
7688
574
582
7751
475
322
7356
780
948
7139
071
414
7522
379
125
7301
580
498
7613
079
126
7141
780
223
7595
372
116
7139
371
392
7138
676
760
7679
177
603
8095
674
199
7327
976
401
8089
177
140
7140
274
319
7661
877
513
7137
271
388
7581
278
929
7140
476
934
7575
180
892
8190
981
892
7141
171
407
7714
774
514
7135
076
361
7851
979
995
7905
278
517
7140
381
874
8050
080
826
8186
581
905
7363
0
Evaluator
Mea
nS
DM
ilest
one
Mean Evaluator Variance per Evaluation: “B” Radiology
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Timing Analysis } Feedback should be timely
} Mean evaluation delay from end-of-rotation was more than 1 month - “A”
Institution Program Mean Delay
B Neurology 81.739
A Neurology 41.191
B OBGYN 37.124
A OBGYN 36.477
B Psychiatry 59.316
A Psychiatry 66.541
A Anesthesia 41.071
A Radiology 29.693
0
50
100
150
7770
481
905
7140
279
967
8089
281
874
8190
876
401
7141
180
948
7999
576
885
7851
779
884
7732
475
751
7676
082
092
8190
773
587
7905
280
891
7920
171
414
7139
375
223
7211
678
929
7135
081
865
8049
877
603
7142
276
618
7140
476
791
7851
973
630
7581
976
934
7138
674
199
8095
673
567
7613
074
514
7751
476
361
7912
573
015
8189
281
909
7140
775
953
7897
074
582
7714
779
883
7431
975
322
8050
071
372
8022
371
408
7912
677
513
7140
671
418
7581
271
388
7714
079
093
7140
371
390
8187
380
826
7141
771
392
7327
9
Evaluator
Del
ay(d
ays)
Mean Eva luator Delay: Radiology
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Evaluator Characteristics
} Evaluator Quality Outcomes:
− Delay of evaluation− Competency variation within evaluation− Mean milestone score
} Evaluator Characteristics:
− Number of years teaching− Faculty position (Assistant, Associate, Full Professor, Director, Chair)− Clinical supervision hours
} Mixed Models show no significant relationships between evaluator characteristics and evaluator quality outcomes
} Models do show a relationship between delay and other quality outcomes− The longer delayed evaluations have significantly higher milestone scores and the significantly lower
competency variation
Confidential – For Discussion Purposes Only
} Steady progression of scores over time
Curriculum Competence Committee
53718 53719 53720
49768 53715 53716 53717
49764 49765 49766 49767
46731 46732 46733 46734
44890 44893 44903 46730
43039 43072 44805 44845
41425 42777 42973 43037
500 1000 1500 500 1000 1500 500 1000 1500
500 1000 1500
234
234
234
234
234
234
234
Time
Mile
ston
e S
core
Evaluatee.Employee.ID41425
42777
42973
43037
43039
43072
44805
44845
44890
44893
44903
46730
46731
46732
46733
46734
49764
49765
49766
49767
49768
53715
53716
53717
53718
53719
53720
Confidential – For Discussion Purposes Only
Curriculum Competence Committee } Factor analysis across Institution “A” 5 programs shows one factor loading across competencies
} Standard deviation analysis shows increased variation within evaluations for certain program’s CCC scores
Program Mean SD acrossEvaluations
Percent all same
Neurology0.953 33.82%
OBGYN0.416 1.10%
Psychiatry0.368 1.01%
Anesthesia0.141 0%
Radiology0.212 42.85%
Confidential – For Discussion Purposes Only
Curriculum Competence Committee
} How does the CCC committee aggregate rotation data?
− Mean of six scores− Predictive model controlling for random effects− Last score of six
} Mean of six scores most predictive of CCC scores
} Significant relationship between rotation scores and CCC scores when time is included in the model
Confidential – For Discussion Purposes Only
Milestone Score Analysis – Summary of Findings1. Milestone evaluations are unreliable across specialties and institutions
2. Evaluators are more predictive of milestone scores than the residents
3. Evaluators often rate milestone competencies along one dimension
4. Time and milestone score is a non-linear function (jump at PGY transition)
5. Evaluator characteristics do not appear to influence milestone scores
6. Monthly milestone evaluations directly impact CCC scores for residents
Confidential – For Discussion Purposes Only
What does this mean for your institution?
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Research Questions
} Do open-ended narrative comments more reliably differentiate residents when compared to milestone evaluation scores?
} How common is constructive feedback in narrative comments?
} How much variation is there among attendings and residents in the amount of constructive feedback they provide and receive?
} Do evaluation prompts increase the amount of high quality feedback?
} What relationships are there between attending characteristics and narrative feedback quality?
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Data Collection and Summary
} Data extracted from the RMSs for end-of rotation attending comments
} Data from 5 specialties at 2 institutions, 10 programs total
} 10,044 complete evaluations divided into 21,290 sentence chunks
Program Institution Comment Chunks
Neurology B 1681
Neurology A 1039
OBGYN B 529
OBGYN A 930
Psychiatry B 2793
Psychiatry A 3864
Anesthesia B 3471
Anesthesia A 3244
Radiology B 247
Radiology A 3492
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Coding Procedure
} Valence: Positive, Neutral, Negative/Corrective
} Specificity: General, Specific
Example 1:PositiveGeneralIreallylikethisResident.She ishardworking,personable,takesinitiativeandgoestheextramile.Thesequalitiesarerareandgreatlyappreciated.
Example 2:NeutralGeneralAppropriatefundofknowledgeforherlevel
Example3:NegativeSpecificAfterthemeeting,thetimelinessofhernotesdidimprove,butstillneedsworkoncontent
Example4:PositiveSpecificHewasefficient,responsible,andwenttheextramilewithpatienttocreatesafetydischargeplansthatwerefeasibleforpatientsinfollow-up
Example5: NeutralGeneralContinuereading.
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Evaluator and Resident Variation Plot} 40% of OB/GYN evaluators never provided critical feedback to residents
} There is greater variation in narrative critical feedback than milestone score among residents
0.00
0.25
0.50
0.75
1.00
46731
46734
49765
53716
53718
53719
53720
53715
49767
49766
49764
46732
49768
46730
53717
46733
Evaluatee
Per
cent
with
Neg
ativ
e
OBGYN Stanford Valence by Resident
0.00
0.25
0.50
0.75
1.00
2 4 5 6 7 8 9 10 14 15 16 18 19 21 23 25 3 12 13 11 24 1 22 17 20
EvaluatorP
erce
nt w
ith N
egat
ive
OBGYN Stanford Valence by Resident“A” OB/GYN Resident Means “A” OB/GYN Evaluator Means
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Summary of Data
Overall Prompts: n = 31502
Overview: n = 13392
Strengths: n = 8112
Complete: N = 5347
Nothing: N = 2765
Weaknesses / Improvement :
n = 9548
Complete: N = 6458
Nothing: N = 6934
Complete: N = 2306
Nothing: N = 7242
Nothing: N = 1228
Constructive: N = 898
Positive: N = 180
Confidential – For Discussion Purposes Only
Narrative Comment Analysis “A”– Evaluator Characteristics
} No OB/GYN faculty teaching over 20 years provided negative feedback
} This relationship was also apparent with faculty rank
Teaching Years + Critical Feedback
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40
Teaching Years
Per
cent
with
Neg
ati v
e
Mean Milestone Score by Teaching Years: OBGYN
Confidential – For Discussion Purposes Only
Narrative Comment Analysis – Summary } OB/GYN narrative comments more reliably differentiate resident abilities compared to milestone scores
(may apply more broadly to all specialties)
} There is significant variation between evaluators in the amount of critical feedback they provide
} Critical and constructive feedback is rare in narrative comments, occurring less than 10% of the time
} Prompting attendings for resident areas of improvement/weaknesses does not guarantee the comment will address this request (still less that 10% of these prompts were constructive)
} There is a significant positive relationship between faculty rank and the amount of constructive feedback given to residents. This is also true for years teaching.
Confidential – For Discussion Purposes Only
What does this mean for your institution?
Confidential – For Discussion Purposes Only
Current Directions – Feedback Model
Evaluator
Student
Evaluator
Student
Data provided to evaluator
Interpretation of data
Student adapts behavior
Providing Feedback
Confidential – For Discussion Purposes Only
Current Directions – Feedback Model Breakdown
} Data Provided to Evaluator
− Inadequate observational data to make an accurate assessment− Not a valid construct: The gradation of behaviors described do not exist
} Evaluator Data Interpretation
− Unable to differentiate levels of competence− Incorrect reference points of competency− Cognitive overload: Unable to evaluate many competencies at once
} Evaluator Providing Feedback
− Lack of descriptive feedback: poor motivation, lacking incentive structure, time− Lack of critical feedback: fear retaliation, uncertain, poorly documented, hurt relationship
Currently conducting resident focus groups, attending interviews, and online surveys to explore these hypotheses
Confidential – For Discussion Purposes Only
Current Directions – A call for data collaborators
Confidential – For Discussion Purposes Only
Measures of Resident Teaching Quality} Student evaluations of teacher performance on rotations
} Student learning outcomes
} Observational Data of resident teachers
− Observation data by peers or supervisors− Process-based analysis through electronic data
} Example: Assessing feedback quality
− Milestone score analysis: Evaluator mean score, competency variation, timing− Narrative comment analysis: Positive/Negative, Specific
Confidential – For Discussion Purposes Only
Resident Scores of Faculty Teaching
} Ceiling effect in teaching scores for all residency programs
} Across 5 programs from Institution A 62% of evaluations of faculty had no variance in scores
Program Percent all same
Neurology56.3%
OBGYN53.4%
Psychiatry52.3%
Anesthesia75.2%
Radiology69.2%
0
200
400
600
2 4 6Constructive Feedback Rating
# of
Rat
ings
Feedback Rating Distribution Institution A Neurology
Confidential – For Discussion Purposes Only
Individual Teacher Ratings} Students evaluations of attending feedback does not adequately differentiate faculty performance
} Teacher ratings do not correlate strongly with process data
2
3
4
5
6
7
71343
75339
74127
71330
78181
71318
75308
71321
79044
74208
79238
74328
73555
71964
78327
71703
77675
71323
78151
71804
73284
73293
77659
79060
71727
76146
75340
Faculty ID
Feed
back
Sco
re
Feedback Score by Attending Neurology Institution A
Confidential – For Discussion Purposes Only
Effect of Student Ratings on Teaching Behavior} In many residency programs teaching scores contribute to promotion and bonus decisions
} Basing teacher ratings solely on student evaluations, faculty could be incentivized to avoid giving negative, specific feedback
} Faculty may also be incentivized not to provide low milestone scores
} Automated process measures allow program directors to identify faculty that may need extra training on differentiating competencies and submitting feedback in a timely, informative manner
Confidential – For Discussion Purposes Only
Current Directions – A call for data collaborators
Confidential – For Discussion Purposes Only
Questions?} Marty Keil: [email protected]
} Nancy Piro: [email protected]
} Ann Dohn: [email protected]