Psychometric Structure of a Comprehensive Objective Structured Clinical Examination: A Factor...

Preview:

Citation preview

Advances in Health Sciences Education 9: 83–92, 2004.© 2004 Kluwer Academic Publishers. Printed in the Netherlands.

83

Psychometric Structure of a ComprehensiveObjective Structured Clinical Examination: AFactor Analytic Approach

KEVIN VOLKAN1,∗, STEVEN R. SIMON2, HARLEY BAKER1 and I. DAVIDTODRES2

1Program in Psychology, Rm 206, Prof. Building California State University Channel IslandsCamarillo, CA 93012, USA; 2Harvard Medical School (∗author for correspondence; Phone:805 437 8867; Fax: 805 437 8864; E-mail: kevin.volkan@csuci.edu)

Abstract. Problem Statement and Background: While the psychometric properties of ObjectiveStructured Clinical Examinations (OSCEs) have been studied, their latent structures have not beenwell characterized. This study examines a factor analytic model of a comprehensive OSCE andaddresses implications for measurement of clinical performance.Methods: An exploratory maximum likelihood factor analysis with a Promax rotation was used toderive latent structures for the OSCE.Results: A model with two correlated factors fit the data well. The first factor was related to Phys-ical Examination and History-Taking was labeled as information gathering, while the second factorwas related to Differential Diagnosis/Clinical Reasoning, and Patient Interaction and was labeled asreasoning and information dissemination. Case Management did not contribute to either factor. Thefactors accounted for a total 61.6% of the variance in the skills variables.Conclusions: Recognizing the psychometric components of OSCEs may support and enhance theuse of OSCEs for measuring clinical competency of medical students.

Key words: clinical competency, clinical performance, factor analysis, medical students, ObjectiveStructured Clinical Exam, psychometrics, standardized patients

Objective Structured Clinical Examinations (OSCEs) are often instituted as acomprehensive test to ensure minimum competencies among medical schoolgraduates in both the application of knowledge and performance of diagnosticand therapeutic maneuvers in clinical settings (Epstein & Hundert, 2002). OSCEshave the ability to measure knowledge and skills necessary for competent clinicalpractice (Hamann et al., 2002). However, a number of issues are related to how wellOSCEs function in this regard which is in turn related to their psychometric struc-ture. For example, there are differences of opinion about whether trained facultyor standardized patients should rate students (Martin et al. 1996), whether globalratings are more valid and useful than checklists (Cohen et al. 1991; Hodges et al.,1999; Swartz et al., 1999), and how to define pass/fail criteria for the OSCE (Bouletet al., 2002; Cusimano, Rothman, & Keystone, 1998; Tamblyn et al., 1998).

84 KEVIN VOLKAN ET AL.

While most OSCEs rely on face validity and acceptable reliabilities, the psycho-metric structure of this type of examination is often not well understood (Newble& Swanson, 1988; Matsell, Wolfish & Hsu, 1991; Hull et al., 1993; Hodges etal., 1996; Regehr et al., 1998). For instance, while some studies have identifiedOSCEs using global rating scales as having higher reliabilities than OSCEs usingchecklists, other studies have shown that this is not always the case (Regehr et al.,1998; Hamann et al., 2002).

Well-conducted studies have examined the predictive and criterion-relatedvalidity of OSCEs, but few studies have analyzed the internal latent psychometricstructure of OSCEs or related performance examinations (Collins & Gamble,1996; Kwolek et al., 1997; Robins et al., 2001). Consequently, the internal latentmeasurement structures of OSCEs are largely unknown. Therefore, we undertookthe present study to analyze the underlying internal structure of an OSCE given atthe end of the third year of medical school to ascertain how well the latent structurescharacterize the measured clinical skills.

Background

OSCEs are usually multi-stationed clinical examinations in which students rotatethrough stations of a set duration, performing specified clinically related tasks.Although some stations may use paper cases, computer simulation, or otherinanimate media, stations using standardized patients (SPs) are usually central toOSCEs. SPs are non-physicians trained to act as patients with specific complaints,to simulate some physical findings, and to behave as a real patient might in responseto an examinee’s behavior. At each station, examinees are asked to perform certainclinical tasks and are scored using carefully designed checklists or global ratingscales to evaluate their performance of those tasks. OSCEs standardize clinicalassessment of students by ensuring that all examinees encounter the same problemsand are asked to do exactly the same tasks, under controlled conditions and closeobservation (Carraccio & Englander, 2000; Harden & Gleeson, 1979; Harden etal., 1975; van der Vleuten & Swanson, 1990).

Stations are usually chosen to represent a broad clinical base, with standardizedpatient cases representing common or educationally important clinical problems.OSCEs also are usually designed to test at least five broad skill areas acrossthe stations. These skills typically include History-Taking; Physical Examination;Differential Diagnosis/Clinical Reasoning; Patient Interaction; and Case Manage-ment (Matsell, Wolfish & Hsu, 1991; Hamann et al., 2002; van der Vleuten &Newble, 1995). Little is known about the psychometric structure of OSCEs asdefined by latent variables derived from the measurement of clinical skills. Thispsychometric structure could provide useful information about the internal validityof the OSCE.

The Harvard Medical School Comprehensive Examination was designed as anevaluation of clinical and basic science knowledge and its application in clinical

PSYCHOMETRIC STRUCTURE OF AN OSCE 85

settings. The exam has been described in detail elsewhere (Todres et al., 2000). Thiscomprehensive OSCE is also used to help medical educators identify relationshipsbetween the curriculum and student skills, especially during clinical clerkships(Morag et al., 2001). Students generally take the exam in the summer between theirthird and fourth years of medical school. Each student receives detailed informa-tion on his or her performance in each of the exam content area stations. Studentsreceive information on their performance in four skill areas which are applicableacross various content areas of the exam. A unique aspect of the Harvard examis the presence of a faculty preceptor in each of the stations. These preceptorsscore each student at each station using a detailed checklist according to whetheror not they observed the student performing a certain task or behavior, or whether astudent evidences specific knowledge or reasoning through an interaction with thepreceptor. At the end of each station the faculty preceptors (and when applicable theSPs) also give a few minutes of feedback to each student on his or her performance.All faculty serving as preceptors for the exam undergo training in how to score thechecklists in their station and in how to give effective constructive feedback tostudents. Checklists are used for scoring because of their simplicity and because ofconcerns that faculty preceptors would not have time to properly use global ratingscales during the 20 minutes allotted to each station. Surveys of the students postexam indicate that they enjoyed the exam experience, especially receiving feedbackon their performance directly from the faculty (Todres et al., 2000).

Methods

SAMPLE

One hundred sixty-nine medical students at Harvard Medical School participatedin a comprehensive OSCE at the end of the third year of medical school during thesummer and fall of 1999. All students in the third year were required to take theexam.

DATA COLLECTION

One hundred twenty-nine faculty members served as examiners. The examina-tion had nine stations: obstetrics/gynecology, pediatrics, breast exam, pathology,psychiatry, asthma, radiology, cardiology and neurology. The number of items ineach station ranged from 18 to 53, for a total of 361 items. Items were carefullycrafted through consensus among content experts and pilot testing so that eachitem represented one concept or unit of behavior. Skill items were distributedacross stations as appropriate to the content. Sample checklist items are shown inTable I. Items were scored dichotomously according to whether a specific behaviorwas performed or whether a student exhibited specific knowledge in an area. Itemdistribution across stations is shown in Table II.

86 KEVIN VOLKAN ET AL.

Table I. Example OSCE items

Skills variable Example checklist items

History-Taking 1. Psychiatry: Student asks if any family history of mental illness/problems

2. Neurology: Student asks if patient has any visual disturbances

Physical Exam 1. Pediatrics: Student examines eyes: conjunctivae and movement

2. Breast: Student examines under BOTH arms for lymph nodes

Differential Diagnosis/ 1. Neurology: Student rules out headache due to mass lesion – hemorrhage

Clinical Reasoning 2. Psychiatry: Primary diagnosis includes major depressive disorder

Patient Interaction 1. Breast: Student reassures patient appropriately

2. Cardiology: Student asks patient to follow up over phone next morning

Case Management 1. Pathology: Student explains incorrect options for treatment of pneumonia

patient

2. Cardiology: Student would administer sub-lingual nitroglycerin

ANALYSES

Items were combined across stations to create scales representing performancein different knowledge and skills areas. Five broad skill scales derived fromitems across the stations were constructed. These skill areas were History-Taking(109 items); Physical Examination (67 items); Differential Diagnosis/ClinicalReasoning (121 items); Patient Interaction (32 items); and Case Management (32items). Cronbach’s alpha reliability coefficient was calculated for each scale andfor the overall examination. This reliability coefficient is the same as the Kuder-Richardson 20 coefficient for binary items and has been shown to be robust forOSCEs that use large numbers of raters (Hamann et al., 2002).

Factor analysis was used to derive factors that fit the observed data. Factoranalysis examines a group of variables for structural relationships and then checkswhether this structure can be explained by underlying or latent variables calledfactors. In psychometric terms, factor analysis seeks to uncover the hidden structureof a test by elucidating latent factors, which underlie a larger number of variablesthat are directly observed. This process works conceptually in the same way aphysician deduces an underlying disease from a group of symptoms. In this wayfactor analysis can check assumptions about what an exam is measuring as well asreduce the number of variables for analysis to a more manageable size.

Maximum Likelihood was used for extraction because reliability coefficientsin many OSCEs show considerable variation and unevenness in magnitude. Suchvariation is important because the magnitude of the reliability constrains the size ofthe correlations among the scales (Pedhazur & Schmelkin, 1991). Such constraintslimit the magnitude of the communalities (Mulaik, 1972). Therefore, maximumlikelihood is the preferred method of extraction because it is less affected inthis situation than are other methods (Lawly, 1971; Mulaik, 1972). For instance,

PSYCHOMETRIC STRUCTURE OF AN OSCE 87

principal components analysis assumes the empirical correlations are accurate. Ifthese correlations are artificially constrained, principal components analysis failsto represent accurately the underlying dimensionality.

Promax rotation was used because it provides the most parsimonious simplestructure solution. In Promax, factors are free to correlate or not, in order to arriveat the best fit to the data. Promax begins with an orthogonal (e.g Varimax) rotationand then tries to improve the goodness of fit. If an orthogonal solution is trulyappropriate, Promax will retain this solution (Hendrikson & White, 1964). Allanalyses were performed using SPSS software (SPSS, 1999).

Results

The nine station scales on the Comprehensive OSCE generally displayed accept-able reliabilities given the multiple sources of error variance for this type ofexam. Cronbach’s alpha reliability coefficients for each station were as follows:obstetrics/gynecology, 0.66; pediatrics, 0.79; breast exam, 0.84; pathology, 0.61;psychiatry, 0.72; asthma, 0.75; radiology, 0.85; cardiology, 0.78; and neurology,0.73. With one notable exception, the skills scales showed moderate variabilityin reliabilities: History-Taking, 0.62; Physical Examination 0.80; DifferentialDiagnosis/Clinical Reasoning, 0.79; Patient Interaction, 0.62; and Case Manage-ment and Treatment, 0.45. These coefficients are shown in Table II along withskills scale means, standard deviations and correlations between skills scales.

Anecdotal evidence for the validity of the Comprehensive OSCE was seenamong students who failed the exam. For the 1999 administration of the Compre-hensive OSCE, the five students who failed the exam (as defined by scoring twostandard deviations below the overall mean) were also concurrently under reviewby the medical school’s student promotion and review committee for poor course-work. Another study, comparing the relationship of student performance in theradiology clerkship to radiology station scores on the comprehensive OSCE, gavebetter evidence of the predictive validity of the OSCE. This study found a linearrelationship of clerkship grades with radiology station OSCE scores, and a dose-response-type relationship between time since completing the radiology clerkshipand radiology station OSCE scores. This latter non-linear relationship indicatedthat student radiological knowledge increased until about eight months after theradiology clerkship, when it began to decline and level off (Morag et al., 2001).

Factor Model

Prior to factoring the five skill scales, the Kaiser–Meyer–Olkin (Kaiser, 1974)measure of sampling adequacy (MSA) and the Bartlett (1950) test of sphericitywere calculated. These tests help to assess the degree to which a correlation matrixis suitable for factor analytic exploration (Dziuban & Shirkey, 1974). The MSAwas 0.68, only slightly lower than the recommended value of 0.70 (Kaiser, 1974).

88 KEVIN VOLKAN ET AL.

Tabl

eII

.S

kill

and

stat

ion

char

acte

rist

ics

Scal

eSk

ills

vari

able

s

Num

ber

ofite

ms

inea

chst

atio

nby

skill

sch

arac

teri

stic

sco

rrel

atio

ns

Skill

sva

riab

les

Ob/

Gyn

Peds

Bre

ast

Path

Psyc

Ast

hma

Rad

Car

dN

euro

Tota

lA

lpha

Mea

nSD

HX

PED

XPI

CM

His

tory

-Tak

ing

1613

240

60

031

1910

90.

6260

.68

6.23

10.

360.

180.

180.

34

Phys

ical

Exa

m15

1613

08

00

69

670.

8064

.72

11.2

50.

361

0.17

0.24

0.23

Dif

.Dx/

Clin

ical

Rea

soni

ng23

204

62

1640

19

121

0.79

63.1

18.

370.

180.

171

0.42

0.26

Patie

ntIn

tera

ctio

n5

48

03

60

24

320.

6264

.36

11.2

90.

180.

240.

421

0.30

Cas

eM

anag

emen

t2

03

124

51

50

320.

4569

.18

9.31

0.34

0.23

0.42

0.30

1

All

Item

sTo

tal

5553

5218

2326

4145

4136

10.

8663

.32

5.62

Sta

tion

Alp

has

0.66

0.79

0.84

0.61

0.72

0.75

0.85

0.78

0.73

PSYCHOMETRIC STRUCTURE OF AN OSCE 89

Table III. Factor analysis results

Loadings

Skills variables Factor 1 Factor 2

History-Taking 0.87 −0.08

Physical Exam 0.38 0.16

Dif. Dx/Clinical Reasoning 0.05 0.53

Patient Interaction −0.02 0.78

Case Management 0.31 0.29

Factor % Variance Explained 41.4 20.2

Total % Variance Explained 61.6

The Bartlett test of sphericity yielded a highly significant result (p < 0.00001).Thus, both the MSA and the Bartlett test indicated factor analysis was appropriatefor these data.

Maximum likelihood extracted and retained two statistically significant factors(p < 0.05) that accounted for 61.6% of the variance (λ = 2.07, 1.01). Boththe Scree Test (Cattell, 1966) and an analysis of the residual correlation matrixconfirmed the adequacy of the two-factor solution. The pattern matrix factor load-ings were used for interpretation, as these loadings function similarly to regressionweights of the factors to the observed skills variables (Hutcheson, 1999; Norman& Streiner, 1998). Tabachnick & Fidell (2001, p. 625) agree with other statisticians(e.g. Comrey & Lee, 1992) that “only variables with loadings of 0.32 and aboveare interpreted”. As such, History (0.87) and Physical Examination (0.38) loadedon the first factor which can be labeled as representing information gathering.Similarly, Patient Interaction (0.78) and Differential Diagnosis/Clinical Reasoning(0.53) loaded on the second factor which can be labeled as reasoning anddissemination. Case Management failed to load on either factor 1 (0.31) or factor2 (0.29). The results of the factor analysis are listed in Table III. The two factorswere moderately correlated (r = 0.39, p < 0.001).

Discussion

This study used factor analysis to show that the internal structure of a compre-hensive OSCE consists of two underlying constructs: one related to informa-tion gathering and another related to reasoning and dissemination. These twoconstructs were found to be moderately correlated with each other, as expected.Case Management was at best weakly related to both factors.

Physical examination and history-taking may load together on the first factoras each of these two skills represents an information gathering activity. In addi-tion, both of these skill sets are taught concurrently beginning in the early days

90 KEVIN VOLKAN ET AL.

of medical school, and indeed both students and their faculty teachers tend toaggregate the two skills together in various settings, as in “doing a complete Hand P.” The second factor seems more related to reasoning and disseminationprocesses. Differential Diagnosis/Clinical Reasoning represent the processes bywhich students create and test alternate hypotheses related to a clinical scenario.The Patient Interaction component of this variable could be related to furthertesting of initial hypotheses through querying and/or communicating the hypoth-eses to the patient. The loading of Differential Diagnosis/Clinical Reasoning withthe Patient Interaction scale also reflects the way in which the exam was conducted.In those stations that contained items for both of these skills areas, students woulddiscuss their diagnosis with the faculty preceptor and then subsequently discuss thisdiagnosis with the patient. Both of these tasks involved the student disseminatinginformation – first to the examiner and then to the SP.

Unlike history-taking and physical examination, the reasoning and dissemina-tion processes related to clinical competency tend to be emphasized later in medicalschool, after the rudiments of the basic sciences, history-taking and physicalexamination have been mastered. Some prior evidence suggests students learnskills such as interviewing and physical examination well (Engler et al., 1981).However, it appears that these skills can decline when students concentrate on moreproblem-solving skills related to reasoning out a diagnosis, and communicatingtheir findings to medical colleagues and patients later on in their training (Engleret al., 1981; Kraan et al., 1990). Both types of skills are important for medicalschool graduates. However, medical school faculty members may inappropriatelyassume that the skills developed earlier remain firmly in place as students begin todo more advanced training. Deficiencies in basic skills can go unnoticed becausestudents are infrequently observed in their performance of these skills late in theirtraining (Lane & Gottlieb, 2000). Therefore, the ability to measure both infor-mation gathering as well as reasoning and dissemination skills is important fora comprehensive OSCE given near the end of a student’s undergraduate medicaleducation.

The Case Management scale is more problematic with regard to these twofactors. Its weak relationship, along with its low reliability, indicates that this scaleneeds more revision before it can be relied upon to deliver useful information.

While standardized, multi-station examinations like OSCEs have many advan-tages over traditional oral or written exams, they can still contain sizable error andshould therefore be used in the context of other indicators of student performance(van der Vleuten, 2000). Nevertheless, OSCEs can serve as an important measureof competence in clinical observation and reasoning. Further understanding ofthe relationship between the dimensions of OSCEs will provide insight into howstudents develop clinical abilities and will guide the future design of these exams.

PSYCHOMETRIC STRUCTURE OF AN OSCE 91

References

Bartlett, M.S. (1950). Tests of significance in factor analysis. British Journal of Psychology 3: 77–85.Boulet, J.R., McKinley, D.W., Norcini, J.J. & Whelan, G.P. (2002). Assessing the comparability of

standardized patient and physician evaluations of clinical skills. Advances in Health SciencesEducation 7: 85–97.

Carraccio, C. & Englander, R. (2000). The objective structured clinical examination: a step inthe direction of competency-based evaluation. Archives of Pediatrics and Adolescent Medicine154(7): 736–741.

Cattell, R.B. (1966). The Scree test for the number of factors. Multivariate Behavior Research 1:245–276.

Cohen, R., Rothman, A.I., Poldre, P. & Ross, J. (1991). Validity and generalizability of global ratingsin an objective structured clinical examination. Academic Medicine 66: 545–548.

Collins, J.P. & Gamble, G.D. (1996). A multi-format interdisciplinary final examination. MedicalEducation 30(4): 259–265.

Comrey, A.L. & Lee, H.B. (1992). A First Course in Factor Analysis, 2nd edn. Hillsdale, NJ:Lawrence Erlbaum Associates.

Cusimano, M.D., Rothman, A. & Keystone, J. (1998). Defining standards of competent performanceon an OSCE. Academic Medicine 73: S112–S113.

Dziuban, C.D. & Shirkey, E.C. (1974). When is a correlation matrix appropriate for factor analysis?Some decision rules. Psychological Bulletin 81: 358–361.

Engler, C.M., Saltzman, G.A., Walker, M.L. & Wolf, F.M. (1981). Medical student acquisition andretention of communication and interviewing skills. Journal of Medical Education 56(7): 572–579.

Epstein, R.M. & Hundert, E.M. (2002). Defining and assessing professional competence. Journal ofthe American Medical Association 278(2): 243–244.

Hamann, C., Volkan, K., Fishman, MB, Silvestri, RC, Simon, S. & Fletcher, SW. (2002). How welldo second-year students learn physical diagnosis: Results from an objective structured clinicalexamination. BMC Medical Education 2(1): 1–11.

Harden, R.M. & Gleeson, F.A. (1979). Assessment of clinical competence using an objectivestructured clinical examination (OSCE). Medical Education 13(1): 41–54.

Harden, R.M., Stevenson, M., Downie, W.W. & Wilson, G.M. (1975). Assessment of clinicalcompetence using objective structured examination. British Medical Journal 1(5955): 447–451.

Hendrikson, A.E. & White, P.O. (1964). PROMAX: A quick method for rotation to oblique simplestructure. The British Journal of Mathematical and Statistical Psychology 17: 65–70.

Hodges, B., Regehr, G., Hanson, M. & McNaughton, N. (1998). Validation of an objective structuredclinical examination in psychiatry. Academic Medicine 73: 910–912.

Hodges, B., Regehr, G., McNaughton, N., Tiberius, R. & Hanson, M. (1999). OSCE checklists donot capture increasing levels of expertise. Academic Medicine 74: 1129–1134.

Hull, A.L., Hodder, S., Berger, B., Ginsberg, D., Lindheim, N., Quan, J. & Kleinhenz, M.E.(1995). Validity of three clinical performance assessments of internal medicine clerks. AcademicMedicine 70(6): 517–522.

Hutcheson, G. & Sofroniou, N. (1999). The Multivariate Social Scientist: Introductory StatisticsUsing Generalized Linear Models. Thousand Oaks, CA: Sage Publications.

Kaiser, H.F. (1974). An index of factorial simplicity. Psychometrika 39: 31–36.Kraan, H.F., Crijnen, A.A., de Vries, M.W., Zuidweg, J., Imbos, T. & Van der Vleuten, C.P. (1990).

To what extent are medical interviewing skills teachable? Medical Teacher 12(3–4): 315–328.Kwolek, C.J., Donnelly, M.B., Sloan, D.A., Birrell, S.N., Strodel, W.E. & Schwartz, R.W. (1997).

Ward evaluations: Should they be abandoned? Journal of Surgical Research 69(1): 1–6.Lane, J.L. & Gottlieb, R.P. (2000). Structured clinical observations: A method to teach clinical skills

with limited time and financial resources. Pediatrics 105(4 Pt 2): 973–977.

92 KEVIN VOLKAN ET AL.

Lawley, D.N. (1971). Factor Analysis as a Statistical Method, 2nd edn. London: Butterworths Press.Martin, J.A., Reznick, R.K., Rothman, A., Tamblyn, R.M. & Regehr, G. (1996). Academic Medicine

71(2): 170–175.Matsell, D.G., Wolfish, N.M. & Hsu, E. (1991). Reliability and validity of the objective structured

clinical examination in pediatrics. Medical Education 25: 293–299.Morag, E.L.G., Volkan, K., Shaffer, K., Novelline, R. & Lang, E. (2001). Clinical competence assess-

ment in radiology: Introduction of an objective structured clinical examination in the medicalschool curriculum. Academic Radiology 8: 74–81.

Mulaik, S.A. (1972). The Foundations of Factor Analysis. New York: McGraw-Hill.Newble, D. & Swanson, D. (1988). Psychometric characteristics of the objective structured clinical

examination. Medical Education 22: 325–334.Norman, G. & Streiner, DL. (1998). Biostatistics: The Bare Essentials. Hamilton & London: BC

Decker.Pedhazur, E.J. & Schmelkin, L.P. (1991). Measurement, Design, and Analysis: An Integrated

Approach. Hillsdale, NJ: Lawrence Erlbaum.Regehr, G., MacRae, H., Reznick, R.K. & Szalay, D. (1998). Comparing the psychometric properties

of checklists and global rating scales for assessing performance on an OSCE-format examination.Academic Medicine 73: 993–997.

Robins, L.S., White, C.B., Alexander, G.L., Gruppen, L.D. & Grum, C.M. (2001). Assessing medicalstudents’ awareness of and sensitivity to diverse health beliefs using a standardized patientstation. Academic Medicine 76(1): 76–80.

SPSS (1999). Statistical Package for the Social Sciences (Version 9). Chicago, IL: SPSS, Inc.Swartz, M.H., Colliver, J.A., Bardes, C.L., Charon, R., Fried, E.D. & Moroff, S. (1999). Global

ratings of videotaped performance versus global ratings of actions recorded on checklists:a criterion for performance assessment with standardized patients. Academic Medicine 74:1028–1032.

Tabachnick, B.G. & Fidell, L.S. (2001). Using Multivariate Statistics, 4th edn. Boston, MA: Allyn& Bacon.

Tamblyn, R., Abrahamowicz, M., Brailovsky, C., Grand’Maison, P., Lescop, J., Norcini, J., Girard,N. & Haggerty, J. (1998). Journal of the American Medical Association 280(11): 989–996.

Todres, D., Volkan, K., Newell, A., Hinrichs, P. & Arky, R. (2000). HMS institutes a fourth yearcomprehensive examination. Medical Education News 5(2): 1–13.

van der Vleuten, C. & Swanson, D. (1990). Assessment of clinical skills with Standardized Patients:State of the art. Teaching and Learning in Medicine 2(2): 58–76.

van der Vleuten, C. & Newble, D. (1995). How can we test clinical reasoning? The Lancet 345(8956):1032–1034.

van der Vleuten, C. (2000). Validity of final examinations in undergraduate medical training. BritishMedical Journal 321(7270): 1217–1219.