27
MEASUREMENT EQUIVALENCE AND MULTISOURCE RATINGS FOR NON-MANAGERIAL POSITIONS: RECOMMENDATIONS FOR RESEARCH AND PRACTICE James M. Diefendorff Louisiana State University Stanley B. Silverman University of Akron Gary J. Greguras Singapore Management University ABSTRACT: The present investigation applies a comprehensive sequence of confirmatory factor analysis tests (Vandenberg and Lance, Organisational Research Methods, 3, 4–69, 2000) to the examination of the measurement equivalence of self, peer, and supervisor ratings of non-managerial targets across several performance dimensions. Results indicate a high degree of measurement equivalence across rater sources and performance dimensions. The paper illus- trates how this procedure can identify very specific areas of non-equivalence and how the complexity of a multisource feedback system may be represented using such procedures. Implications of these results and recommendations for both research and practice are offered. KEY WORDS: 360 degree feedback; measurement equivalence; non-managerial ratings Organizations often times implement multisource feedback systems as a tool to develop their employees (Church & Bracken, 1997; Yamma- rino & Atwater, 1997). In fact, multisource feedback systems are so popular that almost all Fortune 500 companies use this approach to as- sess managerial performance (Cheung, 1999). Feedback from multiple sources is assumed to provide a clearer and more comprehensive picture of a person’s strengths and weaknesses than feedback from only one’s Address correspondences to James M. Diefendorff, The Business School, University of Colorado at Denver, Campus Box 165, P.O. Box 173364, Denver, CO 80217-3364. E-mail: [email protected]. Journal of Business and Psychology, Vol. 19, No. 3, Spring 2005 (Ó2005) DOI: 10.1007/s10869-004-2235-x 399 0889-3268/05/0300-0399/0 Ó 2005 Springer Science+Business Media, Inc.

Measurement Equivalence and Multisource Ratings

  • Upload
    tomor

  • View
    24

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Measurement Equivalence and Multisource Ratings

MEASUREMENT EQUIVALENCEAND MULTISOURCE RATINGS FOR

NON-MANAGERIAL POSITIONS:RECOMMENDATIONS FOR

RESEARCH AND PRACTICE

James M. DiefendorffLouisiana State University

Stanley B. SilvermanUniversity of Akron

Gary J. GregurasSingapore Management University

ABSTRACT: The present investigation applies a comprehensive sequence ofconfirmatory factor analysis tests (Vandenberg and Lance, OrganisationalResearch Methods, 3, 4–69, 2000) to the examination of the measurementequivalence of self, peer, and supervisor ratings of non-managerial targets acrossseveral performance dimensions. Results indicate a high degree of measurementequivalence across rater sources and performance dimensions. The paper illus-trates how this procedure can identify very specific areas of non-equivalence andhow the complexity of a multisource feedback system may be represented usingsuch procedures. Implications of these results and recommendations for bothresearch and practice are offered.

KEY WORDS: 360 degree feedback; measurement equivalence; non-managerialratings

Organizations often times implement multisource feedback systemsas a tool to develop their employees (Church & Bracken, 1997; Yamma-rino & Atwater, 1997). In fact, multisource feedback systems are sopopular that almost all Fortune 500 companies use this approach to as-sess managerial performance (Cheung, 1999). Feedback from multiplesources is assumed to provide a clearer and more comprehensive pictureof a person’s strengths and weaknesses than feedback from only one’s

Address correspondences to James M. Diefendorff, The Business School, University ofColorado at Denver, Campus Box 165, P.O. Box 173364, Denver, CO 80217-3364. E-mail:[email protected].

Journal of Business and Psychology, Vol. 19, No. 3, Spring 2005 (�2005)DOI: 10.1007/s10869-004-2235-x

399

0889-3268/05/0300-0399/0 � 2005 Springer Science+Business Media, Inc.

Page 2: Measurement Equivalence and Multisource Ratings

supervisor (London & Smither, 1995). This comprehensive feedback isassumed to improve recommendations for individual development and toenhance one’s self-awareness and job performance (Tornow, 1993). Ra-tees in multisource feedback systems typically are managers who areevaluated by some combination of potential rater sources (e.g., supervi-sors, subordinates and peers). Recognizing the potential advantages ofcollecting performance information from multiple rater sources, morerecently organizations have begun using such systems to evaluate non-managerial employees.

Despite the popularity of multisource feedback systems, many of theassumptions underlying the advantages and practical uses of thesesystems remain untested (Church & Bracken, 1997). One often over-looked assumption is that the measurement instrument means the samething to all raters and functions the same across rater sources (Cheung,1999). However, if the rating instrument is not equivalent across ratersources, substantive interpretations of the ratings and practical recom-mendations may be inaccurate or misleading. That is, violating the (oftenuntested) assumption of measurement equivalence can render compari-sons across sources meaningless and likely compromises the utility ofusing the ratings to create personal development plans or to makeadministrative decisions. As Cheung (1999) noted, multisource feedbacksystems typically have differences among raters and correctly identifyingand understanding these differences is critical for the effective use of theperformance information.

The current paper has two main objectives. First, a comprehensivedata analytic technique for assessing the equivalence of multisourceratings as suggested by Vandenberg and Lance (2000) is discussed andillustrated. Whereas the vast majority of existing research on measure-ment equivalence has only assessed for conceptual equivalence, thecurrent study describes a series of tests assessing both conceptual andpsychometric equivalence. The establishment of measurement equiva-lence of an instrument across different groups is a prerequisite to makingmeaningful comparisons between groups (Drasgow & Kanfer, 1985;Reise, Widaman, & Pugh, 1993). In the absence of measurement equiv-alence, substantive interpretations may be incorrect or misleading(Maurer, Raju, & Collins, 1998; Vandenberg & Lance, 2000). As such, theestablishment of the measurement equivalence of instruments hasimportant implications for both theory development and practicalinterventions.

The second purpose of this paper is to apply this data analytictechnique to assess the measurement equivalence of multisource feed-back ratings across rater sources and performance dimensions for non-managerial employees. Although several studies have partially assessedthe measurement equivalence of multisource ratings, the ratees in these

JOURNAL OF BUSINESS AND PSYCHOLOGY400

Page 3: Measurement Equivalence and Multisource Ratings

studies have primarily been managers. Assessing the measurementequivalence of ratings of non-managerial targets across rater sourcesrepresents the next logical step in this line of research and is importantbecause organizations are increasingly using multisource ratings toevaluate and develop non-managerial employees. Establishing the mea-surement equivalence of multisource ratings is important because mul-tisource feedback reports often are designed to compare and contrastratings across rater sources (Dalessio & Vasilopoulos, 2001; London &Smither, 1995). Discrepancies between rater sources are often high-lighted in feedback meetings and the target is encouraged to reflect uponthese rater source differences. However, if the measurement equivalenceof the multisource instrument across rater sources has not been estab-lished, the observed differences may reflect non-equivalence betweensources rather than true differences. Such misinterpretations couldmisguide individual development plans and organizational decisions.

MEASUREMENT EQUIVALENCE AND MULTISOURCE RATINGS

In discussing the measurement equivalence of performance ratings,Cheung (1999) identified two main categories: conceptual equivalenceand psychometric equivalence. Conceptual equivalence indicates that theitems on the feedback instrument have the same factor structure acrossdifferent rater sources. That is, across sources there is the same numberof underlying dimensions, the specific behaviors (represented as items)load on the same dimensions, and the item loadings are of roughly thesame magnitude. Thus, the instrument and the underlying performanceconstructs conceptually mean the same thing across rater sources. Psy-chometric equivalence indicates that the instrument not only has thesame factor structure, but is responded to in the same manner by thedifferent rater sources. That is, across rater sources the items and scalesexhibit the same degree of reliability, variance, range of ratings, meanlevel of ratings, and intercorrelations among dimensions. A lack of psy-chometric equivalence may indicate one of several rating biases (e.g.,halo, severity), depending on where the non-equivalence lies. Cheungdemonstrated how confirmatory factor analysis (CFA) procedures can beused to identify various types of conceptual and psychometric equiva-lence between self and manager ratings. In addition, he showed how CFAis superior to other methods in identifying non-equivalence.

Several studies have used CFA procedures to assess the conceptualequivalence of multisource ratings across sources. For example, Lanceand Bennett (1997) analyzed self, supervisor, and peer ratings of eightsamples of US Air Force airmen on one general performance factor (la-beled Interpersonal Proficiency). In 5 of the 8 samples, results indicatedthat the different rater sources held different conceptualizations of per-

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 401

Page 4: Measurement Equivalence and Multisource Ratings

formance (i.e., conceptual non-equivalence). As such, these authorssuggested that many of the observed differences between rater sources(e.g., difference in mean rating level) may be the result of rater sourcedifferences in their conceptualizations of performance. Interestingly,these results are in contrast to a study by Woehr, Sheehan, and Bennett(1999) who also analyzed Air Force airmen ratings from the same largeUS Air Force Job Performance Measurement (JPM) project. Ratherthan investigating the overall performance dimension (i.e., InterpersonalProficiency) as suggested and supported by Lance and Bennett (1997),Woehr et al. (1999) analyzed self, supervisor, and peer ratings for theeight performance dimensions separately. With this sample, Woehr et al.(1999) found that the different rater sources were relatively conceptuallyequivalent.

Aside from the conflicting results from the two military studies de-scribed above, the remaining research on measurement equivalence ofmultisource ratings generally has found that ratings across rater sourcesand performance dimensions are conceptually equivalent. For example,Cheung (1999) investigated self and supervisor ratings of 332 mid-levelexecutives on two broad performance dimensions (i.e., labeled internaland external roles). Results indicated that the mid-level executives andtheir managers were conceptually equivalent. Likewise, Maurer et al.(1998) found that peer and subordinate ratings of managers on a team-building dimension were conceptually equivalent (the Maurer et al.study did not test for psychometric invariance). Finally, in the mostcomprehensive study of the measurement equivalence of multisourceratings to date, Facteau and Craig (2001) analyzed self, supervisor,subordinate, and peer ratings of a managerial sample on seven perfor-mance dimensions. Results indicated that ratings from these variousrater sources across the seven performance dimensions were conceptu-ally invariant (with the exception of one error covariance in the self andsubordinate groups). Taken together, the existing literature on themeasurement equivalence of multisource ratings (with the exception ofthe study by Lance & Bennett, 1997) has indicated that ratings fromdifferent sources across various performance dimensions are conceptu-ally equivalent.

The current study differs from these existing studies in severalimportant ways. The current study applies a more comprehensive set ofnested models to test for both conceptual and psychometric equivalencethan do previous studies. For example, the Maurer et al. (1998) and theFacteau and Craig (2001) studies only assessed for conceptual equiva-lence. The current study also differs from previous studies by conductingseveral tests for partial invariance and by illustrating how various po-tential sources of non-equivalence may be identified. This illustrationand description can help researchers and practitioners pinpoint the

JOURNAL OF BUSINESS AND PSYCHOLOGY402

Page 5: Measurement Equivalence and Multisource Ratings

source of non-equivalence, if observed. Finally, the sample used in thecurrent study is a salaried, non-managerial sample, whereas previousstudies have investigated managerial or military samples. Althoughmultisource feedback systems were originally, and are typically, de-signed to evaluate managerial performance, many organizations arebeginning to implement such systems for non-managerial positions. Assuch, this is the first study to investigate the equivalence of multisourceratings in a civilian, non-managerial sample. This paper first discusses asequence of eight nested models that could be used to test for all types ofequivalence between groups. Following this discussion, the importance ofassessing the measurement equivalence of supervisor, peer, and self-ratings of the non-managerial sample of the current study is highlighted.

MODELS FOR TESTING THE MEASUREMENT EQUIVALENCE

Vandenberg and Lance (2000) recently reviewed and integrated theliterature on measurement equivalence in organizational research, call-ing for increased application of measurement equivalence techniquesbefore substantive hypotheses are tested. In their review, they statedthat ‘‘violations of measurement equivalence assumptions are asthreatening to substantive interpretations as is an inability to demon-strate reliability and validity’’ (p. 6). That is, non-equivalence betweengroups indicates that the measure is not functioning the same acrossgroups and any substantive interpretation of differences (e.g., supervisorratings of a dimension being higher than peer ratings) may be suspect.As Vandenberg and Lance (2000) note, because classical test theory(Crocker & Algina, 1986) cannot adequately identify measurementequivalence across populations, the recommended approach is to useCFA procedures in a series of hierarchically nested models. The primaryadvantage of CFA procedures over traditional approaches is that theyaccount for measurement error (Bollen, 1989) in estimating group dif-ferences. The advantage of Vandenberg and Lance’s (2000) approach overother approaches is that it represents the most comprehensive series ofmeasurement equivalence tests in the literature, and thus is bestequipped at identifying differences between groups.

Applying Vandenberg and Lance’s (2000) approach, Table 1 presentsthe sequence of models proposed to test the measurement equivalence ofa multisource feedback instrument, the specific nature of the constraintapplied at each step, and the implications for rejecting the null hypoth-esis of no differences between rater sources. We adopt the terminologysuggested by Vandenberg and Lance (2000) in identifying the variousmodels and the constraints imposed.

The initial test in the sequence is that of the Equality of CovarianceMatrices (Model 0), which assesses overall measurement equivalence

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 403

Page 6: Measurement Equivalence and Multisource Ratings

Ta

ble

1T

he

Seq

uen

ce

of

Mea

surem

en

tIn

va

ria

nce

Test

su

sed

inth

eP

rese

nt

Inv

est

iga

tio

n

Mod

elM

odel

Nam

eC

onst

rain

tIm

pli

cati

ons

ofR

ejec

tin

gH

0

0E

qu

ali

tyof

Cov

ari

an

ceM

atr

ices

(Rg¼

Rg0 )

Con

stra

inev

eryth

ing

tobe

equ

al

acr

oss

gro

up

sIn

dic

ate

sso

me

form

ofn

on-e

qu

ivale

nce

bet

wee

nra

ter

sou

rces

1C

onfi

gu

ral

Invari

an

ceT

est

for

equ

ivale

nt

fact

orst

ruct

ure

acr

oss

rate

rgro

up

s.T

her

eis

dis

agre

emen

tin

the

nu

mber

orco

mp

osit

ion

offa

ctor

sacr

oss

rate

rso

urc

es2

Met

ric

Invari

an

ce(K

g v¼

Kg0 v)

Lik

eit

ems’

fact

orlo

ad

ings

are

con

stra

ined

tobe

equ

al

acr

oss

sou

rces

Th

ere

isd

isagre

emen

tov

erth

ep

att

ern

offa

ctor

load

ings

acr

oss

rate

rso

urc

es.

Sou

rces

do

not

agre

ere

gard

ing

the

rela

tive

imp

orta

nce

ofbeh

avio

ral

ind

icato

rsin

defi

nin

gth

ed

imen

sion

3S

cala

rIn

vari

an

ce(s

g v¼

sg0 v)

Lik

eit

ems

=in

terc

epts

are

con

stra

ined

acr

oss

rate

rgro

up

sT

he

item

ind

icato

rsh

ave

dif

fere

nt

mea

nle

vel

sacr

oss

gro

up

s,su

gges

tin

gth

ep

ossi

bil

ity

ofa

resp

onse

bia

s,su

chas

Len

ien

cy/S

ever

ity

for

ara

ter

sou

rce

4In

vari

an

tU

niq

uen

esse

s(H

g d¼

Hg0 d)

Con

stra

inu

niq

uen

esse

sacr

oss

rate

rgro

up

sT

he

rati

ng

inst

rum

ent

exh

ibit

sd

iffe

ren

tle

vel

sof

reli

abil

ity

for

dif

fere

nt

rate

rso

urc

es(e

.g.,

ther

eare

dif

fere

nce

sin

the

am

oun

tof

erro

rvari

an

cefo

rea

chit

emacr

oss

rate

rso

urc

es)

5In

vari

an

tF

act

orV

ari

an

ces

(Ug j¼

Ug0

j)

Fact

orvari

an

ces

are

con

stra

ined

tobe

equ

al

acr

oss

rate

rgro

up

sT

he

rati

ngs

from

dif

fere

nt

sou

rces

do

not

use

an

equ

ivale

nt

ran

ge

ofth

esc

ale

.T

hu

s,n

on-e

qu

ivale

nce

cou

ldin

dic

ate

ran

ge

rest

rict

ion

for

ara

ter

sou

rce

6In

vari

an

tF

act

orC

ovari

an

ces

(Ug jj¼

Ug0

jj)

Fact

orco

vari

an

ces

acr

oss

dim

ensi

ons

are

con

stra

ined

equ

al

acr

oss

rate

rgro

up

s

Th

ere

lati

onsh

ips

bet

wee

nco

nst

ruct

sw

ith

inra

ter

sou

rces

dif

fers

acr

oss

rate

rgro

up

s.T

his

can

refl

ect

aH

alo

erro

rfo

rso

urc

esw

ith

stro

ng

corr

elati

ons

bet

wee

nfa

ctor

s7

Invari

an

tF

act

orM

ean

s(j

jg0 )

Mea

nF

act

orra

tin

gs

are

con

stra

ined

tobe

equ

al

acr

oss

rate

rgro

up

s

Dif

fere

nt

sou

rces

are

rati

ng

the

foca

lem

plo

yee

at

dif

fere

nt

level

son

the

late

nt

con

stru

ct,

sugges

tin

gth

ep

ossi

bil

ity

ofa

Len

ien

cy/S

ever

ity

bia

s

JOURNAL OF BUSINESS AND PSYCHOLOGY404

Page 7: Measurement Equivalence and Multisource Ratings

between sources. In this model, everything is constrained to be equalacross rater sources, so failure to reject the null hypothesis indicatescomplete measurement equivalence across the groups and no subsequenttests are required. This is the most restrictive test and is done first toidentify if any between group differences exist. If the null hypothesis isrejected, subsequent tests are needed to identify the source(s) of the non-equivalence. Although Cheung (1999) did not perform this analysis,Vandenberg and Lance (2000) recommend this as a useful initial test.

The second test, referred to as Model 1, because it is the first test in aseries of nested models in the invariance hierarchy, is that of ConfiguralInvariance. Configural invariance is the least restrictive model positingonly an equivalent factor structure across rater sources. If the nullhypothesis that there is no between group differences is rejected, it isinterpreted to mean that the rater sources disagree over the number orcomposition of factors contained in the instrument. This finding could bedue to several issues, including rater sources possessing differentunderstandings of performance, having access to different types of per-formance information (Campbell & Lee, 1988), or outright disagreeingconcerning the ratee’s job duties. This test is the first step in demon-strating the conceptual equivalence of the scale across rater sources. Ifthe null hypothesis is rejected, further tests are not conducted becausethe underlying constructs are defined differently across rater sources.

Model 2 provides a test of Metric Invariance, where the factorloadings are constrained to be equal across rater groups. Cheung (1999)identified this as the second test for conceptual equivalence. At issue hereis whether the strength of the relationship between specific behaviors(items) and the performance dimensions is the same for different ratersources. For example, configural invariance (Model 1) may be presentsuch that two rater sources agree that a behavior (e.g., listens to others)is related to a performance dimension (e.g., interpersonal skills), butmetric invariance (Model 2) may be absent because one source (peers)considers the behavior as more important for defining the dimensionthan another source (managers). Rejecting the null hypothesis at thisstep indicates that the item indicators load on the factors differently fordifferent rater sources.

If the null hypothesis is not rejected for Model 1 or Model 2, thenthere is complete conceptual equivalence for the instrument. Someresearchers suggest that having conceptual equivalence is the primaryrequirement for using the instrument in different groups (Cheung &Rensvold, 1998; Reise et al., 1993; Ryan, Chan, Ployhart, & Slade, 1999).That is, although other sources of non-equivalence may be revealed insubsequent tests, having conceptual equivalence indicates that themeasure can be compared and used across groups. The remaining modelsare categorized as tests of psychometric equivalence and reflect between

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 405

Page 8: Measurement Equivalence and Multisource Ratings

group differences which are often related to substantive hypotheses inmultisource feedback research (e.g., mean differences, differences inrange restriction). A finding of non-equivalence for one of theseremaining tests does not necessarily mean the instrument is inappro-priate for use across sources.

Model 3 is the test of Scalar Invariance, where like items’ interceptsfor different sources are constrained to be equal. In a practical sense, thistest examines whether the means of the item indicators are equivalentacross rater groups (Bollen, 1989). Differences in intercepts may meanthat, although the sources agree conceptually on the dimensions of theinstrument, there is one source who consistently rates the focalemployees lower (severity bias) or higher (leniency bias) than othersources. The presence of mean differences between sources has beenfound in the literature, with supervisors rating more leniently than peersand focal employees rating more leniently than both peers and supervi-sors (Harris & Schaubroeck, 1988). Rejecting the null hypothesis at thisstep indicates that there may be a response bias (leniency/severity) at theitem level for one or more rater sources (Taris, Bok, & Meijer, 1998).Although Cheung did not use this test, it could provide diagnostic evi-dence regarding the prevalence and consistency of bias in the instrumentat the item level of analysis.

Model 4 is the test of Invariant Uniquenesses, where the itemindicator variances are constrained to be equal across raters. The uniquevariance for each item is considered to reflect measurement error, soconstraining these to be equal essentially tests whether the instrumenthas equivalent levels of reliability across sources. That is, because scalereliability decreases as random measurement error increases, this con-straint assesses whether the scale has the same degree of reliabilityacross sources. Rejecting the null hypothesis demonstrates that there aredifferences in the degree of error present in the measure across ratersources. Possible reasons for differences in measurement error betweensources include unequal opportunities to observe performance(Rothstein, 1990), unfamiliarity with scale items, or inexperience withthe rating format (Cheung, 1999).

Model 5 is the test of Invariant Factor Variances, where the vari-ances of the latent constructs are constrained to be equal across ratingsources. Factor variances represent the degree of variability or range ofthe latent construct used by the rater sources. The question being ad-dressed with this test is whether different rater sources use more or lessof the possible range of the performance construct than other sources.Rejecting the null hypothesis suggests that one or more sources have arelatively restricted range in their ratings.

Model 6 is the test of Invariant Factor Covariances, where therelationships among the latent constructs within a rater source are

JOURNAL OF BUSINESS AND PSYCHOLOGY406

Page 9: Measurement Equivalence and Multisource Ratings

constrained to be equal across rater groups. Rejecting the null hypothesisindicates that the rater sources have differences in the strength of therelationships among the latent factors, indicating a possible halo effectfor one or more sources. That is, the relationships between the latentfactors are different across rater sources, with some sources morestrongly discriminating among the performance dimensions than others.There is some evidence that supervisor ratings exhibit greater halo thanself-ratings (Holzbach, 1978), suggesting that individuals are better ableto make distinctions among dimensions for their own performance thanare observers.

Model 7 is the test of Invariant Factor Means, where the means ofthe latent constructs are constrained to be equal across groups. Rejectionof the null hypothesis indicates that the rater sources are rating the focalemployee at different levels on the latent construct. Similar to the test ofScalar Invariance, this test is a way to evaluate the presence of leniency/severity for a particular rater source, but at the construct level ratherthan the item level. Differences in mean ratings may be a more accurateindicator of any leniency or severity bias because idiosyncratic item ef-fects between sources will likely cancel each other out.

In addition to these specific constraints applied between ratersources (Models 0–7), the size of the correlations between sources on adimension (i.e., the level of agreement) can be investigated from theanalyses of the covariance structure in the CFA procedures (Cheung,1999). Consistent with other conceptualizations, low correlations indi-cate high disagreement (Harris & Schaubroeck, 1988). Again, the pri-mary advantage of using the CFA results to estimate the correlationsbetween sources on a dimension is that CFA controls for measurementerrors. Taken together, the series of hierarchical tests described aboveprovide a comprehensive framework for examining the equivalence ofratings from multiple rater sources across multiple performancedimensions.

Multisource Ratings of Non-managerial Jobs

As organizations continue their shift toward fewer levels of man-agement (Whitaker, 1992), more emphasis will be placed on individualaccountability, performance, and development (Murphy & Cleveland,1995). As a way to measure and develop individual performance, orga-nizations increasingly are using multisource ratings for both managerialand non-managerial employees. Although previous research generallyhas observed that ratings of managers are conceptually equivalent acrossrater sources, research has not investigated whether the same is ob-served for non-managers. As discussed below, because of the differencesbetween managerial and non-managerial positions, there are several

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 407

Page 10: Measurement Equivalence and Multisource Ratings

reasons why results from managerial samples may not generalize to non-managerial samples.

First, employees at different organizational levels (e.g., managersand non-managers) often have different experiences with, and percep-tions of, the organization in general, and with evaluation and feedbacksystems in particular (Ilgen & Feldman, 1983; Mount, 1983; Williams &Levy, 2000). Second, managers within an organization are more likelythan non-managers to have participated in the development and imple-mentation of new interventions (e.g., multisource feedback systems), andtherefore, are more likely to have a better understanding of their pro-cesses and procedures (Pooyan & Eberhardt, 1989; Williams & Levy,2000). Third, managers are more likely to have received training withrespect to appraisal systems than are non-managers (Williams & Levy,2000). Finally, the nature of work between managers and non-managersoften is quite different. That is, managerial work tends to be harder toobserve and more discontinuous than non-managerial work (Borman &Brush, 1993) potentially making it more difficult to evaluate.

Consistent with these expectations, several empirical studies havefound differences between managers’ and non-managers’ perceptions anduse of appraisal systems. For example, Conway and Huffcutt (1997)found that supervisor and peer ratings were more reliable and containedmore true score variance for non-managerial than managerial jobs.Further, Conway and Huffcutt (1997) observed that correlations betweensupervisor, peer, and self-ratings were moderated by job type (i.e.,managerial versus non-managerial) such that correlations betweensources were higher for non-managerial jobs than for managerial jobs. Inanother study, Williams and Levy (2000) found that managers were moresatisfied with their appraisals, perceived the procedures to be fairer, andhad higher levels of perceived system knowledge than did non-managers.Additionally, Mount (1983) found that non-supervisory employees re-sponded to performance appraisal systems in a more global way than didsupervisors. Given that the requirements of managerial and non-mana-gerial jobs differ, Williams and Levy (2000) called for research investi-gating the effects of these differences on important individual andorganizational processes and outcomes. The current study begins to re-spond to that call by examining the measurement equivalence of multi-source ratings for non-managerial jobs.

METHOD

Participants

The ratees in this study were non-managerial, professionalemployees (e.g., accountants, engineers, programmers, research scien-tists) from a Fortune 150 firm. In addition to rating themselves, partic-

JOURNAL OF BUSINESS AND PSYCHOLOGY408

Page 11: Measurement Equivalence and Multisource Ratings

ipants were rated by their peers and managers. Only participants withratings from all three sources were included in the present investigation.Furthermore, in instances where an employee had more than one ratingfrom a source (i.e., more than one peer or supervisor rating), a singlerating was randomly selected for inclusion in the study. The final sampleconsisted of 2158 individuals from each source, for a total of 6474 ratings.No demographic information was collected on the participants.

Measures and Procedures

Individuals participated in a multisource feedback program fordevelopmental purposes. The feedback program consisted of four tools:(1) the feedback instrument, which was administered to the focal em-ployee, peers, and supervisors; (2) an individual feedback report given tothe ratee providing details on how he/she was rated; (3) a self-managedworkbook for interpreting the feedback report, developing an action plan,and setting goals for changing behaviors; and (4) a development guideproviding specific recommendations for improving performance in par-ticular areas (included behavioral strategies and a list of developmentalresources). The original feedback instrument consisted of 53 itemsassessing a variety of behaviors identified as necessary for successfulperformance. Participants rated the extent to which an individual dis-played each behavior on a scale from 1 (not at all) to 5 (a great deal).

RESULTS

Preliminary Analyses

Rating forms with more than 15% of the responses missing wereexcluded from further analyses (i.e., missing seven items or more). If lessthan 15% of the values were missing, values were imputed using theexpectation maximization (EM) estimation procedure (Dempster, Laird,& Rubin, 1977). Across sources, this resulted in the estimation andreplacement of less than 1.28% of the responses on average (.80% for selfratings, 1.18% for supervisor ratings, and 1.84% for peer ratings). Toinitially identify the factor structure underlying the instrument, mana-gerial ratings from a separate sample (N ¼ 1,012) of non-managerialemployees from the same organization were submitted to principle axisexploratory factor analyses (EFA) with varimax rotation. This sample ofemployees did not differ in any discernable way from the primary sam-ple. Managerial ratings were used for this purpose because managerstend to be the most familiar with rating and evaluating others’ perfor-mance and these ratings may provide a good frame of reference foridentifying the underlying structure of performance. In identifying thefactors, items that had high loadings (.40 or higher) on a factor and low

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 409

Page 12: Measurement Equivalence and Multisource Ratings

cross-loading (.32 or lower) on the other factors were retained. The EFAresulted in the extraction of 3 primary factors labeled: Relationships (12items, a ¼ .94), Customer Focus (8 items, a ¼ .92), and ContinualLearning (8 items, a ¼ .91). These dimensions were then confirmed usingthe focal sample of this study (see below).

Overview of Analyses

The present investigation used LISREL 8.3 (Joreskog & Sorbom,1993) to test all of the CFA models. For all analyses, a loading was set to1.0 for one item chosen arbitrarily from each factor, to scale the latentvariables (Bollen, 1989). As a preliminary step, we examined whether theproposed 3-factor structure fit the observed covariance matrices sepa-rately for each source. Data from all three sources showed good fit withthe hypothesized factor structure. Next, the ratings were combined intoone data set to test the various levels of measurement equivalence. Thedata were structured in a repeated measures fashion, such that ratingsfrom different sources on the same individual were identified as being thesame case. This was done because the ratings from the three sourceswere not independent (they were all evaluations of the same employee),rending a true multiple-group CFA procedure inappropriate (Cheung,1999).

In evaluating the adequacy of a given model, the present investi-gation utilized the v2 Goodness of Fit statistic and the following fitindices: (a) Tucker Lewis Index (TLI) which is identified in LISREL asthe non-normed fit index (NNFI; Tucker & Lewis, 1973); (b) root meansquare error of approximation (RMSEA; Steiger, 1990); (c) the stan-dardized root mean square residual (SRMR; Bentler, 1995), and (d) theComparative Fit Index (CFI). The lower bound of good fit for the TLI andthe CFI is considered to be .90, whereas, for the RMSEA and the SRMRthe upper bounds for good fit are considered to be .08 and .10, respec-tively (Vandenberg & Lance, 2000). Although Hu and Bentler(1999) recommended more stringent standards for evaluating model fit(CFI ¼ .95, TLI ¼ .95, RMSEA ¼ .06, SRMR ¼ .08), Vandenberg andLance (2000) suggested that this recommendation may be premature andthat more research is needed before these new standards are adopted.They suggested that the more commonly accepted standards presentedabove be considered as the lower bound for good fit, and that reaching thestandards suggested by Hu and Bentler (1999) would indicate very goodfit.

To analyze the various types of equivalence, the series of hierar-chical modeling procedures identified by Vandenberg and Lance (2000)and discussed above were used. For each comparison, we constrained therelevant parameters to be equal across rating sources and examined

JOURNAL OF BUSINESS AND PSYCHOLOGY410

Page 13: Measurement Equivalence and Multisource Ratings

whether a significant reduction in model fit from the less constrained tothe more constrained model occurred. Although the v2 difference test isthe most common method of examining the difference between nestedmodels, the present investigation also looked at changes in other fitindices to make a more informed decision regarding model fit. Therationale for doing so is that with such a large sample, traditional v2 testsand even v2 difference tests may be overly sensitive, yielding significantresults even with small model misfit (Cheung & Rensvold, 1999).

In addition to comparing models where constraints were appliedacross all scales and rater sources, partial invariance also was examinedwhere constraints were placed on specific parameters in an attempt toidentify the sources of non-equivalence. That is, if non-equivalence wasdetected, a search for the cause of the non-equivalence was done byselectively freeing up constraints on parameters and examining if the Dv2

values and other fit statistics improved. One way partial invariance wasexamined was to test the series of models identified by Vandenberg andLance (2000) for each dimension separately. Examining the dimensionsseparately can help to demonstrate whether the sources are moreequivalent on some dimensions than others. A second way partialinvariance was examined was by freeing up one source at a time andconstraining the remaining two sources to be equal in a round-robinfashion. This procedure can determine whether one rater source is pri-marily responsible for the misfit present in the data (i.e., two sources aresimilar and one is dissimilar), or whether the misfit is fairly uniformacross sources (i.e., all three sources are different from each other). Toillustrate the nature of the non-equivalences, descriptive data (e.g., fac-tor loadings, latent means, reliabilities) also are reported.

Tests of Measurement Equivalence

Table 2 presents the fit statistics for Models 0–7 (described above),and the change in v2 and change in df between nested models. For eachmodel, the fit indices (with the exception of the v2 goodness of fit statisticwhich is overly sensitive with large samples) were above the minimum fitrecommendations. Furthermore, although there are significant changesin v2 for each comparison of nested models, the changes in other fitstatistics were quite small, suggesting the models may not be muchworse and that the significant change in v2 may be due to the largesample used in this investigation (Cheung & Rensvold, 1999). For illus-trative purposes, the model comparisons are pursued with the goal ofdemonstrating how various measurement differences can be identified.In addition, descriptive data are presented at the item level in Tables 3(factor loadings and item means) and 4 (variance of the measurementerrors and scale reliabilities), and at the latent construct level in Table 5

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 411

Page 14: Measurement Equivalence and Multisource Ratings

(factor means, variances, covariances, and intercorrelations). Tables 6–8present the sequence of tests of measurement equivalence separately foreach of the three performance dimensions. Finally, Table 9 presents theDv2 and Ddf for tests of partial invariance where only two sources areconstrained to be equal at a time (i.e., one source is freed at a time).

Model 0. The overall fit of this model was above the minimumfit requirements recommended by Vandenberg and Lance (2000)(RMSEA ¼ .020; TLI ¼ .97; SRMR ¼ .053; CFI ¼ .98). At this point,given the high degree of fit for the fully constrained model, Vandenbergand Lance suggest that no further model testing is required and it maybe interpreted that the ratings are equivalent conceptually and psycho-metrically across rater groups. Given the illustrative nature of the cur-rent investigation, and in particular the ability of CFA techniques toreveal various types of rater source differences, we chose to proceed withthe hierarchical tests for Models 1–7, providing descriptive data andtests of partial invariance to identify the nature of potential betweengroup differences.

Model 1. The overall fit of the test of equivalent factor structure wasabove the minimum requirements suggested by Vandenberg and Lance(2000) (RMSEA ¼ .037; TLI ¼ .92; SRMR ¼ .036; CFI ¼ .92), indicatingthat the groups used the same number of factors and that the itemsloaded on the same dimension for each rater source. Thus, performance

Table 2Results for the Sequence of Measurement Invariance Tests

Model df v2 RMSEA TLI SRMR CFI Ddf Dv2 DCFI

0. Invariant CovarianceMatrices

2434 4583.49 .020 .97 .053 .98

1. Configural Invariance 3282 11517.15 .037 .92 .036 .921 versus 2 50 224.59* 0

2. Metric Invariance 3332 11741.74 .037 .92 .038 .922 versus 3 50 791.87* .01

3. Scalar Invariance 3382 12533.61 .038 .91 .038 .913 versus 4 56 315.01* 0

4. Invariant Uniquenesses 3438 12848.62 .038 .91 .039 .914 versus 5 6 171.52* 0

5. Invariant FactorVariances

3444 13020.14 .039 .91 .058 .91

5 versus 6 6 82.66* 06. Invariant Factor

Covariances3450 13102.80 .039 .91 .057 .91

6 versus 7 6 327.35* 07. Invariant Factor Means 3456 13430.15 .039 .90 .055 .91

Note. * Significant at p < .05.

JOURNAL OF BUSINESS AND PSYCHOLOGY412

Page 15: Measurement Equivalence and Multisource Ratings

is defined similarly across rater sources. Note, of course, that the fit of allmodels will be good given that Model 0 fit the data well.

Model 2. The overall fit of the test of Metric Invariance (equal factorloadings) was quite high (RMSEA ¼ .037; TLI ¼ .92; SRMR ¼ .038;CFI ¼ .92), indicating good fit as well. Although the change in (Dv2

(50) ¼ 224.59, p < .05) was significant (suggesting that H0 should be re-jected), the changes in the other fit indices revealed only slightdifference between the models, with the SRMR decreasing only by .002

Table 3Standardized Factor Loadings and item Means

Self-ratings Supervisor ratings Peer ratings

FactorLoading Mean

FactorLoading Mean

FactorLoading Mean

RelationshipsItem 1+ .70 4.31 .76 4.13 .77 4.222+ .68 3.89 .73 3.70 .73 3.853+ .67 4.02 .74 3.74 .76 3.914 .69 4.10 .77 3.92 .79 4.055+ .71 4.14 .76 3.91 .78 4.076*+ .64 4.27 .75 4.18 .76 4.217 .70 3.91 .74 3.70 .78 3.848 .69 4.06 .74 3.92 .78 4.029 .67 3.88 .76 3.71 .75 3.7710*+ .61 4.40 .73 4.21 .75 4.2511+ .62 4.30 .66 4.04 .72 4.1612 .60 4.00 .67 3.85 .69 3.94

Customer FocusItem 1+ .78 3.86 .79 3.66 .79 3.832 .71 4.08 .74 3.92 .78 4.033 .75 3.95 .80 3.77 .77 3.944 .71 4.01 .73 3.81 .74 3.965+ .69 3.84 .75 3.71 .72 3.936+ .71 3.71 .71 3.54 .74 3.807+ .71 3.83 .74 3.72 .74 3.868+ .63 3.49 .60 3.34 .67 3.71

Continuous LearningItem 1+ .75 3.97 .80 3.71 .77 3.942+ .67 4.09 .76 3.83 .71 4.003+ .71 3.87 .76 3.58 .76 3.784+ .66 3.91 .75 3.69 .74 3.945 .65 3.92 .70 3.72 .74 3.926+ .64 3.86 .67 3.71 .72 3.907+ .62 4.15 .69 3.72 .70 3.958+ .62 3.56 .58 3.39 .66 3.63

* Factor loadings are significantly different at p < .05.+ Item means are significantly different at p < .05.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 413

Page 16: Measurement Equivalence and Multisource Ratings

and all other fit indices remaining the same. To get a better feel for thedifferences in factor loadings, the test of Metric Invariance wasconducted independently for each item. Table 3 displays the factorloadings for each item across rater sources and whether the Dv2 wassignificant. As can be seen only two items had significantly differentfactor loadings across groups, with the self-ratings loading lower thanthe other two sources in both cases. Examining Model 2 separately foreach dimension (Tables 6–8) does not identify one dimension as provid-

Table 4Scale Reliabilities and item Variances (Variances of Measurement Error)

Self-ratings Supervisor ratings Peer ratings

Scale aVarianceof M.E. Scale a

Varianceof M.E. Scale a

Varianceof M.E.

Relationships .90 .93 .94Item 1 .23 .23 .262 .32 .32 .333 .30 .28 .294 .24 .24 .245 .22 .22 .236 .26 .25 .297 .25 .26 .238* .24 .27 .249* .29 .25 .3010 .24 .23 .2511* .26 .34 .3112* .28 .25 .27

Customer Focus .89 .90 .91Item 1 .21 .19 .202 .24 .22 .213 .20 .18 .214* .25 .24 .275* .30 .26 .266* .32 .29 .277* .27 .23 .258* .48 .39 .35

Continual Learning .86 .89 .90Item 1 .26 .26 .262 .30 .30 .333* .30 .27 .254* .33 .30 .275* .32 .31 .266* .37 .34 .297 .29 .28 .298* .44 .41 .37

Note. * Significant differences at p < .05.

JOURNAL OF BUSINESS AND PSYCHOLOGY414

Page 17: Measurement Equivalence and Multisource Ratings

ing particularly poor fit compared to the other dimensions. Furthermore,Table 9 does not reveal that one particular rater source was the primarycause of the differences in factor loadings (across all items). Thus, it canbe concluded that although very small differences in factor loadings doexist between sources, there is generally high levels of conceptualequivalence and no one dimension or rater source contributed a dispro-portionate amount to the level of non-equivalence found for the assessedmodel.

Table 5Means, Variances, Covariances, and Intercorrelations of Latent Factors

Self Supervisor Peer

Mean n1 n2 n3 n4 n5 n6 n7 n8 n9

SelfRelationships n1 4.31 .22 .69 .67 .15 .04 .03 .21 .11 .10Customer Focus n2 3.86 .18 .31 .66 .03 .19 .03 .09 .21 .08Continuous Learning n3 3.97 .18 .21 .33 .03 .03 .21 .08 .12 .24

SupervisorRelationships n4 4.13 .04 <.01 <.01 .33 .65 .64 .23 .12 .14Customer Focus n5 3.66 .01 .06 .01 .21 .32 .63 .14 .24 .14Continuous Learning n6 3.71 <.01 ).01 .08 .24 .23 .42 .13 .08 .29

PeerRelationships n7 4.22 .06 .03 .03 .08 .05 .05 .38 .74 .73Customer Focus n8 3.83 .03 .07 .04 .04 .08 .03 .27 .35 .77Continuous Learning n9 3.94 .03 .03 .09 .05 .05 .12 .29 .29 .41

Note. Correlations are above the diagonal and variances are below the diagonal.

Table 6The Sequence of Invariance Tests for the Relationship Dimension

Model df v2 RMSEA TLI SRMR CFI Ddf Dv2 DCFI

0. Invariant CovarianceMatrices

322 1034.73 .031 .97 .073 .98

1. Configural Invariance 555 3141.41 .049 .93 .034 .941 versus 2 22 68.14* 0

2. Metric Invariance 577 3209.55 .049 .93 .037 .942 versus 3 22 202.51* 0

3. Scalar Invariance 599 3412.06 .049 .93 .037 .943 versus 4 24 91.19* .01

4. Invariant Uniquenesses 623 3503.25 .049 .93 .038 .934 versus 5 2 145.48* 0

5. Invariant Factor Variances 625 3648.73 .050 .93 .079 .935 versus 7 2 141.94* 0

7. Invariant Factor Means 627 3790.67 .051 .93 .075 .93

Note. * Significant at p < .05.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 415

Page 18: Measurement Equivalence and Multisource Ratings

Model 3. The overall fit of the model testing for Scalar Invariance(equal item means) was quite good (RMSEA ¼ .038; TLI ¼ .91;SRMR ¼ .038; CFI ¼ .91), but again there was a significant change inchi-square between Model 2 and Model 3 (Dv2 (50) ¼ 791.87, p < .05),suggesting worse fit when the item means were constrained. In addition,the largest changes in other fit statistics occurred for this comparisonwith the CFI and TLI decreasing by .01 and the RMSEA decreasing by.001. The item means and whether they were observed to be significantlydifferent across sources is presented in Table 3. As indicated, the means

Table 7The Sequence of Invariance Tests for the Customer Focus Dimension

Model df v2 RMSEA TLI SRMR CFI Ddf Dv2 DCFI

0. Invariant covariancematrices

150 747.47 .041 .96 .026 .98

1. Configural Invariance 225 671.94 .031 .98 .022 .981 versus 2 14 67.58* 0

2. Metric Invariance 239 739.52 .032 .98 .029 .982 versus 3 14 290.70* .01

3. Scalar Invariance 253 1030.22 .039 .97 .029 .973 versus 4 16 119.39* 0

4. Invariant Uniquenesses 269 1149.61 .039 .96 .032 .974 versus 5 2 5.45 0

5. Invariant Factor Variances 271 1155.06 .039 .97 .035 .975 versus 7 2 160.57* .01

7. Invariant Factor Means 273 1315.63 .042 .96 .033 .96

Note. * Significant at p < .05.

Table 8The Sequence of Invariance Tests for the Continuous Learning Dimension

Model df v2 RMSEA TLI SRMR CFI Ddf Dv2 DCFI

0. Invariant covariancematrices

150 875.65 .046 .94 .046 .97

1. Configural Invariance 225 1772.08 .058 .92 .036 .941 versus 2 14 79.38* .01

2. Metric Invariance 239 1851.46 .057 .92 .040 .932 versus 3 14 294.90* .01

3. Scalar Invariance 253 2146.36 .061 .91 .041 .923 versus 4 16 98.73* 0

4. Invariant Uniquenesses 269 2245.09 .061 .92 .043 .924 versus 5 2 35.43* 0

5. Invariant Factor Variances 271 2280.52 .061 .91 .053 .925 versus 6 2 273.73* .02

7. Invariant Factor Means 273 2554.25 .064 .90 .055 .90

Note. * Significant at p < .05.

JOURNAL OF BUSINESS AND PSYCHOLOGY416

Page 19: Measurement Equivalence and Multisource Ratings

of 19 of the 28 items are significantly different across sources. ExaminingModel 3 for each dimension (Tables 6–8) does not reveal that the misfit isdue to one particular dimension. In addition, freeing up the means for aparticular source does not result in a nonsignificant Dv2 (Table 9), butdoes reveal that constraining the peer and supervisor ratings (andfreeing up self ratings) yields the smallest increase in v2. Examination ofthe means reveals that self ratings are the highest for 21 items, peerratings are the highest for 7 items, and supervisor ratings are the lowestfor all items.

Model 4. The overall fit of the model testing for Invariant Unique-nesses (measurement error/item reliability) was good (RMSEA ¼ .038;TLI ¼ .91; SRMR ¼ .039; CFI ¼ .91), but the change in v2 was signifi-cantly different from Model 3 (Dv2 (56) ¼ 315.01, p < .05). The change inother fit statistics again was negligible, with only the SRMR increasingby .001. The item uniquenesses (variances of measurement errors) andscale reliabilities are presented in Table 4, and as can be seen 14 of theitem uniquenesses were significantly different between rater sources (asindicated by a significant Dv2 when the uniquenesses for each item wereconstrained separately). It does not appear that the majority of the misfitis due to any one dimension (see Tables 6–8). Furthermore, freeing upthe self ratings and constraining the supervisor and peer ratings resultsin the smallest increase in v2, suggesting that self-ratings may contributemore to the lack of model fit, with their uniquenesses being slightlylarger. The impact of having larger item uniquenesses can be seen in thescale reliabilities, with the self-ratings having the lowest reliabilityacross dimensions. Importantly though, the reliabilities for all dimen-sions and sources are quite high, suggesting that any practical differ-ences in reliability may be negligible.

Table 9Tests of Partial Invariance for each Model with the Parameter Two Sources

Constrained to be Equal, and the Parameter for One Source Free to Vary

Source Ratings Constrained to Be Equal(Freed Source Ratings)

Self and Supervisor(Peer)

Self and Peer(Supervisor)

Supervisor andPeer (Self)

Partial Invariance Model Ddf Dv2 Ddf Dv2 Ddf Dv2

2. Metric Invariance 25 152.47 25 97.22 25 93.163. Scalar Invariance 25 383.64 25 488.26 25 309.994. Invariant Uniquenesses 28 147.38 28 216.80 28 99.485. Invariant Factor Variances 3 86.95 3 153.38 3 20.696. Invariant Factor Covariances 3 48.46 3 12.14 3 59.267. Invariant Factor Means 2 232.89 2 73.16 2 219.82

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 417

Page 20: Measurement Equivalence and Multisource Ratings

Model 5. The overall fit of the model testing for Invariant FactorVariances was good (RMSEA ¼ .039; TLI ¼ .91; SRMR ¼ .058;CFI ¼ .91), but was significantly different from Model 4 (Dv2

(6) ¼ 171.52, p < .05). The change in other fit statistics was very smallwith the SRMR having the largest change (an increase of .019). Ana-lyzing this model separately for each performance dimension shows thatthe Relationships dimension appears to hold the lion share of the misfit(Table 6). Specifically, the Customer Focus dimension does not reveal asignificant Dv2 for this constraint (Table 7), and the Continuous Learningdimension had a decrease in model fit roughly a quarter that of theRelationship dimension (Table 8). These differences can be seen in thedata in Table 5, where the variances range from .22 to .38 for the Rela-tionship dimension, from .31 to .35 for the Customer Focus dimension,and from .33 to .42 for the Continuous Learning dimension. In addition,it can be seen that the latent variances are generally smaller for self-ratings than the other two sources (Table 5). This difference is reflectedby the relatively small Dv2 when the factor variances for supervisors andpeers are constrained and those for self-ratings are freed (see Table 9).This finding is consistent with Cheung (1999) who found that the vari-ance of self-ratings was smaller than the variance for supervisor ratings.In sum, these findings demonstrate that self-ratings have less range thanthe other sources and across sources there are larger differences in therange used for the Relationship dimension than the other dimensions.

Model 6. The overall fit of the model testing for Invariant FactorCovariances (relationships among factors) was good (RMSEA ¼ .039;TLI ¼ .91; SRMR ¼ .057; CFI ¼ .91), but was significantly different fromModel 5 (Dv2 (6) ¼ 82.66, p < .05). The change in the other fit statisticswas negligible, with the SRMR actually improving by .001 and all otherstatistics remaining the same. The correlations reported above thediagonal in Table 5 are estimated from the variances and covariancespresented in the bottom half of the table. As indicated, the correlationsbetween dimensions within a source are of fairly high magnitude, withthe supervisor ratings exhibiting the smallest correlations (.64 on aver-age) and the peer ratings the highest (.75 on average). A comparison ofmodels separately by dimensions was not possible because this test isexplicitly comparing the relationship between dimensions. The tests ofpartial invariance across sources (Table 9) revealed that the smallestp <Dv2 occurred when the supervisor rating covariances were free to varyand the self and peer rating covariances were constrained to be equal.Thus, in general supervisor ratings appear to better discriminate amongthe dimensions than self or peer ratings, and peer ratings tend to dis-criminate less than self-ratings.

Model 7. The overall fit of the model testing for Invariant FactorMeans was good (RMSEA ¼ .039, TLI ¼ .90, SRMR ¼ .055; CFI ¼ .91),

JOURNAL OF BUSINESS AND PSYCHOLOGY418

Page 21: Measurement Equivalence and Multisource Ratings

but was significantly different from Model 6 (Dv2 (6) ¼ 327.35, p <.05).The change in other fit statistics also was generally small with the TLIdecreasing by .01 and the SRMR increasing by .002. The factor meansare reported in Table 5. Consistent with the item means (test forScalar Invariance), self-ratings are the highest and supervisor ratingsare the lowest for each dimension. With regard to specific individualdimension comparisons, the Customer Focus dimension had the worstlevel of fit (Table 7) and the other two dimensions were roughlyequivalent in their levels of fit (see Tables 6 and 8). The tests of partialinvariance freeing up a source at a time revealed that constrainingonly the self and peer ratings resulted in the smallest decrease inmodel fit (Table 9). This finding is consistent with the means presentedin Table 5.Correlations Between Sources

The correlations among the latent factors are presented in the tophalf of Table 5 and were estimated from the variance-covariance matrixin the bottom half of the table. The average correlation between sourceson the same dimension was .18 for self and supervisor ratings, .22 for selfand peer ratings, and .25 for supervisor and peer ratings. In addition, theaverage between source correlation for the dimensions was .20 forRelationships, .21 for Customer Focus, and .25 for Continuous Learning.

DISCUSSION

A common practice within multisource feedback systems is to com-pare ratings between various rater sources. However, if the feedbackinstrument is not equivalent, then ratings across sources or performancedimensions are not directly comparable (Drasgow & Kanfer, 1985).Surprisingly, the issue of measurement equivalence of multisource rat-ings has received relatively little attention (Facteau & Craig, 2001).Based on the works of Vandenberg and Lance (2000) and Cheung (1999),this paper applies a method for comprehensively examining the mea-surement equivalence of self, peer, and supervisor ratings across threeperformance dimensions for a sample of professional, non-managerialemployees. This is the first study to apply the complete series of nestedCFA models recommended by Vandenberg and Lance (2000) to multi-source feedback ratings of salaried, non-managerial employees. Theprimary contribution of this study is that it illustrates, in detail, how thespecific sources of rater differences can be identified even with verycomplex datasets involving multiple raters and multiple performancedimensions. A second contribution of this paper is that it demonstratesthat measurement equivalence is present in multisource ratings for asample of non-managerial professionals.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 419

Page 22: Measurement Equivalence and Multisource Ratings

The results of the current study indicate that the ratings on thethree performance dimensions largely were conceptually equivalentacross rater sources. These results are consistent with the existing lit-erature on the measurement equivalence of multisource ratings (e.g.,Maurer et al., 1998) and extend these findings to a non-managerialsample. As Sulsky and Keown (1998) note, the ultimate utility of multi-source systems may depend on our ability to develop some consensus onthe meaning of performance. These results are encouraging and suggestthat comparisons made between rater sources may provide meaningfulinformation. Further, these results suggest that the differences we ob-serve between rater sources likely are not due to different rater sourcesconceptualizing performance differently, as was suggested by Campbelland Lee (1988). Future research should investigate the measurementequivalence of ratings over time, as some recent research suggests thatthe meaning of performance changes over time (Stennett, Johnson,Hecht, Green, Jackson, & Thomas, 1999). Finding that the meaning ofperformance has changed over time has serious implications for studiesinvestigating behavioral change longitudinally.

Recall that the first and most restrictive test revealed completeequivalence in the data. Because this restrictive test suggested equiv-alence, it generally is suggested that no further tests of equivalenceneed to be conducted (Vandenberg and Lance, 2000). However, asillustrated and discussed above, the power of the CFA technique in itsability to detect even very small conceptual and psychometric differ-ences between raters is illustrated in this study. Thus, in the presenceof larger differences between rating sources, the techniques demon-strated here could be used to pinpoint the source of non-equivalencedown to the item level.

Consistent with past research (e.g., Conway & Huffcutt, 1997), thecorrelations between rater sources were quite low. Specifically, theaverage correlation between sources was .18 for self and supervisorratings, .22 for self and peer ratings, and .25 for supervisor and peerratings. These results also are consistent with past research that hasobserved supervisor-peer correlations to be higher than self-peer or self-supervisor correlations (e.g., Harris & Schaubroeck, 1988). Importantly,the current study was able to rule out the possibility that these lowcorrelations were due to rater sources interpreting the multisourceinstrument differently. The observed low correlations also support afundamental assumption of multisource feedback systems, namely, thatdifferent rater sources represent distinct perspectives that provide un-ique information (London & Smither, 1995).

Although supervisors are the most widely used rater source in per-formance management systems (Murphy & Cleveland, 1995), some havesuggested that peers may be a better source of performance information

JOURNAL OF BUSINESS AND PSYCHOLOGY420

Page 23: Measurement Equivalence and Multisource Ratings

(Murphy & Cleveland, 1995; Murphy, Cleveland, & Mohler, 2001). Peersmay be a better source of performance information because they mayhave more opportunities to observe ratee performance and likely workmore closely with one another than do other rater sources (Murphy &Cleveland, 1995). Additionally, peer ratings are often perceived as beingbetter because they appear to be more reliable than supervisory ratings,likely a result of aggregating across peers (Scullen, 1997). Although thecurrent study suggests that self, peer, and supervisor ratings are equallyinternally consistent, past research has found that the inter-raterreliability of supervisors is higher than that of peers, when controlling forthe number of raters (Greguras & Robie, 1998). Future research shouldcontinue to investigate the conditions that influence the quality ofmultisource ratings.

IMPLICATIONS FOR PRACTICE

The model testing framework in this paper is recommended forpractitioners to aid them in interpreting results, providing feedback tonon-managerial employees, and refining the rating instrument andprocedure. The most obvious benefit to practitioners of using CFA pro-cedures is that they can identify whether the scale is measuring the samething across all rater sources (i.e., conceptual equivalence). It is possiblethat even a carefully developed measure will not be conceptualized thesame across rater sources, but instead a direct comparison of the factorstructure using the procedures outlined and demonstrated above canassess whether or not the instrument is conceptualized equivalentlyacross rater sources. If the measure is not conceptually equivalent acrosssources, interpreting between source differences in means, errors, orvariances could be inaccurate. Thus, any performance improvementfeedback that is given to employees based on this information may notreflect employees’ actual developmental needs. This can lead employeesdown the wrong path in terms of performance improvement and leaveimportant needs unaddressed.

If there is conceptual equivalence across rater sources, a second usefor practitioners is to accurately identify between source differences inratings. For example, it can be used to identify between source differ-ences in rating level (e.g., supervisor ratings are the lowest), which canbe used to aid the interpretation of ratings and guide the developmentplan of the subordinate (e.g., more attention should be given to the rel-ative differences of the ratings within a source, rather than the differ-ences between supervisor and self or peer ratings). Thus, this procedurecan aid in the identification of specific psychometric differences between

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 421

Page 24: Measurement Equivalence and Multisource Ratings

sources, which can assist the practitioner in better understanding therating system and providing feedback to employees.

Another implication of this procedure for practitioners is that theresults of these analyses can inform changes that are to be made to theinstrument, the instruction and training of raters, and the system as awhole. If problematic items are identified they can be removed. If theperformance dimensions are conceptualized differently across ratersources, a reevaluation of how different sources define performancemust be made, and these results should be manifest in future versionsof the scale. Furthermore, where there are psychometric differences,rater training and instruction can be used to help eliminate these rating‘‘errors’’ by pointing out the propensity to rate high, or to not distin-guish among performance dimensions. Finally, the presence of bothconceptual and psychometric differences may require an overhaul of theentire system so that a more rigorous and less error-prone procedure isused.

Finally, the substantive implication of these findings is that thisstudy suggests that ratings across rater sources are comparable for non-managerial employees. Our results suggest that meaningful comparisonmay be made across different rater sources. Multisource rating systemsassume that a ratee’s self-awareness is increased by reviewing self-otherrating discrepancies (Tornow, 1993) and, in turn, this increase in self-awareness facilitates ratee development and performance improvement(Church, 1997; Tornow, 1993). The results of the current study suggestthat from a measurement perspective, the ratings are equivalent and canbe meaningfully compared in an attempt to increase one’s self-awarenessand performance.

Future Research

Although the accumulating research on the measurement equiva-lence of multisource ratings generally indicates that the ratings areconceptually equivalent, there are several avenues for future research.First, research should continue to explore ratee and rater characteristicsthat may influence the measurement equivalence of multisource ratings(Maurer et al., 1998). Second, given that research indicates that ratingpurpose differentially impacts the dependability of ratings from differentrater sources (e.g., Greguras, Robie, Schleicher, & Goff, 2003) futureresearch should explore the impact that rating purpose may have on theconceptualization of performance, or the use of performance instruments,across different rater sources. Third, recent research has investigatedthe effects of multisource feedback on employee development longitudi-nally (e.g., Bailey & Fletcher, 2002). If studies that assess behavioralchange are to be meaningfully interpreted, the measurement equivalence

JOURNAL OF BUSINESS AND PSYCHOLOGY422

Page 25: Measurement Equivalence and Multisource Ratings

of multisource ratings collected longitudinally must first be established.Fourth, future research needs to examine the extent to which multi-source feedback instruments exhibit measurement equivalence acrosscultures. It is common practice in multinational companies to develop afeedback instrument in one country and use it in multiple countries.Testing for measurement equivalence across countries would provideinsight into whether the feedback instrument is interpreted and usedsimilarly by individuals from different cultures.

REFERENCES

Bailey, C., & Fletcher, C. (2002). The impact of multiple source feedback on managementdevelopment: Findings from a longitudinal study. Journal of Organizational Behavior,23, 853–867.

Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: MultivariateSoftware.

Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.Borman, W. C., & Brush, D. H. (1993). More progress toward a taxonomy of managerial

performance requirements. Human Performance, 6, 1–21.Campbell, D. J. & Lee, C. (1988). Self-appraisal in performance evaluation: Development

versus evaluation. Academy of Management Review, 13, 302–314.Cheung, G. W. (1999). Multifaceted conceptions of self-other ratings disagreement. Per-

sonnel Psychology, 52, 1–36.Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A

reconceptualization and proposed new method. Journal of Management, 25, 1–27.Church, A. H., (1997). Managerial self-awareness in high-performing individuals in orga-

nizations. Journal of Applied Psychology, 82, 281–292.Church, A. H., & Bracken, D. W. (1997). Advancing the state of the art of 360-degree

feedback. Group & Organization Management, 22, 149–161.Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource performance

ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. HumanPerformance, 10, 331–360.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. FortWorth, TX: Harcourt Brace.

Dalessio, A. T., & Vasilopoulos, N. L. (2001). Multisource feedback reports: Content, for-mats, and levels of analysis. In D. W. Bracken, C. W. Timmreck & A. H. Church (Eds.),The handbook of multisource feedback (pp. 181–203). San Francisco, CA: Jossey-Bass.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.

Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heteroge-neous populations. Journal of Applied Psychology, 70, 662–680.

Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from differentrating sources comparable? Journal of Applied Psychology, 86, 215–227.

Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360-degree feedback ratings. Journal of Applied Psychology, 83, 960–968.

Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M. (2003). A field study of the effects ofrating purpose on multisource ratings. Personnel Psychology, 56, 1–21.

Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, andpeer-supervisor ratings. Personnel Psychology, 41, 43–62.

Holzbach, R. L. (1978). Rater bias in performance ratings: Superior, self, and peer ratings.Journal of Applied Psychology, 63, 579–588.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 423

Page 26: Measurement Equivalence and Multisource Ratings

Ilgen, D. R., & Feldman, J. M. (1983). Performance appraisal: A process focus. In L.Cummings & B. Staw (Eds.), Research in organizational behavior (Vol. 5). Greenwich,CT: JAI Press.

Joreskog, K., & Sorbom, D. (1993). LISREL 8: Structural equation modeling with theSIMPLIS command language. Chicago: Scientific Software.

Lance, C. E., & Bennett, W. Jr. (1997, April). Rater source differences in cognitive repre-sentation of performance information. Paper presented at the meeting of the Society forIndustrial and Organizational Psychology, St. Louis, MO.

London, M., & Smither, J. W. (1995). Can multisource feedback change perceptions of goalaccomplishment, self-evaluations, and performance-related outcomes? Theory-basedapplications and directions for research. Personnel Psychology, 48, 803–839.

Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate appraisal mea-surement equivalence. Journal of Applied Psychology, 83, 693–702.

Mount, M. K. (1983). Comparisons of supervisory and employee satisfaction with a per-formance appraisal system. Personnel Psychology, 36, 99–110.

Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social,organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications.

Murphy, K. R., Cleveland, J. N., & Mohler, C. J. (2001). Reliability, validity, and meaning-fulness of multisource ratings. In D. W. Bracken, C. W. Timmreck & A. H. Church (Eds.),The handbook of multisource feedback (pp. 275–288). San Francisco, CA: Jossey-Bass.

Pooyan, A., & Eberhardt, B. J. (1989). Correlates of performance appraisal satisfactionamong supervisory and non-supervisory employees. Journal of Business Research, 19,215–226.

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and itemresponse theory: Two approaches for exploring measurement invariance. PsychologicalBulletin, 114, 552–566.

Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth toasymptote level with increasing opportunity to observe. Journal of Applied Psychology,75, 322–327.

Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in amultinational organization: Considering language and culture in assessing measure-ment equivalence. Personnel Psychology, 52, 37–58.

Scullen, S. E. (1997). When ratings from one source have been averaged, but ratings fromanother source have not: Problems and solutions. Journal of Applied Psychology, 82,880–888.

Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimationapproach. Multivariate Behavioral Research, 25, 173–180.

Stennett, R. B., Johnson, C. D., Hecht, J. E., Green, T. D., Jackson, K., & Thomas, W. (1999,August). Factorial invariance and multirater feedback. Poster presented at the 14thAnnual Conference of the Society for Industrial and Organizational Psychology,Atlanta, GA.

Sulsky, L. M., & Keown, L. (1998). Performance appraisal in the changing world of work:Implications for the meaning and measurement of work performance. Canadian Psy-chology, 39, 52–59.

Taris, T. W., Bok, I. A., & Meijer, Z. Y. (1998). Assessing stability and change of psycho-metric properties of multi-item concepts across different situations: A general ap-proach. Journal of Psychology, 132, 301–316.

Tornow, W. W. (1993). Editor’s note: Introduction to special issue on 360-degree feedback.Human Resource Management, 32, 211–219.

Tucker, L. R., & Lewis, C. (1973). The reliability coefficient for maximum likelihood factoranalysis. Psychometrika, 38, 1–10.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurementinvariance literature: Suggestions, practices, and recommendations for organizationalresearch. Organizational Research Methods, 3, 4–69.

Whitaker, A. (1992). The transformation in work: Post-Fordism revisited. In M. Reed & H.Hughes (Eds.) Rethinking organization: New directions in organization theory andanalysis. London: Sage.

JOURNAL OF BUSINESS AND PSYCHOLOGY424

Page 27: Measurement Equivalence and Multisource Ratings

Williams, J. R., & Levy, P. E. (2000). Investigating some neglected criteria: The influence oforganizational level and perceived system knowledge on appraisal reactions. Journal ofBusiness and Psychology, 14, 501–513.

Woehr, D. J., Sheehan, M. K., & Bennett, W. Jr. (1999, April). Understanding disagreementacross rating sources: An assessment of the measurement equivalence of raters in 360degree feedback systems. Poster presented at the 14th Annual Conference of the Societyfor Industrial and Organizational Psychology, Atlanta, GA.

Yammarino, F., & Atwater, L. (1997). Do managers see themselves as others see them?Implications of self-other rating agreement for human resource management. Orga-nizational Dynamics, 25(4), 35–44.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS 425