A META-ANALYSIS OF SELF-SUPERVISOR, SELF-PEER, AND PEER-SUPERVISOR RATINGS

PERSONNEL PSYCHOLOGY 1988. 41

A META-ANALYSIS OF SELF-SUPERVISOR, SELF-PEER, AND PEER-SUPERVISOR RATINGS

MICHAEL M. HARRIS, JOHN SCHAUBROECK Krannert Graduate School of Management

Purdue University

Reviews of self-Supervisor, self-peer, and peer-supervisor ratings have generally concluded that there is at best a modest correlation between different rating sources. Nevertheless, there has been much inconsistency across studies. Accordingly, a meta-analysis was conducted. The results indicated a relatively high correlation between peer and supervisor ratings ( p = -62) but only a moderate correlation between self-supervisor (p = .35) and self-peer ratings (p = .36). While rating format (dimensional versus global) and rating scale (trait versus behavioral) had little impact as moderators, job type (managerial/ professional versus blue-collar/ service) did seem to moderate self-peer and self-supervisor ratings.

The use of multiple sources for performance ratings has gained considerable acceptance in the past two decades. Numerous advantages of using multiple raters have been cited: for example, enhanced ability to observe and measure various job facets (Borman, 1974; Henderson, 1984), greater reliability, fairness, and ratee acceptance (Latham & Wexley, 1982), and improved defensibility of the performance appraisal program from a legal standpoint (Bemardin & Beatty, 1984). The literature on multiple raters has focused on self-ratings. A number of scholars have argued that self-ratings can promote personal development, improve communication between supervisors and subordinates, and clarify differences of opinion between supervisors and other managers (Carroll & Schneier, 1982; Cum- mings & Schwab, 1973).

Despite the alleged gains from self-ratings, empirical research shows frequent lack of agreement between self-ratings and those provided by other sources. Mabe and West (1982) reviewed a number of studies and found, on average, a low correlation between self-ratings and others’ ratings, in- cluding supervisor and peer appraisals. Thornton (1980), in his summary of self-ratings, concluded that “individuals have a significantly different view of their own job performance than that held by other people” (p. 268). Nevertheless, both Mabe and West and Thomton found that research often

The authors were equal contributors to this manuscript; order of listing was determined

The authors would like to thank Michael A. Campion for his helpful comments. Correspondence and requests for reprints should be addressed to Michael M. Harris,

alphabetically.

Krannert Graduate School of Management, Purdue University, West Lafayette, IN 47907.

COPYRIGHT @ 1988 PERSONNEL PSYCHOLOGY. INC.

44 PERSONNEL PSYCHOLOGY

led to conflicting results. Indeed, a casual inspection of research on self- supervisor ratings shows a wide range of findings. For example, Williams and Seiler (1973) reported an average self-supervisor correlation of .60; Pym and Auld (1965) found an average self-supervisor correlation of .56 across three independent studies. Conversely, Klimoski and London (1 974) reported an average self-supervisor correlation of .05, and Ferris, Yates, Gilmore, and Rowland (1985) found a self-supervisor correlation of .02. Others (e.g., Heneman, 1974) have found moderate correlations between self-supervisor ratings. Thus, no firm conclusions can be drawn regarding the extent of self-supervisor agreement.

Low correlations between different raters are not unique to comparisons with self-ratings. On the one hand, some researchers have observed relatively high peer-supervisor correlations: for example, Lawler ( 1967) reported an average correlation of .57. On the other hand, others have found much lower correlations between peer and supervisor performance ratings: Springer (1953), for example, reported an average correlation of .28, and Tucker, Cline, and Schmitt (1967) reported .21. Thus, Landy and Farr (1980) have concluded that different raters exhibit moderate to low agreement. Moreover, they have suggested that the limited correspondence between self-ratings and those from other sources is not due to any unique differences beyond that to be expected for multiple raters in general.

It is important to note that a number of scholars have asserted that the general lack of convergence between different raters is neither surprising nor problematic. Supporters of this position argue that different raters may observe different dimensions of performance, or have different definitions of effective performance and, thus, arrive at different assessments of the same individual’s performance (Borman, 1974; Landy & Farr, 1980). As Wexley and Klimoski (1984) pointed out, however, a definitive test of this hypothesis has not been conducted.

A review of the literature suggests that a number of explanations have been offered as to why low correlations exist between different raters. These can be sorted into at least three categories: (1) egocentric bias, (2 ) differences in organizational level, (3) observational opportunities. What follows is a brief discussion of each of these explanations.

Egocentric Bias

While several different explanations fall under this rubric, the under- lying premise is that ratee (or self-) ratings are biased in some fashion, while other raters (e.g., peers and supervisor) share a set of common perceptions. One of the most commonly discussed versions of egocentric bias is the main effect of defensiveness, wherein a self-rater is inclined to inflate hidher rating in order to enhance the evaluation (Holzbach, 1978;

HARRIS AND SCHAUBROECK 45

Steel & Ovalle, 1984). This would lead the self-ratings to have a restricted range, thereby attenuating the correlation between self- and others’ ratings. Correlations between others’ ratings (e.g., peers and supervisors) would be considerably higher as there would be no range restriction. However, correcting for range restriction should eliminate any differences between correlations based on self- and others’ ratings and the correlations based on different observers’ ratings.

A second version of the egocentric bias explanation assumes that ratee defensiveness may be moderated by other variables, such as self-esteem (Baird, 1977; Kay, Meyer, & French, 1965). If the impact of ratee defensiveness depends on a third variable, such as self-esteem, the order of self-ratings may be affected. That is, high self-esteem ratees may inflate their ratings, while low self-esteem ratees may not. If the third variable is not taken into account, correcting for range restriction will not completely eradicate the egocentric bias, and correlations between observer ratings would remain higher than correlations between self- and others’ ratings.

A third type of egocentric bias is posited by attribution theory (DeVader, Bateson, & Lord, 1986 Jones & Nisbett, 1972). According to this theory, actors (i.e., self-raters ) attribute good performance to their own behavior and poor performance to environmental factors. Conversely, observers (e.g., peers and supervisors) attribute good performance to environmental factors and poor performance to the actors’ dispositions. Hence, this theory would suggest that self-ratings will correlate poorly with ratings provided by other sources. Correcting for range restriction should have little effect. However, there should be substantial agreement between observers (i.e., peers and supervisors).

In addition to predictions about the correlations, all three versions of the egocentric bias explanation would predict differences in mean ratings. Specifically, self-ratings should be significantly higher than either peer or supervisor ratings.

Differences in Organizational Level

Two versions of the organizational-level explanation for rater disagreement exist. Some researchers (e.g., Klimoski & London, 1974; Zammuto, London & Rowland, 1982) have asserted that raters at different levels weight performance dimensions differently. This suggests that raters at different levels (e.g., peers and supervisors) disagree on the overall rating, but they presumably would agree on dimensional ratings. A second version of this explanation maintains that raters at different levels define and measure performance differently (e.g., Borman, 1974; Landy, Farr, Saal, & Freytag, 1976). This latter hypothesis would suggest that raters at different levels disagree on both dimensions and overall ratings. At the


same time, both explanations suggest that raters on the same level (e.g., self and peers) would provide similar ratings. Note that neither version of the organizational-level explanation would necessarily suggest differences in the mean ratings of self and others.

Observational Opportunities

A third explanation for lack of agreement across different raters con- cerns differing observational opportunities. Specifically, peers are hypothesized to have more opportunities to observe ratees and at more revealing times than do supervisors (Latham & Wexley, 1982; Lawler, 1967). This explanation implies that supervisors disagree with a ratee because they have few opportunities to observe the individual's performance, and the behaviors they do observe do not reflect true performance levels. Thus, one would expect self-supervisor and peer-supervisor ratings to exhibit far lower correlations than peer-self ratings. However, as with the organizational-level explanation, no differences between mean self-, peer, and supervisor ratings would be expected.

While the three basic explanations described above have been presented as mutually exclusive, it is possible that several are operating simultane- ously. For example, Latham and Wexley (1982) provide an observational- opportunities explanation for the lack of agreement between peers and supervisors and suggest that defensiveness may be the cause of peer-self disagreement. Accordingly, while a direct test of these alternative hypotheses is not possible without measurement of proposed moderators, comparison of average correlations for different rater pairs can provide some direc- tion and insight into which explanation may account for low correlations between raters.

The purpose of this paper was to conduct a quantitative review of empirical research on rater agreement. While a number of qualitative reviews have been reported (e.g., Landy & Farr, 1980; Thornton, 1980), a definitive estimate of rater agreement has not been ascertained.' Furthermore, the apparent lack of consistent findings regarding rater agreement may be explained by sampling error, as many of these studies are based on small sample sizes. For example, Pym and Auld (1965) had an average of 70 subjects in each sample; Ferris et al. (1985) reported an N of 81. Thus, one explanation for the lack of consistency from study to study is that sampling error is affecting the results.

A second explanation for the apparent lack of consistency between studies is the nature of the job. That is, blue-collar and service jobs are

'The one quantitative review (Mabe & West, 1982) did not separate out self-peer and self-supervisor correlations, nor did it examine peer-supervisor correlations.


relatively routine, and performance is well defined compared with managerial or professional jobs (Campbell, Dunnette, Lawler, & Weick, 1970; Cascio, 1987). Hence, it is to be expected that there will be a higher correlation between different raters for blue-collar/ service jobs than for managerial/ professional jobs.

Two other moderators are predicted by the explanations discussed above. As described by the first organizational-level hypothesis, different raters may weight dimensions differently. If this is the case, one would expect greater agreement on dimensional ratings than on global ratings. Hence. another potential moderator is rating format (dimensional versus global). Second, rating scali may be an important moderator whereby rater agreement will be higher for behavioral than for trait scales. This follows from the second version of the organizational-level hypothesis, which asserts that different raters may have different definitions of effective job performance (Borman, 1974; Zedeck & Baker, 1972). Hence, the more well defined and behaviorally based the rating scale, the greater the agreement expected between different raters.

In summary, the present study used meta-analysis to review the literature regarding the correlation between self-supervisor, self-peer, and peer-supervisor ratings. Specifically, the questions of interest were as follows: 1. (a) Overall, what is the average correlation between self-supervisor,

self-peer, and peer-supervisor ratings? (b) What is the average difference between supervisor, peer, and self-ratings? (c) Will the data support an egocentric-bias, organizational-level, or observational-opportunities explanation for rater disagreement?

2. Can differences between studies within each rater pair be attributed to statistical artifacts such as sampling error?

3. Assuming that differences between studies cannot be explained by statistical artifacts, can the variance be explained by moderators such as job type (managerial/ professional versus blue-collar/ service), rating format (dimensional versus global), and rating scale (behavioral versus trait)?

Method

Analysis

The Hunter, Schmidt, and Jackson (1982) meta-analytic procedure de- signed to correct for unreliability and range restriction with sporadic information was employed for the correlations. This procedure involves a sequence of computations. In step 1, the sample weighted average correlation and the sample weighted variance of the correlations is calculated.


Next, the expected variance due to sampling error is determined, and this is subtracted from the observed variance.

Step 2 of the meta-analysis involves estimating reliability and range restriction across the variables. Once this is done, the average correlation is corrected for unreliability and range restriction, as is the variance.2 The formulas for these estimates may be found in Hunter et al. (1982). In accord with the main effect defensiveness hypothesis, self-other correlations were corrected for both of these artifacts. However, since there was no theoretical reason to correct for range restriction for peer-supervisor correlations, only corrections for measurement error were performed for this category.

The third step examines whether variation between studies is due to artifacts or is caused by moderator variables. The 25% rule was used, whereby a moderator search was initiated if the ratio of unexplained variance (i.e., variance remaining after correcting for sampling error, measurement error, and range restriction) to total variance was greater than 25%.

The fourth step involves testing for moderators. Towards this end, effect sizes are subgrouped and a separate analysis is performed on each subgroup. Supportive evidence for the hypothesized moderator variable exists when (1) the mean correlation differs from subgroup to subgroup and (2) the corrected variance is lower on average in the subgroups than for the overall set.

Mean differences between peer, supervisor, and self-ratings were con- verted into a d statistic. Procedures described by Hunter et al. (1982) were then used to obtain the average effect size and true ~ar iance .~ The d statistic was calculated such that the first rater in each rater pair (e.g., self in self-supervisor ratings) was associated with the higher ratings.

Sample

The search for studies to be used in the meta-analysis was confined to journal articles apd conference papers in published proceedings. The search began with manual inspection of each of the following journals for the past thirty years: Academy of Management Journal (1958-1986), Journal of Applied Psychology (1956-1986), Journal of Educational Psy- chology (1956-1986), Journal of Occupational Psychology (197&1986),

Note that the source of the range restriction differs from the traditional situation wherein range restriction is due to a lack of complete data. For the present situation, we computed a ratio of the standard deviation of self-ratings to the standard deviation of others’ ratings (either peers or supervisors). This ratio is referred to as u in Hunter et al. (1982) and is used along with other formulas to correct for range restriction.

3Because Hunter et al. (1982) do not provide formulas for correcting for measurement error with sporadic information, we used the average reliabilities listed in Table 1.


Journal of Vocational Behavior ( 197&1986), Organizational Behavior and Human Decision Processes (Formerly Organizational Behavior and Human Performance) (1966-1986), and Personnel Psychology (1956-1986). Con- ference papers from the Midwest and National proceedings of the Academy of Management were also examined. Computer-based searches were conducted on Psyc-Info (Psychological Abstracts) for the period since 1967 and the ABI Inform data base for the period since 1975. References for other studies were obtained from these sources to expand the research base.

Three judgments were made concerning the collection and analysis of data. 'First, only those reliability estimates based on accepted formulas were selected to determine the mean reliabilities across studies. This excluded, among other estimates, unadjusted average inter-item correlation coefficients. Second, laboratory studies were deliberately excluded from the analysis. This was done because the conditions under which laboratory ratings occur grossly exaggerate the observation time factor. More impor- tantly, assigned roles cannot be taken to reflect true perspectives of supervisors and peers (Ilgen & Favero, 1985). Finally, some studies contained multiple measures of performance. Since these measures were typically not independent, the effect sizes were averaged and the mean effect size was used in the meta-analysis.

Results

The literature search produced a total of 36 independent self-supervisor correlations, 23 independent peer-supervisor correlations, and 1 1 independent self-peer correlations (approximately 5 studies did not provide either means or correlations). Table 1 shows the average correlations, reliabilities, and range restriction for the three rater combinations. The reliabilities are based on internal consistency (alpha) measures. It is noteworthy that the average reliabilities are substantially higher than the .60 figure used by Schmidt and his colleagues (e.g., see Pearlman, Schmidt, & Hunter, 1980). There are three reasons for using the more conservative figures found here. First, Schmidt and his colleagues based their reliability estimates on the correlations between ratings from different supervisors at different points in time. In the present study, only internal consistency measures were found. Second, the .60 figure used by other researchers assumes that one source of error is the use of different raters. In the present case, at least for self-ratings, the rater did not change. Hence, a reliability of .60 would be an underestimate. Finally, there has been considerable debate over the accuracy of .60 as an estimate of reliability (Sackett, Tenopyr, Schmitt, & Kehoe, 1985; Schmitt & Schneider, 1983). Use of the higher, and hence more conservative, figures therefore seemed in order.

TABL

E 1

Met

a-an

alys

is R

esul

ts fo

r C

orre

latio

ns

90%

Con

fiden

ce

Tota

l N

umbe

r in

terv

als

%

Rat

er

sam

ple

of

Mea

nb

Stan

dard

Lo

wer

U

pper

U

nexp

lain

ed

com

bina

tion

size

co

rrel

atio

ns

aa

6 U

corr

elat

ion

devi

atio

ns

boun

d bo

und

varia

nce

Self

-sup

emis

or

3,95

7 36

.Y

O(7

) .Y

1(7

) .7

8(8)

.3

5(.2

2)

.I l(

. 12)

.I

7 .5

3 33

Pe

er-s

uper

viso

r‘

2,64

3 23

.8

7(5)

.89(5)

NA

.6

2( .4

8)

.24(

.22)

.2

2 1 .o

o 86

Se

I f-Pe

.er

984

I1

.90(

7)

.87(

5)

.84(3)

.36(

.24)

.1

9(. 1

3)

.05

.67

82

‘a?

b, an

d o

rcp

rese

nt, r

espe

ctiv

ely,

ave

rage

rel

iabi

lity

for t

he f

irst

rate

r in

cac

h co

mbi

natio

n, Ih

e se

cond

rat

er in

eac

h co

rnhi

natio

n, a

nd t

he r

atio

be

twee

n th

e st

anda

rd d

evia

tion

of

the

firs

t rat

er a

nd th

e st

anda

rd d

cvia

tion

of t

he s

econ

d ra

ter

in e

ach

pair

(se

e H

unte

r et

al.,

198

2, fo

r fu

rther

cxp

la-

natio

n).

Figu

res

in p

aren

thcs

es i

ndic

ate

num

ber

of s

tudi

es in

volv

ed in

cal

cula

tions

. bN

umbe

rs in

par

enth

eses

inc

lude

cor

rect

ions

for

sam

plin

er

ror;

num

bers

not

in

pare

nthe

ses

corr

ecte

d fo

r al

l pos

sibl

e ar

tifac

ts.

‘Mea

ns a

nd v

aria

nce

corr

ecte

d fo

r m

easu

rem

ent

emr

o&;

see

text

for

exp

lana

tion.

0

Q


Because there was only one self-peer study reporting reliabilities, estimates of peer-rating reliability were obtained from the peer-supervisor studies; likewise, estimates of self-rating reliability were taken from self- supervisor studies.

As shown in Table 1, all three rater combinations had lower 90% confidence intervals that were greater than zero. Hence, the correlation between ratings from different raters is statistically significant. However, there were marked differences between rater combinations: The peer-supervisor correlation ( p = -62) was substantially higher than either the self-supervisor ( p =. -35) or the self-peer correlation ( p = .36). Finally, all three rater combinations contained more than 25% unexplained variance even after sampling error, measurement error, and range restriction were taken into account; accordingly, moderator variables were examined.

Table 2 summarizes the moderator analysis for self-supervisor correlations. Consistent with the hypothesis, there was slightly more agreement for dimensional ( p = .36) than for global ( p = .29) ratings. However, there was no decrease in variance across the subgroups. In accord with the rating-scale hypothesis, there was also a small difference between trait and behavioral scales, wherein greater agreement occurred for the latter. While the ratio of unexplained to total variance was less than 25% (15%) for behavioral ratings, there was no decrease in sample-size weighted variance averaged across subgroups compared with the whole set. Thus, rating scale and rating format were not significant moderators for self-supervisor correlations. However, job type appeared to be a significant moderator. When sampling error and other artifacts were taken into account, no variance remained for managerial/ professional employees. Although substantial variance remained for blue-collar/ service workers, the average corrected variance across the subgroups was lower than for the set as a whole. Moreover, there was lower agreement for managerial/ professional employees ( p = .27) than for blue-collar/service employees ( p = .42). Hence, job type did seem to moderate self-supervisor correlations.

Table 3 summarizes the results of the moderator analysis for peer- supervisor correlations. Contrary to expectations, slightly higher agreement was observed for global than for dimensional ratings; nonetheless, there was no decrease in variance across subsets. More surprising was the finding that greater agreement was obtained for trait ( p = -64) than for behavioral ratings ( p = -53). There was only a slight reduction in the sample-size weighted average variance across subsets as compared with the variance in the whole set (a = .23 across subsets as compared with .24 for the whole set). Finally, job type did not seem to affect the amount of agreement.

Due to the small number of studies (2) using either behaviorally based rating scales or dimensional formats, these variables were not investigated

TABL

E 2

Mod

erat

or A

naly

sis

Res

ulrs

: Se

w-S

iiper

viso

r C

orre

lati

ons

-----

--

Mod

erat

or

Tota

l

size

sa

mpl

e

Rat

ing

form

atb

Dim

ensi

onal

G

loba

l R

atin

g sc

ale

Trai

t B

ehav

iora

l

2,54

2 2,

838

2,89

9 86

0 Jo

b ty

pe

Man

ager

ial/

prof

essio

nal

1,93

.5

Blu

e-co

llar/

ser

vice

2,

022

Num

ber

of

corr

elat

ions

29

24

21

13

I1

2s

Mea

n'

com

eiat

ion

Stan

dard

de

viat

ion

.36(

.23)

.2

9(.1

8)

.32(

.21)

.4

3(.2

8)

.27(

I 17)

.4

2(.2

7)

.16(

. 15)

.1

6(. 1

4)

.12(

. 12)

.0

8(. 1

3)

.00(

.04)

.1

6(. 15

)

'cr

YO%

Con

fiden

ce

intervals

%

n E L

ower

IJ

pper

U

nexp

lain

ed

boun

d bo

und

varr

ance

~~

.10

.62

50

.03

.55

58

.12

.52

46

.30

.56

15

.27

.27

00

.16

.68

45

z F A s B <

"Sun

iben

in p

,jren

thc.j

cs in

clud

e co

rrec

tions

tor

\;lrn

plin

er

or: n

unih

ers n

or i

n p

aren

thes

es c

orre

c,ted

tor

all

po

\dk

art

lfitc

u.

hCor

nhin

etl w

nplc

tirx

lar

ger

than

tot

al i

aiiip

le i

incc

hof

h di

irie

nsio

nsl c

lnd

gob

s. r

aiin

gs a

re in

clud

ed f

rotn

sam

e st

udic

s.

TAB

LE 3

M

oder

aror

Ana

lysi

r R

erul

ts:

Peer

-Sup

ervi

sor

Cor

rela

tions

901 C

onfid

encc

T

otal

N

umbe

r in

terv

als

'7u U

nexp

hine

d :oP,p

,",: va

rianc

e sa

mpl

e O

f M

ean"

St

anda

rd

Lo*e

r M

oder

ator

w

e

corr

elat

icm

r co

rrel

atio

ii de

viat

ion

boun

d

Rat

ing

form

atb

Dim

ensi

onal

G

loba

l R

atin

g sc

ale

Trai

t B

ehav

iora

l

1,33

1 14

1,

914

15

.57(.44)

.23(

. 20)

.1

9 .9

5 .6

5(.5

0)

.25(

.21)

.2

4 ,

1 .oo

80

86

2,13

7 18

.6

4(.4

9)

.24(

.23)

.2

4 I .o

o 88

50

6 5

.53(

.41)

.2

0(. 1

8)

.20

36

80

Jo

b ty

pe

Man

ager

iaU

prof

essi

onal

90

5 8

.64(

.48)

.2

4(.2

1)

.24

I .oo

80

Blu

e-co

llar/

serv

ice

1,73

8 15

62

t.47)

.2

7(.2

3)

.17

1 .w

87

a N

uinb

et.';

in p

aim

ithes

es in

c,lud

e cnr

t'ect

ions

for

sam

plin

ge~t

w; ty

nty

? no

t in

par

enth

eses

cor

rect

ed f

or a

ll 'C

ombi

ned

,mp

k j

iLc

1arg

c.r th

an t

otal

.\am

ple

sinc

e bo

f Ji

rnei

bion

dl d

nd g

loba

l ra

ting5

are

iric

ludr

d fr

om

'r.4

HLE

4

poss

ible

art

ifact

s. sa

me

siud

ies.

Mod

erut

or A

wl-y

sis

Res

uirs

: Sr

tj-Pe

er C

orre

luriu

rts"

~-

90%

Con

fiden

ce

Tota

l N

umbe

r in

terv

als

%

sam

ple

of

Mea

nb

Stan

dard

Lo

wer

U

pper

U

nexp

lain

ed

Mod

erat

or

size

co

rrel

atio

ns

corr

elat

ion

devi

atio

n bo

und

boun

d va

rianc

e

P z U

00

Job

type

M

anag

eria

l/ pr

ofes

sion

al

534

6 .3

1(.2

0)

.oo(.oo)

.3 1

.31

BI~

~-~O

IIX

/ se

rvic

e 3 8

0 5

.40(

.26)

.2

8(. 1

9)

- .0

6 .8

6 14

.-

'The

re w

ere

too few

stud

ies

to t

est e

ither

rat

ing

form

at o

r ra

ting

scal

e as

mod

erat

ors.

tNum

bers

in p

aren

thes

es i

nclu

de c

orre

ctio

ns f

or s

ampl

ing

erro

r; n

umbe

rs n

ot i

n pa

rent

hese

s co

rrec

ted

for

all

poss

ible

art

ifac

ts.

"I W


as moderators for self-peer correlations. Thus, only job type was consid- ered; the results are shown in Table 4.

As in the self-supervisor category, little or no variance remained for either subgroup of self-peer correlations when sampling error and other artifacts were taken into account. Accordingly, there was a reduction in average variance across subsets as compared with the whole set, while the average correlation was higher for blue-collarlservice workers ( p = .40) than for managerial/professional employees ( p = .31). Thus, job type did meet the criteria for acceptance as a moderator.

A total of 18 independent self-supervisor means, 7 independent peer- supervisor means, and 4 independent self-peer means were located. As predicted by the egocentric-bias explanation, there was a fairly large difference between self- and supervisor ratings (d = .70). However, the true variance was also large (u2 = .25). Thus, the difference in means between self- and supervisor ratings was not statistically significant. Simi- larly, while on average self-ratings were somewhat higher than peer ratings (d = .28), a substantial amount of variance remained even after correcting for sampling error (u2 = .09). Hence, the difference in mean self-peer ratings was nonsignificant. Finally, while peers provided somewhat higher ratings than did supervisors (d = .23), the difference was not significant, given the large variance (u2 = .ll).4

Discussion

The first question of interest concerned the average correlation and mean difference between rater pairs. Overall, the data suggest that after correcting for measurement error and range restriction, moderate agreement between self-peer ( p = .36) and self-supervisor ( p = .35) ratings exists. There was much higher agreement between peers and supervisors ( p = ,621. Nonetheless, the lower 90% confidence interval was greater than zero for all three rater combinations. The results of this study then are somewhat contrary to conclusions by Landy and Farr (1980). As noted earlier, they concluded that different raters exhibit low to moderate agreement. The present results indicate that while this is true for self-ratings compared with ratings by others, peer-supervisor ratings demonstrate a far higher average correlation. This discrepancy between a meta-analytic-based literature review and a traditional narrative review is not surprising, given the differences between the two procedures. The present results indicate the value of conducting quantitative reviews.

4A moderator analysis subgrouping by job type for self-supervisor means revealed no reduction in variance.


In terms of mean differences, on average, self-ratings were over a half standard deviation higher than supervisor ratings and approximately one- quarter of a standard deviation higher than peer ratings. Nevertheless, given the large variances, the lower-bound confidence intervals included zero in both cases. However, examination of the effect sizes showed that in only one case were the self-ratings lower than supervisor ratings.

The second question concerned whether or not differences between studies within each rater pair could be attributed to statistical artifacts. The results indicated that in all cases substantial amounts of variance still remained, even when the appropriate corrections were made. Accordingly, modefator analyses were eonducted.

Of the three moderators, only job type showed any meaningful effects. Specifically, self-supervisor and self-peer correlations were lower for managerial/ professional employees than for blue-collar/ service employees, and no true variance existed for the former category. However, such effects were not obtained for peer-supervisor correlations. This suggests that while incumbents of managerial/ professional jobs have very different views of their performance than do others, observers exhibit far more agreement. One possible explanation is that egocentric bias is more likely to occur in ambiguous contexts (i.e., managerial/ professional jobs) than in well-defined jobs (i.e,, blue-collar/ service). Whatever the explanation, job type does not appear to affect observer agreement; further research is needed to test why this is the case. The one instance where rating scale met the criteria for a moderator should be viewed cautiously for several reasons. First, only a small number of studies ( 5 ) using behaviorally based rating scales were available, and the total sample size was small (n = 506). Second, the decrease in amount of variance across subsets was very small. Thus, rating scale was at best a relatively minor moderator of peer-supervisor agreement. In answer to the third question then, moderators did help account for some of the variance.

Recall earlier that three basic explanations were reviewed as to why disagreement between raters may occur. The present findings are consistent with two versions of the egocentric-bias theory. Specifically, in accord with attribution theory (DeVader et al., 1986; Jones & Nisbett, 1972), observers (i.e., peer-supervisor combinations) displayed higher agreement than did “actors” with observers (i.e., self-supervisor and self-peer combinations). The findings also mesh with the moderated defensiveness explanation: even after the range restriction corrections, observer-observer ratings (i.e., peer- supervisor combinations) demonstrated much higher agreement than self- observer ratings.

Conversely, there was no support for the main-effect-of-defensiveness explanation as correcting for range restriction did not reduce the difference


between self-peer/ self-supervisor correlations and peer-supervisor correlations. Nor was there evidence that observational differences accounted for rater disagreement as there was a relatively high correlation between peer- supervisor ratings. Finally, there was no support for the organizational-level hypotheses: an almost identical correlation was found between ratings by individuals at the same level (self-raters and peers) and individuals at different levels (self and supervisors). Moreover, the highest correlation was found for raters at different levels (peer-supervisor). Nevertheless, further research is needed to conduct more direct tests for all possible explanations of rater disagreement.

It is also noteworthy that, with one minor exception, neither rating format nor rating scale were significant moderators. Other meta-analyses of performance ratings have come to similar conclusions. For example, Kraiger and Ford (1985) found that rating scale did not moderate ratee race effects; Heneman (1986) also reported a lack of support for rating format as a moderator. Hence, despite what would seem to be critical moderators, research has found rating instrumentation to be less useful in identifying sources of variation than was previously thought to be the case. Conversely, the findings regarding job type suggest that this may be a moderator of importance that other researchers in performance appraisal should examine.

Although there are a number of potential reasons why a substantial amount of variance between studies remained in most instances, two seem most plausible. First, it seems likely that there are differences between studies in regard to criterion contamination or rater independence. Specifi- cally, studies may have differed as to how much raters knew about others’ ratings. In some studies, subtle forms of criterion contamination may have existed. For example, it is possible that a few studies involving self-ratings were conducted shortly after performance appraisal reviews, wherein ratees had substantial information regarding supervisor ratings. In other instances, employees may have received informal feedback, while in yet other studies, employees may have had little or no feedback of any sort. Unfortunately, very few studies report enough detail about the amount or type of feedback to test whether this variable moderates rater agreement.

A second explanation for variance between studies may be differences in raters’ opportunity to observe job performance. For example, workers may have more interaction with peers in some cases than in others, thereby leading to differential agreement with supervisors. Jobs with much task interdependence may lead to greater agreement between self- and peer ratings than jobs where workers perform independently.

Other potential moderators include such things as purpose of rating (e.g., Zedeck & Cascio, 1982), amount of rater training (e.g., Smith, 1986), and rater motivation (Bemardin & Beatty, 1984). However, a review of


our studies showed that these variables were also rarely, if ever, reported. Hence, no moderator analyses could be performed. Moreover, McIntyre, Smith, and Hassett (1984) found that purpose of rating had little or no effect on ratings, and many studies have failed to show that training increases the accuracy of performance ratings, particularly when only a lecture approach is used (Smith, 1986). Further theory development and empirical research will be necessary to achieve greater knowledge about which conditions affect rater agreement.

Although one can always argue that a “file-drawer’’ problem may exist such than inclusion of missing studies would change the results, this is unlikely here. First, an extensive literature review was conducted. Second, Rosenthal’s (1979) index suggests that it would take many studies to dra- matically alter the results, given the size of the correlations and the total sample size found here.

Future research needs to address two issues. First, a more direct test of these competing hypotheses is needed. That is, a study should be done wherein different raters (e.g., peers, supervisors, and self-raters) will assess performance and provide information regarding moderators. For example, attributions regarding cause of performance, self-esteem, amount of observational opportunity, as well as other possible factors would be measured.

Second, depending on the outcome of this research, further work on the ways to ameliorate rater bias would be useful. For example, if differential attributions are a source of conflict, the Performance Distribution Assess- ment (PDA), described by Bemardin and Beatty (1984), might provide a ve- hicle for increasing rater agreement. Specifically, the PDA attempts to take into account situational constraints. Research should also focus on identifying possible moderators of self-other agreement. For instance, impression management may be a variable that can account for self-supervisor and self-peer disagreement (Zerbe & Paulhus, 1987). Clearly, more research examining a variety of individual differences is in order.

In terms of practical implications, the results suggest that self-ratings will generally show only moderate correlations with ratings by others. This is particularly true for managerial/professional employees. Practi- tioners considering the use of self-ratings should be aware that there is liable to be much disagreement. Conversely, peer-supervisor ratings often (but not always) reflect relatively high rater agreement, even for managerial/professional jobs. For legal purposes, then, peer-supervisor ratings may be useful. Using raters from different levels may also help to develop consensus, eliminate biases, and perhaps in turn lead to greater acceptance by ratees. Nevertheless, there is a plethora of other problems associated with using peer ratings (Latham & Wexley, 1982).


In conclusion, the present results show that peer-supervisor ratings demonstrate a relatively high correlation, whereas self-supervisor and self- peer ratings exhibit moderate correlations. Compared with the correlation between supervisor ratings and results-oriented outcomes (produc- tion, sales, etc.) (Heneman, 1986), the average correlation obtained here between supervisor and peer-ratings is substantially higher. Thus, from both an absolute and relative point of view, there is high convergence between observer ratings. The moderator analysis showed that correlations between self-peer and self-supervisor ratings are particularly low for managerial/ professional workers, but not for peer-supervisor ratings. While these findings are consistent with an egocentric-bias explanation for rater disagreement, future research should test the alternative explanations described earlier in a more direct fashion than was possible here.

REFERENCES

Baird LS. (1977). Self and supervior ratings of performance: As related to self-esteem and satisfaction with supervision. Academy of Management Journal, 20, 29 1-300.

Bernardin HJ, Beatty RW. (1984). Performance appraisal: Assessing human behavior at work. Boston, MA: Kent Publishing Company.

Borman WC. (1974). The rating of individuals in organizations: An alternate approach. Organizational Behavior and Human Performance, 12, 105-124.

Campbell JP, Dunnette MD, Lawler EE, Weick KE. (1970). Managerial behavior, performance, and egectiveness. New York: McGraw-Hill.

Carroil SJ, Schneier CE. (1982). Performance appraisal and review systems: The identifrca- tion, measurement, and development of performance in organizations. Glenview, IL: Scott, Foresman.

Cascio WF. (1987). Applied psychology in personnel management. Englewood Cliffs, NJ: Prentice-Hall.

Cummings LL, Schwab DP. (1973). Performance in organizations: determinants and appraisal. Glenview, IL: Scott Foresman.

DeVader CL, Bateson AG, Lord RG. (1986). Attribution theory: A meta-analysis of attri- butional hypotheses. In Locke E (Ed.), Generalizing from laboratory to field settings (pp. 63-8 1). Lexington, MA: Lexington Books.

Fems GR, Yates VL, Gilmore DC, Rowland KM. (1985). The influence of subordinate age on performance ratings and causal attributions. PERSONNEL PSYCHOLOGY, 38, 545-557.

Henderson RI. (1984). Performance appraisal. Reston, VA: Reston Publishing Company. Heneman HG 111. (1974). Comparisons of self- and superior ratings of managerial perfor-

mance. Journal of Applied Psychology, 59, 638-642. Heneman RL (1986). The relationship between supervisory ratings and results-oriented mea-

sures of performance: A meta-analysis. PERSONNEL PSYCHOLOGY, 39, 81 1-826. Holzbach RL. (1978). Rater bias in performance ratings: Superior-, self-, and peer ratings.

Journal of Applied Psychology, 63, 579-588. Hunter JE, Schmidt FL,, Jackson GB. (1982). Mera-analysis: Cumulating research findings

across studies. Beverly Hills, CA: Sage Publications. Ilgen DR, Favero JL. (1985). Limits in generalization from psychological research to perfor-

mance appraisal processes. Academy of Management Review, 10, 31 1-321. Jones EE, Nisbett RE. (1972). The actor and the observer: Divergent perceptions of the causes

of behavior. In Jones EE, Kanouse DA, Kelly HH, Nisbett RE, Valines S, Weiner B


(Eds.), Attribution: Perceiving the causes of behavior (pp.79-94). Morristown, NJ General Learning Press.

Kay E, Meyer HH, French JRP. (1965). Effects of threat in a performance appraisal interview. Journal of Applied Psychology, 49, 311-317.

Klimoski RJ, London M. (1974). Role of the rater in performance appraisal. Journal of Applied Psychology, 59, 445-451.

Kraiger K, Ford JK. (1985). A meta-analysis of ratee race effects in performance ratings. Journal of Applied Psychology, 70, 56-65.

Landy FJ, Farr JL. (1980). Performance rating. Psychological Bulletin, 87, 72-107. Landy FJ, Farr JL, Saal FE, Freytag WR. (1976). Behaviorally anchored scales for rating the

performance of police officers. Journal of Applied Psychology, 61, 458-557. Latham GP, Wexley KN. (1982). Increasing producriviry rhrough performance appraisal.

Reading, MA:’ Addison-Wesley. Lawler EE. ( 1967). The multitrait-multirater approach to measuring managerial job perfor-

mance. Journal of Applied Psychology, 51, 369-38 1. Mabe PA, West SG. (1982). Validity of self-evaluation of ability: A review and meta-analysis.

Journal of Applied Psychology, 67, 28G296. McIntyre RM, Smith DE, Hassett CE. (1984). Accuracy of performance ratings as affected

by rater training and perceived purpose of rating. Journal of Applied Psychology, 69, 147-1 56.

Pearlman K, Schmidt FL, Hunter JE. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373406.

Pym DLA, Auld HD. (1965). The self-rating as a measure of employee satisfactoriness. Journal of Occupational Psychology, 39, 103-1 13.

Rosenthal R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86, 638-641.

Sackett PR, Tenopyr ML, Schmitt N, Kehoe J. (1985). Commentary on forty questions about validity generalizationa and meta-analysis. PERSONNEL PSYCHOLOGY, 38, 697-798.

Schmitt N, Schneider B. (1983). Current issues in personnel selection. In Rowland KW, Ferris GD (Eds.), Research in personnel and human resoiirces, Vol. I (pp. 85-125). Greenwich, C T JAI Press.

Smith DE. (1986). Training programs for performance appraisal: A review. Academy of Managemenl Review, 11, 22-40.

Springer D. (1953). Ratings of candidates for promotion by co-workers and supervisors. Journal of Applied Psychology, 37, 347-35 1 .

Steele RP, Ovalle NK. ( I 984). Self-appraisal based upon supervisory feedback. PERSONNEL

PSYCHOLOGY, 37, 667-685. Thomton GC. (1980). Psychometric properties of self-appraisals of job performance. PER-

SONNEL PSYCHOLOGY, 33, 263-271. Tucker MF, Cline VB, Schmitt JR. (1967). Prediction of creativity and other performance

measures from biographical information among pharmaceutical scientists. Journal of Applied Psychology, 51, 131-138.

Wexley KN, Klimoski R. (1984). Performance appraisal: An update. In Rowland KM, Fems GD (Eds.), Resarch in personnel and human resources, Vol. 2 (pp. 35-79). Greenwich, C T JAI Press.

Williams WE, Seiler DA. (1973). Relationship between measures of effort and performance. Journal of Applied Psychology, 57, 49-54.

Zammuto RF, London M, Rowland KM. (1982). Organization and rater differences in performance appraisals. PERSONNEL PSYCHOLOGY, 35, 643458.


Zedeck S, Baker HT. (1972). Nursing performance as measured by behavioral expectations scales: A multitrait-multirater analysis. Organizational Behavior and Human Perfor- mance, 7, 457466.

Zedeck S, Cascio W. (1982). Performance decision as a function of purpose of rating and training. Journal of Applied Psychology, 67, 752-758.

Zerbe WJ, Paulhus DL. (1987). Socially desirable responding in organizational behavior: A reconception. Academy of Management Review, 12, 250-264.

STUDIES INCLUDED IN THE META-ANALYSIS

Baird LS. (1977). Self and superior ratings of performance: As relate’d to self-esteem and satisfaction with supervision. Academy of Management Journal, 20, 291-300.

Blackburn RT, Clark MJ. (1975). An assessment of faculty performance: Some correlates between administrator, colleague, student, and self ratings. Sociology of Education, 48, 242-256.

Booker GS, Miller RW. (1966). A closer look at peer ratings. Personnel, 43, 42-47. Borman WC. (1974). The rating of individuals in organizations: An alternative approach.

Organizational Behavior and Human Performance, 12, 105-124. Boruch RF, Larking JD, Wolins L, MacKinney AC. (1970). Alternative methods of analysis:

Multitrait-multimethod data. Educational and Psychological Measurement, 30, 833- 853.

Brief AP, Aldag RJ, Van Sell M. (1977). Moderators of the relationships between self and superior evaluations of job performance. Journal of Occupational Psychology, 50, 129- 134.

Burke RJ, Weitzel W, Deszca G. (1982). Supervisors’ and subordinates’ communication experiences in managing job performance: Two blind men describing an elephant? Psychological Reports, 50, 527-534.

Campbell DJ. ( 1985). Self-appraisals from two perspectives: Esteem versus consistency influences. 28th Annual Proceedings of the Midwest Academy of Management.

Centra JA. (1975). Colleagues as raters of classroom instruction. Journal of Higher Educa- tion, 46, 327-337.

Doyle KD, Crichton CI. (1978). Student, peer, and self evaluations of college instructors. Journal of Educational Psychology, 70, 8 15-826.

Ekpo-Ufot A. (1979). Self-perceived task-relevant abilities, rated job performance, and com- plaining behavior of junior employees in a government ministry. Journal of Applied Psychology, 64,429434.

F e m s GR, Yates VL, Gilmore DC, Rowland KM. (1985). The influence of subordinate age on performance ratings and causal attributions. PERSONNEL PSYCHOLOGY, 38, 545-557.

Fuqua DR, Johnson AW, Newman JL, Anderson MW, Gade EM. (1984). Variability across sources of performance ratings. Journal of Counseling Psychology, 31, 249-252.

Gormly J. (1 984). Correspondence between personality trait ratings and behavior events. Journal of Personality, 52, 220-232.

Griffiths RD. (1975). The accuracy and correlates of psychiatric patients’ self-assessment of their work behavior. British Journal of Social and Clinical Psychology, 14, 181-189.

Gunderson EK, Nelson PD. (1966). Criterion measures for extremely isolated groups. PER- SONNEL PSYCHOLOGY, 19.67-80.

Heneman HG 111. (1974). Comparisons of self- and superior ratings of managerial performance. Journal of Applied Psychology, 59. 638-642.

Hoffman EL, Rohrer JH. (1954). An objective peer evaluation scale: Construction and validity. Educational and Psychological Measurement, 14, 332-341.


Holzbach RL. (1978). Rater bias in performance ratings: Superior, self, and peer ratings. Journal of Applied Psychology, 63, 579-588.

Ilgen DR, Peterson RB, Martin BA, Boeschen DA. (1981). Supervisor and subordinate reactions to performance appraisal sessions. Organizational Behavior and Human Perj5ormance, 28, 3 11-330.

Kaufman GG, Johnson JC. (1984). Scaling peer ratings: An examination of the differential validities of positive and negative nominations. Journal of Applied Psychology, 59,

Kay E, Meyer HH, French JRP. (1965). Effects of threat in a performance appraisal interview. Journal of Applied Psychology, 50, 3 1 1-3 17.

King LM, Hunter JE, Schmidt FL. (1980). Halo in multi-dimensional forced-choice performance evaluatiod scale. Journal of Applied Psychology, 65, 507-5 16.

Kirchner WK. ( 1965). Relationships between supervisory and subordinate ratings for technical personnel. Journal of Industrial Psychology, 3, 57-60.

Klimoski RJ, London M. (1974). Role of the rater in performance appraisal. Journal of Applied Psychology, 59, 4 4 5 4 5 1.

Kraut AI. (1975). Prediction of managerial success by peer and training staff ratings. Journal of Applied Psychology, 60, 389-394.

Kubany AJ. (1957). Use of sociometric peer nominations in medical education research. Journal of Applied Psychology, 41, 389-394.

Lawler EE. ( 1967). The multitrait-multirater approach to measuring managerial job performance. Journal of Applied Psychology, 51, 369-380.

Levine EL, Flory A, Ash RA. (1977). Self-assessment in personnel selection. Journal of Applied Psychology, 62,428435.

Love KG. (198 1). Comparison of peer assessment methods: Reliability, validity, friendship bias, and user reaction. Journal of Applied Psychology, 66, 45 1 4 5 7 .

Maslow AH, Zimmerman W. (1956). College teaching ability, scholarly activity and personality. Journal of Educational Psychology, 47, 185-1 89.

Mihal WL, Graumenz JL. (1984). An assessment of the accuracy of self-assessment for career decision making. Journal of Vocational Behavior, 25, 245-253.

Mitchell TR, Albright DW. (1972). Expectancy theory predictions of the satisfaction, effort, performance, and retention of naval aviation officers. Organizational Behavior and Human Perjormance, 8, 1-20.

Motowildo SJ. (1982). Relationship between self-rated performance and pay satisfaction among sales representatives. Journal of Applied Psychology, 67, 209-21 3.

Mount MK. (1984). Supervisor, self, and subordinate ratings of performance and satisfaction with supervision. Journal of Management, 10, 305-320.

Nealy SM, Owen TW. (1970). A multitrait-multimethod analysis of predictors and criteria of nursing performance. Organizational Behavior and Human Performance, 5 , 348-365.

Orpen C. (1983). Effects of race of rater and ratee on peer-ratings of managerial potential. Psychological Reports, 52, 507-5 10.

Parker JW, Taylor EK, Barrett RS, Martens L. (1959). Rating scale content: 111. Relationship between supervisor and self-ratings. PERSONNEL PSYCHOLOGY, 12, 49-63.

Porter LW, Lawler EE. (1968). Managerial attitudes and performance. Homewood, IL: Irwin.

Prien Ep, Liske RE. (1962). Assessments of higher-level personnel: 111. Rating criteria: A comparative analysis of supervisor ratings and incumbent ratings of job performance. PERSONNEL PSYCHOLOGY, 15, 187-194.

Pym DL, Auld HD. (1965). The self-rating as a measure of employee satisfactoriness. Journal of Occupational Psychology, 39, 103-1 13.

Schmitt N, Saari BB. (1978). Behavior, situation, and rater variance in descriptions of leader behaviors. Multivariate Behavioral Research, 13, 483-495.

302-306.


Schneier CE, Beatty RW. (1978). The influence of role prescriptions on the performance

Shapiro GL, Dessler G. (1985). Are self-appraisals more realistic among professionals or

Shore LF, Thornton GC 111. (1986). Effects of gender on self- and supervisory ratings.

Shrauger JS, Lund AK. (1975). Self-evaluation and reactions to evaluations from others.

Siege1 L. (1982). Paired comparison evaluations of managerial effectiveness by peers and

Smircich L, Chesser RJ. (198 1). Supervisors’ and subordinates’ perceptions of performance:

Springer D. (1953). Ratings of candidates for promotion by co-workers and supervisors.

Steel RP, Ovalle NK. (1984). Self-appraisal based upon supervisory feedback. PERSONNEL

Thornton GC. (1968). The relationship between supervisory and self-appraisals of executive performance. PERSONNEL PSYCHOLOGY, 21,441455.

Tucker MF, Cline VB, Schmitt JR. (1963). Prediction of creativity and other performance measures from biographical information among pharmaceutical scientists. Journal of Applied Psychology, 51, 131-138.

Webb WB, Nolan CY. (1955). Student, supervisor and self-ratings of instructional proficiency. Journal of Educational Psychology, 46, 4 2 4 6 .

Williams WE, Seller DA. (1973). Relationship between measures of effort and job performance. Journal of Applied Psychology, 57 , 49-54.

appraisal process. Academy of Management Journal, 2 1 , 129-1 35.

nonprofessionals in health care? Puhlic Personnel Management, 14, 285-290.

Academy of Management Journal, 29, 1 15-129.

Journal of Personality, 43, 94-108.

supervisors. PERSONNEL PSYCHOLOGY, 35, 843-852.

Beyond disagreement. Academy of Management Journal. 24, 198-205.

Journal of Applied Psychology, 37, 347-35 I .

PSYCHOLOGY, 37, 667-685.

Documents

A META-ANALYSIS OF SELF-SUPERVISOR, SELF-PEER, AND PEER-SUPERVISOR RATINGS