Click here to load reader
Upload
michael-m-harris
View
214
Download
0
Embed Size (px)
Citation preview
PERSONNEL PSYCHOLOGY 1988. 41
A META-ANALYSIS OF SELF-SUPERVISOR, SELF-PEER, AND PEER-SUPERVISOR RATINGS
MICHAEL M. HARRIS, JOHN SCHAUBROECK Krannert Graduate School of Management
Purdue University
Reviews of self-Supervisor, self-peer, and peer-supervisor ratings have generally concluded that there is at best a modest correlation between different rating sources. Nevertheless, there has been much inconsistency across studies. Accordingly, a meta-analysis was conducted. The results indicated a relatively high correlation between peer and supervisor ratings ( p = -62) but only a moderate correlation between self-supervisor (p = .35) and self-peer ratings (p = .36). While rating format (dimensional versus global) and rating scale (trait versus behavioral) had little impact as moderators, job type (managerial/ professional versus blue-collar/ service) did seem to moderate self-peer and self-supervisor ratings.
The use of multiple sources for performance ratings has gained con- siderable acceptance in the past two decades. Numerous advantages of using multiple raters have been cited: for example, enhanced ability to observe and measure various job facets (Borman, 1974; Henderson, 1984), greater reliability, fairness, and ratee acceptance (Latham & Wexley, 1982), and improved defensibility of the performance appraisal program from a legal standpoint (Bemardin & Beatty, 1984). The literature on multiple raters has focused on self-ratings. A number of scholars have argued that self-ratings can promote personal development, improve communication between supervisors and subordinates, and clarify differences of opinion between supervisors and other managers (Carroll & Schneier, 1982; Cum- mings & Schwab, 1973).
Despite the alleged gains from self-ratings, empirical research shows frequent lack of agreement between self-ratings and those provided by other sources. Mabe and West (1982) reviewed a number of studies and found, on average, a low correlation between self-ratings and others’ ratings, in- cluding supervisor and peer appraisals. Thornton (1980), in his summary of self-ratings, concluded that “individuals have a significantly different view of their own job performance than that held by other people” (p. 268). Nevertheless, both Mabe and West and Thomton found that research often
The authors were equal contributors to this manuscript; order of listing was determined
The authors would like to thank Michael A. Campion for his helpful comments. Correspondence and requests for reprints should be addressed to Michael M. Harris,
alphabetically.
Krannert Graduate School of Management, Purdue University, West Lafayette, IN 47907.
COPYRIGHT @ 1988 PERSONNEL PSYCHOLOGY. INC.
44 PERSONNEL PSYCHOLOGY
led to conflicting results. Indeed, a casual inspection of research on self- supervisor ratings shows a wide range of findings. For example, Williams and Seiler (1973) reported an average self-supervisor correlation of .60; Pym and Auld (1965) found an average self-supervisor correlation of .56 across three independent studies. Conversely, Klimoski and London (1 974) reported an average self-supervisor correlation of .05, and Ferris, Yates, Gilmore, and Rowland (1985) found a self-supervisor correlation of .02. Others (e.g., Heneman, 1974) have found moderate correlations between self-supervisor ratings. Thus, no firm conclusions can be drawn regarding the extent of self-supervisor agreement.
Low correlations between different raters are not unique to compar- isons with self-ratings. On the one hand, some researchers have observed relatively high peer-supervisor correlations: for example, Lawler ( 1967) reported an average correlation of .57. On the other hand, others have found much lower correlations between peer and supervisor performance ratings: Springer (1953), for example, reported an average correlation of .28, and Tucker, Cline, and Schmitt (1967) reported .21. Thus, Landy and Farr (1980) have concluded that different raters exhibit moderate to low agreement. Moreover, they have suggested that the limited correspondence between self-ratings and those from other sources is not due to any unique differences beyond that to be expected for multiple raters in general.
It is important to note that a number of scholars have asserted that the general lack of convergence between different raters is neither surprising nor problematic. Supporters of this position argue that different raters may observe different dimensions of performance, or have different definitions of effective performance and, thus, arrive at different assessments of the same individual’s performance (Borman, 1974; Landy & Farr, 1980). As Wexley and Klimoski (1984) pointed out, however, a definitive test of this hypothesis has not been conducted.
A review of the literature suggests that a number of explanations have been offered as to why low correlations exist between different raters. These can be sorted into at least three categories: (1) egocentric bias, (2 ) differences in organizational level, (3) observational opportunities. What follows is a brief discussion of each of these explanations.
Egocentric Bias
While several different explanations fall under this rubric, the under- lying premise is that ratee (or self-) ratings are biased in some fashion, while other raters (e.g., peers and supervisor) share a set of common per- ceptions. One of the most commonly discussed versions of egocentric bias is the main effect of defensiveness, wherein a self-rater is inclined to in- flate hidher rating in order to enhance the evaluation (Holzbach, 1978;
HARRIS AND SCHAUBROECK 45
Steel & Ovalle, 1984). This would lead the self-ratings to have a restricted range, thereby attenuating the correlation between self- and others’ ratings. Correlations between others’ ratings (e.g., peers and supervisors) would be considerably higher as there would be no range restriction. However, correcting for range restriction should eliminate any differences between correlations based on self- and others’ ratings and the correlations based on different observers’ ratings.
A second version of the egocentric bias explanation assumes that ratee defensiveness may be moderated by other variables, such as self-esteem (Baird, 1977; Kay, Meyer, & French, 1965). If the impact of ratee de- fensiveness depends on a third variable, such as self-esteem, the order of self-ratings may be affected. That is, high self-esteem ratees may inflate their ratings, while low self-esteem ratees may not. If the third variable is not taken into account, correcting for range restriction will not completely eradicate the egocentric bias, and correlations between observer ratings would remain higher than correlations between self- and others’ ratings.
A third type of egocentric bias is posited by attribution theory (DeVader, Bateson, & Lord, 1986 Jones & Nisbett, 1972). According to this theory, actors (i.e., self-raters ) attribute good performance to their own behavior and poor performance to environmental factors. Conversely, observers (e.g., peers and supervisors) attribute good performance to environmental factors and poor performance to the actors’ dispositions. Hence, this theory would suggest that self-ratings will correlate poorly with ratings provided by other sources. Correcting for range restriction should have little effect. However, there should be substantial agreement between observers (i.e., peers and supervisors).
In addition to predictions about the correlations, all three versions of the egocentric bias explanation would predict differences in mean ratings. Specifically, self-ratings should be significantly higher than either peer or supervisor ratings.
Differences in Organizational Level
Two versions of the organizational-level explanation for rater disagree- ment exist. Some researchers (e.g., Klimoski & London, 1974; Zammuto, London & Rowland, 1982) have asserted that raters at different levels weight performance dimensions differently. This suggests that raters at different levels (e.g., peers and supervisors) disagree on the overall rat- ing, but they presumably would agree on dimensional ratings. A second version of this explanation maintains that raters at different levels define and measure performance differently (e.g., Borman, 1974; Landy, Farr, Saal, & Freytag, 1976). This latter hypothesis would suggest that raters at different levels disagree on both dimensions and overall ratings. At the
46 PERSONNEL PSYCHOLOGY
same time, both explanations suggest that raters on the same level (e.g., self and peers) would provide similar ratings. Note that neither version of the organizational-level explanation would necessarily suggest differences in the mean ratings of self and others.
Observational Opportunities
A third explanation for lack of agreement across different raters con- cerns differing observational opportunities. Specifically, peers are hypoth- esized to have more opportunities to observe ratees and at more revealing times than do supervisors (Latham & Wexley, 1982; Lawler, 1967). This explanation implies that supervisors disagree with a ratee because they have few opportunities to observe the individual's performance, and the behav- iors they do observe do not reflect true performance levels. Thus, one would expect self-supervisor and peer-supervisor ratings to exhibit far lower cor- relations than peer-self ratings. However, as with the organizational-level explanation, no differences between mean self-, peer, and supervisor ratings would be expected.
While the three basic explanations described above have been presented as mutually exclusive, it is possible that several are operating simultane- ously. For example, Latham and Wexley (1982) provide an observational- opportunities explanation for the lack of agreement between peers and su- pervisors and suggest that defensiveness may be the cause of peer-self dis- agreement. Accordingly, while a direct test of these alternative hypotheses is not possible without measurement of proposed moderators, comparison of average correlations for different rater pairs can provide some direc- tion and insight into which explanation may account for low correlations between raters.
The purpose of this paper was to conduct a quantitative review of em- pirical research on rater agreement. While a number of qualitative reviews have been reported (e.g., Landy & Farr, 1980; Thornton, 1980), a defini- tive estimate of rater agreement has not been ascertained.' Furthermore, the apparent lack of consistent findings regarding rater agreement may be explained by sampling error, as many of these studies are based on small sample sizes. For example, Pym and Auld (1965) had an average of 70 subjects in each sample; Ferris et al. (1985) reported an N of 81. Thus, one explanation for the lack of consistency from study to study is that sampling error is affecting the results.
A second explanation for the apparent lack of consistency between studies is the nature of the job. That is, blue-collar and service jobs are
'The one quantitative review (Mabe & West, 1982) did not separate out self-peer and self-supervisor correlations, nor did it examine peer-supervisor correlations.
HARRIS AND SCHAUBROECK 41
relatively routine, and performance is well defined compared with man- agerial or professional jobs (Campbell, Dunnette, Lawler, & Weick, 1970; Cascio, 1987). Hence, it is to be expected that there will be a higher correlation between different raters for blue-collar/ service jobs than for managerial/ professional jobs.
Two other moderators are predicted by the explanations discussed above. As described by the first organizational-level hypothesis, different raters may weight dimensions differently. If this is the case, one would expect greater agreement on dimensional ratings than on global ratings. Hence. another potential moderator is rating format (dimensional versus global). Second, rating scali may be an important moderator whereby rater agree- ment will be higher for behavioral than for trait scales. This follows from the second version of the organizational-level hypothesis, which asserts that different raters may have different definitions of effective job performance (Borman, 1974; Zedeck & Baker, 1972). Hence, the more well defined and behaviorally based the rating scale, the greater the agreement expected between different raters.
In summary, the present study used meta-analysis to review the lit- erature regarding the correlation between self-supervisor, self-peer, and peer-supervisor ratings. Specifically, the questions of interest were as fol- lows: 1. (a) Overall, what is the average correlation between self-supervisor,
self-peer, and peer-supervisor ratings? (b) What is the average differ- ence between supervisor, peer, and self-ratings? (c) Will the data sup- port an egocentric-bias, organizational-level, or observational-opportu- nities explanation for rater disagreement?
2. Can differences between studies within each rater pair be attributed to statistical artifacts such as sampling error?
3. Assuming that differences between studies cannot be explained by sta- tistical artifacts, can the variance be explained by moderators such as job type (managerial/ professional versus blue-collar/ service), rating format (dimensional versus global), and rating scale (behavioral versus trait)?
Method
Analysis
The Hunter, Schmidt, and Jackson (1982) meta-analytic procedure de- signed to correct for unreliability and range restriction with sporadic in- formation was employed for the correlations. This procedure involves a sequence of computations. In step 1, the sample weighted average corre- lation and the sample weighted variance of the correlations is calculated.
48 PERSONNEL PSYCHOLOGY
Next, the expected variance due to sampling error is determined, and this is subtracted from the observed variance.
Step 2 of the meta-analysis involves estimating reliability and range restriction across the variables. Once this is done, the average correlation is corrected for unreliability and range restriction, as is the variance.2 The formulas for these estimates may be found in Hunter et al. (1982). In ac- cord with the main effect defensiveness hypothesis, self-other correlations were corrected for both of these artifacts. However, since there was no theoretical reason to correct for range restriction for peer-supervisor cor- relations, only corrections for measurement error were performed for this category.
The third step examines whether variation between studies is due to arti- facts or is caused by moderator variables. The 25% rule was used, whereby a moderator search was initiated if the ratio of unexplained variance (i.e., variance remaining after correcting for sampling error, measurement error, and range restriction) to total variance was greater than 25%.
The fourth step involves testing for moderators. Towards this end, effect sizes are subgrouped and a separate analysis is performed on each subgroup. Supportive evidence for the hypothesized moderator variable exists when (1) the mean correlation differs from subgroup to subgroup and (2) the corrected variance is lower on average in the subgroups than for the overall set.
Mean differences between peer, supervisor, and self-ratings were con- verted into a d statistic. Procedures described by Hunter et al. (1982) were then used to obtain the average effect size and true ~ar iance .~ The d statis- tic was calculated such that the first rater in each rater pair (e.g., self in self-supervisor ratings) was associated with the higher ratings.
Sample
The search for studies to be used in the meta-analysis was confined to journal articles apd conference papers in published proceedings. The search began with manual inspection of each of the following journals for the past thirty years: Academy of Management Journal (1958-1986), Journal of Applied Psychology (1956-1986), Journal of Educational Psy- chology (1956-1986), Journal of Occupational Psychology (197&1986),
Note that the source of the range restriction differs from the traditional situation wherein range restriction is due to a lack of complete data. For the present situation, we computed a ratio of the standard deviation of self-ratings to the standard deviation of others’ ratings (either peers or supervisors). This ratio is referred to as u in Hunter et al. (1982) and is used along with other formulas to correct for range restriction.
3Because Hunter et al. (1982) do not provide formulas for correcting for measurement error with sporadic information, we used the average reliabilities listed in Table 1.
HARRIS AND SCHAUBROECK 49
Journal of Vocational Behavior ( 197&1986), Organizational Behavior and Human Decision Processes (Formerly Organizational Behavior and Human Performance) (1966-1986), and Personnel Psychology (1956-1986). Con- ference papers from the Midwest and National proceedings of the Academy of Management were also examined. Computer-based searches were con- ducted on Psyc-Info (Psychological Abstracts) for the period since 1967 and the ABI Inform data base for the period since 1975. References for other studies were obtained from these sources to expand the research base.
Three judgments were made concerning the collection and analysis of data. 'First, only those reliability estimates based on accepted formulas were selected to determine the mean reliabilities across studies. This ex- cluded, among other estimates, unadjusted average inter-item correlation coefficients. Second, laboratory studies were deliberately excluded from the analysis. This was done because the conditions under which laboratory ratings occur grossly exaggerate the observation time factor. More impor- tantly, assigned roles cannot be taken to reflect true perspectives of super- visors and peers (Ilgen & Favero, 1985). Finally, some studies contained multiple measures of performance. Since these measures were typically not independent, the effect sizes were averaged and the mean effect size was used in the meta-analysis.
Results
The literature search produced a total of 36 independent self-supervisor correlations, 23 independent peer-supervisor correlations, and 1 1 indepen- dent self-peer correlations (approximately 5 studies did not provide either means or correlations). Table 1 shows the average correlations, reliabili- ties, and range restriction for the three rater combinations. The reliabilities are based on internal consistency (alpha) measures. It is noteworthy that the average reliabilities are substantially higher than the .60 figure used by Schmidt and his colleagues (e.g., see Pearlman, Schmidt, & Hunter, 1980). There are three reasons for using the more conservative figures found here. First, Schmidt and his colleagues based their reliability estimates on the correlations between ratings from different supervisors at different points in time. In the present study, only internal consistency measures were found. Second, the .60 figure used by other researchers assumes that one source of error is the use of different raters. In the present case, at least for self-ratings, the rater did not change. Hence, a reliability of .60 would be an underestimate. Finally, there has been considerable debate over the accuracy of .60 as an estimate of reliability (Sackett, Tenopyr, Schmitt, & Kehoe, 1985; Schmitt & Schneider, 1983). Use of the higher, and hence more conservative, figures therefore seemed in order.
TABL
E 1
Met
a-an
alys
is R
esul
ts fo
r C
orre
latio
ns
90%
Con
fiden
ce
Tota
l N
umbe
r in
terv
als
%
Rat
er
sam
ple
of
Mea
nb
Stan
dard
Lo
wer
U
pper
U
nexp
lain
ed
com
bina
tion
size
co
rrel
atio
ns
aa
6 U
corr
elat
ion
devi
atio
ns
boun
d bo
und
varia
nce
Self
-sup
emis
or
3,95
7 36
.Y
O(7
) .Y
1(7
) .7
8(8)
.3
5(.2
2)
.I l(
. 12)
.I
7 .5
3 33
Pe
er-s
uper
viso
r‘
2,64
3 23
.8
7(5)
.89(5)
NA
.6
2( .4
8)
.24(
.22)
.2
2 1 .o
o 86
Se
I f-Pe
.er
984
I1
.90(
7)
.87(
5)
.84(3)
.36(
.24)
.1
9(. 1
3)
.05
.67
82
‘a?
b, an
d o
rcp
rese
nt, r
espe
ctiv
ely,
ave
rage
rel
iabi
lity
for t
he f
irst
rate
r in
cac
h co
mbi
natio
n, Ih
e se
cond
rat
er in
eac
h co
rnhi
natio
n, a
nd t
he r
atio
be
twee
n th
e st
anda
rd d
evia
tion
of
the
firs
t rat
er a
nd th
e st
anda
rd d
cvia
tion
of t
he s
econ
d ra
ter
in e
ach
pair
(se
e H
unte
r et
al.,
198
2, fo
r fu
rther
cxp
la-
natio
n).
Figu
res
in p
aren
thcs
es i
ndic
ate
num
ber
of s
tudi
es in
volv
ed in
cal
cula
tions
. bN
umbe
rs in
par
enth
eses
inc
lude
cor
rect
ions
for
sam
plin
er
ror;
num
bers
not
in
pare
nthe
ses
corr
ecte
d fo
r al
l pos
sibl
e ar
tifac
ts.
‘Mea
ns a
nd v
aria
nce
corr
ecte
d fo
r m
easu
rem
ent
emr
o&;
see
text
for
exp
lana
tion.
0
Q
HARRIS AND SCHAUBROECK 51
Because there was only one self-peer study reporting reliabilities, es- timates of peer-rating reliability were obtained from the peer-supervisor studies; likewise, estimates of self-rating reliability were taken from self- supervisor studies.
As shown in Table 1, all three rater combinations had lower 90% confi- dence intervals that were greater than zero. Hence, the correlation between ratings from different raters is statistically significant. However, there were marked differences between rater combinations: The peer-supervisor cor- relation ( p = -62) was substantially higher than either the self-supervisor ( p =. -35) or the self-peer correlation ( p = .36). Finally, all three rater combinations contained more than 25% unexplained variance even after sampling error, measurement error, and range restriction were taken into account; accordingly, moderator variables were examined.
Table 2 summarizes the moderator analysis for self-supervisor corre- lations. Consistent with the hypothesis, there was slightly more agreement for dimensional ( p = .36) than for global ( p = .29) ratings. However, there was no decrease in variance across the subgroups. In accord with the rating-scale hypothesis, there was also a small difference between trait and behavioral scales, wherein greater agreement occurred for the latter. While the ratio of unexplained to total variance was less than 25% (15%) for behavioral ratings, there was no decrease in sample-size weighted vari- ance averaged across subgroups compared with the whole set. Thus, rating scale and rating format were not significant moderators for self-supervisor correlations. However, job type appeared to be a significant moderator. When sampling error and other artifacts were taken into account, no vari- ance remained for managerial/ professional employees. Although substan- tial variance remained for blue-collar/ service workers, the average cor- rected variance across the subgroups was lower than for the set as a whole. Moreover, there was lower agreement for managerial/ professional employ- ees ( p = .27) than for blue-collar/service employees ( p = .42). Hence, job type did seem to moderate self-supervisor correlations.
Table 3 summarizes the results of the moderator analysis for peer- supervisor correlations. Contrary to expectations, slightly higher agreement was observed for global than for dimensional ratings; nonetheless, there was no decrease in variance across subsets. More surprising was the finding that greater agreement was obtained for trait ( p = -64) than for behavioral ratings ( p = -53). There was only a slight reduction in the sample-size weighted average variance across subsets as compared with the variance in the whole set (a = .23 across subsets as compared with .24 for the whole set). Finally, job type did not seem to affect the amount of agreement.
Due to the small number of studies (2) using either behaviorally based rating scales or dimensional formats, these variables were not investigated
TABL
E 2
Mod
erat
or A
naly
sis
Res
ulrs
: Se
w-S
iiper
viso
r C
orre
lati
ons
-----
--
Mod
erat
or
Tota
l
size
sa
mpl
e
Rat
ing
form
atb
Dim
ensi
onal
G
loba
l R
atin
g sc
ale
Trai
t B
ehav
iora
l
2,54
2 2,
838
2,89
9 86
0 Jo
b ty
pe
Man
ager
ial/
prof
essio
nal
1,93
.5
Blu
e-co
llar/
ser
vice
2,
022
Num
ber
of
corr
elat
ions
29
24
21
13
I1
2s
Mea
n'
com
eiat
ion
Stan
dard
de
viat
ion
.36(
.23)
.2
9(.1
8)
.32(
.21)
.4
3(.2
8)
.27(
I 17)
.4
2(.2
7)
.16(
. 15)
.1
6(. 1
4)
.12(
. 12)
.0
8(. 1
3)
.00(
.04)
.1
6(. 15
)
'cr
YO%
Con
fiden
ce
intervals
%
n E L
ower
IJ
pper
U
nexp
lain
ed
boun
d bo
und
varr
ance
~~
.10
.62
50
.03
.55
58
.12
.52
46
.30
.56
15
.27
.27
00
.16
.68
45
z F A s B <
"Sun
iben
in p
,jren
thc.j
cs in
clud
e co
rrec
tions
tor
\;lrn
plin
er
or: n
unih
ers n
or i
n p
aren
thes
es c
orre
c,ted
tor
all
po
\dk
art
lfitc
u.
hCor
nhin
etl w
nplc
tirx
lar
ger
than
tot
al i
aiiip
le i
incc
hof
h di
irie
nsio
nsl c
lnd
gob
s. r
aiin
gs a
re in
clud
ed f
rotn
sam
e st
udic
s.
TAB
LE 3
M
oder
aror
Ana
lysi
r R
erul
ts:
Peer
-Sup
ervi
sor
Cor
rela
tions
901 C
onfid
encc
T
otal
N
umbe
r in
terv
als
'7u U
nexp
hine
d :oP,p
,",: va
rianc
e sa
mpl
e O
f M
ean"
St
anda
rd
Lo*e
r M
oder
ator
w
e
corr
elat
icm
r co
rrel
atio
ii de
viat
ion
boun
d
Rat
ing
form
atb
Dim
ensi
onal
G
loba
l R
atin
g sc
ale
Trai
t B
ehav
iora
l
1,33
1 14
1,
914
15
.57(.44)
.23(
. 20)
.1
9 .9
5 .6
5(.5
0)
.25(
.21)
.2
4 ,
1 .oo
80
86
2,13
7 18
.6
4(.4
9)
.24(
.23)
.2
4 I .o
o 88
50
6 5
.53(
.41)
.2
0(. 1
8)
.20
36
80
Jo
b ty
pe
Man
ager
iaU
prof
essi
onal
90
5 8
.64(
.48)
.2
4(.2
1)
.24
I .oo
80
Blu
e-co
llar/
serv
ice
1,73
8 15
62
t.47)
.2
7(.2
3)
.17
1 .w
87
a N
uinb
et.';
in p
aim
ithes
es in
c,lud
e cnr
t'ect
ions
for
sam
plin
ge~t
w; ty
nty
? no
t in
par
enth
eses
cor
rect
ed f
or a
ll 'C
ombi
ned
,mp
k j
iLc
1arg
c.r th
an t
otal
.\am
ple
sinc
e bo
f Ji
rnei
bion
dl d
nd g
loba
l ra
ting5
are
iric
ludr
d fr
om
'r.4
HLE
4
poss
ible
art
ifact
s. sa
me
siud
ies.
Mod
erut
or A
wl-y
sis
Res
uirs
: Sr
tj-Pe
er C
orre
luriu
rts"
~-
90%
Con
fiden
ce
Tota
l N
umbe
r in
terv
als
%
sam
ple
of
Mea
nb
Stan
dard
Lo
wer
U
pper
U
nexp
lain
ed
Mod
erat
or
size
co
rrel
atio
ns
corr
elat
ion
devi
atio
n bo
und
boun
d va
rianc
e
P z U
00
Job
type
M
anag
eria
l/ pr
ofes
sion
al
534
6 .3
1(.2
0)
.oo(.oo)
.3 1
.31
BI~
~-~O
IIX
/ se
rvic
e 3 8
0 5
.40(
.26)
.2
8(. 1
9)
- .0
6 .8
6 14
.-
'The
re w
ere
too few
stud
ies
to t
est e
ither
rat
ing
form
at o
r ra
ting
scal
e as
mod
erat
ors.
tNum
bers
in p
aren
thes
es i
nclu
de c
orre
ctio
ns f
or s
ampl
ing
erro
r; n
umbe
rs n
ot i
n pa
rent
hese
s co
rrec
ted
for
all
poss
ible
art
ifac
ts.
"I W
54 PERSONNEL PSYCHOLOGY
as moderators for self-peer correlations. Thus, only job type was consid- ered; the results are shown in Table 4.
As in the self-supervisor category, little or no variance remained for either subgroup of self-peer correlations when sampling error and other artifacts were taken into account. Accordingly, there was a reduction in average variance across subsets as compared with the whole set, while the average correlation was higher for blue-collarlservice workers ( p = .40) than for managerial/professional employees ( p = .31). Thus, job type did meet the criteria for acceptance as a moderator.
A total of 18 independent self-supervisor means, 7 independent peer- supervisor means, and 4 independent self-peer means were located. As predicted by the egocentric-bias explanation, there was a fairly large dif- ference between self- and supervisor ratings (d = .70). However, the true variance was also large (u2 = .25). Thus, the difference in means between self- and supervisor ratings was not statistically significant. Simi- larly, while on average self-ratings were somewhat higher than peer ratings (d = .28), a substantial amount of variance remained even after correcting for sampling error (u2 = .09). Hence, the difference in mean self-peer ratings was nonsignificant. Finally, while peers provided somewhat higher ratings than did supervisors (d = .23), the difference was not significant, given the large variance (u2 = .ll).4
Discussion
The first question of interest concerned the average correlation and mean difference between rater pairs. Overall, the data suggest that after correcting for measurement error and range restriction, moderate agree- ment between self-peer ( p = .36) and self-supervisor ( p = .35) ratings exists. There was much higher agreement between peers and supervisors ( p = ,621. Nonetheless, the lower 90% confidence interval was greater than zero for all three rater combinations. The results of this study then are somewhat contrary to conclusions by Landy and Farr (1980). As noted earlier, they concluded that different raters exhibit low to moderate agree- ment. The present results indicate that while this is true for self-ratings compared with ratings by others, peer-supervisor ratings demonstrate a far higher average correlation. This discrepancy between a meta-analytic-based literature review and a traditional narrative review is not surprising, given the differences between the two procedures. The present results indicate the value of conducting quantitative reviews.
4A moderator analysis subgrouping by job type for self-supervisor means revealed no reduction in variance.
HARRIS AND SCHAUBROECK 55
In terms of mean differences, on average, self-ratings were over a half standard deviation higher than supervisor ratings and approximately one- quarter of a standard deviation higher than peer ratings. Nevertheless, given the large variances, the lower-bound confidence intervals included zero in both cases. However, examination of the effect sizes showed that in only one case were the self-ratings lower than supervisor ratings.
The second question concerned whether or not differences between studies within each rater pair could be attributed to statistical artifacts. The results indicated that in all cases substantial amounts of variance still remained, even when the appropriate corrections were made. Accordingly, modefator analyses were eonducted.
Of the three moderators, only job type showed any meaningful ef- fects. Specifically, self-supervisor and self-peer correlations were lower for managerial/ professional employees than for blue-collar/ service employ- ees, and no true variance existed for the former category. However, such effects were not obtained for peer-supervisor correlations. This suggests that while incumbents of managerial/ professional jobs have very differ- ent views of their performance than do others, observers exhibit far more agreement. One possible explanation is that egocentric bias is more likely to occur in ambiguous contexts (i.e., managerial/ professional jobs) than in well-defined jobs (i.e,, blue-collar/ service). Whatever the explanation, job type does not appear to affect observer agreement; further research is needed to test why this is the case. The one instance where rating scale met the criteria for a moderator should be viewed cautiously for sev- eral reasons. First, only a small number of studies ( 5 ) using behaviorally based rating scales were available, and the total sample size was small (n = 506). Second, the decrease in amount of variance across subsets was very small. Thus, rating scale was at best a relatively minor moder- ator of peer-supervisor agreement. In answer to the third question then, moderators did help account for some of the variance.
Recall earlier that three basic explanations were reviewed as to why disagreement between raters may occur. The present findings are consistent with two versions of the egocentric-bias theory. Specifically, in accord with attribution theory (DeVader et al., 1986; Jones & Nisbett, 1972), observers (i.e., peer-supervisor combinations) displayed higher agreement than did “actors” with observers (i.e., self-supervisor and self-peer combinations). The findings also mesh with the moderated defensiveness explanation: even after the range restriction corrections, observer-observer ratings (i.e., peer- supervisor combinations) demonstrated much higher agreement than self- observer ratings.
Conversely, there was no support for the main-effect-of-defensiveness explanation as correcting for range restriction did not reduce the difference
56 PERSONNEL PSYCHOLOGY
between self-peer/ self-supervisor correlations and peer-supervisor corre- lations. Nor was there evidence that observational differences accounted for rater disagreement as there was a relatively high correlation between peer- supervisor ratings. Finally, there was no support for the organizational-level hypotheses: an almost identical correlation was found between ratings by individuals at the same level (self-raters and peers) and individuals at dif- ferent levels (self and supervisors). Moreover, the highest correlation was found for raters at different levels (peer-supervisor). Nevertheless, further research is needed to conduct more direct tests for all possible explanations of rater disagreement.
It is also noteworthy that, with one minor exception, neither rating format nor rating scale were significant moderators. Other meta-analyses of performance ratings have come to similar conclusions. For example, Kraiger and Ford (1985) found that rating scale did not moderate ratee race effects; Heneman (1986) also reported a lack of support for rating format as a moderator. Hence, despite what would seem to be critical moderators, research has found rating instrumentation to be less useful in identifying sources of variation than was previously thought to be the case. Conversely, the findings regarding job type suggest that this may be a moderator of importance that other researchers in performance appraisal should examine.
Although there are a number of potential reasons why a substantial amount of variance between studies remained in most instances, two seem most plausible. First, it seems likely that there are differences between studies in regard to criterion contamination or rater independence. Specifi- cally, studies may have differed as to how much raters knew about others’ ratings. In some studies, subtle forms of criterion contamination may have existed. For example, it is possible that a few studies involving self-ratings were conducted shortly after performance appraisal reviews, wherein ratees had substantial information regarding supervisor ratings. In other instances, employees may have received informal feedback, while in yet other studies, employees may have had little or no feedback of any sort. Unfortunately, very few studies report enough detail about the amount or type of feedback to test whether this variable moderates rater agreement.
A second explanation for variance between studies may be differences in raters’ opportunity to observe job performance. For example, workers may have more interaction with peers in some cases than in others, thereby leading to differential agreement with supervisors. Jobs with much task interdependence may lead to greater agreement between self- and peer ratings than jobs where workers perform independently.
Other potential moderators include such things as purpose of rating (e.g., Zedeck & Cascio, 1982), amount of rater training (e.g., Smith, 1986), and rater motivation (Bemardin & Beatty, 1984). However, a review of
HARRIS AND SCHAUBROECK 57
our studies showed that these variables were also rarely, if ever, reported. Hence, no moderator analyses could be performed. Moreover, McIntyre, Smith, and Hassett (1984) found that purpose of rating had little or no effect on ratings, and many studies have failed to show that training increases the accuracy of performance ratings, particularly when only a lecture approach is used (Smith, 1986). Further theory development and empirical research will be necessary to achieve greater knowledge about which conditions affect rater agreement.
Although one can always argue that a “file-drawer’’ problem may exist such than inclusion of missing studies would change the results, this is unlikely here. First, an extensive literature review was conducted. Second, Rosenthal’s (1979) index suggests that it would take many studies to dra- matically alter the results, given the size of the correlations and the total sample size found here.
Future research needs to address two issues. First, a more direct test of these competing hypotheses is needed. That is, a study should be done wherein different raters (e.g., peers, supervisors, and self-raters) will assess performance and provide information regarding moderators. For example, attributions regarding cause of performance, self-esteem, amount of obser- vational opportunity, as well as other possible factors would be measured.
Second, depending on the outcome of this research, further work on the ways to ameliorate rater bias would be useful. For example, if differential attributions are a source of conflict, the Performance Distribution Assess- ment (PDA), described by Bemardin and Beatty (1984), might provide a ve- hicle for increasing rater agreement. Specifically, the PDA attempts to take into account situational constraints. Research should also focus on identify- ing possible moderators of self-other agreement. For instance, impression management may be a variable that can account for self-supervisor and self-peer disagreement (Zerbe & Paulhus, 1987). Clearly, more research examining a variety of individual differences is in order.
In terms of practical implications, the results suggest that self-ratings will generally show only moderate correlations with ratings by others. This is particularly true for managerial/professional employees. Practi- tioners considering the use of self-ratings should be aware that there is liable to be much disagreement. Conversely, peer-supervisor ratings often (but not always) reflect relatively high rater agreement, even for manage- rial/professional jobs. For legal purposes, then, peer-supervisor ratings may be useful. Using raters from different levels may also help to develop consensus, eliminate biases, and perhaps in turn lead to greater acceptance by ratees. Nevertheless, there is a plethora of other problems associated with using peer ratings (Latham & Wexley, 1982).
58 PERSONNEL PSYCHOLOGY
In conclusion, the present results show that peer-supervisor ratings demonstrate a relatively high correlation, whereas self-supervisor and self- peer ratings exhibit moderate correlations. Compared with the correla- tion between supervisor ratings and results-oriented outcomes (produc- tion, sales, etc.) (Heneman, 1986), the average correlation obtained here between supervisor and peer-ratings is substantially higher. Thus, from both an absolute and relative point of view, there is high convergence be- tween observer ratings. The moderator analysis showed that correlations between self-peer and self-supervisor ratings are particularly low for man- agerial/ professional workers, but not for peer-supervisor ratings. While these findings are consistent with an egocentric-bias explanation for rater disagreement, future research should test the alternative explanations de- scribed earlier in a more direct fashion than was possible here.
REFERENCES
Baird LS. (1977). Self and supervior ratings of performance: As related to self-esteem and satisfaction with supervision. Academy of Management Journal, 20, 29 1-300.
Bernardin HJ, Beatty RW. (1984). Performance appraisal: Assessing human behavior at work. Boston, MA: Kent Publishing Company.
Borman WC. (1974). The rating of individuals in organizations: An alternate approach. Organizational Behavior and Human Performance, 12, 105-124.
Campbell JP, Dunnette MD, Lawler EE, Weick KE. (1970). Managerial behavior, perfor- mance, and egectiveness. New York: McGraw-Hill.
Carroil SJ, Schneier CE. (1982). Performance appraisal and review systems: The identifrca- tion, measurement, and development of performance in organizations. Glenview, IL: Scott, Foresman.
Cascio WF. (1987). Applied psychology in personnel management. Englewood Cliffs, NJ: Prentice-Hall.
Cummings LL, Schwab DP. (1973). Performance in organizations: determinants and ap- praisal. Glenview, IL: Scott Foresman.
DeVader CL, Bateson AG, Lord RG. (1986). Attribution theory: A meta-analysis of attri- butional hypotheses. In Locke E (Ed.), Generalizing from laboratory to field settings (pp. 63-8 1). Lexington, MA: Lexington Books.
Fems GR, Yates VL, Gilmore DC, Rowland KM. (1985). The influence of subordinate age on performance ratings and causal attributions. PERSONNEL PSYCHOLOGY, 38, 545-557.
Henderson RI. (1984). Performance appraisal. Reston, VA: Reston Publishing Company. Heneman HG 111. (1974). Comparisons of self- and superior ratings of managerial perfor-
mance. Journal of Applied Psychology, 59, 638-642. Heneman RL (1986). The relationship between supervisory ratings and results-oriented mea-
sures of performance: A meta-analysis. PERSONNEL PSYCHOLOGY, 39, 81 1-826. Holzbach RL. (1978). Rater bias in performance ratings: Superior-, self-, and peer ratings.
Journal of Applied Psychology, 63, 579-588. Hunter JE, Schmidt FL,, Jackson GB. (1982). Mera-analysis: Cumulating research findings
across studies. Beverly Hills, CA: Sage Publications. Ilgen DR, Favero JL. (1985). Limits in generalization from psychological research to perfor-
mance appraisal processes. Academy of Management Review, 10, 31 1-321. Jones EE, Nisbett RE. (1972). The actor and the observer: Divergent perceptions of the causes
of behavior. In Jones EE, Kanouse DA, Kelly HH, Nisbett RE, Valines S, Weiner B
HARRIS AND SCHAUBROECK 59
(Eds.), Attribution: Perceiving the causes of behavior (pp.79-94). Morristown, NJ General Learning Press.
Kay E, Meyer HH, French JRP. (1965). Effects of threat in a performance appraisal interview. Journal of Applied Psychology, 49, 311-317.
Klimoski RJ, London M. (1974). Role of the rater in performance appraisal. Journal of Applied Psychology, 59, 445-451.
Kraiger K, Ford JK. (1985). A meta-analysis of ratee race effects in performance ratings. Journal of Applied Psychology, 70, 56-65.
Landy FJ, Farr JL. (1980). Performance rating. Psychological Bulletin, 87, 72-107. Landy FJ, Farr JL, Saal FE, Freytag WR. (1976). Behaviorally anchored scales for rating the
performance of police officers. Journal of Applied Psychology, 61, 458-557. Latham GP, Wexley KN. (1982). Increasing producriviry rhrough performance appraisal.
Reading, MA:’ Addison-Wesley. Lawler EE. ( 1967). The multitrait-multirater approach to measuring managerial job perfor-
mance. Journal of Applied Psychology, 51, 369-38 1. Mabe PA, West SG. (1982). Validity of self-evaluation of ability: A review and meta-analysis.
Journal of Applied Psychology, 67, 28G296. McIntyre RM, Smith DE, Hassett CE. (1984). Accuracy of performance ratings as affected
by rater training and perceived purpose of rating. Journal of Applied Psychology, 69, 147-1 56.
Pearlman K, Schmidt FL, Hunter JE. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373406.
Pym DLA, Auld HD. (1965). The self-rating as a measure of employee satisfactoriness. Journal of Occupational Psychology, 39, 103-1 13.
Rosenthal R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86, 638-641.
Sackett PR, Tenopyr ML, Schmitt N, Kehoe J. (1985). Commentary on forty questions about validity generalizationa and meta-analysis. PERSONNEL PSYCHOLOGY, 38, 697-798.
Schmitt N, Schneider B. (1983). Current issues in personnel selection. In Rowland KW, Ferris GD (Eds.), Research in personnel and human resoiirces, Vol. I (pp. 85-125). Greenwich, C T JAI Press.
Smith DE. (1986). Training programs for performance appraisal: A review. Academy of Managemenl Review, 11, 22-40.
Springer D. (1953). Ratings of candidates for promotion by co-workers and supervisors. Journal of Applied Psychology, 37, 347-35 1 .
Steele RP, Ovalle NK. ( I 984). Self-appraisal based upon supervisory feedback. PERSONNEL
PSYCHOLOGY, 37, 667-685. Thomton GC. (1980). Psychometric properties of self-appraisals of job performance. PER-
SONNEL PSYCHOLOGY, 33, 263-271. Tucker MF, Cline VB, Schmitt JR. (1967). Prediction of creativity and other performance
measures from biographical information among pharmaceutical scientists. Journal of Applied Psychology, 51, 131-138.
Wexley KN, Klimoski R. (1984). Performance appraisal: An update. In Rowland KM, Fems GD (Eds.), Resarch in personnel and human resources, Vol. 2 (pp. 35-79). Greenwich, C T JAI Press.
Williams WE, Seiler DA. (1973). Relationship between measures of effort and performance. Journal of Applied Psychology, 57, 49-54.
Zammuto RF, London M, Rowland KM. (1982). Organization and rater differences in per- formance appraisals. PERSONNEL PSYCHOLOGY, 35, 643458.
60 PERSONNEL PSYCHOLOGY
Zedeck S, Baker HT. (1972). Nursing performance as measured by behavioral expectations scales: A multitrait-multirater analysis. Organizational Behavior and Human Perfor- mance, 7, 457466.
Zedeck S, Cascio W. (1982). Performance decision as a function of purpose of rating and training. Journal of Applied Psychology, 67, 752-758.
Zerbe WJ, Paulhus DL. (1987). Socially desirable responding in organizational behavior: A reconception. Academy of Management Review, 12, 250-264.
STUDIES INCLUDED IN THE META-ANALYSIS
Baird LS. (1977). Self and superior ratings of performance: As relate’d to self-esteem and satisfaction with supervision. Academy of Management Journal, 20, 291-300.
Blackburn RT, Clark MJ. (1975). An assessment of faculty performance: Some correlates between administrator, colleague, student, and self ratings. Sociology of Education, 48, 242-256.
Booker GS, Miller RW. (1966). A closer look at peer ratings. Personnel, 43, 42-47. Borman WC. (1974). The rating of individuals in organizations: An alternative approach.
Organizational Behavior and Human Performance, 12, 105-124. Boruch RF, Larking JD, Wolins L, MacKinney AC. (1970). Alternative methods of analysis:
Multitrait-multimethod data. Educational and Psychological Measurement, 30, 833- 853.
Brief AP, Aldag RJ, Van Sell M. (1977). Moderators of the relationships between self and superior evaluations of job performance. Journal of Occupational Psychology, 50, 129- 134.
Burke RJ, Weitzel W, Deszca G. (1982). Supervisors’ and subordinates’ communication experiences in managing job performance: Two blind men describing an elephant? Psychological Reports, 50, 527-534.
Campbell DJ. ( 1985). Self-appraisals from two perspectives: Esteem versus consistency influences. 28th Annual Proceedings of the Midwest Academy of Management.
Centra JA. (1975). Colleagues as raters of classroom instruction. Journal of Higher Educa- tion, 46, 327-337.
Doyle KD, Crichton CI. (1978). Student, peer, and self evaluations of college instructors. Journal of Educational Psychology, 70, 8 15-826.
Ekpo-Ufot A. (1979). Self-perceived task-relevant abilities, rated job performance, and com- plaining behavior of junior employees in a government ministry. Journal of Applied Psychology, 64,429434.
F e m s GR, Yates VL, Gilmore DC, Rowland KM. (1985). The influence of subordinate age on performance ratings and causal attributions. PERSONNEL PSYCHOLOGY, 38, 545-557.
Fuqua DR, Johnson AW, Newman JL, Anderson MW, Gade EM. (1984). Variability across sources of performance ratings. Journal of Counseling Psychology, 31, 249-252.
Gormly J. (1 984). Correspondence between personality trait ratings and behavior events. Journal of Personality, 52, 220-232.
Griffiths RD. (1975). The accuracy and correlates of psychiatric patients’ self-assessment of their work behavior. British Journal of Social and Clinical Psychology, 14, 181-189.
Gunderson EK, Nelson PD. (1966). Criterion measures for extremely isolated groups. PER- SONNEL PSYCHOLOGY, 19.67-80.
Heneman HG 111. (1974). Comparisons of self- and superior ratings of managerial perfor- mance. Journal of Applied Psychology, 59. 638-642.
Hoffman EL, Rohrer JH. (1954). An objective peer evaluation scale: Construction and validity. Educational and Psychological Measurement, 14, 332-341.
HARRIS AND SCHAUBROECK 61
Holzbach RL. (1978). Rater bias in performance ratings: Superior, self, and peer ratings. Journal of Applied Psychology, 63, 579-588.
Ilgen DR, Peterson RB, Martin BA, Boeschen DA. (1981). Supervisor and subordinate reactions to performance appraisal sessions. Organizational Behavior and Human Perj5ormance, 28, 3 11-330.
Kaufman GG, Johnson JC. (1984). Scaling peer ratings: An examination of the differential validities of positive and negative nominations. Journal of Applied Psychology, 59,
Kay E, Meyer HH, French JRP. (1965). Effects of threat in a performance appraisal interview. Journal of Applied Psychology, 50, 3 1 1-3 17.
King LM, Hunter JE, Schmidt FL. (1980). Halo in multi-dimensional forced-choice perfor- mance evaluatiod scale. Journal of Applied Psychology, 65, 507-5 16.
Kirchner WK. ( 1965). Relationships between supervisory and subordinate ratings for technical personnel. Journal of Industrial Psychology, 3, 57-60.
Klimoski RJ, London M. (1974). Role of the rater in performance appraisal. Journal of Applied Psychology, 59, 4 4 5 4 5 1.
Kraut AI. (1975). Prediction of managerial success by peer and training staff ratings. Journal of Applied Psychology, 60, 389-394.
Kubany AJ. (1957). Use of sociometric peer nominations in medical education research. Journal of Applied Psychology, 41, 389-394.
Lawler EE. ( 1967). The multitrait-multirater approach to measuring managerial job perfor- mance. Journal of Applied Psychology, 51, 369-380.
Levine EL, Flory A, Ash RA. (1977). Self-assessment in personnel selection. Journal of Applied Psychology, 62,428435.
Love KG. (198 1). Comparison of peer assessment methods: Reliability, validity, friendship bias, and user reaction. Journal of Applied Psychology, 66, 45 1 4 5 7 .
Maslow AH, Zimmerman W. (1956). College teaching ability, scholarly activity and person- ality. Journal of Educational Psychology, 47, 185-1 89.
Mihal WL, Graumenz JL. (1984). An assessment of the accuracy of self-assessment for career decision making. Journal of Vocational Behavior, 25, 245-253.
Mitchell TR, Albright DW. (1972). Expectancy theory predictions of the satisfaction, effort, performance, and retention of naval aviation officers. Organizational Behavior and Human Perjormance, 8, 1-20.
Motowildo SJ. (1982). Relationship between self-rated performance and pay satisfaction among sales representatives. Journal of Applied Psychology, 67, 209-21 3.
Mount MK. (1984). Supervisor, self, and subordinate ratings of performance and satisfaction with supervision. Journal of Management, 10, 305-320.
Nealy SM, Owen TW. (1970). A multitrait-multimethod analysis of predictors and criteria of nursing performance. Organizational Behavior and Human Performance, 5 , 348-365.
Orpen C. (1983). Effects of race of rater and ratee on peer-ratings of managerial potential. Psychological Reports, 52, 507-5 10.
Parker JW, Taylor EK, Barrett RS, Martens L. (1959). Rating scale content: 111. Relationship between supervisor and self-ratings. PERSONNEL PSYCHOLOGY, 12, 49-63.
Porter LW, Lawler EE. (1968). Managerial attitudes and performance. Homewood, IL: Irwin.
Prien Ep, Liske RE. (1962). Assessments of higher-level personnel: 111. Rating criteria: A comparative analysis of supervisor ratings and incumbent ratings of job performance. PERSONNEL PSYCHOLOGY, 15, 187-194.
Pym DL, Auld HD. (1965). The self-rating as a measure of employee satisfactoriness. Journal of Occupational Psychology, 39, 103-1 13.
Schmitt N, Saari BB. (1978). Behavior, situation, and rater variance in descriptions of leader behaviors. Multivariate Behavioral Research, 13, 483-495.
302-306.
62 PERSONNEL PSYCHOLOGY
Schneier CE, Beatty RW. (1978). The influence of role prescriptions on the performance
Shapiro GL, Dessler G. (1985). Are self-appraisals more realistic among professionals or
Shore LF, Thornton GC 111. (1986). Effects of gender on self- and supervisory ratings.
Shrauger JS, Lund AK. (1975). Self-evaluation and reactions to evaluations from others.
Siege1 L. (1982). Paired comparison evaluations of managerial effectiveness by peers and
Smircich L, Chesser RJ. (198 1). Supervisors’ and subordinates’ perceptions of performance:
Springer D. (1953). Ratings of candidates for promotion by co-workers and supervisors.
Steel RP, Ovalle NK. (1984). Self-appraisal based upon supervisory feedback. PERSONNEL
Thornton GC. (1968). The relationship between supervisory and self-appraisals of executive performance. PERSONNEL PSYCHOLOGY, 21,441455.
Tucker MF, Cline VB, Schmitt JR. (1963). Prediction of creativity and other performance measures from biographical information among pharmaceutical scientists. Journal of Applied Psychology, 51, 131-138.
Webb WB, Nolan CY. (1955). Student, supervisor and self-ratings of instructional proficiency. Journal of Educational Psychology, 46, 4 2 4 6 .
Williams WE, Seller DA. (1973). Relationship between measures of effort and job perfor- mance. Journal of Applied Psychology, 57 , 49-54.
appraisal process. Academy of Management Journal, 2 1 , 129-1 35.
nonprofessionals in health care? Puhlic Personnel Management, 14, 285-290.
Academy of Management Journal, 29, 1 15-129.
Journal of Personality, 43, 94-108.
supervisors. PERSONNEL PSYCHOLOGY, 35, 843-852.
Beyond disagreement. Academy of Management Journal. 24, 198-205.
Journal of Applied Psychology, 37, 347-35 I .
PSYCHOLOGY, 37, 667-685.