Measuring the Coworker E ects on Wages · 2021. 1. 9. · Measuring the Coworker E ects on Wages Jianhong Xiny Saturday 9th January, 2021 [Click for Latest Version] Abstract The fast

Measuring the Coworker Effects on Wages∗

Jianhong Xin†

Saturday 9th January, 2021

[Click for Latest Version]

Abstract

The fast growing literature studying the impact of co-workers on individual’s wageshas recently made significant progress by developing techniques that allowed it tomove from small and idiosyncratic case studies to more generalizable studies based onlarge labor markets. However, I show that the empirical methodology underlying thisshift delivers a large positive or negative bias in measured co-worker effects in realisticsettings. I combine insights from the assortative matching theory with recent computerscience advances in graph embedding techniques to develop a machine learning methodthat allows researchers to obtain efficient and unbiased estimates in those settings. Theproposed method allows to non-parametrically measure the potentially heterogeneousimpact of different co-workers on individuals’s wages. I am currently using the proposedmethod to measure co-worker effects in the matched employer-employee panel datacovering the entire population of Denmark.

Keywords: Coworker Effects, Two-Sided Unobserved Heterogeneity, Assortative Match-ing, Machine Learning, Graph Embedding, Matrix Completion.

∗I am deeply indebted to Iourii Manovskii, Marcus Hagedorn, Dirk Krueger for their invaluable guidanceand support. I would like to thank Harold Cole, José-Vı́ctor Ŕıos-Rull, Andrew Postlewaite, Xu Cheng,Guillermo Ordonez, Hanming Fang and all other seminar participants at University of Pennsylvania and2020 Joint Statistical Meetings.†University of Pennsylvania, Department of Economics, The Ronald O. Perelman Center for Political

Science and Economics, 133 South 36th Street, Philadelphia, PA 19104. Email: [email protected].

https://economics.sas.upenn.edu/people/jianhong-xin

1 Introduction

How does the wage of a worker depend on where she works and whom she works with? How

to disentangle the contribution to wages the unobserved components of worker’s individual

productivity, her firm productivity and her coworkers’ productivities? What is the magnitude

of the complementarity between the productivity of the worker and the firm? What is the

magnitude of the complementarity between the productivities of coworkers? How to predict

the potential wage and output of any worker relocated to any firm she never worked at and

with coworkers she might have never encountered? Measuring coworker effects on wages at

the scale of a local labor market provides the key to answer these questions. The empirics of

coworker effects also pave the way for subsequent research, for instance, to investigate the

efficiency of the labor market allocation: Method that allows to predict wages and outputs in

a counterfactual meeting of any workers across any firm is capable of projecting the optimal

assignment of workers against search frictions and can shed light upon the design of policies

such as the taxation and unemployment insurance system to achieve the optimal assignment.

To estimate coworker effects, the central empirical challenge is well-known as the selection

problem (Manski (1993), Brock and Durlauf (2001), Angrist (2014), Bramoulle et al. (2020)).

The problem is rooted in the sorting in the labor market: workers may be endogenously

matched into peers across firms and occupations. The sorting may be based on similar

productive attributes or common external causes, which would potentially confound the

peer effects. Moreover, a big proportion of these attributes may not be observed in the data,

considering that observable worker and firm characteristics only account for small fraction

of the observed variation in wages. 1

The conventional approach to circumvent the selection problem is to instrumentalize the

exogenous sources of variation in exposure to coworker influences. Researchers can random-

ize peer components or treatment affecting outcome through field experiments (Sacerdote

(2001), Duflo et al. (2011), Banerjee et al. (2015), Eckles et al. (2016) and Feld and Zölitz

(2017)) or have to rely on a variety of strong and context-specific exogeneity assumptions

for observational data. 2 Perhaps not surprisingly, these empirical research have been lim-

1A large number of studies have found evidence that workers with similar unobservable productive at-tributes systematically self-select across firms and become peers. To name just a few, Andrews et al. (2012)find positive correlation between worker and firm contributions to wage for German data. Hagedorn et al.(2017) identify sorting based on unobserved characteristics in both workers and firms and find positive as-sortative matching (PAM), which has reinforced the previous finding. Abowd et al. (2018) find unobservabledifferences in worker productivity are strongly positively correlated with unobservable differences in firmproductivity across sectors. Crane (2014) find productive, fast-growing firms tend to hire more productiveworkers using U.S. census data. Mendes et al. (2007) document PAM for Portuguese data. Borovikova andShimer (2017) find high wage workers work for high wage firms for Austria.

2For example, models studying peer effects in classroom typically assume idiosyncratic variations in peer

1

ited to small and idiosyncratic case studies. While it is difficult to apply their methods to

investigate coworker effects at a broader scale, the empirical findings in these studies exhibit

a large extent of heterogeneity, making it difficult to generalize their outcome for the whole

market as well. 3

Yet, there is a recent growing interest in the literature moving beyond these case studies

to investigate coworker effects in large labor markets. In a recent advance, Cornelissen et al.

(2017) (henceforth CDS) are the first to estimate the average coworker effects on wages for the

whole labor market in Germany. They proposed an empirical method to measure the average

worker’s wage response to a change of her coworker quality with a linear-in-means model.

Building on the worker and firm fixed effects model pioneered by Abowd et al. (1999) (AKM

hereafter), the method accounts for selection into peer groups by including additive worker

and firm fixed effects. Taking the approach of Arcidiacono et al. (2012), the method estimates

coworker effects emanating from unobservable characteristics by iteratively estimating the

focal worker’s ability, the firm ability, and the spillover coefficient mapping her own wage

response to her coworker quality measured by the average of their fixed effects. Similar

strategy based on including two-way additive fixed effects are also adopted by Hanushek

et al. (2003), Betts and Zau (2004), Lavy et al. (2012), Burke and Sass (2013) and Sanz de

Galdeano (2020).

However, this method could be confronted with two challenges. First, by assuming linearly

additive fixed effects, the model cannot capture potential interactions between unobserved

heterogeneity of agents on both sides of the market. Particularly, wages that are monotone

in worker and firm productivity are inconsistent with standard sorting models where the

complementarity between worker and firm productivities play a key role (e.g. Becker (1973),

Eeckhout and Kircher (2011), Gautier and Teulings (2006), Lopes de Melo (2013), Lentz

(2010), Lise et al. (2016), Hagedorn et al. (2014)). The estimated worker and firm fixed effects

therefore may fail to correctly reflect the true worker abilities and firm productivities when

these attributes are truly complementary in the underlying production process. 4 Based on

components across cohorts (Hoxby (2000)). Models studying peer effects on networks are often abstractfrom correlated effects based on unobservable characteristics. The typical assumption is the network andobservables are exogeneous to the outcome (Bramoulle et al. (2009)), or the endogenous formation of thenetwork is only conditional on observed characteristics (Bramoulle et al. (2020)). These assumptions abstractfrom correlated effects based on unobserved heterogeneity.

3Mas and Moretti (2009) for instance, find strong evidence of positive productivity spillovers conditionalon the physical presence of coworkers in a supermarket chain, while in a different scenario, Bloom et al.(2014) conclude quite the opposite. Waldinger (2009) investigates spillover among university researchers andfound no evidence for localized peer effects, in sharp contrast to the findings of Herbst and Mas (2015).

4From a theoretical point of view, Gautier and Teulings (2006) and Eeckhout and Kircher (2011) showthe role of the complementarity in determining the sign and magnitude of sorting and its implication forreallocation. From an empirical point of view, Hagedorn et al. (2017) non-parametrically estimate the wageand output profile and find overall positive assortative matching between workers and firms and positive

2

these estimates, the CDS method can induce a sizable misspecification bias in the measured

coworker effects: I show that the bias can be large, with its sign being either positive or

negative, depending on the strength and sign of the wage complementarity.5

Second, the adoption of linear-in-means model only captures average coworker effects.

Despite its trackability, the linear-in-means model cannot reconcile the vast heterogeneity

in coworker effects documented in the empirical literature: across various occupations and

industries, different workers are found affected by their coworkers differently. 6 The second

challenge stems from the fact that coworker influences are heterogeneous and these hetero-

geneities may not have been observed in the data. However, there has been scarce discussion

in the literature on estimating heterogeneous coworker effects based on unobservables.

In this paper, I propose a new semi-parametric methodology to measure the effects of

coworkers on wages. Combining economic theory and recent advances in machine learning,

the method considers the dependence of wages on the heterogeneous productive attributes of

both workers and firms that are either observed or unobserved. First, the method delivers an

efficient and robust estimation under the presence of potential complementarities between

these attributes on both sides of the market. Second, beyond linear-in-means models, the

method is also the first to capture the heterogeneous coworker effects based on unobserved

heterogeneity.

The main idea to non-parametrically identify the wages that are non-linear in the com-

plementarities and the coworker effects is to partition the set of workers into clusters where

workers inside a group are similar to each other in their productivities and coworker influ-

ences. Then at cluster level, I estimate the complementarities between each worker clusters

and firms as well as the coworker effects between worker clusters.

The identification consists of two stages. In the first stage, I identify similar workers and

cluster workers based on their similarity. To achieve this goal, I develop a method in the

spirit of Hagedorn et al. (2020) but taking into account the potential time-varying impact

from the evolving set of coworkers. The key underlying assumption is that wages are driven

by the unobserved “types” of the worker, the firm she matched into and the “types” of

her coworkers. Here, “types” are interpreted as a time-invariant unobserved membership

of groups with certain productive worker attributes that govern the worker’s productivity

complementarity between their producitivities in output, in sharp contrast to additive firm and worker fixedeffect model would imply.

5Specifically, “wage complementarity” refers as the wage component that is specific to the combinationof the productivity type of the worker and the firm in the match.

6There are two other major problems with the linear-in-means model received by the literature: First, themodel is not most interesting for it constrains the net effect from regrouping peers thus the estimation hasonly limited policy implications. Second, the linear-in-means model is often misspecified and not supportedby numerous empirical studies (Hoxby and Weingarth (2005), Sacerdote (2011)).

3

and her peer influence on coworkers. The vision is that there could be a large number of

unobserved types of workers and their coworkers that meaningfully determine the observed

wages in the data, and the core of the method is to partition workers of a large number of

types into a relatively small numbers of clusters so that each cluster is populated with workers

of similar types. I leverage the insight featured by most sorting models in the literature: two

workers with similar attributes and similar working history, must earn similar wages in the

same firm influenced by the same set of coworkers. These restrictions allow me to identify

pairs of workers with similar unobserved types by comparing their wages observed in the

same firm at each point of time.

Based on a worker partition, the method enables to counter-factually predict the “out-

of-sample” wage of any worker relocated to any firm she never worked at and with coworkers

she might have never encountered, using the wages of other workers belonging to her “sim-

ilar worker cluster” in the target firm to predict. The clustering-based estimation can be

viewed as a “matrix completion” process that allows to measure the global consequence of

wages (and other potential economic outcomes) between every worker and every firm with

any combinations of coworkers after a reallocation. Methodology-wise, completing the wage

matrix is also important for two reasons. First, it allows to further evaluate the similarity

of workers who has not been directly worked as coworkers in the same firm by comparing

their “completed wages” based on wages of other workers in their cluster. Second, it enables

to validate the quality of the clustering through testing the accuracy of the out-of-sample

predictions.

In the data, however, most individual workers have been working for only a few firms

with limited number of coworkers, and not many similar workers have been working together

in the same firm. This posits a sparsity problem to the proposed method since wages are only

comparable within firms implied by the theory. To account for the sparsity problem, I adopt

a hierarchical clustering approach to group workers in an iterative manner. I start with the

most restrictive criteria only joining pairs of workers with the minimal average wage difference

throughout their coworkership. Despite the limited amount of initial merges due to the

strictness of the criteria, joining these workers into bigger clusters can mitigate the sparsity

problem through expanding the connectivity of coworker network and the set of available

comparisons. This is because now I can make comparisons not only between a worker and

her immediate coworkers in the same firm, but also between her and coworkers of another

individual from her cluster encountered at different firms, based on her wages “completed”

by the cluster average. I continue to merge similar workers until no more workers are left

to be merged by the current criteria. Then, I iteratively move through a set of progressively

relaxed similarity criterion allowing for merging workers within a increasingly bigger cutoff

4

for their wage differences. The process delivers a hierarchical sequence of worker clustering.

There is an apparent trade-off when to terminate the algorithm: as the procedure iter-

ates forward joining more individuals into larger clusters, the connectivity of the coworker

network become increasingly denser. The method is therefore able to make predictions for

a wider range of workers as larger sets of comparable workers now become available. In

the meanwhile, however, the accuracy of the prediction gradually deteriorates, subject to

a greater “approximation error” since the purity of each cluster become increasingly con-

taminated with including workers with lower similarity. To best balance the the trade off, I

resort to machine learning: I split the data into three random subsets: the training set, the

validation set and the test set. I estimate the hierarchical sequence of worker clustering only

based on observations in the training set. Then the optimal worker partition chosen in the

sequence of worker clustering as the one that best predicts observations in the validation set.

For evaluation purpose, the performance of the estimated worker partition is validated and

reported using the test dataset.

In the second stage of this approach, I estimate the wage complementarities and coworker

effects given the worker clustering obtained from the first stage. The essence to account for

the potential self-selection problem in estimating coworker effects is to control for the wage

complementarity that governs the sorting of workers across firms based on their unobserved

productivity types. To control for the wage complementarity empirically, I include two-

dimensional “match fixed effects” cross-indexed by the identity of the individual’s worker

cluster and the identity of the individual firm in the match. Moving beyond the AKM

approach relying on restrictive worker and firm fixed effects decompositions, the proposed

method non-parametrically identifies wage complementarity reflecting non-linear interactions

during each type of match.

Conditional on the wage complementarity, I non-parametrically identify coworker effects

utilizing the focal worker’s wage response to variations in her coworker productivity distri-

bution driven by their job mobility. Here, the important identification assumption is that

conditional on the productivity of the worker, the productivity of the firm and the pro-

ductivity of all her coworkers, the wages of the worker are exogenous to her coworker job

mobility. Empirically, I approximate the coworker productivity distribution for each indi-

vidual with discrete “coworker components”, the mass function depicting the measure of

the coworkers assigned to each worker bin. Then I estimate coworker effects function by

projecting the wages of the worker on her coworker components conditional on the match

effects. Each coefficient captures the coworker influence, the elasticity of the wage response

to the coworker share of each productivity type. Beyond the linear-in-means model, the

proposed method non-parametrically estimates heterogeneous coworker influences, as these

5

coefficients are unrestricted by functional form assumptions. In an important extension of

the method, I further generalize the framework that enables to measure complementarities

between coworkers. I estimate a two-dimensional coworker effect function that allows for

asymmetric coworker influence from one clusters to the other. In the general framework, I

show the identification of the worker clustering follows the identical machine learning algo-

rithm, and the two-dimensional coworker effects can be estimated separately for the focal

workers in each worker clusters.

I am now taking the algorithm to the administrative matched employer-employee data

of Denmark, which covers a population of 3 million workers over a span of 20 years. De-

spite its high accuracy, the proposed hierarchical clustering algorithm is relative slow for big

dataset: its computational complexity is quadratic in space and cubic in time. To accommo-

date the demand for scalability, I integrate the baseline method with Graph Convolutional

Networks (GCN), a computer science advance in graph embedding techniques (Kipf and

Welling (2016), Hamilton et al. (2017)). In recent years, GCN-based graph embedding tech-

niques have enjoyed tremendous success in multi-discipline applications ranging from natural

language processing, knowledge graphs to online recommender systems (Xu et al. (2018), Cai

et al. (2018), Zhou et al. (2018), Wu et al. (2019)). The fundamental idea is to represent each

node of the graph by a vector in a low dimensional space such that the similar pair of nodes

are embedded close in the space. However, while these graph embedding methods exhibit

high performance in computations, I found the accuracy of these methods are limited when

tested with simulated data.

Following Hagedorn et al. (2020), I integrate the baseline hierarchical clustering algorithm

with the graph embedding techniques with a divide-and-conquer strategy. The “dividing”

step computes worker embeddings using GraphSAGE and group closely embedded workers

and divide them into separate subsets. The “conquering” step applies the baseline hierarchi-

cal clustering algorithm only to each local subset: when the dividing step is relatively accu-

rate, only similar workers are assigned into each cluster. Therefore, the divide-and-conquer

strategy can significantly reduce the dimension of the problem by erasing voluminous redun-

dant comparisons without any compromise of accuracy.

This paper is related to multiple strands of the literature. The first contribution is to

the fast-moving research on peer effects using large matched-employer-employee data. I de-

veloped a parsimonious machine-learning-based approach that enables reliable and testable

results. My framework extends Cornelissen et al. (2017) by allowing wages to reflect flexi-

ble worker-firm complementarities and capture heterogeneous peer effects across unobserved

worker productivities. Moving beyond homogeneous peer effects, Sanz de Galdeano (2020)

find heterogeneous peer effects across observed characteristics for MEE data for Brazil, tak-

6

ing a similar approach to Arcidiacono et al. (2012) and Cornelissen et al. (2017). In parallel

with the literature focusing on contemporaneous peer effects, Herkenhoff et al. (2018) and

Jarosch et al. (2019) find asymmetric flow of knowledge spillover from high wage workers

to low wages workers over years. As for evaluating the efficacy of the estimate. Eckles and

Bakshy (2020) conduct a constructed observational study by comparing the prediction of

observational estimator of peer behavior based on a nonexperimental control group to a

randomized experiment. My method takes a machine learning approach and evaluate out-of-

sample prediction on the test set. Second, the paper contributes to the literature to identify

labor market sorting based on unobserved heterogeneity. Bonhomme et al. (2019), Lentz

et al. (2018) propose random-effect-based approach to bicluster workers and firms based on

wage distribution. Complementary to these existing methods, my approach delivers in finite

samples precise and accurate counterfactual predictions for any individual worker if allo-

cated to any firm conditional on the set of coworkers. The third contribution of the paper

is to the literature on team productions. worker-firm sorting and wage complementarities

separately from the wage effects of coworkers. With additional assumptions on bargaining

protocol and the market structures, my method also provides to non-parametrically estimate

worker production in teams given observed wages, which is an equilibrium object and can be

non-parametrically inverted for outputs. In contrast, Bonhomme (2020) quantify individual

contribution given observed team output. Finally, the paper contribute to search frictions

and assortative matching literature. To my knowledge, this paper is the first to jointly es-

timate the complementary between worker and firm productivity taking into account the

coworker effects.

The rest of the paper is organized as follows. Section 2 setups the environment and il-

lustrate the extent of the misspecification bias in measured coworker effects if the standard

assumption that wages are linear in worker and firm effects is violated in the data. Sec-

tion 3 propose a novel machine learning based approach and apply to extended framework

of Cornelissen et al. (2017). The extension allows for the worker-firm complementarity and

captures heterogeneous peer effects. Section 4 presents simulation results to show the effi-

ciency and efficacy of the algorithm. Section 5 integrate my baseline method with the recent

graph embedding techniques to achieve scalability. Section 6 concludes.

2 Background

This section introduces the framework of Cornelissen et al. (2017), the first leading empirical

methodology that attempts to estimate peer effects for a large local labor market. Then I

show the method is not robust at the presence of the wage complementarity between workers

7

and firms, which could induce a sizable misspecification bias in the measured coworker effects.

2.1 Environment

In the matched employer and employee data for workers I = {1, ..., N} and firms J ={1, ...,M}, we can keep track of the wages for each individual worker i ∈ I when he ismatched into firm j ∈ J in year t ∈ T . For every observed match in each period (i, j, t),denote the log wages as wijt and match indicator mijt = 1, otherwise mijt = 0. Individual

worker can have up to one job in each year. The set of workers for each firm j at the same

period t are referred to as peer group Pjt = {i ∈ I | mijt = 1}. Denote Njt = |Pjt| − 1.Denote the coworker set for a reference worker i as P∼i,jt = {i′ ∈ I, i′ 6= i | mi′jt = 1}.

2.2 The Method of Cornelissen et al. (2017)

Cornelissen et al. (2017) is the leading empirical method to measure peer effects based on

matched employer-employee data. To account for the selection problem, i.e. the endogenous

sorting of high-productivity workers into high-productivity peer groups or high-productivity

firms based on unobserved attributes, CDS extend the worker and firm fixed effects model

of Abowd et al. (1999) by including control variables and multiple fixed effects. The goal is

to estimate the following wage equation: 7

wijt = αi + φjt + γᾱ∼i,jt + �ijt. (1)

αi is a worker fixed effects for worker i to capture permanent worker ability and φjt is

time-varying firm fixed effects for firm j for time t. The model captures peer effects in a

“linear-in-means” setup: Here, term ᾱ∼i,jt is peer productivity defined as the average worker

fixed effect in the peer group, computed by excluding individual i:

ᾱ∼i,jt =1

Njt

∑i′∈P∼i,jt

αi′

The spillover coefficient γ measuring the coworkers’ impact on wages, is the key parameter

of interest.

7For expositional clarity, this wage equation is simplified relative to the CDS by abstracting from occu-pation fixed effects and from the influence of other observable time-varying characteristics, as they do notaffect the conclusion of this section.

8

2.2.1 Homogeneous peer effects

Two underlying assumptions in CDS are restrictive. The first one is that peer effect function

is homogeneous. This is inconsistent with heterogeneity found in empirical studies.

2.2.2 Worker-firm complementarity

The second important one inherited from AKM is that the worker-firm wage component can

be separably additive as a worker fixed effect and a time-varying firm fixed effect:

αi + φjt.

The first implication of this underlying assumption is the wage gap between two coworkers

is a constant: if worker i is more productive and gets a higher wage than his coworker i′ for

one firm j, he is expected to be so for all other firm j′ in the economy, irrespective to the

productivity of the firm :

E(wijt − wi′jt) = (αi − αi′)(

1− γNjt

), ∀j ∈ J .

Second, the high wage firm would always pay a high wage premium: for two firms j and j′

with equal peer productivity ᾱ∼i,jt = ᾱ∼i,j′t, the expected wage difference is independent of

the productivity of the worker employed:

E(wijt − wij′t|ᾱ∼i,jt = ᾱ∼i,j′t) = φjt − φj′t, ∀i ∈ I.

These two assumptions combined would rule out the interdependence of worker and firm

productivity in wages, so that there’s no gain for firms to select the right job applicants,

nor is there extra credit for the job seeker to select a best job. These implications are

inconsistent with most structural models in the assortative matching literature where the

production function is such that it is optimal to sort workers to firms where joint output is

maximized.

In more realistic settings, however, wages can reflect the effect of complementarities

between worker and firm productivity. The worker ability could be complementary (or sub-

stitutable) to the firm productivity such that a high-ability worker become more (or less)

productive moving from a low-productivity firm to a high-productivity firm comparing to a

low-ability worker. In consequence, the inter-dependence of worker and firm productivity give

rise to positive (or negative) assortative matching, i.e. high-productivity workers sorted into

high-productivity (or low-productivity) firms with other high-productivity colleagues in the

9

equilibrium outcome. Hagedorn et al. (2014) for instance found positive assortative match-

ing in German administrative data, in alignment with theories that attribute sorting to the

worker-firm complementarity.

The complementarity has two important implications. First, it induces sorting between

workers and firms which posits a well-known empirical challenge of “selection problem” in the

estimation of coworker effects. The selection problem arises when the cohort of coworkers

are not formed at random. Second, the inter-dependence of worker and firm productivity

implies that wage cannot be decomposed in an additively separable worker and firm fixed

effect, which is in sharp contrast to the specification of CDS.

2.2.3 Misspecification bias

The CDS method cannot correctly account for the selection problem induced by the the

complementarities between workers and firms. The misspecification bias can be sizable, with

its sign being either positive or negative, depending on the magnitude and sign of the com-

plementarity.

Data generating process To illustrate the misspecification bias, I present the perfor-

mance of CDS estimator applied to wages simulated from an alternative simple data gen-

erating. In contrast to Equation (1), the wage does reflect the complementarities between

workers and firms:

wift = w(αi, φf ) = (αρi + φ

ρf )

1/ρ. (2)

Here, each worker i ∈ I is endowed with a permanent latent productivity αi each firmf ∈ J a permanent latent productivity φj on entering the market. Both αi and φf areindependently drawn from the standard uniform distribution and cannot be observed in the

data. Importantly, I focus on the case where the wages incorporates no coworker effect: wages

are solely determined by the productivity of the worker αi and firm φf in the match, but

not by coworkers. Substitution parameter ρ controlling the curvature of wage function w,

representing the magnitude of the complementarity. In particular, when rho = 1, Equation

(2) degenerates to (1) with the corresponding γ = 0. To generate realistic peer selction and

positively associative matching, assume the worker and firm matches if and only if

|αi − φf | < 0.1.

10

The rest of the model follows a basic search and matching paradigm. The model use

standard calibration with moments of the labor market mobility. The details is delegated to

Section 4.1.

Bias in finite sample The simulation result of estimator γ from equation (1) shown in

Figure (2). The 95% bootstrap confidence interval is constructed using B = 100 bootstrap

samples.

Note that when ρ = 1, the worker and firm productivity x and y are linearly additive. In

this case, the estimator γ̂ correctly recover the true magnitude of peer effect, i.e. the true value

of γ = 0. In alternative cases, I simulate data for four different values of ρ ∈ {−3,−1, 0.5, 1.5}.In each case, the estimate γ̂ is subject to mis-specification bias. The size and sign of the bias

is dependent on the strength of complementarity measured by |ρ− 1|: when ρ > 1, the signof the bias is upward γ̂ > 0. When ρ < 1 its sign is downward γ̂ < 0.

Figure 1: Eγ̂ 6= 0 when ρ 6= 1.

Asymptotic performance CDS estimator γ̂ does not converge to the true γ = 0 asymp-

totically. The misspecification bias does not vanish asymptotically as the length of simulation

T approaches infinity.

11

Figure 2: γ̂ is inconsistent.

CDS method detect a negative spillover parameter in the simulation where the true one

is zero. To the intuition of the bias can be illustrated by contradictions: if we restrict the

spillover parameter to be zero in (1), it amounts to fit an AKM regression with worker and

time-varying firm effects only,

wift = αi + φft + uift

then a negative correlation could be found between the peer quality α̂−i,ft and the regres-

sion residual uift. The negative correlation implies if withhold the restriction, the spillover

parameter can be lowered to a negative number to better fit the model. That is in the case

of (1):

γ =cov(ᾱ−i,ft, uift)

var(ᾱ−i,ft)< 0.

The correlation is negative is because of the presence of complementarity, the estimated

constant worker fixed effect tend to underpredicted the wage for a high-productivity worker

while overpredict for a low-productivity worker. Focusing only on the within-peer-group

variations the identification utilize, the regression residual is systematically higher for better

workers, who are innately paired with worse coworkers within the same peer group.

The example implies that estimating the match component w(x, y) is vital for estimating

coworker effects. It controls for variations in wages accounted by the movement of non-

observable worker and firm characteristics that may endogenously correlated with unobserved

coworker attributes. If complementarity in w(x, y) is correctly measured, estimating coworker

12

effect will be free from the selection problem. This is the main focus of the paper and will

be delivered in the next section.

3 Machine-learning based Approach

In this section, I propose economic-theory-based machine-learning approach to estimate wage

peer effects in a more generalized framework by extending Cornelissen et al. (2017) in two im-

portant directions: First, the framework allows wages to reflect the flexible interdependence

of worker and firm productivities, so that complementarities or substitabilities between them

can be well captured. In accordance with assortative matching theories, these complemen-

tarities can potentially account for the endogenous peer selection. Second, the framework

allows for heterogeneous coworker effects across workers with different productivity levels.

By doing so, the model can reconcile the mixed empirical findings in peer effect literature

where different workers may exhibits heterogeneous impact on the wage of their coworkers

in various scenarios.

In addition to these two relaxations, my method provides to make precise counterfactual

prediction of the wage one individual if reallocated to any firm in the economy with a

different set of coworkers. From a macroeconomic perspective, the wage is an equilibrium

object of a structural model and can be non-parametrically inverted to all other equilibrium

outcome such as output and productivity. Thus, being able to empirically estimate the

complementarities and coworker effects on wages, researchers can make further inference

of complementarities and coworker effects in these outcomes as well. It opens the gate to

address substantive questions: for instance, how to assess labor market efficiency and how

to design policies to achieve the efficient assignment of coworkers. Of course, answering such

questions involves making additional assumptions of the labor market, regarding the market

structure, the bargaining protocol, etc.

3.1 The Framework

The goal is to jointly estimate worker-firm complementarities and heterogeneous coworker

effects in wages in the following framework:

wift = w(xi, yf )︸︷︷︸match effects

+1

Nft

∑j∈P−i,ft

a(xj)︸︷︷︸coworker effects

+ νift. (3)

13

where wift are observed wages in a large matched employer-employee dataset. xi is the

unobserved productivity of worker i and yf the latent productivity for firm f . Productivity

xi and yf are drawn at the born of the worker and vacancy from the exogenous distribution

whose support can be normalized to the closed unit interval

xi ∈ X = [0, 1], yf ∈ Y = [0, 1],

and remain constant over time. Denote match effects w(xi, yf ) the component that captures

the complementarity between worker producitivity xi and firm productivity yf in wages.

Denote P−i,ft the (self-exclusive) coworker set for worker i’ in peer group. Peer groups aredefined by the set of workers at workplace (firm) f at time t, therefore can be indexed by

(f, t). DenoteNft = |P−i,ft| the number of coworkers. Disturbance νift captures the variationsaccounted by all other factors, satisfying that E

[νift∣∣ xi, yf] = 0.

The key object of interest is the match and coworker effects component w(xi, yf ) and

a(xj). The first key underlying assumption is that wages are not determined by the identity

of the worker i, firm j or coworker j conditional on their productivity xi, yf and xj. The

second key assumption is that both mapping w and a are both finite and continuous mapping

defined on compact set:

w : X × Y →cts

R, a : X →cts

R.

I start with baseline model (3) as it is an immediate generalization of the CDS method

and thus can be a good benchmark.8 Importantly, I do not impose restrictive function form

assumptions on the match and coworker effect component w(x, y) and a(x), so that w(x, y)

can flexibly reflect arbitrary interactions between worker and firm productivity, and a(x) can

capture the heterogeneous peer effects between coworkers.

The example in Section 2 implies that estimating the match component w(x, y) is vital

for estimating coworker effects a(x), as it controls for variations in wages accounted by the

movement of non-observable worker and firm characteristics that may endogenously corre-

lated with unobserved coworker attributes. Once w(x, y) is correctly measured, estimating

coworker effect a(x) will be free from the selection problem.

8In specific, when the match and spillover component are both linear

w(x, y) = x+ y, a(x) = γx

equation (3) degenerates into CDS specification (1). Similar to CDS, my framework is abstract from en-dogenous peer effects: own wage is independent of peer wage conditional on the productivity of the worker,her firm and her coworkers. Therefore, the method has bypassed the highlighted challenge of the refectionproblem as well as distinguishing between the endogenous and exogenous peer effects in the literature sinceManski (1993).

14

3.1.1 Identification

When the types of workers {xi}i∈I and firms {yf}f∈F are observable, coworker effects functiona(x) in equation (3) can be identified when identification assumption holds:

Assumption 3.1. Identification: For all workers i and her coworkers ∀j ∈ P−i,ft:

νift ⊥⊥ h−i,ft(k)∣∣∣∣ xi, yf , {xj}j∈P−i,ft ,

where h(xj) is the measure of her coworkers j belonging to productivity type xj = k:

h(k) =1

Nft

∑j∈P−i,ft

1{xj = k}.

Identification assumption (3.1) states that conditional on the type of the worker, the firm

and her coworkers, the wages νift are exogenous to the mobility of her coworkers. Impor-

tantly, identification holds under this assumption for a general wage function incorporating

contemporaneous coworker effects as follows:

wift =gn(xi, yf , {xj}j∈P−i,ft) + νift (4)

where n is the size of the peer group; {xj}j∈P−i,ft is the collection of coworker productivity.

Theorem 3.1. Under Assumption (3.1), general wage equation (4) is identified.

The baseline (3) is identified as it is a specific form of equation (4). 9 The identification

holds given both types of workers and firms x and y being observed. The greater challenge is

how to measure the match and spillover functions when these types are not observed. This

is discussed in the next session.

9One concern regarding Assumption (3.1) is that there could be potentially peer-group level shocks thatsimultaneously affect wages and coworker components. Then Assumption (3.1) is violated and the estimationfor (3) may lead to biased results. The issue can be easily fixed using “within estimator” w, a, i.e. by estimating(3) conditional on time-varying peer-group fixed effects:

wift = w(xi, yf ) +1

Nft

∑j∈P−i,ft

a(xj) + Zft + �ift. (5)

(16) is identified when

�ift ⊥⊥ h−i,ft(xj)∣∣∣∣ xi, yf , {xj}j∈P−i,ft .

The identification only utilizes movements of wages and coworker composition within peer groups.

15

3.2 The Method

I develop an economic-theory-based semi-parametric approach to estimate the wage function

(3). The method can be viewed as an extension of the non-parametric method proposed by

Hagedorn, Manovskii and Xin (2020), to allow for the coworker impact on wages. The main

take from their work is that wage function can be non-parametrically estimated by grouping

workers with similar unobserved productivities with a hierarchical clustering approach. From

a machine learning perspective, non-parametrically estimate w(xi, yf ) and a(xj) be viewed

as a matrix completion problem: the goal is to best predict wages if counter-factually match

any worker i ∈ I into any firm f ∈ F , based on wages in observed matches {i, f, t, wift} inthe matched-employer-employee dataset, and under the constraint of (3).

Taking the coworker effects into account, the wages of the same worker matched into the

same firm can move in response to the evolution of peer components. To identify similar

workers, I leverage the insight that workers with similar matching history would get similar

wages working in the same firm at the same point in time. Thus, the similarity between two

coworkers can be measured by their average wage distance during the coworkership. Notice

that implied by both (3) and (4), any pair of similar coworkers i and i′ ∈ Pft at the same firmsame period should receive similar coworker influence on their wages, as these two workers

share almost identical coworker set P−i,ft and P−i′,ft. To my knowledge, this feature is alignwith most assortative matching models in the literature, including Hagedorn et al. (2014),

Gautier and Teulings (2012), Eeckhout and Kircher (2011), Lise et al. (2016).

The identification of unobserved worker productivities, wage complementarities and

coworker effects therefore can be conducted in two consecutive stages: “clustering” and “es-

timation”. In the “clustering” stage, the target is to identify in the data groups of workers

that are similar in latent productivity and assign them into the same group, and predict

the wage at firms one did not work at, based on what workers assigned to the same group

make at that firm in the same year. The worker clustering takes an agglomerative hierar-

chical approach: the algorithm starts with each worker initialized as a single-point cluster,

and iteratively merges the most similar pair of “child” clusters at current stage into a new

“parent” cluster, and update the similarity between the new cluster and the rest of existing

ones. In the “estimation” stage, I estimate match and coworker effect function a(x) and

w(x, y) given the worker clustering assigned.

I adopt a cross validation method to decide the number of clusters. The optimal clustering

is chosen to make the best out-of-sample prediction of wages.

16

3.3 The “Clustering” Stage

This section explains how the algorithm works for the “clustering” stage .

3.3.1 Notations

Clustering C is a of a set of workers I if it forms a partition of I:

C = {C1, C2, ..., CK}, Ck = {i ∈ I|ci = k}.

The assignment function c maps individuals to their cluster c : I → K. Clusters are indexedby integers from the cluster-label set K ≡ {1, ..., K}. The number of clusters in C is K.Clusters Ck and firm f match at t if i works at firm f at some t. Denote the matching set :

Ckft ≡ {i ∈ Ck : i works at firm f at some t}

Denote the matching indicator between workers and firms for certain periods on set C×F×T :

Ωk,f,t =

1 if worker i ∈ Ck works at firm f at time t.0 otherwiseWhen worker cluster Ck matches firm f at year t, the cluster mean µkft can be evaluated:

µkft =1

|Ckft|∑

k′∈Ckft

wk′ft, if Ωkft = 1.

Dissimilarity Within firm f at t, wage distance between individual worker j, k:

Djkft = wjft − wkft =w(xj, yf )− w(xk, yf ) +a(xk)− a(xj)

Nft(6)

Note that when worker i and j are similar xi ≈ xj, Djkft ≈ 0 given that both w and a arecontinuous. The average wage distance between individual worker j, k over all wage distances

observed at periods t and firms f :

Djk = mean{f,t | Ωjft=1, Ωkft=1}

Djkft (7)

17

Worker similarity between each pair of worker is measured by the average wage distance a

coworkership.

d(j, k) =

|Djk|∑

t∈T ΩjftΩkft > 0

∞ otherwise

In particular, worker j is similar to k at cutoff κ, if d(j, k) < κ; otherwise worker j is

dissimilar to k at κ. The hierarchical clustering algorithm sequentially merge individual

worker to worker clusters. I also define wage distance and similarity measure between a pairs

of worker clusters. Given clustering set C, the wage distance and similarity between workercluster pair Cj and Ck within firm f at t is:

Djkft = µjft − µkft. (8)

The average wage distance and dissimilarity between Cj and Ck:

Djk = mean{f,t | Ωjft=1, Ωkft=1}

Djkft (9)

d(Cj, Ck) =

|Djk|∑

t∈T ΩjftΩkft > 0

∞ otherwise(10)

Worker cluster Cj is similar to Ck at cutoff κ, if d(Cj, Ck) ≤ κ. Likewise, worker cluster Cj isdissimilar to Ck at cutoff κ, if d(Cj, Ck) > κ. Affinity graph A ∈ SK of C. Vertex j ∈ K of Arepresents a cluster Cj ∈ C and edge (j, k) represents the similarity between Cj and Ck at κ:

Aj,k(κ) =

1 if d(Cj, Ck) < κ

−1 if d(Cj, Ck) > κ

0 if Cj and Ck does not match.

. (11)

Components Denote Π ∈ P (C) one partition of worker clustering set C who has L subsets.The l-th subset of Π consists of Kl member clusters Sl = {C1l , C2l , ..., CKl}:

Π = {{C11 , ..., CK1}, ..., {C1L , ..., CKL}} = {S1, ...,SL},L∑l=1

Kl = K.

Subset S is a path-similar component at cutoff κ if all its member clusters forms a connectcomponent of the affinity graph A. That is, for any pair of member clusters Ck, Ck′ ∈ S,

18

exists path on S {Ck → Cp(1) → ... → Cp(N) → Ck′} such that any two consecutive clusterson the path are similar at κ: Ap(n), p(n+1)(κ) = 1, ∀ n ∈ {0, ..., N}. Partition Π is a path-similar partition of C at κ if all its subsets are path-similar components at κ. Denote it asΠ∗(C).

Subset S is a disagreement-free component at cutoff κ if there are no dissimilar pairof member clusters Ck, Ck′ ∈ S at κ, or Ak, k′ (κ) > −1, ∀ Ck, Ck′ ∈ S. Partition Π is adisagreement-free partition of C at κ if all its subsets are disagreement-free components atκ. Denote it as Π∗∗(C).

3.3.2 The Algorithm

The input of the algorithm is the wage information wift for all worker-firm pairs (i, f) which

match at time t, i.e. Ωi,f,t = 1. The output of the algorithm is worker clustering C.The algorithm goes through a number of iterations ι ∈ {0, 1, 2, . . . , ῑ} each associated

with regularization parameter κ ∈ {0, �, 2�, . . . , κ̄ ≡ ῑ�}, where � represents a small number.At the first iteration ι = 0, initialize worker clustering C0 = I to be the set of single-worker clusters, i.e. N worker clusters each contains an individual worker: C0 = {C01 ={1}, C02 = {2}, . . . , C0N = {N}}. and the corresponding initial assignment functions wouldbe c0i = i,∀i ∈ I. The outcome of each iteration ι is a worker clustering Cι.

Workers are clustered in a hierarchical order of their similarities: for each iteration ι, the

algorithm decide whether or not to group current worker cluster Cιj, Cιk ∈ Cι based on the

current cutoff κ. If Cιj, Cιk are similar at κ, they will be both merged into the same worker

cluster:

merge Cιj and Cιk if Aj,k(κ) = 1.

The algorithm can be summarized as follows:

Algorithm 1 Worker Clustering (Baseline)

function Worker Clustering

Initialize clustering: one worker = one cluster

Distance d(Cj, Ck) = meanf∈F ,t∈T | µjft − µkft |

for κ ∈ [0, ..., κ̄] do

if d(Cj, Ck) < κthen Merge (Cj, Ck) and update cluster wage ŵ, â end if

Repeat until distance for all ≥ κ

end for

end function

19

3.3.3 Theoretical Properties

This subsection derives theoretical properties of the worker clustering by Algorithm 1 with

additional assumptions.

If κ increase slowly, i.e. step increment � is small for each iteration, at most one pair of

workers are merged into the same cluster in the next iteration clustering Cι+1, and Algorithm1 can accurately group similar workers.

Theorem 3.2. When κ increases slowly, each iteration ι associated to cutoff κ delivers a

clustering Cι is a path-similar partition of workers I by κ. That is, for each pair of individualon the j, k ∈ I in the same cluster j, k ∈ C,C ∈ Cι, exists a path connecting j and k on Csuch that each pair of adjacent workers on the path are similar by κ.

Proof. Proof by induction: Theorem 3.2 obviously holds for ι = 1. Assume it holds for ι = n

with cutoff κ = ι�: without loss of generality, suppose that Ck, Cj ∈ C are two clusters to beagglomerated at iteration j ∈ Cj, k ∈ Ck, then each of Cj and Ck is a path-similar partitionfor I at the previous cutoff κ′ = (n − 1)� < κ, thus Cj is also path-similar by κ. Considerthe distance between the closest pair of workers on each cluster j∗ ∈ Cj, k∗ ∈ Ck, j∗ and k∗

are similar at κ as d(j∗, k∗) ≤ d(j∗, Ck) ≤ d(Cj, Ck) < κ. The newly formed cluster Cj,k isalso a path-similar partition of I, since for any two member j, k ∈ Cj,k, can find path fromj → j∗ → k∗ → k such that all adjacent workers are similar at κ. Therefore, the theoremalso holds for ι = n+ 1.

For computational efficiency, � is set a reasonable large number in practice and there

can be multiple worker clusters in the same path-similar component (constituting the same

connected components of the affinity graph A(κ)) assigned in each iteration ι. Ruling out dis-

agreement for each path-similar component requires additional constraints. This discussion

is delegated to Section 3.7.

Homophily Worker of certain productivity x ∈ X exhibits local homophily near neighbor-hood Br(x) ≡ {x′ ∈ X : |x′ − x| < r}, if any pair workers with similar types ∀k, k′ ∈ Br(x)have been working as coworkers with independent probability:

p(k, k′) >log µr(x)N

µr(x)N.

µr(x) is the fraction of workers whose type x ∈ Br(x): µr(x) =∫Br(x) φ(x)dx. φ(x) is the

probability density at x. Local homophily assumes that workers with similar latent produc-

tivities have higher tendency to become coworkers. It guarantees sufficient local matching

20

density for these workers that are close in productivity space to meet in the same workplace.

Algorithm 1 can detect and group workers with similar unobserved productivity where local

homophily holds.

Single-crossing Worker of certain productivity xj ∈ X satisfies the single crossing con-dition, if the average wage distance between any individual worker j and k, Djk(xj) is

monotonically increasing in the productivity xj, for all xk ∈ X and yj ∈ Y . Recall that by(6) and (7):

Djk(xj) = mean{f,t | Ωjft=1, Ωkft=1}

w(xj, yf )− w(xk, yf ) +a(xk)− a(xj)

Nft

Note that Djk is a finite function defined on a compact set X . The single-crossing conditionassumes that a worker of higher productivity should always get a higher wage. Under this

condition, Djk(xj) crosses zero only for once at xj = xk. In consequence, small wage dis-

tances between workers can be mapped into the proximity of their productivity. A sufficient

condition for single-crossing is when wage w(x, y) is increasing in x and size of the peer group

Nft is large so that|a(xk)−a(xj)|

Nftis always dominated by |w(xj, yf )− w(xk, yf )| .

Theorem 3.3. (No global split) Suppose x ∈ X exhibits local homophily in Br(x), for anypair of worker j and k with similar productivity ∀xj, xk ∈ Br(x), can find κ > 0 at whichAlgorithm 1 terminates and delivers path-similar clustering C∗ at κ that assigns both workersto the same cluster with high probability:

limN→∞

p(c∗k = c∗j) = 1.

Lemma 3.1. (Erdos and Renyi, ’60) Denote random graph G(N, p) that has N vertices and

whose edge between any pair of vertices form independently with probability p.

p =p0 logN

N

Graph G(N, p) is connected with high probability if and only if p0 > 1.

Proof. when N → ∞, immediately after Lemma 3.1, sub-graph G(µr(x)N, p) is connectedwith high probability, i.e. exists a path connecting all workers in Br(x) such that any adjacentworkers j, k on the path are coworkers. Now need to show can find κ > 0 all j and k on the

path are also similar at κ. Since j, k ∈ Br(x), |xj − xk| < r. Because wage distance function(6) is continuous, the average distance is bounded: ∃δjkft : |Djk(xj) − Djk(xk)| < δjkft.Therefore, can choose κ = maxj,k,f,t δjkft such that all adjacent worker j and k are similar

21

at κ. That is Br(x) constitutes a path-similar component and all its members are assignedto the same cluster.

Theorem 3.4. (No local contamination): Assume that all x ∈ X satisfies single-crossingcondition. Exists κ > 0 at which Algorithm 1 terminates and delivers path-similar clustering

C∗ such that∀k : |xj − xk| > r, p(c∗k′ = c∗j) = 0.

Proof. Theorem 3.4 immediately follows the continuity and the monotonicity of wage dis-

tance Djk(xj): without loss of generality, assume that xj > xk + r, and then κ0 ≡Djk(xj) − Djk(xk) > 0. For all small κ < κ0, worker j and k will be dissimilar and willnot be grouped to the same cluster.

Theorem 3.4 states that if terminate Algorithm 1 at a relative big cutoff κ, the outcome

clustering C can detect and group all similar workers in a densely connected neighborhood.On the other hand, Theorem 3.3 states that if terminate Algorithm 1 at a relative small

cutoff κ′, the corresponding clustering C ′ that can distinguish any dissimilar pair of worker jand k. The optimal stopping criterion for κ must best balance the tradeoff between “splitting

similar workers into multiple clusters” and “contaminating a cluster by introducing dissimilar

workers”. The discussion of choosing the optimal κ is delegated to Section 3.5.

3.4 The “Estimation” Stage

Based on worker clustering C and assignment function ci, the “estimation” stage recoversthe complementarities and coworker effects function w(xi, yf ) and a(xj) at worker cluster

level. The idea is that when workers assigned to the same worker cluster are close in their

productivity space, I can recover coworker influences for any individual their cluster average:

Ewift = w(xi, yf ) +1

Nft

∑j∈P−i,ft

â(xj)

≈ ŵ(ci, f) +1

Nft

∑j∈P−i,ft

â(cj). (12)

where ŵ : C × F → R, ŝ : C → R.The estimation recovers unobserved heterogeneous characteristics in workers and firms

that determines the wages. Despite the fact that xi, yf are latent, the cluster membership ci

and firm identity f are now observable from the “clustering” stage. Conditional on observed

22

ci and f , equation (13) is identified under Assumption 3.1.

wift = ŵ(ci, f) +1

Nft

∑j∈P−i,ft

â(cj) + ν̃ift

= ŵ(ci, f) +1

Nft

∑k∈K

h−i,ft(k) â(k) + ν̃ift (13)

Coworker component h−i,ft(k) =∑

j∈P−i,ft 1{cj = k} counts the number of coworkers inP−i,ft assigned to cluster k. The match component ŵ(ci, f) is captured by the joint-worker-cluster-by-firm fixed effect and coworker effect â(cj) can be estimated by the coefficient of

wage responses to the changing coworker components.

3.5 Regularization

I use regularization criterion to select the optimal worker clustering among the sequence of

all iterations {Cι} that minimizes the generalization error of the machine learning algorithm.In particular, I place penalty when the cluster sizes are too small, and select the clustering

Cι (and ŵ and â it implied) that minimizes the RMSE of out-of-sample forecast.To evaluate the criterion, I randomly split the data into three components: the training

set (80%), the validation set (10%), and the test set (10%). 10 Only based on observations

in the training set, I estimate the sequence of worker clustering {Cι and function ŵι, âι} foreach iteration ι. Based on that, I can make out-of-sample predictions for each observation

wi,f ′,t′ in the validation set and the test set. If Cιi matches to f

′ at t′ in the training set, i.e.

the algorithm can find one or more workers assigned to the same cluster Cιi who worked at

firm f ′ at t′, it should optimally predict the average wage in that cluster:

w̃if ′t′ = ŵ(ci, f′) +

1

Nf ′t′

∑j∈P−i,f ′t′

â(cj) = µci,f ′t′

If such worker does not exist, predictions cannot be made based on a similar reference worker.

The best predictor would be worker i’s average wage in the training sample conditional on

all available information.

Predicted wage ŵιif ′t′ of i at new firm f′ at t′ =

w̃if ′t′ , if Cιi and f ′ match at t′:worker i tr. sample ave., otherwise10Since wages are inter-dependent within each peer group, the split is at random at peer groups. For

example, if peer groups (f, t) is drawn and assigned to the training set, all workers in Pft are assigned tothe training set.

23

The criterion function is the RMSE evaluated on the test sample.

Q(w, ŵι) =

(mean

(i,f,t)∈{validation set}(ŵιift − wift)

)1/2. (14)

and the optimal clustering is selected as

ι∗ = arg minι{Q(w, ŵι)} (15)

Starting from the initial iteration ι = 0, each worker forms its own cluster. At this itera-

tion, the average of each single-worker cluster can accurately reflect wages for the individual.

However, I cannot make out-of-sample predictions based on “similar reference workers”, but

can only rely on the worker’s personal average. Therefore, criterion (14) for the initial itera-

tion would be large. As the algorithm proceeds, the wage cutoff κ = ι� gradually increases,

with more workers can be grouped as similar, and the average size of clusters gets bigger.

Criterion (14) will first decrease as more workers can be predicted with cluster average in-

stead of personal average. (14) will increase again when the average size of clusters gets too

large to accurately predict individual wages. One can imagine when the wage cutoff goes to

infinite, all workers are assigned to one single big cluster. The algorithm would predict the

average wage for all workers which couldn’t be accurate. The lowest points of the u-shaped

regularization curve represents the optimal trade-off.

Figure 3: Criterion Q(w, ŵ) on the validation set.

24

3.6 Measuring coworker complementarities

In this section, I show how to further generalize the framework (3) to allow for the com-

plementarity between coworker productivities in the coworker effects. Further allowing for

coworker complementarities besides the complementarity between the worker and the firm

has important implications on efficiency and reallocation of workers across the firms.

3.6.1 Framework

In a more relaxed framework, wages can be determined as 11

wift = w(xi, yf )︸︷︷︸match effects

+1

Nft

∑j∈P−i,ft

a(xi, xj)︸︷︷︸general coworker effects

+ νift. (17)

Here, I adopt the two-dimensional continuous spillover function aX 2 → R to capturethe wage complementarity between worker i and her coworkers j ∈ P−i,ft. Particularly,a(xi, xj) reflect the coworker influence exerted by coworker j on the focal worker i. Note that

conditional on the set of productivity xi, yf , {xj}j∈P−i,ft being observed, the identificationholds for (16) as it takes a specific form of the general wage function (4).

3.6.2 The “clustering” stage

Importantly, the identification of unobserved worker productivity is identical to the baseline

equation (6), i.e. the worker similarity can be measured with the wage distance between

11Similarly to (3), one can include time-varying fixed effects to account for the potential endogeneity bythe common shocks (e.g. technology shocks at the firm or other cohort level):


Nft

∑j∈P−i,ft

a(xi, xj) + Zft(xi) + �ift. (16)

This equation is identified under the assumption of exogeneity

�ift ⊥⊥ h−i,ft(xj)∣∣∣∣ xi, yf , {xj}j∈P−i,ft .

and empirically requires no multi-linearity between the realization of

X = {1{ci = l, f}, {h−i,ft(cj)}, 1{ci = l, f, t}}.

25

individual worker j, k within firm f at t:

Djkft = wjft − wkft =w(xj, yf )− w(xk, yf ) +a(xi, xk)− a(xi, xj)

Nft(18)

As the similarity can be measured pairwisely with identical distance function, the worker

clustering obtained with the “clustering” stage identifies types {xi}I for the extended model.

3.6.3 The “estimation” stage

Given worker clustering C acquired in the clustering stage, I estimate coworker effect functionâ(ci = l, cj = k) defined for each pair of interaction between worker cluster Ck and Cl. The

estimation and identification are in the similar fashion of (13), but separately for each group

of worker in the same clusters ci = l.

wift = ŵ(ci, f) +1

Nft

∑j∈P−i,ft

â(ci, cj) + ν̃ift

= ŵ(l, f) +1

Nft

∑k∈K

h−i,ft(k) â(l, k) + ν̃ift (19)

Coworker component h−i,ft(k) =∑

j∈P−i,ft 1{cj = k} counts the number of coworkers inP−i,ft assigned to cluster k.

3.7 Disagreement-free Partition

When the step of cutoff � is small, at most one pair of workers are merged into the same

cluster next clustering Cι+1. While it is accurate, the algorithm is slow. In practice, I usereasonably large cutoff step � to merge multiple worker clusters in the same path-similar

component assigned in each iteration ι. This is more efficient. The problem is, there could

be disagreement in same path-similar component. For example, one component {Ci, Cj, Ck}can be path-similar by certain κ, if worker cluster Ci is similar to cluster Cj, and cluster Cj

is similar to cluster Ck. It can also contain a “disagreement” if worker cluster Ci and Ck are

observably dissimilar at a workplace where Cj is absent. Ruling out disagreement for each

path-similar component requires additional constraints.

The idea is to partition each path-similar component into finer collection of disagreement-

free components:

{S∗∗l(1), ...,S∗∗l(M)} = split disagreement(S∗l ).

Subroutine split disagreement is a fine-tuning device developed to rule out dissimilar workers

in a path-similar component. It takes each path-similar component S∗l ∈ Π∗ as input, and

26

recursively finds the minimum cut to split the component into multiple disconnected “child

component” once a disagreement is found. Finding the smallest cut of a component is to

remove least edges weighted by similarity so that the component is split into disconnected

child sub-components. If both child components are disagreement-free, the algorithm stops.

Otherwise repeat the procedure until all “child component” do not contain any dissimilar

pair of worker clusters at the cutoff κ. The procedure can be viewed as a tree transverse with

depth-first search.

Algorithm 2 Split Disagreement

function Split Disagreement . Input: S∗ . Output: S∗∗

Initialize partition list S∗∗ = {}

if component S∗ is disagreement-free

then add S∗ to partition list S∗∗

else

Find the most distant cluster

Cj, Ck = arg maxCj ,Ck∈S∗

d(Cj, Ck)

Find minimum cut

{S∗l ,S∗r } = maxflow(S∗, Cj, Ck)

Repeat for component containing Cj:

{S∗∗j(1), ...,S∗∗j(M)} = split disagreement(S∗l )

Repeat for component containing Ck:

{S∗∗k(1), ...,S∗∗k(M ′)} = split disagreement(S∗r )

Add {S∗∗l(1), ...,S∗∗l(M),S∗∗r(1), ...,S∗∗r(M ′)} to partition list S∗∗

end if

end function

27

Incorporating this modification, I propose the efficient version algorithm for worker clus-

tering as follows:

Algorithm 3 Worker Clustering (Efficient)

function Worker Clustering

Initialize clustering: one worker = one cluster

Distance d(Cj, Ck) = meanf∈F ,t∈T | µjft − µkft |

for ι ∈ [0, ..., ῑ] do

Current clustering Cι, cutoff: κ = ι�

Evaluate affinity graph Aj,k(κ)

Find its connected component Π∗(Cι) = {S1, ...,SL}

for l ∈ [1, ..., L] do

S∗∗ ≡ {S∗∗l(1), ...,S∗∗l(M)} = split disagreement(S∗l )

end for

Disagreement-free partition

Π∗∗ = {S∗∗1(1), ...,S∗∗1(M ′), ...,S∗∗L(1), ...,S∗∗L(M ′′)}

Merge clusters in all components of Π∗∗ into the same cluster

The output is the clustering for the next iteration Cι+1

end for

end function

28

Illustration of Worker Clustering

1 23

5 6

Firm 𝑓, Time 𝑡

Wag

e 𝑊

𝑖,𝑓,𝑡

Worker 𝑖

1. The input of the algo-rithm is wages observed in thematched employer employeedata wift (y axis) for all work-ers i (x axis) at the sameworkplace, i.e. in firm f at thesame period t. Individuals ex-ert influence on their cowork-ers’ wages.

1 23

5 6 1 2 3 4 5 6


Wag

e 𝑊

𝑖,𝑓,𝑡

Worker 𝑖 Worker 𝑖

Worker hierarchy

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

2. Starting from the first it-eration associated with theminimum κ (y axis of theright panel), the algorithmassign the most similar work-ers (“1” and “2”) into thesame cluster.


3

5 6 3 4 5 6

Worker hierarchy

1 2

1 2


Wag

e 𝑊

𝑖,𝑓,𝑡


Worker hierarchy

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

3. Continue to the next itera-tion with larger κ, the nextmost similar pair of worker(“3” and “4”) are grouped.

29

1 23

5 6 1 2 3 5 64


Wag

e 𝑊

𝑖,𝑓,𝑡


Worker hierarchy

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

4. As cutoff κ further in-crease, larger fraction ofsingle-worker clusters areagglomerated into biggerclusters. Note that mergenot only takes place betweenindividual workers but alsobetween worker clusters (e.g.between cluster “12” andindividual “3”).

45 1 2 3 4 5 6

78

Firm 𝑓′′, Time 𝑡′′

Wag

e 𝑊

𝑖,𝑓′′,𝑡′′


Worker hierarchy

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

5. As the size of clusterincrease, the algorithm cancompare and group a widerrange of workers. For ex-ample, worker “4” and “6”has never been coworkers.Nonetheless, they are identi-fied similar and cluster via in-termediate worker “5”. Notethat “4” and “5” earn similarwage in firm f ′ at t′ while “5”and “6” earn similar wage infirm f at t.

1 23

5 6 1 2 3 4 5 6


Wag

e 𝑊

𝑖,𝑓,𝑡


Worker hierarchy

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

train

6. When increasing κ to in-finity, the algorithm ultimateagglomerates all workers. Theoutput is a “worker hier-archy”: for every individualworker, a group of similarworkers given cutoff κ.

30

Illustration of Regularization

11 2 3 4 5 61

3

45

78

2

6

error

cross-validate

1 23

5 6


Wag

e 𝑊

𝑖,𝑓,𝑡


Worker hierarchy Firm 𝑓′, Time 𝑡′

Worker 𝑖

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

Wag

e 𝑊

𝑖,𝑓′ ,𝑡′

train predict

𝜅

1. Each level of cutoff κ corresponds to a unique worker clustering, based on which the algorithmpredict out-of-sample wages in the validation set. Given κ fixed at the level in the middle panel, thecorresponding clustering C = {{1, 2, 3}, {4, 5, 6}}. The algorithm use the average wage of worker2 and 3 in firm f ′ at t′ to predict for worker 1 (the dashed square) and compare it to the actualwage (the solid square) in validation set to evaluate the RMSE error.


13

Wag

e 𝑊

𝑖,𝑓,𝑡

5 6

Worker 𝑖

1 2 3 4 5 6

Worker 𝑖

Worker hierarchy Firm 𝑓′, Time 𝑡′

2

train predict

1

3

45

78

2

6

Worker 𝑖

Dis

sim

ilar

ity

|𝐷𝑖,𝑖′|

Wag

e 𝑊

𝑖,𝑓′′,𝑡′′

cross-validate

𝜅

2. Search for κ that minimize the prediction error. Note in this case, the RMSE decreases whenmove to a lower κ, corresponding to clustering C = {{1, 2}, {3}, {4, 5, 6}}, and the counterfactualwage for worker 1 is only predicted with worker 2’s wage.

31

4 Simulation Results

To evaluate the accuracy of the estimator in finite sample, I run Monte Carlo simulations. I

simulate wages from a data generating process with standard calibration.

4.1 Simulation

This section shows that Algorithm 1 effectively estimates the worker clustering, the wage

complementarity and the coworker effects function. Each worker i ∈ I has productivity

xi, x : I → R and each firm j ∈ J is has productivity yj, y : J → R. For each year

t ∈ {1, ..., T} the unemployed workers randomly search and apply for job vacancies in firms.

The average monthly job finding rate is calibrated to λ̄ is set to 40%. To generate realistic

positively associative matching, i.e. high-productivity workers will work in high-productivity

firms, assume that the offer is accepted if and only if

|xi − yf | < 0.1.

For each type of match (xi, yj), worker i left firm j subject to an exogeneous job separation

rate δ = 3%. Importantly, the match effect component for (xi, yf ).

w(xi, yf ) = (xρi + y

ρf )

1/ρ.

and the coworker effect exerted by coworker j whose type xj is a(xj). Wage is determined

by Equation (3) given worker productivity xi, firm productivity yf and all the coworker

productivities {xj}j∈P−i,ft :


Nft

∑j∈P−i,ft

a(xj) + νift.

The parameters are summarized in Table I:

32

Workers N = 10, 000Firms M = 200Years T = 20Worker types xi ∼ U [0, 1]Firm types yf ∼ U [0, 1]Job finding rate λ = 40%Job separating rate δ = 3%Meeting randomMatching set {(x, y) : ||x− y|| < 0.1}

Table I: Data Generating Process

4.1.1 Data Generating Process #1

In the first DGP, the match effect takes functional form of a CES and there is no peer effects:

w(x, y) =(x1/2 + y1/2

)2, a(xi) = 0.

Clustering The outcome cluster assignment ci for all workers in displayed in Figure 12.

On the x-axis for each pixel is the true worker productivity xi while on the y-axis is the label

of the assigned cluster ci for the same worker i. The clustering C displayed in Figure 12 is

highly accurate: workers that are close in X are assigned to the same or adjacent clusters

and the assignment function ci is on the 45 degree line.

Figure 12: Worker cluster assignment ci. Note that cluster labels are identified up to apermutation. In cluster label are ranked by the average true worker productivity.

33

Coworker effects The coworker effect function is accurately estimated. In Figure 13 When

there is no coworker effect is in the DGP, the algorithm correctly detect zero. â(x) ≈ 0.

Figure 13: Estimated coworker effects â(cj) and the ground truth a(xj)

To evaluate the accuracy of estimator â, I use relative risk, defined as the ratio of the

root of mean squared error for estimator â(X ) over the total variance of the true coworker

effects in the population.

RR =

(∫ 10

(â(x)− a(x))2φ(x)dx∫ 10a(x)2φ(x)dx

)1/2× 100%

φ(x) is probability density function of workers whose type is x. To create a comparable

benchmark, I show the performance of the implied CDS estimator for coworker influence for

each individual i as:

âCDSi = γ̂ · ψ̂i.

The results are summarised in Table II.

Baseline estimator â CDS estimator âCDS

Relative Risk 6.87% inf

Table II: Relative Risk for both estimators

34

4.1.2 Data Generating Process #2

In the second DGP, I simulate wage with the same match effect function, but now with

positive coworker effect function:

w(xi, yf ) =(x

1/2i + y

1/2f

)2, a(xj) = 0.05xj.

Coworker effects Still, the coworker effect function is accurately estimated. In Figure 15

the estimated and true coworker effect function well align on top of each other. â(x) ≈ a(x).

Figure 14: Estimated coworker effects â(ci) and the ground truth a(xi)

35

General coworker effects The results for the general estimator between type

â(x = l, x′ = k)

are displayed as follows:

Figure 15: Estimated coworker effects â(ci, cj) and the ground truth a(xi, xj)

The performance of the baseline estimator â(l), two-dimensional estimator â(l, k) and

benchmark estimator âCDS are summarised in III

Baseline est. â(k) General est. â(l, k) CDS est. âCDS

Relative Risk 3.42% 5.68% 44.81%

Table III: Relative Risk for both estimators

36

5 “Divide and Conquer”: Achieve the Scalability

While the baseline hierarchical clustering algorithm demonstrates its accuracy in identifying

unobserved worker productivities and estimating the coworker effects, one impediment to

implement it to large administrative data lies in its computational complexity in both time

and space. The time complexity of an algorithm measures how the number of operations

scales with the size of the data and space complexity measures its memory usage. For a typical

hierarchical clustering, the time complexity is cubic O(N2 logN) and space requirement is

quadratic O(N2), where N is the number of workers in the data. This is formidable volume

of computations and space requirement for the scale of the matched employer-employee data

of Denmark where the size N = 3 million workers.

5.1 Graph Embedding Techniques

To accommodate the demand for scalability, I combine the baseline Algorithm 1 with recent

advancements in graph neural network (GNN) based graph embedding techniques.

Graph embedding is a widely applied graph analysis method that has achieved ground-

breaking success in recent applications in multiple domains ranging from recommender sys-

tem to pharmaceutics (Cai et al. (2018), Wu et al. (2019),Zhou et al. (2018)) for detailed

reviews of these recent development and applications. The target to map a graph into a low

dimensional space where the graph information is best preserved. Particularly in my appli-

cation, the focus is to find embeddings for each worker i and each peer group (time-varying

firm f by t) given wage matrix wi,ft ∈ RN×FT . Here, w is viewed as a biparte graph between

worker node i ∈ I and time-varying firm node p ∈ F ×T and with the weight on edge (i, ft)

being the observed wage wi,ft. The embedding is a vector representation of an individual

worker or peer group in a low dimension: hi ∈ Rv and hft ∈ Rv, that can best preserve the

wage information in the data.

Graph embedding can be efficiently computed with GNN techniques, pionneered by Kipf

and Welling (2016). The method of GNN follows a neighborhood aggregation scheme: the

embedding of a worker node is computed by recursively aggregating and transforming the

embeddings of its neighboring coworkers (Xu et al. (2018)). The scheme therefore enables

37

flexible interactions between coworkers reflected by the embedding of workers and the time-

varying firms.

Illustration of GNN-based Graph Embedding

𝒉𝒋𝒌−𝟏

𝒉𝒇′𝒕′𝒌−𝟐

𝒉𝒇𝒕𝒌−𝟐

𝒉𝒊𝒌−𝟏

𝒉𝒇𝒕𝒌

Wage Matrix Graph Convolutional Network

j

ft

𝒉𝒇𝒕𝒌−𝟐

ft

𝐿 𝐻; 𝑊 =

𝑖,𝑓𝑡

𝑊𝑖𝑓𝑡 − 𝑤𝜃 ℎ𝑓𝑡, ℎ𝑖

𝒉𝒇𝒕𝒉𝒇′𝒕′

𝒉𝒊

𝒉𝒋

𝑾𝒊𝒇𝒕

ℎ𝑓𝑡𝑘 = 𝑔𝜃({ℎ𝑖

𝑘−1, ℎ𝑗𝑘−1})

f’t’

i

j

i

ft

ft

ft

f’t’

Figure 16: The implementation of GraphSAGE. Consider second order neighborhood aggre-gations. Each circle in both panels represents a worker node (“i”), a square a time-varyingfirm node (“ft”), and the edge the wage for the match (“wi,ft”). All grey square in the rightpanel represent the same graph neural network function gθ (paramenterized by θ) that ag-gregates the embedding of neighbor nodes (“i” and “j”) to update the embedding of currentnode (“ft”). Once the embedding are computed, wages can be recovered by neural networkfunction wθ. The object is to find embeddings H = {{hi}, {hft}}, wθ, and gθ to minimizeloss function L(H,W ).

The worker embedding can be computed efficiently with GNN. The time and space com-

plexity is only linearO(N). In addition, the computation and optimization for neural network

is highly modularized, parallelizable and can be easily distributed on GPU.

5.2 Worker embeddings

Figure 17 displays the result of worker embedding for wages simulated by the data generating

process in Section 4.1. The coordinate of a node is two t-SNE representation of the embedding

for a worker. Each node is colored by the true productivity type x. The GNN algorithm can

38

well distinguish workers globally as worker in similar colors tend to appear at the similar

location, but the result subject to a plenty of local mistakes and is less accurate than the

baseline Hierarchical clustering Algorithm 1.

{ Worker-by-time-varying-firm wage matrix }Graph Embbedding−−−−−−−−−−−→ Rv k-means−−−−→ subsets

Figure 17: t-SNE representation of Worker Embedding

5.3 “Divide and Conquer”

The baseline hierarchical clustering is accurate but very costly in both computation and

memory usage in big-data applications. Despite being less accurate, the GNN-based graph

embedding is can be implemented with high performance and efficiency. This section pro-

pose to integrate the baseline algorithm with the GNN approach with a divide and conquer

strategy. The integration is closely related to the proposal by Hagedorn et al. (2020).

In the “division” step, compute worker embeddings using GNN and group closely em-

bedded workers and divide them into separate subsets. In the “conquering” step, apply

hierarchical clustering only to each local subset: on the premise that only similar workers

assigned into each cluster, this step significantly reduce the dimension of the problem by

erasing voluminous redundant comparisons without any compromise of accuracy. In case

that GNN mistakenly “split” similar workers into different subset, I reshuffle the subsets and

repeat the procedure.

39

Different from the random walk based “node2vec” embedding employed by Hagedorn,

Manovskii and Xin (2000), GNN-based embedding has a number of advantages. First, it

is suitable to study coworker effects as it explicitly model interaction between neighboring

coworkers in a flexible manner. Second, GNN approach allows to incorporate node feature

information. Moreover, the computation for GNN can be easily paralleled and efficiently

computed with GPU.

5.4 Simulation Results

To illustrate the efficiency of the divide-and-conquer strategy, I simulate Data Generation

Process 2 in Section 4.1 with large number of workers N = 100, 000.

Clustering The outcome cluster assignment ci for all workers for the divide-and-conquer

algorithm in displayed in Figure 18. The clustering C displayed is accurate: workers that are

close in X are assigned to the same or adjacent clusters and the assignment function ci is on

the 45 degree line.

Figure 18: Worker cluster assignment ci (divide-and-conquer).

Coworker effects Figure 15 indicates that the coworker effect function is accurately esti-

mated.

40

Figure 19: Estimated coworker effects â(ci) and the ground truth a(xi)

For estimator â on X :

RMSE =

(∫ 10

(â(x)− a(x))2φ(x)dx)1/2

= 0.16%

41

6 Conclusion

In this paper I have developed a new empirical methodology that allows to study peer effects.

I show that the leading empirical methodology is biased under worker-firm complementar-

ity. I developed a semi-parametric approach to jointly estimate the wage complementarities

and coworker effects. The method can also capture heterogeneous coworker effects, which

helps to reconcile the diverging results in microeconomic literature. The approach combines

recent advancement in machine learning and the approach is based on economic theory. To

accommodate the demand for computational efficiently, I integrate the baseline algorithm

with GNN-based graph embedding techniques. I am currently using the proposed method to

measure co-worker effects in the matched employer-employee panel data covering the entire

population of Denmark.

42

References

Abowd, J. M., F. Kramarz, and D. N. Margolis (1999): “High Wage Workers and

High Wage Firms,” Econometrica, 67, 251–334.

Abowd, J. M., F. Kramarz, S. Pérez-Duarte, and I. M. Schmutte (2018): “Sorting

Between and Within Industries: A Testable Model of Assortative Matching,” Annals of

Economics and Statistics, 1–32.

Andrews, M. J., L. Gill, T. Schank, and R. Upward (2012): “High Wage Workers

Match with High Wage Firms: Clear Evidence of the Effects of Limited Mobility Bias,”

Economics Letters, 117, 824–827.

Angrist, J. (2014): “The perils of peer effects,” Labour Economics, 30, 98–108.

Arcidiacono, P., G. Foster, N. Goodpaster, and J. Kinsler (2012): “Estimating

spillovers using panel data, with an application to the classroom,” Quantitative Economics,

3, 421–470.

Banerjee, A., E. Duflo, R. Glennerster, and C. Kinnan (2015): “The Miracle of

Microfinance? Evidence from a Randomized Evaluation,” American Economic Journal:

Applied Economics, 7, 22–53.

Becker, G. (1973): “A Theory of Marriage: Part I,” Journal of Political Economy, 81,

813–846.

Betts, J. and A. Zau (2004): “Peer groups and academic achievement: Panel evidence

from administrative data,” .

Bloom, N., J. Liang, J. Roberts, and Z. J. Ying (2014): “ Does Working from Home

Work? Evidence from a Chinese Experiment *,” The Quarterly Journal of Economics,

130, 165–218.

Bonhomme, S. (2020): “Heterogeneity, Sorting and Complementarity,” Working paper,

National Bureau of Economic Research.

Bonhomme, S., T. Lamadon, and E. Manresa (2019): “A Distributional F

Documents

Measuring the Coworker E ects on Wages · 2021. 1. 9. · Measuring the Coworker E ects on Wages Jianhong Xiny Saturday 9th January, 2021 [Click for Latest Version] Abstract The fast