Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...

Causal Inference with Panel Data

Yiqing Xu (Stanford University)

Northwestern-Duke Causal Inference Workshop

19 August 2019

Motivation

Cadre Visits on Loans

−12 −10 −8 −6 −4 −2 0 2 4

Quarter(s) before / after a cadre's visit

182 treated firms

542 control firms

Visits

−12 −10 −8 −6 −4 −2 0 2 4

ffs in

nsThe Effect of Cadre Visits on Loans95% Confidence Interval

Implicitly Assumed

−12 −10 −8 −6 −4 −2 0 2 4

ffs in

nsThe "Effect" of Cadre Visits on Loans95% Confidence Interval

−12 −10 −8 −6 −4 −2 0 2 4

Year(s) before / after Reform

%The "Effect" of Assault Weapon Ban on Crime Rate95% Confidence Interval

−20 −16 −12 −8 −4 0 4 8

Year(s) before / after Both Parties in the WTO

The "Effect" of WTO on Trade95% Confidence Interval

−12 −10 −8 −6 −4 −2 0 2 4 6 8

Year(s) before / after Democratization

The "Effect" of Democratization on Economic Development95% Confidence Interval

−28 −24 −20 −16 −12 −8 −4 0 4 8

Day(s) before / after the News Leak

The "Effect" of Tim Geithner Connections on Stock Market Return95% Confidence Interval

Motivations

Two-way fixed effects (FE) and difference-in-differences (DiD) methods

are commonly used in the social sciences.

• Abundant observational panel data

• Accounting for unobserved unit and time heterogeneity

• Easy to estimate and interpret

●● ● ● ●

●●

●● ●

2000 2002 2004 2006 2008 2010 2012 2014 2016

"difference−in−differences"

"fixed effects" + "panel data"

#Papers published in APSR/AJPS/JOP

We Try to Answer

1. What about time-varying confounders, e.g. when a “pre-trend”

exists?

2. What if the treatment effect are heterogeneous — as they always

3. What’re the differences between twoway FE and DiD?

4. Are the assumptions credible? What’re the hypothetical experiments

behind these models?

5. How can we do better than DiD or synthetic control?

What’s Special about Panel Data?

• The fundamental problem of causal inference (Holland 1986)

τi = Y1i − Yi0

• A statistical solution makes use of others’ information

e.g. ATE = E[Y1]− E[Y0]

• A scientific solution exploits homogeneity or invariance assumptions

e.g. A rock stays a rock.

e.g. The long-run growth rate of the US economy is 2.5%.

• Panel data allow us to construct treated counterfactuals using

information from both the past and the others with the caveat that

treatment assignment mechanism may be complicated

• Panel data is also difficult because of all kinds of interferences

(SUTVA violations)

Causal Inference with Panel Data

• “Scientific” solution: modeling (but all models are wrong...)

• Statistical solution: similar to the Selection-on-Observable (SOO)

approach, e.g., matching/reweighting

• Panel data make both easier

• Pre-trends are observable → more information for modeling

• The additional dimension helps relax the conventional ignorability

assumption

• And we can do more...

Roadmap

• DiD and synthetic control

• FE/DiD assumptions

• New estimators• Matching and reweighting

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Quick Review of DiD and Synth

Difference-in-Differences

Yit = τitDit + αi + ξt + εit

or{Y 0it = αi + ξt + εit

Y 1it = Y 0

it + τit

• τit is the treatment effect for unit i at time t

• Y 0it is a combination of two additive fixed effects and idiosyncratic

errors

• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)

• ATT = E[τit |Dit = 1] can be non-parametrically identified if there

are only two periods (or two treatment histories)

Difference-in-Differences

(Y 0T ,pre ??

Y 0C,pre Y 0

C,post

• τit is the treatment effect for unit i at time t

• Y 0it is a combination of two additive fixed effects and idiosyncratic

errors

• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)

• ATT = E[τit |Dit = 1] can be non-parametrically identified if there

are only two periods (or two treatment histories)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Under Control Under Treatment

Treatment Status

Semi-parametric DiD

• Abadie (2005) proposes “semi-parametric DiD”

• Assumption: non-parallel outcome dynamics between treated and

controls caused by observed characteristics

• Two-step strategy:

1. estimate the propensity score based on observed covariates; compute

the fitted value

2. run a weighted DiD model

• The idea of using pre-treatment variables to adjust trends is a

precursor to synthetic control

• Strezhnev (2018) extends this approach to incorporate pre-treatment

outcomes

Synthetic Control: Basic Idea

• J + 1 units in periods 1, 2, . . . ,T ; one treated “1”, J controls

• Region “1” is exposed to the intervention after period T0

• We aim to estimate the effect of the intervention on Region “1”

1 T0 T

Synthetic Control: Insights

• Athey and Imbens (2017): “[a]rguably the most important

innovation in the policy evaluation literature in the last 15 years.”

• A combo of many innovations

• Take advantage of pre-treatment outcomes

• Use cross-sectional instead of temporal correlations in data

• Construct a convex combination of donors to construct a

counterfactual

• Reserve some pre-treatment periods for testing

Difference-in-Differences (DiD)

−12 −10 −8 −6 −4 −2 0 2 4

Time relative to the treatment

treated

controls

Synthetic Control (and Many Extensions)

−12 −10 −8 −6 −4 −2 0 2 4

Time relative to the treatment

treated

controls

Theoretical Justification

Yit = τitDit + θ′tZi + ξt + λ′i ft + εit

or{Y 0it = θ′tZi + ξt + λ′i ft + εit

Y 1it = Y 0

it + τit

• Suppose there are R time-varying signals ft out there

• Each unit (e.g. country, participant) picks up a fixed linear

combination of these signals based on factor loadings λi

• Since these “confounders” are evidenced in the pre-treatment

outcomes for both treated and controls, we can try to use this

information to “balance on” these confounders

• We will discuss the model-based approach later18

Theoretical Justification

{Y 0it = θ′tZi + ξt + λ′i ft + εit

Y 1it = Y 0

it + τit

• Let W = (w2, . . . ,wJ+1)′ with wj ≥ 0 and w2 + · · ·+ wJ+1 = 1.

• Let Y K1

i , . . . , Y KM

i be M > R linear functions of pre-intervention

outcomes

• Suppose that we can choose W ∗ such that:

Z1 =∑J+1

j=2 w∗j Zj , Y k1 =

∑J+1j=2 w∗j Y

kj , k ∈ {K1, . . . ,KM}

• When T0 is large, an approximately unbiased estimator of τ1t is:

τ1t = Y1t −∑J+1

j=2 w∗j Yjt , t ∈ {T0 + 1, . . . ,T}

Implementation

• Let X1 = (Z1, YK11 , . . . , Y KM

1 )′ be a (k × 1) vector of

pre-intervention characteristics for the treated and X0, a (k × J)

matrix, for the controls.

• The vector W ∗ is chosen to minimize ‖X1 − X0W ‖, subject to our

weight constraints.

• We consider ‖X1 − X0W ‖V =√

(X1 − X0W )′V (X1 − X0W ), where

V is some (k × k) symmetric and positive semidefinite matrix.

• Various ways to choose V (subjective assessment of predictive power

of X , regression, minimize MSPE, cross-validation, etc.).

Example: Proposition 99 on Cigarette Consumption

• In 1988, California first passed comprehensive tobacco control

legislation (cigarette tax, media campaign etc.)

• Using 38 states that had never passed such programs as controls

1970 1975 1980 1985 1990 1995 2000

per−

Californiarest of the U.S.

Passage of Proposition 99

Cigarette Consumption: CA and the Rest of the U.S.

Example: Proposition 99 on Cigarette Consumption

• In 1988, California first passed comprehensive tobacco control

legislation (cigarette tax, media campaign etc.)

• Using 38 states that had never passed such programs as controls

1970 1975 1980 1985 1990 1995 2000

per−

Californiasynthetic California

Cigarette Consumption: CA and Synthetic CA

Predictor Means: Actual vs. Synthetic California

California Average of

Variables Real Synthetic 38 control states

Ln(GDP per capita) 10.08 9.86 9.86

Percent aged 15-24 17.40 17.40 17.29

Retail price 89.42 89.41 87.27

Beer consumption per capita 24.28 24.20 23.75

Cigarette sales per capita 1988 90.10 91.62 114.20

Note: All variables except lagged cigarette sales are averaged for the 1980-1988

period (beer consumption is averaged 1984-1988).

Smoking Gap Between CA and Synthetic CA

1970 1975 1980 1985 1990 1995 2000

Smoking Gap for CA and 38 Control States

(All States in Donor Pool)

1970 1975 1980 1985 1990 1995 2000

) Californiacontrol states

(Pre-Prop. 99 MSPE ≤ 20 Times Pre-Prop. 99 MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

(Pre-Treatment MSPE ≤ 5 Times Pre-Treatment MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

(Pre-Treatment MSPE ≤ 2 Times Pre-Treatment MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

Ratio Post-Treatment MSPE to Pre-Treatment MSPE

(All 38 States in Donor Pool)

0 20 40 60 80 100 120

post/pre−Proposition 99 mean squared prediction error

California

Limitations and Potential Solutions

• Deal with only one treated unit at a time

• Multiple treated units, e.g. Acemoglu et al. (2017)

• Inference is hard

• Permutation inference and sensitivity analysis, e.g. Hahn and Shi

(2016); Sergio et al. (2017); Chernochukov (2017)

• Allow too much user discretion, e.g. cherry-picking Y ki results in

over-rejection (Ferman et al. 2017)

• Slow to implement and sometimes difficult to find a solution

• Other reweighting approaches...

• Model-based methods

Literature

• Difference-in-differences (Ashenfelter & Card 1985; Card and Krueger 1994)

• Synthetic control (Abadie and Gardeazabal 2003; Abadie et al 2010, 2015)

• Best subset or penalized regression models (e.g. Hsiao et al. 2012;

Valero 2015; Doudchenko and Imbens 2016; Chernozhukov et al 2019)

• Matching and reweighting methods (e.g. Abadie 2005; Imai and Kim

2018; Hazlett and Xu 2018; Strezhnev; 2018; Imai et al 2019)

• Outcome model: factor augmented and matrix completion methods

(e.g. Gobillon and Magnac 2016; Xu 2017; Athey et al 2019)

• Hybrid approaches (e.g. Ben-Michael, Feller and Rothsstien 2018;

Arkhangelsky et al. 2019)

• Growing ...

Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)

Semi-parametric DiDAbadie (2003)

Synthetic Control Method Abadie & Gardeazabel (2003)

Abadie et al. (2010)

Time-varying confounders

Matching/Reweighting

Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),

Xu (2017)

Matrix Completion Athey et al. (2018)

Balancing Robbins et al. (2017)

Hazlett and Xu (2018)

Multiple treated units, Computational and statistical efficiency, Robustness, Inference

Regression Methods Hsiao (2012)

Doudchenko & Immens (2016)

Panel Matching Imai and Kim (2018)

Imai, Kim and Wang(2018)

Literature

Propensity Score Reweighting

Austin (2011) Blackwell & Glynn (2018)

Strezhnev (2018)

Outcome Modeling Hybrid Methods

Augmented Synth Ben-Michael et al. (2018)

Synthetic DID Arkhangelsky et al. (2018)

Matching/Reweighting Outcome Modeling

Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)

Xu (2017)

Outcome Modeling

Xu (2017)

Outcome Modeling

Xu (2017)

Outcome Modeling

Literature

Strezhnev (2018)

Doudchenko & Imbens (2016)

Xu (2017)

Literature

Strezhnev (2018)

Xu (2017)

Literature

Strezhnev (2018)

Xu (2017)

Strezhnev (2018)

Literature

Xu (2017)

Strezhnev (2018)

Literature

Roadmap

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

FE/DiD Assumptions

Assumptions for Twoway FEs

Yit = τDit + X ′β + αi + ξt + εit

in which Dit is dichotomous

1. Functional form

• Additive fixed effect

• Constant and contemporaneous treatment effect

• Linearity in covariates

2. Strict exogeneity

εit ⊥⊥ Djs ,Xjs , αj , ξs ∀i , j , t, s

⇒ {Yit(0),Yit(1)} ⊥⊥ Djs |XXX ,ααα,ξξξ ∀i , j , t, s

if only two groups, parallel trends:

⇒ E[Yit(0)−Yit′(0)|XXX ] = E[Yjt(0)−Yjt′(0)|XXX ] i ∈ T , j ∈ C,∀t, t ′

Shortcomings of Twoway FEs

Assumptions

1. Functional form

Challenges

1. Treatment effect heterogeneity leads to bias

2. Difficult to evaluate strict exogeneity

3. Strict exogeneity means a lot more than what you think

4. A deeper question: what does fixed effects approach imply from a

design-based inference perspective?

Treatment Effect Heterogeneity Causes Bias

Yit = τitDit + αi + εit

Within estimator:

τ = arg minτ

{(Yit − Yit)− τ(Dit − Dit)}

Causal estimand (ATT):

τ = E[τit |Ci = 1,Dit = 1], Ci = 1 if var(Dit) 6= 0

Proposition:

τ → E{Ciσ2i [Yi (1)− Yi (0)]}E[Ciσ2

E[Ciσ2i τi ]

E[Ciσ2i ]6= τ

i.e. a unit fixed effect model gives the average “variance-weighted”

treatment effect, cf. Chernochukov et al. (2013) Theorem 1

Assumptions

1. Functional form

Challenges

Common Practice: Dynamic Treatment Effect Plots

Adhikari and Alm (2016)

1. parametric assumptions

2. arbitrarily chosen base category

3. unreliable tests

Assumptions

1. Functional form

Challenges

Reinterpreting Strict Exogeneity (Imai and Kim 2019)

1. No unobserved time-varying confounder exists

2. Past outcomes don’t directly affect current outcome (no LDV)

3. Past treatments don’t directly affect current outcome (no “carryover

effect”)

4. Past outcomes don’t directly affect current treatment (no “feedback”)

Reinterpreting Strict Exogeneity (Imai and Kim 2019)

1. No unobserved time-varying confounder exists

2. Past outcomes don’t directly affect current outcome (no LDV)

3. Past treatments don’t directly affect current outcome (no “carryover

effect”)

4. Past outcomes don’t directly affect current treatment (no “feedback”)

• To relax 2 or 3, “block”/control for past treatments → but how many?

• To relax 4, need instrumental variables (Arellano and Bond 1991) → hard to

justify instruments; bad finite sample properties

• Often end up directly controlling for arbitrary number of past treatments

and LDVs → Nickel bias

Assumptions

1. Functional form

Challenges

design-based inference perspective?→ hypothetical experiment?

Hypothetical Experiment?

Strict exogeneity implies the following data generating processes:

αi ,XiXiXi → DiDiDi → YiYiYi

treatment status are assigned randomly or at one shot, not sequentially!

Examples: random assignment within units

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Treatment Status

Hypothetical Experiment?

Strict exogeneity implies the following data generating processes:

αi ,XiXiXi → T0i → DiDiDi → YiYiYi

treatment status are assigned randomly or at one shot, not sequentially!

Examples: staggered adoption (Athey and Imbens 2018)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Treatment Status

What We Try to Do

Assumptions

1. Functional form

What we do

? Allow treatment effect heterogeneity

? Develop methods to evaluate strict exogeneity

X Relax the no-time-varying confounder assumption

- Think harder about the hypothetical experiment

Roadmap

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Matching and Reweighting

The Matching/Reweighting Approach

Synthetic control

Yit(0) = λ′i ft + ξt + εit , and E[εit ] = 0, ∀i , t

• Choose weights on controls such that Yit =∑

i ′ 6=i w∗i ′Yi ′t , ∀t ≤ T0

• As a result: λi =∑

i ′ 6=i w∗i ′λi ′

New Developments

• Propensity score reweighting – extension to Abadie (2005)

• Panel matching

• Panel regression methods

• Balancing: mean balancing and trajectory balancing

Panel Matching (Imai, Kim & Wang 2019)

Assumptions

• Sequential exogeneity (past info can affect today’s treatment)

→ Parallel trends after conditions (similar to Abadie 2005)

• SUTVA (no spillover effect)

• Limited carryover effect

Procedure

1. Create a matched set for each treated observation based on

treatment history

2. Refine the matched set via any matching or weighting method

3. Compute causal effect via weighted DiD using the refined set

4. Calculate model-based standard errors

Example: Democracy and Economic Growth

1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008

Missing Autocracy Democracy

Democracy and Economic Growth: Treatment Status

• Match based on treatment history for the past L periods

• Refine the matched set based on covariates and pre-treatment

outcomes

The number of matched control units

Estimated treatment effects

Panel Matching: Advantages and Limitations

Advantages

• Require sequential exogeneity instead of strict exogeneity

• Allow treatment reversal

• Allow a variety of matching/reweighting methods

Limitations

• Lots of data (w/ info on outcome dynamics) are dropped

• Normally, imbalances remain

• Many choices require user discretion

Panel Regression Methods

• Synthetic control finds a convex combination of controls to mimic

the treated

• Implicitly assumes

• non-negative weights

• no intercept shift

• weights add up to 1

• Other ways to obtain weights

• Constrained regression

• Regression based on the best subset

• Penalized regression

• Balancing

Constrained Regression

• No intercept shift; weights add up to 1; non-negativity

ωconstr = arg minω

∑t≤T0

(Y1t − ω′YCt)2

s.t.∑i∈C

ωi = 1 and ωi ≥ 0,∀i ∈ C

• Limitation: T0 > Nco

Best Subset (Hsiao et al. 2012)

• Choose the number of weights that are allowed to be positive

• Optimize the weights after taking out the intercepts(µsubset , ωsubset

)= arg min

∑t≤T0

(Y1t − µ− ω′YCt)2

s.t.∑i∈C

1ωi 6=0 ≤ k

• Bottom-up approach: search for the best 1, then the best 2, then

the best 3 ... (greedy)

• Weights can be negative and do not need to add up to 1

Penalized Regression (Doudchenko and Imbens 2016)

• Use elastic net (L1 and L2 regularization) to select weights

(µen, ωen) = arg minµ,ω

(Y1,pre−µ−ω′YC,pre)2+λ·(

1− α2‖ω‖2

2 + α‖ω‖1

)in which λ and α are tuning parameters

• Allow intercept shift; weights can be negative and do not add up to 1

• Randomization inference: (1) treated unit is randomly selected; (2)

the timing of the onset of the treatment is randomly selected

Example: Proposition 99 in California

Adapted from Doudchenko and Imbens (2016)

Example: German Reunification

Adapted from Doudchenko and Imbens (2016)

Differences Among Existing Methods

synth cstr Hsiao EN IPW PMatch meanbal tjbal

Methods of getting weights bal reg reg reg reg match bal bal

Allow short T0 X X X XNon-negative weights X X X X XWeights add up to 1 X X X XIntercept shift X X X X X X XConvex combination X X X

Multiple treated units X X X XComputational efficiency X X X X X XDimension reduction X X(Almost) exact balancing X XHigher-order features X

See Doudchenko & Imbens (2016) and Ben-Michael et al. (2018) for detailed discussions.

Balancing: A Simple, Unified Framework

Almost all commonly used panel data models imply common function

space with Y 0post linear in Ypre .

Linearity in Pre-Treatment Outcomes – LPO

E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .

• Diff-in-Diffs

• Twoway FEs

• ARMA

• IFE, Synth

Basic Idea: Equal means on Yi,pre would imply equal mean on Y 0it,t>T0

regardless of θt (Robbins et al. 2017)

Mean Balancing

• Objective: choose weights on controls to get same average

pre-treatment trajectory for weighted controls as treated,

∑i∈T

Yi,pre =∑j∈C

wjYj,pre

• Conceptually similar to synth: choosing weights to predict for

counterfactual averages

• In practice: seek approximate balance, working from largest toward

smallest principal components of Ypre(Ypre)′ with a stopping rule of

minimizing the upper bound of biases (Hazlett & Xu 2018)

Mean Balancing

Assumption 1 (Linearity in Pre-Treatment Outcomes)

E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .

Assumption 2 (Conditional Ignorability)

Y 0it ⊥⊥ Gi |Yi,pre , ∀t > T0

Assumption 3 (Feasibility)1

∑Gi=1

Yit =∑Gi=0

wiYit , t ≤ T0 wi ≥ 0,∑Gi=0

wi = 1

Under the above assumptions, we have:

Proposition 4 (Unbiasedness)

E[ATT t |Ypre ] = ATTt , ∀t > T0

Limitations of Mean Balance

1. Weights that achieve mean balancing can leave treated and control

different on non-linear functions of Ypre

2. Mean balancing limited by number of pre-treatment points

• Few pre-treatment periods = few constraints unless you make more

• With enough periods, anything that matters to Y 0 will appear in Y 0

and can be balanced on — but with fewer periods, no guarantees

Trajectory Balancing

• Mean balancing: a simple approach to get same average trajectory

for weighted controls as treated:

∑i∈T

Yi,pre =∑j∈C

wjYj,pre

• Trajectory balancing: feature mapping Yi,pre 7→ φ(Yi,pre),

then balance

∑i∈T

φ(Yi,pre) =∑j∈C

wjφ(Yj,pre)

• Get mean balance on a feature expansion instead

φ(Yi,pre ,Xi ), φ : RP 7→ RP′

Choice of φ()

A good choice of φ() is one that:

• requires little or no user discretion

• includes all continuous functions (at the limit)

• perhaps, prioritizes low frequency, smoother functions

• allows covariates to play a role

Implementation: Gaussian kernel then approximation via principal

components

Implementation

• Instead of Ypre , form kernel matrixKi,j = k([Yi ,Xi ], [Yj ,Xj ]) = exp(−||[Yi ,Xi ]− [Yj ,Xj ]||2/h)

• Replaces each unit’s [Yi,pre ,Xi ] with a vector ki encoding how similar

observation i is to observation 1, 2, ...

• SVD this matrix to obtain components/ eigenvectors

• Choose weights to get mean balance on these, starting from largest

• We choose the number of principal components to include by minimizing the

upper bound of bias in the ATT estimates

When Averages Fail and φ()’s Thrive

Intuition: mean balancing is okay but may emphasize “wrong” features

of the pre-treatment trend

• Trajectory balancing gets you similarity of whole trajectories rather

than just equal means at each time point → balance on

“higher-order” features such as variance, curvature, etc.

• Approximately, trajectory balancing gets multivariate distribution of

Ypre for the controls equal to that of the treated, whereas mean

balancing only gets marginals equal

• This can matter when non-linear functions of Ypre are confounders,

especially when T0 short

When Mean Balancing Fail: A Severe Example

• N = 200 countries with simulated GDP over years T ∈ {1, 2, ..., 24}

• Two “types” of countries:

Volatile with no growth:

GDPit = 5 + ai sin(.2πt) + bicos(.2πt) + .1εit

εit ∼ N(0, 1), ai , bi ∼ U(−1, 1)

Or steady growing:

GDPit = 4 + ci1.03t + .1εit

εit ∼ N(0, 1), ci ∼ U(0.9, 1.1)

• A randomly selected 25% of the stable type take the treatment.

When Mean Balancing Fails: A Severe Example

Controls

0 4 8 12 16 20 24

Treated

0 4 8 12 16 20 24

Heavily weighted control units (pre-treatment)

Controls: Mean Balancing

1 2 3 4 5 6 7 8

Treated AverageWeighted Control Average

Controls: Kernel Balancing

1 2 3 4 5 6 7 8

Treated AverageWeighted Control Average

When Mean Balancing Fails: A Severe Example

What Information is Encoded in the Kernel Matrix?

●●

● ●

●●

●●●●

●●

● ●●

●●

● ●●

●●●

●●

● ●

● ●●

●●

●●●

●●

●●● ●●

●●

● ●

●●

●● ●

Log Variance of Pre−treatment Outcomes

t Prin

−7 −6 −5 −4 −3 −2 −1 0

●●

● ●●●

●●

●● ●

● ●●●

●●

TreatedControls (all)Controls (weighted)

Two Extensions

• Incorporating covariates

Assumption 5 (Linearity in φ(Xi ,Yi,pre))

E[Y 0it |Xi ,Yi,pre ] = φ(Xi ,Yi,pre)′θt , T0 < t ≤ T .

• Intercept shift and demeaning

Assumption 6 (Parallel Trends)

E[Y 0it − Y 0

is |Yi,pre ] = E[Y 0it − Y 0

is |Yi,pre ,Gi ], ∀ t, s ∈ {1, 2, · · · ,T}

Truex (2014): Return to office in China’s Parliament

• Treatment: CEO taking a seat in the

National People’s Congress (NPC)

Outcome: Return on assets (ROA)

• 48 treated firms, 984 controls

Pre-treatment: 2005-2007

Post-treatment: 2008-2010

• Two covariates: state ownership,

revenue in 2007

• Balancing on: roa2005, roa2006,

roa2007, so portion, rev2007 (and

higher order terms through a kernel

transformation)2005 2006 2007 2008 2009 2010

Balance on Pre-treatment Outcome Trajectories

Top 25% Heavily Weighted Controlsw/ Mean Balancing

2005 2006 2007

Treated

2005 2006 2007

Top 25% Heavily Weighted Controlsw/ Kernel Balancing

2005 2006 2007

Balance Check

●rev2007

so_portion

roa2007

roa2006

roa2005

−0.50 −0.25 0.00 0.25 0.50

Difference in Means

● Unweighted Mean Balancing Kernel Balancing

Variance

●rev2007

so_portion

roa2007

roa2006

roa2005

−4 0 4

(Varco − Vartr) Vartr

● Unweighted Mean Balancing Kernel Balancing

Truex (2014): Main Results

2005 2006 2007 2008 2009 2010

Mean BalancingKernel Balancing

NPC Membership and Return on Assets

Summary

• Removing time-invariant confounders is costly, i.e.,no carryover

effect, no feedback from past Y to current D (Kim & Imai 2018)

• Sequential exogeneity may be more desirable than strict exogeneity

• Panel non-parametric and semi-parametric methods have gone a

long way

• Things quickly get more complex when the number of different

treatment histories grows

• Inference is hard especially when there are only a small number of

treated units

Roadmap

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Outcome Models

Basic Idea

• In a panel setting, treat Y (1) as missing data

• Predict Y (0) based on an outcome model

• (Use pre-treatment data for model selection)

• Estimate ATT by averaging differences between Y (1) and Y (0)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Treatment Status

Basic Idea

ATT s = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸︷︷︸s periods

,∀i ∈ T ].

−4 −3 −2 −1 0 1 2 3 4 5

−5 −4 −3 −2 −1 0 1 2 3 4

−6 −5 −4 −3 −2 −1 0 1 2 3

−7 −6 −5 −4 −3 −2 −1 0 1 2

−8 −7 −6 −5 −4 −3 −2 −1 0 1

1 2 3 4 5 6 7 8 9 10

A New Plot for “Dynamic Treatment Effects”

−5.0

−2.5

−30 −20 −10 0 10Time relative to the Treatment

No "Pre−trend"

A New Plot for “Dynamic Treatment Effects”

−5.0

−2.5

A "Pre−trend"

Model-based Counterfactual Estimators

A model-based counterfactual estimator proceeds in the following steps:

• Step 1. Train the model using observations under the control

condition (Dit = 0).

• Step 2. Predict the counterfactual outcome Yit(0) for each

observation under the treatment condition (Dit = 1) and obtain an

estimate of the individual treatment effect: τit = Yit − Yit(0).

• Step 3. Generate estimates for the causal quantities of interest

ATT = E[τit |Dit = 1,∀i ∈ T ,∀t], or

ATTs = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸︷︷︸s periods

,∀i ∈ T ].

Review of Three Estimators

We review three estimation strategies:

• FEct:

Yit(0) = Xit β + αi + ξt

• IFEct (Gobillon&Magnac 2016; Xu 2017):

Yit(0) = Xit β + λ′i Ft

• Matrix Completion (MC) (Athey et al. 2018):

Yit(0) = Xit β + Lit ,

where matrix {Lit}N×T is a lower-rank matrix approximation of

{Y (0)}N×T with missing values

Remarks:

• DiD is a special case of FEct

• Both IFEct and MC are estimated via iterative algorithms

• Cross-validation to choose the tunning parameter

Xu (2017) proposes a three-step approach based on a latent factor model:

Control Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit

Treated Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit (pre)

Yit(1) = X ′itβ + αi + ξt + λ′i ft + εit + τit (post)

1. Expectation-Maximization

(Gobillon & Magnac 2016)factors: r × T

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Time

Treated (Pre) Treated (Post) Controls

Election Day Registration (EDR) and Voter Turnout

Causal inference is a missing data problem.

WVWAVTVAUTTXTNSDSCRIPA

OROKOHNYNVNMNJNENCMSMOMI

MDMALAKYKSINIL

GAFLDECOCAAZARALCTMTIA

WYNHIDWI

1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012Year

Treated (Pre) Treated (Post) Controls

EDR Reform

Main Results

Factors and Factor Loadings

1920 1936 1952 1968 1984 2000 2016

Factor 1Factor 2

TXUTVT

−20 −10 0 10 20

Loadings for Factor 1

Example: Cadre Visits on Loans

−12 −10 −8 −6 −4 −2 0 2 4

The Effect of Cadre Visits95% Confidence Interval

Example: Property Rights and Land Improvement (Sanford 2019)

• Does property rights lead to improved land quality?

• “Experiment”: giving peasants in Borgou, Benin land titles

• Use satellite (remote sensing) data to measure land improvement,

i.e., switch from annual crops to perennial crops (bushes and trees)

• Use IFEct to construct counterfactuals

Original Satellite Images

Treated and Counterfactual Averages

Pre-treatment Outcome

Post-treatment Outcome (1 Year)

Post-treatment Outcome (4 Years)

Factors and Loadings

Geographic Distribution of Heavily-weighted Controls

Bigger number represents higher dissimilarity

Matrix Completion Methods

1920 1928 1936 1944 1952 1960 1968 1976 1984 1992 2000 2008year

Treated States (before EDR) Treated States (after EDR) Control States

EDR Reform• Recall that our main goal is to

predict treated counterfactuals

• Taking advantage of the matrix

structure,

matrix completion methods use

non-treated data to achieve this

• The basic idea to find a

lower-rank representation of the

matrix to impute the “missing

data”

• Xu (2017) is a special case of

this approach94

• Recall in the baseline DiD setup:

(Y 0T ,pre ??

Y 0C,pre Y 0

C,post

)• Matrix completion (MC) methods attempt to find a lower-rank

representation of Y, which we call L, that makes predictions of

missing values in Y

• Athey et al. (2018) generalize Xu (2017) with different ways of

constructing L

• Plus, missingness can be arbitrary → accommodate reversible

treatments (note: strict exogeneity)

• Mathematically,

Yit = Lit + αi + ξt + X ′itβ + εit

in which Lit is an element of L, an (N × T ) matrix

• We need regularization on L because of too many parameters:

#Controls

∑Dit=0

(Yit − Lit)2 + λL‖L‖∗

• The nuclear norm ‖.‖∗ generally leads to a low-rank solution for L

‖L‖∗ =

min(N,T )∑i=1

σi (L)

in which σi (L) that the singular values of L

IFEct vs. MC

• Singular value decomposition of L

LN×T = SN×NΣN×TRT×T

• Difference in how ΣN×T is regularized

IFE MC

best subset nuclear norm

σ1 0 0 · · · 0

0 σ2 0 · · · 0

0 0 0 . . . 0

0 0 0 · · · 0

|σ1 − λL|+ 0 0 · · · 0

0 |σ2 − λL|+ 0 · · · 0

0 0 |σ3 − λL|+ . . . 0

|σT − λL|+0 0 0 · · · 0

0 0 0 · · · 0

in which |a|+ = max(a, 0)

IFEct vs. MC

The pros and cons of IFEct and MC:

• IFEct works better with a small number of strong factors

• MC works better with a large number of weak factors

IFEct (known r)

IFEct (CV−ed r)

Variance of each factor

1 2 3 4 5 6 7 8 9Number of Factors in the DGP

Over-fitting of IFEct and MC

When the true DGP is a two-factor IFE model:

sigma (Pre)

SD (ATT)

Bias (ATT)

RMSE (ATT)

0 1 2 3 4 5 6Number of Factors in the Model

sigma (Pre)

SD (ATT)

Bias (ATT)

RMSE (ATT)

−4.4 −5.2 −6.1 −7 −7.8 −8.7 −9.6log(Tunning Parameter)

Inferential Methods

• Non-parametric block bootstrap

• sample with replacement across units

• valid when N is large, NtrN

is fixed

• A permutation-based test for Sharp Nulls (Chernozhukov et al 2019)

• e.g. Yit(1) = Yit(0), ∀i ∈ T , t > T0i

• randomization over time (by blocks) instead of across units

• valid if T is large, errors are stationary weekly dependent, and

estimators are consistent or stable

• exact if errors are i.i.d.

Algorithm for Testing the Sharp Null

Based on Chernozhukov et al (2019):

• Assuming the Sharp Null is true: H0 : Y (0)it = Y (1)it , ∀i ∈ T , t.

• Denote Zt = (Y1t ,Y2t , · · ·YNt ,X1t ,X2t , · · ·XNt)′.

• Denote a i.i.d. block permutation πh a one-to-one mapping:

πh : {b1, · · · , bK} 7→ {b1, · · · , bK}, in which {b1, · · · , bK} is a

partition of {1, · · · ,T}.

Algorithm for Testing the Sharp Null

Step 1. Using a counterfactual estimator with real data, calculate the

following test statistics S(u) =√|O||ATT |, in which |O| is the number

of treated observations.

Step 2. Generate a random i.i.d block permutation πh; reorganize the

data (over the time dimension) based on πh while fixing the treatment

assignment matrix.

Step 3. Re-estimate the test statistic using the permuted data,

obtaining S(uπh) using the same method in Step 1.

Step 4. Repeat Steps 2-3 H times, obtaining {S(uπ1 ) · · · ,S(uπH)}.

Step 5. Calculate the p-value: p = 1− F (S(u)) in which

F (x) =1

1{S(uπh) < x}

Diagnostic Tests

A Simulated Example

Data Generating Process:

• T = 35, N = 200

• Outcome model: a linear interactive fixed effect model with two factors: one drift

process and one white noise.

Yit = τitDit + 5 + 1 · Xit,1 + 3 · Xit,2 + λi1 · f1t + λi2 · f2t + αi + ξt + εit

• Treatment assignment: staggered adoption with T0i correlated with additive and

interactive fixed effect.

• Treatment effects: τi,t>T0i = 0.2(t − T0i ) + eit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Treatment Status

Dynamic Treatment Effects

−5.0

−2.5

1. Placebo Test

• Drop S periods before the treatment’s onset, and estimate the

average treatment effect in these periods.

• Test whether the average effect is significant

• Robust to model misspecification, but using only limited information

● ●

●●

●● ●

● ●

●●

● ● ● ● ● ●● ●

●●

●● ●

●●

Placebo p value: 0.400

−5.0

−2.5

2. Equivalence Test

• Extension to Hartman and Hidalgo (2018) in a TSCS setting

• H0: |ATTt | > τ vs. H1: |ATTt | ≤ τ

• We calculate the maximal possible τ and compare it with

pre-specified threshold: 0.36 ∗ sd(Yit |Dit = 0)

• It has more power when the sample size grows larger, and is more

likely to reject the Null (hence, equivalence holds) when a

confounder is trivial

• Drawback 1: setting the threshold requires user discretion

• Drawback 2: easy to pass when pre-treatment data are used to fit

the model (IFEct and MC)

Equivalence Test

Wald p value: 0.321

−5.0

−2.5

Why Not a Wald test?

• A simpler test: conduct an Wald (F) test on pre-treatment residual

averages

• However, when there exists a small confounders which induces a

neglectable bias compared with the ATT, a Wald test will almost

always reject the Null (that equivalence holds) when there are

enough data

• The Equivalence Test avoids this problem

Why Not a Wald Test?

• There exist a time-varying confounder

• We vary its influence on the bias in the ATT

Wald Equivalence

0.0 0.1 0.2 0.3 0.4Bias / SD(Y)

Why Not a Wald Test?

Equivalence

0 250 500 750 1000Number of Units (N)

Bias = 0.08SD(Y )

Equivalence

0 250 500 750 1000Number of Units (N)

Bias = 0.28SD(Y )

Hainmueller & Hangatner (2019)

Does indirect democracy benefit immigrant minorities?

• Unit of analysis: 1400 Swiss municipalities from 1991-2009

• Treatment: Indirect (vs. direct) democracy

• Outcome: Naturalization rate

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Direct Democracy Indirect Democracy

Hainmueller & Hangatner (2019)

● ● ●

● ●

●●

Placebo Test

Xu (2017): EDR on Turnout

Does Election Day Registration increase turnout?

• Unit of analysis: 47 US states from 1920 to 2012

• Treatment: EDR reform

• Outcome: Turnout

1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012

No EDR EDR

Treatment Status

Wald p value: 0.129

● ● ●

●●

●● ●

Placebo Test

Wald p value: 0.728

● ●● ●

● ●

●●

● ●

●● ●

Placebo Test

Wald p value: 0.644

● ●●

●●

● ●

●●

● ●●

Placebo Test

Acemoglu et al. (2019): Democracy on Growth

• Unit of analysis: 184 countries over 51 years (1960-2010)

• Treatment: a dichotomous measure of democracy and autocracy

• Outcome: Log GDP per capita (’2000 dollars)

Entering Democracy 117

Wald p value: 0.111

−20 −10 0 10 20Time relative to the Treatment

Wald p value: 0.182

Wald p value: 0.079

Entering Democracy

Wald p value: 0.890

−10 0 10 20Time relative to the Treatment

Wald p value: 0.270

Wald p value: 0.620

Exiting Democracy

Roadmap

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Hybrid Methods

• So far, we’ve surveyed two group of methods: (1) those constructing

balancing weights; (2) those modeling the conditional outcomes

• Combining the two approaches will likely produce doubly robust

estimators

• Some methods we discussed, including semi-parametric DiD, panel

matching, trajectory balancing, are already doing a simple version of

it (balancing plus regression)

• We review two new methods that formally adopt this idea

• Augmented synthetic control (Ben-Michael et al 2018): modeling first

• Synthetic DiD (Arkhangelsky et al. 2019): weighting first

Augmented Synthetic Control

• Assuming one treated unit (unit N)

• Procedure

1. Run an outcome model (FEct, IFEct, MC, etc.) and obtain model fit m(Xi )

2. Balance on the residual averages, obtaining weights γi for the controls

3. Treated average is constructed using:

Y augN (0) = m(XN) +

N−1∑i=1

γi (Yi − m(Xi ))

• The balancing weights take care of the remaining biases from the outcome

model; the estimator is thus doubly robust

• Inference via clustered bootstrap

Synthetic DiD

• Assuming one treated unit (unit N) and one post-treatment period

(period T ); weights add up to 1

• Procedure1. Estimate “synthetic control weight” for each control unit:

ωsc = arg minω∑T−1

(∑N−1i=1 ωiYit − YNt

)2. Estimate “synthetic control weight” for each time period:

λsc = arg minλ∑N−1

(∑T−1t=1 λtYit − YiT

)3. Estimate a weighted DiD by minimizing:

N∑i=1

T∑t=1

(Yit − µ− αi − βt − Xitγ − Ditτ)2ωi λt

• Either the SC weights or the outcome model is correct, the causal effect

will be identified (doubly robust)

• Inference via jackknife

Conclusions

Concluding Remarks

• Removing unit fixed effects is costly, i.e., strict exogeneity; the alternative

is sequential exogeneity

• “Parallel trends” can very well be wrong; when T is large, we can assess

the assumption by checking the “pre-trend” using diagnostic testes

• Counterfactual estimators can relax the homogeneity assumption with

little cost

• Both the matching/reweighting approach (dealing with D) and the

model-based approach (dealing with Y) can help with time-varying

confounders with sufficient data

• Be aware of the modeling assumptions when the latter is employed

• A hybrid estimator is doubly robust

Practical Recommendations

• Plotting raw data helps us see obvious problems

• Start from conventional estimators (i.e. DiD, FEct) and check the

“pre-trend”

• If they don’t work, try easy fixes, e.g. trimming the data to make

the treated and controls more alike

• If that doesn’t work, either, we need more complex methods

• Always ask yourself first: “what’s the hypothetical experiment?’

Packages

• panelView: panel data visualization

• fastplm: fast panel linear fixed effects estimation (coming soon)

• gsynth: IFEct/MC approach with non-reversible treatments

• fect: IFEct/MC methods with diagnostic tests (coming soon)

• tjbal: trajectory balancing

Thank you!yiqingxu@stanford.edu

http://yiqingxu.org

github.com/xuyiqing

References

• Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010). Journal of the American Statistical

Association. June 1, 2010, 105(490): 493–505.

• Imai, Kosuke and In Song Kim (2019). “When Should We Use Unit Fixed Effects Regression Models

for Causal Inference with Longitudinal Data?” American Journal of Political Science, Vol. 62, Iss. 2,

April 2019, pp. 467–490.

• Abadie, Alberto (2005). “Semiparametric Difference-in-Differences Estimators,” Review of Economic

Studies (2005) 72, 1–19.

• Hsiao, Cheng, H. Steve Ching and Shui Ki Wan (2012). “A Panel Data Approach for Program

Evaluation: Measuring the Benefits of Political and Economic Integration of Hong Kong with

Mainland China,” Journal of Applied Econometrics, Vol. 27, Iss. 5, August 2012, pp. 705–740.

• Doudchenko, Nikolay and Guido Imbens (2016). “Balancing, Regression, Difference-In-Differences

and Synthetic Control Methods: A Synthesis.” Working Paper Stanford University.

• Hazlett, Chad and Yiqing Xu (2018). “Trajectory Balancing: A Kernel Method for Causal Inference

with Time-Series Cross-Sectional Data.” Working Paper, UCLA.

• Xu, Yiqing (2017). “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed

Effects Models” Political Analysis, Vol. 25, Iss. 1, January 2017, pp. 57-76.

• Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, Khashayar Khosravi (2017).

“Matrix Completion Methods for Causal Panel Data Models.” Working Paper, Stanford University.

• Liu, Licheng, Ye Wang, Yiqing Xu (2019). ”A Practical Guide to Counterfacutal Estimators for

Causal Inference with Time-Series Cross-Sectional Data.” Working Paper, Stanford University.

• Ben-Michael Eli, Avi Feller, Jesse Rothstein (2018). “The Augmented Synthetic Control Method.”

Working Paper, UC Berkeley.

• Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens and Stefan Wager (2019).

“Synthetic Difference In Differences.” Working Paper, UC Berkeley.126

Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...

Documents

Propensity Score Matching. Outline Describe the problem Introduce propensity score matching as one solution Present empirical tests of propensity score

Risk-taking Propensity, Managerial Network Ties and Firm ...eprints.whiterose.ac.uk/98682/2/UddinRisk-taking propensity, networ… · Risk-taking propensity, network ties and firm

Modeling Operational Risk Depending on Covariates. An

Tree-Based Methods in Drug Safety Researchdocs.salford-systems.com/MarshaWilcox.pdfMethods - Covariates ̇ Covariates – Prescription drug use (binary indicators) – 17 drug classes,

Multivariable modeling of continuous covariates with a

PROPENSITY SCORE METHODS: THEORY AND CONCEPTSbkl29/pres/SPER_PS_workshop_Part_1...Estimate propensity score from all 18 covariates Discard 15 control units and 2 treated units outside

On Fractile Transformation of Covariates in

Application of Propensity Score Matching in Observational ...hasug.org/wp-content/uploads/2015/11/Propensity... · Application of Propensity Score Matching in Observational Studies

10 Time Dependent Covariates - NCSU Statistics

Regression Discontinuity Designs Using Covariates: Supplemental … · Regression Discontinuity Designs Using Covariates: Supplemental Appendix Sebastian Calonicoy Matias D. Cattaneoz

Adjustment of nonconfounding covariates in case-control

Probabilisty,propensity andprobabilityofpropensityaws/agostini.pdfProbabilisty,propensity andprobabilityofpropensity Giulio D’Agostini giulio.dagostini@roma1.infn.it Dipartimento

Covariate balancing propensity score · propensity score methods, including matching and weighting. Although matching exactly on the propensity score is typically impossible, methods

Propensity Score Matching

Propensity score prediction for electronic healthcare ... · Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score

Dootted tted Liinesnes - pse.org

Monday, June 1st Pre-Conference Workshop · A Comparison of Missing Imputation Methods for Covariates in Propensity Score Analysis Using Random Forests Algorithms YongSeok Lee Walter

Propensity Score Models

Robert Plant != Richard Plant. Sample Data Response, covariates Predictors Remotely sensed Build Model Uncertainty Maps Covariates Direct or Remotely

Propensity Scores