271
Chernozhukov et al. on Double / Debiased Machine Learning Slides by Chris Felton Princeton University, Sociology Department Sociology Statistics Reading Group November 2018

Chernozhukov et al. on Double / Debiased Machine Learning

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chernozhukov et al. on Double / Debiased Machine Learning

Chernozhukov et al. on Double / DebiasedMachine Learning

Slides by Chris FeltonPrinceton University, Sociology Department

Sociology Statistics Reading GroupNovember 2018

Page 2: Chernozhukov et al. on Double / Debiased Machine Learning

Papers

Background:

Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.

Main paper:

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.

Page 3: Chernozhukov et al. on Double / Debiased Machine Learning

Papers

Background:

Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.

Main paper:

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.

Page 4: Chernozhukov et al. on Double / Debiased Machine Learning

Papers

Background:

Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.

Main paper:

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.

Page 5: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 6: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 7: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 8: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 9: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 10: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 11: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 12: Chernozhukov et al. on Double / Debiased Machine Learning

Why This Paper?

Provides a general framework for estimating treatment effectsusing machine learning (ML) methods

In particular, we can use any (preferably n14 -consistent) ML

estimator with this approach

Enables us to construct valid confidence intervals for ourtreatment effect estimates

Introduces a√n-consistent estimator

As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1

2 (or 1/√n)

We really like our estimators to be at least√n-consistent

1/n12 will approach 0 more quickly than, e.g., 1/n

14 as n grows

Page 13: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 14: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 15: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 16: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 17: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 18: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 19: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

In observational studies, we often estimate causal effects byconditioning on confounders

We typically condition on confounders by making strongassumptions about the functional form of our model

E.g., a standard OLS model assumes a linear and additiveconditional expectation function

If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding

Our parametric specifications often lack strong substantivejustification

ML provides a systematic framework for learning the form ofthe conditional expectation function from the data

Page 20: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 21: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 22: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 23: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 24: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 25: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 26: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

We sometimes find ourselves working with high-dimensionaldata

I.e., we have many covariates p relative to the number ofobservations n

Two types of high-dimensionality:

1 We simply have many measured covariates (e.g., text data,genetic data)

2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates

ML models perform much better in high dimensions thantraditional statistical models do

Page 27: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

1 ML allows us to do causal inference with minimalassumptions about the functional form of our model

Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)

2 ML allows us to do causal inference withhigh-dimensional data

Page 28: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

1 ML allows us to do causal inference with minimalassumptions about the functional form of our model

Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)

2 ML allows us to do causal inference withhigh-dimensional data

Page 29: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

1 ML allows us to do causal inference with minimalassumptions about the functional form of our model

Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)

2 ML allows us to do causal inference withhigh-dimensional data

Page 30: Chernozhukov et al. on Double / Debiased Machine Learning

Why Use ML for Causal Inference, Anyway?

1 ML allows us to do causal inference with minimalassumptions about the functional form of our model

Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)

2 ML allows us to do causal inference withhigh-dimensional data

Page 31: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 32: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 33: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 34: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 35: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 36: Chernozhukov et al. on Double / Debiased Machine Learning

Why Using ML for Causal Inference is Tricky

ML methods were designed for prediction

But off-the-shelf ML methods are biased estimators fortreatment effects

To minimize MSE = bias2 + variance, we trade off variancefor bias

Consistent ML methods converge more slowly than 1√n

Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates

Page 37: Chernozhukov et al. on Double / Debiased Machine Learning

Key Aims of Double / Debiased Machine Learning (DML)

1 Eliminate the bias

2 Achieve√n-consistency

3 Construct valid confidence intervals

Page 38: Chernozhukov et al. on Double / Debiased Machine Learning

Key Aims of Double / Debiased Machine Learning (DML)

1 Eliminate the bias

2 Achieve√n-consistency

3 Construct valid confidence intervals

Page 39: Chernozhukov et al. on Double / Debiased Machine Learning

Key Aims of Double / Debiased Machine Learning (DML)

1 Eliminate the bias

2 Achieve√n-consistency

3 Construct valid confidence intervals

Page 40: Chernozhukov et al. on Double / Debiased Machine Learning

Key Aims of Double / Debiased Machine Learning (DML)

1 Eliminate the bias

2 Achieve√n-consistency

3 Construct valid confidence intervals

Page 41: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 42: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 43: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 44: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 45: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 46: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 47: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 48: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 49: Chernozhukov et al. on Double / Debiased Machine Learning

Outline of Presentation

1 Introduce partially linear model set-up

2 Explain two sources of estimation bias from ML and how weovercome them

Correct bias from regularization with Neyman orthogonality

We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem

Correct bias from overfitting using sample-splitting

Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting

3 Outline a procedure for conducting inference with DML

4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up

Page 50: Chernozhukov et al. on Double / Debiased Machine Learning

Don’t worry if none of that makes sense yet!

It’s not as complicated as it sounds, and we

will work through it slowly.

Page 51: Chernozhukov et al. on Double / Debiased Machine Learning

Don’t worry if none of that makes sense yet!

It’s not as complicated as it sounds, and we

will work through it slowly.

Page 52: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Page 53: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Page 54: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Y : Outcome

Page 55: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Y : Outcome

D: Treatment

Page 56: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Y : Outcome

D: Treatment

X : Measured confounders

Page 57: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Y : Outcome

D: Treatment

X : Measured confounders

U and V are our error terms

Page 58: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

Y : Outcome

D: Treatment

X : Measured confounders

U and V are our error terms

We assume zero conditional mean:

E[U | X ,D] = 0 E[V | X ] = 0

Page 59: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect

“theta-naught”Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

Page 60: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect“theta-naught”

Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

Page 61: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

Page 62: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

Page 63: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

g0(·): some function mapping X to Y ,conditional on D

Page 64: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect

Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)

g0(·): some function mapping X to Y ,conditional on D

m0(·): some function mapping X to D

Page 65: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model

However, note that this partially linear model assumesthat the effect of D on Y is additive and linear

Our confounders can interact with one another, but notwith our treatment!

And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)

Page 66: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model

However, note that this partially linear model assumesthat the effect of D on Y is additive and linear

Our confounders can interact with one another, but notwith our treatment!

And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)

Page 67: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model

However, note that this partially linear model assumesthat the effect of D on Y is additive and linear

Our confounders can interact with one another, but notwith our treatment!

And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)

Page 68: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model

However, note that this partially linear model assumesthat the effect of D on Y is additive and linear

Our confounders can interact with one another, but notwith our treatment!

And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)

Page 69: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?

We’re just using this model for illustration!

We can assume a fully interactive and non-linear modelwhen we actually use DML

But our partially linear model set-up will allow us tobetter explain how DML works

Page 70: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?

We’re just using this model for illustration!

We can assume a fully interactive and non-linear modelwhen we actually use DML

But our partially linear model set-up will allow us tobetter explain how DML works

Page 71: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?

We’re just using this model for illustration!

We can assume a fully interactive and non-linear modelwhen we actually use DML

But our partially linear model set-up will allow us tobetter explain how DML works

Page 72: Chernozhukov et al. on Double / Debiased Machine Learning

The Partially Linear Model Set-up

Y = Dθ0 + g0(X ) + U

D = m0(X ) + V

If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?

We’re just using this model for illustration!

We can assume a fully interactive and non-linear modelwhen we actually use DML

But our partially linear model set-up will allow us tobetter explain how DML works

Page 73: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We’ve introduced the partially linear model set-up

Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model

We will show that this estimation procedure is biased andnot√n-consistent

Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias

Page 74: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We’ve introduced the partially linear model set-up

Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model

We will show that this estimation procedure is biased andnot√n-consistent

Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias

Page 75: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We’ve introduced the partially linear model set-up

Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model

We will show that this estimation procedure is biased andnot√n-consistent

Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias

Page 76: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We’ve introduced the partially linear model set-up

Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model

We will show that this estimation procedure is biased andnot√n-consistent

Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias

Page 77: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We’ve introduced the partially linear model set-up

Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model

We will show that this estimation procedure is biased andnot√n-consistent

Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias

Page 78: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 79: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 80: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 81: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 82: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]

That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 83: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this model

In the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 84: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 85: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 86: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML: The Naive Approach

The naive approach: estimate Y = D θ0 + g0(X ) + U usingML

θ0: our naive estimate of θ0, “theta-naught-hat”

How might we estimate θ0 and g0(X )?

Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]

To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0

Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel

Page 87: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 88: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 89: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularization

This decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 90: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting...

...but introduces bias and prevents√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 91: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 92: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 93: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfitting

Overfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 94: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signal

More carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 95: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performance

Overfitting → bias and slow convergence

Page 96: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

1 Bias from regularization

To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents

√n-consistency

2 Bias from overfitting

Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence

Page 97: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

For clarity, we will isolate each type of bias

When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting

When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias

We’ll explain how sample-splitting and orthogonalization worksoon

Page 98: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

For clarity, we will isolate each type of bias

When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting

When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias

We’ll explain how sample-splitting and orthogonalization worksoon

Page 99: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

For clarity, we will isolate each type of bias

When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting

When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias

We’ll explain how sample-splitting and orthogonalization worksoon

Page 100: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

For clarity, we will isolate each type of bias

When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting

When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias

We’ll explain how sample-splitting and orthogonalization worksoon

Page 101: Chernozhukov et al. on Double / Debiased Machine Learning

Two Sources of Bias in Our Naive Estimator

For clarity, we will isolate each type of bias

When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting

When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias

We’ll explain how sample-splitting and orthogonalization worksoon

Page 102: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

Let’s start by looking at the scaled estimation errorin θ0 when we use sample-splitting without

orthogonalization

Page 103: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

Page 104: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

This looks scary, so let’s take it one term at a time

Page 105: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error

Page 106: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error

If we want consistency, we want our estimation error to go tozero

Page 107: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error

If we want consistency, we want this term to go to zero

a N(0, Σ). Great!

Page 108: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

b is the sum of terms that do not have mean zero divided by√n

Page 109: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

b is the sum of terms that do not have mean zero divided by√n

Specifically, g0(Xi )− g0(Xi ) will not have mean zero becauseg0 is biased

Page 110: Chernozhukov et al. on Double / Debiased Machine Learning

Regularization Bias

√n(θ0 − θ0) =

(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

DiUi︸ ︷︷ ︸:=a

+(1

n

∑i∈I

D2i

)−1 1√n

∑i∈I

Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b

b is the sum of terms that do not have mean zero divided by√n

Specifically, g0(Xi )− g0(Xi ) will not have mean zero becauseg0 is biased

b will approach 0, but too slowly for our estimator to be√n-consistent!

Page 111: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 112: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 113: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 114: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model

2 Estimate Y = D θ0 + g0(X ) + U as we do in the naiveapproach, our outcome model

3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 115: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model

3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 116: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 117: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 118: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

To overcome this regularization bias, let’s useorthogonalization

Instead of fitting one ML model, we fit two:

1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive

approach, our outcome model3 Regress Y − g0(X ) on V

The resulting θ0 (“theta-naught-check”) is free ofregularization bias!

We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)

Page 119: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

In notation:

θ0 =(1

n

∑i∈I

ViDi

)−1 1

n

∑i∈I

V (Yi − g0(Xi))

Look familiar?

Page 120: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

In notation:

θ0 =(1

n

∑i∈I

ViDi

)−1 1

n

∑i∈I

V (Yi − g0(Xi))

Look familiar?

Page 121: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

In notation:

θ0 =(1

n

∑i∈I

ViDi

)−1 1

n

∑i∈I

V (Yi − g0(Xi))

Look familiar?

What about now?

βIV = (Z ′D)−1Z ′y

It’s very similar to our standard linear instrumental variableestimator, two-stage least squares!

Page 122: Chernozhukov et al. on Double / Debiased Machine Learning

Causal Inference with ML using Orthogonalization

In notation:

θ0 =(1

n

∑i∈I

ViDi

)−1 1

n

∑i∈I

V (Yi − g0(Xi))

Look familiar?

What about now?

βIV = (Z ′D)−1Z ′y

It’s very similar to our standard linear instrumental variableestimator, two-stage least squares!

Page 123: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 124: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 125: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 126: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 127: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 128: Chernozhukov et al. on Double / Debiased Machine Learning

How Orthogonalization De-biases

Remember b from the scaled estimation error equation? Nowwe have b∗:

b∗ = (E [V 2])−1 1√n

∑i∈I

(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error

(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error

Because this term is based on the product of two estimationerrors, it vanishes more quickly

If g0 and m0 are each n14 -consistent, θ0 will be

√n-consistent

To see why, just note that n14 × n

14 = n

12

Page 129: Chernozhukov et al. on Double / Debiased Machine Learning

A Quick Detour

Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well

This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator

If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem

But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works

Page 130: Chernozhukov et al. on Double / Debiased Machine Learning

A Quick Detour

Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well

This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator

If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem

But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works

Page 131: Chernozhukov et al. on Double / Debiased Machine Learning

A Quick Detour

Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well

This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator

If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem

But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works

Page 132: Chernozhukov et al. on Double / Debiased Machine Learning

A Quick Detour

Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well

This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator

If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem

But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works

Page 133: Chernozhukov et al. on Double / Debiased Machine Learning

A Quick Detour

Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well

This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator

If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem

But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works

Page 134: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 135: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 136: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 137: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 138: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 139: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 140: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 141: Chernozhukov et al. on Double / Debiased Machine Learning

The Frisch–Waugh–Lovell Theorem

Let’s say we want to estimate the following model using OLS:

Y = β0 + β1D + β2X + U

The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:

1 Regress D on X using OLS

Let D be the predicted values of D and let the residualsV = D − D

2 Regress Y on X using OLS

Let Y be the predicted values of Y and let the residualsW = Y − Y

3 Regress W on V using OLS

The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!

Page 142: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 143: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 144: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X

2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 145: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X

3 Linear regression of the residuals from 2 on the residualsfrom 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 146: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 147: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 148: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 149: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X

2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 150: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 151: Chernozhukov et al. on Double / Debiased Machine Learning

Robinson

The Frisch–Waugh–Lovell procedure:

1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression

Robinson’s procedure:

1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals

from 1

Page 152: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

DML using residuals-on-residuals regression:

1 Estimate D = m0(X ) + V2 Estimate Y = ˆ

0(X ) + U

Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]

3 Regress U on V using OLS for an estimate θ0

Page 153: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

DML using residuals-on-residuals regression:

1 Estimate D = m0(X ) + V2 Estimate Y = ˆ

0(X ) + U

Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]

3 Regress U on V using OLS for an estimate θ0

Page 154: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

DML using residuals-on-residuals regression:

1 Estimate D = m0(X ) + V

2 Estimate Y = ˆ0(X ) + U

Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]

3 Regress U on V using OLS for an estimate θ0

Page 155: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

DML using residuals-on-residuals regression:

1 Estimate D = m0(X ) + V2 Estimate Y = ˆ

0(X ) + U

Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]

3 Regress U on V using OLS for an estimate θ0

Page 156: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

DML using residuals-on-residuals regression:

1 Estimate D = m0(X ) + V2 Estimate Y = ˆ

0(X ) + U

Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]

3 Regress U on V using OLS for an estimate θ0

Page 157: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 158: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 159: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression

2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 160: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression

3 Linear regression of the residuals from 2 on the residualsfrom 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 161: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 162: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 163: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 164: Chernozhukov et al. on Double / Debiased Machine Learning

Another Way to Orthogonalize

Robinson’s procedure:

1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals

from 1

DML residuals-on-residuals procedure:

1 Predict D with X using any n 14 -consistent ML model

2 Predict Y with X using any n 14 -consistent ML model

3 Linear regression of the residuals from 2 on the residualsfrom 1

Page 165: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We saw that can eliminate regularization biasusing orthogonalization

Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting

Page 166: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We saw that can eliminate regularization biasusing orthogonalization

Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting

Page 167: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We saw that can eliminate regularization biasusing orthogonalization

Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting

Page 168: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 169: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 170: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 171: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 172: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 173: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence

Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting

That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations

Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗

We looked at b∗ before, and we don’t have to worry about a∗

Page 174: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U

What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?

Page 175: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

V : the error term (not the estimated residuals) fromD = m0(X ) + V

g0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U

What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?

Page 176: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U

What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?

Page 177: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U

What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?

Page 178: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

Overfitting: modeling the noise too closely

When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model

If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V

Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away

Page 179: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

Overfitting: modeling the noise too closely

When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model

If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V

Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away

Page 180: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

Overfitting: modeling the noise too closely

When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model

If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V

Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away

Page 181: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

Overfitting: modeling the noise too closely

When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model

If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V

Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away

Page 182: Chernozhukov et al. on Double / Debiased Machine Learning

Bias from Overfitting

But c∗ contains terms like this:

1√n

∑i∈I

Vi (g0(Xi )− g0(Xi ))

Overfitting: modeling the noise too closely

When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model

If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V

Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away

Page 183: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 184: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 185: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets

2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 186: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset

3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 187: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 188: Chernozhukov et al. on Double / Debiased Machine Learning

How to Avoid Bias from Overfitting

To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:

1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0

functions we fit in the first subset

Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0

Page 189: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 190: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 191: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 192: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets

2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 193: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset

3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 194: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset

4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 195: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset

5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 196: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset

6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 197: Chernozhukov et al. on Double / Debiased Machine Learning

Cross-Fitting

The standard approach to sample-splitting will reduceefficiency and statistical power

We can avoid the loss of efficiency and power usingcross-fitting:

1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1

functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2

functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0

Page 198: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence

Next we’re going to look at how the authorsformally define Neyman orthogonality and DML

Page 199: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence

Next we’re going to look at how the authorsformally define Neyman orthogonality and DML

Page 200: Chernozhukov et al. on Double / Debiased Machine Learning

Where We Are and Where We’re Going

We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence

Next we’re going to look at how the authorsformally define Neyman orthogonality and DML

Page 201: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 202: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 203: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 204: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 205: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 206: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 are

We just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 207: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 208: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Remember: orthogonalize D with respect to X → eliminateregularization bias

Now we formalize this “Neyman orthogonality” condition

Letη0 = (g0,m0), η = (g ,m)

We have a new friend! Her name is η0 (“eta-naught”)

η0 is our nuisance parameter

We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0

The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest

Page 209: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 210: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 211: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 212: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 213: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 214: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following score function:

ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V

× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U

Looks scary! Let’s break it down

ψ: our score function “psi”

W : our data

Each underbraced term represents a noise term from ourpartially linear model

Page 215: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 216: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 217: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 218: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 219: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 220: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Introduce the following moment condition:

ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0

So we want our score function to = 0. Why?

Y = V θ0 + g0(X ) + U

Y = (D −m0(X ))θ0 + g0(X ) + U

V = (D −m0(X )) is our regressor

U = (Y − g0(X )− (D −m0(X ))θ) is our error term

Page 221: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

We’re saying we want our regressor V and our error term Uto be orthogonal to one another1

Two vectors are orthogonal to one another when their dotproduct equals 0

This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor

It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS

1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.

Page 222: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

We’re saying we want our regressor V and our error term Uto be orthogonal to one another1

Two vectors are orthogonal to one another when their dotproduct equals 0

This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor

It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS

1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.

Page 223: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

We’re saying we want our regressor V and our error term Uto be orthogonal to one another1

Two vectors are orthogonal to one another when their dotproduct equals 0

This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor

It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS

1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.

Page 224: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

We’re saying we want our regressor V and our error term Uto be orthogonal to one another1

Two vectors are orthogonal to one another when their dotproduct equals 0

This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor

It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS

1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.

Page 225: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 226: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 227: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 228: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 229: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 230: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 231: Chernozhukov et al. on Double / Debiased Machine Learning

Defining Neyman Orthogonality

Now we can define Neyman orthogonality!

∂ηE[ψ(W ; θ0, η0)][η − η0] = 0

In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0

Recall that a derivative represents our instantaneous rate ofchange

Thus, when the derivative = 0, our score function is robustto small perturbations in η0

It doesn’t change much when η0 moves around a little

Page 232: Chernozhukov et al. on Double / Debiased Machine Learning

Now we’re ready to formally define DML!

Page 233: Chernozhukov et al. on Double / Debiased Machine Learning

Defining DML

1 Take a K -fold random partition (Ik)Kk=1 of observationindices [N] = 1, ...,N such that the size of each fold Ik isn = N/K . Also, for each k ∈ [K ] = 1, ...,K , defineI ck := 1, ...,N \ Ik .

Create K equally sized partitionsI ck is the complement of Ik : if we have 100 observations andIk is the set of observations 1–20, then I ck is the set ofobservations 21–100

Page 234: Chernozhukov et al. on Double / Debiased Machine Learning

Defining DML

2 For each k ∈ [K ], construct an ML estimator

η0,k = η0((Wi )i∈I ck )

Important: The estimator we use for fold k was fit in I ck !This is the sample splitting we talked about earlier thatremoves bias from overfitting

Page 235: Chernozhukov et al. on Double / Debiased Machine Learning

Defining DML

3 Construct the estimator θ0 (“theta-naught-tilde”) as thesolution to

1

K

K∑k=1

En,k [ψ(W ; θ0, ˆη0,k)] = 0

Note that θ0 is not indexed by k , but the nuisance parameterη0,k is

We’re finding the θ0 that minimizes the average of the scoresacross all folds, where the scores vary by fold due to η0,k

This is a slightly different2 version of the cross-fittingapproach we talked about earlier that enables us to do samplesplitting without loss of efficiency

2In particular, we are no longer taking the average of k different estimatesof θ0 but instead finding one estimate that minimizes the average of the kdifferent score functions. Chernozhukov et al. recommend this latter approachbecause it behaves better in smaller samples.

Page 236: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 237: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 238: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 239: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 240: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 241: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

Even under unconfoundedness, OLS does not (necessarily)give us the ATE

We set up the following moment condition for estimation ofthe ATE:

ψ(W ; θ, η) :=

(g(1,X )−g(0,X ))+D(Y − g(1,X ))

m(X )−(1− D)(Y − g(0,X ))

1−m(X )−θ

Recall the score function has to = 0

So we’re saying we want the three big terms in the middle to= θ

Let’s look at them more closely

Page 242: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Debiasing terms

g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units

g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units

Page 243: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Debiasing terms

g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units

g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units

Page 244: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Debiasing terms

g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units

g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units

Page 245: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively

When Y (1) is downwardly biased, our ATE will be biaseddownward

When Y (0) is downwardly biased, our ATE will be biasedupward

Page 246: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively

When Y (1) is downwardly biased, our ATE will be biaseddownward

When Y (0) is downwardly biased, our ATE will be biasedupward

Page 247: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively

When Y (1) is downwardly biased, our ATE will be biaseddownward

When Y (0) is downwardly biased, our ATE will be biasedupward

Page 248: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively

When Y (1) is downwardly biased, our ATE will be biaseddownward

When Y (0) is downwardly biased, our ATE will be biasedupward

Page 249: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units

To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment

Page 250: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units

To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment

Page 251: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X ))

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units

To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment

Page 252: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Finally, we weight by the inverse probability of treatment 1m(X )

for the treated units because units with high probability oftreatment will be overrepresented among the treated units

And we weight by the inverse probability of control 11−m(X ) for

the control units because units with high probability ofcontrol will be overrepresented among the control units

Page 253: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Finally, we weight by the inverse probability of treatment 1m(X )

for the treated units because units with high probability oftreatment will be overrepresented among the treated units

And we weight by the inverse probability of control 11−m(X ) for

the control units because units with high probability ofcontrol will be overrepresented among the control units

Page 254: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Finally, we weight by the inverse probability of treatment 1m(X )

for the treated units because units with high probability oftreatment will be overrepresented among the treated units

And we weight by the inverse probability of control 11−m(X ) for

the control units because units with high probability ofcontrol will be overrepresented among the control units

Page 255: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Important: be sure to assess common support!

If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up

Lack of common support → unstable estimates of treatmenteffects

Page 256: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Important: be sure to assess common support!

If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up

Lack of common support → unstable estimates of treatmenteffects

Page 257: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Important: be sure to assess common support!

If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up

Lack of common support → unstable estimates of treatmenteffects

Page 258: Chernozhukov et al. on Double / Debiased Machine Learning

The ATE

(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect

estimate from ML models

+D(Y − g(1,X ))

m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment

− (1− D)(Y − g(0,X )

1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control

Important: be sure to assess common support!

If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up

Lack of common support → unstable estimates of treatmenteffects

Page 259: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

To get valid confidence intervals, we assume that score arelinear in the following sense:

ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T

We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator

Page 260: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

To get valid confidence intervals, we assume that score arelinear in the following sense:

ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T

We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator

Page 261: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

To get valid confidence intervals, we assume that score arelinear in the following sense:

ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T

We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator

Page 262: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

We use the following estimator for the asymptotic variance ofDML:

σ2 = J0−1︸︷︷︸

J0 inverse

1

K

K∑k=1

En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function

ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function

transpose

] (J0−1)′︸ ︷︷ ︸

J0 inversetranspose

where

J0 =1

K

K∑k=1

En,k [ψa(W ; η0,k)]

Page 263: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

We use the following estimator for the asymptotic variance ofDML:

σ2 = J0−1︸︷︷︸

J0 inverse

1

K

K∑k=1

En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function

ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function

transpose

] (J0−1)′︸ ︷︷ ︸

J0 inversetranspose

where

J0 =1

K

K∑k=1

En,k [ψa(W ; η0,k)]

Page 264: Chernozhukov et al. on Double / Debiased Machine Learning

Variance and Confidence Intervals

We use the following estimator for the asymptotic variance ofDML:

σ2 = J0−1︸︷︷︸

J0 inverse

1

K

K∑k=1

En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function

ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function

transpose

] (J0−1)′︸ ︷︷ ︸

J0 inversetranspose

where

J0 =1

K

K∑k=1

En,k [ψa(W ; η0,k)]

Page 265: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 266: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 267: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 268: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 269: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting

√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 270: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals

Page 271: Chernozhukov et al. on Double / Debiased Machine Learning

Wrapping Up

Useful for selection-on-observables with high-dimensionalconfounding

Avoids strong functional form assumptions

Two sources of bias: regularization and overfitting

Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are

n14 -consistent

Asymptotically valid confidence intervals