High-Dimensional Sparse Econometric Models, an Introductionice.uchicago.edu/2011_presentations/29_Belloni/Sparse_Lecture_ICE... · High-Dimensional Sparse Econometric Models, an Introduction

High-Dimensional Sparse EconometricModels,

an Introduction

Alexandre Belloni

ICE, July 2011

Alexandre Belloni High-Dimensional Sparse Econometrics

Outline

1. High-Dimensional Sparse Econometric Models (HDSM)◮ Models◮ Motivating Examples for linear/nonparametric regression

2. Estimation Methods for linear/nonparametric regression◮ ℓ1-penalized estimators

◮ formulations◮ penalty parameter

◮ computational methods◮ post-ℓ1-penalized estimators◮ summary of rates of convergence

3. Plenty of model selection applications in Econometrics◮ Results for IV Estimators◮ Results for Quantile Regression◮ Results for Variable Length Markov Chain


High-Dimensional Sparse Models

◮ We have a large number of parameters p, potentially largerthan the sample size n

◮ The idea is that a low dimensional submodel captures veryaccurately all essential features of the full dimensionalmodel.

◮ The key assumption is that the number of relevantparameters is smaller than the sample size

◮ fundamentally, a model selection problem

◮ One question is whether sparse models make sense.◮ Another question is how to compute such models.◮ Another question is what guarantees can we establish.


1. High-Dimensional Sparse Econometric Model

HDSM. A response variable yi obeys

yi = x ′i β0 + ǫi , ǫi ∼ N(0, σ2), i = 1, ..., n

where xi are fixed p-dimensional regressors,

xi = (xij , j = 1, ..., p), En[x2ij ] = 1.

p possibly much larger than n.

The key assumption is sparsity:

s := ‖β0‖0 =

p∑

j=1

1{β0j 6= 0} ≪ n.

This generalizes the standard large parametric econometricmodel by letting the identity

T := support (β0)

of the relevant s regressors be unknown.Alexandre Belloni High-Dimensional Sparse Econometrics

High p motivation

◮ transformations of basic regressors zi ,

xi = (P1(zi ), ..., Pp(zi ))

◮ for example, in wage regressions, Pj ’s are polynomials orB-splines in education and experience.

◮ or simply a very large list of regressors,◮ a list of country characteristics in cross-country growth

regressions (Barro & Lee)◮ housing characteristics in hedonic regressions

(American Housing Survey)◮ price and product characteristics at the point of purchase

(scanner data, TNS).


Sparseness

◮ The key assumption is that the number of non-zeroregression coefficients is smaller than the sample size:

s := ‖β0‖0 =

p∑

j=1

1{β0j 6= 0} ≪ n.

◮ The idea is that a low s-dimensional submodel accuratelyapproximates the full p-dimensional model. Theapproximation error in fact is zero.

◮ Can easily relax this assumption and put in a non-zeroapproximation error.

◮ We can relax the Gaussianity of the error terms ǫi

substantially based on self-normalized moderate deviationtheory results


High-Dimensional Sparse Econometric Model

HDSM′. For yi the response variable and zi the fixedelementary regressors

yi = g(zi) + ǫi , E [ǫi ] = 0, E [ǫ2i ] = σ2, i = 1, ..., n

Consider a p-dimensional dictionary of series terms,

xi = (P1(zi ), ..., Pp(zi ))

for example, B-splines, polynomials, and interactions, where

p is possibly much larger than n.

Expand via generalized series: g(zi) = x ′i β0 + ri ,

◮ x ′i β0 is the best s−dimensional parametric approximation

s := ‖β0‖0 =

p∑

j=1

1{β0j 6= 0} ≪ n.

◮ ri is the approximation error.Alexandre Belloni High-Dimensional Sparse Econometrics

High-Dimensional Sparse Econometric Model

HDSM′. As a result we have an approximately sparse model:

yi = x ′i β0 + ri + ǫi , i = 1, . . . , n

The parametric part x ′i β0 is the target, with s non-zero

coefficients.We choose s to balance the approximation error with the oracleestimation error:

En[r2i ] ∝ σ2 s

n.

◮ This model generalizes the exact sparse model, by lettingin approximation error and non-Gaussian errors.

◮ This model also generalizes the standard series regressionmodel by letting the identity

T = support (β0)

of the most important s series terms be unknown.


Example 1: Series Models of Wage Function

◮ In this example, abstract away from the estimationquestions.

◮ Consider a series expansion of the conditional expectationE [yi |zi ] of wage yi given education zi .

◮ A conventional series approximation to the regressionfunction is, for example,

E [yi |zi ] = β1P1(zi ) + β2P2(zi) + β3P3(zi) + β4P4(zi) + ri

where P1, ..., P4 are first low-order polynomials or splineswith a few knots.


8 10 12 14 16 18 20

6.0

6.5

7.0

education

wage

Traditional Approximation of Expected Wage Function using Polynomials

◮ In the figure, true regression function E [yi |zi ] computedusing U.S. Census data.


◮ With the same number of parameters, can we a find amuch better series approximation?

◮ For this to be possible, some high-order series terms in theexpansion

E [yi |zi ] =

p∑

k=1

β0kPk (zi) + r ′i

must have large coefficients.◮ This indeed happens in our example, since E [yi |zi ] exhibits

sharp oscillatory behavior in some regions, which lowdegree polynomials fail to capture.

◮ And we find the “right” series terms by usingℓ1-penalization methods


8 10 12 14 16 18 20

6.0

6.5

7.0

education

wage

Lasso Approximation of Expected Wage Function using Polynomials and Dummies


8 10 12 14 16 18 20

6.0

6.5

7.0

education

wage

Traditional vs Lasso Approximation of Expected Wage Functions

with Equal Number of Parameters


Errors of Conventional and the lasso-Based SparseApproximations

L2 error L∞ errorConventional Series Approximation 0.135 0.290lasso-Based Series Approximation 0.031 0.063


Example 2: Cross-Country Growth Regression

◮ Relation between growth rate and initial per capita GDP

GrowthRate = α0 + α1 log(GDP) + ǫ

◮ Test the convergence hypothesis, α1 < 0 , poor countriescatch up with richer countries. Prediction from the classicalSolow growth model.

◮ Linear regression only with log(GDP) yields:

GrowthRate = 0.0352 + 0.0013 log(GDP) + ǫ

◮ Test the convergence hypothesis conditional on covaritesdescribing institutions and technological factors:

GrowthRate = α0 + α1 log(GDP) +

p∑

j=1

βjXj + ǫ



In Barro-Lee data, we have p = 60 covariates, n = 90observations. Need to select significant covariates among 60.

Penalization Real GDP per capita (log) is in all modelsParameter

λ Additional Selected Variables2.3662 Black Market Premium (log)1.3935 Black Market Premium (log)

Political Instability0.929 Black Market Premium (log)

Political InstabilityRatio of nominal government expenditure on defense

to nominal GDPRatio of import to GDP



◮ We find a sparse model for conditioning effects using ℓ1-penalization. We re-fit the resulting model using LS.

◮ We find that the coefficient on lagged GDP is negative, andthe confidence intervals, adjusted for model selection,exclude zero.

Coefficient Confidence Setlagged GDP [-0.0261, -0.0007]

◮ Our findings fully support the conclusions reached in Barroand Lee and Barro and Sala-i-Martin.


2. Estimation via ℓ1-penalty methods

The canonical LS which minimizes

Q̂(β) = En[(yi − x ′i β)2]

is not consistent when dim(β) = p ≥ n.We can try minimizing an AIC/BIC type criterion function

Q̂(β) + λ‖β‖0, ‖β‖0 =

p∑

j=1

1{βj 6= 0}

but this is not computationally feasible when p is large.

Essentially a search over too many (non-nested) models.

Technically, it is a combinatorial problem known to be NP-hard.



It was required to balance and understand simultaneously

computational issues︸︷︷︸hard combinatorial search

and statistical issues︸︷︷︸lack of global identification

Understanding that statistical aspects can help computationalmethods and vice-versa was crucial.

Further developments in model selection estimation will relyheavily on computational advances.

(MCMC is an example of a computational method with aprofound impact on the statistics/econometrics literature)



avoid computing: Q̂(β)+λ‖β‖0 but get something meaninful

A solution (Tibshirani, JRSS, 96) is to replace theℓ0-“norm” by the closest convex function, the ℓ1-norm:

‖β‖1 =

p∑

j=1

|βj |

◮ lasso estimator β̂ then minimizes

Q̂(β) + λ‖β‖1

◮ this change makes the regularization function to be convex◮ good computational methods for large scale problems◮ many ways to combine the fit and regularization functions


Formulations

Idea: balance fit Q̂(β) and penalty ‖β‖1 using a parameter λ

Several ways to do it:

minβ Q̂(β) + λ‖β‖1 lasso

minβ Q̂(β) : ‖β‖1 ≤ λ sieve

minβ ‖β‖1 : Q̂(β) ≤ λ ℓ1-minimization method

minβ ‖β‖1 : ‖∇Q̂(β)‖∞ ≤ λ Dantzig selector

minβ

√Q̂(β) + λ‖β‖1

√lasso


Penalty parameter λ

Motivation/heuristic is to choose the parameter λ so that β0 isfeasible for the problem but not too many more points.

sieve: ideally set λ = ‖β0‖1

ℓ1-minimization: ideally set λ = σ2

Dantzig selector: ideally set λ = ‖∇Q̂(β0)‖∞ = ‖2En[ǫixi ]‖∞

Note that lasso and√

lasso are unrestricted minimizations.

Their penalties are chosen to majorate the score at the truevalue, respectively

λ > ‖∇Q̂(β0)‖∞ and λ > ‖∇√

Q̂(β0)‖∞


Heuristics for lasso via Convex Geometry

−5 −4 −3 −2 −1 0 1 2 3 4 5

βT c

true value(β0T c = 0)

Q̂(β)



−5 −4 −3 −2 −1 0 1 2 3 4 5

βT c

Q̂(β)


Q̂(β0) + ∇Q̂(β0)′βT c



−5 −4 −3 −2 −1 0 1 2 3 4 5

βT c

Q̂(β)λ = ‖∇Q̂(β0)‖∞

λ‖β‖1




−5 −4 −3 −2 −1 0 1 2 3 4 5

βT c

Q̂(β)


λ‖β‖1

λ > ‖∇Q̂(β0)‖∞


Heuristics via Convex Geometry

−5 −4 −3 −2 −1 0 1 2 3 4 5

βT c

Q̂(β) + λ‖β‖1


λ > ‖∇Q̂(β0)‖∞


Penalty parameters for lasso and√

lasso

◮ In the case of lasso, the optimal choice of penalty level λ,guaranteeing near-oracle performance, is

λ =σ · 2√

2 log p/n ≥P ‖2En[ǫixi ]‖∞ = ‖∇Q̂(β0)‖∞.

◮ The choice relies on knowing σ, which may be apriori hardto estimate when p ≫ n.

◮ In the case of√

lasso the score is a pivotal random variable,with distribution independent of unknown parameters:

∥∥∥∥∇√

Q̂(β0)

∥∥∥∥∞

=‖En[ǫixi ]‖∞√

En[ǫ2i ]

=‖En[(ǫi/σ)xi ]‖∞√

En[(ǫi/σ)2],

◮ hence for√

lasso the penalty λ does not depend on σ.


Computational Methods for lasso

Big Picture of Computational Methods

Method Problem Size Comp. ComplexityComponentwise extremely large convergence

First-order very large pn(

log pε

)

(up to 100000)Interior-point large (p + n)3.5 log

(p+nε

)

(up to 5000)

Remark:Several implementations for these methods are available:

◮ SDPT3 or SeDuMi for interior-point methods◮ TFOCS for first-order methods◮ several downloadable codes for componentwise methods

(easy to implement)Alexandre Belloni High-Dimensional Sparse Econometrics

Computational Methods for lasso: Componentwise

Componentwise Method.A common approach to solve multivariate optimization:

(i) pick a component and fix all remaining components,

(ii) improve the objective function along the chosencomponent,

(iii) loop steps (i)-(iii) until convergence is achieved.

This has been widely used when the minimization over a singlecomponent can be done very efficiently.

This is the case of lasso.

Its simple implementation also contributed for the widespreaduse of this approach.


Computational Methods for lasso: Componentwise

For a current iterate β, let β−j = (β1, . . . , βj−1, 0, βj+1, . . . , βp)′:

◮ if 2En[xij(yi − x ′i β−j)] > λ, the minimum over βj is

βj =(−2En[xij (yi − x ′

i β−j)] + λ)/En[x2

ij ];

◮ if 2En[xij(yi − x ′i β−j)] < −λ the minimum over βj is

βj =(−2En[xij (yi − x ′

i β−j)] − λ)/En[x2

ij ];

◮ if 2|En[xij(yi − x ′i β−j)]| ≤ λ the minimum over βj is

βj = 0.

Remark: Similar formulas can be computed for other methodslike

√lasso.


Computational Methods for lasso: first-order methods

First-order methods. The new generation of first-ordermethods focus on structured convex problems that can be castas

minw

f (A(w) + b) + h(w) or minw

h(w) : A(w) + b ∈ K .

where◮ f is a smooth function◮ h is a structured function that is possibly non-differentiable

or with extended values◮ K is a structure convex set (typically a convex cone).



lasso is cast as

minw

f (A(w) + b) + h(w)

with

f (·) =‖ · ‖2

n, h(·) = λ‖ · ‖1, A = X , and b = −Y

Thus,

minβ

‖Xβ − Y‖2

n+ λ‖β‖1

Remark: A number of different first-order methods have beenproposed and analyzed. We just mention briefly one.



On every iteration for a given µ > 0 and current point βk is

β(βk ) = arg minβ

2En[xi (yi − x ′i β

k )]′β +12µ‖β − βk‖2 +

λ

n‖β‖1.

It follows that the minimization in β above is separable and canbe solved by soft-thresholding, letting

mkj := βk

j + 2En[xij(yi − x ′i β

k )]/µ

We update our iterate as

βj(βk ) = sign

(mk

j

)max

{∣∣∣mkj

∣∣∣ − λ/µ, 0}

.

Remark: The additional regularization term µ‖β − βk‖2 bringcomputational stability that allows the algorithm to enjoy bettercomputational complexity properties.


Computational Methods for lasso: ipm

Interior-point methods. Interior-point methods (ipms) solverstypically focus on solving conic programming problems instandard form,

minw

c′w : Aw = b, w ∈ K . (2.1)

The conic constraint will be biding at the optimal solution.These methods use a “barrier function” FK for the cone K that is

strictly convex, smooth and FK (w) → ∞ when w → ∂K ,

and for some µ > 0 proceed to compute

w(µ) = arg minw

c′w + µFK (w) : Aw = b. (2.2)

Letting µ → 0, w(µ) converges to the solution (2.1).We can recover an approximate solution for (2.1) efficiently byusing an approximate solution for w(µk ) as a warm-start toapproximate w(µk+1), µk > µk+1.



Not all barriers are created equally, solving the problems

w(µ) = arg minw

c′w + µFK (w) : Aw = b. (2.3)

will be tractable only if FK is a special barrier (depends on K ).Three cones play a major role in optimization theory:

◮ Non-negative orthant:IRp

+ = {w ∈ IRp : wj ≥ 0, j = 1, . . . , p},FK (w) = −

∑pj=1 log(wi)

◮ Second-order cone:Qn+1 = {(w , t) ∈ IRn × IR+ : t ≥ ‖w‖},FK (w) = − log(t − ‖w‖)

◮ Semi-definite positive matrices:Sk×k

+ = {W ∈ IRk×k : W = W ′, v ′Wv ≥ 0, for all v ∈ IRk},FK (w) = − log det(W )



These cones (and three others) are called self-scaled cones.

Their cartesian products are also self-scaled.

Therefore, when using ipms, it is VERY importantcomputationally to arrive a formulation that is not only convexbut also one that relies on self-scaled cones.

It turns out that many tractable penalization methods (includingℓ1 but others as well) can be formulated with these three cones.



lasso minβ

Q̂(β) + λ‖β‖1

to formulate the ‖ · ‖1, let β = β+ − β−, where β+, β− ∈ IRp+.

for any vector v ∈ IRn and any scalar t ≥ 0 we have that

v ′v ≤ t is equivalent to ‖(v , (t − 1)/2︸︷︷︸=:u

)‖ ≤ (t + 1)/2︸︷︷︸=:z

.

mint,β+,β−,u,z,v

tn

+ λ

p∑

j=1

β+

j + β−j

v = Y − Xβ+ + Xβ−

t = −1 + 2zt = 1 + 2u(v , u, z) ∈ Qn+2, t ≥ 0, β+ ∈ IRp

+, β− ∈ IRp+.


Quick comparison on a very sparse model:s = 5, β0j = 1 for j = 1, . . . , s, β0j = 0 for j > s.

Table: Average running time over 100 simulations. Differentprecisions across methods.

n = 100, p = 500 Componentwise First-order Interior-pointlasso 0.2173 10.99 2.545√lasso 0.3268 7.345 1.645




Post-Model Selection Estimator

Define the post-lasso estimator as follows:

◮ Step 1:Select the model using the lasso, T̂ := support(β̂).

◮ Step 2:Apply ordinary least square (LS) to the selected model T̂ .

Remark: We could use√

lasso, Dantzig selector, sieve orℓ1-minimization instead of lasso.


Post-Model Selection Estimator Geometry

β1

β2

β0β̂

minβ∈Rp

‖β‖1 : Q̂(β) ≤ γ,

In this example the correct model is selected, but note the shrinkagebias. The shrinkage bias can be removed by post model selection

estimator, i.e. applying OLS to the selected model.


Post-Model Selection Estimator Geometry

β1

β2

β0β̂

minβ∈Rp

‖β‖1 : Q̂(β) ≤ γ,

In this example the correct model was not selected. The selectedmodel contains a “junk” regressor which can impact the post model

selection estimator.


Issues on post model selection estimator

◮ lasso and others ℓ1-penalized methods will successfully“zero out” lots of irrelevant regressors.

◮ however, these procedures are not perfect at modelselection – might include “junk”, exclude some relevantregressors,

◮ and these procedures bias/shrink the non-zero coefficientestimates towards zero.

◮ post model selection estimators are motivated to removethe bias towards zero

◮ but results need to account for possible misspecification inthe selected model.


Regularity Condition

◮ A simple sufficient condition is as follows.Condition RSE. Take any C > 1.Then for all n large enough, the sC-sparse eigenvalues of the

empirical design matrix En[xix ′i ],

are bounded away from zero and from above; namely,

0 < κ ≤ min‖δ‖0≤sC

En[x ′i δ]

2

‖δ‖2 ≤ max‖δ‖0≤sC

En[x ′i δ]

2

‖δ‖2 ≤ κ′ < ∞.(2.4)

◮ This holds under i.i.d. sampling if E [xix ′i ] has eigenvalues

bounded away from zero and above, and:• xi has light tails (i.e., log-concave) and s log p = o(n);• or bounded regressors maxij |xij | ≤ K and s log4 p = o(n).


Result 1: Rates for ℓ1-penalized methods

Theorem (Rates)Under condition RSE, wp → 1

√En[x ′

i β̂ − x ′i β0]2 . σ

√s log(n ∨ p)

n

◮ the rate is very close to the oracle rate√

s/n, obtainablewhen we know the “true” model T .

◮ p shows up only through a logarithmic term log p.◮ for Dantzig selector:

◮ Candes and Tao (Annals of Statistics 2007)◮ for lasso:

◮ Meinshausen and Yu (Annals of Statistics 2009),◮ Bickel, Ritov, Tsybakov (Annals of Statistics 2009)

◮ for√

lasso:◮ B., Chernozhukov and Wang (Biometrika, 2011)


Result 2: Post-Model Selection Estimator

Theorem (Rate for Post-Model Selection Estimator)Under conditions of Result 1, wp → 1

√En[x ′

i β̂PL − x ′i β0]2 . σ

√sn

log(n ∨ p),

Under some further exceptional cases faster, up to σ√

sn .

◮ Even though lasso does not in general perfectly select therelevant regressors, post-lasso performs at least as well.

◮ analysis account for a selected model that can be misspecified◮ This result was first derived in the context of median regression

◮ B. and Chernozhukov (Annals of Statistics, 2011),◮ and extended to least squares by

◮ B. and Chernozhukov (R & R Bernoulli, 2011).


Monte Carlo

◮ In this simulation we used

s = 6, p = 500, n = 100

yi = x ′i β0 + ǫi , ǫi ∼ N(0, σ2),

β0 = (1, 1, 1/2, 1/3, 1/4, 1/5, 0, . . . , 0)′

xi ∼ N(0,Σ), Σij = (1/2)|i−j |, σ2 = 1

◮ Ideal benchmark: Oracle estimator which runs OLS of yi

on xi1, ..., xi6. This estimator is not feasible outsideMonte-Carlo.


Monte Carlo Results: Risk

Risk:√

E [(x ′i β̂ − x ′β0)2]

n = 100, p = 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

LASSO Post-LASSO Oracle

Estimation Risk


Monte Carlo Results: Bias

Bias: ‖E [β̂] − β0‖2

n = 100, p = 500

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

LASSO Post-LASSO

Bias


Applications of model selection


Gaussian Instrumental Variable

Gaussian IV model

yi = d ′i α + ǫi ,

di = D(zi) + vi , (first stage)

(ǫi

vi

)∼ N

(0,

(σ2

ǫ σǫv

σǫv σ2v

))

◮ can have additional controls wi in both equations;suppressed here, see references for details

◮ want to estimate “optimal” instrument D(zi) = E [di |zi ]using a large list of “technical instruments,”

xi := (xi1, ..., xip)′ := (P1(zi ), ..., Pp(zi ))′, and we can write

D(zi) = x ′i β0 + ri , with ‖β0‖0 = s, En[r2

i ] . σ2vs/n

where p is large, possibly much larger than n.◮ use estimated optimal instrument to estimate α.


2SLS with LASSO/Post-LASSO estimated Optimal IV

2SLS with LASSO/Post-LASSO estimated Optimal IV

◮ Step 1:Estimate D̂(zi ) = x ′

i β̂ using LASSO or Post-LASSOestimator.

◮ Step 2:α̂ = [En[diD̂(zi)

′]]−1En[D̂(zi)yi ]


2SLS with LASSO-based IV

Theorem (2SLS with lasso-based IV)Under conditions of Result 1, and s2 log2 p = o(n), we have

(Λ̂∗)−1/2√n(α̂ − α) →d N(0, I),

where Λ̂∗ := σ̂2{En[D̂(zi)D̂(zi)′]}−1.

◮ For more results, see B., Chernozhukov, and Hansen(submitted, ArXiv 2010)

◮ The requirement s2 log2 p = o(n) can be relaxed tos log p = o(n) if split IV estimator is used.

◮ Extension to non-Gaussian and heteroskedastic errors in B.,Chen, Chernozhukov and Hansen (submitted, ArXiv 2010)required adjustments in the penalty function


Application of IV: Eminent Domain

Estimate economic consequences of government take-over ofproperty rights from individuals

◮ yi = economic outcome in a region i , e.g. housing priceindex

◮ di = number of property take-overs decided in a court oflaw, by panels of 3 judges

◮ xi = demographic characteristics of judges, that arerandomly assigned to panels: education, politicalaffiliations, age, experience etc.

◮ fi = xi + various interactions of components of xi ,◮ a very large list p = p(fi) = 344


Application of IV: Eminent Domain, Continued.

◮ Outcome is log of housing price index; endogenousvariable is government take-over

◮ Can use 2 elementary instruments, suggested by reallawyers (Chen and Yeh, 2010)

◮ Can use all 344 instruments and select approximately theright set using LASSO.

Estimator Instruments Price Effect Rob Std Error2SLS 2 .07 .0322SLS / LASSO IVs 4 .05 .017


ℓ1-Penalized Quantile Regression

◮ a response variable y and p-dimensional covariates x◮ u-th conditional quantile function of y given x is given by

x ′β(u), β(u) ∈ Rp.

◮ the dimension p is large, possibly much larger than thesample size n

◮ the true model β(u) is sparse having only s < n non-zerocomponents; we don’t know which.

◮ in population, β(u) minimizes

Qu(β) = E[ρu(y − x ′β)],

where ρu(t) = (u − 1{t ≤ 0})t = ut+ + (1 − u)t−.


ℓ1-Penalized Quantile Regression

◮ canonical estimator that minimizes

Q̂u(β) = n−1n∑

i=1

ρu(yi − x ′i β)

is not consistent when p ≥ n.◮ ℓ1-penalized quantile regression β̂(u) solves

minβ∈Rp

Q̂u(β) + λu‖β‖1.


ℓ1-QR Rate Result

Theorem (Rates)Under standard conditions, letting U ⊂ (0, 1) be a compact set,wp→ 1

supu∈U

‖β̂(u) − β(u)‖2 .P

√s log(n ∨ p)

n.

◮ very similar to the uniform rate√

s log n/n when thesupport is known

◮ similar rates to lasso but also hold uniformly over u ∈ U◮ although similar rates substantially different analysis


Application ℓ1-QR: Conditional Independence

Let X ∈ IRm denote a random vector. We are interest indetermining conditional independence structure of X , for everypair of components i , j determine if

Xi ⊥⊥ Xj | X−i ,−j .

Conditional independence between two components of aGaussian distributed random vector X ∼ N(0,Σ) has been wellunderstood:For Σ = Var(X ), Xi and Xj are independent conditioned on theother components if and only if Σ−1

ij = 0.



Let X ∈ IRm denote a random vector. We are interest indetermining conditional independence structure of X , for everypair of components i , j determine if

Xi ⊥⊥ Xj | X−i ,−j .

We consider a non-Gaussian high-dimensional setting.

Xi ⊥⊥ Xj |X−i ,−j

if and only if

QXi(u|X−i) = QXi

(u|X−i ,−j) for all u ∈ (0, 1).



In order to approximate conditional independence consider: wesay that Xi and Xj are U -quantile uncorrelated if

β i(u) ∈ arg min E [ρu(Xi − β′X−i)] and

β j(u) ∈ arg min E [ρu(Xj − β′X−j)]

are such that β ij (u) = 0 and β j

i (u) = 0 for all u ∈ U .



In order to approximate conditional independence consider:◮ U ⊂ (0, 1) a compact set of quantile indices◮ a class of functions F

We will say thatXi ⊥⊥U ,F Xj | X−i ,−j

if and only if for every u ∈ U

f i ,ju (0, X−i ,−j) ∈ arg min

f∈FE [ρu(Xi − f (Xj , X−i ,−j))] and

f j ,iu (0, X−i ,−j) ∈ arg min

f∈FE [ρu(Xj − f (Xi , X−i ,−j))]

This approach requires:◮ series approximation for F ;◮ uniform estimation across all quantile indices u ∈ U ;◮ model selection.



Let X be a vector of log returns. Edge (i , j) if Xi 6⊥⊥ Xj |X−i ,−j

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



Consider IBM returns first.

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



Next consider all returns.

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



Next we investigate how this dependence structure is affectedby “downside movement” of the market.This is particulary relevant for hedging since during downsidemovements of the market is when hedge is most needed.

We quantify downside movement by conditioning on theobservations with the largest 25%, 50%, 75% and 100%downside movement of the market.(The 100% corresponds to the overall dependence.)



100% (No downside movement)

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



75%

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



50%

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



25%

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM



100% vs. 25%

EWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBMEWST

BTFGAEPI

JJSF

DGSE

PLXS

CAU

RMCF

HGR

SUNW

KDE

ORCL

MSFT

TW

OSIP

FUND

LLTC

DJCO

DELL

IBM

The right graph is much sparser than the left graph. Thus, whenmarkets are moving down several variables become (nearly)conditionally independent. Therefore, hedging strategies thatdo not accounting for downside movements might not offer thedesired protection exactly when it is needed the most.


Variable Length Markov Chains

Consider a dynamic model:

◮ dynamic discrete choice models◮ discrete stochastic dynamic programming

The main feature we are interest:

history of outcomes potentially impactsthe probability distribution of the next outcome



Consider at each time t agents choose: st ∈ A = {Low , High}.Typical assumption: Markov Chain of order 1 (2 cond. prob.)

t = 0

lllllllllllll

RRRRRRRRRRRRR

t = −1 L H



Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order 2 requires to estimate 4 cond. prob.

t = 0

lllllllllllll

RRRRRRRRRRRRR

t = −1 L

yyyy

yyyy

EEEE

EEEE

H

yyyy

yyyy

EEEE

EEEE

t = −2 LL LH HL HH



Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order 3 requires to estimate 8 cond. prob.

t = 0

lllllllllllll

RRRRRRRRRRRRR

t = −1 L

yyyy

yyyy

EEEE

EEEE

H

yyyy

yyyy

EEEE

EEEE

t = −2 LL

yyyy

yyyy

LH

yyyy

yyyy

HL

EEEE

EEEE

HH

EEEE

EEEE

t = −3 LLL LLH LHL LHH HLL HLH HHL HHH



Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order K requires to estimate 2K cond. prob.

t = 0

lllllllllllll

RRRRRRRRRRRRR

t = −1 L

yyyy

yyyy

EEEE

EEEE

H

yyyy

yyyy

EEEE

EEEE

t = −2 LL

yyyy

yyyy

LH

yyyy

yyyy

HL

EEEE

EEEE

HH

EEEE

EEEE

t = −3 LLL

%%%%%

��

LLH

%%%%%

��

LHL

%%%%%

��

LHH

%%%%%

��

HLL

%%%%%

��

HLH

%%%%%

��

HHL

%%%%%

��

HHH

%%%%%

��

t = −K ......

......

......

......



Issues:

◮ K is unknown (potentially infinite)◮ full K -order Markov chain becomes large quickly◮ similar histories may lead to similar transition probabilities

Motivation:◮ bias (small model) vs. variance (large model)◮ make the order of the Markov Chain depend on the history◮ model selection balances:

misspecification and length of confidence intervals



Variable Length Markov Chain:very flexible but need to select the “context tree”

t = 0

lllllllllllll

RRRRRRRRRRRRR

t = −1 L

yyyy

yyyy

EEEE

EEEE

H

yyyy

yyyy

EEEE

EEEE

t = −2 LL LH

yyyy

yyyy

HL HH

t = −3 LHL LHH

%%%%%

��

...



A penalty method based on (uniform) confidence intervals.

Uniform estimation of transition probabilities over all possiblehistories x ∈ A−1

−∞:

supx∈A−1

−∞

‖P̂(·|x) − P(·|x)‖∞

(bounds achieve the minimax rates for some classes)

Results allow to establish:◮ in dynamic discrete choice models:

uniform estimation of dynamic marginal effects◮ in discrete stochastic dynamic programming:

uniform estimation of the value function


Conclusion


Conclusion

1. ℓ1 and post-ℓ1-penalization methods◮ offer a good way to control for confounding factors;◮ offer a good way to select the series terms;◮ are especially useful when p ≥ n


Conclusion


2. ℓ1-penalized estimators achieve near-oracle rates


Conclusion



3. post-ℓ1-penalized estimators achieve at least near-oraclerates


Conclusion



3. post-ℓ1-penalized estimators achieve at least near-oraclerates

4. applications of model selection in econometrics estimators


References

◮ Bickel, Ritov and Tsybakov, “Simultaneous analysis of Lasso andDantzig selector”, Annals of Statistics, 2009.

◮ Candes and T. Tao, “The Dantzig selector: statistical estimation when pis much larger than n,” Annals of Statistics, 2007.

◮ Meinshausen and Yu, “Lasso-type recovery of sparse representationsfor high-dimensional data,” Annals of Statistics, 2009.

◮ Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy.

Statist. Soc. Ser. B, 1996.


References

◮ Belloni and Chernozhukov, “ℓ1-Penalized Quantile Regression for HighDimensional Sparse Models,” 2007, Annals of Statistics.

◮ Belloni and Chernozhukov, “Post-ℓ1-Penalized Estimators inHigh-Dimensional Linear Regression Models”, ArXiv, 2009.

◮ Belloni, Chernozhukov and Wang, “√

LASSO: Pivotal Recovery ofSparse Signals via Conic Programming,” Biometrika, 2011.

◮ Belloni and Chernozhukov, “High Dimensional Sparse EconometricModels: An Introduction,” Ill-Posed Inverse Problems andHigh-Dimensional Estimation, Springer Lecture Notes in Statistics,2011.

◮ Belloni, Chernozhukov, and Hansen, “Estimation and InferenceMethods for High-dimensional Sparse Econometric Models: a Review,”2011, submitted.

◮ Belloni and Oliveira, “Approximate Group Context Tree: application to

dynamic programming and dynamic choice models,” 2011, submitted.


Documents

High-Dimensional Sparse Econometric Models, an Introductionice.uchicago.edu/2011_presentations/29_Belloni/Sparse_Lecture_ICE... · High-Dimensional Sparse Econometric Models, an Introduction