Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Lecture 1: Dimension reduction estimation: canlinear methods solve nonlinear problems?

Fall 2014

Hong Kong Baptist UniversityUniversity of Connecticut

1 / 32

Part I: Why dimension reduction?

Question: Why dimension reduction?

Regression modelling: Dimension reduction makes datavisualization. The current practice of regression analysis is:

Fit a simpler model;

Check the residual plot;

If the residual plot does not show a systematic pattern thenstop; otherwise continue to fit a more complex model.

2 / 32

Why dimension reduction?

Question: How to give a comprehensive residual plot?

In the one-dimensional cases – the residual plot is informative.Given (X1,Y1), · · · , (Xn,Yn), first fit a linear regressionmodel:

For 1 ≤ i ≤ n,Yi = β0 + β1Xi + εi

Find a regression estimator:

β1 =

∑ni=1(Yi − Y )(Xi − X )

(∑n

i=1)((Xi − X ))2;

β0 = Y − β1X . (1)

3 / 32


Let Yi be the predicted values β0 + β1Xi and ei be theresiduals Yi − Yi . We can simply plot ei against Xi . This iscalled the one-dimensional residual plot.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−4

−3

−2

−1

0

1

2

3

hy

resi

Figure: Residual Plot

4 / 32


What should we do if Xi is a vector in Rp? Currently twomethods are in frequent use:

(1) Residual plot ei versus Yi (note thatYi is alwaysone-dimensional.)

(2) Scatter plot matrix, in which we plot ei against each predictor,and each predictor against any other predictor, forming a(p + 1)× (p + 1) matrix of scatter plots.

However, each of these methods are intrinsically marginal –they cannot reflect the whole picture of the regressionrelation. Let us see this through an example.

5 / 32


Example 1. 100 pairs, (X1;Y1); · · · ; (X100;Y100), aregenerated from some model, where Xi are in R3

(three-dimensional vector Xi = (Xi1,Xi2,Xi3)).

The scatter plot matrix is produced. Show the scatter plotmatrix below. From the scatter plot matrix the data appear tohave the following features:

(1) Y doesn’t seem to depend on X2

(2) Y seems to depend on X1 in a nonlinear way(3) Y seems to depend on X3 in a nonlinear way.

6 / 32


Figure: Scatter Plot Matrix

−2 0 2 4−5 0 5−2 −1 0 1−5 0 5

−2

0

2

4

−5

0

5

−2

−1

0

1

−5

0

5

7 / 32

Why need dimension reduction?

However, (X ;Y ) are actually generated from the followingmodel:

Y = |X1 + X2|+ ε,

where ε is independent of X = (X1;X2;X3) that ismultivariate normal with mean 0 and covariance matrix Σ

Σ =

1 0 0.80 0.2 0

0.8 0 1

Note that Y does not depend on X3, and Y does depend onX2.

X2 has much smaller variance than that of X1 or X3 bothhave high correlation.

8 / 32


Once again, the scatter plot matrix cannot capture the truerelation between X and Y .

What can truly capture the relation between X and Y is thescatter plot of Y versus X1 + X2.

But how can we make this plot before we know thatX1 + X2 is the predictor? This is the question ofdimension reduction.

Find the linear combination X1 + X2 before any regressionmodelling is performed!

9 / 32

Models under dimension reduction structure

Some examples:

Linear model: Y = βTX + ε,

Generalized linear model: Y = g(βTX ) + ε for a givenmonotonic function g ,

Single-index model: Y = g(βTX ) + ε for an unknownfunction g ,

Multi-index model (1): Y = g(βTX ) + ε, β is an p × qorthogonal matrix

Multi-index model (2): Y = g1(βT1 X ) + g2(βT2 X )ε, whereβ = (β1, β2),

Multi-index model (3): Y = g(βTX , ε).

10 / 32

Further observation on dimension reduction structure

In these models, all the information on the response Y can becaptured through βTX , rather than through the original X !

In other words, when βTX is given, no more information onY can be acquired from the rest part of X . Y is thenconditionally independent of X if ε is independent of X .

We call this independence the conditional independence, andwrite this independence as Y⊥⊥X |βTX .

Based on this, consider a generic framework such that”model-free” methods can be developed to estimate theparameter of interest and then to establish a model.

11 / 32

Central Subspace

The goal of dimension reduction is to seek β ∈ Rp×q, q < psuch that

Y⊥⊥X |βTX .

However, β is not unique such that Y⊥⊥X |βTX unless q = 1such that β is a vector. To make the notion clearly, definecolumn space first.

Proposition 1.1 If A is any q × q non-singular matrix, Y⊥⊥X |βTX if and only

if Y⊥⊥X |(βA)TX .

Column space. For a matrix B we denote by S(B) thesubspace spanned by the columns of B: Let b1, · · · , bq be thecolumns of the matrix B. The space consists of all the linearcombinations c1b1 + · · ·+ cqbq for constants c1, · · · , cq.

12 / 32

Central Subspace

If γ is another matrix such that S(β) ⊆ S(γ), thenY⊥⊥X |βTX implies Y⊥⊥X |γTX . Therefore, we are naturallyinterested in the smallest dimension reduction space, whichachieves the maximal reduction of the dimension of X .

Definition

Definition If the intersection of all dimension reduction spaces for(X ,Y ) is itself a dimension reduction space, this space is called theCentral Space. Write SY |X .

Reference: Cook (1994, 1998). Thus the goal of dimensionreduction is to find the central space SY |X .

13 / 32

Sufficient plot

Once we know the central space or equivalently its p × q basematrix β = (β1, · · · , βq), a comprehensive scatter plot orresidual plot relating to βTX can be informative.

When q = 1, scatter plot is informative and q = 2, use spinsoftware to have a comprehensive view of the data. Usuallythis will suffice for most of the data analysis.

Example 1 (continued) The model is

Y = |X1 + X2|+ ε.

Thus the central space is spanned by (1, 1, 0). The sufficientplot is the scatter plot of Y versus X1 + X2. Show this plothere.

14 / 32

Sufficient plot

Figure: Scatter Plot Matrix

−2 0 2 4 6−5 0 5

−2

0

2

4

6

−5

0

5

15 / 32

Part II: Ordinary least squares

Assumption

Assumption 2.1 Let β be a Rp×q matrix whose columns form anorthonormal basis in SY |X . Assume that E (X |βTX ) is a linearfunction of X , that is, For a constant c and an p × q matrix CE (X |βTX ) = c + CβTX.

We first make some notes on the intuitions and implicationsof this assumption.

In practice, we do not know β at the outset. So we typicallyreplace this assumption by E (X |γTX ) is linear in X for allγ ∈ Rp×q. This is equivalent to elliptical symmetry of X .

16 / 32

Ordinary least squares

Elliptically symmetric distribution can often approximately beachieved by appropriate transformation of the original data; acertain power of the data, or logarithm of the data. See Cookand Weisberg (1994).

Hall and K. C. Li (1993) demonstrated that, if the originaldimension p is much larger than the structural dimension q,then E (X |βTX ) is approximately linear in X .

17 / 32


We can get C = β.

Theorem

Theorem 2.2. If Assumption 2.1 holds, then

E (X |βTX ) = c + CβTX =: Pβ(X ),

where Pβ = ββT is the projection operator.

18 / 32


Recall the assumption that E (X ) = 0, var(X ) = Ip. Then wehave

Theorem

Theorem 2.3. Suppose that Assumption 2.1 holds. The vectorE (XY ) = βc for an 1× q vector c is a vector in SY |X . In otherwords, E (XY ) can identify a vector in the central subspace.

19 / 32


Theorem

Theorem 2.4 Suppose that Y |X follows the modelY = g(βTX ) + varepsilon where β = (β1, · · · , βp) is a vector.Suppose that Assumption 2.1 holds. Then E (XY ) is proportionalto β in the model.

20 / 32


At the population level, the identifying procedure can bedescribed as follows. First, standardize X to beZ = Σ

−1/2x (X − µ). Identify a vector in SY |Z , then transfer

back to SY |X = Σ−1/2x SY |Z . At the sample level, the

estimating procedure is as follows.

Step 1. Compute the sample mean and variance:

µ = En(X ) Σx = varn(X )

and standardize Xi to be Zi = Σ−1/2x (X − µ).

Step 2. Center Yi to be Yi = Yi − E (Y ) and it is estimatedby Yi = Yi − En(Y ).

Step 3. Let γ = En(Z Y ) estimate E (ZY ) ∈ SY |Z .

Step 4. Let β = Σ−1/2x γ estimate E (XY ) ∈ SY |X .

21 / 32

Applications

For generalized linear model: Y = g(βTX ) + ε, single-indexmodel Y = g(βTX ) + ε, and transformation modelH(Y ) = βTX + ε, the above result shows that OLS can beapplied to estimate γ = β/‖β‖ if the L2-norm ‖β‖ of β is not1.

As OLS has a close form and thus, estimating γ is verycomputational efficient under the above nonlinear models.

After that, we can define z = γTX that is one-dimensional,rewrite the model as Y = g(αz) + ε to estimate α in aone-dimensional setting.

Thus, estimating β = αγ can be performed in this two-stepprocedure. (see Feng and Zhu 2013 CSDA)

22 / 32

Part III: Principal Hessian Directions

The biggest disadvantages of OLS: 1) it can only estimate atmost one direction in the central space; 2) it cannot identifythe direction in symmetric function such as the one inY = |βTX |+ ε.

For example, Y = |X1 + X2|+ ε: β = (1, 1, 0). But it cannotwell identified by OLS.

23 / 32

OLS plot

Figure: OLS Scatter Plot

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−0.5 0.0 0.5 1.0

−1

01

23

n = 20 (−1.13,0.22,0.94)

OLS

y

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−0.5 0.0 0.5 1.0

−2

−1

01

23

4

n = 50 (0.02,0.61,−0.12)

OLS

y

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

−0.4 −0.2 0.0 0.2 0.4

−2

−1

01

23

4

n = 100 (0.01,0.11,1.49)

OLS

y

24 / 32

Principal Hessian Directions

Another method: Consider the conditional mean E (Y |X ) ofY given X . When E (Y |X ) = E (Y |βTX ), its secondderivative is∂2E (Y |X )/∂(XXT ) = β∂2E (Y |βTX )/∂(βTXXTβ)βT whichis a p × p matrix.An application of Stein Lemma (1956). When the distributionof X is normal,

E(∂2E (Y |X )/∂(XXT )

)= E (YXXT )

= βE(∂2E (Y |βTX )/∂(βTXXTβ)

)βT ,

where E(∂2E (Y |βTX )/∂(βTXXTβ)

)is a q × q matrix.

The p × p matrix βE(∂2E (Y |βTX )/∂(βTXXTβ)

)βT has at

most q non-zero eigenvalues and the correspondingeigenvectors will be proved to be in the central subspace.

25 / 32

Principal Hessian Directions

Assumption

Assumption 3.1 Assume that the conditional variance

var(X |βTX ) = C

is a p × p non-random matrix.

This assumption is satisfied if X is multivariate normal.

26 / 32

Principal Hessian directions: population development

Let α be the OLS vector E (XY ). Let e be the residual fromthe simple linear regression, that is,

e = Y − αTX .

Note that, in the standardized coordinate, the intersection ofthe OLS is zero, because it is E (Y ) = 0 and E (X ) = 0 andthus E (Y )− αTE (X ) = 0. That is why there is no constantterm in e.

Definition

Definition 3.1. The matrix H1 = E (YXXT ) is called the y-basedHessian matrix, the matrix H2 = E (eXXT ) is called the e-basedHessian matrix.

The central result of Part III is that the column space of aHessian matrix (either one) is a subspace of the central space.

27 / 32

Principal Hessian directions: population development

Theorem

Theorem 3.1. Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H1 is a subspace of SY |X .

Theorem

Theorem 3.2 Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H2 is a subspace of SY |X .

28 / 32

Sample estimator of pHd

Again, we use the idea of first transforming to Z , estimatingSY |Z , and then transforming back to SY |X . We summarizethe computation into the following steps.

Step 1. standardize X1, · · · ,Xn to be Z1, · · · Zn, and centerY1, · · · ,Yn to be Y1, · · · Yn, as described in the algorithm forOLS.

Step 2. Compute the OLS of Yi versus Zi to get α:

α = (varn(Z ))−1covn(Z , Y ),

andα0 = En(Y )− αTEn(Z ) = 0.

Because of standardization, we have:

29 / 32


varn(Z ) = Ip, andcovn(Z , Y ) = En(Z Y )− En(Z )En(Y ) = En(Z Y ).

This means the OLS for Y and Z is α = En(Z Y ). Theresidual is ei = Yi − αT Zi .

Step 3. Construct the e-based and y-based Hessian matrix:

H1 = En(Y Z ZT ), H2 = En(eZ ZT ).

Step 4. Assume, for now, we know the structural dimension q.Let γ1 · · · γq be the q eigenvectors corresponding to the qlargest eigenvalues of H1H

T1 and let δ1 · · · δq be the q

eigenvectors corresponding to the q largest eigenvalues ofH2H

T2 . We use γ1 · · · γq and δ1 · · · δq as the estimators of

SY |Z .

30 / 32


Letβi = Σ−1/2γi , ηi = Σ−1/2δi

We then use β1 · · · βq and η1 · · · ηq as the estimators of SY |X .

We have assumed that the structural dimension q is known.In practice this must be determined by the data. There areseveral proposals in the literature.

The following is the pHd scatter plot for the modelY = |X1 + X2|+ ε, described before.

31 / 32

pHd plot

Figure: pHd Scatter Plot

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

01

23

n =20 (−0.9,−0.31,0.27)

r−pHd1

y

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−1.5 −1.0 −0.5 0.0 0.5 1.0

01

23

45

n =50 (0.72,0.66,−0.17)

r−pHd1

y

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

−2 −1 0 1 2

−2

−1

01

23

4

n =100 (−0.62,−0.74,−0.23)

r−pHd1

y

32 / 32

Documents

Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture