Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Lecture 1: Dimension reduction estimation: canlinear methods solve nonlinear problems?
Fall 2014
Hong Kong Baptist UniversityUniversity of Connecticut
1 / 32
Part I: Why dimension reduction?
Question: Why dimension reduction?
Regression modelling: Dimension reduction makes datavisualization. The current practice of regression analysis is:
Fit a simpler model;
Check the residual plot;
If the residual plot does not show a systematic pattern thenstop; otherwise continue to fit a more complex model.
2 / 32
Why dimension reduction?
Question: How to give a comprehensive residual plot?
In the one-dimensional cases – the residual plot is informative.Given (X1,Y1), · · · , (Xn,Yn), first fit a linear regressionmodel:
For 1 ≤ i ≤ n,Yi = β0 + β1Xi + εi
Find a regression estimator:
β1 =
∑ni=1(Yi − Y )(Xi − X )
(∑n
i=1)((Xi − X ))2;
β0 = Y − β1X . (1)
3 / 32
Why dimension reduction?
Let Yi be the predicted values β0 + β1Xi and ei be theresiduals Yi − Yi . We can simply plot ei against Xi . This iscalled the one-dimensional residual plot.
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−4
−3
−2
−1
0
1
2
3
hy
resi
Figure: Residual Plot
4 / 32
Why dimension reduction?
What should we do if Xi is a vector in Rp? Currently twomethods are in frequent use:
(1) Residual plot ei versus Yi (note thatYi is alwaysone-dimensional.)
(2) Scatter plot matrix, in which we plot ei against each predictor,and each predictor against any other predictor, forming a(p + 1)× (p + 1) matrix of scatter plots.
However, each of these methods are intrinsically marginal –they cannot reflect the whole picture of the regressionrelation. Let us see this through an example.
5 / 32
Why dimension reduction?
Example 1. 100 pairs, (X1;Y1); · · · ; (X100;Y100), aregenerated from some model, where Xi are in R3
(three-dimensional vector Xi = (Xi1,Xi2,Xi3)).
The scatter plot matrix is produced. Show the scatter plotmatrix below. From the scatter plot matrix the data appear tohave the following features:
(1) Y doesn’t seem to depend on X2
(2) Y seems to depend on X1 in a nonlinear way(3) Y seems to depend on X3 in a nonlinear way.
6 / 32
Why dimension reduction?
Figure: Scatter Plot Matrix
−2 0 2 4−5 0 5−2 −1 0 1−5 0 5
−2
0
2
4
−5
0
5
−2
−1
0
1
−5
0
5
7 / 32
Why need dimension reduction?
However, (X ;Y ) are actually generated from the followingmodel:
Y = |X1 + X2|+ ε,
where ε is independent of X = (X1;X2;X3) that ismultivariate normal with mean 0 and covariance matrix Σ
Σ =
1 0 0.80 0.2 0
0.8 0 1
Note that Y does not depend on X3, and Y does depend onX2.
X2 has much smaller variance than that of X1 or X3 bothhave high correlation.
8 / 32
Why dimension reduction?
Once again, the scatter plot matrix cannot capture the truerelation between X and Y .
What can truly capture the relation between X and Y is thescatter plot of Y versus X1 + X2.
But how can we make this plot before we know thatX1 + X2 is the predictor? This is the question ofdimension reduction.
Find the linear combination X1 + X2 before any regressionmodelling is performed!
9 / 32
Models under dimension reduction structure
Some examples:
Linear model: Y = βTX + ε,
Generalized linear model: Y = g(βTX ) + ε for a givenmonotonic function g ,
Single-index model: Y = g(βTX ) + ε for an unknownfunction g ,
Multi-index model (1): Y = g(βTX ) + ε, β is an p × qorthogonal matrix
Multi-index model (2): Y = g1(βT1 X ) + g2(βT2 X )ε, whereβ = (β1, β2),
Multi-index model (3): Y = g(βTX , ε).
10 / 32
Further observation on dimension reduction structure
In these models, all the information on the response Y can becaptured through βTX , rather than through the original X !
In other words, when βTX is given, no more information onY can be acquired from the rest part of X . Y is thenconditionally independent of X if ε is independent of X .
We call this independence the conditional independence, andwrite this independence as Y⊥⊥X |βTX .
Based on this, consider a generic framework such that”model-free” methods can be developed to estimate theparameter of interest and then to establish a model.
11 / 32
Central Subspace
The goal of dimension reduction is to seek β ∈ Rp×q, q < psuch that
Y⊥⊥X |βTX .
However, β is not unique such that Y⊥⊥X |βTX unless q = 1such that β is a vector. To make the notion clearly, definecolumn space first.
Proposition 1.1 If A is any q × q non-singular matrix, Y⊥⊥X |βTX if and only
if Y⊥⊥X |(βA)TX .
Column space. For a matrix B we denote by S(B) thesubspace spanned by the columns of B: Let b1, · · · , bq be thecolumns of the matrix B. The space consists of all the linearcombinations c1b1 + · · ·+ cqbq for constants c1, · · · , cq.
12 / 32
Central Subspace
If γ is another matrix such that S(β) ⊆ S(γ), thenY⊥⊥X |βTX implies Y⊥⊥X |γTX . Therefore, we are naturallyinterested in the smallest dimension reduction space, whichachieves the maximal reduction of the dimension of X .
Definition
Definition If the intersection of all dimension reduction spaces for(X ,Y ) is itself a dimension reduction space, this space is called theCentral Space. Write SY |X .
Reference: Cook (1994, 1998). Thus the goal of dimensionreduction is to find the central space SY |X .
13 / 32
Sufficient plot
Once we know the central space or equivalently its p × q basematrix β = (β1, · · · , βq), a comprehensive scatter plot orresidual plot relating to βTX can be informative.
When q = 1, scatter plot is informative and q = 2, use spinsoftware to have a comprehensive view of the data. Usuallythis will suffice for most of the data analysis.
Example 1 (continued) The model is
Y = |X1 + X2|+ ε.
Thus the central space is spanned by (1, 1, 0). The sufficientplot is the scatter plot of Y versus X1 + X2. Show this plothere.
14 / 32
Sufficient plot
Figure: Scatter Plot Matrix
−2 0 2 4 6−5 0 5
−2
0
2
4
6
−5
0
5
15 / 32
Part II: Ordinary least squares
Assumption
Assumption 2.1 Let β be a Rp×q matrix whose columns form anorthonormal basis in SY |X . Assume that E (X |βTX ) is a linearfunction of X , that is, For a constant c and an p × q matrix CE (X |βTX ) = c + CβTX.
We first make some notes on the intuitions and implicationsof this assumption.
In practice, we do not know β at the outset. So we typicallyreplace this assumption by E (X |γTX ) is linear in X for allγ ∈ Rp×q. This is equivalent to elliptical symmetry of X .
16 / 32
Ordinary least squares
Elliptically symmetric distribution can often approximately beachieved by appropriate transformation of the original data; acertain power of the data, or logarithm of the data. See Cookand Weisberg (1994).
Hall and K. C. Li (1993) demonstrated that, if the originaldimension p is much larger than the structural dimension q,then E (X |βTX ) is approximately linear in X .
17 / 32
Ordinary least squares
We can get C = β.
Theorem
Theorem 2.2. If Assumption 2.1 holds, then
E (X |βTX ) = c + CβTX =: Pβ(X ),
where Pβ = ββT is the projection operator.
18 / 32
Ordinary least squares
Recall the assumption that E (X ) = 0, var(X ) = Ip. Then wehave
Theorem
Theorem 2.3. Suppose that Assumption 2.1 holds. The vectorE (XY ) = βc for an 1× q vector c is a vector in SY |X . In otherwords, E (XY ) can identify a vector in the central subspace.
19 / 32
Ordinary least squares
Theorem
Theorem 2.4 Suppose that Y |X follows the modelY = g(βTX ) + varepsilon where β = (β1, · · · , βp) is a vector.Suppose that Assumption 2.1 holds. Then E (XY ) is proportionalto β in the model.
20 / 32
Ordinary least squares
At the population level, the identifying procedure can bedescribed as follows. First, standardize X to beZ = Σ
−1/2x (X − µ). Identify a vector in SY |Z , then transfer
back to SY |X = Σ−1/2x SY |Z . At the sample level, the
estimating procedure is as follows.
Step 1. Compute the sample mean and variance:
µ = En(X ) Σx = varn(X )
and standardize Xi to be Zi = Σ−1/2x (X − µ).
Step 2. Center Yi to be Yi = Yi − E (Y ) and it is estimatedby Yi = Yi − En(Y ).
Step 3. Let γ = En(Z Y ) estimate E (ZY ) ∈ SY |Z .
Step 4. Let β = Σ−1/2x γ estimate E (XY ) ∈ SY |X .
21 / 32
Applications
For generalized linear model: Y = g(βTX ) + ε, single-indexmodel Y = g(βTX ) + ε, and transformation modelH(Y ) = βTX + ε, the above result shows that OLS can beapplied to estimate γ = β/‖β‖ if the L2-norm ‖β‖ of β is not1.
As OLS has a close form and thus, estimating γ is verycomputational efficient under the above nonlinear models.
After that, we can define z = γTX that is one-dimensional,rewrite the model as Y = g(αz) + ε to estimate α in aone-dimensional setting.
Thus, estimating β = αγ can be performed in this two-stepprocedure. (see Feng and Zhu 2013 CSDA)
22 / 32
Part III: Principal Hessian Directions
The biggest disadvantages of OLS: 1) it can only estimate atmost one direction in the central space; 2) it cannot identifythe direction in symmetric function such as the one inY = |βTX |+ ε.
For example, Y = |X1 + X2|+ ε: β = (1, 1, 0). But it cannotwell identified by OLS.
23 / 32
OLS plot
Figure: OLS Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.5 0.0 0.5 1.0
−1
01
23
n = 20 (−1.13,0.22,0.94)
OLS
y
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.5 0.0 0.5 1.0
−2
−1
01
23
4
n = 50 (0.02,0.61,−0.12)
OLS
y
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
−0.4 −0.2 0.0 0.2 0.4
−2
−1
01
23
4
n = 100 (0.01,0.11,1.49)
OLS
y
24 / 32
Principal Hessian Directions
Another method: Consider the conditional mean E (Y |X ) ofY given X . When E (Y |X ) = E (Y |βTX ), its secondderivative is∂2E (Y |X )/∂(XXT ) = β∂2E (Y |βTX )/∂(βTXXTβ)βT whichis a p × p matrix.An application of Stein Lemma (1956). When the distributionof X is normal,
E(∂2E (Y |X )/∂(XXT )
)= E (YXXT )
= βE(∂2E (Y |βTX )/∂(βTXXTβ)
)βT ,
where E(∂2E (Y |βTX )/∂(βTXXTβ)
)is a q × q matrix.
The p × p matrix βE(∂2E (Y |βTX )/∂(βTXXTβ)
)βT has at
most q non-zero eigenvalues and the correspondingeigenvectors will be proved to be in the central subspace.
25 / 32
Principal Hessian Directions
Assumption
Assumption 3.1 Assume that the conditional variance
var(X |βTX ) = C
is a p × p non-random matrix.
This assumption is satisfied if X is multivariate normal.
26 / 32
Principal Hessian directions: population development
Let α be the OLS vector E (XY ). Let e be the residual fromthe simple linear regression, that is,
e = Y − αTX .
Note that, in the standardized coordinate, the intersection ofthe OLS is zero, because it is E (Y ) = 0 and E (X ) = 0 andthus E (Y )− αTE (X ) = 0. That is why there is no constantterm in e.
Definition
Definition 3.1. The matrix H1 = E (YXXT ) is called the y-basedHessian matrix, the matrix H2 = E (eXXT ) is called the e-basedHessian matrix.
The central result of Part III is that the column space of aHessian matrix (either one) is a subspace of the central space.
27 / 32
Principal Hessian directions: population development
Theorem
Theorem 3.1. Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H1 is a subspace of SY |X .
Theorem
Theorem 3.2 Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H2 is a subspace of SY |X .
28 / 32
Sample estimator of pHd
Again, we use the idea of first transforming to Z , estimatingSY |Z , and then transforming back to SY |X . We summarizethe computation into the following steps.
Step 1. standardize X1, · · · ,Xn to be Z1, · · · Zn, and centerY1, · · · ,Yn to be Y1, · · · Yn, as described in the algorithm forOLS.
Step 2. Compute the OLS of Yi versus Zi to get α:
α = (varn(Z ))−1covn(Z , Y ),
andα0 = En(Y )− αTEn(Z ) = 0.
Because of standardization, we have:
29 / 32
Sample estimator of pHd
varn(Z ) = Ip, andcovn(Z , Y ) = En(Z Y )− En(Z )En(Y ) = En(Z Y ).
This means the OLS for Y and Z is α = En(Z Y ). Theresidual is ei = Yi − αT Zi .
Step 3. Construct the e-based and y-based Hessian matrix:
H1 = En(Y Z ZT ), H2 = En(eZ ZT ).
Step 4. Assume, for now, we know the structural dimension q.Let γ1 · · · γq be the q eigenvectors corresponding to the qlargest eigenvalues of H1H
T1 and let δ1 · · · δq be the q
eigenvectors corresponding to the q largest eigenvalues ofH2H
T2 . We use γ1 · · · γq and δ1 · · · δq as the estimators of
SY |Z .
30 / 32
Sample estimator of pHd
Letβi = Σ−1/2γi , ηi = Σ−1/2δi
We then use β1 · · · βq and η1 · · · ηq as the estimators of SY |X .
We have assumed that the structural dimension q is known.In practice this must be determined by the data. There areseveral proposals in the literature.
The following is the pHd scatter plot for the modelY = |X1 + X2|+ ε, described before.
31 / 32
pHd plot
Figure: pHd Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
01
23
n =20 (−0.9,−0.31,0.27)
r−pHd1
y
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.5 −1.0 −0.5 0.0 0.5 1.0
01
23
45
n =50 (0.72,0.66,−0.17)
r−pHd1
y
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
−2 −1 0 1 2
−2
−1
01
23
4
n =100 (−0.62,−0.74,−0.23)
r−pHd1
y
32 / 32