Raymond J. Carroll Texas A&M University carroll [email protected] Non/Semiparametric Regression and Clustered/Longitudinal Data

Raymond J. CarrollTexas A&M University

http://stat.tamu.edu/[email protected]

Non/Semiparametric Regression and Clustered/Longitudinal Data

http://stat.tamu.edu/~carroll

mailto:[email protected]

Outline

• Series of Semiparametric Problems: • Panel data• Matched studies• Family studies• Finance applications

Outline

• General Framework: • Likelihood-criterion functions

• Algorithms: kernel-based• General results:

• Semiparametric efficiency• Backfitting and profiling

• Splines and kernels: Summary and conjectures

Xihong Lin Harvard University

Acknowledgments

Basic Problems

• Semiparametric problems

• Parameter of interest, called

• Unknown function

• The key is that the unknown function is evaluated multiple times in computing the likelihood for an individual

θ(•)

β

Example 1: Panel Data

• i = 1,…,n clusters/individuals• j = 1,…,m observations per cluster

Subject Wave 1 Wave 2 … Wave m

1 X X X

2 X X X

… X

n X X X

Example 1: Marginal Parametric Model

• Y = Response• X,Z = time-varying covariates

• General Result: We can improve efficiency for by accounting for correlation: Generalized Least Squares (GLS)

ij ij ij ij

ij

Y =Z +X +ε

cov(ε

θβ

)=Σ

Example 1: Marginal Semiparametric Model

• Y = Response• X,Z = varying covariates

• Question: can we improve efficiency for by accounting for correlation?

ij ij ij ij

ij

Y =Z + (X )+ε

cov(ε

Θβ

)=Σ

Example 1: Marginal Nonparametric Model

• Y = Response• X = varying covariate

• Question: can we improve efficiency by accounting for correlation? (GLS)

ij ij ij

ij

Y = (X )+ε

• unknown function

cov(ε Σ

Θ

Θ

)=

Example 2: Matched Studies

• Prospective logistic model: i = person, S = stratum

• The usual idea is that the stratum-dependent random variables may have been chosen by an extremely weird process, hence impossible to model.

TiS S iS iSpr(Y =1)=H + β+θZ (X )

S

Example 2: Matched Studies

• The usual likelihood is determined by

• Note how the conditioning removes

• Also note: function evaluated twice per stratum

0S 1S 0S 1S

T0S 1S 0S 1S

pr(Y =0,Y =1| Y +Y =1)

=H (Z -Z ) + (X )-θ(Xθβ )

S

θ(•)

Example 3: Model in Finance

• Model in finance

• Note how the function is evaluated m-times for each subject

mj-1

i ij ij=1

Y= β (X )Θ +ε

θ(•)

Example 3: Model in Finance

• Model in finance

• Previous literature used an integration estimator, namely first solved via backfitting:

• Computation was pretty horrible• For us, exact computation, general theory

m

i ij ij=1

jY= (X )Θ +ε

mj-1

i ij ij=1

Y= β (X )Θ +ε

Example 4: Twin Studies

• Family consists of twins, followed longitudinally

• Baseline for each twin modeled nonparametrically via

• Longitudinal modeled parametrically via

i1 i2(X ), (Xθ θ )

ij1 ij2Z ,Z ,β

General Formulation

• These examples all have common features:

• They have a parameter

• They have an unknown function

• The function is evaluated multiple times for each unit (individual, matched pair, family)

• This distinguishes it from standard semiparametric models

θ( )

θ( )

β

General Formulation

• Yij = Response

• Xij,Zij = possibly varying covariates

• Loglikelihood (or criterion function)

• All my examples have the criterion function

i i i1 imY ,Z , , (X ),..., (θ )θβ X

General Formulation: Examples


• As stated previously, this is not a standard semiparametric problem, because of the multiple function evaluations

i i i1 imY ,Z , , (X ),..., (θ )θβ X

General Formulation: Overview


• For these problems, I will give constructive methods of estimation with• Asymptotic expansions and inference available

• If the criterion function is a likelihood function, then the methods are semiparametric efficient.• Methods avoid solving integral equations

i i i1 imY ,Z , , (X ),..., (θ )θβ X

The Semiparametric Model

• Y = Response• X,Z = time-varying covariates

• Question: can we improve efficiency for by accounting for correlation, i.e., what method is semiparametric efficient?

ij ij ij ij

ij

Y =Z + (X )+ε

cov(ε

θβ

)=Σ

Semiparametric Efficiency

• The semiparametric efficient score is readily worked out.

• Involves a Fredholm equation of the 2nd kind

• Effectively impossible to solve directly:• Involves densities of each X conditional on the

others

• The usual device of solving integral equations does not work here (or at least is not worth trying)

The Efficient Score (Yuck!)

1 m

1 m

-1eff

m mjk

k eff k j jj 1 k 1

Efficient Score

θ

Fredholm integral equati

X=(X ,...,X )

Z=(Z ,...,Z )

{X- (Z)} {Y X (Z)}

:

0

o

β

E[{X (Z )}| Z z]f (z)

n

My Approach

• First pretend that if you knew , then you could solve for .

• I am going to suggest an algorithm for then estimating

• I am then going to turn to the question of estimating

β

θ( )β

θ( )

• Profile methods work like this.• Fix• Apply your smoother• Call the result • Maximize the Gaussian Loglikelihood

function in

• Explicit solution for most smoothers in Gaussian cases

β

β

(Y-Z )βS

θ̂(X,β)

n

i i i1 imi 1Y ,Z , , (X , ),..., (Xβ β β, )ˆ ˆθ θ

Profiling in Gaussian Problems

• Profile methods maximize

• This can be difficult numerically in nonlinear problems

• A type of backfitting is often much easier numerically

n

i i i1 imi 1Y ,Z , , (X , ),..., (Xβ β β, )ˆ ˆθ θ

Profiling

Backfitting Methods

• Backfitting methods work like this.• Fix• Apply your smoother• Call the result • Maximize the Loglikelihood function in :

• Iterate until convergence (explicit solution for most smoothers, but different from profiling)

newβ

oldβ

old (Y-Zβ )S

oldβ,ˆ(Xθ )

old

n

i i new i1 imi ld1 oY ,Z , , (X , ),..., (X ,ˆ ˆθβ β βθ )

Backfitting/Profiling Example

• Partially linear model, one function

• Define

• Fit the expectations by local linear kernel regression (or whatever)

T θY Z + (β X)+ε

Y=Y-E(Y| X); Z=Z-E(Z| X)


• The Estimators are

• These are numerically different, but asymptotically equivalent

• The equivalence is a subtle calculation, even in this simple context

1TiB ii i

1Ti

i i

i

F

P i iiR i

β̂

ˆ X

= X Y ;X

β Y

X

X= X


• The asymptotic equivalence of profiling and backfitting in this partially linear model has one subtlety

• Profiling: off-the-shelf smoothers are OK

• Backfitting: off-the-shelf smoothers need to be undersmoothed to get rid of asymptotic bias

Backfitting/Profiling

• Hu, et al. (2004, Biometrika) showed that in general problems:• Backfitting is generally more variable than

profiling, for linear-type problems• Backfitting and profiling need not

necessarily have the same limit distributions

General Formulation: Revisited

• Yij = Response

• Xij,Zij = varying covariates


• The key is that the function is evaluated multiple times for each individual

• The goal is to estimate and efficiently

θ( )

i i i1 imY ,Z , , (X ),..., (θ )θβ X

θ( ) β


• What I want to show you is a constructive solution, i.e., one that can be computed• Different from solving integral equations• Completely general• Theoretically sound

• The methodology is based on kernel methods, i.e., local methods.

• First a little background

Simple Local Likelihood

• Consider a nonparametric regression with iid data

• The Loglikelihood function is

i i iY= (X )+εθ

2

i i i i

1θY , (X θ) Y- (X )

2


• Let K be a density function, and h a bandwidth

• Your target is the function at x• The kernel weights for local likelihood are

• If K is the uniform density, only observations within h of x get any weight

iKX -xh


Only observations within h = 0.25 of x = -1.0 get any weight


• Near x, the function should be nearly linear

• The idea then is to do a likelihood estimate local to x via weighting, i.e., maximize

• Then announce

n

i1 i

i0i

=1

KX -

Y , + (X - )h

xx

0θ̂(x)


• In the linear model, local likelihood is local linear regression

• It is essentially equivalent to loess, splines, etc.

• I’ll now use local likelihood ideas to solve the general problem


• Likelihood (or criterion function)

• The goal is to estimate the function at a target value t

• Fix . Pretend that the formulation involves different functions

β

i i i1 imY ,Z , , (X ),..., (θ )θβ X

i i i11 m imY ,Z , , (X )θ θβ ,..., (X )


• Pretend that the formulation involves different functions

• Pretend that are known

• Fit a local linear regression via local likelihood:

• Get the local score function for

i i i11 m imY ,Z , , (X )θ θβ ,..., (X )

i1 im2 m(X ),.. θ., (Xθ )

11 i1 0 1 i t)=θ (X (X )

0 1( , )

i1 0 1( , ) A


• Repeat: Pretend knowing

• Fit a local linear regression:• Get the local score function• Finally, solve

• Explicit solution in the Gaussian cases

1 j-1 j+1 mi1 im im im(X ),..., (X ),θ θ θ θ(X ),... (X )

jj ij 0 1 i t)=θ (X (X )

ij 0 1( , ) A

n m

ij 0 1i=1 j 1

0 ( , )

= A

Main Results

• Semiparametric Efficient for

• Backfitting (under-smoothed) = profiling

• The equivalence of backfitting and profiling is not obvious in the general case.

β

Main Results

• Explicit variance formulae

• High-order expansions for parameters and functions• Used for estimating population quantities

such as population means, etc.

Marginal Approaches

• The most standard approach is a marginal one

• Often, we can write, for known G,

• Similar would be to write the likelihood function for single observations:

ij ij ij ij ijE(Y | Z ,X ) θ,βG Z , (X )

ij ij ijY ,Z , ,θ(Xβ )

Marginal Approaches

• The marginal approaches ignore the correlation structure

• Lots, and lots, and lots of papers

• Methods tend to be very inefficient if the correlation structure is important

Econometric Example

• In panel data, interest can be in random-fixed effects models

• Our usual variance components model: is independent of everything

• If so, this is a version of our partially linear model, hence already solved by us

ij ij ij i ijY =Z + (X +δ) +εΘβ

iδ

Econometric Example

• Econometricians though worry that is correlated with Z or X

• This says that represents unmeasured variables. This is the fixed-effects model

• They want to know the effects of (X,Z), controlling for individual factors


iδ

iδ

Econometric Example

• Starting model:

• Get rid of the terms, e.g.,

• A special case of our model!


iδ

ij i1 ij i1 ij i1 ij i1Y -Y =(Z -Z ) + (X )- (X )+εΘβ εΘ

Econometric Example

• Model:

• The terms are correlated over j = 2,…,m

• The variance efficiency loss of ignoring these correlations is (2+m)/4

ij i1ε -ε

ij i1 ij i1 ij i1 ij i1Y -Y =(Z -Z ) + (X )- (X )+εΘβ εΘ

Econometric Example

• Example: China Health and Nutrition Survey

• No parametric part• Response Y = caloric intake (log scale)• Predictor X = income• Initial random effects model result

suggests that for very low incomes, an increase in income is NOT associated with an increase in calories

Econometric Example

• Random effects model suggests that for very low incomes, an increase in income is NOT associated with an increase in calories

• The fixed effects model fits with economic theory and common sense

• Specification test confirms this

Econometric Example

• The fixed effects cubic regression fit is far too steep at either end.

• The nonparametric fit makes much more sense

Remarks on Splines

• Splines are a practical alternative to kernels

• Penalized splines (smoothing, P-splines, etc.) with penalty parameter = • Easy to develop, very flexible

• Computable, truly nonparametric

• Difficult theory (Mammen & van der Geer, Mammen & Nielsen)

• In partially linear model for smoothing splines, for example, they are equivalent to kernel methods

Remarks on Splines

• Unpenalized splines

• There are theoretical results for non-penalized splines

• These methods assume fixed, known knots

• Then slowly grow the number of knots

• Theoretically equivalent to our methods

• The theory, and the method, is irrelevant

Unpenalized Splines

No penalty and standard number of knots = crazy curves

Unpenalized Splines

• The theoretical results for unpenalized splines require that the relationship between the number of knots k and the sample size n be

• Every paper in this area does data analysis with <= 5 knots. Why?

1/ 5 5k n or n k

557 16,807 (n k )

Splines With Knot Selection

• There is a nice literature on using fixed-knot splines but with the knots selected

• Basically, use model-selection techniques to zero out some of the coefficients

• This gets the smoothness back

Conclusions

• General likelihood:

• Distinguishing Property: Unknown function evaluated repeatedly in each individual

• Kernel method: Iterated local likelihood calculations, explicit solution in Gaussian cases

i i i1 imY ,Z , , (X ),..., (θ )θβ X

Conclusions

• General results:

• Semiparametric efficient: construction, no integral equations need to be solved

• Backfitting and profiling: asymptotically equivalent

i i i1 imY ,Z , , (X ),..., (θ )θβ X

Conclusions

• Smoothing Splines and Kernels: Asymptotically the same in the Gaussian case

• Splines: generally easier to compute, although smoothing parameter selection can be intensive

• Unpenalized splines: irrelevant theory, need knot selection

i i i1 imY ,Z , , (X ),..., (θ )θβ X

Conclusions

• Splines and Kernels: One might conjecture that splines can be constructed for the general problem that are asymptotically efficient

• Open Problem: is this true, and how?

Thanks!

http://stat.tamu.edu/~carroll

Conjectured Approach

• Mammen and Nielsen worked in a nonlinear least squares context with multiple functions

• Roughly, the obvious version of their method is

• Both methods are semiparametric efficient when profiled

ij ij ij ij ijE(Y | Z ,X ) G Z , (θ X )

n

ij jma i i 1 m j 1 j

i 1

X xargmax Y ,Z ,a(x ),...,a(x ) K dx

h

Conjectured Approach

• Roughly, the obvious version of the Mammen and Nielsen method is

• This can be used for the model

n

ij jma i i 1 m j 1 j

i 1

X xargmax Y ,Z ,a(x ),...,a(x ) K dx

h

mj-1

i ij ij=1

Y= β (X )Θ +ε

Documents

Raymond J. Carroll Texas A&M University carroll [email protected] Non/Semiparametric Regression and Clustered/Longitudinal Data