Expectile and Quantile Regression and Other …indico.ictp.it/event/a12175/session/41/contribution/27/...Expectile and Quantile Regression and Other Extensions Lawrence Kazembe University

2453-14

School on Modelling Tools and Capacity Building in Climate and Public Health

KAZEMBE Lawrence

15 - 26 April 2013

University of Malawi Chancellor College

Faculty of Science Department of Mathematical Sciences 18 Chirunga Road, 0000 Zomba

MALAWI

Expectile and Quantile Regression and Other Extensions

Expectile and Quantile Regression

and Other Extensions

Lawrence Kazembe

University of Namibia

Windhoek, Namibia

A presentation atSchool on Modelling Tools and Capacity Building in

Climate and Public Health

ICTP, Trieste, Italy

1

Objectives and Questions

• Objective:Introduce other regression beyond the mean

• Are there median regression models

• What about regression at the mode?

• Can we fit regression model at any other locationstoo?

• What of the variance, skewness and kurtosis?

2

Prelimaries

• Given a r.v. y ∼ f(·) then

-E(Y ) = μ defines the mean;

-V ar(Y ) = σ2 is the variance.

• General assumption: two measures completely de-

fine the distribution (stationarity concept)

• Classical regression is often characterized by relat-

ing to the mean

-E(y|x) = βx if x is the set of covariates.

-y ∼ N(μ, σ2), then μ = βx

3

-y ∼ Bern(π), then log(

π1−π

)= βx

-y ∼ Pois(λ), then log(λ) = βx.

• Statisticians are mean lovers (Friedman, Fried-man & Amoo)

• It goes like:We are ”mean” lovers. Deviation is considered normal. We are right 95%

of the time. Statisticians do it discretely and continuously. We can legally

comment on someone’s posterior distribution.

• Why the mean?-Mean regression reduces complexity-However, the mean is not sufficient to describe adistribution.

Examples Plot (Rent in Munich, Kneib et al)

4

Examples Quantile Plot (Rent in Munich, Kneib

et al)

5

Example (Malaria vs Altitude , Kazembe et al)

6

Example (Botswana data)

7

Motivating Examples

• Epidemiology and Public Health:

-Investigating height, weight and body mass index

as a function of different covariates e.g. age (Wei

et al. 2006)

-Exploring stunting curves in African and Indian

children (Yue et al. 2012)

-Generating age-specific centile charts (Chitty et al.

1994).

• Economics:

-Study of determinants of wages (Koenker, 2005).

8

• Education:

-The performance of students in public schools (Koenker

and Hallock, 2001).

• Climate data:

- Spatiotemporal analysis of Boston temperature

(Reich 2012)

Double GLM

• Classical regression assumes a homogeneous vari-

ance

y|x, ε ∼ N(βx, σ2)

ε ∼ N(0, σ2)

• However variance heterogeneity (heteroscedasticity)

is an order in real-life problems

• The variance of the response may depend on the

covariates

9

• Normal regression example:

-Regression for mean and variance of a normal dis-

tribution

yi = ηi1 + exp(η2i)εi, εi ∼ N(0,1)

such that

E(yi|xi) = η1i

V ar(yi|xi) = exp(η2i)2

Regression for location, scale and shape

• In general: Specify a distribution for the response

on all parameters and relate to the predictors.

• The generalized additive model location, scale and

shape (GAMLSS) is a statistical model developed

by Rigby and Stasinopoulos (2005).

• For a probability (density) function f(yi|μi, σi, νi, τi)conditional on (μi, σi, νi, τi) a vector of four distri-

bution parameters, each of which can be a function

10

to the explanatory variables (X)

g1(μ) = η1 = β1X1 +J1∑j=1

hj1(xj1) (1)

g2(σ) = η2 = β2X2 +J2∑j=1

hj2(xj2) (2)

g3(ν) = η3 = β3X3 +J3∑j=1

hj3(xj3) (3)

g4(τ) = η4 = β4X4 +J4∑j=1

hj4(xj4) (4)

• GAMLSS assumes different models for each param-

eter

-Model 1: Mean regression

-Model 2: Dispersion regression

-Model 3: Skewness regression

-Model 4: Kurtosis regression

• Other summary measures are also permissible

Quantile Regression

• Quantile, Centile, and Percentile are terms that canbe used interchangeably-A 0.5 quantile ≡ 50 percentile, which is a median.

• Related terms are quartiles, quintiles and deciles:divides distribution into 4, 5, and 10 parts

• Quantiles are related to the median.

• Suppose Y has a cumulative distribution F (y) =P (Y ≤ τ). Then τth quantile of Y is defined to be

Q(τ) =∫ τ

−∞f(y)dy = inf{y : τ ≤ F (y)}

11

for 0 < τ < 1.

• The quantile regression drops the parametric as-sumption for the error/ response distribution.

• Fit separate models for different asymmetries τ ∈[0,1]:

yi = ηiτ + εiτ

• Instead of assuming E(yi ≤ ηiτ = 0, one assumes

P (εiτ ≤ 0) = τ

or

Fyi(ηiτ) = P (εiτ ≤ 0) = τ

• This gives a set of regression function at any as-

sumed quantile value.

• Assumptions:

-it is distribution-free since it does not make any

specific assumption on the type of errors

-it does not even require i.i.d errors

-it allows for heterogeneity.

Expectile Regression

• Expectiles are related to the mean, as are quantiles

related to the median.

∑ni=1 |yi − ηi| → min

∑ni=1wτ(yi, ηiτ)|yi − ηiτ | → min

median regression quantile regression

∑ni=1 |yi − ηi|2 → min

∑ni=1wτ(yi, ηiτ)|yi − ηiτ |2 → min

mean regression expectile regression

where wτ is the check function defined by

12

wτ =

⎧⎨⎩τ if yi > μ(τ)

1− τ if yi ≤ μ(τ)

for some population expectile μ(τ) for different val-ues of an asymmetric parameter 0 < τ < 1.

• Expectiles are obtained by solving

τ =

∫ eτ−∞ |y − eτ |fy(y)dy∫∞−∞ |y − eτ |fy(y)dy=

Gy(eτ)− eτFy(eτ)

2(Gy(eτ)− eτFy(eτ)) + (eτ − μ)

where-fy(·) and Fy(·) denote the density and cumulativedistribution function of y.-Gy(e) =

∫ e−∞ yfy(y)dy is the partial moment func-tion of y and-Gy(∞) = μ is the expectation of y.

Worked example - binomial data

13

500 1500

−1

01

2

ALT

s(A

LT, d

f = c

(2, 4

, 2))

:1

500 1500

−0.

40.

00.

20.

40.

60.

8

ALT

s(A

LT, d

f = c

(2, 4

, 2))

:2

500 1500

−0.

20.

00.

10.

2

ALT

s(A

LT, d

f = c

(2, 4

, 2))

:3500 1500

−0.

10−

0.05

0.00

0.05

0.10

ALT

s(A

LT, d

f = c

(2, 4

, 2))

:4

500 1500

−0.

3−

0.2

−0.

10.

00.

1

ALT

s(A

LT, d

f = c

(2, 4

, 2))

:5

500 1500

−0.

3−

0.2

−0.

10.

00.

1ALT

s(A

LT, d

f = c

(2, 4

, 2))

:6

8 10 14

−1.

0−

0.5

0.0

0.5

1.0

1.5

TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):1

8 10 14

−0.

40.

00.

20.

40.

60.

8

TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):2

8 10 14

−0.

10.

00.

10.

20.

30.

4

TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):3

8 10 14

−0.

2−

0.1

0.0

0.1

0.2

TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):4

8 10 14

−0.

5−

0.3

−0.

10.

1

TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):5

8 10 14

−0.

5−

0.3

−0.

10.

1TMIN

s(T

MIN

, df =

c(2

, 4, 2

)):6

14

25 30 35

−2

−1

01

TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:1

25 30 35

−0.

8−

0.4

0.0

0.4

TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:2

25 30 35

−0.

20.

00.

20.

4

TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:325 30 35

−0.

100.

000.

100.

20

TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:4

25 30 35

−0.

2−

0.1

0.0

0.1

0.2

TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:5

25 30 35

−0.

050.

000.

050.

100.

15TMAX

s(T

MA

X, d

f = c

(2, 4

, 2))

:615

Reference

• Hudson, I. L., Rea, A., Dalrymple, M. L., Eilers, P. H. C(2008). Climate impacts on sudden infant death syndrome:a GAMLSS approach. Proceedings of the 23rd internationalworkshop on statistical modelling pp. 277280.

• Beyerlein, A., Fahrmeir, L., Mansmann, U., Toschke., A. M(2008). Alternative regression models to assess increase inchildhood BM. BMC Medical Research Methodology, 8(59).

• Smyth, G. K (1989). Generalized linear models with varyingdispersion. J. R. Statist. Soc. B, 51, 4760.

• Sobotka, F., and T. Kneib, 2010. Geoadditive Expectile Re-gression. Computational Statistics and Data Analysis, doi:10.1016/j.csda.2010.11.015.

16

Documents

Expectile and Quantile Regression and Other …indico.ictp.it/event/a12175/session/41/contribution/27/...Expectile and Quantile Regression and Other Extensions Lawrence Kazembe University