Upload
others
View
39
Download
0
Embed Size (px)
Citation preview
2453-14
School on Modelling Tools and Capacity Building in Climate and Public Health
KAZEMBE Lawrence
15 - 26 April 2013
University of Malawi Chancellor College
Faculty of Science Department of Mathematical Sciences 18 Chirunga Road, 0000 Zomba
MALAWI
Expectile and Quantile Regression and Other Extensions
Expectile and Quantile Regression
and Other Extensions
Lawrence Kazembe
University of Namibia
Windhoek, Namibia
A presentation atSchool on Modelling Tools and Capacity Building in
Climate and Public Health
ICTP, Trieste, Italy
1
Objectives and Questions
• Objective:Introduce other regression beyond the mean
• Are there median regression models
• What about regression at the mode?
• Can we fit regression model at any other locationstoo?
• What of the variance, skewness and kurtosis?
2
Prelimaries
• Given a r.v. y ∼ f(·) then
-E(Y ) = μ defines the mean;
-V ar(Y ) = σ2 is the variance.
• General assumption: two measures completely de-
fine the distribution (stationarity concept)
• Classical regression is often characterized by relat-
ing to the mean
-E(y|x) = βx if x is the set of covariates.
-y ∼ N(μ, σ2), then μ = βx
3
-y ∼ Bern(π), then log(
π1−π
)= βx
-y ∼ Pois(λ), then log(λ) = βx.
• Statisticians are mean lovers (Friedman, Fried-man & Amoo)
• It goes like:We are ”mean” lovers. Deviation is considered normal. We are right 95%
of the time. Statisticians do it discretely and continuously. We can legally
comment on someone’s posterior distribution.
• Why the mean?-Mean regression reduces complexity-However, the mean is not sufficient to describe adistribution.
Examples Plot (Rent in Munich, Kneib et al)
4
Examples Quantile Plot (Rent in Munich, Kneib
et al)
5
Example (Malaria vs Altitude , Kazembe et al)
6
Example (Botswana data)
7
Motivating Examples
• Epidemiology and Public Health:
-Investigating height, weight and body mass index
as a function of different covariates e.g. age (Wei
et al. 2006)
-Exploring stunting curves in African and Indian
children (Yue et al. 2012)
-Generating age-specific centile charts (Chitty et al.
1994).
• Economics:
-Study of determinants of wages (Koenker, 2005).
8
• Education:
-The performance of students in public schools (Koenker
and Hallock, 2001).
• Climate data:
- Spatiotemporal analysis of Boston temperature
(Reich 2012)
Double GLM
• Classical regression assumes a homogeneous vari-
ance
y|x, ε ∼ N(βx, σ2)
ε ∼ N(0, σ2)
• However variance heterogeneity (heteroscedasticity)
is an order in real-life problems
• The variance of the response may depend on the
covariates
9
• Normal regression example:
-Regression for mean and variance of a normal dis-
tribution
yi = ηi1 + exp(η2i)εi, εi ∼ N(0,1)
such that
E(yi|xi) = η1i
V ar(yi|xi) = exp(η2i)2
Regression for location, scale and shape
• In general: Specify a distribution for the response
on all parameters and relate to the predictors.
• The generalized additive model location, scale and
shape (GAMLSS) is a statistical model developed
by Rigby and Stasinopoulos (2005).
• For a probability (density) function f(yi|μi, σi, νi, τi)conditional on (μi, σi, νi, τi) a vector of four distri-
bution parameters, each of which can be a function
10
to the explanatory variables (X)
g1(μ) = η1 = β1X1 +J1∑j=1
hj1(xj1) (1)
g2(σ) = η2 = β2X2 +J2∑j=1
hj2(xj2) (2)
g3(ν) = η3 = β3X3 +J3∑j=1
hj3(xj3) (3)
g4(τ) = η4 = β4X4 +J4∑j=1
hj4(xj4) (4)
• GAMLSS assumes different models for each param-
eter
-Model 1: Mean regression
-Model 2: Dispersion regression
-Model 3: Skewness regression
-Model 4: Kurtosis regression
• Other summary measures are also permissible
Quantile Regression
• Quantile, Centile, and Percentile are terms that canbe used interchangeably-A 0.5 quantile ≡ 50 percentile, which is a median.
• Related terms are quartiles, quintiles and deciles:divides distribution into 4, 5, and 10 parts
• Quantiles are related to the median.
• Suppose Y has a cumulative distribution F (y) =P (Y ≤ τ). Then τth quantile of Y is defined to be
Q(τ) =∫ τ
−∞f(y)dy = inf{y : τ ≤ F (y)}
11
for 0 < τ < 1.
• The quantile regression drops the parametric as-sumption for the error/ response distribution.
• Fit separate models for different asymmetries τ ∈[0,1]:
yi = ηiτ + εiτ
• Instead of assuming E(yi ≤ ηiτ = 0, one assumes
P (εiτ ≤ 0) = τ
or
Fyi(ηiτ) = P (εiτ ≤ 0) = τ
• This gives a set of regression function at any as-
sumed quantile value.
• Assumptions:
-it is distribution-free since it does not make any
specific assumption on the type of errors
-it does not even require i.i.d errors
-it allows for heterogeneity.
Expectile Regression
• Expectiles are related to the mean, as are quantiles
related to the median.
∑ni=1 |yi − ηi| → min
∑ni=1wτ(yi, ηiτ)|yi − ηiτ | → min
median regression quantile regression
∑ni=1 |yi − ηi|2 → min
∑ni=1wτ(yi, ηiτ)|yi − ηiτ |2 → min
mean regression expectile regression
where wτ is the check function defined by
12
wτ =
⎧⎨⎩τ if yi > μ(τ)
1− τ if yi ≤ μ(τ)
for some population expectile μ(τ) for different val-ues of an asymmetric parameter 0 < τ < 1.
• Expectiles are obtained by solving
τ =
∫ eτ−∞ |y − eτ |fy(y)dy∫∞−∞ |y − eτ |fy(y)dy=
Gy(eτ)− eτFy(eτ)
2(Gy(eτ)− eτFy(eτ)) + (eτ − μ)
where-fy(·) and Fy(·) denote the density and cumulativedistribution function of y.-Gy(e) =
∫ e−∞ yfy(y)dy is the partial moment func-tion of y and-Gy(∞) = μ is the expectation of y.
Worked example - binomial data
13
500 1500
−1
01
2
ALT
s(A
LT, d
f = c
(2, 4
, 2))
:1
500 1500
−0.
40.
00.
20.
40.
60.
8
ALT
s(A
LT, d
f = c
(2, 4
, 2))
:2
500 1500
−0.
20.
00.
10.
2
ALT
s(A
LT, d
f = c
(2, 4
, 2))
:3500 1500
−0.
10−
0.05
0.00
0.05
0.10
ALT
s(A
LT, d
f = c
(2, 4
, 2))
:4
500 1500
−0.
3−
0.2
−0.
10.
00.
1
ALT
s(A
LT, d
f = c
(2, 4
, 2))
:5
500 1500
−0.
3−
0.2
−0.
10.
00.
1ALT
s(A
LT, d
f = c
(2, 4
, 2))
:6
8 10 14
−1.
0−
0.5
0.0
0.5
1.0
1.5
TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):1
8 10 14
−0.
40.
00.
20.
40.
60.
8
TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):2
8 10 14
−0.
10.
00.
10.
20.
30.
4
TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):3
8 10 14
−0.
2−
0.1
0.0
0.1
0.2
TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):4
8 10 14
−0.
5−
0.3
−0.
10.
1
TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):5
8 10 14
−0.
5−
0.3
−0.
10.
1TMIN
s(T
MIN
, df =
c(2
, 4, 2
)):6
14
25 30 35
−2
−1
01
TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:1
25 30 35
−0.
8−
0.4
0.0
0.4
TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:2
25 30 35
−0.
20.
00.
20.
4
TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:325 30 35
−0.
100.
000.
100.
20
TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:4
25 30 35
−0.
2−
0.1
0.0
0.1
0.2
TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:5
25 30 35
−0.
050.
000.
050.
100.
15TMAX
s(T
MA
X, d
f = c
(2, 4
, 2))
:615
Reference
• Hudson, I. L., Rea, A., Dalrymple, M. L., Eilers, P. H. C(2008). Climate impacts on sudden infant death syndrome:a GAMLSS approach. Proceedings of the 23rd internationalworkshop on statistical modelling pp. 277280.
• Beyerlein, A., Fahrmeir, L., Mansmann, U., Toschke., A. M(2008). Alternative regression models to assess increase inchildhood BM. BMC Medical Research Methodology, 8(59).
• Smyth, G. K (1989). Generalized linear models with varyingdispersion. J. R. Statist. Soc. B, 51, 4760.
• Sobotka, F., and T. Kneib, 2010. Geoadditive Expectile Re-gression. Computational Statistics and Data Analysis, doi:10.1016/j.csda.2010.11.015.
16