Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
On the Estimation of the Conditional -
Median
حول تقدير الوسيط الشرطى
Muna Yaqoub Mohamed Quraiq
Supervised by
Prof. Raid B. Salha
Prof. of Mathematical Statistics
A thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science in Mathmatics
March / 2017
زةــغ –تــالميــــــت اإلســـــــــبمعـالج
شئون البحث العلمي والدراسبث العليب
العــلــــــــــــــــــومت ـــــــــــــــــليـــك
مبجستيــــر الريـــــــبضيـــــــــــــــبث
احصــــــــــــــــبء ريـبضــــــــــــــــي
The Islamic University–Gaza
Research and Postgraduate Affairs
Faculty of Science
Master of Mathematics
Mathematical Statistics
i
رإقــــــــــــــرا
أنا الموقع أدناه مقدم الرسالة التي تحمل العنوان:
On the Estimation of the Conditional -Median
حول تقدير الوسيط الشرطى
الرسالة إنما ىو نتاج جيدي الخاص، باستثناء ما تمت اإلشارة أقر بأن ما اشتممت عميو ىذه
لنيل درجة أو االخرين إليو حيثما ورد، وأن ىذه الرسالة ككل أو أي جزء منيا لم يقدم من قبل
لقب عممي أو بحثي لدى أي مؤسسة تعميمية أو بحثية أخرى.
Declaration
I understand the nature of plagiarism, and I am aware of the University’s
policy on this.
The work provided in this thesis, unless otherwise referenced, is the
researcher's own work, and has not been submitted by others elsewhere
for any other degree or qualification.
:Student's name منى يعقوب محمد قريق اسم الطالب:
:Signature منى قريق التوقيع:
:Date 4/3/2017 التاريخ:
ii
Abstract
In this thesis, we study the conditional -median which plays an important role in the nonparametric prediction.
An estimator of the -median based on the Nadaraya-Watson estimator of the conditional distribution function has been studied and its
consistency has been driven under some conditions.
Another estimator based on the double kernel estimator of the
conditional distribution function has been proposed to get a smoother
estimator than that based on the Nadaraya –Watson estimator .
The performance of the two estimators has been tested using two
bivariate time series data. The comparison indicated that the double
kernel estimator gives smooth prediction curves while the Nadaraya-
Watson estimator gives a smaller mean square error.
iii
الملخص
الشرطى الذى يمعب دور ميم في التنبؤ دراسة الوسيط الدراسة فى ىذا البحث تيتم ب
.الالمعممى
لدالة التوزيع الشرطية واتسون –عمى مقدر ندارايا باالعتماد وسيط تم دراسة مقدر لمولقد
.وتم اثبات خاصية اتساقو تحت شروط معينة
مقدر آخر تم اقتراحو باالعتماد عمى مقدر ثنائى النواة لدالة التوزيع الشرطية لمحصول عمى
. واتسون –مقدر أكثر نعومة من ذلك الذى يعتمد عمى مقدر ندارايا
المقارنة أشارت الى أن المقدر ثنائيتن.أداء المقدرين تم اختباره باستخدام متسمسمتين زمنيتين
واتسن أعطى مقدرات ذات خطأ –أعطى منحنيات لمتنبؤ ممساء بينما مقدر ندارايا النواة ثنائى
.قميل
iv
Dedication
To My Parents.
To My husband.
To My brothers and sisters.
To My Friends.
To all Knowledge Seekers.
v
Acknowledgment
First, I thank Allah for the many blessings especially for
success blessing. I would like to thank several people for
helping me with this work. I would like to thank my professor
Dr. Raid Salha for his support, patience, encouragement, and
spending long hours helping me, Without his support could
have never accomplished this thesis. I would like to thank the
whole group of Department of Mathematics who gave me great
support and helped me in my work. I would like to thank all my
colleagues and friends at the Department of Mathematics for
encouraging me. I want to express my great gratitude to my
entire family for your continuous support and encouragement
during my whole life and especially during this work. And
special thanks to my father, mother, husband and my children.
vi
Table of Contents
Declaration ......................................................................................................................... i
Abstract in English ............................................................................................................ ii
Abstract in Arabic ............................................................................................................ iii
Dedication ........................................................................................................................ iv
Acknowledgment .............................................................................................................. v
Table of Contents ............................................................................................................. vi
List of Tables .................................................................................................................. vii
List of Figures ................................................................................................................ viii
List of Abbreviations ....................................................................................................... ix
List of Symbols ................................................................................................................. x
Chapter 1 Preliminaries ................................................................................................. 4
1.1 Basic Definitions and Notations ............................................................................. 4
1.2 Kernel Density Estimation of the Pdf ................................................................... 10
1.3 Properties of the Kernel Estimator ........................................................................ 12
1.4 Optimal Bandwidth ............................................................................................... 15
Chapter 2 Nadaraya-Watson Estimator of L1-Median ............................................ 19
2.1 Important of Median ............................................................................................. 20
2.2 The Conditional L1-Median ................................................................................. 21
2.3 The Nadaraya-Watson Estimator ......................................................................... 25
2.4 The Consistency of the Nadaraya-Watson Estimator of the L1-Median ............ 28
2.4.1 Assumotion and Main Results……………………………………………29
2.4.2 Preliminary Lemmas……………………………...……………………...30
2.4.3 Proof of the Theorem in Section 2.4.1………………………………….…36
Chapter 3 The Double Kernel Estimator .................................................................. 39
3.1 The Double Kernel Estimator ............................................................................. 40
3.2 Consistency of the Double Kernel Estimator ...................................................... 41
Chapter 4 Applications ................................................................................................. 44
4.1 Application 1 ......................................................................................................... 45
4.2 Application …...………………………………………………………………50
4.3 Discussion and conclusion ...............................................................................56
The Reference List ........................................................................................................ 58
vii
List of Tables
Table (4.1):Summary statistics of IBM and SP500 data................................................ 45
Table (4.2): The DK and NW median estimators for the IBM data .............................. 48
Table (4.3): The DK and NW median estimators for the SP500 data ........................... 49
Table (4.4): MSE of the DK and the NW estimators for Application 1 ........................ 50
Table (4.5):Summary statistics of Cisco and Intel data ................................................. 51
Table (4.6): The DK and NW median estimators for the Cisco data ............................. 54
Table (4.7): The DK and NW median estimators for the Intel data............................... 55
Table (4.8): MSE of the DK and the NW estimators for Application 2 ........................ 56
file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643456file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643456file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457
viii
List of Figures
Figure (1.1): Kernel density estimation based on 7 points ([11]). ............................. 11
Figure (1.2): Kernel density estimates of the Ethanol data )[11]) ............................. 12
Figure (1.3): Kernel density estimates based on differnet bandwidths h=0.25 (solid
curves), h = 0.5 ) dashed curves), h =0.75)dotted curve) )[11]) .................................. 18
Figure (4.1): Time plot of the rescaled IBM stock. ................................................... 46
Figure (4.2): Time plot of the rescaled SP500 stock. ................................................ 46
Figure (4.3): Scatterplot of the rescaled IBM stock versus the rescaled SP500 stock.
................................................................................................................................... 47
Figure (4.4): Scatterplot of the squares of the rescaled IBM stock versus the squares
of the rescaled SP500 stock. ...................................................................................... 47
Figure (4.5): Graph of the NW and the DK estimators for IBM . ........................... 48
Figure (4.6): Graph of the NW and the DK estimators for SP500. .......................... 49
Figure (4.7): Time plot of the rescaled Cisco stock. .................................................. 52
Figure (4.8): Time plot of the rescaled Intel stock. ................................................... 52
Figure (4.9): Scatterplot of the rescaled Cisco stock versus the rescaled Intel stock.
.................................................................................................................................... 53
Figure (4.10): Scatterplot of the squares of the rescaled Cisco stock versus the
squares of the rescaled Intel stock. ............................................................................. 53
Figure (4.11): Graph the NW and the DK estimators for Cisco ............................... 54
Figure (4.12): Graph of the NW and the DK estimators for Intel………………… 55
ix
List of Abbreviations
arg Argument
Cdf Cumulative distribution function
Cov Covariance
C.I. Confidence Interval
i.i.d. Independent and identically distributed
Inf Infinum
Min Minimum
MSE Mean Square Error
o Small oh
O Big oh
Pdf Probability density function
Var. Variance
N-W Nadaraya Watson
DK Double Kernel
w.p With probability
List of Symbols
Symbol Description
X univariate random variable, X ∈ R.
X multivariate random variable, X ∈ Rd, d ≥ 2.
f(x) Probability density function.
F (x) Cumulative distribution function.
Kh Scaled univariate kernel function.
f̂ kernel estimator for the function f .
P probability set function.
h the bandwidth smoothing parameter.
µ the mean.
σ2 the variance.
E the expectation.
R(K)∫K2(x)dx.
IA the indicator function.
| · | the absolute value function.
R the set of real numbers.∏product.
N(0, 1) Standard normal distribution .
µj(K) j-th moment of a kernel K, or Gaussian distribution.
K(.) the kernel function.p↪→ converge in probability.d↪→ converge in distribution.
wi weight function.
||.|| the lp−norm function.
||.||p,α the norm like function.
x
Preface
The probability density function is a fundamental concept in statistics. Consider any
random variable X that has probability density function (pdf) f(x), x ∈ R. When
we have an observed data, pdf of this data may be unknown. So it is useful to find
the pdf or to estimate it. There are many methods for statistical estimation of the
density function, these methods are divided into two kinds, parametric estimation
and nonparametric estimation. It is helpful to distinguish between the parametric
estimation and nonparametric estimation. The parametric estimation assumes that,
the sample which is studied has known distribution such as Gaussian and Gamma
distribution and then parametric estimation can be used to estimate the missed
parameters for the distribution by using moment, maximum likelihood estimators,
bayes estimators, chi square estimators...etc. For example the parametric density
function assume that, if the data has normal distribution with mean and variance
σ2, we can estimate these parameters µ and σ2 and substitute in the normal distri-
bution formula, Then we obtain the estimation density function denoted by f̂(x).
On the other hand, nonparametric estimation is a very useful way for dealing with
data from unknown distribution. It is used for estimating the density function to
choose the suitable model for a given data. Examples of nonparametric estimators,
histogram estimator, the naive estimator, kernel estimator, Nadaraya-Watson esti-
mator, kernel nearest neighbor (KNN),...etc; For more details see [15], [21] and [25].
In this thesis, we will study the kernel estimation of the conditional L1- median.
The sample median is defined as the middle value of a set of ranked data, i.e. the
sample median splits the data into two parts with an equal number of data points
in each. Usually, the sample median is taken as an estimator of the population
median m, a quantity which splits the distribution into two halves in the sense that
P (y ≤ m) = p(y ≥ m) = 12. Multivariate time series arise when several time series
1
are observed simultaneously over time. A multivariate time series consists of mul-
tiple single series referred to as components, see [26] and [8]. When the individual
series are related to each other, there is a need for jointly analyzing the series rather
than treating each one separately. By so doing, one hopes to improve the accuracy
of the predictions by utilizing the additional information available from the related
series in the predictions of each other. Several extensions of the concept of univari-
ate median were introduced in the literature with application in different fields of
statistics. One of them, the so called L1- median.
A nonparametric estimator is proposed for estimating the L1- median for multivari-
ate conditional distribution when the covariates take values in an infinite dimensional
space. The multivariate case is more appropriate to predict the components of a
vector of random variables simultaneously rather than predicting each of them sep-
arately.
In this thesis, we will study the conditional L1 - median which plays an important
role in the nonparametric prediction. The main goal of this thesis is to modify
the estimator of the L1- median using the double kernel estimator rather than the
Nadaraya-Watson estimator. In our study, we will discuss the consistency of the
proposed estimator. Also, applications using real data, to test the performance of
the L1 - median estimator will be given.
This thesis consists of four chapters which are organized as follows:
Chapter 1
This chapter contains notations, some basic definitions, and facts that we need in
the thesis. Also, we introduce the kernel density estimation of the pdf, properties
of the kernel estimator.
Chapter 2
In this chapter, we introduce the L1 median, and we use the Nadaraya-Watson
(NW) estimator of the conditional cdf to estimate it. we will study the asymptotic
consistently properties of the NW estimator.
Chapter 3
In this chapter, we use the double kernel (DK) estimator of the cdf rather than the
NW estimator.
Chapter 4
2
In this chapter, we will practically compare between the two estimators, the NW
and DK estimators and conclusion.
3
Chapter 1
Preliminaries
Chapter 1
Preliminaries
This chapter contains notations, some basic definitions, and facts that we need in
the remaining of this thesis, it is organized as follows: In Section 1.1, we introduce
some basic definitions and notations related to the areas of this thesis. In Section
1.2, we introduce the kernel density estimator of the probability density function
(pdf). In the next section, we summarize some properties of the kernels. Finally, in
Section 1.4, we present the problems of the optimal bandwidth selection.
1.1 Basic Definitions and Notations
In this section, we will introduce some basic definitions and theorems that will be
help in the remanning of this thesis.
Consider any random variable X that has pdf f . Specifying the function f gives a
natural description of the distribution of X, and allows probabilities associated with
X to be found from the relation.
P (a < X < b) =
∫ ba
f(x)dx
for any real constants a and b with a < b, if the observed data are drawn from a
distribution with unknown pdf, then the construction of an estimator to the un-
known density function is called a density estimation. Density estimation has
experienced a wide explosion of interest over the last 40 years. It has been ap-
4
plied in many fields, including archeology chemistry, banking, climatology, genetics,
economics, hydrology and physiology. Density estimates give valuable indication of
such features as skewness and multimodality in the data in some cases they will
yield conclusions while in other all they do is to point the way to farther analysis
and data collection.
Definition 1.1.1. (Estimator) [14] An estimator is any statistic from the sample
data which is used to give information about an unknown parameter in the popula-
tion.
For example, the sample mean is an estimator of the population mean. Estimators
of population parameters are sometimes distinguished from the true value by using
the symbol hat . For example the normal distribution has two parameters, the mean
µ and the standard deviation σ. There estimators are denoted by µ̂ and σ̂.
Definition 1.1.2. [14] Let X be a random variable with pdf with parameter θ. Let
X1,X2,... ,Xn be a random sample from the distribution of X and let θ̂ denotes an
estimator of θ. We say θ̂ is an unbiased estimator of θ if
E(θ̂) = θ.
If θ̂ is not unbiased, we say that θ̂ is a biased estimator of θ.
Example 1.1.1. S2 is unbiased estimator for σ2.
5
Proof.
E(S2) = E(1
n− 1∑
(Xi − x̄)2)
=1
n− 1E(∑
(Xi − x̄)2)
=1
n− 1E(∑
(Xi − µ+ µ− x̄)2)
=1
n− 1E[∑
(Xi − µ)2 − 2(x̄− µ)∑
(Xi − µ) +∑
(x̄− µ)2]
=1
n− 1[nE(Xi − µ)2 − 2nE(x̄− µ)2 + nE(x̄− µ)2]
=1
n− 1[nσ2 − nE(x̄− µ)2]
=1
n− 1[nσ2 − nσ
2
n]
=σ2
n− 1[n− 1] = σ2
So, S2 is unbiased estimator for σ2.
Definition 1.1.3. [14] If θ̂ is an unbiased estimator of θ and
V ar(θ̂) =1
nE[(
∂lnf(X)∂θ
)]2 (1.1.1)then θ̂ is called a minimum variance unbiased estimator (efficient) of θ.
Example 1.1.2. x̄ has a minimum variance unbiased estimator for µ in normal
population.
Proof.
V ar(x̄) =σ2
n
f(x) =1
σ√
2πexp−
12(x−µσ
)2
ln f(x) = −lnσ − 12ln2π − 1
2(x− µσ
)2
∂ ln f(x)
∂µ= −1
22(x− µσ
)(−1σ
) =x− µσ2
E(x− µσ2
)2 =1
σ4E(x− µ)2
=σ2
σ4=
1
σ2
1
nE( ∂∂µ
ln f(x))2=
1
n 1σ2
=σ2
n
6
Definition 1.1.4. [14] The statistic θ̂ is a consistent estimator of the parameter
θ if and only if for each c > 0
limn−→∞
P (|θ̂ − θ| < c) = 1. (1.1.2)
Example 1.1.3. x̄ is a consistent estimator for µ. in normal population.
Theorem 1.1.1. [14] If θ̂ is an unbiased estimator of θ and V ar(θ̂)→ 0, as n→∞,
then θ̂ is a consistent estimator of θ.
Definition 1.1.5. [14]. The statistic θ̂ is a sufficient estimator of the parameter
θ if and only if for each value of θ̂ the conditional probability distribution or density
of the random sample X1,X2, ... ,Xn given θ̂ = θ is independent of θ.
Example 1.1.4. x̄ is a sufficient estimator for µ in normal population.
There are two types of density estimation:
• Parametric Estimation.
• Nonparametric Estimation.
Parametric Estimation.
The parametric approach for estimating f(x) is to assume that f(x) is a member
of some parametric family of distributions, e.g. N(x̄, S2), and then to estimate the
parameters of the assumed distribution from the data. For example, fitting a normal
distribution leads to the estimator
fn(x) =1√
2πSnexp(
(x− x̄)2
2S2), x ∈ R,
where,
x̄ =1
n
n∑i=1
xi,
and
S2 =1
n− 1
n∑i=1
(xi − x̄)2.
This approach has advantages as long as the distributional assumption is correct,
or at least is not seriously wrong. It is easy to apply and it yields (relatively) stable
7
estimates. The main disadvantage of the parametric approach is lack of exibility.
Each parametric family of distributions imposes restrictions on the shapes that f(x)
can have. For example the density function of the normal distribution is symmet-
rical, bell-shaped, and therefore is unsuitable for representing skewed densities or
bimodal densities.
Nonparametric Estimation.
If the data that we study come from unknown distribution i.e. The density func-
tion f(x) is unknown, then we must estimate the density function. This estimation
is called a nonparametric estimation, there are many nonparametric statistical ob-
jects of potential interest, including density functions (uni-variate and multivariate),
density derivatives, conditional density functions, conditional distribution functions,
regression functions, median functions, quantile functions, and variance functions.
Many nonparametric problems are generalizations of uni-variate density estimation.
For obtaining a nonparametric estimation of a pdf, there many methods :
1. Histogram.
2. The Naive Estimator.
3. Kernel Density Estimation .
Definition 1.1.6. (Indicator function) [23] If A is any set, we define the indi-
cator function IA of the set A to be the function given by
IA =
1 if x∈A,0 if x/∈A.Definition 1.1.7. (Converge in Probability) [16]
Let Xn be a sequence of random variables and let X be a random variable defined on
a sample space. We say Xn converges in probability to X if for all � > 0, we have
limn−→∞
P [|Xn −X|≥�] = 0, (1.1.3)
or equivalently,
limn−→∞
P [|Xn −X| < �] = 1. (1.1.4)
8
If so, we write Xnp−→ X
Definition 1.1.8. (Converge in Distribution)[16]
Let Xn be a sequence of random variables and let X be a random variable. Let FXn
and FX be, respectively, the cdfs of Xn and X. Let C(FX) denote the set of all points
where FX is continuous. We say that Xn converge in distribution to X if
limn−→∞
FXn(x) = FX(x), for all x ∈ C(FX). (1.1.5)
We denote this convergence by
Xnd−→ X
Definition 1.1.9. Converge with probability 1 [16]
Let {Xn}n=∞n=1 be a sequence of random variables on (Ω, L, P ). We say that Xn con-
verge almost surly to a random variable X (Xna.s−→ X) or Converge with probability
1 to X or Xn converge strongly to X if and only if
P ({w : Xn(w)→ X(w), as n→∞}) = 1,
or equivalent, for all � > 0, there exists N ∈ N
P (|Xn −X| < �, n ≥ N) = 1.
Definition 1.1.10. (Order Notation O And o) [27]
Let an and bn each be sequences of real numbers. Then we say that an is of big
order bn or (an is big oh bn) as n → ∞ and write an = O(bn) as n → ∞, if and
only if
lim supn→∞
|anbn|
Theorem 1.1.2. (Taylor’s Theorem) [27]
Suppose that f is real-valued function defined on R and let x ∈ R. Assume that
f has p continuous derivatives in an interval (x − δ, x + δ) for some δ > 0. Then
for any sequence αn converging to zero.
f(x+ αn) =
p∑j=0
(αjnj!
)f j(x) + o(αpn)
1.2 Kernel Density Estimation of the Pdf
We present the kernel density estimation of the pdf and review some important
definitions and aspects in this area. In statistics, kernel Density Estimation
(KDE) is a non-parametric way to estimate the (pdf) of a random variable. Kernel
density estimation is a fundamental data smoothing problem where inferences about
the population are made, based on a finite data sample.
Definition 1.2.1. (Kernel Estimator of a Probability Density Function)
[25] Suppose that X1, ..., Xn is a random sample of data from an unknown continuous
distribution with pdf f(x) and cumulative distribution function (cdf) F (x), the kernel
estimator of a probability density function is defined as
f̂(x) =1
nh
n∑i=1
K(x−Xih
), (1.2.1)
where the bandwidth h = hn is a sequence of positive numbers that converging to
zero and K(.) is the kernel function considers to be both symmetric and satisfies∫ ∞−∞
K(x)dx = 1
where
µj(K) =
∫ ∞−∞
yjK(y)dy. (1.2.2)
The density estimates derived using such kernels can fail to be probability densities,
because they can be negative for some values of x. Typically, K is chosen to be a
symmetric pdf. There is a large body of literature on choosing K and h well, where
”well” means that the estimate converges asymptotically as rapidly as possible in
some suitable norm on pdf.
10
A slightly more compact formula for the kernel estimator can be obtained by
introducing the recalling notation Kh(u) = h−1K(u/h). This allows us to write
f̂(x) = n−1n∑i=1
Kh(x−Xi).
Figure 1.1: Kernel density estimation based on 7 points([11])
From Figure (1.1), we have
(1) The shape of bump is defined the kernel function.
(2) The shape of the bump is determined by a bandwidth h.
That is, the value of the kernel estimate at the point x is the average of the n kernel
ordinates at this point.
11
Figure 1.2: Kernel density estimates of the Ethanol data ([11])
Figure 1.2 shows the kernel density estimates of the Ethanol data based on the
same bandwidth hn = 0.2, but using different kernels. The solid curve stands for
the triangular kernel, the dashed curve for the uniform kernel, and the dotted curve
for the normal kernel.
1.3 Properties of the Kernel Estimator
In this section, we will introduce some important properties of the kernel. A ker-
nel is a piecewise continuous function, symmetric around zero, even function and
integrating to one, i.e.
K(x) = K(−x),∫ ∞−∞
K(x)dx = 1.
The kernel function need not have bounded support and in most applications K
is a positive pdf.
Definition 1.3.1. [5] A kernel function K is said to be of order p, if its first nonzero
moment is µp 6= 0, i.e. if
µj(K) = 0, j = 1, 2, ..., p− 1;µp(K) 6= 0;
12
where
µj(K) =
∫ ∞−∞
yjK(y)dy.
we consider the following conditions:
• The unknown density function f(x) has continuous second derivative f ′′(x)
• The bandwidth h = hn satisfies limn→∞ h = 0, and limn→∞ nh =∞
• The kernel K is a bounded pdf of order 2 and symmetric about the origin.∫∞−∞ zK(z)dz = 0 and
∫∞−∞ z
2K(z)dz 6= 0
(Now expanding f(x− hz) in a Taylor series about x to obtain
f(x− hz) = f(x)− hzf ′(x) + 12h2z2f ′′(x) + o(h2)
E(f̂(x)) =
∫K(z)[f(x)− hzf ′(x) + 1
2h2z2f ′′(x)]dz + o(h2)
= f(x)
∫K(z)dz − hf ′(x)
∫zK(z)dz +
h2
2f ′′(x)
∫z2K(z)dz + o(h2).
This leads to:
Ef̂(x) = f(x) +1
2h2f ′′(x)
∫z2K(z)dz + o(h2),
where we have used∫K(z)dz = 1,
∫zK(z)dz = 0, and
∫z2K(z)dz
let z = x−yh
, then we get
V ar(f̂(x)) =1
nh2[h
∫K2(z)f(x− zh))dz)− (h
∫K(z)f(x− zh)dz)2] (1.3.3)
Using Taylor series for f(x− zh) as before, we have
V ar(f̂(x)) =1
nh2[(h
∫(f(x)− hzf ′(x))K2(z)dz)− o(h2)
=1
nh[(
∫(f(x)− hzf ′(x))K2(z)dz)− o(h)]
= (nh)−1f(x)
∫K2(z)dz + o((nh)−1).
By assumption, we have the result.
From the previous nothing above lemmas, we have some properties about the bias
and the variance:
1. The bias is of order h2, which implies that f̂(x) is an asymptotically unbiased
estimator.
2. The bias is large, whenever the absolute value of the second derivative |f (2)(x)|
is large. This occurs for several densities at peaks where the bias is negative, and
valleys, where the bias is positive.
3. The variance is of order (nh)−1, which means that the variance converges to zero
by Condition (ii).
1.4 Optimal Bandwidth
The problem of selection the bandwidth is very important in kernel density esti-
mation. Choice of appropriate bandwidth is critical to the performance of most
nonparametric density estimators. When the bandwidth is very small, the estimate
will be very close to the original data. The estimate will be almost unbiased, but it
will have large variation under repeated sampling. If the bandwidth is very large,
the estimate will be very smooth, lying close to the mean of all the data. Such an
estimate will have small variance, but it will be highly biased. There are many rules
for bandwidth selection, for example Normal Scale Rules, Over-smoothed bandwidth
selection rules, Least Squares Cross-Validation, Biased Cross-Validation, Estimation
15
of density function-als and Plug-In Bandwidth Selection. For more details see [25]
and [27].
We shall use two types of errors criteria. The mean square error (MSE) is used
to measure the error when estimating the density function at a single point. It is
defined by
MSE{fn(x)} = E{fn(x)− f(x)}2. (1.4.1)
We can write the MSE as a sum of the squared bias and the variance at x,
MSE(fn(x)) = {Efn(x)− f(x)}2 + V ar(fn(x)). (1.4.2)
A second type of criteria measures the error when estimating the density over the
whole real line. The most well known of this type is the mean integral square error
(MISE) introduced by [22]. The MISE is defined as
MISE(fn) = E
∫ ∞−∞{fn(x)− f(x)}2dx. (1.4.3)
By changing the order of integration we have,
MISE(fn) =
∫ ∞−∞
MSE{fn(x)}dx
=
∫ ∞−∞{Efn(x)− f(x)}2dx+
∫ ∞−∞
V ar(fn(x))dx. (1.4.4)
Equation (1.4.4) gives the MISE as a sum of the integral squared bias and the
integral variance. Substituting (1.3.1) and (1.3.2) we conclude that
MISE(fn) = AMISE(fn) + o{h4 + (nh)−1}, (1.4.5)
where A MISE is the asymptotic mean integral squared error given by
AMISE(fn) =1
4h4µ2(K)
2R(f (2)) + (nh)−1R(K), (1.4.6)
see [27].
The natural way for choosing h is to plot out several curves and choose the esti-
mate that best matches one prior (subjective) ideas. However, this method is not
practical in pattern recognition since we typically have high-dimensional data.
16
Assume a standard density function and find the value of the bandwidth that min-
imizes the integral of the square error (MISE)
hMISE = arg minE[
∫(fn(x)− f(x))2dx]. (1.4.7)
If we assume that the true distribution is Gaussian and we use a Gaussian kernel,
the bandwidth h is computed using the following equation from [25].
h∗ = 1.06SN−15 (1.4.8)
where S is the sample standard deviation and N is the number of training examples.
By differentiating (1.4.6) with respect to h we can find the optimal bandwidth with
respect to AMISE criterion. This yields, the optimal bandwidth is given by
hop = [R(K)
µ2(K)2R(f 2)n]. (1.4.9)
Therefore if we substitute (1.4.9) into (1.4.6), we obtain the smallest value of A
MISE for estimating f using the kernel K.
infh>0
AMISE{fn} =5
4{µ2K2R(K)4R(f ′′)}
15n−
45 .
Notice that in 1.4.9 the optimal bandwidth depends on the unknown density being
estimated, so we can not use (1.4.9) directly to find the optimal bandwidth hopt.
Also from 1.4.9 we can conclude the following useful conclusions.
• The optimal bandwidth will converge to zero as the sample size increases, but
at very slow rate.
• The optimal bandwidth is inversely proportional to R(f ′′) 15 . Since R(f ′′) mea-
sures the curvature of f, this means that for a density function with little
curvature, the optimal bandwidth will be large. Conversely, if the density
function has a large curvature, the optimal bandwidth will be small.
17
Figure 1.3: Kernel density estimates based on different bandwidths h =0 .25 (solid
curve), h =0 .5 (dashed curve), h =0 .75 (dotted curve)([11])
Summary
In this chapter, we introduced some basic definitions and theorems that will need it
in this thesis, and we studied definition of estimation, its types, and precis of non-
parametric estimation and its common methods, then we presented kernel density
estimation of the pdf and we presented properties of the kernel estimator. In the
next chapter, we will study the Nadaraya-Watson estimator of L1 −median.
18
Chapter 2
Nadaraya Watson
Estimator of 𝑳𝟏 −Median
Chapter 2
Nadaraya Watson Estimator of L1
- Median
Introduction
Conditional distribution functions underline many popular statistical object of in-
terest. They are rarely modeled directly in parametric setting and have perhaps
received even less attention in kernel setting. Nevertheless, as will be seen, they
are extremely useful for range of tasks, whether directly estimation the conditional
distribution function [6], or perhaps molding conditional quantiles . The conditional
median depends directly on the conditional distribution function. Indeed, estimating
the conditional distribution is actually much more informative, since it allows us not
only to recalculate the expected value E(Y |X) and the variance σ2(Y |X), but also
to provide the general shape of the conditional distribution. In this context several
nonparametric methods can be applicable for estimating the conditional distribution
function based on data (X1, Y1),...,(Xn, Yn). A class of the kernel-type estimators is
called the Nadaraya-Watson estimator which is one of the most widely known and
used for estimating the conditional distribution function. Conditional distribution
estimation was introduced by [22]. A bias correction was proposed by [17]. [12]
proposed a direct estimator on Local polynomial estimation. The Nadaraya-Watson
(NW) estimator is created by independent researchers [28] and [19]. In this chap-
ter, we will investigate a nonparametric method for estimating the conditional L1
median which is called the Nadaraya-Watson estimator.
19
In This chapter we will present the NW estimator in order to estimate the condi-
tional L1 - median. In Section 2.1, we introduce the important of the median, and
giving historical notes. In Section 2.2, we will present conditional L1 - median. In
Section 2.3, the NW estimator of the conditional L1 median will be studied. Then
in Section 2.4, the asymptotic properties of the NW will be discussed and derived.
2.1 Important of Median
The median is the value separating the higher half of a data sample, a population, or
a probability distribution, from the lower half. In simple terms, it may be thought
of as the ”middle” value of a data set. For example, in the data set 1, 3, 3, 6, 7, 8,
9, the median is 6, the fourth number in the sample. The median is a commonly
used measure of the properties of a data set in statistics and probability theory.
The basic advantage of the median in describing data compared to the mean (often
simply described as the ”average”) is that it is not skewed so much by extremely large
or small values, and so it may give a better idea of a ’typical’ value. For example, in
understanding statistics like household income or assets which vary greatly, a mean
may be skewed by a small number of extremely high or low values. Median income,
for example, may be a better way to suggest what a ’typical’ income is.
Because of this, the median is of central importance in robust statistics, as it is
the most resistant statistic, having a breakdown point of 50: so long as no more
than half the data are contaminated, the median will not give an arbitrarily large
or small result.
The median is another way to measure the center of a numerical data set.
A statistic median is much like the median of an interstate highway on many high-
way, the median is the middle and the equal number of lans lay on either side of it.
In a numerical data set, the median is the point of which there are an equal number
of points whose value is above and below the median value,thus the median is truly
the middle of the data set.
The mean and the median are the used measure of locations by the investigators,
20
because those measure are easy the understand,most inrestigators describe the lo-
cation of the data by the arithmetic mean, except when the data are highly skewed,
higly kurtotic,or contaminoted with outliers, in which case the median is often used,
in the case of asymmetric data, the median is almost always close the mode, and is
many cases agood approach would be the estimate the mode itself.See [20]
Advantages of Median
1. It is very simple to understand and easy to calculate.
2. It is a special a verge used in qualities phenomena like intelligence or beauty
which are not quantified but rank are given.
Disadvantages of Median
1. It is less representative.
2. Take along times to calculates for avery large set of data.
2.2 The Conditional L1- Median
In this section, we will study the conditional median and we will define the median
as the solution to the problem of minimization a sum of absolute residuals, and we
will study the conditional L1− Median.
Definition 2.2.1. Let Y1, Y2, ......., Yn be a random sample from a distribution with
pdf f(y) and cdf F (y) then the median of the distribution is defined by
θ = inf{y : F (y) ≥ 0.5}
Example 2.2.1. To define the median as minimization problem we will solve:
arg minθ∈R
n∑i=1
|Yi − θ|.
21
Proof. [7]
E|Yi − θ| =∫ ∞−∞|Yi − θ|fY (y)dy
=
∫ θ−∞−(Yi − θ)f(y)dy +
∫ ∞θ
(Yi − θ)f(y)dy
dE
dθ|Yi − θ| =
∫ θ−∞
f(y)dy −∫ ∞θ
f(y)dy = 0
then ∫ θ−∞
f(y)dy =
∫ ∞θ
f(y)dy
The definition of the median is p(Y ≤ θ) = p(Y ≥ θ) = 0.50.
The symmetry of the piecewise linear absolute value function implies that the min-
imization of the sum of absolute residuals equates the number of the positive and
negative residuals.
The median regression estimates the conditional median of Y given X = x and
corresponds to the minimization of E(|Y − θ||X = x) over θ. The associated loss
function is r(u) = |u|. We can take the loss function to be ρ0.5(u) = 0.5|u|. because
the half positive equal half negative.
Definition 2.2.2. (The Conditional Median) [1]
Let (X1, Y1),(X2, Y2), .... ,(Xn, Yn) be a random sample from a distribution with
a conditional cdf F (y|x), then the conditional median M(x) is defined by
M(x) = inf{y : F (y|x) ≥ 0.5}
.
Definition 2.2.3. Suppose we have a complex vector space V. A norm is a function
f : V → R which satisfies:
1. f(x) ≥ 0 for all x ∈ V
2. f(x+ y) ≤ f(x) + f(y) for all x, y ∈ V
3. f(λx) = |λ|f(x) for all λ ∈ C and x ∈ V
22
4. f(x) = 0 if and only if x = 0
We usually write a norm by ||x||.
Examples of the most important norms are as follows:
• The 2-norm or Euclidean norm:
||x||2 = (n∑i=1
|xi|2)12
• The 1-norm:
||x||1 = (n∑i=1
|xi|)
• For any integer p ≥ 1 we have the p-norm:
||x||p = (n∑i=1
|xi|p)1p
• The ∞− norm, also called the sup-norm:
||x||∞ = maxi|xi|
This notation is used because ||x||∞ = limp→∞ ||x||p
Several extensions of the concept of univariate median were introduced in the
literature with application in different fields of statistics. One of them, the so called
L1- median.
A nonparametric estimator is proposed for estimating the L1- median for multivari-
ate conditional distribution when the covariates take values in an infinite dimensional
space. The multivariate case is more appropriate to predict the components of a
vector of random variables simultaneously rather than predicting each of them sep-
arately.
Definition 2.2.4. Convex function A function f : M −→ R defined on a
nonempty subset M of Rn and taking real values is called convex, if:
1. the domain M of the function is convex;
23
2. for any x, y ∈M and every λ ∈ [0, 1] one has
f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y). (2.2.1)
If the above inequality is strict whenever x 6= y and 0 < λ < 1, f is called strictly
convex.
Example of convex functions:
norms , recall that a real-valued function ||x|| on Rn is called a norm, if it is:
• nonnegative everywhere, for all x ∈ Rn, ||x|| ≥ 0;
• homogeneous, for all x ∈ Rn and a ∈ R, ||ax|| = |a|||x||;
• satisfies the triangle inequality, ||x+ y|| ≤ ||x||+ ||y||;
• ||x|| = 0 if and only if x = 0
Let ‖.‖ denote any strictly convex norm (‖α + β‖ < (‖α‖ + ‖β‖), whenever α
and β are not proportional on Rd. Let ‖.‖p : Rd −→ R, (1 < p < ∞). In the
sequel, we restrict to the Euclidean norm. For notational simplicity, we write ‖.‖pfor ‖.‖p for ‖.‖ for ‖.‖2.
For a fixed x ∈ Rs, define a vector function of α, α ∈ Rd, by
φ(α, x) = E(‖Y − α‖ − ‖Y ‖|X = x)
=
∫Rd
(‖y − α‖ − ‖y‖)F (dy|x)
where F (.|x) is the conditional probability measure of Y given X = x.
When the norm is strictly convex, the existence and uniqueness of µ ∈ Rd such that
φ(µ) = infα∈Rd
φ(α, x)
are guaranteed. The vector µ is called L1 - median of the measure associated with
the distribution function G [18].
Definition 2.2.5. (The Conditional L1- Median) [2],[3]
The L1- median of Y conditioally on X = x is defined by
µ(x) = arg minα∈Rd
φ(α, x)
24
where for α ∈ Rd
φ(α, x) =
∫Rd
(‖y − α‖ − ‖y‖)F (dy|x)
.
In this section, we introduced the median as a problem of minimization for both
cases, the univariate case and the multivariate case. The L1-median has been in-
troduced. In the next section, we will study the Nadaraya-Watson estimator of the
univariate conditional median. It’s mean and variance will be discussed.
2.3 The Nadaraya-Watson Estimator
In this section, we will study some basic information about the (NW) estimator to
use this later. It’s one of the popular nonparametric methods for estimating the
conditional density function f(y|x). We will consider the kernel estimation of the
conditional cumulative distribution function (cdf).
let (X1, Y1),(X2, Y2), .... ,(Xn, Yn) be a random sample from a distribution with a
conditional probability density function (pdf) f(y|x) then the cdf F (y|x) is given
by :
F (y|x) =∫ y−∞
f(u|x)du,
where
f(y|x) = f(x, y)fX(x)
.
Now, we introduce the basic equation of the kernel conditional density estimation
(KCDF). The Kernel function K(u) is assumed to be a Borel symmetric function,
and h is sequence of positive number converging to zero and it is called bandwidth,
and Kh(x) = K(x/h)/h..
Standard kernel estimators of f(x, y) and fX(x) are
f̂(x, y) =1
n
n∑i=1
Kh(x−Xi)Kh(y − Yi)
f̂X(x) =1
n
n∑i=1
Kh(x−Xi)
25
and
f̂(y|x) = f̂(x, y)f̂X(x)
=1n
∑ni=1Kh(x−Xi)Kh(y − Yi)1n
∑ni=1Kh(x−Xi)
=
∑ni=1Kh(x−Xi)Kh(y − Yi)∑n
i=1Kh(x−Xi),
Now the estimate of the conditional cdf is given by :
F̂ (y|x) =∫ y−∞
f̂(u|x)du
=
∑ni=1Kh(x−Xi)
∫ y−∞Kh(u− Yi)du∑n
i=1Kh(x−Xi)
Now, there are two ways to estimate the conditional cdf F (y|x). By using the indi-
cator function I(Yi ≤ y), which is called the Nadaraya-Watson estimator,
F̂NW (y|x) =∑n
i=1 I(Yi≤y)Kh(x−Xi)∑ni=1Kh(x−Xi)
,
Remark 2.3.1.
0 ≤ F̂NW (y|x) ≤ 1,
because, if y ≤ Yi for all i = 1, 2, ..., n then I(Yi ≤ y) = 0 for all i, then,
F̂NW (y|x) =∑n
i=1 I(Yi≤y)Kh(x−Xi)∑ni=1Kh(x−Xi)
= 0.
If y lies between the Y ,i s, i.e. there are some Yi less than or equal y but not at all,
then I(Yi ≤ y) = 0 for some i, and I(Yi ≤ y) = 1 for the other, then
F̂NW (y|x) =∑n
i=1 I(Yi≤y)Kh(x−Xi)∑ni=1Kh(x−Xi)
< 1
If y ≥ Yi for all i then I(Yi ≤ y) = 1 for all i then
F̂NW (y|x) =∑n
i=1Kh(x−Xi)∑ni=1Kh(x−Xi)
= 1
Definition 2.3.1. The NW estimator of the conditional median m(x) is defined as
m̂NW (x) = inf{y ∈ R : F̂NW (y|x) ≥ 0.5}
26
In the following theorem, we discuss the expectation of the NW estimator.
Theorem 2.3.1. (The Expectation of the NW Estimator)
Let Yi be an independent random variables. The expectation of the estimator FNW (y|x)
is given by
E[F̂NW (y|x)] =n∑i=1
Kh(x−Xi)∑ni=1Kh(x−Xi)
F (y|Xi)
Proof.
E[F̂NW (y|x)] = E[∑n
i=1 I{Yi≤y}Kh(x−Xi)∑ni=1Kh(x−Xi)
]
= E[∑n
i=1 I{Yi≤y}Kh(x−Xi)]∑ni=1Kh(x−Xi)
=
∫∞−∞∑n
i=1 I{Yi≤y}Kh(x−Xi)f(y|Xi)dy∑ni=1Kh(x−Xi)
=
∑ni=1Kh(x−Xi)
∫∞−∞ I{Yi≤y}f(y|Xi)dy∑n
i=1Kh(x−Xi)
=
∑ni=1Kh(x−Xi)
∫∞−∞ f(t|Xi)dt∑n
i=1Kh(x−Xi)
=
∑ni=1Kh(x−Xi)F (y|Xi)∑n
i=1Kh(x−Xi)
=n∑i=1
Kh(x−Xi)∑ni=1Kh(x−Xi)
F (y|Xi)
Theorem 2.3.2. (The Variance of the Nadaraya-Watson Estimato)[9]
Let Yi be an independent random variables. The variance of the estimator F̂NW (y|x)
is given by
V ar[F̂NW (y|x)] =n∑i=1
K2(x−Xihn
)
[∑n
i=1K(x−Xihn
)]2.[F (y|Xi)− F 2(y|Xi)].
27
Proof.
V ar[F̂NW (y|x)] = E(F̂ 2NW (y|x))− [E(F̂NW (y|x))]2
= E[[∑n
i=1 I2{Yi≤y}Kh(x−Xi) +
∑1≤i≤j
[2].
Let φn(α, x) be the estimate of φ(α, x) defined by
φn(α, x) =
∫Rd
(‖y − α‖ − ‖y‖)F (dy|x)
=
∑ni=1(‖Yi − α‖ − ‖y‖)Kh(x−Xi)∑n
i=1Kh(x−Xi)
From the definition of µ(x), it seems natural to estimate it by minimizing an esti-
mate of φn(α, x).
Definition 2.4.1. (The minimizer µn is the L1-median) [2], [3]
µn(x) = arg minα∈Rd
φn(α, x)
= arg minα∈Rd
n∑i=1
‖Yi − α‖Kh(x−Xi)
where the last equality is obtained by removing terms independent of α in the ex-
pressin of φn(α, x)
2.4.1 Assumption and Main Results
Let C be a compact subset of Rs on which the marginal density of X, denote by g,
is lower bounded by some positive constant. Now we state some assumption, which
are required to prove the theoretical results of this section.
(A1) The density g of X is uniformly continuous.
(A2) The function µ(.), satisfies a uniform uniqueness property over C
∀� > 0, ∃η > 0, ∀t : C −→ Rd,
sup ‖µ(x)− t(x)‖ ≥ � =⇒ sup |φ(µ(x), x)− φ(t(x), x)| ≥ η.
(A3) The kernel K is bounded, positive Hölderian function.
(A4) The sequence (hn)n≥1 satisfies
limn−→∞
nhsnlog n
=∞
(A5) For any Borel set V ⊂Rd and for any α ∈ Rd, the function Q(V|.) and φ(α, .)
are continuous on C.
29
Then under the previous conditions, the results in Theorem 2.4.1 can be proved.
Theorem 2.4.1. [2],[3] Assume (A1), (A2), (A3), (A4) and (A5). Then
1. with probability 1(w.p.1), one can find an integer N ≥ 1, such that if n ≥ N
and x ∈ C, the median µn(x) associated with the probability measure Qn(.|x)
exists and is unique. Moreover, the function µn(.) is continuous on C
2. the function µ(.) is continues on C.
3. w.p.1:
supx∈C||µn(x)− µ(x)|| → 0, if n→∞
2.4.2 Preliminary Lemmas
We review here some preliminary lemmas and definitions that will be helpful to
follow the proof of Theorem 2.4.1.
Definition 2.4.2. Borel set
A Borel set is any set in a topological space that can be formed from open sets(or,
equivalently, from closed sets) through the operations of countable union, and count-
able intersection.
Definition 2.4.3. Continuous, Uniform Continuous [13]
Let f : I −→ R and let x◦ ∈ I, we say that f continuous at x◦ if
limx−→x◦
f(x) = f(x◦)
Using (�, δ) definition of the limit, this means the following:
∀� > 0,∃δ = δ(�, x◦) > 0, ∀x ∈ I, |x− x◦| < δ ⇒ |f(x)− f(x◦)| < �
If f is continuous at x for all x ∈ I, we say f is continuous on I.
We say that f is uniformly continuous on I if
∀� > 0,∃δ = δ(�) > 0, ∀x ∈ I, |x− x◦| < δ ⇒ |f(x)− f(x◦)| < �
30
Lemma 2.4.1. Assume (A5). We have
lim‖α‖−→∞
supn≥1
supx∈C| φn(α, x)‖α‖
− 1 |= 0.
Proof. As for all p ≥ 1, x 7→ P (‖Y ‖ > p|X = x) is continuous function (by (A5)),
one can find xp ∈ C such that
supx∈C
P (‖Y ‖ > p|X = x) = P (‖Y ‖ > p|X = xp) (2.4.1)
The sequence (xp)p≥1 being in the compact C, one can extract a subsequence (pk)k≥1
such that
xpk −→ x∞ if k −→∞. (2.4.2)
Then, w. p. 1, for any α ∈ Rd − {0}, x ∈ C, n ≥ 1 and k ≥ 1, we have
|φn(α, x)‖α‖
− 1| ≤∫Rd| ‖y − α‖ − ‖y‖ − ‖α‖
‖α‖| Fn(dy|x)
≤∫‖y‖≤pk
| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖
| Fn(dy|x)
+
∫‖y‖>pk
| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖
| Fn(dy|x).
Now, for all α ∈ Rd − {0} and y ∈ Rd:
| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖
| ≤ ‖y − α‖ − ‖y‖+ ‖α‖‖α‖
≤ ‖y − α− y‖+ ‖α‖‖α‖
≤ ‖ − α‖+ ‖α‖‖α‖
=2‖α‖‖α‖
= 2
and
| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖
| ≤ ‖y − α‖+ ‖α‖+ ‖y‖‖α‖
≤ ‖y − α + α‖+ ‖y‖‖α‖
≤ ‖y‖+ ‖y‖‖α‖
=2‖y‖‖α‖
Thus, we get the inequality
| φn(α, x)‖α‖
− 1| ≤∫‖y‖≤pk
2‖y‖‖α‖
Fn(dy|x) +∫‖y‖>pk
2Fn(dy|x).
31
Letting ‖α‖ tend to infinity, we obtain that, w. p. 1, and for all k ≥ 1
lim‖α‖−→∞
supn≥1
supx∈C| φn(α, x)‖α‖
− 1 |≤ 2 supn≥1
supx∈C
∫‖y‖>Pk
Fn(dy|x).(2.4.3)
The last upper bound is now proved to tend to zero as k tends to infinity.
If k, n ≥ 1 and x ∈ C, let us denote
qxn(k) =
∫‖y‖>pk
Fn(dy|x)
=
∑ni=1 I(‖Yi‖>Pk)K(
x−Xihn
)∑ni=1K(
x−Xihn
)
Assuming (A1),(A3), (A4) and (A5), we have, w.p.1, for any k ≥ 1:
supx∈C| qxn(k)− P (‖Y ‖ > pk|X = x)| −→ 0 as n −→∞, [4]. (2.4.4)
The P− null set where the convergence in 2.4.4 is not true may be chosen to be
independent of k. Moreover, let us note that
(supx∈C
qxn(k))k≥1
and
(supP (||Y || > pk|X = x))k≥1
are decreasing sequences of positive numbers (because the sequence (pk)k≥1 is in-
creasing), hence both converge as k →∞. Consequently, w.p.1, the convergence in
2.4.4 is uniform in k ≥ 1, i.e.
supk≥1
supx∈C|qxn(k)− P (||Y || > pk|X = x)| → 0, if n→∞.
Let ε > 0. By the above property, one can find, w.p.1, an integer N ≥ 1 such that
if n > N, k ≥ 1 and x ∈ C
qxn(k) ≤ ε+ P (||Y || > pk|X = x). (2.4.5)
Now, w.p.1, if k ≥ 1
supn≥1
supx∈C
qxn(k) ≤ supn=1....N
supx∈C
qxn(k) + supn>N
supx∈C
qxn(k).
32
on the one hand, by the very definition of qxn(k) :
supn=1....N
supx∈C
qxn(k) ≤ supn=1...N
1[||Yi||>pk].
On the other hand, according to 2.4.5:
supn≥N
supx∈C
qxn(k) ≤ ε+ supx∈C
P (||Y || > pk|X = x).
But, the Yis are P− a.s. finite random variables, so that, with probability 1
lim supk→∞
supn≥1
supx∈C
qxn(k)
is upper bounded by
ε+ lim supk→∞
supx∈C
P (||Y || > pk|X = x). (2.4.6)
As ε is arbitrary, the proof will be completed by showing that the last term in 2.4.6
is equal to 0.
Let K ≥ 1. According to 2.4.1:
supx∈C
P (||Y || > pk|X = x) = P (||Y || > pk|X = xpk).
Moreover, xpk → x∞ if k →∞, according to 2.4.2. Now, if k ≥ 1 and ṕ ≤ pk :
P (||Y || > pk|X = xpk) ≤ P (||Y || > ṕ|X = xpk),
so that if ṕ ≥ 1 :
lim supk→∞
P (||Y || > ṕ|X = xpk) = P (||Y || > ṕ|X = x∞),
because P (||Y || > ṕ|X = .) is a continuous function on C by (A5). Letting ṕ→∞,
we get
limk→∞
supx∈C
P (||Y || > pk|X = x) = 0,
because Y is P -a.s. finite.
Lemma 2.4.2. Let us assume that K is a positive probability density. w.p.1,one
can find an integer N ≥ 1 such that if n ≥ N and x ∈ C, the L1-median µn(x)
associate with the probability measure Qn(.|x) ( which is not supported by a straight
line) exists and is unique.
33
Proof. Let x ∈ C and n ≥ 1. w.p.1, the set of minimizers of ϕn(.|x) is non-empty.
Let us prove that this set contains only one point. let x ∈ Rd. By assumption, if D
is any straight line in Rd :
Q(D|x) = P (Y ∈ D|X = x) < 1.
Obviously, this gives that :
P (Y ∈ D) < 1(2.4.7)
Now, let us denote by D(y1, y2) the line which connects two points y1, y2 ∈ Rd(y1 6=
y2), and by PY the distribution of Y. Clearly
P (Y1, Y2, Y3 on the same line) = P (Y3 ∈ D(Y1, Y2))
=
∫R2d
P (Y3 ∈ D(y1, y2))PY (dy1)PY (dy2),
because Y1, Y2 and Y3 are independent and identically distributed random vari-
ables. Then, according to 2.4.7, one gets
P (Y1, Y2, Y3 on the same lines) < 1.
Let us denote by Ω1 the subset of Ω defined by
Ω1 = {Y1, Y2, ... not on the same line}.
From the previous inequality, we have
P (Ωc1) ≤ P (Y1, Y2, Y3 on the same line) < 1.
But Ωc1 is an element of the σ− field generated by the independent random variables
Y1, Y2, ..., so that it is symmetric and P (Ωc1) = 0 by the Hewitt-Savage 0-1 low (see
[9]) Hence P (Ω1) = 1, and one can deal with Ω1 instead of Ω. For all ω ∈ Ω1, one can
find N(ω) ≥ 1 such that if n ≥ N(ω), Y1(ω), Y2(ω), ..., Yn(ω) are not on a straight
line. Now, let x ∈ Rs, ω ∈ Ω1 and n ≥ N(ω). Because K is a positive function, the
support of the probability measure Qn(.|x)(ω) is
{Y1(ω), ..., Yn(ω)}.
But n ≥ N(ω), so that this support is not included in any straight line. consequently,
according to Theorem 2.17 of [18], the set of L1− medians associated with the
probability measure Qn(.|x)(ω) contains only one element: µn(x)(ω).
34
Lemma 2.4.3. Let us assume that K is a continuous positive kernel. w . p. 1,one
can find an integer N ≥ 1 such that if n ≥ N , µn(.) is continuous on C.
Proof. According to Lemma 2.4.2, w. p. 1, one can find an integer N ≥ 1 such that
if n ≥ N and x ∈ C, the L1-median µn(x) associate with the probability measure
Qn(.|x) exists and is unique. We prove now that P-a.s., if n ≥ N , µn(x) is continu-
ous on C.
Let x ∈ C, and let (xp)p≥1 be a sequence such that xp −→ x, if P −→∞.
By the continuity of K, one obviously gets that the sequence of probability mea-
sures (Qn(.|x))p≥1 converges weakly to the measure Qn(.|x) is not supported by
any straight line, so that according to Corollary 2.26 of [18], µn(xp) −→ µ(x), if
p −→∞
Hence, w. p. 1, if n ≥ N , µn(.) is continuous function on C.
Lemma 2.4.4. Assume (A1), (A2), (A3), (A4) and (A5). W. p. 1, and for
any A > 0
sup‖α‖≤A
supx∈C| φn(α, x)− φ(α, x) |−→ 0, if n −→∞.
Proof. We clearly have, w. p. 1, and for all α ∈ Rd, i ≥ 1:
|‖Yi − α‖ − ‖Yi‖| ≤ ‖α‖
Consequently, assuming (A1),(A3),(A4)and (A5), we have that, for all α ∈ Rd and
w. p. 1:
supx∈C|φn(α, x)− φ(α, x)| −→ 0, if n −→∞, (2.4.8)
[4]. But w. p. 1, if n ≥ 1, x ∈ C and α, ά ∈ Rd:
|φn(α, x)− φn(ά, x)| ≤ ‖α− ά‖ (2.4.9)
and
|φ(α, x)− φ(ά, x)| ≤ ‖α− ά‖
From this we deduce that P-null set on which (2.4.8) is not true may be chosen
independent of α. Moerover according to (2.4.9) and w. p. 1, the sequence of
35
functions (φn(., x), n ≥ 1) is equicontinuous, and this property is independent of x.
Thus, according to (2.4.8) and the Ascoli-Arzel. Theorem [9], we get that , w. p. 1,
if A > 0:
sup‖α‖≤A
supx∈C|φn(α, x)− φ(α, x)| −→ 0, if n −→∞
This completes the proof of the lemma.
2.4.3 Proof of Theorem in Section 2.4.1
Assuming (A3), assertion (1) follows from Lemmas 2.4.2 and 2.4.3. Moreover, (2)is
a straightforward consequence of (1) and (3). One only needs to prove (3). The
proof is divided into two steps.
Step 1. We want to prove that, w.p.1, one can find r > 0 and N ≥ 1 such that
supn≥N
supx∈C||µn(x)|| ≤ r
and
supx∈C||µ(x)|| ≤ r.
From Lemma 2.4.1, one can find, w.p.1, r1 > 0 such that, if ||α|| > r1,∀n ≥ 1, and
∀x ∈ C :
ϕn(α, x) ≥1
2||α||. (2.4.10)
We have already proved in Lemma 2.4.2 that assuming (A3), w.p.1, there exists
N ≥ 1 such that , if n ≥ N and x ∈ C, the L1−median µn(x) associated with
the probability measure Qn(.|x) exists and is unique. Assume now that there exists
n ≥ N and x ∈ C such that
||µn(x)|| > r1.
Then, according to (2.4.10)
ϕn(µn(x), x) ≥1
2||µn(x)||.
But, by the very definition of µn(x) :
ϕn(µn(x), x) = infα∈Rd
ϕn(α, x) ≤ ϕn(0, x) = 0.
36
This is impossible. Hence, w.p.1, and for all n ≥ N :
supx∈C||µn(x)|| ≤ r1.
similar arguments lead to the existence of a real number r2 > 0 such that
supx∈C||µ(x)|| ≤ r2.
Now, the desired result is obtained with r = max(r1, r2).
Step 2. Conclusion. w.p.1, if n ≥ N (N was fixed in step 1)
supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| ≤ sup
x∈C|ϕ(µ(x), x)− ϕn(µn(x), x)|.
+ supx∈C|ϕn(µn(x), x)− ϕ(µn(x), x)|.
Form step 1:
supn≥N
supx∈C||µn(x)|| ≤ r
and
supx∈C||µ(x)|| ≤ r,
so that
ϕ(µ(x), x) = infα∈Rd
ϕ(α, x) = inf||α||≤r
ϕ(α, x)
and
ϕn(µn(x), x) = infα∈Rd
ϕn(α, x) = inf||α||≤r
ϕn(α, x).
Thus, w.p.1, if n ≥ N
supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| ≤ sup
x∈C| inf||α||≤r
ϕ(α, x)− inf||α||≤r
ϕn(α, x)|
+ sup||α||≤r
supx∈C|ϕn(α), x)− ϕ(α, x)|.
Assuming (A1), (A3), (A4) and (A5), from Lemma 2.4.4, we get
supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| → 0, if n→∞,
w.p.1. Then, applying (A2), one gets
supx∈C||µ(x)− µn(x)|| → 0, if n→∞,
w.p.1.
This complete the proof of Theorem 2.4.1.
37
Through the studying the NW estimator, the large bias and boundary effects is
the most drawbacks of the NW estimator. Hence, in the next chapter, we will study
another estimator, which is know as a double kernel (DK) estimator.
38
Chapter 3
The Double Kernel
Estimator
Chapter 3
The Double Kernel Estimator
Introduction
Let (X,Y ) be a two dimensional random variable with a joint distribution func-
tion F (x, y). The large bias and boundary effects are considered to be the most
important defect in the case of the NW estimator. The NW estimator was treated
and modified in order to obtain some more refinement estimator, which is called the
double kernel estimator (DK), see [10].
Various consistency proofs for the kernel density estimator have been developed over
the last few decades. Important milestones are the pointwise consistency and almost
sure uniform convergence with a fixed bandwidth on the one hand and the rate of
convergence with a fixed or even a variable bandwidth on the other hand. While
considering global properties of the empirical distribution functions is sufficient for
strong consistency, proofs of exact convergence rates use deeper information about
the underlying empirical processes. A unifying character, however, is that earlier
and more recent proofs use bounds on the probability that a sum of random vari-
ables deviates from its mean see [25].
This chapter studies the double kernel estimation of the conditional L1 -median of Y
for a given value of X based on a random sample from the above distribution. In this
chapter, the joint asymptotic consistency of the conditional L1- median estimated
at a finite number of distinct points is established under some regularity conditions.
The aim of this chapter is to introduce the double kernel estimator (DK) and its
aspects, discussion the properties of the DK estimator of the conditional L1- median
39
and studying the asymptotic consistency of the DK.
This chapter consists of two sections. In Section 3.1, we will introduce the DK
estimator of the conditional distribution function F̂DK(y|x). In Section 3.2, we will
investigate the asymptotic consistency of the DK.
3.1 The Double Kernel Estimator
If f(x, y) is the joint pdf of the random variables X and Y at (x,y), and g(x) is the
marginal pdf of X at x, the conditional pdf of Y given X = x is given by
f(y|x) = f(x, y)g(x)
, g(x) > 0 (3.1.1)
for each y within the range of Y , and then ,
F (y|x) =∫ y−∞
f(u|x)du (3.1.2)
Now, we introduce the basic equation of the kernel conditional distribution function
estimation (KCDF). The Kernel function K(u) is assumed to be the kernel function,
hn = h is a sequence of positive numbers converging to zero.
Standard kernel estimators of f(x, y) , g(x) , f(y|x) are
f̂(x, y) = (nh2)−1n∑i=1
Kh(x−Xi)Kh(y − Yi) (3.1.3)
ĝ(x) = (nh)−1n∑i=1
Kh(x−Xi) (3.1.4)
and
f̂(y|x) = f̂(x, y)ĝ(x)
=(nh2)−1
∑ni=1Kh(x−Xi)Kh(y − Yi)
(nh)−1∑n
i=1Kh(x−Xi)
=
∑ni=1Kh(x−Xi)Kh(y − Yi)h∑n
i=1Kh(x−Xi)
Definition 3.1.1. The double kernel estimator of the conditional distribution func-
tion F (x, y) is defined as:
F̂DK(y|x) =∫ y−∞
f̂(u|x)du = Bn(x, y)ĝ(x)
40
where,
Bn(x, y) =1nhn
∑ni=1Kh(x−Xi)K̂h(y − Yi), K̂(y) =
∫ y−∞K(u)du
and K is a probability density function, hn is a sequence of positive numbers con-
verging to zero.
[29] used a double kernel approach. In this case, the indicator function in the NW
estimator is replaced by a continuous distribution function Ω(Yi−yh
). Then Fn(y|x)
has the form
F̂DK(y|x) =n∑i=1
wi(x)Ω(Yi − yh2
),
where,
Ω(y) =
∫ y−∞
W (u)du
is a distribution function with associated density function W (u).
Definition 3.1.2. The double kernel estimator of the conditional median m(x) is
defined as:
m̂DK(x) = inf[y ∈ R : F̂DK(y|x) ≥ 0.5].
3.2 Consistency of the Double Kernel Estimator
This section is the main section of this chapter. Here, we will introduce the DK es-
timator of the L1− median as a minimization problem, then the asymptotic consis-
tency of this estimator will be discussed and driven under some regularity conditions.
Let φn(α, x) be the estimate of φ(α, x) defined by
φn(α, x) =
∫Rd
(‖y − α‖ − ‖y‖)F (dy|x)
=
∑ni=1(‖Yi − α‖ − ‖y‖)Kh(x−Xi)K̂h(y − Yi)∑n
i=1Kh(x−Xi)
From the definition of µ(x), it seems natural to estimate it by minimizing an
estimate of φn(α, x).
41
Definition 3.2.1. (The minimizer µn is the L1-median)
µn(x) = arg minα∈Rd
φn(α, x)
= arg minα∈Rd
n∑i=1
‖Yi − α‖Kh(x−Xi)K̂h(y − Yi)
where the last equality is obtained by removing terms independent of α in the ex-
pression of φn(α, x)
Theorem 3.2.1. Assume (A1), (A2), (A3), (A4) and (A5). Then
1. with probability 1(w.p.1), one can find an integer N ≥ 1, such that if n ≥ N
and x ∈ C, the median µn(x) associated with the probability measure Qn(.|x)
exists and is unique. Moreover, the function µn(.) is continuous on C
2. the function µ(.) is continues on C.
3. w.p.1:
supx∈C||µn(x)− µ(x)|| → 0, if n→∞
Lemma 3.2.1. (Bochner Lemma)[24] Suppose that K is a Borel function satis-
fying the conditions:
1. supx∈Rs |K(x)|
Lemma 3.2.4. Let us assume that K is a positive probability density. W. p. 1,one
can find an integer N ≥ 1 such that if n ≥ N and x ∈ C, the L1-median µn(x)
associate with the probability measure Qn(.|x) ( which is not supported by a straight
line) exists and is unique.
Lemma 3.2.5. Let us assume that K is a continuous positive kernel. W. p. 1,one
can find an integer N ≥ 1 such that if n ≥ N , µn(.) is continuous on C.
The proofs of the previous Lemmas and the main theorem are obtaind using the
same techniques as in chapter 2, after replacing the indicator function I(Yi ≤ y)
by Ω(Yi−yh
).
43
Chapter 4
Application
Chapter 4
Application
The performance of the estimators of the conditional L1 - median is our concern in
this chapter. We will use the two estimators that we have studied in chapter 2 and
chapter 3 to analyze and forecast two bivariate time series.
The dependency between the stock markets in the world of finance indicated that
the dealing must be between multiple time series rather than we deal with each time
series alone.
The L1- median estimators enable use to predict multivariate medians of different
time series. Therefor, we will use the NW and DK estimators to analyze bivariate
time series.
The S-plus program is used to compute the two estimators depending on their
theoretical equations from the previous two chapters.
This chapter consists from three sections. In section 4.1, the first application will
be given using a bivariate time series from [26] which gives the asserts prices for two
international companies, IBM and SP500. The next section is similar to the first,
but it depends on another bivariate time series, from [26] which gives the asserts
prices for the Intel and Cisco companies. Finally, we close this chapter by Section 3,
which contains some conclusions and suggestions that we have concluded from our
study in this thesis.
44
4.1 Application 1
We illustrate the application of our conditional NW and DK estimators for the
bivariate median by considering the prediction of a financial position with bivariate
financial time series.
Prediction for the IBM and SP500 series
Consider the bivariate time series of the monthly log returns of the IBM stock and
the SP500 index, from January 1926 to December 1999 consisting of 880 observation.
The source of the this data set from [26].
As in application 1, we rescaled the data such that they range from zero to one.
Now, let x1,t = {IBM}t and x2,t = {SP500}t. Thus xt = {(x1,t, x2,t)} is a bivariate
time series.
The two time series x1,t and x2,t are correlated. Table 4.1 provides some summary
statistics of the rescaled series. Figure 4.1 and 4.2 show the time plot of the two
series, while Figure 4.3 and 4.4 show the scatterplots of the two series and their
squares respectively.
Table 4.1: Summary statistics of IBM and SP500 data
Series Min. 1st Qu Mean Median 3rd Qu. Max.
IBM 0.00 0.46 0.52 0.52 0.59 1.00
SP500 0.00 0.47 0.51 0.52 0.56 1.00
Details of Calculation
We computed µn(x) by finding the minimum at a finite sample of points
µn(x) = arg minα∈Rd
φn(α, x)
= arg minα∈Rd
n∑i=1
‖Yi − α‖Kh(x−Xi) (4.1.1)
µn(x) = arg minα∈Rd
φn(α, x)
= arg minα∈Rd
n∑i=1
‖Yi − α‖Kh(x−Xi)K̂h(y − Yi) (4.1.2)
45
We use the first 880 bivariate observations to predict the last 8 observations for the
bivariate time series using the NW from (4.1.1) and DK from (4.1.2) estimators for
the median. The results for IBM data are listed in Table 4.2, Figure 4.5 shows the
true observations for IBM and their predictions using the NW and DK estimators
. The results for SP500 data are listed in Table 4.3 and their graph are shown in
Figure 4.6.
Figure 4.1: Time plot of the rescaled IBM stock
Figure 4.2: Time plot of the rescaled SP500 stock
46
Figure 4.3: Scatterplot of the rescaled IBM stock versus the rescaled SP500 stock
Figure 4.4: Scatterplot of the squares of the rescaled IBM stock versus the squares
of the rescaled SP500 stock
47
Figure 4.5: Graph of the NW and the DK estimators for IBM
Table 4.2: The DK and NW median estimators for the IBM data.
i DK estimator True value NW estimator
880 0.7051316 0.6751316 0.6651316
881 0.7051316 0.6811093 0.6351316
882 0.6851316 0.4560167 0.5851316
883 0.6351316 0.4889528 0.5351316
884 0.5851316 0.4542471 0.5651316
885 0.5951316 0.1577721 0.5351316
886 0.5751316 0.5832439 0.5651316
887 0.6051316 0.5777071 0.5551316
48
Figure 4.6: Graph of the NW and the DK estimators for SP500
Table 4.3: The DK and NW median estimators for the SP500 data
i DK estimator True value NW estimator
880 0.5068489 0.6751316 0.5068489
881 0.5468489 0.6811093 0.5468489
882 0.5868489 0.4560167 0.5468489
883 0.5968489 0.4889528 0.5168489
884 0.5468489 0.4542471 0.5368489
885 0.5668489 0.1577721 0.5268489
886 0.5468489 0.5832439 0.5268489
887 0.5668489 0.5777071 0.5168489
49
In Table 4.4, We computed the values of the MSE related to the kernel estima-
tions NW and DK, as seen from Table 4.4, for all sample sizes, the MSE values of
the NW has the best performance than the DK kernel estimator.
MSE =1
n
n∑i=1
(yi − ŷi)2,
where yi is the true value and ŷi its prediction value.
Table 4.4: MSE of the DK and the NW estimators for Application 1
IBM SP500
DK 0.03557136 0.004800537
NW 0.02206879 0.003111601
4.2 Application 2
In this application, we use the NW and DK using a bivariate time series consists of
Cisco and Intel data.
Prediction for the Cisco and Intel series
Suppose that we want to predict values of the following two time series, the Stocks
of Cisco Systems and the Intel Corporation. We use the daily log returns for the
two stocks from January 2, 1991 to December 31, 1999 with 2258 observations. [26]
considered this data set and computed the VaR at the end of the data span by
building univariate and multivariate volatility models. We rescaled the data so that
they range from zero to one. Now, let x1,t = {Cisco}t and x2,t = {Intel}t. Thus
xt = {(x1,t, x2,t)} is bivariate time series.
The two time series x1,t and x2,t are correlated.Table 4.5 provide some summary
statistics of the rescaled series. Figure 4.7 and Figure 4.8 show the time plot of the
two series, while Figure 4.9 and Figure 4.10 show the scatterplots of the two series
and their squares respectively.
50
Table 4.5: Summary statistics of Cisco and Intel data
Series Min. 1st Qu Mean Median 3rd Qu. Max.
Cisco 0.00 0.55 0.59 0.54 0.64 1.00
Intel 0.00 0.48 0.54 0.54 0.60 1.00
We use the first 2258 bivariate observations to predict the last 8 observations for
the bivariate time series using the NW and DK estimators for the median. The re-
sults for Cisco data are listed in Table 4.6, Figure 4.11 shows the true observations
for Cisco and their predictions using the NW and DK estimators . The results for
Intel data are listed in Table 4.7 and their graph are shown in Figure 4.12.
51
Figure 4.7: Time plot of the rescaled Cisco stock
Figure 4.8: Time plot of the rescaled Intel stock
52
Figure 4.9: Scatterplot of the rescaled Cisco stock versus the rescaled Intel stock
Figure 4.10: Scatterplot of the squares of the rescaled Cisco stock versus the squares
of the rescaled Intel stock
53
Figure 4.11: Graph of the NW and the DK estimators for Cisco
Table 4.6: The DK and NW median estimators for the Cisco data
i DK estimator True value NW estimator
2258 0.7280763 0.7280763 0.7380763
2259 0.6880763 0.5314482 0.7080763
2260 0.6380763 0.5408192 0.6580763
2261 0.6580763 0.6439006 0.6380763
2262 0.6280763 0.6509391 0.6380763
2263 0.6580763 0.4613087 0.6380763
2264 0.6280763 0.5078365 0.6480763
2265 0.6580763 0.6794206 0.6280763
54
Figure 4.12: Graph of the NW and the DK estimators for Intel
Table 4.7: The DK and NW median estimators for the Intel data
i DK estimator True value NW estimator
2258 0.5176012 0.4776012 0.5176012
2259 0.5576012 0.3765592 0.5576012
2260 0.5176012 0.4469881 0.5576012
2261 0.5576012 0.4590812 0.5476012
2262 0.5276012 0.6055674 0.5676012
2263 0.5576012 0.4269851 0.5576012
2264 0.5276012 0.8381123 0.5576012
2265 0.5576012 0.5740417 0.5576012
55
In Table 4.8 summarizes their computed MSE for the two time series using the
NW and DK estimators .
The results indicator that the NW estimator performs better than DK estimator for
Cisco data. On the other hand DK estimator performs better than NW estimator
for Intel data.
Table 4.8: MSE of the DK and the NW estimators for Application 2
Intel Cisco
DK 0.0110432 0.02111191
NW 0.01898825 0.01234954
4.3 Discussion and Conclusion
The need to deal with multiple time series rather than dealing with each time series,
alone gives the researchers the idea to use L1- median estimator, for more details
see [2].
[2] has used the NW estimator for estimating the L1- median. This estimator de-
pends on the indicator function and this make the resulting curves are not smooth.
To get curves much smoothers, we have proposed to use a DK estimator for the L1
median. By looking on Figure 4.5, Figure 4.6, Figure 4.11 and Figure 4.12, we note
that the estimated curves using the DK is smoother than that we have obtained
using the NW estimator.
The comparison results depending on the MSE has indicated that the NW estimator
was better than the DK for the three time series.
From this results, we can conclude the following :
1. To get smooth curves, we must use the DK estimator.
2. In general, the NW estimator gives curves that much closer to the real data.
3. Using a modified estimators depending on the NW estimator, by using reweighed
versions or using variable bandwidth will give better estimators than that we
56
have used, but it request much complicated computations and programming.
57
Bibliography
[1] Al Attal, A. (2016). On the Kernel Estimation of the Conditional Median. The
Islamic University of Gaza.
[2] Berlinet, A., Cardre, B. and Gannoun, A. 1998. On the Conditional L1-Median
and its Estimation. Nonparametric Statistics.
[3] Berlinet, A., Cadre, B. and Gannoun, A. (2001). On the Conditional L1 -Median
and its Estimation. University of Montpellier, France.
[4] Bosq, D., and Lecoutre, J, P.(1987). Theorie de 1 estimation fonctionnelle. Eco-
nomica.
[5] Bruce, E. Hansen.Lecture Notes on Nonparametrics.University of Wisconsin
.Spring 2009
[6] Cameron, A.C., Trivedi, P.K. (1998). Regression Analyse of Count Data.
NewYork: Cambridge University Prsee.
[7] Casella, G. and Berger, R.(2002). Statistical Inference,USA.
[8] De Gooijer J. G., Gannoun, A., Zerom, D. (2004). A multivariate Quantile Pre-
dictor, UVA Econometrics, Discussion Paper 2002/08.
[9] Dudley, R. M. (1989). Real Analysis and Probability. Chapman and Hall.
[10] Fan, J., Hu, T.C., Troung, Y.K. (1994). Robust Nonparametric Function Es-
timation. Scandinavian Journal of Statistics, 21, 433-446.
[11] Fan, J., Yao, Q. (2003). Nonlinear Time series: Nonparametric and Parametric
Methods. Springer-Velag. USA.
58
[12] Fan, J., Yao, Q., Tong, H. (1996). Estimation of conditional densities and
sensitivity measuresin nonlinear dynamical systems. Biometrika, Vol. 83, 189-
206.
[13] Florescu, I. (2015). Probability and Stochastic Processes, By John Willy and
Sons.
[14] Freund, J. (1992).Mathematical Statistics , Arizona State University.
[15] George Casell. Roger L. Berger. (1990). Statistical Inference. Cornell University,
North Carolina State University
[16] Hogg . Mckean . Craig , (2005). Introduction to Mathematical Statistic. Uni-
versity of Iowa, Wester Michigan University, University of Iowa.
[17] Hyndman R. J. (1996), Estimating and visualizing condti tional densities. Jour-
nal of Computational and Graghical Statistics, Vol.5,315-336.
[18] Kemperman, J. H. B. (1987). The Median of Finite Measure on Banach Space.
Statistical Data Analysis Based on L1 Norm and Related Method, Y. Dodge
(ed.), Amsterdam: NorthHoland, 217-230
[19] Nadaraya, E. A. (1964). One Estimating Regression. Theory of the Probability
of its Applications, 10, 186-190.
[20] Rider, Paul R. (1960). Variance of the median of small samples from several
special populations.Journal of the American Statistical Association.
[21] Roger, C.R. (1965). Linear Statistical Inference and Its Applications. Wiley,
New York.
[22] Rosenblatt, M. (1969). Conditional Probability Density and Regression Esti-
mator. New York: Academic Press
[23] Royden H.L. (1997). Real Analysis. Stanford University.
[24] Salha, R (2006). Kernel Estimation for the Conditional mode and quantiles of
time series. The Macedonia economic and social sciences university .
59
[25] Silverman, B. W. (1986). Density estimation for statistics and data analysis.
Chapman and Hall, London.
[26] Tsay. R. S. (2002). Analysis of Financial Time Series. John Wiley and Sons.
[27] Wand and Jones, Wand, M. P. and Jones, M. C. (1995). Kernel smoothing,
Chapman and Hall.
[28] Watson, G.S. (1964). Smooth regression analysis.Sankhya Series A, 26, 359-
3720.
[29] Yu, K. and Jones M. C. (1998). Local Linear Quantile Regression. Journal of
the American Statistical Association, Vol. 93, No. 441, 228-237.
60