On the Estimation of the Conditional Median · 2017. 4. 18. · 1- median. The sample median is de ned as the middle value of a set of ranked data, i.e. the sample median splits the

On the Estimation of the Conditional -

Median

حول تقدير الوسيط الشرطى

Muna Yaqoub Mohamed Quraiq

Supervised by

Prof. Raid B. Salha

Prof. of Mathematical Statistics

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Mathmatics

March / 2017

زةــغ –تــالميــــــت اإلســـــــــبمعـالج

شئون البحث العلمي والدراسبث العليب

العــلــــــــــــــــــومت ـــــــــــــــــليـــك

مبجستيــــر الريـــــــبضيـــــــــــــــبث

احصــــــــــــــــبء ريـبضــــــــــــــــي

The Islamic University–Gaza

Research and Postgraduate Affairs

Faculty of Science

Master of Mathematics

Mathematical Statistics

i

رإقــــــــــــــرا

أنا الموقع أدناه مقدم الرسالة التي تحمل العنوان:

On the Estimation of the Conditional -Median

حول تقدير الوسيط الشرطى

الرسالة إنما ىو نتاج جيدي الخاص، باستثناء ما تمت اإلشارة أقر بأن ما اشتممت عميو ىذه

لنيل درجة أو االخرين إليو حيثما ورد، وأن ىذه الرسالة ككل أو أي جزء منيا لم يقدم من قبل

لقب عممي أو بحثي لدى أي مؤسسة تعميمية أو بحثية أخرى.

Declaration

I understand the nature of plagiarism, and I am aware of the University’s

policy on this.

The work provided in this thesis, unless otherwise referenced, is the

researcher's own work, and has not been submitted by others elsewhere

for any other degree or qualification.

:Student's name منى يعقوب محمد قريق اسم الطالب:

:Signature منى قريق التوقيع:

:Date 4/3/2017 التاريخ:

ii

Abstract

In this thesis, we study the conditional -median which plays an important role in the nonparametric prediction.

An estimator of the -median based on the Nadaraya-Watson estimator of the conditional distribution function has been studied and its

consistency has been driven under some conditions.

Another estimator based on the double kernel estimator of the

conditional distribution function has been proposed to get a smoother

estimator than that based on the Nadaraya –Watson estimator .

The performance of the two estimators has been tested using two

bivariate time series data. The comparison indicated that the double

kernel estimator gives smooth prediction curves while the Nadaraya-

Watson estimator gives a smaller mean square error.

iii

الملخص

الشرطى الذى يمعب دور ميم في التنبؤ دراسة الوسيط الدراسة فى ىذا البحث تيتم ب

.الالمعممى

لدالة التوزيع الشرطية واتسون –عمى مقدر ندارايا باالعتماد وسيط تم دراسة مقدر لمولقد

.وتم اثبات خاصية اتساقو تحت شروط معينة

مقدر آخر تم اقتراحو باالعتماد عمى مقدر ثنائى النواة لدالة التوزيع الشرطية لمحصول عمى

. واتسون –مقدر أكثر نعومة من ذلك الذى يعتمد عمى مقدر ندارايا

المقارنة أشارت الى أن المقدر ثنائيتن.أداء المقدرين تم اختباره باستخدام متسمسمتين زمنيتين

واتسن أعطى مقدرات ذات خطأ –أعطى منحنيات لمتنبؤ ممساء بينما مقدر ندارايا النواة ثنائى

.قميل

iv

Dedication

To My Parents.

To My husband.

To My brothers and sisters.

To My Friends.

To all Knowledge Seekers.

v

Acknowledgment

First, I thank Allah for the many blessings especially for

success blessing. I would like to thank several people for

helping me with this work. I would like to thank my professor

Dr. Raid Salha for his support, patience, encouragement, and

spending long hours helping me, Without his support could

have never accomplished this thesis. I would like to thank the

whole group of Department of Mathematics who gave me great

support and helped me in my work. I would like to thank all my

colleagues and friends at the Department of Mathematics for

encouraging me. I want to express my great gratitude to my

entire family for your continuous support and encouragement

during my whole life and especially during this work. And

special thanks to my father, mother, husband and my children.

vi

Table of Contents

Declaration ......................................................................................................................... i

Abstract in English ............................................................................................................ ii

Abstract in Arabic ............................................................................................................ iii

Dedication ........................................................................................................................ iv

Acknowledgment .............................................................................................................. v

Table of Contents ............................................................................................................. vi

List of Tables .................................................................................................................. vii

List of Figures ................................................................................................................ viii

List of Abbreviations ....................................................................................................... ix

List of Symbols ................................................................................................................. x

Chapter 1 Preliminaries ................................................................................................. 4

1.1 Basic Definitions and Notations ............................................................................. 4

1.2 Kernel Density Estimation of the Pdf ................................................................... 10

1.3 Properties of the Kernel Estimator ........................................................................ 12

1.4 Optimal Bandwidth ............................................................................................... 15

Chapter 2 Nadaraya-Watson Estimator of L1-Median ............................................ 19

2.1 Important of Median ............................................................................................. 20

2.2 The Conditional L1-Median ................................................................................. 21

2.3 The Nadaraya-Watson Estimator ......................................................................... 25

2.4 The Consistency of the Nadaraya-Watson Estimator of the L1-Median ............ 28

2.4.1 Assumotion and Main Results……………………………………………29

2.4.2 Preliminary Lemmas……………………………...……………………...30

2.4.3 Proof of the Theorem in Section 2.4.1………………………………….…36

Chapter 3 The Double Kernel Estimator .................................................................. 39

3.1 The Double Kernel Estimator ............................................................................. 40

3.2 Consistency of the Double Kernel Estimator ...................................................... 41

Chapter 4 Applications ................................................................................................. 44

4.1 Application 1 ......................................................................................................... 45

4.2 Application …...………………………………………………………………50

4.3 Discussion and conclusion ...............................................................................56

The Reference List ........................................................................................................ 58

vii

List of Tables

Table (4.1):Summary statistics of IBM and SP500 data................................................ 45

Table (4.2): The DK and NW median estimators for the IBM data .............................. 48

Table (4.3): The DK and NW median estimators for the SP500 data ........................... 49

Table (4.4): MSE of the DK and the NW estimators for Application 1 ........................ 50

Table (4.5):Summary statistics of Cisco and Intel data ................................................. 51

Table (4.6): The DK and NW median estimators for the Cisco data ............................. 54

Table (4.7): The DK and NW median estimators for the Intel data............................... 55

Table (4.8): MSE of the DK and the NW estimators for Application 2 ........................ 56

file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643456file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643456file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457file:///C:/Users/محمد/Downloads/lالصفحة-الثانية-لللاخر%20(6).docx%23_Toc445643457

viii

List of Figures

Figure (1.1): Kernel density estimation based on 7 points ([11]). ............................. 11

Figure (1.2): Kernel density estimates of the Ethanol data )[11]) ............................. 12

Figure (1.3): Kernel density estimates based on differnet bandwidths h=0.25 (solid

curves), h = 0.5 ) dashed curves), h =0.75)dotted curve) )[11]) .................................. 18

Figure (4.1): Time plot of the rescaled IBM stock. ................................................... 46

Figure (4.2): Time plot of the rescaled SP500 stock. ................................................ 46

Figure (4.3): Scatterplot of the rescaled IBM stock versus the rescaled SP500 stock.

................................................................................................................................... 47

Figure (4.4): Scatterplot of the squares of the rescaled IBM stock versus the squares

of the rescaled SP500 stock. ...................................................................................... 47

Figure (4.5): Graph of the NW and the DK estimators for IBM . ........................... 48

Figure (4.6): Graph of the NW and the DK estimators for SP500. .......................... 49

Figure (4.7): Time plot of the rescaled Cisco stock. .................................................. 52

Figure (4.8): Time plot of the rescaled Intel stock. ................................................... 52

Figure (4.9): Scatterplot of the rescaled Cisco stock versus the rescaled Intel stock.

.................................................................................................................................... 53

Figure (4.10): Scatterplot of the squares of the rescaled Cisco stock versus the

squares of the rescaled Intel stock. ............................................................................. 53

Figure (4.11): Graph the NW and the DK estimators for Cisco ............................... 54

Figure (4.12): Graph of the NW and the DK estimators for Intel………………… 55

ix

List of Abbreviations

arg Argument

Cdf Cumulative distribution function

Cov Covariance

C.I. Confidence Interval

i.i.d. Independent and identically distributed

Inf Infinum

Min Minimum

MSE Mean Square Error

o Small oh

O Big oh

Pdf Probability density function

Var. Variance

N-W Nadaraya Watson

DK Double Kernel

w.p With probability

List of Symbols

Symbol Description

X univariate random variable, X ∈ R.

X multivariate random variable, X ∈ Rd, d ≥ 2.

f(x) Probability density function.

F (x) Cumulative distribution function.

Kh Scaled univariate kernel function.

f̂ kernel estimator for the function f .

P probability set function.

h the bandwidth smoothing parameter.

µ the mean.

σ2 the variance.

E the expectation.

R(K)∫K2(x)dx.

IA the indicator function.

| · | the absolute value function.

R the set of real numbers.∏product.

N(0, 1) Standard normal distribution .

µj(K) j-th moment of a kernel K, or Gaussian distribution.

K(.) the kernel function.p↪→ converge in probability.d↪→ converge in distribution.

wi weight function.

||.|| the lp−norm function.

||.||p,α the norm like function.

x

Preface

The probability density function is a fundamental concept in statistics. Consider any

random variable X that has probability density function (pdf) f(x), x ∈ R. When

we have an observed data, pdf of this data may be unknown. So it is useful to find

the pdf or to estimate it. There are many methods for statistical estimation of the

density function, these methods are divided into two kinds, parametric estimation

and nonparametric estimation. It is helpful to distinguish between the parametric

estimation and nonparametric estimation. The parametric estimation assumes that,

the sample which is studied has known distribution such as Gaussian and Gamma

distribution and then parametric estimation can be used to estimate the missed

parameters for the distribution by using moment, maximum likelihood estimators,

bayes estimators, chi square estimators...etc. For example the parametric density

function assume that, if the data has normal distribution with mean and variance

σ2, we can estimate these parameters µ and σ2 and substitute in the normal distri-

bution formula, Then we obtain the estimation density function denoted by f̂(x).

On the other hand, nonparametric estimation is a very useful way for dealing with

data from unknown distribution. It is used for estimating the density function to

choose the suitable model for a given data. Examples of nonparametric estimators,

histogram estimator, the naive estimator, kernel estimator, Nadaraya-Watson esti-

mator, kernel nearest neighbor (KNN),...etc; For more details see [15], [21] and [25].

In this thesis, we will study the kernel estimation of the conditional L1- median.

The sample median is defined as the middle value of a set of ranked data, i.e. the

sample median splits the data into two parts with an equal number of data points

in each. Usually, the sample median is taken as an estimator of the population

median m, a quantity which splits the distribution into two halves in the sense that

P (y ≤ m) = p(y ≥ m) = 12. Multivariate time series arise when several time series

1

are observed simultaneously over time. A multivariate time series consists of mul-

tiple single series referred to as components, see [26] and [8]. When the individual

series are related to each other, there is a need for jointly analyzing the series rather

than treating each one separately. By so doing, one hopes to improve the accuracy

of the predictions by utilizing the additional information available from the related

series in the predictions of each other. Several extensions of the concept of univari-

ate median were introduced in the literature with application in different fields of

statistics. One of them, the so called L1- median.

A nonparametric estimator is proposed for estimating the L1- median for multivari-

ate conditional distribution when the covariates take values in an infinite dimensional

space. The multivariate case is more appropriate to predict the components of a

vector of random variables simultaneously rather than predicting each of them sep-

arately.

In this thesis, we will study the conditional L1 - median which plays an important

role in the nonparametric prediction. The main goal of this thesis is to modify

the estimator of the L1- median using the double kernel estimator rather than the

Nadaraya-Watson estimator. In our study, we will discuss the consistency of the

proposed estimator. Also, applications using real data, to test the performance of

the L1 - median estimator will be given.

This thesis consists of four chapters which are organized as follows:

Chapter 1

This chapter contains notations, some basic definitions, and facts that we need in

the thesis. Also, we introduce the kernel density estimation of the pdf, properties

of the kernel estimator.

Chapter 2

In this chapter, we introduce the L1 median, and we use the Nadaraya-Watson

(NW) estimator of the conditional cdf to estimate it. we will study the asymptotic

consistently properties of the NW estimator.

Chapter 3

In this chapter, we use the double kernel (DK) estimator of the cdf rather than the

NW estimator.

Chapter 4

2

In this chapter, we will practically compare between the two estimators, the NW

and DK estimators and conclusion.

3

Chapter 1

Preliminaries

Chapter 1

Preliminaries

This chapter contains notations, some basic definitions, and facts that we need in

the remaining of this thesis, it is organized as follows: In Section 1.1, we introduce

some basic definitions and notations related to the areas of this thesis. In Section

1.2, we introduce the kernel density estimator of the probability density function

(pdf). In the next section, we summarize some properties of the kernels. Finally, in

Section 1.4, we present the problems of the optimal bandwidth selection.

1.1 Basic Definitions and Notations

In this section, we will introduce some basic definitions and theorems that will be

help in the remanning of this thesis.

Consider any random variable X that has pdf f . Specifying the function f gives a

natural description of the distribution of X, and allows probabilities associated with

X to be found from the relation.

P (a < X < b) =

∫ ba

f(x)dx

for any real constants a and b with a < b, if the observed data are drawn from a

distribution with unknown pdf, then the construction of an estimator to the un-

known density function is called a density estimation. Density estimation has

experienced a wide explosion of interest over the last 40 years. It has been ap-

4

plied in many fields, including archeology chemistry, banking, climatology, genetics,

economics, hydrology and physiology. Density estimates give valuable indication of

such features as skewness and multimodality in the data in some cases they will

yield conclusions while in other all they do is to point the way to farther analysis

and data collection.

Definition 1.1.1. (Estimator) [14] An estimator is any statistic from the sample

data which is used to give information about an unknown parameter in the popula-

tion.

For example, the sample mean is an estimator of the population mean. Estimators

of population parameters are sometimes distinguished from the true value by using

the symbol hat . For example the normal distribution has two parameters, the mean

µ and the standard deviation σ. There estimators are denoted by µ̂ and σ̂.

Definition 1.1.2. [14] Let X be a random variable with pdf with parameter θ. Let

X1,X2,... ,Xn be a random sample from the distribution of X and let θ̂ denotes an

estimator of θ. We say θ̂ is an unbiased estimator of θ if

E(θ̂) = θ.

If θ̂ is not unbiased, we say that θ̂ is a biased estimator of θ.

Example 1.1.1. S2 is unbiased estimator for σ2.

5

Proof.

E(S2) = E(1

n− 1∑

(Xi − x̄)2)

=1

n− 1E(∑

(Xi − x̄)2)

=1

n− 1E(∑

(Xi − µ+ µ− x̄)2)

=1

n− 1E[∑

(Xi − µ)2 − 2(x̄− µ)∑

(Xi − µ) +∑

(x̄− µ)2]

=1

n− 1[nE(Xi − µ)2 − 2nE(x̄− µ)2 + nE(x̄− µ)2]

=1

n− 1[nσ2 − nE(x̄− µ)2]

=1

n− 1[nσ2 − nσ

2

n]

=σ2

n− 1[n− 1] = σ2

So, S2 is unbiased estimator for σ2.

Definition 1.1.3. [14] If θ̂ is an unbiased estimator of θ and

V ar(θ̂) =1

nE[(

∂lnf(X)∂θ

)]2 (1.1.1)then θ̂ is called a minimum variance unbiased estimator (efficient) of θ.

Example 1.1.2. x̄ has a minimum variance unbiased estimator for µ in normal

population.

Proof.

V ar(x̄) =σ2

n

f(x) =1

σ√

2πexp−

12(x−µσ

)2

ln f(x) = −lnσ − 12ln2π − 1

2(x− µσ

)2

∂ ln f(x)

∂µ= −1

22(x− µσ

)(−1σ

) =x− µσ2

E(x− µσ2

)2 =1

σ4E(x− µ)2

=σ2

σ4=

1

σ2

1

nE( ∂∂µ

ln f(x))2=

1

n 1σ2

=σ2

n

6

Definition 1.1.4. [14] The statistic θ̂ is a consistent estimator of the parameter

θ if and only if for each c > 0

limn−→∞

P (|θ̂ − θ| < c) = 1. (1.1.2)

Example 1.1.3. x̄ is a consistent estimator for µ. in normal population.

Theorem 1.1.1. [14] If θ̂ is an unbiased estimator of θ and V ar(θ̂)→ 0, as n→∞,

then θ̂ is a consistent estimator of θ.

Definition 1.1.5. [14]. The statistic θ̂ is a sufficient estimator of the parameter

θ if and only if for each value of θ̂ the conditional probability distribution or density

of the random sample X1,X2, ... ,Xn given θ̂ = θ is independent of θ.

Example 1.1.4. x̄ is a sufficient estimator for µ in normal population.

There are two types of density estimation:

• Parametric Estimation.

• Nonparametric Estimation.

Parametric Estimation.

The parametric approach for estimating f(x) is to assume that f(x) is a member

of some parametric family of distributions, e.g. N(x̄, S2), and then to estimate the

parameters of the assumed distribution from the data. For example, fitting a normal

distribution leads to the estimator

fn(x) =1√

2πSnexp(

(x− x̄)2

2S2), x ∈ R,

where,

x̄ =1

n

n∑i=1

xi,

and

S2 =1

n− 1

n∑i=1

(xi − x̄)2.

This approach has advantages as long as the distributional assumption is correct,

or at least is not seriously wrong. It is easy to apply and it yields (relatively) stable

7

estimates. The main disadvantage of the parametric approach is lack of exibility.

Each parametric family of distributions imposes restrictions on the shapes that f(x)

can have. For example the density function of the normal distribution is symmet-

rical, bell-shaped, and therefore is unsuitable for representing skewed densities or

bimodal densities.

Nonparametric Estimation.

If the data that we study come from unknown distribution i.e. The density func-

tion f(x) is unknown, then we must estimate the density function. This estimation

is called a nonparametric estimation, there are many nonparametric statistical ob-

jects of potential interest, including density functions (uni-variate and multivariate),

density derivatives, conditional density functions, conditional distribution functions,

regression functions, median functions, quantile functions, and variance functions.

Many nonparametric problems are generalizations of uni-variate density estimation.

For obtaining a nonparametric estimation of a pdf, there many methods :

1. Histogram.

2. The Naive Estimator.

3. Kernel Density Estimation .

Definition 1.1.6. (Indicator function) [23] If A is any set, we define the indi-

cator function IA of the set A to be the function given by

IA =

1 if x∈A,0 if x/∈A.Definition 1.1.7. (Converge in Probability) [16]

Let Xn be a sequence of random variables and let X be a random variable defined on

a sample space. We say Xn converges in probability to X if for all � > 0, we have

limn−→∞

P [|Xn −X|≥�] = 0, (1.1.3)

or equivalently,

limn−→∞

P [|Xn −X| < �] = 1. (1.1.4)

8

If so, we write Xnp−→ X

Definition 1.1.8. (Converge in Distribution)[16]

Let Xn be a sequence of random variables and let X be a random variable. Let FXn

and FX be, respectively, the cdfs of Xn and X. Let C(FX) denote the set of all points

where FX is continuous. We say that Xn converge in distribution to X if

limn−→∞

FXn(x) = FX(x), for all x ∈ C(FX). (1.1.5)

We denote this convergence by

Xnd−→ X

Definition 1.1.9. Converge with probability 1 [16]

Let {Xn}n=∞n=1 be a sequence of random variables on (Ω, L, P ). We say that Xn con-

verge almost surly to a random variable X (Xna.s−→ X) or Converge with probability

1 to X or Xn converge strongly to X if and only if

P ({w : Xn(w)→ X(w), as n→∞}) = 1,

or equivalent, for all � > 0, there exists N ∈ N

P (|Xn −X| < �, n ≥ N) = 1.

Definition 1.1.10. (Order Notation O And o) [27]

Let an and bn each be sequences of real numbers. Then we say that an is of big

order bn or (an is big oh bn) as n → ∞ and write an = O(bn) as n → ∞, if and

only if

lim supn→∞

|anbn|

Theorem 1.1.2. (Taylor’s Theorem) [27]

Suppose that f is real-valued function defined on R and let x ∈ R. Assume that

f has p continuous derivatives in an interval (x − δ, x + δ) for some δ > 0. Then

for any sequence αn converging to zero.

f(x+ αn) =

p∑j=0

(αjnj!

)f j(x) + o(αpn)

1.2 Kernel Density Estimation of the Pdf

We present the kernel density estimation of the pdf and review some important

definitions and aspects in this area. In statistics, kernel Density Estimation

(KDE) is a non-parametric way to estimate the (pdf) of a random variable. Kernel

density estimation is a fundamental data smoothing problem where inferences about

the population are made, based on a finite data sample.

Definition 1.2.1. (Kernel Estimator of a Probability Density Function)

[25] Suppose that X1, ..., Xn is a random sample of data from an unknown continuous

distribution with pdf f(x) and cumulative distribution function (cdf) F (x), the kernel

estimator of a probability density function is defined as

f̂(x) =1

nh

n∑i=1

K(x−Xih

), (1.2.1)

where the bandwidth h = hn is a sequence of positive numbers that converging to

zero and K(.) is the kernel function considers to be both symmetric and satisfies∫ ∞−∞

K(x)dx = 1

where

µj(K) =

∫ ∞−∞

yjK(y)dy. (1.2.2)

The density estimates derived using such kernels can fail to be probability densities,

because they can be negative for some values of x. Typically, K is chosen to be a

symmetric pdf. There is a large body of literature on choosing K and h well, where

”well” means that the estimate converges asymptotically as rapidly as possible in

some suitable norm on pdf.

10

A slightly more compact formula for the kernel estimator can be obtained by

introducing the recalling notation Kh(u) = h−1K(u/h). This allows us to write

f̂(x) = n−1n∑i=1

Kh(x−Xi).

Figure 1.1: Kernel density estimation based on 7 points([11])

From Figure (1.1), we have

(1) The shape of bump is defined the kernel function.

(2) The shape of the bump is determined by a bandwidth h.

That is, the value of the kernel estimate at the point x is the average of the n kernel

ordinates at this point.

11

Figure 1.2: Kernel density estimates of the Ethanol data ([11])

Figure 1.2 shows the kernel density estimates of the Ethanol data based on the

same bandwidth hn = 0.2, but using different kernels. The solid curve stands for

the triangular kernel, the dashed curve for the uniform kernel, and the dotted curve

for the normal kernel.

1.3 Properties of the Kernel Estimator

In this section, we will introduce some important properties of the kernel. A ker-

nel is a piecewise continuous function, symmetric around zero, even function and

integrating to one, i.e.

K(x) = K(−x),∫ ∞−∞

K(x)dx = 1.

The kernel function need not have bounded support and in most applications K

is a positive pdf.

Definition 1.3.1. [5] A kernel function K is said to be of order p, if its first nonzero

moment is µp 6= 0, i.e. if

µj(K) = 0, j = 1, 2, ..., p− 1;µp(K) 6= 0;

12

where

µj(K) =

∫ ∞−∞

yjK(y)dy.

we consider the following conditions:

• The unknown density function f(x) has continuous second derivative f ′′(x)

• The bandwidth h = hn satisfies limn→∞ h = 0, and limn→∞ nh =∞

• The kernel K is a bounded pdf of order 2 and symmetric about the origin.∫∞−∞ zK(z)dz = 0 and

∫∞−∞ z

2K(z)dz 6= 0

(Now expanding f(x− hz) in a Taylor series about x to obtain

f(x− hz) = f(x)− hzf ′(x) + 12h2z2f ′′(x) + o(h2)

E(f̂(x)) =

∫K(z)[f(x)− hzf ′(x) + 1

2h2z2f ′′(x)]dz + o(h2)

= f(x)

∫K(z)dz − hf ′(x)

∫zK(z)dz +

h2

2f ′′(x)

∫z2K(z)dz + o(h2).

This leads to:

Ef̂(x) = f(x) +1

2h2f ′′(x)

∫z2K(z)dz + o(h2),

where we have used∫K(z)dz = 1,

∫zK(z)dz = 0, and

∫z2K(z)dz

let z = x−yh

, then we get

V ar(f̂(x)) =1

nh2[h

∫K2(z)f(x− zh))dz)− (h

∫K(z)f(x− zh)dz)2] (1.3.3)

Using Taylor series for f(x− zh) as before, we have

V ar(f̂(x)) =1

nh2[(h

∫(f(x)− hzf ′(x))K2(z)dz)− o(h2)

=1

nh[(

∫(f(x)− hzf ′(x))K2(z)dz)− o(h)]

= (nh)−1f(x)

∫K2(z)dz + o((nh)−1).

By assumption, we have the result.

From the previous nothing above lemmas, we have some properties about the bias

and the variance:

1. The bias is of order h2, which implies that f̂(x) is an asymptotically unbiased

estimator.

2. The bias is large, whenever the absolute value of the second derivative |f (2)(x)|

is large. This occurs for several densities at peaks where the bias is negative, and

valleys, where the bias is positive.

3. The variance is of order (nh)−1, which means that the variance converges to zero

by Condition (ii).

1.4 Optimal Bandwidth

The problem of selection the bandwidth is very important in kernel density esti-

mation. Choice of appropriate bandwidth is critical to the performance of most

nonparametric density estimators. When the bandwidth is very small, the estimate

will be very close to the original data. The estimate will be almost unbiased, but it

will have large variation under repeated sampling. If the bandwidth is very large,

the estimate will be very smooth, lying close to the mean of all the data. Such an

estimate will have small variance, but it will be highly biased. There are many rules

for bandwidth selection, for example Normal Scale Rules, Over-smoothed bandwidth

selection rules, Least Squares Cross-Validation, Biased Cross-Validation, Estimation

15

of density function-als and Plug-In Bandwidth Selection. For more details see [25]

and [27].

We shall use two types of errors criteria. The mean square error (MSE) is used

to measure the error when estimating the density function at a single point. It is

defined by

MSE{fn(x)} = E{fn(x)− f(x)}2. (1.4.1)

We can write the MSE as a sum of the squared bias and the variance at x,

MSE(fn(x)) = {Efn(x)− f(x)}2 + V ar(fn(x)). (1.4.2)

A second type of criteria measures the error when estimating the density over the

whole real line. The most well known of this type is the mean integral square error

(MISE) introduced by [22]. The MISE is defined as

MISE(fn) = E

∫ ∞−∞{fn(x)− f(x)}2dx. (1.4.3)

By changing the order of integration we have,

MISE(fn) =

∫ ∞−∞

MSE{fn(x)}dx

=

∫ ∞−∞{Efn(x)− f(x)}2dx+

∫ ∞−∞

V ar(fn(x))dx. (1.4.4)

Equation (1.4.4) gives the MISE as a sum of the integral squared bias and the

integral variance. Substituting (1.3.1) and (1.3.2) we conclude that

MISE(fn) = AMISE(fn) + o{h4 + (nh)−1}, (1.4.5)

where A MISE is the asymptotic mean integral squared error given by

AMISE(fn) =1

4h4µ2(K)

2R(f (2)) + (nh)−1R(K), (1.4.6)

see [27].

The natural way for choosing h is to plot out several curves and choose the esti-

mate that best matches one prior (subjective) ideas. However, this method is not

practical in pattern recognition since we typically have high-dimensional data.

16

Assume a standard density function and find the value of the bandwidth that min-

imizes the integral of the square error (MISE)

hMISE = arg minE[

∫(fn(x)− f(x))2dx]. (1.4.7)

If we assume that the true distribution is Gaussian and we use a Gaussian kernel,

the bandwidth h is computed using the following equation from [25].

h∗ = 1.06SN−15 (1.4.8)

where S is the sample standard deviation and N is the number of training examples.

By differentiating (1.4.6) with respect to h we can find the optimal bandwidth with

respect to AMISE criterion. This yields, the optimal bandwidth is given by

hop = [R(K)

µ2(K)2R(f 2)n]. (1.4.9)

Therefore if we substitute (1.4.9) into (1.4.6), we obtain the smallest value of A

MISE for estimating f using the kernel K.

infh>0

AMISE{fn} =5

4{µ2K2R(K)4R(f ′′)}

15n−

45 .

Notice that in 1.4.9 the optimal bandwidth depends on the unknown density being

estimated, so we can not use (1.4.9) directly to find the optimal bandwidth hopt.

Also from 1.4.9 we can conclude the following useful conclusions.

• The optimal bandwidth will converge to zero as the sample size increases, but

at very slow rate.

• The optimal bandwidth is inversely proportional to R(f ′′) 15 . Since R(f ′′) mea-

sures the curvature of f, this means that for a density function with little

curvature, the optimal bandwidth will be large. Conversely, if the density

function has a large curvature, the optimal bandwidth will be small.

17

Figure 1.3: Kernel density estimates based on different bandwidths h =0 .25 (solid

curve), h =0 .5 (dashed curve), h =0 .75 (dotted curve)([11])

Summary

In this chapter, we introduced some basic definitions and theorems that will need it

in this thesis, and we studied definition of estimation, its types, and precis of non-

parametric estimation and its common methods, then we presented kernel density

estimation of the pdf and we presented properties of the kernel estimator. In the

next chapter, we will study the Nadaraya-Watson estimator of L1 −median.

18

Chapter 2

Nadaraya Watson

Estimator of 𝑳𝟏 −Median

Chapter 2

Nadaraya Watson Estimator of L1

- Median

Introduction

Conditional distribution functions underline many popular statistical object of in-

terest. They are rarely modeled directly in parametric setting and have perhaps

received even less attention in kernel setting. Nevertheless, as will be seen, they

are extremely useful for range of tasks, whether directly estimation the conditional

distribution function [6], or perhaps molding conditional quantiles . The conditional

median depends directly on the conditional distribution function. Indeed, estimating

the conditional distribution is actually much more informative, since it allows us not

only to recalculate the expected value E(Y |X) and the variance σ2(Y |X), but also

to provide the general shape of the conditional distribution. In this context several

nonparametric methods can be applicable for estimating the conditional distribution

function based on data (X1, Y1),...,(Xn, Yn). A class of the kernel-type estimators is

called the Nadaraya-Watson estimator which is one of the most widely known and

used for estimating the conditional distribution function. Conditional distribution

estimation was introduced by [22]. A bias correction was proposed by [17]. [12]

proposed a direct estimator on Local polynomial estimation. The Nadaraya-Watson

(NW) estimator is created by independent researchers [28] and [19]. In this chap-

ter, we will investigate a nonparametric method for estimating the conditional L1

median which is called the Nadaraya-Watson estimator.

19

In This chapter we will present the NW estimator in order to estimate the condi-

tional L1 - median. In Section 2.1, we introduce the important of the median, and

giving historical notes. In Section 2.2, we will present conditional L1 - median. In

Section 2.3, the NW estimator of the conditional L1 median will be studied. Then

in Section 2.4, the asymptotic properties of the NW will be discussed and derived.

2.1 Important of Median

The median is the value separating the higher half of a data sample, a population, or

a probability distribution, from the lower half. In simple terms, it may be thought

of as the ”middle” value of a data set. For example, in the data set 1, 3, 3, 6, 7, 8,

9, the median is 6, the fourth number in the sample. The median is a commonly

used measure of the properties of a data set in statistics and probability theory.

The basic advantage of the median in describing data compared to the mean (often

simply described as the ”average”) is that it is not skewed so much by extremely large

or small values, and so it may give a better idea of a ’typical’ value. For example, in

understanding statistics like household income or assets which vary greatly, a mean

may be skewed by a small number of extremely high or low values. Median income,

for example, may be a better way to suggest what a ’typical’ income is.

Because of this, the median is of central importance in robust statistics, as it is

the most resistant statistic, having a breakdown point of 50: so long as no more

than half the data are contaminated, the median will not give an arbitrarily large

or small result.

The median is another way to measure the center of a numerical data set.

A statistic median is much like the median of an interstate highway on many high-

way, the median is the middle and the equal number of lans lay on either side of it.

In a numerical data set, the median is the point of which there are an equal number

of points whose value is above and below the median value,thus the median is truly

the middle of the data set.

The mean and the median are the used measure of locations by the investigators,

20

because those measure are easy the understand,most inrestigators describe the lo-

cation of the data by the arithmetic mean, except when the data are highly skewed,

higly kurtotic,or contaminoted with outliers, in which case the median is often used,

in the case of asymmetric data, the median is almost always close the mode, and is

many cases agood approach would be the estimate the mode itself.See [20]

Advantages of Median

1. It is very simple to understand and easy to calculate.

2. It is a special a verge used in qualities phenomena like intelligence or beauty

which are not quantified but rank are given.

Disadvantages of Median

1. It is less representative.

2. Take along times to calculates for avery large set of data.

2.2 The Conditional L1- Median

In this section, we will study the conditional median and we will define the median

as the solution to the problem of minimization a sum of absolute residuals, and we

will study the conditional L1− Median.

Definition 2.2.1. Let Y1, Y2, ......., Yn be a random sample from a distribution with

pdf f(y) and cdf F (y) then the median of the distribution is defined by

θ = inf{y : F (y) ≥ 0.5}

Example 2.2.1. To define the median as minimization problem we will solve:

arg minθ∈R

n∑i=1

|Yi − θ|.

21

Proof. [7]

E|Yi − θ| =∫ ∞−∞|Yi − θ|fY (y)dy

=

∫ θ−∞−(Yi − θ)f(y)dy +

∫ ∞θ

(Yi − θ)f(y)dy

dE

dθ|Yi − θ| =

∫ θ−∞

f(y)dy −∫ ∞θ

f(y)dy = 0

then ∫ θ−∞

f(y)dy =

∫ ∞θ

f(y)dy

The definition of the median is p(Y ≤ θ) = p(Y ≥ θ) = 0.50.

The symmetry of the piecewise linear absolute value function implies that the min-

imization of the sum of absolute residuals equates the number of the positive and

negative residuals.

The median regression estimates the conditional median of Y given X = x and

corresponds to the minimization of E(|Y − θ||X = x) over θ. The associated loss

function is r(u) = |u|. We can take the loss function to be ρ0.5(u) = 0.5|u|. because

the half positive equal half negative.

Definition 2.2.2. (The Conditional Median) [1]

Let (X1, Y1),(X2, Y2), .... ,(Xn, Yn) be a random sample from a distribution with

a conditional cdf F (y|x), then the conditional median M(x) is defined by

M(x) = inf{y : F (y|x) ≥ 0.5}

.

Definition 2.2.3. Suppose we have a complex vector space V. A norm is a function

f : V → R which satisfies:

1. f(x) ≥ 0 for all x ∈ V

2. f(x+ y) ≤ f(x) + f(y) for all x, y ∈ V

3. f(λx) = |λ|f(x) for all λ ∈ C and x ∈ V

22

4. f(x) = 0 if and only if x = 0

We usually write a norm by ||x||.

Examples of the most important norms are as follows:

• The 2-norm or Euclidean norm:

||x||2 = (n∑i=1

|xi|2)12

• The 1-norm:

||x||1 = (n∑i=1

|xi|)

• For any integer p ≥ 1 we have the p-norm:

||x||p = (n∑i=1

|xi|p)1p

• The ∞− norm, also called the sup-norm:

||x||∞ = maxi|xi|

This notation is used because ||x||∞ = limp→∞ ||x||p

Several extensions of the concept of univariate median were introduced in the

literature with application in different fields of statistics. One of them, the so called

L1- median.

A nonparametric estimator is proposed for estimating the L1- median for multivari-

ate conditional distribution when the covariates take values in an infinite dimensional

space. The multivariate case is more appropriate to predict the components of a

vector of random variables simultaneously rather than predicting each of them sep-

arately.

Definition 2.2.4. Convex function A function f : M −→ R defined on a

nonempty subset M of Rn and taking real values is called convex, if:

1. the domain M of the function is convex;

23

2. for any x, y ∈M and every λ ∈ [0, 1] one has

f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y). (2.2.1)

If the above inequality is strict whenever x 6= y and 0 < λ < 1, f is called strictly

convex.

Example of convex functions:

norms , recall that a real-valued function ||x|| on Rn is called a norm, if it is:

• nonnegative everywhere, for all x ∈ Rn, ||x|| ≥ 0;

• homogeneous, for all x ∈ Rn and a ∈ R, ||ax|| = |a|||x||;

• satisfies the triangle inequality, ||x+ y|| ≤ ||x||+ ||y||;

• ||x|| = 0 if and only if x = 0

Let ‖.‖ denote any strictly convex norm (‖α + β‖ < (‖α‖ + ‖β‖), whenever α

and β are not proportional on Rd. Let ‖.‖p : Rd −→ R, (1 < p < ∞). In the

sequel, we restrict to the Euclidean norm. For notational simplicity, we write ‖.‖pfor ‖.‖p for ‖.‖ for ‖.‖2.

For a fixed x ∈ Rs, define a vector function of α, α ∈ Rd, by

φ(α, x) = E(‖Y − α‖ − ‖Y ‖|X = x)

=

∫Rd

(‖y − α‖ − ‖y‖)F (dy|x)

where F (.|x) is the conditional probability measure of Y given X = x.

When the norm is strictly convex, the existence and uniqueness of µ ∈ Rd such that

φ(µ) = infα∈Rd

φ(α, x)

are guaranteed. The vector µ is called L1 - median of the measure associated with

the distribution function G [18].

Definition 2.2.5. (The Conditional L1- Median) [2],[3]

The L1- median of Y conditioally on X = x is defined by

µ(x) = arg minα∈Rd

φ(α, x)

24

where for α ∈ Rd

φ(α, x) =

∫Rd

(‖y − α‖ − ‖y‖)F (dy|x)

.

In this section, we introduced the median as a problem of minimization for both

cases, the univariate case and the multivariate case. The L1-median has been in-

troduced. In the next section, we will study the Nadaraya-Watson estimator of the

univariate conditional median. It’s mean and variance will be discussed.

2.3 The Nadaraya-Watson Estimator

In this section, we will study some basic information about the (NW) estimator to

use this later. It’s one of the popular nonparametric methods for estimating the

conditional density function f(y|x). We will consider the kernel estimation of the

conditional cumulative distribution function (cdf).

let (X1, Y1),(X2, Y2), .... ,(Xn, Yn) be a random sample from a distribution with a

conditional probability density function (pdf) f(y|x) then the cdf F (y|x) is given

by :

F (y|x) =∫ y−∞

f(u|x)du,

where

f(y|x) = f(x, y)fX(x)

.

Now, we introduce the basic equation of the kernel conditional density estimation

(KCDF). The Kernel function K(u) is assumed to be a Borel symmetric function,

and h is sequence of positive number converging to zero and it is called bandwidth,

and Kh(x) = K(x/h)/h..

Standard kernel estimators of f(x, y) and fX(x) are

f̂(x, y) =1

n

n∑i=1

Kh(x−Xi)Kh(y − Yi)

f̂X(x) =1

n

n∑i=1

Kh(x−Xi)

25

and

f̂(y|x) = f̂(x, y)f̂X(x)

=1n

∑ni=1Kh(x−Xi)Kh(y − Yi)1n

∑ni=1Kh(x−Xi)

=

∑ni=1Kh(x−Xi)Kh(y − Yi)∑n

i=1Kh(x−Xi),

Now the estimate of the conditional cdf is given by :

F̂ (y|x) =∫ y−∞

f̂(u|x)du

=

∑ni=1Kh(x−Xi)

∫ y−∞Kh(u− Yi)du∑n

i=1Kh(x−Xi)

Now, there are two ways to estimate the conditional cdf F (y|x). By using the indi-

cator function I(Yi ≤ y), which is called the Nadaraya-Watson estimator,

F̂NW (y|x) =∑n

i=1 I(Yi≤y)Kh(x−Xi)∑ni=1Kh(x−Xi)

,

Remark 2.3.1.

0 ≤ F̂NW (y|x) ≤ 1,

because, if y ≤ Yi for all i = 1, 2, ..., n then I(Yi ≤ y) = 0 for all i, then,

F̂NW (y|x) =∑n


= 0.

If y lies between the Y ,i s, i.e. there are some Yi less than or equal y but not at all,

then I(Yi ≤ y) = 0 for some i, and I(Yi ≤ y) = 1 for the other, then

F̂NW (y|x) =∑n


< 1

If y ≥ Yi for all i then I(Yi ≤ y) = 1 for all i then

F̂NW (y|x) =∑n

i=1Kh(x−Xi)∑ni=1Kh(x−Xi)

= 1

Definition 2.3.1. The NW estimator of the conditional median m(x) is defined as

m̂NW (x) = inf{y ∈ R : F̂NW (y|x) ≥ 0.5}

26

In the following theorem, we discuss the expectation of the NW estimator.

Theorem 2.3.1. (The Expectation of the NW Estimator)

Let Yi be an independent random variables. The expectation of the estimator FNW (y|x)

is given by

E[F̂NW (y|x)] =n∑i=1

Kh(x−Xi)∑ni=1Kh(x−Xi)

F (y|Xi)

Proof.

E[F̂NW (y|x)] = E[∑n

i=1 I{Yi≤y}Kh(x−Xi)∑ni=1Kh(x−Xi)

]

= E[∑n

i=1 I{Yi≤y}Kh(x−Xi)]∑ni=1Kh(x−Xi)

=

∫∞−∞∑n

i=1 I{Yi≤y}Kh(x−Xi)f(y|Xi)dy∑ni=1Kh(x−Xi)

=

∑ni=1Kh(x−Xi)

∫∞−∞ I{Yi≤y}f(y|Xi)dy∑n

i=1Kh(x−Xi)

=

∑ni=1Kh(x−Xi)

∫∞−∞ f(t|Xi)dt∑n

i=1Kh(x−Xi)

=

∑ni=1Kh(x−Xi)F (y|Xi)∑n

i=1Kh(x−Xi)

=n∑i=1

Kh(x−Xi)∑ni=1Kh(x−Xi)

F (y|Xi)

Theorem 2.3.2. (The Variance of the Nadaraya-Watson Estimato)[9]

Let Yi be an independent random variables. The variance of the estimator F̂NW (y|x)

is given by

V ar[F̂NW (y|x)] =n∑i=1

K2(x−Xihn

)

[∑n

i=1K(x−Xihn

)]2.[F (y|Xi)− F 2(y|Xi)].

27

Proof.

V ar[F̂NW (y|x)] = E(F̂ 2NW (y|x))− [E(F̂NW (y|x))]2

= E[[∑n

i=1 I2{Yi≤y}Kh(x−Xi) +

∑1≤i≤j

[2].

Let φn(α, x) be the estimate of φ(α, x) defined by

φn(α, x) =

∫Rd

(‖y − α‖ − ‖y‖)F (dy|x)

=

∑ni=1(‖Yi − α‖ − ‖y‖)Kh(x−Xi)∑n

i=1Kh(x−Xi)

From the definition of µ(x), it seems natural to estimate it by minimizing an esti-

mate of φn(α, x).

Definition 2.4.1. (The minimizer µn is the L1-median) [2], [3]

µn(x) = arg minα∈Rd

φn(α, x)

= arg minα∈Rd

n∑i=1

‖Yi − α‖Kh(x−Xi)

where the last equality is obtained by removing terms independent of α in the ex-

pressin of φn(α, x)

2.4.1 Assumption and Main Results

Let C be a compact subset of Rs on which the marginal density of X, denote by g,

is lower bounded by some positive constant. Now we state some assumption, which

are required to prove the theoretical results of this section.

(A1) The density g of X is uniformly continuous.

(A2) The function µ(.), satisfies a uniform uniqueness property over C

∀� > 0, ∃η > 0, ∀t : C −→ Rd,

sup ‖µ(x)− t(x)‖ ≥ � =⇒ sup |φ(µ(x), x)− φ(t(x), x)| ≥ η.

(A3) The kernel K is bounded, positive Hölderian function.

(A4) The sequence (hn)n≥1 satisfies

limn−→∞

nhsnlog n

=∞

(A5) For any Borel set V ⊂Rd and for any α ∈ Rd, the function Q(V|.) and φ(α, .)

are continuous on C.

29

Then under the previous conditions, the results in Theorem 2.4.1 can be proved.

Theorem 2.4.1. [2],[3] Assume (A1), (A2), (A3), (A4) and (A5). Then

1. with probability 1(w.p.1), one can find an integer N ≥ 1, such that if n ≥ N

and x ∈ C, the median µn(x) associated with the probability measure Qn(.|x)

exists and is unique. Moreover, the function µn(.) is continuous on C

2. the function µ(.) is continues on C.

3. w.p.1:

supx∈C||µn(x)− µ(x)|| → 0, if n→∞

2.4.2 Preliminary Lemmas

We review here some preliminary lemmas and definitions that will be helpful to

follow the proof of Theorem 2.4.1.

Definition 2.4.2. Borel set

A Borel set is any set in a topological space that can be formed from open sets(or,

equivalently, from closed sets) through the operations of countable union, and count-

able intersection.

Definition 2.4.3. Continuous, Uniform Continuous [13]

Let f : I −→ R and let x◦ ∈ I, we say that f continuous at x◦ if

limx−→x◦

f(x) = f(x◦)

Using (�, δ) definition of the limit, this means the following:

∀� > 0,∃δ = δ(�, x◦) > 0, ∀x ∈ I, |x− x◦| < δ ⇒ |f(x)− f(x◦)| < �

If f is continuous at x for all x ∈ I, we say f is continuous on I.

We say that f is uniformly continuous on I if

∀� > 0,∃δ = δ(�) > 0, ∀x ∈ I, |x− x◦| < δ ⇒ |f(x)− f(x◦)| < �

30

Lemma 2.4.1. Assume (A5). We have

lim‖α‖−→∞

supn≥1

supx∈C| φn(α, x)‖α‖

− 1 |= 0.

Proof. As for all p ≥ 1, x 7→ P (‖Y ‖ > p|X = x) is continuous function (by (A5)),

one can find xp ∈ C such that

supx∈C

P (‖Y ‖ > p|X = x) = P (‖Y ‖ > p|X = xp) (2.4.1)

The sequence (xp)p≥1 being in the compact C, one can extract a subsequence (pk)k≥1

such that

xpk −→ x∞ if k −→∞. (2.4.2)

Then, w. p. 1, for any α ∈ Rd − {0}, x ∈ C, n ≥ 1 and k ≥ 1, we have

|φn(α, x)‖α‖

− 1| ≤∫Rd| ‖y − α‖ − ‖y‖ − ‖α‖

‖α‖| Fn(dy|x)

≤∫‖y‖≤pk

| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖

| Fn(dy|x)

+

∫‖y‖>pk

| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖

| Fn(dy|x).

Now, for all α ∈ Rd − {0} and y ∈ Rd:

| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖

| ≤ ‖y − α‖ − ‖y‖+ ‖α‖‖α‖

≤ ‖y − α− y‖+ ‖α‖‖α‖

≤ ‖ − α‖+ ‖α‖‖α‖

=2‖α‖‖α‖

= 2

and

| ‖y − α‖ − ‖y‖ − ‖α‖‖α‖

| ≤ ‖y − α‖+ ‖α‖+ ‖y‖‖α‖

≤ ‖y − α + α‖+ ‖y‖‖α‖

≤ ‖y‖+ ‖y‖‖α‖

=2‖y‖‖α‖

Thus, we get the inequality

| φn(α, x)‖α‖

− 1| ≤∫‖y‖≤pk

2‖y‖‖α‖

Fn(dy|x) +∫‖y‖>pk

2Fn(dy|x).

31

Letting ‖α‖ tend to infinity, we obtain that, w. p. 1, and for all k ≥ 1

lim‖α‖−→∞

supn≥1

supx∈C| φn(α, x)‖α‖

− 1 |≤ 2 supn≥1

supx∈C

∫‖y‖>Pk

Fn(dy|x).(2.4.3)

The last upper bound is now proved to tend to zero as k tends to infinity.

If k, n ≥ 1 and x ∈ C, let us denote

qxn(k) =

∫‖y‖>pk

Fn(dy|x)

=

∑ni=1 I(‖Yi‖>Pk)K(

x−Xihn

)∑ni=1K(

x−Xihn

)

Assuming (A1),(A3), (A4) and (A5), we have, w.p.1, for any k ≥ 1:

supx∈C| qxn(k)− P (‖Y ‖ > pk|X = x)| −→ 0 as n −→∞, [4]. (2.4.4)

The P− null set where the convergence in 2.4.4 is not true may be chosen to be

independent of k. Moreover, let us note that

(supx∈C

qxn(k))k≥1

and

(supP (||Y || > pk|X = x))k≥1

are decreasing sequences of positive numbers (because the sequence (pk)k≥1 is in-

creasing), hence both converge as k →∞. Consequently, w.p.1, the convergence in

2.4.4 is uniform in k ≥ 1, i.e.

supk≥1

supx∈C|qxn(k)− P (||Y || > pk|X = x)| → 0, if n→∞.

Let ε > 0. By the above property, one can find, w.p.1, an integer N ≥ 1 such that

if n > N, k ≥ 1 and x ∈ C

qxn(k) ≤ ε+ P (||Y || > pk|X = x). (2.4.5)

Now, w.p.1, if k ≥ 1

supn≥1

supx∈C

qxn(k) ≤ supn=1....N

supx∈C

qxn(k) + supn>N

supx∈C

qxn(k).

32

on the one hand, by the very definition of qxn(k) :

supn=1....N

supx∈C

qxn(k) ≤ supn=1...N

1[||Yi||>pk].

On the other hand, according to 2.4.5:

supn≥N

supx∈C

qxn(k) ≤ ε+ supx∈C

P (||Y || > pk|X = x).

But, the Yis are P− a.s. finite random variables, so that, with probability 1

lim supk→∞

supn≥1

supx∈C

qxn(k)

is upper bounded by

ε+ lim supk→∞

supx∈C

P (||Y || > pk|X = x). (2.4.6)

As ε is arbitrary, the proof will be completed by showing that the last term in 2.4.6

is equal to 0.

Let K ≥ 1. According to 2.4.1:

supx∈C

P (||Y || > pk|X = x) = P (||Y || > pk|X = xpk).

Moreover, xpk → x∞ if k →∞, according to 2.4.2. Now, if k ≥ 1 and ṕ ≤ pk :

P (||Y || > pk|X = xpk) ≤ P (||Y || > ṕ|X = xpk),

so that if ṕ ≥ 1 :

lim supk→∞

P (||Y || > ṕ|X = xpk) = P (||Y || > ṕ|X = x∞),

because P (||Y || > ṕ|X = .) is a continuous function on C by (A5). Letting ṕ→∞,

we get

limk→∞

supx∈C

P (||Y || > pk|X = x) = 0,

because Y is P -a.s. finite.

Lemma 2.4.2. Let us assume that K is a positive probability density. w.p.1,one

can find an integer N ≥ 1 such that if n ≥ N and x ∈ C, the L1-median µn(x)

associate with the probability measure Qn(.|x) ( which is not supported by a straight

line) exists and is unique.

33

Proof. Let x ∈ C and n ≥ 1. w.p.1, the set of minimizers of ϕn(.|x) is non-empty.

Let us prove that this set contains only one point. let x ∈ Rd. By assumption, if D

is any straight line in Rd :

Q(D|x) = P (Y ∈ D|X = x) < 1.

Obviously, this gives that :

P (Y ∈ D) < 1(2.4.7)

Now, let us denote by D(y1, y2) the line which connects two points y1, y2 ∈ Rd(y1 6=

y2), and by PY the distribution of Y. Clearly

P (Y1, Y2, Y3 on the same line) = P (Y3 ∈ D(Y1, Y2))

=

∫R2d

P (Y3 ∈ D(y1, y2))PY (dy1)PY (dy2),

because Y1, Y2 and Y3 are independent and identically distributed random vari-

ables. Then, according to 2.4.7, one gets

P (Y1, Y2, Y3 on the same lines) < 1.

Let us denote by Ω1 the subset of Ω defined by

Ω1 = {Y1, Y2, ... not on the same line}.

From the previous inequality, we have

P (Ωc1) ≤ P (Y1, Y2, Y3 on the same line) < 1.

But Ωc1 is an element of the σ− field generated by the independent random variables

Y1, Y2, ..., so that it is symmetric and P (Ωc1) = 0 by the Hewitt-Savage 0-1 low (see

[9]) Hence P (Ω1) = 1, and one can deal with Ω1 instead of Ω. For all ω ∈ Ω1, one can

find N(ω) ≥ 1 such that if n ≥ N(ω), Y1(ω), Y2(ω), ..., Yn(ω) are not on a straight

line. Now, let x ∈ Rs, ω ∈ Ω1 and n ≥ N(ω). Because K is a positive function, the

support of the probability measure Qn(.|x)(ω) is

{Y1(ω), ..., Yn(ω)}.

But n ≥ N(ω), so that this support is not included in any straight line. consequently,

according to Theorem 2.17 of [18], the set of L1− medians associated with the

probability measure Qn(.|x)(ω) contains only one element: µn(x)(ω).

34

Lemma 2.4.3. Let us assume that K is a continuous positive kernel. w . p. 1,one

can find an integer N ≥ 1 such that if n ≥ N , µn(.) is continuous on C.

Proof. According to Lemma 2.4.2, w. p. 1, one can find an integer N ≥ 1 such that

if n ≥ N and x ∈ C, the L1-median µn(x) associate with the probability measure

Qn(.|x) exists and is unique. We prove now that P-a.s., if n ≥ N , µn(x) is continu-

ous on C.

Let x ∈ C, and let (xp)p≥1 be a sequence such that xp −→ x, if P −→∞.

By the continuity of K, one obviously gets that the sequence of probability mea-

sures (Qn(.|x))p≥1 converges weakly to the measure Qn(.|x) is not supported by

any straight line, so that according to Corollary 2.26 of [18], µn(xp) −→ µ(x), if

p −→∞

Hence, w. p. 1, if n ≥ N , µn(.) is continuous function on C.

Lemma 2.4.4. Assume (A1), (A2), (A3), (A4) and (A5). W. p. 1, and for

any A > 0

sup‖α‖≤A

supx∈C| φn(α, x)− φ(α, x) |−→ 0, if n −→∞.

Proof. We clearly have, w. p. 1, and for all α ∈ Rd, i ≥ 1:

|‖Yi − α‖ − ‖Yi‖| ≤ ‖α‖

Consequently, assuming (A1),(A3),(A4)and (A5), we have that, for all α ∈ Rd and

w. p. 1:

supx∈C|φn(α, x)− φ(α, x)| −→ 0, if n −→∞, (2.4.8)

[4]. But w. p. 1, if n ≥ 1, x ∈ C and α, ά ∈ Rd:

|φn(α, x)− φn(ά, x)| ≤ ‖α− ά‖ (2.4.9)

and

|φ(α, x)− φ(ά, x)| ≤ ‖α− ά‖

From this we deduce that P-null set on which (2.4.8) is not true may be chosen

independent of α. Moerover according to (2.4.9) and w. p. 1, the sequence of

35

functions (φn(., x), n ≥ 1) is equicontinuous, and this property is independent of x.

Thus, according to (2.4.8) and the Ascoli-Arzel. Theorem [9], we get that , w. p. 1,

if A > 0:

sup‖α‖≤A

supx∈C|φn(α, x)− φ(α, x)| −→ 0, if n −→∞

This completes the proof of the lemma.

2.4.3 Proof of Theorem in Section 2.4.1

Assuming (A3), assertion (1) follows from Lemmas 2.4.2 and 2.4.3. Moreover, (2)is

a straightforward consequence of (1) and (3). One only needs to prove (3). The

proof is divided into two steps.

Step 1. We want to prove that, w.p.1, one can find r > 0 and N ≥ 1 such that

supn≥N

supx∈C||µn(x)|| ≤ r

and

supx∈C||µ(x)|| ≤ r.

From Lemma 2.4.1, one can find, w.p.1, r1 > 0 such that, if ||α|| > r1,∀n ≥ 1, and

∀x ∈ C :

ϕn(α, x) ≥1

2||α||. (2.4.10)

We have already proved in Lemma 2.4.2 that assuming (A3), w.p.1, there exists

N ≥ 1 such that , if n ≥ N and x ∈ C, the L1−median µn(x) associated with

the probability measure Qn(.|x) exists and is unique. Assume now that there exists

n ≥ N and x ∈ C such that

||µn(x)|| > r1.

Then, according to (2.4.10)

ϕn(µn(x), x) ≥1

2||µn(x)||.

But, by the very definition of µn(x) :

ϕn(µn(x), x) = infα∈Rd

ϕn(α, x) ≤ ϕn(0, x) = 0.

36

This is impossible. Hence, w.p.1, and for all n ≥ N :

supx∈C||µn(x)|| ≤ r1.

similar arguments lead to the existence of a real number r2 > 0 such that

supx∈C||µ(x)|| ≤ r2.

Now, the desired result is obtained with r = max(r1, r2).

Step 2. Conclusion. w.p.1, if n ≥ N (N was fixed in step 1)

supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| ≤ sup

x∈C|ϕ(µ(x), x)− ϕn(µn(x), x)|.

+ supx∈C|ϕn(µn(x), x)− ϕ(µn(x), x)|.

Form step 1:

supn≥N

supx∈C||µn(x)|| ≤ r

and

supx∈C||µ(x)|| ≤ r,

so that

ϕ(µ(x), x) = infα∈Rd

ϕ(α, x) = inf||α||≤r

ϕ(α, x)

and

ϕn(µn(x), x) = infα∈Rd

ϕn(α, x) = inf||α||≤r

ϕn(α, x).

Thus, w.p.1, if n ≥ N

supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| ≤ sup

x∈C| inf||α||≤r

ϕ(α, x)− inf||α||≤r

ϕn(α, x)|

+ sup||α||≤r

supx∈C|ϕn(α), x)− ϕ(α, x)|.

Assuming (A1), (A3), (A4) and (A5), from Lemma 2.4.4, we get

supx∈C|ϕ(µ(x), x)− ϕ(µn(x), x)| → 0, if n→∞,

w.p.1. Then, applying (A2), one gets

supx∈C||µ(x)− µn(x)|| → 0, if n→∞,

w.p.1.

This complete the proof of Theorem 2.4.1.

37

Through the studying the NW estimator, the large bias and boundary effects is

the most drawbacks of the NW estimator. Hence, in the next chapter, we will study

another estimator, which is know as a double kernel (DK) estimator.

38

Chapter 3

The Double Kernel

Estimator

Chapter 3

The Double Kernel Estimator

Introduction

Let (X,Y ) be a two dimensional random variable with a joint distribution func-

tion F (x, y). The large bias and boundary effects are considered to be the most

important defect in the case of the NW estimator. The NW estimator was treated

and modified in order to obtain some more refinement estimator, which is called the

double kernel estimator (DK), see [10].

Various consistency proofs for the kernel density estimator have been developed over

the last few decades. Important milestones are the pointwise consistency and almost

sure uniform convergence with a fixed bandwidth on the one hand and the rate of

convergence with a fixed or even a variable bandwidth on the other hand. While

considering global properties of the empirical distribution functions is sufficient for

strong consistency, proofs of exact convergence rates use deeper information about

the underlying empirical processes. A unifying character, however, is that earlier

and more recent proofs use bounds on the probability that a sum of random vari-

ables deviates from its mean see [25].

This chapter studies the double kernel estimation of the conditional L1 -median of Y

for a given value of X based on a random sample from the above distribution. In this

chapter, the joint asymptotic consistency of the conditional L1- median estimated

at a finite number of distinct points is established under some regularity conditions.

The aim of this chapter is to introduce the double kernel estimator (DK) and its

aspects, discussion the properties of the DK estimator of the conditional L1- median

39

and studying the asymptotic consistency of the DK.

This chapter consists of two sections. In Section 3.1, we will introduce the DK

estimator of the conditional distribution function F̂DK(y|x). In Section 3.2, we will

investigate the asymptotic consistency of the DK.

3.1 The Double Kernel Estimator

If f(x, y) is the joint pdf of the random variables X and Y at (x,y), and g(x) is the

marginal pdf of X at x, the conditional pdf of Y given X = x is given by

f(y|x) = f(x, y)g(x)

, g(x) > 0 (3.1.1)

for each y within the range of Y , and then ,

F (y|x) =∫ y−∞

f(u|x)du (3.1.2)

Now, we introduce the basic equation of the kernel conditional distribution function

estimation (KCDF). The Kernel function K(u) is assumed to be the kernel function,

hn = h is a sequence of positive numbers converging to zero.

Standard kernel estimators of f(x, y) , g(x) , f(y|x) are

f̂(x, y) = (nh2)−1n∑i=1

Kh(x−Xi)Kh(y − Yi) (3.1.3)

ĝ(x) = (nh)−1n∑i=1

Kh(x−Xi) (3.1.4)

and

f̂(y|x) = f̂(x, y)ĝ(x)

=(nh2)−1

∑ni=1Kh(x−Xi)Kh(y − Yi)

(nh)−1∑n

i=1Kh(x−Xi)

=

∑ni=1Kh(x−Xi)Kh(y − Yi)h∑n

i=1Kh(x−Xi)

Definition 3.1.1. The double kernel estimator of the conditional distribution func-

tion F (x, y) is defined as:

F̂DK(y|x) =∫ y−∞

f̂(u|x)du = Bn(x, y)ĝ(x)

40

where,

Bn(x, y) =1nhn

∑ni=1Kh(x−Xi)K̂h(y − Yi), K̂(y) =

∫ y−∞K(u)du

and K is a probability density function, hn is a sequence of positive numbers con-

verging to zero.

[29] used a double kernel approach. In this case, the indicator function in the NW

estimator is replaced by a continuous distribution function Ω(Yi−yh

). Then Fn(y|x)

has the form

F̂DK(y|x) =n∑i=1

wi(x)Ω(Yi − yh2

),

where,

Ω(y) =

∫ y−∞

W (u)du

is a distribution function with associated density function W (u).

Definition 3.1.2. The double kernel estimator of the conditional median m(x) is

defined as:

m̂DK(x) = inf[y ∈ R : F̂DK(y|x) ≥ 0.5].

3.2 Consistency of the Double Kernel Estimator

This section is the main section of this chapter. Here, we will introduce the DK es-

timator of the L1− median as a minimization problem, then the asymptotic consis-

tency of this estimator will be discussed and driven under some regularity conditions.

Let φn(α, x) be the estimate of φ(α, x) defined by

φn(α, x) =

∫Rd

(‖y − α‖ − ‖y‖)F (dy|x)

=

∑ni=1(‖Yi − α‖ − ‖y‖)Kh(x−Xi)K̂h(y − Yi)∑n

i=1Kh(x−Xi)

From the definition of µ(x), it seems natural to estimate it by minimizing an

estimate of φn(α, x).

41

Definition 3.2.1. (The minimizer µn is the L1-median)


φn(α, x)

= arg minα∈Rd

n∑i=1

‖Yi − α‖Kh(x−Xi)K̂h(y − Yi)

where the last equality is obtained by removing terms independent of α in the ex-

pression of φn(α, x)

Theorem 3.2.1. Assume (A1), (A2), (A3), (A4) and (A5). Then

1. with probability 1(w.p.1), one can find an integer N ≥ 1, such that if n ≥ N

and x ∈ C, the median µn(x) associated with the probability measure Qn(.|x)

exists and is unique. Moreover, the function µn(.) is continuous on C

2. the function µ(.) is continues on C.

3. w.p.1:

supx∈C||µn(x)− µ(x)|| → 0, if n→∞

Lemma 3.2.1. (Bochner Lemma)[24] Suppose that K is a Borel function satis-

fying the conditions:

1. supx∈Rs |K(x)|

Lemma 3.2.4. Let us assume that K is a positive probability density. W. p. 1,one

can find an integer N ≥ 1 such that if n ≥ N and x ∈ C, the L1-median µn(x)

associate with the probability measure Qn(.|x) ( which is not supported by a straight

line) exists and is unique.

Lemma 3.2.5. Let us assume that K is a continuous positive kernel. W. p. 1,one

can find an integer N ≥ 1 such that if n ≥ N , µn(.) is continuous on C.

The proofs of the previous Lemmas and the main theorem are obtaind using the

same techniques as in chapter 2, after replacing the indicator function I(Yi ≤ y)

by Ω(Yi−yh

).

43

Chapter 4

Application

Chapter 4

Application

The performance of the estimators of the conditional L1 - median is our concern in

this chapter. We will use the two estimators that we have studied in chapter 2 and

chapter 3 to analyze and forecast two bivariate time series.

The dependency between the stock markets in the world of finance indicated that

the dealing must be between multiple time series rather than we deal with each time

series alone.

The L1- median estimators enable use to predict multivariate medians of different

time series. Therefor, we will use the NW and DK estimators to analyze bivariate

time series.

The S-plus program is used to compute the two estimators depending on their

theoretical equations from the previous two chapters.

This chapter consists from three sections. In section 4.1, the first application will

be given using a bivariate time series from [26] which gives the asserts prices for two

international companies, IBM and SP500. The next section is similar to the first,

but it depends on another bivariate time series, from [26] which gives the asserts

prices for the Intel and Cisco companies. Finally, we close this chapter by Section 3,

which contains some conclusions and suggestions that we have concluded from our

study in this thesis.

44

4.1 Application 1

We illustrate the application of our conditional NW and DK estimators for the

bivariate median by considering the prediction of a financial position with bivariate

financial time series.

Prediction for the IBM and SP500 series

Consider the bivariate time series of the monthly log returns of the IBM stock and

the SP500 index, from January 1926 to December 1999 consisting of 880 observation.

The source of the this data set from [26].

As in application 1, we rescaled the data such that they range from zero to one.

Now, let x1,t = {IBM}t and x2,t = {SP500}t. Thus xt = {(x1,t, x2,t)} is a bivariate

time series.

The two time series x1,t and x2,t are correlated. Table 4.1 provides some summary

statistics of the rescaled series. Figure 4.1 and 4.2 show the time plot of the two

series, while Figure 4.3 and 4.4 show the scatterplots of the two series and their

squares respectively.

Table 4.1: Summary statistics of IBM and SP500 data

Series Min. 1st Qu Mean Median 3rd Qu. Max.

IBM 0.00 0.46 0.52 0.52 0.59 1.00

SP500 0.00 0.47 0.51 0.52 0.56 1.00

Details of Calculation

We computed µn(x) by finding the minimum at a finite sample of points


φn(α, x)

= arg minα∈Rd

n∑i=1

‖Yi − α‖Kh(x−Xi) (4.1.1)


φn(α, x)

= arg minα∈Rd

n∑i=1

‖Yi − α‖Kh(x−Xi)K̂h(y − Yi) (4.1.2)

45

We use the first 880 bivariate observations to predict the last 8 observations for the

bivariate time series using the NW from (4.1.1) and DK from (4.1.2) estimators for

the median. The results for IBM data are listed in Table 4.2, Figure 4.5 shows the

true observations for IBM and their predictions using the NW and DK estimators

. The results for SP500 data are listed in Table 4.3 and their graph are shown in

Figure 4.6.

Figure 4.1: Time plot of the rescaled IBM stock

Figure 4.2: Time plot of the rescaled SP500 stock

46

Figure 4.3: Scatterplot of the rescaled IBM stock versus the rescaled SP500 stock

Figure 4.4: Scatterplot of the squares of the rescaled IBM stock versus the squares

of the rescaled SP500 stock

47

Figure 4.5: Graph of the NW and the DK estimators for IBM

Table 4.2: The DK and NW median estimators for the IBM data.

i DK estimator True value NW estimator

880 0.7051316 0.6751316 0.6651316

881 0.7051316 0.6811093 0.6351316

882 0.6851316 0.4560167 0.5851316

883 0.6351316 0.4889528 0.5351316

884 0.5851316 0.4542471 0.5651316

885 0.5951316 0.1577721 0.5351316

886 0.5751316 0.5832439 0.5651316

887 0.6051316 0.5777071 0.5551316

48

Figure 4.6: Graph of the NW and the DK estimators for SP500

Table 4.3: The DK and NW median estimators for the SP500 data


880 0.5068489 0.6751316 0.5068489

881 0.5468489 0.6811093 0.5468489

882 0.5868489 0.4560167 0.5468489

883 0.5968489 0.4889528 0.5168489

884 0.5468489 0.4542471 0.5368489

885 0.5668489 0.1577721 0.5268489

886 0.5468489 0.5832439 0.5268489

887 0.5668489 0.5777071 0.5168489

49

In Table 4.4, We computed the values of the MSE related to the kernel estima-

tions NW and DK, as seen from Table 4.4, for all sample sizes, the MSE values of

the NW has the best performance than the DK kernel estimator.

MSE =1

n

n∑i=1

(yi − ŷi)2,

where yi is the true value and ŷi its prediction value.

Table 4.4: MSE of the DK and the NW estimators for Application 1

IBM SP500

DK 0.03557136 0.004800537

NW 0.02206879 0.003111601

4.2 Application 2

In this application, we use the NW and DK using a bivariate time series consists of

Cisco and Intel data.

Prediction for the Cisco and Intel series

Suppose that we want to predict values of the following two time series, the Stocks

of Cisco Systems and the Intel Corporation. We use the daily log returns for the

two stocks from January 2, 1991 to December 31, 1999 with 2258 observations. [26]

considered this data set and computed the VaR at the end of the data span by

building univariate and multivariate volatility models. We rescaled the data so that

they range from zero to one. Now, let x1,t = {Cisco}t and x2,t = {Intel}t. Thus

xt = {(x1,t, x2,t)} is bivariate time series.

The two time series x1,t and x2,t are correlated.Table 4.5 provide some summary

statistics of the rescaled series. Figure 4.7 and Figure 4.8 show the time plot of the

two series, while Figure 4.9 and Figure 4.10 show the scatterplots of the two series

and their squares respectively.

50

Table 4.5: Summary statistics of Cisco and Intel data

Series Min. 1st Qu Mean Median 3rd Qu. Max.

Cisco 0.00 0.55 0.59 0.54 0.64 1.00

Intel 0.00 0.48 0.54 0.54 0.60 1.00

We use the first 2258 bivariate observations to predict the last 8 observations for

the bivariate time series using the NW and DK estimators for the median. The re-

sults for Cisco data are listed in Table 4.6, Figure 4.11 shows the true observations

for Cisco and their predictions using the NW and DK estimators . The results for

Intel data are listed in Table 4.7 and their graph are shown in Figure 4.12.

51

Figure 4.7: Time plot of the rescaled Cisco stock

Figure 4.8: Time plot of the rescaled Intel stock

52

Figure 4.9: Scatterplot of the rescaled Cisco stock versus the rescaled Intel stock

Figure 4.10: Scatterplot of the squares of the rescaled Cisco stock versus the squares

of the rescaled Intel stock

53

Figure 4.11: Graph of the NW and the DK estimators for Cisco

Table 4.6: The DK and NW median estimators for the Cisco data


2258 0.7280763 0.7280763 0.7380763

2259 0.6880763 0.5314482 0.7080763

2260 0.6380763 0.5408192 0.6580763

2261 0.6580763 0.6439006 0.6380763

2262 0.6280763 0.6509391 0.6380763

2263 0.6580763 0.4613087 0.6380763

2264 0.6280763 0.5078365 0.6480763

2265 0.6580763 0.6794206 0.6280763

54

Figure 4.12: Graph of the NW and the DK estimators for Intel

Table 4.7: The DK and NW median estimators for the Intel data


2258 0.5176012 0.4776012 0.5176012

2259 0.5576012 0.3765592 0.5576012

2260 0.5176012 0.4469881 0.5576012

2261 0.5576012 0.4590812 0.5476012

2262 0.5276012 0.6055674 0.5676012

2263 0.5576012 0.4269851 0.5576012

2264 0.5276012 0.8381123 0.5576012

2265 0.5576012 0.5740417 0.5576012

55

In Table 4.8 summarizes their computed MSE for the two time series using the

NW and DK estimators .

The results indicator that the NW estimator performs better than DK estimator for

Cisco data. On the other hand DK estimator performs better than NW estimator

for Intel data.

Table 4.8: MSE of the DK and the NW estimators for Application 2

Intel Cisco

DK 0.0110432 0.02111191

NW 0.01898825 0.01234954

4.3 Discussion and Conclusion

The need to deal with multiple time series rather than dealing with each time series,

alone gives the researchers the idea to use L1- median estimator, for more details

see [2].

[2] has used the NW estimator for estimating the L1- median. This estimator de-

pends on the indicator function and this make the resulting curves are not smooth.

To get curves much smoothers, we have proposed to use a DK estimator for the L1

median. By looking on Figure 4.5, Figure 4.6, Figure 4.11 and Figure 4.12, we note

that the estimated curves using the DK is smoother than that we have obtained

using the NW estimator.

The comparison results depending on the MSE has indicated that the NW estimator

was better than the DK for the three time series.

From this results, we can conclude the following :

1. To get smooth curves, we must use the DK estimator.

2. In general, the NW estimator gives curves that much closer to the real data.

3. Using a modified estimators depending on the NW estimator, by using reweighed

versions or using variable bandwidth will give better estimators than that we

56

have used, but it request much complicated computations and programming.

57

Bibliography

[1] Al Attal, A. (2016). On the Kernel Estimation of the Conditional Median. The

Islamic University of Gaza.

[2] Berlinet, A., Cardre, B. and Gannoun, A. 1998. On the Conditional L1-Median

and its Estimation. Nonparametric Statistics.

[3] Berlinet, A., Cadre, B. and Gannoun, A. (2001). On the Conditional L1 -Median

and its Estimation. University of Montpellier, France.

[4] Bosq, D., and Lecoutre, J, P.(1987). Theorie de 1 estimation fonctionnelle. Eco-

nomica.

[5] Bruce, E. Hansen.Lecture Notes on Nonparametrics.University of Wisconsin

.Spring 2009

[6] Cameron, A.C., Trivedi, P.K. (1998). Regression Analyse of Count Data.

NewYork: Cambridge University Prsee.

[7] Casella, G. and Berger, R.(2002). Statistical Inference,USA.

[8] De Gooijer J. G., Gannoun, A., Zerom, D. (2004). A multivariate Quantile Pre-

dictor, UVA Econometrics, Discussion Paper 2002/08.

[9] Dudley, R. M. (1989). Real Analysis and Probability. Chapman and Hall.

[10] Fan, J., Hu, T.C., Troung, Y.K. (1994). Robust Nonparametric Function Es-

timation. Scandinavian Journal of Statistics, 21, 433-446.

[11] Fan, J., Yao, Q. (2003). Nonlinear Time series: Nonparametric and Parametric

Methods. Springer-Velag. USA.

58

[12] Fan, J., Yao, Q., Tong, H. (1996). Estimation of conditional densities and

sensitivity measuresin nonlinear dynamical systems. Biometrika, Vol. 83, 189-

206.

[13] Florescu, I. (2015). Probability and Stochastic Processes, By John Willy and

Sons.

[14] Freund, J. (1992).Mathematical Statistics , Arizona State University.

[15] George Casell. Roger L. Berger. (1990). Statistical Inference. Cornell University,

North Carolina State University

[16] Hogg . Mckean . Craig , (2005). Introduction to Mathematical Statistic. Uni-

versity of Iowa, Wester Michigan University, University of Iowa.

[17] Hyndman R. J. (1996), Estimating and visualizing condti tional densities. Jour-

nal of Computational and Graghical Statistics, Vol.5,315-336.

[18] Kemperman, J. H. B. (1987). The Median of Finite Measure on Banach Space.

Statistical Data Analysis Based on L1 Norm and Related Method, Y. Dodge

(ed.), Amsterdam: NorthHoland, 217-230

[19] Nadaraya, E. A. (1964). One Estimating Regression. Theory of the Probability

of its Applications, 10, 186-190.

[20] Rider, Paul R. (1960). Variance of the median of small samples from several

special populations.Journal of the American Statistical Association.

[21] Roger, C.R. (1965). Linear Statistical Inference and Its Applications. Wiley,

New York.

[22] Rosenblatt, M. (1969). Conditional Probability Density and Regression Esti-

mator. New York: Academic Press

[23] Royden H.L. (1997). Real Analysis. Stanford University.

[24] Salha, R (2006). Kernel Estimation for the Conditional mode and quantiles of

time series. The Macedonia economic and social sciences university .

59

[25] Silverman, B. W. (1986). Density estimation for statistics and data analysis.

Chapman and Hall, London.

[26] Tsay. R. S. (2002). Analysis of Financial Time Series. John Wiley and Sons.

[27] Wand and Jones, Wand, M. P. and Jones, M. C. (1995). Kernel smoothing,

Chapman and Hall.

[28] Watson, G.S. (1964). Smooth regression analysis.Sankhya Series A, 26, 359-

3720.

[29] Yu, K. and Jones M. C. (1998). Local Linear Quantile Regression. Journal of

the American Statistical Association, Vol. 93, No. 441, 228-237.

60

Documents

On the Estimation of the Conditional Median · 2017. 4. 18. · 1- median. The sample median is de ned as the middle value of a set of ranked data, i.e. the sample median splits the