27
NONPARAMETRIC REGRESSION ESTIMATION WHEN THE REGRESSOR TAKES ITS VALUES IN A METRIC SPACE * Sophie DABO-NIANG a and Noureddine RHOMARI a,b a CREST, Timbre J340, 3, Av. Pierre Larousse, 92245 Malakoff cedex, France [email protected]; [email protected] b Universit´ e Mohammed I, Facult´ e des Sciences, 60 000 Oujda, Maroc [email protected] (1 st version, September 2001) AMS classification: 62 G 08; 62 H 30. Key words and phrases : Kernel estimates; regression function; metric-valued random vectors; discrimination. Abstract: We study a nonparametric regression estimator when the explana- tory variable takes its values in a separable semi-metric space. We establish some asymptotic results and give upper bounds of the p-mean and (pointwise and in- tegrated) almost sure estimation errors, under general conditions. We end by an application to the discrimination in a semi-metric space and study as an example the case when the explanatory variable is the Wiener process in C [0, 1]. esum´ e: Dans ce travail, nous ´ etudions l’estimateur ` a noyau de la r´ egression quand la variable explicative prend ses valeurs dans un espace semi-m´ etrique eparable. Nous ´ etablissons sa consistance en moyenne d’ordre p et presque sˆ ure (ponctuelle et int´ egr´ ee), sous des conditions g´ en´ erales. Nous pr´ ecisons aussi des bornes sup´ erieures de ces erreurs d’estimation. Ensuite, nous appliquons, ces esultats ` a la discrimination de variables d’un espace semi-m´ etrique et ´ etudions comme exemple le cas o` u la variable explicative est le processus de Wiener dans C [0, 1]. * Comments are welcome 1

NONPARAMETRIC REGRESSION ESTIMATION WHEN THE REGRESSOR TAKES ITS VALUES IN A METRIC SPACE

Embed Size (px)

Citation preview

NONPARAMETRIC REGRESSION ESTIMATION WHEN THE REGRESSORTAKES ITS VALUES IN A METRIC SPACE∗

Sophie DABO-NIANG a and Noureddine RHOMARI a,b

a CREST, Timbre J340, 3, Av. Pierre Larousse, 92245 Malakoff cedex, France

[email protected]; [email protected] Universite Mohammed I, Faculte des Sciences, 60 000 Oujda, Maroc

[email protected]

(1st version, September 2001)

AMS classification: 62G08; 62H30.

Key words and phrases : Kernel estimates; regression function; metric-valued

random vectors; discrimination.

Abstract: We study a nonparametric regression estimator when the explana-tory variable takes its values in a separable semi-metric space. We establish someasymptotic results and give upper bounds of the p-mean and (pointwise and in-tegrated) almost sure estimation errors, under general conditions. We end by anapplication to the discrimination in a semi-metric space and study as an examplethe case when the explanatory variable is the Wiener process in C[0, 1].

Resume: Dans ce travail, nous etudions l’estimateur a noyau de la regression

quand la variable explicative prend ses valeurs dans un espace semi-metrique

separable. Nous etablissons sa consistance en moyenne d’ordre p et presque sure

(ponctuelle et integree), sous des conditions generales. Nous precisons aussi des

bornes superieures de ces erreurs d’estimation. Ensuite, nous appliquons, ces

resultats a la discrimination de variables d’un espace semi-metrique et etudions

comme exemple le cas ou la variable explicative est le processus de Wiener dans

C[0, 1].

∗Comments are welcome

1

1 Introduction

Let (X, Y ) be a random vector from X × R with E(|Y |) < ∞, where

(X , d) is a separable semi-metric space equipped with the semi-metric d.

The distribution of (X, Y ) is often unknown, so is the regression function

m(x) = E(Y |X = x), x ∈ X . We then want to estimate the regression oper-

ator m(x) by mn(x), a function of x and a sample (X1, Y1), . . . , (Xn, Yn) of

(X, Y ). The present framework includes the classical finite dimensional case

of X = Rd, d ≥ 1, but also spaces of infinite dimension, as usual function

spaces or sets of probabilities on some given measurable space. This problem

has been widely studied, when the variable X lies in a finite dimensional

space, and there are many references on this topic, contrary to infinite di-

mensional observation case.

In this work, X may have an infinite dimension; it can be for example a

function space which is an important case. Recently, the statistics of the

functional data has known a growing interest. These questions in infinite

dimension are particularly interesting, at once for the fundamental problems

they formulate, see Bosq (2000), but also for many applications they may

allow, see Ramsay and Silverman (1997). One possible application is that the

(Xi)i are random curves, for instance in C(R) or C[0, T ]. Many phenomena,

in various areas (e.g. weather, medicine,...), are continuous in time and may

or must be represented by curves. We can be interested either in the forecast

or with the classification.

For example let us consider the weather data of some country. One can in-

vestigate to what extent total annual precipitation for weather stations can

be predicted from the pattern of temperature variation through the year.

Let Y be the logarithm of total annual precipitation at a weather station,

and let X = X(t), t ∈ [0, T ] be its temperature function, the interval [0, T ]

is either [0, 12] or [0, 365] depending on weather monthly or daily data. For

more examples, we refer to Ramsay and Silverman (1997) and the references

therein.

The popular estimate of m(x), x ∈ X , which is a locally weighted average,

2

is defined by the following expression

mn(x) =n∑

i=1

Wni(x)Yi (1)

where (Wn1(x), . . . ,Wnn(x)) is a probability vector of weights and each

Wni(x) is a Borel measurable function of x, X1, . . . , Xn. We will be inter-

ested in kernel method

Wni(x) =K(d(Xi, x)/hn)∑n

j=1 K(d(Xj, x)/hn), (2)

if∑n

j=1 K(d(Xj, x)/hn) 6= 0 otherwise Wni(x) = 0, where (hn)n is a sequence

of positive numbers, and K is a nonnegative measurable function on R.

When (Xi, Yi) ∈ Rd × Rk, the estimation of m was treated by many au-

thors with various weights including the nearest neighbor and kernel meth-

ods. For the references cited hereafter, we limit ourselves to the general

case where no assumption concerning the existence of density with respect

to some measure of reference is made. Stone (1977) showed that the nearest

neighbor estimate is universally consistent that is,

E(|mn(X)−m(X)|p) −→ 0 as n →∞ whenever E(|Y |p) < ∞, p ≥ 1,

for more general weight vectors. Devroye and Wagner (1980), and Spiegelman

and Sacks (1980), extended the same result to the kernel estimate. Universal

consistency results were also presented by Gyorfi (1981). Krzyzak (1986)

gave rates of probability and almost sure convergence for a wide class of

kernel estimates; see also the references cited by these authors. While a

pointwise consistency was treated, among others, by Devroye (1981) for both

the nearest neighbor and the kernel methods and by Greblicki et al. (1984)

for a more general kernel estimates. For example, Devroye (1981) proved

that for PX-almost all x, mn(x) is consistent, in p-mean, that is

limn→∞

E(|mn(x)−m(x)|p) = 0, (3)

for PX-almost all x, in Rd, whenever E(|Y |p) < ∞, p ≥ 1, hn → 0 and

limn→∞

nhdn = ∞, where PX is the probability distribution of X. He also estab-

lished its almost sure pointwise convergence and its almost sure integrated

(w.r.t. PX) consistency if in addition Y is bounded and limn→∞

nhdn/ log n = ∞.

3

In this paper we are interested to extend Devroye’s results (1981) to the

case where X takes values in a general separable semi-metric space, which

may be of an infinite dimension. We also precise the rates of convergence

which are optimal in the case of finite dimensional X .

The literature on this topic in infinite dimension is relatively limited,

to our knowledge. Bosq (1983) and Bosq and Delecroix (1985) deal with

general kernel predictors for Hilbert-valued Markovian processes. In the case

of independent observations, Kulkarni and Posner (1995) studied the nearest

neighbor estimation in general separable metric space X . They precised a

rate of convergence connected to metric covering numbers in X . The recent

work of Ferraty and Vieu (2000) is on kernel method for X being in a semi-

normed vector space, whose distribution has a finite fractal dimension (see

condition (4) bellow). They proved that, when Y is bounded, the kernel

estimate related to (1) and (2) converges almost surely to m(x), if m is

continuous at x and if there exist two positive numbers a(x), c(x) such that

limh→0

P (X ∈ Bxh)

ha(x)= c(x) (4)

and hn → 0 and limn→∞

nha(x)n / log n = ∞, where Bx

h denotes the closed ball

of radius h and center x. Under similar conditions, Ferraty et al. (2002)

extended this result to dependent observations and obtained an a.s. uniform

convergence on compact set and precised its rate.

Here, the same statement, in independent case, are established, under gen-

eral assumptions (see theorem 2 below). The conditions are connected with

the probability of the explanatory variable X being in the ball Bxhn

. We ask

that limn→∞

nP (X∈Bxhn

)

log n= ∞, which is fulfilled for all random variables X and

some class of sequences (hn) converging to 0, and m may be discontinuous at

x; hence we extend [12]’s results, and in the case of condition (4) we improve

their rate of convergence. The proofs are different. Similar condition as (4)

allow [13] to extend the study to dependent observations and to establish

a.s. uniform consistency under a uniform-type condition of (4). However

this condition (4) excludes an interesting class of processes having infinite

fractal dimension like the Wiener processes (see paragraph 4.2). The proof

we used here can easily be extended to the dependent case via the coupling

4

method, but it seems difficult to obtain with it the uniform consistency; the

a.s. convergence remains valid for absolutely regular (β-mixing) processes

(see [26]). Note that our conditions written in term of P (X ∈ Bxhn

) unify the

regression estimation with finite and infinite dimensional explanatory vari-

ables, and the results, especially the bounds of estimation errors, highlight

how the probability of small balls related to explanatory variable influences

the convergences.

Let µ be the probability distribution of X. In the case of finite dimensional

space X = Rd, the crucial tool, needed to prove those results, e.g., in [9],

[18] and [27], is the differentiation result of a probability measure i.e. if

f ∈ Lp(Rd, µ) = {f : (Rd,BRd , µ) −→ (R,BR) /‖f‖p < ∞}, then,

limh→0

1

µ(Bxh)

∫Bx

h

|f(z)− f(x)|pdµ(z) = 0, (5)

for µ-almost all x, p ≥ 1. This holds for all Borel locally finite measures on

Rd, in particular for probability measures. Unfortunately, that statement is

in general false when X is an infinite dimensional space, even for probability

measures, as proved in Preiss (1979). He constructs a Gaussian measure γ,

on an Hilbert space H, and an integrable function f such that,

lims→0

inf

{1

γ(Bxh)

∫Bx

h

f dγ; x ∈ H, 0 < h < s

}= ∞.

However, (5) remains valid, under some additional conditions, for some prob-

ability measure µ on an Hilbert space, as stated in Tiser’s theorem below.

Theorem A (Tiser, 1988). Let H be an Hilbert space and let γ be a Gaus-

sian measure with the following spectral representation of its covariance oper-

ator Rx =∑

i ci〈x, ei〉ei, where (ei) is an orthonormal system in H. Suppose

ci+1 ≤ cii−α for given α > 5/2. Then, for all f ∈ Lp(H, γ), p > 1,

limh→0

1

γ(Bxh)

∫Bx

h

|f(z)− f(x)|dγ(z) = 0, γ-almost every x.

Remark 1

In the above Tiser’s theorem, the convergence can be stated with |·|p′instead

5

of | · |, and for all f ∈ Lploc(H, γ) and p′ < p (loc stands for locally integrable),

that is

limh→0

1

γ(Bxh)

∫Bx

h

|f(z)− f(x)|p′dγ(z) = 0, γ-almost every x.

For this, we use either his own proof or the same idea in the first part of the

Lemma 2.1’s proof in Devroye (1981), or the same argument as in the proof

of theorem 7.15, page 108 in [32].

Remark 2 It is obvious that (5) remains valid for all almost surely continu-

ous functions f and all Borel locally finite measures on a semi-metric space.

In particular it holds for each point x of continuity of f and all probabil-

ity measures µ. That’s why we preferred to work with general regressions

fulfilling condition (5).

In this work, we shall prove that (5) and E(|Y |p) < ∞, for some p ≥ 1,

imply (3), even for an infinite dimensional space X–valued random variable

X. We also show the pointwise and integrated (w.r.t. µ i.e. L1(µ)-error,∫|mn − m|dµ) strong convergence of mn, when the r.v. Y is bounded, un-

der general conditions (m may be discontinuous). We precise the rate of

convergence in each type of consistency. We then apply these results to the

nonparametric discrimination of variables in a semi-metric space (e.g. the

curves classification problem). As an example we present a special case of

µ a Wiener measure and X = C[0, 1] equipped with the sup-norm. This

example, which is not fulfill [12]’s condition (4), illustrates the extension and

the contribution of this work; we can apply it to the regression with an ex-

planatory variable whose distribution has an infinite fractal dimension and

to classify nonparametric general curves; which is new, to our knowledge.

It is obvious that all the results hold also when Y takes values in Rk,

k ≥ 1, thanks to linearity of the estimator (1) with respect to the (Yi)’s.

Another work dealing with Y in a separable Banach space, which may be also

a curve function, shows that the same results remain valid for the regression

estimation with response variable in Banach space. As it mentioned above,

we can also drop the independence condition and replace it by the absolute

regular mixing condition (see [26]).

6

The rest of the paper is as follows. We give the assumptions and some

comments in section 2. Section 3 contains results on pointwise consistency in

p-mean and the strong consistency, followed by the study of the integrated

error estimation for the special case of bounded Y . Section 4 is devoted to

the application to the nonparametric discrimination problem, followed by an

example of the Wiener measure and X = C[0, 1] equipped with the sup-norm.

The proofs are postponed to the last section.

Remark 3 Let X ′ be a separable semi-metric space, and Φ : R −→ R and

Ψ : X −→ X ′ are some measurable functions. It is clear that the results

bellow can be applied to estimate mΨΦ(z) = E(Φ(Y )|Ψ(X) = z). The intro-

duction of these functions allows to cover and estimate various (functional)

parameters related to the distribution of (X, Y ), by varying them. For ex-

ample:

i) If Φ = Id and Ψ = Id, then mIdId(x) = E(Y |X = x) = m(x).

ii) Let A be a measurable set and Φ(Y ) = 1IA(Y ), then in this case

mΨΦ(z) = P (Y ∈ A|Ψ(X) = z) corresponds to the conditional distribution

probability of Y given Ψ(X),...

Further, Ψ can be used when X is not separable, in order to fulfill the con-

dition of separability.

We have omitted these functions just to simplify the notations.

2 Assumptions

Let us assume that there exist positive numbers r, a and b such that,

a1I{|u|≤r} ≤ K(u) ≤ b1I{|u|≤r}, (6)

and for a fixed x in X , let Bxh denote the closed ball of radius h and center

x, and suppose

hn → 0 and limn→∞

nµ(Bxrhn

) = ∞. (7)

We assume also that for some p ≥ 1,

limh→0

1

µ(Bxh)

∫Bx

h

|m(w)−m(x)|pdµ(w) = 0, (8)

7

and,

E[|Y −m(X)|p|X ∈ Bx

rhn

]= o

([nµ(Bx

rhn)]p/2)

(9)

Remark 4

1- (8) is satisfied for all points of continuity of m, for all µ.

2- (8) does not imply, in general, that m is continuous. Indeed the Direchlet’s

function defined on [0, 1] as m(x) = 1I{x rational in [0,1]}, fulfills (8) with µ an ab-

solute probability measure with respect to Lebesgue measure on [0, 1], but it

is nowhere continuous (example from Krzyzak (1986)); in infinite dimension

take instead of rational some countable dense set and continuous measure.

3- If m verifies (8) for p = 1 and is bounded in a neighborhood of x then (8)

is true for all p.

4- When Y is bounded the condition (9) is obviously satisfied.

5- It is clear that E[|Y −m(X)|p|X ∈ Bxrhn

] = 1µ(Bx

h)E[|Y −m(X)|p1IBx

rhn(X)],

then the condition (9) is satisfied, for example, when (7) holds and if

E[|Y −m(X)|p|X ∈ Bxrhn

] = O(1) or limn→∞

n(µ(Bxrhn

))1+2/p = ∞.

6- As noticed above, thanks to theorem A, (8) and (9) remain valid for all

f ∈ Lp′(H, µ) and µ-almost all x in H, p′ > p ≥ 1, where H is an Hilbert

space and µ a gaussian measure as in theorem A; thus for such probability

measure, (8) and (9) are fulfilled when E|Y |p′< ∞. In this case theorem 1,

bellow, holds for µ-almost all x in H.

7- Recall that (8) and (9) hold for µ-almost all x and all µ in the finite

dimensional space X = Rd, whenever E|Y |p < ∞, see [9] or [32].

In the finite dimension X = Rd, we have hd = O(µ(Bxh)) for µ-almost all

x, and all probability measures µ on Rd (see [9] or [32]), so, in this case

the condition (7) is implied by hn → 0 and limn→∞

nhdn = ∞. Then the

formulation of the assumptions and the bounds of the estimation errors in

term of the probability µ(Bxh) unify the two cases of the finite and the infinite

dimensional explanatory variables. It should be noted that the evaluation of

µ(Bxh) which is known as probability of small balls is very difficult in infinite

dimension (see e.g. [21]).

8

3 Main results

3.1 p-mean consistency

Theorem 1 If E(|Y |p) < ∞, and if (6), (7) and (9) hold then,

E(|mn(x)−m(x)|p) −→n→∞

0,

for all x in X such that m satisfies (8).

If µ is discrete, the result in theorem 1 holds for all points in the support

of µ whenever E(|Y |p) < ∞ and hn → 0.

Rate of p-mean convergence The next result is about a rate of p-mean

convergence, p ≥ 2. We therefore need to reinforce the condition (8) in order

to sharply evaluate the bias. We suppose that m is ”p-mean Lipschitz”, in a

neighborhood of x, with parameters 0 < τ = τx ≤ 1 and c = cx > 0, that is,

1

µ (Bxh)

∫Bx

h

|m(w)−m(x)|pdµ(w) ≤ cxhpτ , as h → 0. (10)

It is obvious that (10) is satisfied if m is lipschitzian of parameters τ = τx

and cx, but it is weaker than the Lipschitz one and it does not imply, in

general, even the continuity of m at x (see the example in remark 4). A

similar assumption was used by Krzyzak (1986).

Corollary 1 If E(|Y |p) < ∞, E(|Y −m(X)|p|X ∈ Bxrhn

) = O(1) and if (6),

(7), (10) are satisfied, with p ≥ 2, then we have

E(|mn(x)−m(x)|p) = O

(hpτ

n +

(1

nµ(Bxrhn

)

)p/2)

.

Without the assumption of existence of the probability densities, this state-

ment in finite dimension seems to be new for unbounded variable Y .

Remark 5

1. If we assume the condition lim suph→0

µ(Bxh)

ha(x) < ∞, for some a(x) > 0, which is

weaker than (4), we have, for h = cn−1/(2τ+a(x)),

E(|mn(x)−m(x)|p) = O(n−pτ/(2τ+a(x))), p ≥ 2.

9

2. In the finite dimensional case, X = Rd, the best rate derived from the

bound in corollary 1 is optimal (cf. [29]), because hd = O(µ(Bxh)) for µ-

almost all x, and all probability measures µ on Rd; see lemma 2.2 of [9] and

its proof, or [32, p. 189]. In this case, we obtain a similar rate as above with

d instead of a(x).

3. In Rd, Spiegelman and Sacks (1980) obtained the same rate in the case p =

2, τ = 1 and bounded Y for E(|mn(X)−m(X)|2). Krzyzak (1986) obtained

for a more general kernel the rates in probability and a.s. convergence under

condition (10) and unbounded response variable Y .

3.2 Strong consistency

Bellow, we give the pointwise and µ-integrated almost sure convergence re-

sults of the kernel estimate and we precise their rates of convergence. In

order to simplify the proof, we assume that |Y | ≤ M < ∞, a.s. But with

the truncating techniques we can extend the result to unbounded Y ; see the

comments bellow.

Theorem 2 If Y is bounded, (6) holds, hn → 0 and limn→∞

nµ(Bxrhn

)

log n= ∞, then

|mn(x)−m(x)| −→ 0, a.s., as n →∞, (11)

for all x in X such that m satisfies (8).

If in addition (8) holds for µ-almost all x,

limn→∞

E(|mn(X)−m(X)||X1, Y1, . . . , Xn, Yn) = 0, a.s. (12)

Rate of a.s. convergence We establish now the rate of strong consistency.

Corollary 2 Let us assume that |Y | ≤ M , if (6) and (10), for p = 1, hold,

if hn → 0 and limn→∞

nµ(Bxrhn

)

log n= ∞, then we have

|mn(x)−m(x)| = O

(hτ

n +

(log n

nµ(Bxrhn

)

)1/2)

, a.s..

10

In fact we obtain more precisely, for large n, a.s.,

|mn(x)−m(x)| ≤ cx(b/a)(rhn)τ + 6M(b/a)

(nµ(Bx

rhn)

log n

)−1/2

,

and the 6M can be replaced by some almost surely bound of

2 max(√

E[|m(X)−m(x)|2|X ∈ Bxrhn

], 2√

E[|Y −m(X)|2|X]1IBxrhn

(X)).

But if m is lipschitzian, in a neighborhood of x, with parameter 0 < τ =

τx ≤ 1 and cx > 0, this bound becomes

|mn(x)−m(x)| ≤ cx(b/a)(rhn)τ + 2M(b/a)

(nµ(Bx

rhn)

log n

)−1/2

,

and the 2M can be replaced by some almost surely bound of

2√

E[|Y −m(X)|2|X]1IBxrhn

(X).

We can relax the boundedness of Y by assuming that it satisfies the Cramer

condition, in this case the log n above is replaced by (log n)3, see the com-

ments after.

Remark 6

1. If the distribution of X satisfies the additional conditions lim suph→0

µ(Bxh)

ha(x) <

∞, for some a(x) > 0, (weaker than (4)) and m continue at x, then (7) and

(8) are fulfilled, and therefore we find theorem 1 of Ferraty and Vieu (2000).

If in addition (10) holds for p = 1, then we find for h = c(log n/n)1/(2τ+a(x))

that

|mn(x)−m(x)| = O((log n/n)τ/(2τ+a(x))), a.s.,

which improves the bound given by Ferraty and Vieu (2000), for m lips-

chitzian, under an additional condition on the expansion of the limit in (4).

They have assumed that there exists a real positive b(x) such that µ(Bxh) =

ha(x)c(x) + O(ha(x)+b(x)) and obtain the rate O((n/ log n)−γ(x)/(2γ(x)+a(x))

),

where γ(x) = min{b(x), τ}.2. In a finite dimension, X = Rd, Krzyzak (1986) obtained, for a slight

general kernel, similar rates in probability and almost sure convergence in

bounded case but more weaker for unbounded Y .

11

3.3 How to choose the window h?

It is well known that the performance of the kernel estimate depends on

the choice of the window parameter h. The bound in corollary 2 is simple

and easy to compute. So, this allows us to choose the window parameter

h that minimizes that bound. We can also use other different methods as

cross-validation.

4 Applications

4.1 Discrimination.

Let an object O having an observed variable X in some space X and an

unknown nature-class Y in some discrete set.

Classical discrimination is about predicting the unknown nature Y of an

object O having a d-dimensional observation X and a nature Y . The un-

known nature, Y , of the object is called a class and takes values in a finite

set {1, . . . ,M}. One constructs a function g taking values in {1, . . . ,M}which represents one’s guess g(X) of Y given X. This mapping g, which

is called a classifier, is defined on X and takes values in {1, . . . ,M}. We

err on Y if g(X) 6= Y , and the probability of error for a classifier g is,

L(g) = P{g(X) 6= Y }. The question is how to build this function g and what

is the best manner to determine the class of this object? In discrimination,

the problem is to classify this objectO, i.e. to decide on the value of Y from X

and a sample of this pair of variables by an empirical rule, gn, which is a mea-

surable function from X ×(X ×{1, . . . ,M})n into {1, . . . ,M}; its probability

of error is P{gn(X) 6= Y |X1, Y1, . . . , Xn, Yn}, where (X1, Y1), . . . , (Xn, Yn) are

observations from (X,Y ).

It is well known that the Bayes classifier which is defined by,

g∗ = arg ming:X→{1,...,M}

P{g(X) 6= Y },

is the best possible classifier, with respect to quadratic loss (see e.g. [11]).

Note that g∗ depends upon the distribution of (X, Y ) which is unknown.

However, we build an estimator gn of the classifier based on a sample of ob-

12

servations.

Here we consider an observation X belonging to some separable semi-

metric space X , with an unknown class Y . X may be of infinite dimension,

e.g. a curve. We want to predict this class from X and a sample of this

pair of variables. For example to classify objects with curve parameters (e.g.

determine the risk, Y , of patients according to electrocardiogram curve, ...);

in this case the (Xi) take values in some function space, for instance C(R).

In this section we extend Devroye’s results in the setting of X belonging

to a semi-metric space X . This space may be with infinite dimension, e.g. a

function space, as we have mentioned it.

Let X be a random variable with values in X and Y takes values in

{1, . . . ,M} which is estimated from X and (X1, Y1), . . . , (Xn, Yn) by gn(X).

So the quality of this estimate is measured by the probability of error

Ln = P{gn(X) 6= Y |X1, Y1, . . . , Xn, Yn},

and it is clear that

Ln ≥ L∗ = infg:X→{1,...,M}

P{g(X) 6= Y },

where L∗ is the Bayes probability of error. The Bayes classifier g∗ can be

approximated by gn chosen so that,

n∑i=1

Wni(x)1I{Yi=gn(x)} = max1≤j≤M

n∑i=1

Wni(x)1I{Yi=j}. (13)

Such classifier gn (not necessarily uniquely determined) is called an approxi-

mate Bayes classifier.

The probability and almost sure convergences are established in the fol-

lowing theorem.

Theorem 3 If (6), (7) and (8) hold for µ-almost all x then, limn→∞

Ln = L∗

in probability.

If in addition, limn→∞

nµ(Bxrhn

)/ log n = ∞ for µ-almost all x then, limn→∞

Ln =

L∗, almost surely.

13

4.2 Example of Wiener process

In the simple, but interesting, setting X = C[0, 1] the set of all continuous

real functions defined over [0, 1], equipped with the sup norm, let X = W be

the standard Wiener process. Hence µ = Pw is the Wiener measure. Define

the set

S = {x ∈ C[0, 1] : x(0) = 0, x is absolutely continuous and

∫ 1

0

x′2(t)dt < ∞},

where the ′ stands for the derivative. Then (see Csaki, 1980),

exp

(−1

2

∫ 1

0

x′2(t)dt

)Pw(B0

h) ≤ Pw(Bxh) ≤ Pw(B0

h),

as h → 0 and x ∈ S. In addition Pw(B0h)/ exp (−π2/8h2) → 1 as h → 0, see

Bogachev (1999) (cf. also Lifshits (1995), section 18). Hence, for x ∈ S,

exp

(−1

2

∫ 1

0

x′2(t))dt

)≤ lim inf

h→0

Pw(Bxh)

exp(−π2/(8h2))≤ lim sup

h→0

Pw(Bxh)

exp(−π2/(8h2))≤ 1.

From this we deduce that for h = (c log n)−1/2, we have c(x)n−cπ2/8 ≤Pw(Bx

h) ≤ n−cπ2/8, for large n. It implies, with corollary 1, if h = (c log n)−1/2

with c < 8/π2, then

E(|mn(x)−m(x)|p) = O((log n)−τp/2), p ≥ 2,

and also a.s. if Y is bounded.

Note that ∀x ∈ S and ∀a > 0 we have limh→0

µ(Bxh)

ha = 0; hence the standard

Wiener measure on (C[0, 1], ‖·‖∞) has an infinite fractal dimension and then

does not verify the condition of [12] and [13, 14] (cf. condition 4).

5 Comments

1) If Y is unbounded but admits an exponential moment E(eθ‖Y ‖) < ∞, for

some θ > 0, then (11) takes place whenever hn → 0, limn→∞

nµ(Bxrhn

)

log3 n= ∞ and

m fulfills (8), and the log n in corollary 2 is replaced by (log n)3. We can only

assume the existence of a polynomial moment, but in this case the log n in

the condition and in the bound of corollary 2 become a power of n.

14

2) We think that the conditions hn → 0 and limn→∞

nµ(Bxrhn

) = ∞ (resp.

limn→∞

nµ(Bxrhn

)

log n= ∞) are necessary in theorems 1 (resp. 2).

3) As we have mentioned, since (8) may hold for discontinuous regression

functions m then all the results above can be valid for some discontinuous

regression operators. In addition if K(u) = 1I{|u|≤r}, and if hn → 0 and

limn→∞

nµ(Bxrhn

) = ∞ then (8) is necessary.

4) The proofs we used allow us not to assume a particular structure of distri-

bution of X as in [12, 13, 14]; they allow to include processes having infinite

fractal dimension, for example the Wiener process.

5) (X , d) is a general semi-metric space; it may be for example any subset of

usual function spaces (e.g. set of the probability densities equipped with the

usual metrics, any subspace of functions which is not a linear space, obtained

for example under some constraints,...), or:

a. (P , dpro), a set of probabilities defined on some measurable space and dpro

the Prohorov metric.

b. (PR, dK), a set of probabilities defined on R endowed with its borelian σ-

field, and dK(P, Q) = supx |P (−∞, x]−Q(−∞, x]|, the Kolmogorov metric.

c. (D[0, 1], dsko) set of the cadlag functions on [0, 1] and dsko the Skorohod

metric.

6) It should be noted that, the same results remain valid when Y takes values

in separable Banach space (of finite or infinite dimension).

7) By the mean of approximation of dependent r.v.’s by independent ones,

we can drop the independence hypothesis and replace it by β-mixing, (see

[26]).

6 Proofs.

We apply, with slight modifications, proofs similar to those in Devroye (1981),

except that we use our lemma 1 below instead of his lemma 2.1. We add

some complements in the proof of strong consistency, and also to obtain a

rate of convergence. But, for the sake of completeness we present them in

entirety. Inequality (15) bellow gives a sharp bound than Devroye’s (1981)

inequality 2.7.

15

Let us first introduce some notations needed in the sequel. According to

definition of the kernel estimate (1) and (2) we will take the convention that

0/0 = 0. Recall that Bxh denotes the closed ball of radius h and center x and

define

ξ = 1I{d(X,x)≤rhn}, ξi = 1I{d(Xi,x)≤rhn} and Nn =n∑

t=1

ξt.

Lemma below is needed to prove the consistency in p-mean like the strong

one.

Lemma 1 For all measurable f ≥ 0 and all x in X we have, for all n ≥ 1,

E

(n∑

i=1

Wni(x)f(Xi) | ξ1, . . . , ξn

)≤ (b/a)

µ(Bxrhn

)

∫Bx

rhn

f(w)dµ(w)1IN 6=0, (14)

and

E

(n∑

i=1

Wni(x)f(Xi)

)≤

(b/a)[1− (1− µ(Bxrhn

))n]

µ(Bxrhn

)

∫Bx

rhn

f(w)dµ(w). (15)

Proof of lemma 1. Define first

U1 =n∑

i=1

f(Xi)ξi∑nt=1 ξt

.

It is easy to see that from the kernel condition (6) and the positivity of f we

haven∑

i=1

Wni(x)f(Xi) ≤ (b/a)U1,

and remark that the i.i.d.’s of (X1, ξ1), . . . , (Xn, ξn) implies

ξiE(f(Xi)|ξi) = ξiE(f(X)|ξ = 1)

= ξi1

µ(Bxrhn

)E(f(X)1I{d(X,x)≤rhn}

),

16

and E(f(Xi)|ξ1, . . . , ξn) = E(f(Xi)|ξi). Then, we have

E(U1|ξ1, . . . , ξn, Nn 6= 0) =n∑

i=1

E(f(Xi)/ξi)ξi∑nt=1 ξt

=n∑

i=1

E(f(X)/ξ = 1)ξi

Nn

= E(f(X)/ξ = 1)

=1

µ(Bxrhn

)E(f(X)1I{d(X,x)≤rhn})

and E(U1|ξ1, . . . , ξn, N = 0) = 0, which can be summarized as

E(U1|ξ1, . . . , ξn) =1

µ(Bxrhn

)E(f(X)1I{d(X,x)≤rhn}

)1IN 6=0.

Taking now, the expectation of this quantity we obtain

E(U1) =1

µ(Bxrhn

)E(f(X)1I{d(X,x)≤rhn}

)P (N 6= 0)

=1

µ(Bxrhn

)E(f(X)1I{d(X,x)≤rhn}

)[1− (1− µ(Bx

rhn))n].

This ends the proof of lemma 1.

Proof of theorem 1 and corollary 1. We use the same arguments as

those in Theorem 2.1 in [9].

If we apply successively Jensen’s inequality we obtain for all p ≥ 1,

31−pE

[∣∣∣∣∣n∑

i=1

Wni(x)Yi −m(x)

∣∣∣∣∣p]

≤ E

[∣∣∣∣∣n∑

i=1

Wni(x)(Yi −m(Xi))

∣∣∣∣∣p]

+ E

[n∑

i=1

Wni(x)|m(Xi)−m(x)|p]

+ |m(x)|pP (Nn = 0) (16)

By positivity of the kernel and the independence of (Xi)’s we have,

P (Nn = 0) = [1− nµ(Bxrhn

)]n ≤ exp(−nµ(Bxrhn

)) −→ 0,

17

because r is finite and nµ(Bxrhn

) →∞ for all x, so the last term of (16) tends

to 0. For corollary 1, remark that exp(−nµ(Bxrhn

)) = o(1/(nµ(Bxrhn

)).

By Lemma 1 the 3rd term in (16) is not greater than

(b/a)

µ(Bxrhn

)

∫Bx

rhn

|m(w)−m(x)|pdµ(w),

which goes to 0, for x such that m satisfies (8). And for the rate of conver-

gence, this quantity is bounded by bc(rhn)pτ/a = O(hpτn ), if m satisfies the

p-mean Lipschitz condition (10) in a neighborhood of x.

We are going to show that the first term on the right side of (16) tends to 0

for x as (9) holds. We distinguish the case p ≥ 2 and 1 ≤ p < 2.

Suppose first that p ≥ 2. Since (X1, Y1), . . . , (Xn, Yn) are i.i.d. and Wni(x)

depends only upon (Xj)j, then inequalities of Marcinkiewicz and Zygmund

(1937) (see also Petrov (1975) or Chow and Teicher (1997)), conditionally to

(X1, . . . , Xn), and Jensen, yield that there exists a constant C = C(p) > 0

depending only upon p, such that

E

[∣∣∣∣∣n∑

i=1

Wni(x)(Yi −m(Xi))

∣∣∣∣∣p]

≤ CE

∣∣∣∣∣n∑

i=1

W 2ni(x)(Yi −m(Xi))

2

∣∣∣∣∣p/2

≤ CE

(supj

Wnj(x)

)p/2∣∣∣∣∣

n∑i=1

Wni(x)(Yi −m(Xi))2

∣∣∣∣∣p/2

≤ CE

[(sup

jWnj(x)

)p/2 n∑i=1

Wni(x)|Yi −m(Xi)|p]

= CnE

[(sup

jWnj(x)

)p/2

Wnn(x)|Yn −m(Xn)|p]

(17)

As above, using one more time the i.i.d.’s of the variables (Xi, Yi) and that

18

Wni(x) depends upon (Xj)j only we have,

E

[(sup

jWnj(x)

)p/2

Wnn(x)|Yn −m(Xn)|p]

= E

[E

[(sup

jWnj(x)

)p/2

Wnn(x)|Yi −m(Xi)|p|X1, . . . , Xn

]]

= nE

[(sup

jWnj(x)

)p/2

Wnn(x)k(Xn)

], (18)

where k(w) = E[|Y −m(X)|p|X = w].

Now, Define U = K(d(Xn, x)/hn), u = E(U), V =∑n−1

j=1 K(d(Xj, x)/hn)

and Zn−1 = min(1, b/V ). Let us note that aµ(Bxrhn

) ≤ u ≤ bµ(Bxrhn

) and

Wnj(x) = K(d(Xj, x)/hn)/(U + V ) ≤ Zn−1 for all j. Then, we have(sup

jWnj(x)

)p/2

≤ (Zn−1)p/2.

So, this last bound and (18) give an estimate of (17),

E

[∣∣∣∣∣n∑

i=1

Wni(x)(Yi −m(Xi))

∣∣∣∣∣p]≤ CnE[Z

p/2n−1Wnn(x)k(Xn)].

It is clear that for all c > 0, Zp/2n−1 = Z

p/2n−11IV <c +Z

p/2n−11IV≥c ≤ 1IV <c +(b/c)p/2.

Then for c = (n− 1)u/2, we have

CnE[Zp/2n−1Wnn(x)k(Xn)] ≤ CnP (V < (n− 1)u/2)E

(k(Xn)1I{d(Xn,x)≤rhn}

)+ Cn

(2b

(n− 1)u

)p/2

E(Wnn(x)k(Xn)). (19)

Now we apply (15) to k(·) in order to obtain

nE(Wnn(x)k(Xn)) = E

[n∑

i=1

Wni(x)k(Xi)

]

≤ (b/a)

µ(Bxrhn

)

∫Bx

rhn

k(w)dµ(w).

19

Therefore, the last term of (19) can be bounded by

Cn

(2b

(n− 1)u

)p/2

E(Wnn(x)k(Xn))

≤ 2p/2C(b/a)1+p/2

[(n− 1)µ(Bxrhn

)]p/2µ(Bxrhn

)

∫Bx

rhn

k(w)dµ(w),

which converges to 0 if nµ(Bxrhn

) → ∞ and (9) is satisfied. And, for corol-

lary 1, when E[|Y − m(X)|p|X ∈ Bx

rhn

]is bounded, this last term is,

O(1/(nµ(Bxrhn

))p/2).

It remains to show that the second term of (19) goes to 0. With U and V

defined above, and u = E(U) = E(V )/(n− 1) ≥ aµ(Bnrhn

), put σ2 = var(U).

Because |U−EU | ≤ 2b and σ2 ≤ bu, applying Bernstein’s inequality, for sums

of bounded independent random variables (see Bennett, 1962), we obtain

P (V < (n− 1)u/2) = P (V − E(V ) < −(n− 1)u/2)

≤ exp

(−(n− 1)(u/2)2

2σ2 + 2bu/3

)≤ exp(−3(n− 1)u/28b)

≤ exp(−c1nµ(Bxrhn

)),

for n ≥ 2, where c1 = 3a/56b. Thus,

CnP (V < (n− 1)u/2)E(1I{d(Xn,x)≤rhn}k(Xn)) (20)

≤ Cn exp(−c1nµ(Bxrhn

))

∫Bx

rhn

k(w)dµ(w)

= Cnµ(Bxrhn

) exp(−c1nµ(Bxrhn

))1

µ(Bxrhn

)

∫Bx

rhn

k(w)dµ(w).

Then, (20) tends to 0 according to (9) and nµ(Bxrhn

) → ∞; and it is also,

when E[|Y − m(X)|p|X ∈ Bx

rhn

]is bounded, o(1/(nµ(Bx

rhn))p/2). Thus,

theorem 1 is proved when p ≥ 2 with the bound stated in corollary 1.

For 1 ≤ p < 2, we use as usual the truncating techniques. Let M be

a positive number and define Y ′ = Y 1I{|Y |≤M} and Y ′′ = Y 1I{|Y |>M}. Then,

theorem 1 is true for (X, Y ′), m′(x) = E(Y ′|X = x), and for all fixed M . It

20

suffices to prove that the remainder term related to Y ′′ tends to 0, that is

E

[n∑

i=1

Wni(x)|Y ′′i −m′′(Xi)|p

]= E

[n∑

i=1

Wni(x)gM(Xi)

]

≤ (b/a)

µ(Bxrhn

)

∫Bx

rhn

gM(w)dµ(w)

≤ 2p (b/a)

µ(Bxrhn

)E|Y 1I{|Y |>M}|p,

where gM(w) = E[|Y ′′ − m′′(X)|p|X = w]; the first inequality comes from

lemma 1. The last term goes to 0 as M →∞ for all x and h > 0. This ends

the proof of theorem 1 and corollary 1.

To prove the almost sure convergence of the kernel estimate we need the

following Lemma on binomial random variable.

Lemma 2 (Devroye, 1981, Lemma 4.1) If Nn is a binomial random variable

with parameters n and p = pn such that np/ log n →∞, then

∞∑n=1

E(exp(−sNn)) < ∞, for all s > 0.

Proof of theorem 2 and corollary 2. We apply the same arguments as

those in theorem 4.2 in [9].

The conditional expectation in (12) is exactly∫|mn −m|dµ; it follows from

(11), as in [9], by application of Fubini’s and dominated convergence theorem

(see Glick (1974)). To prove (11), recall first that ξi = 1I{d(Xi,x)≤rhn} and

Nn =∑n

t=1 ξt, and let us define

U1(x) = (b/a)n∑

i=1

|m(Xi)−m(x)| ξi∑nt=1 ξt

,

U2(x) =

∣∣∣∣∣n∑

i=1

Wni(x)(Yi −m(Xi))

∣∣∣∣∣ .Then, we have the inequality

|mn(x)−m(x)| ≤ U2(x) + U1(x) + |m(x)|1INn=0. (21)

21

We have just seen in the proof of theorem 1 that

P (Nn = 0) = [1− nµ(Bxrhn

)]n ≤ exp(−nµ(Bxrhn

)),

which is the general term of summable serie with respect to n when

limn→∞

nµ(Bxrhn

)/ log n > 1. So the last term in (21) is almost surely 0 for

large n.

Let Zx1,i = |m(Xi)−m(x)|ξi and ξ(n) = (ξ1, . . . , ξn) then |Zx

1,i−Eξ(n)(Zx1,i)| ≤

4Mξi and var(Zx1,i|ξ1, . . . , ξn) ≤ 4M2ξi, where Eξ(n)(·) = E(·|ξ1, . . . , ξn).

Then, we have by Bernstein’s inequality, for any ε > 0,

P(|U1(x)− Eξ(n)(U1(x))| > ε|ξ1, . . . , ξn

)≤ P

(∣∣∣∣∣n∑

i=1

(Zx1,i − Eξi(Zx

1,i))

∣∣∣∣∣ > ε(a/b)Nn|ξ1, . . . , ξn

)≤ 2 exp(−c1Nn),

where c1 = a2ε2

8bM(bM+aε/3). But, from lemma 1, we get

Eξ(n)(U1(x)) ≤ (b/a)

µ(Bxrh)

E(|m(X)−m(x)|1I{d(X,x)≤rhn}

)=

(b/a)

µ(Bxrh)

∫Bx

rhn

|m(w)−m(x)|dµ(w), (22)

which tends to 0 under (8). Thus, for large n, P(Eξ(n)(U1(x)) > ε|ξ1, . . . , ξn

)=

0 and

P (U1(x) > 2ε|ξ1, . . . , ξn) ≤ 2 exp(−c1Nn).

Therefore, for all ε > 0,

P (U1(x) > 2ε) ≤ 2E{exp(−c1Nn)},

and since Nn is a binomial random variable with parameters n and p(x) =

µ(Bxrhn

) such that np(x)/ log n → ∞ as n → ∞, the last term in the above

inequality is summable with respect to n by Lemma 2. Hence U1(x) −→ 0,

a.s., by Borel-Cantelli lemma.

Remark 7 If m was assumed to be continuous at x, U1(x) could be eas-

ily showed to converge to 0, using the continuity. Indeed maxi |m(Xi) −m(x)|ξi ≤ ε for small h and then |U1(x)| ≤ (b/a)ε for large n.

22

The term U2(x) is treated in a similar way. Let Zx2,i = (Yi−m(Xi))K(d(Xi, x)/hn)

then |Zx2,i| ≤ 2bMξi, E(Zx

2,i|X1, . . . , Xn) = 0 and var(Zx2,i|X1, . . . , Xn) ≤

b2M2ξi. So as above we have also by Bernstein’s inequality, with c2 =a2ε2

2bM(bM+2aε/3),

P (|U2(x)| > ε|X1, . . . , Xn) ≤ P

(∣∣∣∣∣n∑

i=1

Zx2,i

∣∣∣∣∣ > εaNn|X1, . . . , Xn

)≤ 2 exp(−c2Nn),

and taking its expectation we obtain

P (|U2(x)| > ε) ≤ 2E(exp(−c2Nn)). (23)

Now from Lemma 2 and Borel-Cantelli lemma, U2(x) −→ 0, a.s. And gath-

ering those convergences above we deduce (11). This ends the proof of the-

orem 2.

For the rate of convergence, using the p-mean Lipschitz hypothesis, (10), on

m in a neighborhood of x we have from (22)

Eξ(n)(U1(x)) ≤ c(b/a)(rhn)τ ,

then P(Eξ(n)(U1(x)) > c(b/a)(rhn)τ |ξ1, . . . , ξn

)= 0. Take now ε1 =

c(b/a)(rhn)τ + C1(nµ(Bxrhn

)/ log n)−1/2, for an appropriate positive number

C1, and as before we obtain

P (U1(x) > ε1) ≤ 2E{exp(−c1Nn)}, (24)

where c1 defined above corresponds to ε = C1(nµ(Bxrhn

)/ log n)−1/2. We re-

mark that (see the proof of lemma 4.1 in [9]) E(exp(−θNn)) ≤ exp(−θ′nµ(Bxrhn

)),

where θ′ = min(θ/2, 1/10), then Borel–Cantelli lemma allows us to finish.

Indeed, for n large enough, c′1 = min(c1/2, 1/10) = c1/2, with c1 correspond-

ing to ε, both defined above, then (24) is bounded by 2 times

exp(−c′1nµ(Bxrhn

)) ≤ exp

(− a2C2

1 log n

16bM(bM + aC1(nµ(Bxrhn

)/ log n)−1/2/3)

)≤ exp(−C ′

1 log n).

The last term is the general one of a summable serie with respect to n for a

suitable choice of C1 to ensure that C ′1 > 1, (in fact C1 > 4bM/a).

23

Remark 8 If m was assumed to be τ -lipschitzian in a neighborhood of x,

U1(x) could be treated directly because maxi |m(Xi)−m(x)|ξi ≤ c(rhn)τ for

small h and then U1(x) ≤ c(b/a)(rhn)τ for large n.

To conclude we will show, in the same way as before, that U2(x) =

O(nµ(Bxrhn

)/ log n)−1/2. For this purpose we choose ε2 = C2(nµ(Bxrhn

)/ log n)−1/2,

for an appropriate positive number C2, in the exponential bound of (23).

Then, for n large enough, c′2 = min(c2/2, 1/10) = c2/2, with c2 correspond-

ing to ε2, both defined above, and (23) is bounded by 2 times

exp(−c′2nµ(Bxrhn

)) ≤ exp

(− a2C2 log n

4bM(bM + 2aC2(nµ(Bxrhn

)/ log n)−1/2/3)

)≤ exp(−C ′

2 log n),

which is the general term of a summable serie with respect to n for a suitable

choice of C2 to ensure that C ′2 > 1, (in fact C2 > 2bM/a).

Therefore we obtained, a.s., for large n,

U1(x) ≤ c(b/a)(rhn)τ + (4bM/a)(nµ(Bxrhn

)/ log n)−1/2,

and

U2(x) ≤ (2bM/a)(nµ(Bxrhn

)/ log n)−1/2.

Proof of theorem 3. It is a simple application of theorem 2, since (13)

implies that,

0 ≤ Ln−L∗ ≤ 2M∑

j=1

E

{∣∣∣∣∣P (Y = j|X)−n∑

i=1

Wni(X)1I{Yi=j}

∣∣∣∣∣ |X1, Y1, . . . , Xn, Yn

},

see Devroye (1981) or Stone (1977).

References

[1] Bennett, G. (1962) Probability inequalities for sums of independent ran-

dom variables. J. Amer. Statist. Assoc, 57, 33–45.

24

[2] Bogachev, V.I. (1998) Gaussian Measures. American Mathematical So-

ciety.

[3] Bosq, D. (1983) Sur la prediction non parametrique de variables

aleatoires et mesures aleatoires. Zeit. Wahrs. Ver. Geb., 64, 541–553.

[4] Bosq, D. (2000) Linear Processes in Function Spaces: Theory and ap-

plications. Lecture Notes in Statistics, 149, Springer.

[5] Bosq, D. and Delecroix, M. (1985) Nonparametric prediction of a Hilbert

space valued random variable. Stoch. Proc. and their Appl., 19, 271–280.

[6] Chow, Y.S. and Teicher, H. (1997) Probability Theory: Independence,

Interchangeability, Martingales. Springer, 3rd ed., New York.

[7] Csaki, E. (1980) A Relation between Chung’s and Strassen Laws of the

Iterated Logarithm. Zeit. Wahrs. Ver. Geb., 19, 287–301.

[8] Dabo-Niang, S. (2002) Estimation de la densite dans un espace de di-

mension infinie : Application aux diffusions C. R. Acad. Sci. Paris I,

334, 213–216.

[9] Devroye, L. (1981) On the absolute everywhere convergence of nonpara-

metric regression function estimates. Ann. Statist., 9, 1310–1319.

[10] Devroye, L. and Wagner, T. J. (1980) Distribution-free consistency re-

sults in nonparametric discrimination and regression function estima-

tion. Ann. Statist., 8, 231–239.

[11] Devroye, L., Gyorfi, L. and Lugosi, G. (1996) A Propabilistic Theory of

Pattern Recognition. Springer, New York.

[12] Ferraty, F et Vieu, P. (2000) Dimension fractal et estimation de la

regression dans des espaces vectoriels semi-normes. C. R. Acad. Sci.

Paris I, 330, 139–142.

[13] Ferraty, F., Goia, A. and Vieu, P. (2002a) Functional nonparametric

model for time series: a fractal approach for dimension reduction. Test,

to appear.

25

[14] Ferraty, F., Goia, A. et Vieu, P. (2002b) Regression nonparametrique

pour des variables aleatoires fonctionnelles melangeantes. C. R. Acad.

Sci. Paris I, to appear.

[15] Gyorfi, L. (1978) On the rate of onvergence of nearest neighbor rules.

IEEE Transac. Information Th., vol. IT-41, 509–512.

[16] Gyorfi, L. (1981) Recent results on nonparametric regression estimate

and multiple classification. Problem control Inform. Theory, 10, 43–52.

[17] Glick, N. (1974) Consistency conditions for probability estimators and

integrals of density estimators. Utilitas Math., 6, 61–74.

[18] Greblicki, W., Krzyzak, A. and Pawlak, M. (1984) Distribution-free

pointwise consistency of kernel regression estimate. Ann. Statist., 12,

1570–1575.

[19] Krzyzak, A. (1986) The rates of onvergence of kernel regression esti-

mates and clasification rules. IEEE Transac. Inform. Theory, IT-32,

668–679.

[20] Kulkarni, S.R. and Posner, S.E. (1995) Rates of onvergence of nearest

neighbor estimation under arbitrary sampling. IEEE Transac. Informa-

tion Th., 41, 1028–1039.

[21] Lifshits, M.A. (1995) Gaussian Random Functions. Kluwer Academic

Publishers.

[22] Parthasarathy, K.R. (1967) Probability Measures on Metric Spaces. Aca-

demic Press.

[23] Petrov, V. V. (1975) Sums of independent Random Variables. Springer,

Berlin.

[24] Preiss, D. (1979) Gaussian measures and covering theorem. Comment.

Math. Univ. Carolin., 20, 95–99.

[25] Ramsay, J. O and Silverman, B.W. (1997) Functional data analysis.

Springer, New York.

26

[26] Rhomari, N. (2001) Kernel regression estimation in Banach space under

dependence. Preprint.

[27] Spiegelman, C. and Sacks, J. (1980) Consistent window estimation in

nonparametric regression. Ann. Statist., 8, 240–246.

[28] Stone, C. J. (1977) Consistent nonparametric regression. Ann. Statist.,

5, 595–645.

[29] Stone, C. J. (1980) Optimal rates of convergence for nonparametric es-

timators. Ann. Statist., 8, 1348–1360.

[30] Stone, C. J. (1982) Optimal global rates of convergence for nonparamet-

ric regression. Ann. Statist., 10, 1040–1053.

[31] Tiser, J. (1988) Differentiation theorem for Gaussian measures on

Hilbert spaces. Trans. Amer. Math. Soc., 308, 655–666.

[32] Wheeden, R. and Zygmund, A. (1977) Measure and integral. Marcel

Dekker, New York.

27