Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
NONPARAMETRIC REGRESSION ESTIMATION WHEN THE REGRESSORTAKES ITS VALUES IN A METRIC SPACE∗
Sophie DABO-NIANG a and Noureddine RHOMARI a,b
a CREST, Timbre J340, 3, Av. Pierre Larousse, 92245 Malakoff cedex, France
[email protected]; [email protected] Universite Mohammed I, Faculte des Sciences, 60 000 Oujda, Maroc
(1st version, September 2001)
AMS classification: 62G08; 62H30.
Key words and phrases : Kernel estimates; regression function; metric-valued
random vectors; discrimination.
Abstract: We study a nonparametric regression estimator when the explana-tory variable takes its values in a separable semi-metric space. We establish someasymptotic results and give upper bounds of the p-mean and (pointwise and in-tegrated) almost sure estimation errors, under general conditions. We end by anapplication to the discrimination in a semi-metric space and study as an examplethe case when the explanatory variable is the Wiener process in C[0, 1].
Resume: Dans ce travail, nous etudions l’estimateur a noyau de la regression
quand la variable explicative prend ses valeurs dans un espace semi-metrique
separable. Nous etablissons sa consistance en moyenne d’ordre p et presque sure
(ponctuelle et integree), sous des conditions generales. Nous precisons aussi des
bornes superieures de ces erreurs d’estimation. Ensuite, nous appliquons, ces
resultats a la discrimination de variables d’un espace semi-metrique et etudions
comme exemple le cas ou la variable explicative est le processus de Wiener dans
C[0, 1].
∗Comments are welcome
1
1 Introduction
Let (X, Y ) be a random vector from X × R with E(|Y |) < ∞, where
(X , d) is a separable semi-metric space equipped with the semi-metric d.
The distribution of (X, Y ) is often unknown, so is the regression function
m(x) = E(Y |X = x), x ∈ X . We then want to estimate the regression oper-
ator m(x) by mn(x), a function of x and a sample (X1, Y1), . . . , (Xn, Yn) of
(X, Y ). The present framework includes the classical finite dimensional case
of X = Rd, d ≥ 1, but also spaces of infinite dimension, as usual function
spaces or sets of probabilities on some given measurable space. This problem
has been widely studied, when the variable X lies in a finite dimensional
space, and there are many references on this topic, contrary to infinite di-
mensional observation case.
In this work, X may have an infinite dimension; it can be for example a
function space which is an important case. Recently, the statistics of the
functional data has known a growing interest. These questions in infinite
dimension are particularly interesting, at once for the fundamental problems
they formulate, see Bosq (2000), but also for many applications they may
allow, see Ramsay and Silverman (1997). One possible application is that the
(Xi)i are random curves, for instance in C(R) or C[0, T ]. Many phenomena,
in various areas (e.g. weather, medicine,...), are continuous in time and may
or must be represented by curves. We can be interested either in the forecast
or with the classification.
For example let us consider the weather data of some country. One can in-
vestigate to what extent total annual precipitation for weather stations can
be predicted from the pattern of temperature variation through the year.
Let Y be the logarithm of total annual precipitation at a weather station,
and let X = X(t), t ∈ [0, T ] be its temperature function, the interval [0, T ]
is either [0, 12] or [0, 365] depending on weather monthly or daily data. For
more examples, we refer to Ramsay and Silverman (1997) and the references
therein.
The popular estimate of m(x), x ∈ X , which is a locally weighted average,
2
is defined by the following expression
mn(x) =n∑
i=1
Wni(x)Yi (1)
where (Wn1(x), . . . ,Wnn(x)) is a probability vector of weights and each
Wni(x) is a Borel measurable function of x, X1, . . . , Xn. We will be inter-
ested in kernel method
Wni(x) =K(d(Xi, x)/hn)∑n
j=1 K(d(Xj, x)/hn), (2)
if∑n
j=1 K(d(Xj, x)/hn) 6= 0 otherwise Wni(x) = 0, where (hn)n is a sequence
of positive numbers, and K is a nonnegative measurable function on R.
When (Xi, Yi) ∈ Rd × Rk, the estimation of m was treated by many au-
thors with various weights including the nearest neighbor and kernel meth-
ods. For the references cited hereafter, we limit ourselves to the general
case where no assumption concerning the existence of density with respect
to some measure of reference is made. Stone (1977) showed that the nearest
neighbor estimate is universally consistent that is,
E(|mn(X)−m(X)|p) −→ 0 as n →∞ whenever E(|Y |p) < ∞, p ≥ 1,
for more general weight vectors. Devroye and Wagner (1980), and Spiegelman
and Sacks (1980), extended the same result to the kernel estimate. Universal
consistency results were also presented by Gyorfi (1981). Krzyzak (1986)
gave rates of probability and almost sure convergence for a wide class of
kernel estimates; see also the references cited by these authors. While a
pointwise consistency was treated, among others, by Devroye (1981) for both
the nearest neighbor and the kernel methods and by Greblicki et al. (1984)
for a more general kernel estimates. For example, Devroye (1981) proved
that for PX-almost all x, mn(x) is consistent, in p-mean, that is
limn→∞
E(|mn(x)−m(x)|p) = 0, (3)
for PX-almost all x, in Rd, whenever E(|Y |p) < ∞, p ≥ 1, hn → 0 and
limn→∞
nhdn = ∞, where PX is the probability distribution of X. He also estab-
lished its almost sure pointwise convergence and its almost sure integrated
(w.r.t. PX) consistency if in addition Y is bounded and limn→∞
nhdn/ log n = ∞.
3
In this paper we are interested to extend Devroye’s results (1981) to the
case where X takes values in a general separable semi-metric space, which
may be of an infinite dimension. We also precise the rates of convergence
which are optimal in the case of finite dimensional X .
The literature on this topic in infinite dimension is relatively limited,
to our knowledge. Bosq (1983) and Bosq and Delecroix (1985) deal with
general kernel predictors for Hilbert-valued Markovian processes. In the case
of independent observations, Kulkarni and Posner (1995) studied the nearest
neighbor estimation in general separable metric space X . They precised a
rate of convergence connected to metric covering numbers in X . The recent
work of Ferraty and Vieu (2000) is on kernel method for X being in a semi-
normed vector space, whose distribution has a finite fractal dimension (see
condition (4) bellow). They proved that, when Y is bounded, the kernel
estimate related to (1) and (2) converges almost surely to m(x), if m is
continuous at x and if there exist two positive numbers a(x), c(x) such that
limh→0
P (X ∈ Bxh)
ha(x)= c(x) (4)
and hn → 0 and limn→∞
nha(x)n / log n = ∞, where Bx
h denotes the closed ball
of radius h and center x. Under similar conditions, Ferraty et al. (2002)
extended this result to dependent observations and obtained an a.s. uniform
convergence on compact set and precised its rate.
Here, the same statement, in independent case, are established, under gen-
eral assumptions (see theorem 2 below). The conditions are connected with
the probability of the explanatory variable X being in the ball Bxhn
. We ask
that limn→∞
nP (X∈Bxhn
)
log n= ∞, which is fulfilled for all random variables X and
some class of sequences (hn) converging to 0, and m may be discontinuous at
x; hence we extend [12]’s results, and in the case of condition (4) we improve
their rate of convergence. The proofs are different. Similar condition as (4)
allow [13] to extend the study to dependent observations and to establish
a.s. uniform consistency under a uniform-type condition of (4). However
this condition (4) excludes an interesting class of processes having infinite
fractal dimension like the Wiener processes (see paragraph 4.2). The proof
we used here can easily be extended to the dependent case via the coupling
4
method, but it seems difficult to obtain with it the uniform consistency; the
a.s. convergence remains valid for absolutely regular (β-mixing) processes
(see [26]). Note that our conditions written in term of P (X ∈ Bxhn
) unify the
regression estimation with finite and infinite dimensional explanatory vari-
ables, and the results, especially the bounds of estimation errors, highlight
how the probability of small balls related to explanatory variable influences
the convergences.
Let µ be the probability distribution of X. In the case of finite dimensional
space X = Rd, the crucial tool, needed to prove those results, e.g., in [9],
[18] and [27], is the differentiation result of a probability measure i.e. if
f ∈ Lp(Rd, µ) = {f : (Rd,BRd , µ) −→ (R,BR) /‖f‖p < ∞}, then,
limh→0
1
µ(Bxh)
∫Bx
h
|f(z)− f(x)|pdµ(z) = 0, (5)
for µ-almost all x, p ≥ 1. This holds for all Borel locally finite measures on
Rd, in particular for probability measures. Unfortunately, that statement is
in general false when X is an infinite dimensional space, even for probability
measures, as proved in Preiss (1979). He constructs a Gaussian measure γ,
on an Hilbert space H, and an integrable function f such that,
lims→0
inf
{1
γ(Bxh)
∫Bx
h
f dγ; x ∈ H, 0 < h < s
}= ∞.
However, (5) remains valid, under some additional conditions, for some prob-
ability measure µ on an Hilbert space, as stated in Tiser’s theorem below.
Theorem A (Tiser, 1988). Let H be an Hilbert space and let γ be a Gaus-
sian measure with the following spectral representation of its covariance oper-
ator Rx =∑
i ci〈x, ei〉ei, where (ei) is an orthonormal system in H. Suppose
ci+1 ≤ cii−α for given α > 5/2. Then, for all f ∈ Lp(H, γ), p > 1,
limh→0
1
γ(Bxh)
∫Bx
h
|f(z)− f(x)|dγ(z) = 0, γ-almost every x.
Remark 1
In the above Tiser’s theorem, the convergence can be stated with |·|p′instead
5
of | · |, and for all f ∈ Lploc(H, γ) and p′ < p (loc stands for locally integrable),
that is
limh→0
1
γ(Bxh)
∫Bx
h
|f(z)− f(x)|p′dγ(z) = 0, γ-almost every x.
For this, we use either his own proof or the same idea in the first part of the
Lemma 2.1’s proof in Devroye (1981), or the same argument as in the proof
of theorem 7.15, page 108 in [32].
Remark 2 It is obvious that (5) remains valid for all almost surely continu-
ous functions f and all Borel locally finite measures on a semi-metric space.
In particular it holds for each point x of continuity of f and all probabil-
ity measures µ. That’s why we preferred to work with general regressions
fulfilling condition (5).
In this work, we shall prove that (5) and E(|Y |p) < ∞, for some p ≥ 1,
imply (3), even for an infinite dimensional space X–valued random variable
X. We also show the pointwise and integrated (w.r.t. µ i.e. L1(µ)-error,∫|mn − m|dµ) strong convergence of mn, when the r.v. Y is bounded, un-
der general conditions (m may be discontinuous). We precise the rate of
convergence in each type of consistency. We then apply these results to the
nonparametric discrimination of variables in a semi-metric space (e.g. the
curves classification problem). As an example we present a special case of
µ a Wiener measure and X = C[0, 1] equipped with the sup-norm. This
example, which is not fulfill [12]’s condition (4), illustrates the extension and
the contribution of this work; we can apply it to the regression with an ex-
planatory variable whose distribution has an infinite fractal dimension and
to classify nonparametric general curves; which is new, to our knowledge.
It is obvious that all the results hold also when Y takes values in Rk,
k ≥ 1, thanks to linearity of the estimator (1) with respect to the (Yi)’s.
Another work dealing with Y in a separable Banach space, which may be also
a curve function, shows that the same results remain valid for the regression
estimation with response variable in Banach space. As it mentioned above,
we can also drop the independence condition and replace it by the absolute
regular mixing condition (see [26]).
6
The rest of the paper is as follows. We give the assumptions and some
comments in section 2. Section 3 contains results on pointwise consistency in
p-mean and the strong consistency, followed by the study of the integrated
error estimation for the special case of bounded Y . Section 4 is devoted to
the application to the nonparametric discrimination problem, followed by an
example of the Wiener measure and X = C[0, 1] equipped with the sup-norm.
The proofs are postponed to the last section.
Remark 3 Let X ′ be a separable semi-metric space, and Φ : R −→ R and
Ψ : X −→ X ′ are some measurable functions. It is clear that the results
bellow can be applied to estimate mΨΦ(z) = E(Φ(Y )|Ψ(X) = z). The intro-
duction of these functions allows to cover and estimate various (functional)
parameters related to the distribution of (X, Y ), by varying them. For ex-
ample:
i) If Φ = Id and Ψ = Id, then mIdId(x) = E(Y |X = x) = m(x).
ii) Let A be a measurable set and Φ(Y ) = 1IA(Y ), then in this case
mΨΦ(z) = P (Y ∈ A|Ψ(X) = z) corresponds to the conditional distribution
probability of Y given Ψ(X),...
Further, Ψ can be used when X is not separable, in order to fulfill the con-
dition of separability.
We have omitted these functions just to simplify the notations.
2 Assumptions
Let us assume that there exist positive numbers r, a and b such that,
a1I{|u|≤r} ≤ K(u) ≤ b1I{|u|≤r}, (6)
and for a fixed x in X , let Bxh denote the closed ball of radius h and center
x, and suppose
hn → 0 and limn→∞
nµ(Bxrhn
) = ∞. (7)
We assume also that for some p ≥ 1,
limh→0
1
µ(Bxh)
∫Bx
h
|m(w)−m(x)|pdµ(w) = 0, (8)
7
and,
E[|Y −m(X)|p|X ∈ Bx
rhn
]= o
([nµ(Bx
rhn)]p/2)
(9)
Remark 4
1- (8) is satisfied for all points of continuity of m, for all µ.
2- (8) does not imply, in general, that m is continuous. Indeed the Direchlet’s
function defined on [0, 1] as m(x) = 1I{x rational in [0,1]}, fulfills (8) with µ an ab-
solute probability measure with respect to Lebesgue measure on [0, 1], but it
is nowhere continuous (example from Krzyzak (1986)); in infinite dimension
take instead of rational some countable dense set and continuous measure.
3- If m verifies (8) for p = 1 and is bounded in a neighborhood of x then (8)
is true for all p.
4- When Y is bounded the condition (9) is obviously satisfied.
5- It is clear that E[|Y −m(X)|p|X ∈ Bxrhn
] = 1µ(Bx
h)E[|Y −m(X)|p1IBx
rhn(X)],
then the condition (9) is satisfied, for example, when (7) holds and if
E[|Y −m(X)|p|X ∈ Bxrhn
] = O(1) or limn→∞
n(µ(Bxrhn
))1+2/p = ∞.
6- As noticed above, thanks to theorem A, (8) and (9) remain valid for all
f ∈ Lp′(H, µ) and µ-almost all x in H, p′ > p ≥ 1, where H is an Hilbert
space and µ a gaussian measure as in theorem A; thus for such probability
measure, (8) and (9) are fulfilled when E|Y |p′< ∞. In this case theorem 1,
bellow, holds for µ-almost all x in H.
7- Recall that (8) and (9) hold for µ-almost all x and all µ in the finite
dimensional space X = Rd, whenever E|Y |p < ∞, see [9] or [32].
In the finite dimension X = Rd, we have hd = O(µ(Bxh)) for µ-almost all
x, and all probability measures µ on Rd (see [9] or [32]), so, in this case
the condition (7) is implied by hn → 0 and limn→∞
nhdn = ∞. Then the
formulation of the assumptions and the bounds of the estimation errors in
term of the probability µ(Bxh) unify the two cases of the finite and the infinite
dimensional explanatory variables. It should be noted that the evaluation of
µ(Bxh) which is known as probability of small balls is very difficult in infinite
dimension (see e.g. [21]).
8
3 Main results
3.1 p-mean consistency
Theorem 1 If E(|Y |p) < ∞, and if (6), (7) and (9) hold then,
E(|mn(x)−m(x)|p) −→n→∞
0,
for all x in X such that m satisfies (8).
If µ is discrete, the result in theorem 1 holds for all points in the support
of µ whenever E(|Y |p) < ∞ and hn → 0.
Rate of p-mean convergence The next result is about a rate of p-mean
convergence, p ≥ 2. We therefore need to reinforce the condition (8) in order
to sharply evaluate the bias. We suppose that m is ”p-mean Lipschitz”, in a
neighborhood of x, with parameters 0 < τ = τx ≤ 1 and c = cx > 0, that is,
1
µ (Bxh)
∫Bx
h
|m(w)−m(x)|pdµ(w) ≤ cxhpτ , as h → 0. (10)
It is obvious that (10) is satisfied if m is lipschitzian of parameters τ = τx
and cx, but it is weaker than the Lipschitz one and it does not imply, in
general, even the continuity of m at x (see the example in remark 4). A
similar assumption was used by Krzyzak (1986).
Corollary 1 If E(|Y |p) < ∞, E(|Y −m(X)|p|X ∈ Bxrhn
) = O(1) and if (6),
(7), (10) are satisfied, with p ≥ 2, then we have
E(|mn(x)−m(x)|p) = O
(hpτ
n +
(1
nµ(Bxrhn
)
)p/2)
.
Without the assumption of existence of the probability densities, this state-
ment in finite dimension seems to be new for unbounded variable Y .
Remark 5
1. If we assume the condition lim suph→0
µ(Bxh)
ha(x) < ∞, for some a(x) > 0, which is
weaker than (4), we have, for h = cn−1/(2τ+a(x)),
E(|mn(x)−m(x)|p) = O(n−pτ/(2τ+a(x))), p ≥ 2.
9
2. In the finite dimensional case, X = Rd, the best rate derived from the
bound in corollary 1 is optimal (cf. [29]), because hd = O(µ(Bxh)) for µ-
almost all x, and all probability measures µ on Rd; see lemma 2.2 of [9] and
its proof, or [32, p. 189]. In this case, we obtain a similar rate as above with
d instead of a(x).
3. In Rd, Spiegelman and Sacks (1980) obtained the same rate in the case p =
2, τ = 1 and bounded Y for E(|mn(X)−m(X)|2). Krzyzak (1986) obtained
for a more general kernel the rates in probability and a.s. convergence under
condition (10) and unbounded response variable Y .
3.2 Strong consistency
Bellow, we give the pointwise and µ-integrated almost sure convergence re-
sults of the kernel estimate and we precise their rates of convergence. In
order to simplify the proof, we assume that |Y | ≤ M < ∞, a.s. But with
the truncating techniques we can extend the result to unbounded Y ; see the
comments bellow.
Theorem 2 If Y is bounded, (6) holds, hn → 0 and limn→∞
nµ(Bxrhn
)
log n= ∞, then
|mn(x)−m(x)| −→ 0, a.s., as n →∞, (11)
for all x in X such that m satisfies (8).
If in addition (8) holds for µ-almost all x,
limn→∞
E(|mn(X)−m(X)||X1, Y1, . . . , Xn, Yn) = 0, a.s. (12)
Rate of a.s. convergence We establish now the rate of strong consistency.
Corollary 2 Let us assume that |Y | ≤ M , if (6) and (10), for p = 1, hold,
if hn → 0 and limn→∞
nµ(Bxrhn
)
log n= ∞, then we have
|mn(x)−m(x)| = O
(hτ
n +
(log n
nµ(Bxrhn
)
)1/2)
, a.s..
10
In fact we obtain more precisely, for large n, a.s.,
|mn(x)−m(x)| ≤ cx(b/a)(rhn)τ + 6M(b/a)
(nµ(Bx
rhn)
log n
)−1/2
,
and the 6M can be replaced by some almost surely bound of
2 max(√
E[|m(X)−m(x)|2|X ∈ Bxrhn
], 2√
E[|Y −m(X)|2|X]1IBxrhn
(X)).
But if m is lipschitzian, in a neighborhood of x, with parameter 0 < τ =
τx ≤ 1 and cx > 0, this bound becomes
|mn(x)−m(x)| ≤ cx(b/a)(rhn)τ + 2M(b/a)
(nµ(Bx
rhn)
log n
)−1/2
,
and the 2M can be replaced by some almost surely bound of
2√
E[|Y −m(X)|2|X]1IBxrhn
(X).
We can relax the boundedness of Y by assuming that it satisfies the Cramer
condition, in this case the log n above is replaced by (log n)3, see the com-
ments after.
Remark 6
1. If the distribution of X satisfies the additional conditions lim suph→0
µ(Bxh)
ha(x) <
∞, for some a(x) > 0, (weaker than (4)) and m continue at x, then (7) and
(8) are fulfilled, and therefore we find theorem 1 of Ferraty and Vieu (2000).
If in addition (10) holds for p = 1, then we find for h = c(log n/n)1/(2τ+a(x))
that
|mn(x)−m(x)| = O((log n/n)τ/(2τ+a(x))), a.s.,
which improves the bound given by Ferraty and Vieu (2000), for m lips-
chitzian, under an additional condition on the expansion of the limit in (4).
They have assumed that there exists a real positive b(x) such that µ(Bxh) =
ha(x)c(x) + O(ha(x)+b(x)) and obtain the rate O((n/ log n)−γ(x)/(2γ(x)+a(x))
),
where γ(x) = min{b(x), τ}.2. In a finite dimension, X = Rd, Krzyzak (1986) obtained, for a slight
general kernel, similar rates in probability and almost sure convergence in
bounded case but more weaker for unbounded Y .
11
3.3 How to choose the window h?
It is well known that the performance of the kernel estimate depends on
the choice of the window parameter h. The bound in corollary 2 is simple
and easy to compute. So, this allows us to choose the window parameter
h that minimizes that bound. We can also use other different methods as
cross-validation.
4 Applications
4.1 Discrimination.
Let an object O having an observed variable X in some space X and an
unknown nature-class Y in some discrete set.
Classical discrimination is about predicting the unknown nature Y of an
object O having a d-dimensional observation X and a nature Y . The un-
known nature, Y , of the object is called a class and takes values in a finite
set {1, . . . ,M}. One constructs a function g taking values in {1, . . . ,M}which represents one’s guess g(X) of Y given X. This mapping g, which
is called a classifier, is defined on X and takes values in {1, . . . ,M}. We
err on Y if g(X) 6= Y , and the probability of error for a classifier g is,
L(g) = P{g(X) 6= Y }. The question is how to build this function g and what
is the best manner to determine the class of this object? In discrimination,
the problem is to classify this objectO, i.e. to decide on the value of Y from X
and a sample of this pair of variables by an empirical rule, gn, which is a mea-
surable function from X ×(X ×{1, . . . ,M})n into {1, . . . ,M}; its probability
of error is P{gn(X) 6= Y |X1, Y1, . . . , Xn, Yn}, where (X1, Y1), . . . , (Xn, Yn) are
observations from (X,Y ).
It is well known that the Bayes classifier which is defined by,
g∗ = arg ming:X→{1,...,M}
P{g(X) 6= Y },
is the best possible classifier, with respect to quadratic loss (see e.g. [11]).
Note that g∗ depends upon the distribution of (X, Y ) which is unknown.
However, we build an estimator gn of the classifier based on a sample of ob-
12
servations.
Here we consider an observation X belonging to some separable semi-
metric space X , with an unknown class Y . X may be of infinite dimension,
e.g. a curve. We want to predict this class from X and a sample of this
pair of variables. For example to classify objects with curve parameters (e.g.
determine the risk, Y , of patients according to electrocardiogram curve, ...);
in this case the (Xi) take values in some function space, for instance C(R).
In this section we extend Devroye’s results in the setting of X belonging
to a semi-metric space X . This space may be with infinite dimension, e.g. a
function space, as we have mentioned it.
Let X be a random variable with values in X and Y takes values in
{1, . . . ,M} which is estimated from X and (X1, Y1), . . . , (Xn, Yn) by gn(X).
So the quality of this estimate is measured by the probability of error
Ln = P{gn(X) 6= Y |X1, Y1, . . . , Xn, Yn},
and it is clear that
Ln ≥ L∗ = infg:X→{1,...,M}
P{g(X) 6= Y },
where L∗ is the Bayes probability of error. The Bayes classifier g∗ can be
approximated by gn chosen so that,
n∑i=1
Wni(x)1I{Yi=gn(x)} = max1≤j≤M
n∑i=1
Wni(x)1I{Yi=j}. (13)
Such classifier gn (not necessarily uniquely determined) is called an approxi-
mate Bayes classifier.
The probability and almost sure convergences are established in the fol-
lowing theorem.
Theorem 3 If (6), (7) and (8) hold for µ-almost all x then, limn→∞
Ln = L∗
in probability.
If in addition, limn→∞
nµ(Bxrhn
)/ log n = ∞ for µ-almost all x then, limn→∞
Ln =
L∗, almost surely.
13
4.2 Example of Wiener process
In the simple, but interesting, setting X = C[0, 1] the set of all continuous
real functions defined over [0, 1], equipped with the sup norm, let X = W be
the standard Wiener process. Hence µ = Pw is the Wiener measure. Define
the set
S = {x ∈ C[0, 1] : x(0) = 0, x is absolutely continuous and
∫ 1
0
x′2(t)dt < ∞},
where the ′ stands for the derivative. Then (see Csaki, 1980),
exp
(−1
2
∫ 1
0
x′2(t)dt
)Pw(B0
h) ≤ Pw(Bxh) ≤ Pw(B0
h),
as h → 0 and x ∈ S. In addition Pw(B0h)/ exp (−π2/8h2) → 1 as h → 0, see
Bogachev (1999) (cf. also Lifshits (1995), section 18). Hence, for x ∈ S,
exp
(−1
2
∫ 1
0
x′2(t))dt
)≤ lim inf
h→0
Pw(Bxh)
exp(−π2/(8h2))≤ lim sup
h→0
Pw(Bxh)
exp(−π2/(8h2))≤ 1.
From this we deduce that for h = (c log n)−1/2, we have c(x)n−cπ2/8 ≤Pw(Bx
h) ≤ n−cπ2/8, for large n. It implies, with corollary 1, if h = (c log n)−1/2
with c < 8/π2, then
E(|mn(x)−m(x)|p) = O((log n)−τp/2), p ≥ 2,
and also a.s. if Y is bounded.
Note that ∀x ∈ S and ∀a > 0 we have limh→0
µ(Bxh)
ha = 0; hence the standard
Wiener measure on (C[0, 1], ‖·‖∞) has an infinite fractal dimension and then
does not verify the condition of [12] and [13, 14] (cf. condition 4).
5 Comments
1) If Y is unbounded but admits an exponential moment E(eθ‖Y ‖) < ∞, for
some θ > 0, then (11) takes place whenever hn → 0, limn→∞
nµ(Bxrhn
)
log3 n= ∞ and
m fulfills (8), and the log n in corollary 2 is replaced by (log n)3. We can only
assume the existence of a polynomial moment, but in this case the log n in
the condition and in the bound of corollary 2 become a power of n.
14
2) We think that the conditions hn → 0 and limn→∞
nµ(Bxrhn
) = ∞ (resp.
limn→∞
nµ(Bxrhn
)
log n= ∞) are necessary in theorems 1 (resp. 2).
3) As we have mentioned, since (8) may hold for discontinuous regression
functions m then all the results above can be valid for some discontinuous
regression operators. In addition if K(u) = 1I{|u|≤r}, and if hn → 0 and
limn→∞
nµ(Bxrhn
) = ∞ then (8) is necessary.
4) The proofs we used allow us not to assume a particular structure of distri-
bution of X as in [12, 13, 14]; they allow to include processes having infinite
fractal dimension, for example the Wiener process.
5) (X , d) is a general semi-metric space; it may be for example any subset of
usual function spaces (e.g. set of the probability densities equipped with the
usual metrics, any subspace of functions which is not a linear space, obtained
for example under some constraints,...), or:
a. (P , dpro), a set of probabilities defined on some measurable space and dpro
the Prohorov metric.
b. (PR, dK), a set of probabilities defined on R endowed with its borelian σ-
field, and dK(P, Q) = supx |P (−∞, x]−Q(−∞, x]|, the Kolmogorov metric.
c. (D[0, 1], dsko) set of the cadlag functions on [0, 1] and dsko the Skorohod
metric.
6) It should be noted that, the same results remain valid when Y takes values
in separable Banach space (of finite or infinite dimension).
7) By the mean of approximation of dependent r.v.’s by independent ones,
we can drop the independence hypothesis and replace it by β-mixing, (see
[26]).
6 Proofs.
We apply, with slight modifications, proofs similar to those in Devroye (1981),
except that we use our lemma 1 below instead of his lemma 2.1. We add
some complements in the proof of strong consistency, and also to obtain a
rate of convergence. But, for the sake of completeness we present them in
entirety. Inequality (15) bellow gives a sharp bound than Devroye’s (1981)
inequality 2.7.
15
Let us first introduce some notations needed in the sequel. According to
definition of the kernel estimate (1) and (2) we will take the convention that
0/0 = 0. Recall that Bxh denotes the closed ball of radius h and center x and
define
ξ = 1I{d(X,x)≤rhn}, ξi = 1I{d(Xi,x)≤rhn} and Nn =n∑
t=1
ξt.
Lemma below is needed to prove the consistency in p-mean like the strong
one.
Lemma 1 For all measurable f ≥ 0 and all x in X we have, for all n ≥ 1,
E
(n∑
i=1
Wni(x)f(Xi) | ξ1, . . . , ξn
)≤ (b/a)
µ(Bxrhn
)
∫Bx
rhn
f(w)dµ(w)1IN 6=0, (14)
and
E
(n∑
i=1
Wni(x)f(Xi)
)≤
(b/a)[1− (1− µ(Bxrhn
))n]
µ(Bxrhn
)
∫Bx
rhn
f(w)dµ(w). (15)
Proof of lemma 1. Define first
U1 =n∑
i=1
f(Xi)ξi∑nt=1 ξt
.
It is easy to see that from the kernel condition (6) and the positivity of f we
haven∑
i=1
Wni(x)f(Xi) ≤ (b/a)U1,
and remark that the i.i.d.’s of (X1, ξ1), . . . , (Xn, ξn) implies
ξiE(f(Xi)|ξi) = ξiE(f(X)|ξ = 1)
= ξi1
µ(Bxrhn
)E(f(X)1I{d(X,x)≤rhn}
),
16
and E(f(Xi)|ξ1, . . . , ξn) = E(f(Xi)|ξi). Then, we have
E(U1|ξ1, . . . , ξn, Nn 6= 0) =n∑
i=1
E(f(Xi)/ξi)ξi∑nt=1 ξt
=n∑
i=1
E(f(X)/ξ = 1)ξi
Nn
= E(f(X)/ξ = 1)
=1
µ(Bxrhn
)E(f(X)1I{d(X,x)≤rhn})
and E(U1|ξ1, . . . , ξn, N = 0) = 0, which can be summarized as
E(U1|ξ1, . . . , ξn) =1
µ(Bxrhn
)E(f(X)1I{d(X,x)≤rhn}
)1IN 6=0.
Taking now, the expectation of this quantity we obtain
E(U1) =1
µ(Bxrhn
)E(f(X)1I{d(X,x)≤rhn}
)P (N 6= 0)
=1
µ(Bxrhn
)E(f(X)1I{d(X,x)≤rhn}
)[1− (1− µ(Bx
rhn))n].
This ends the proof of lemma 1.
Proof of theorem 1 and corollary 1. We use the same arguments as
those in Theorem 2.1 in [9].
If we apply successively Jensen’s inequality we obtain for all p ≥ 1,
31−pE
[∣∣∣∣∣n∑
i=1
Wni(x)Yi −m(x)
∣∣∣∣∣p]
≤ E
[∣∣∣∣∣n∑
i=1
Wni(x)(Yi −m(Xi))
∣∣∣∣∣p]
+ E
[n∑
i=1
Wni(x)|m(Xi)−m(x)|p]
+ |m(x)|pP (Nn = 0) (16)
By positivity of the kernel and the independence of (Xi)’s we have,
P (Nn = 0) = [1− nµ(Bxrhn
)]n ≤ exp(−nµ(Bxrhn
)) −→ 0,
17
because r is finite and nµ(Bxrhn
) →∞ for all x, so the last term of (16) tends
to 0. For corollary 1, remark that exp(−nµ(Bxrhn
)) = o(1/(nµ(Bxrhn
)).
By Lemma 1 the 3rd term in (16) is not greater than
(b/a)
µ(Bxrhn
)
∫Bx
rhn
|m(w)−m(x)|pdµ(w),
which goes to 0, for x such that m satisfies (8). And for the rate of conver-
gence, this quantity is bounded by bc(rhn)pτ/a = O(hpτn ), if m satisfies the
p-mean Lipschitz condition (10) in a neighborhood of x.
We are going to show that the first term on the right side of (16) tends to 0
for x as (9) holds. We distinguish the case p ≥ 2 and 1 ≤ p < 2.
Suppose first that p ≥ 2. Since (X1, Y1), . . . , (Xn, Yn) are i.i.d. and Wni(x)
depends only upon (Xj)j, then inequalities of Marcinkiewicz and Zygmund
(1937) (see also Petrov (1975) or Chow and Teicher (1997)), conditionally to
(X1, . . . , Xn), and Jensen, yield that there exists a constant C = C(p) > 0
depending only upon p, such that
E
[∣∣∣∣∣n∑
i=1
Wni(x)(Yi −m(Xi))
∣∣∣∣∣p]
≤ CE
∣∣∣∣∣n∑
i=1
W 2ni(x)(Yi −m(Xi))
2
∣∣∣∣∣p/2
≤ CE
(supj
Wnj(x)
)p/2∣∣∣∣∣
n∑i=1
Wni(x)(Yi −m(Xi))2
∣∣∣∣∣p/2
≤ CE
[(sup
jWnj(x)
)p/2 n∑i=1
Wni(x)|Yi −m(Xi)|p]
= CnE
[(sup
jWnj(x)
)p/2
Wnn(x)|Yn −m(Xn)|p]
(17)
As above, using one more time the i.i.d.’s of the variables (Xi, Yi) and that
18
Wni(x) depends upon (Xj)j only we have,
E
[(sup
jWnj(x)
)p/2
Wnn(x)|Yn −m(Xn)|p]
= E
[E
[(sup
jWnj(x)
)p/2
Wnn(x)|Yi −m(Xi)|p|X1, . . . , Xn
]]
= nE
[(sup
jWnj(x)
)p/2
Wnn(x)k(Xn)
], (18)
where k(w) = E[|Y −m(X)|p|X = w].
Now, Define U = K(d(Xn, x)/hn), u = E(U), V =∑n−1
j=1 K(d(Xj, x)/hn)
and Zn−1 = min(1, b/V ). Let us note that aµ(Bxrhn
) ≤ u ≤ bµ(Bxrhn
) and
Wnj(x) = K(d(Xj, x)/hn)/(U + V ) ≤ Zn−1 for all j. Then, we have(sup
jWnj(x)
)p/2
≤ (Zn−1)p/2.
So, this last bound and (18) give an estimate of (17),
E
[∣∣∣∣∣n∑
i=1
Wni(x)(Yi −m(Xi))
∣∣∣∣∣p]≤ CnE[Z
p/2n−1Wnn(x)k(Xn)].
It is clear that for all c > 0, Zp/2n−1 = Z
p/2n−11IV <c +Z
p/2n−11IV≥c ≤ 1IV <c +(b/c)p/2.
Then for c = (n− 1)u/2, we have
CnE[Zp/2n−1Wnn(x)k(Xn)] ≤ CnP (V < (n− 1)u/2)E
(k(Xn)1I{d(Xn,x)≤rhn}
)+ Cn
(2b
(n− 1)u
)p/2
E(Wnn(x)k(Xn)). (19)
Now we apply (15) to k(·) in order to obtain
nE(Wnn(x)k(Xn)) = E
[n∑
i=1
Wni(x)k(Xi)
]
≤ (b/a)
µ(Bxrhn
)
∫Bx
rhn
k(w)dµ(w).
19
Therefore, the last term of (19) can be bounded by
Cn
(2b
(n− 1)u
)p/2
E(Wnn(x)k(Xn))
≤ 2p/2C(b/a)1+p/2
[(n− 1)µ(Bxrhn
)]p/2µ(Bxrhn
)
∫Bx
rhn
k(w)dµ(w),
which converges to 0 if nµ(Bxrhn
) → ∞ and (9) is satisfied. And, for corol-
lary 1, when E[|Y − m(X)|p|X ∈ Bx
rhn
]is bounded, this last term is,
O(1/(nµ(Bxrhn
))p/2).
It remains to show that the second term of (19) goes to 0. With U and V
defined above, and u = E(U) = E(V )/(n− 1) ≥ aµ(Bnrhn
), put σ2 = var(U).
Because |U−EU | ≤ 2b and σ2 ≤ bu, applying Bernstein’s inequality, for sums
of bounded independent random variables (see Bennett, 1962), we obtain
P (V < (n− 1)u/2) = P (V − E(V ) < −(n− 1)u/2)
≤ exp
(−(n− 1)(u/2)2
2σ2 + 2bu/3
)≤ exp(−3(n− 1)u/28b)
≤ exp(−c1nµ(Bxrhn
)),
for n ≥ 2, where c1 = 3a/56b. Thus,
CnP (V < (n− 1)u/2)E(1I{d(Xn,x)≤rhn}k(Xn)) (20)
≤ Cn exp(−c1nµ(Bxrhn
))
∫Bx
rhn
k(w)dµ(w)
= Cnµ(Bxrhn
) exp(−c1nµ(Bxrhn
))1
µ(Bxrhn
)
∫Bx
rhn
k(w)dµ(w).
Then, (20) tends to 0 according to (9) and nµ(Bxrhn
) → ∞; and it is also,
when E[|Y − m(X)|p|X ∈ Bx
rhn
]is bounded, o(1/(nµ(Bx
rhn))p/2). Thus,
theorem 1 is proved when p ≥ 2 with the bound stated in corollary 1.
For 1 ≤ p < 2, we use as usual the truncating techniques. Let M be
a positive number and define Y ′ = Y 1I{|Y |≤M} and Y ′′ = Y 1I{|Y |>M}. Then,
theorem 1 is true for (X, Y ′), m′(x) = E(Y ′|X = x), and for all fixed M . It
20
suffices to prove that the remainder term related to Y ′′ tends to 0, that is
E
[n∑
i=1
Wni(x)|Y ′′i −m′′(Xi)|p
]= E
[n∑
i=1
Wni(x)gM(Xi)
]
≤ (b/a)
µ(Bxrhn
)
∫Bx
rhn
gM(w)dµ(w)
≤ 2p (b/a)
µ(Bxrhn
)E|Y 1I{|Y |>M}|p,
where gM(w) = E[|Y ′′ − m′′(X)|p|X = w]; the first inequality comes from
lemma 1. The last term goes to 0 as M →∞ for all x and h > 0. This ends
the proof of theorem 1 and corollary 1.
To prove the almost sure convergence of the kernel estimate we need the
following Lemma on binomial random variable.
Lemma 2 (Devroye, 1981, Lemma 4.1) If Nn is a binomial random variable
with parameters n and p = pn such that np/ log n →∞, then
∞∑n=1
E(exp(−sNn)) < ∞, for all s > 0.
Proof of theorem 2 and corollary 2. We apply the same arguments as
those in theorem 4.2 in [9].
The conditional expectation in (12) is exactly∫|mn −m|dµ; it follows from
(11), as in [9], by application of Fubini’s and dominated convergence theorem
(see Glick (1974)). To prove (11), recall first that ξi = 1I{d(Xi,x)≤rhn} and
Nn =∑n
t=1 ξt, and let us define
U1(x) = (b/a)n∑
i=1
|m(Xi)−m(x)| ξi∑nt=1 ξt
,
U2(x) =
∣∣∣∣∣n∑
i=1
Wni(x)(Yi −m(Xi))
∣∣∣∣∣ .Then, we have the inequality
|mn(x)−m(x)| ≤ U2(x) + U1(x) + |m(x)|1INn=0. (21)
21
We have just seen in the proof of theorem 1 that
P (Nn = 0) = [1− nµ(Bxrhn
)]n ≤ exp(−nµ(Bxrhn
)),
which is the general term of summable serie with respect to n when
limn→∞
nµ(Bxrhn
)/ log n > 1. So the last term in (21) is almost surely 0 for
large n.
Let Zx1,i = |m(Xi)−m(x)|ξi and ξ(n) = (ξ1, . . . , ξn) then |Zx
1,i−Eξ(n)(Zx1,i)| ≤
4Mξi and var(Zx1,i|ξ1, . . . , ξn) ≤ 4M2ξi, where Eξ(n)(·) = E(·|ξ1, . . . , ξn).
Then, we have by Bernstein’s inequality, for any ε > 0,
P(|U1(x)− Eξ(n)(U1(x))| > ε|ξ1, . . . , ξn
)≤ P
(∣∣∣∣∣n∑
i=1
(Zx1,i − Eξi(Zx
1,i))
∣∣∣∣∣ > ε(a/b)Nn|ξ1, . . . , ξn
)≤ 2 exp(−c1Nn),
where c1 = a2ε2
8bM(bM+aε/3). But, from lemma 1, we get
Eξ(n)(U1(x)) ≤ (b/a)
µ(Bxrh)
E(|m(X)−m(x)|1I{d(X,x)≤rhn}
)=
(b/a)
µ(Bxrh)
∫Bx
rhn
|m(w)−m(x)|dµ(w), (22)
which tends to 0 under (8). Thus, for large n, P(Eξ(n)(U1(x)) > ε|ξ1, . . . , ξn
)=
0 and
P (U1(x) > 2ε|ξ1, . . . , ξn) ≤ 2 exp(−c1Nn).
Therefore, for all ε > 0,
P (U1(x) > 2ε) ≤ 2E{exp(−c1Nn)},
and since Nn is a binomial random variable with parameters n and p(x) =
µ(Bxrhn
) such that np(x)/ log n → ∞ as n → ∞, the last term in the above
inequality is summable with respect to n by Lemma 2. Hence U1(x) −→ 0,
a.s., by Borel-Cantelli lemma.
Remark 7 If m was assumed to be continuous at x, U1(x) could be eas-
ily showed to converge to 0, using the continuity. Indeed maxi |m(Xi) −m(x)|ξi ≤ ε for small h and then |U1(x)| ≤ (b/a)ε for large n.
22
The term U2(x) is treated in a similar way. Let Zx2,i = (Yi−m(Xi))K(d(Xi, x)/hn)
then |Zx2,i| ≤ 2bMξi, E(Zx
2,i|X1, . . . , Xn) = 0 and var(Zx2,i|X1, . . . , Xn) ≤
b2M2ξi. So as above we have also by Bernstein’s inequality, with c2 =a2ε2
2bM(bM+2aε/3),
P (|U2(x)| > ε|X1, . . . , Xn) ≤ P
(∣∣∣∣∣n∑
i=1
Zx2,i
∣∣∣∣∣ > εaNn|X1, . . . , Xn
)≤ 2 exp(−c2Nn),
and taking its expectation we obtain
P (|U2(x)| > ε) ≤ 2E(exp(−c2Nn)). (23)
Now from Lemma 2 and Borel-Cantelli lemma, U2(x) −→ 0, a.s. And gath-
ering those convergences above we deduce (11). This ends the proof of the-
orem 2.
For the rate of convergence, using the p-mean Lipschitz hypothesis, (10), on
m in a neighborhood of x we have from (22)
Eξ(n)(U1(x)) ≤ c(b/a)(rhn)τ ,
then P(Eξ(n)(U1(x)) > c(b/a)(rhn)τ |ξ1, . . . , ξn
)= 0. Take now ε1 =
c(b/a)(rhn)τ + C1(nµ(Bxrhn
)/ log n)−1/2, for an appropriate positive number
C1, and as before we obtain
P (U1(x) > ε1) ≤ 2E{exp(−c1Nn)}, (24)
where c1 defined above corresponds to ε = C1(nµ(Bxrhn
)/ log n)−1/2. We re-
mark that (see the proof of lemma 4.1 in [9]) E(exp(−θNn)) ≤ exp(−θ′nµ(Bxrhn
)),
where θ′ = min(θ/2, 1/10), then Borel–Cantelli lemma allows us to finish.
Indeed, for n large enough, c′1 = min(c1/2, 1/10) = c1/2, with c1 correspond-
ing to ε, both defined above, then (24) is bounded by 2 times
exp(−c′1nµ(Bxrhn
)) ≤ exp
(− a2C2
1 log n
16bM(bM + aC1(nµ(Bxrhn
)/ log n)−1/2/3)
)≤ exp(−C ′
1 log n).
The last term is the general one of a summable serie with respect to n for a
suitable choice of C1 to ensure that C ′1 > 1, (in fact C1 > 4bM/a).
23
Remark 8 If m was assumed to be τ -lipschitzian in a neighborhood of x,
U1(x) could be treated directly because maxi |m(Xi)−m(x)|ξi ≤ c(rhn)τ for
small h and then U1(x) ≤ c(b/a)(rhn)τ for large n.
To conclude we will show, in the same way as before, that U2(x) =
O(nµ(Bxrhn
)/ log n)−1/2. For this purpose we choose ε2 = C2(nµ(Bxrhn
)/ log n)−1/2,
for an appropriate positive number C2, in the exponential bound of (23).
Then, for n large enough, c′2 = min(c2/2, 1/10) = c2/2, with c2 correspond-
ing to ε2, both defined above, and (23) is bounded by 2 times
exp(−c′2nµ(Bxrhn
)) ≤ exp
(− a2C2 log n
4bM(bM + 2aC2(nµ(Bxrhn
)/ log n)−1/2/3)
)≤ exp(−C ′
2 log n),
which is the general term of a summable serie with respect to n for a suitable
choice of C2 to ensure that C ′2 > 1, (in fact C2 > 2bM/a).
Therefore we obtained, a.s., for large n,
U1(x) ≤ c(b/a)(rhn)τ + (4bM/a)(nµ(Bxrhn
)/ log n)−1/2,
and
U2(x) ≤ (2bM/a)(nµ(Bxrhn
)/ log n)−1/2.
Proof of theorem 3. It is a simple application of theorem 2, since (13)
implies that,
0 ≤ Ln−L∗ ≤ 2M∑
j=1
E
{∣∣∣∣∣P (Y = j|X)−n∑
i=1
Wni(X)1I{Yi=j}
∣∣∣∣∣ |X1, Y1, . . . , Xn, Yn
},
see Devroye (1981) or Stone (1977).
References
[1] Bennett, G. (1962) Probability inequalities for sums of independent ran-
dom variables. J. Amer. Statist. Assoc, 57, 33–45.
24
[2] Bogachev, V.I. (1998) Gaussian Measures. American Mathematical So-
ciety.
[3] Bosq, D. (1983) Sur la prediction non parametrique de variables
aleatoires et mesures aleatoires. Zeit. Wahrs. Ver. Geb., 64, 541–553.
[4] Bosq, D. (2000) Linear Processes in Function Spaces: Theory and ap-
plications. Lecture Notes in Statistics, 149, Springer.
[5] Bosq, D. and Delecroix, M. (1985) Nonparametric prediction of a Hilbert
space valued random variable. Stoch. Proc. and their Appl., 19, 271–280.
[6] Chow, Y.S. and Teicher, H. (1997) Probability Theory: Independence,
Interchangeability, Martingales. Springer, 3rd ed., New York.
[7] Csaki, E. (1980) A Relation between Chung’s and Strassen Laws of the
Iterated Logarithm. Zeit. Wahrs. Ver. Geb., 19, 287–301.
[8] Dabo-Niang, S. (2002) Estimation de la densite dans un espace de di-
mension infinie : Application aux diffusions C. R. Acad. Sci. Paris I,
334, 213–216.
[9] Devroye, L. (1981) On the absolute everywhere convergence of nonpara-
metric regression function estimates. Ann. Statist., 9, 1310–1319.
[10] Devroye, L. and Wagner, T. J. (1980) Distribution-free consistency re-
sults in nonparametric discrimination and regression function estima-
tion. Ann. Statist., 8, 231–239.
[11] Devroye, L., Gyorfi, L. and Lugosi, G. (1996) A Propabilistic Theory of
Pattern Recognition. Springer, New York.
[12] Ferraty, F et Vieu, P. (2000) Dimension fractal et estimation de la
regression dans des espaces vectoriels semi-normes. C. R. Acad. Sci.
Paris I, 330, 139–142.
[13] Ferraty, F., Goia, A. and Vieu, P. (2002a) Functional nonparametric
model for time series: a fractal approach for dimension reduction. Test,
to appear.
25
[14] Ferraty, F., Goia, A. et Vieu, P. (2002b) Regression nonparametrique
pour des variables aleatoires fonctionnelles melangeantes. C. R. Acad.
Sci. Paris I, to appear.
[15] Gyorfi, L. (1978) On the rate of onvergence of nearest neighbor rules.
IEEE Transac. Information Th., vol. IT-41, 509–512.
[16] Gyorfi, L. (1981) Recent results on nonparametric regression estimate
and multiple classification. Problem control Inform. Theory, 10, 43–52.
[17] Glick, N. (1974) Consistency conditions for probability estimators and
integrals of density estimators. Utilitas Math., 6, 61–74.
[18] Greblicki, W., Krzyzak, A. and Pawlak, M. (1984) Distribution-free
pointwise consistency of kernel regression estimate. Ann. Statist., 12,
1570–1575.
[19] Krzyzak, A. (1986) The rates of onvergence of kernel regression esti-
mates and clasification rules. IEEE Transac. Inform. Theory, IT-32,
668–679.
[20] Kulkarni, S.R. and Posner, S.E. (1995) Rates of onvergence of nearest
neighbor estimation under arbitrary sampling. IEEE Transac. Informa-
tion Th., 41, 1028–1039.
[21] Lifshits, M.A. (1995) Gaussian Random Functions. Kluwer Academic
Publishers.
[22] Parthasarathy, K.R. (1967) Probability Measures on Metric Spaces. Aca-
demic Press.
[23] Petrov, V. V. (1975) Sums of independent Random Variables. Springer,
Berlin.
[24] Preiss, D. (1979) Gaussian measures and covering theorem. Comment.
Math. Univ. Carolin., 20, 95–99.
[25] Ramsay, J. O and Silverman, B.W. (1997) Functional data analysis.
Springer, New York.
26
[26] Rhomari, N. (2001) Kernel regression estimation in Banach space under
dependence. Preprint.
[27] Spiegelman, C. and Sacks, J. (1980) Consistent window estimation in
nonparametric regression. Ann. Statist., 8, 240–246.
[28] Stone, C. J. (1977) Consistent nonparametric regression. Ann. Statist.,
5, 595–645.
[29] Stone, C. J. (1980) Optimal rates of convergence for nonparametric es-
timators. Ann. Statist., 8, 1348–1360.
[30] Stone, C. J. (1982) Optimal global rates of convergence for nonparamet-
ric regression. Ann. Statist., 10, 1040–1053.
[31] Tiser, J. (1988) Differentiation theorem for Gaussian measures on
Hilbert spaces. Trans. Amer. Math. Soc., 308, 655–666.
[32] Wheeden, R. and Zygmund, A. (1977) Measure and integral. Marcel
Dekker, New York.
27