View
10
Download
0
Category
Preview:
Citation preview
On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods
Tong Li
Submitted in partial fulfillment of therequirements for the degree of
Doctor of Philosophyunder the Executive Committee
of the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2021
ยฉ 2021
Tong Li
All Rights Reserved
Abstract
On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods
Tong Li
Kernel embedding methods have witnessed a great deal of practical success in the area of
nonparametric hypothesis testing in recent years. But ever since its first proposal, there exists an
inevitable problem that researchers in this area have been trying to answerโwhat kernel should be
selected, because the performance of the associated nonparametric tests can vary dramatically
with different kernels. While the way of kernel selection is usually ad hoc, we wonder if there
exists a principled way of kernel selection so as to ensure that the associated nonparametric tests
have good performance. As consistency results against fixed alternatives do not tell the full story
about the power of the associated tests, we study their statistical performance within the minimax
framework. First, focusing on the case of goodness-of-fit tests, our analyses show that a vanilla
version of the kernel embedding based test could be suboptimal, and suggest a simple remedy by
moderating the kernel. We prove that the moderated approach provides optimal tests for a wide
range of deviations from the null and can also be made adaptive over a large collection of
interpolation spaces. Then, we study the asymptotic properties of goodness-of-fit, homogeneity
and independence tests using Gaussian kernels, arguably the most popular and successful among
such tests. Our results provide theoretical justifications for this common practice by showing that
tests using a Gaussian kernel with an appropriately chosen scaling parameter are minimax
optimal against smooth alternatives in all three settings. In addition, our analysis also pinpoints
the importance of choosing a diverging scaling parameter when using Gaussian kernels and
suggests a data-driven choice of the scaling parameter that yields tests optimal, up to an iterated
logarithmic factor, over a wide range of smooth alternatives. Numerical experiments are
presented to further demonstrate the practical merits of our methodology.
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Nonparametric Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Minimax Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Kernel Selection and Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Moderated Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Background and Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Operating Characteristics of MMD Based Test . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Asymptotics under ๐ปGOF0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Power Analysis for MMD Based Tests . . . . . . . . . . . . . . . . . . . . 12
2.3 Optimal Tests Based on Moderated MMD . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Moderated MMD Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . 13
i
2.3.2 Operating Characteristics of [2๐๐(P๐, P0) Based Tests . . . . . . . . . . . . 15
2.3.3 Minimax Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Gaussian Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4: Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Effect of Scaling Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Efficacy of Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 5: Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 6: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Appendix A: Some Technical Results and Proofs Related to Chapter 2 . . . . . . . . . . . 102
Appendix B: Some Technical Results and Proofs Related to Chapter 3 . . . . . . . . . . . 104
ii
B.1 Properties of Gaussian Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.2 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.3 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.4 Decomposition of dHSIC and Its Variance Estimation . . . . . . . . . . . . . . . . 116
B.5 Theoretical Properties of Independence Tests for General ๐ . . . . . . . . . . . . . 121
iii
List of Tables
4.1 Frequency that each DAG in Figure 4.4 was selected by three tests. . . . . . . . . . 42
iv
List of Figures
4.1 Observed power against log(a) in Experiment I (left) and Experiment II(right). . . 38
4.2 Observed power versus sample size in Experiment III for ๐ = 1, 10, 100, 1000 from
left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Observed power versus sample size in Experiment IV for ๐ = 2, 10, 100, 1000 from
left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 DAGs with the top 3 highest probabilities of being selected. . . . . . . . . . . . . . 42
v
Acknowledgements
First of all, I want to express my sincere gratitude to my Ph.D. advisor Prof. Ming Yuan. His
guidance and support are invaluable to my research and my life. I have learned a lot from his deep
insights and descent work ethics.
I am very grateful to Prof. Bodhisattva Sen, Prof. Victor de la Pena, Prof. Cynthia Rush and
Prof. Bin Cheng for their serving on the committee. Their comments and feedback were very
helpful for the modification of my thesis and left me inspirations for future research.
I also want to thank my close friends and my fellow students, Yi Li, Youran Qi, Chensheng
Kuang, Shulei Wang, Cuize Han, Fan Gao, Luxi Cao, Yuan Li, Yuqing Xu, Chaoyu Yuan, Guanhua
Fang, Yuanzhe Xu and Shun Xu. Their company has made this journey much more pleasurable.
Their suggestions and help have been indispensable during the hard moments of my life.
Finally, I want to thank my parents Xiuyuan Li and Ping Zhou. They have always been sup-
porting and encouraging me. I owe a lot to them.
vi
Dedication
To my beloved parents who always give me unconditional support and encouragement.
vii
Chapter 1: Introduction
1.1 Kernel Embedding
Tests for goodness-of-fit, homogeneity and independence are central to statistical inferences.
Numerous techniques have been developed for these tasks and are routinely used in practice. In
recent years, there is a renewed interest in them from both statistics and other related fields as they
arise naturally in many modern applications where the performance of classical methods are less
than satisfactory. In particular, nonparametric inferences via the embedding of distributions into
a reproducing kernel Hilbert space (RKHS) have emerged as a popular and powerful technique to
tackle these challenges. The approach immediately allows for easy access to the rich machinery
for RKHS and has found great successes in a wide range of applications from causal discovery to
deep learning. See, e.g., Muandet et al. (2017) for a recent review.
More specifically, let ๐พ (ยท, ยท) be a symmetric and positive definite function defined over X รX,
that is ๐พ (๐ฅ, ๐ฆ) = ๐พ (๐ฆ, ๐ฅ) for all ๐ฅ, ๐ฆ โ X, and the Gram matrix [๐พ (๐ฅ๐, ๐ฅ ๐ )]1โค๐, ๐โค๐ is positive
definite for any distinct ๐ฅ1, . . . , ๐ฅ๐ โ X. The Moore-Aronszajn Theorem indicates that such a
function, referred to as a kernel, can always be uniquely identified with a RKHS H๐พ of functions
over X. The embedding
`P(ยท) :=โซX๐พ (๐ฅ, ยท)P(๐๐ฅ),
maps a probability distribution P into H๐พ . The difference between two probability distributions P
and Q can then be conveniently measured by
๐พ๐พ (P,Q) := โ`P โ `QโH๐พ.
Under mild regularity conditions, it can be shown that ๐พ๐พ (P,Q) is an integral probability metric so
1
that it is zero if and only if P = Q, and
๐พ๐พ (P,Q) = sup๐ โH๐พ :โ ๐ โH๐พ โค1
โซX๐ ๐ (P โ Q) .
As such, ๐พ๐พ (P,Q) is often referred to as the maximum mean discrepancy (MMD) between P andQ.
See, e.g., Sriperumbudur et al. (2010) or Gretton et al. (2012a) for details. In what follows, we shall
drop the subscript ๐พ whenever its choice is clear from the context. It was noted recently that MMD
is also closely related to the so-called energy distance between random variables (Szรฉkely et al.,
2007; Szรฉkely and Rizzo, 2009) commonly used to measure independence. See, e.g., Sejdinovic
et al. (2013) and Lyons (2013).
1.2 Nonparametric Hypothesis Testing
Given a sample from P and/or Q, estimates of the ๐พ(P,Q) can be derived by replacing P and
Q with their respective empirical distributions. These estimates can subsequently be used for
nonparametric hypothesis testing. Here are several notable examples that we shall focus on in this
work.
Goodness-of-fit tests. The goal of goodness-of-fit tests is to check if a sample comes from
a pre-specified distribution. Let ๐1, ยท ยท ยท , ๐๐ be ๐ independent X-valued samples from a certain
distribution P. We are interested in testing if the hypothesis ๐ปGOF0 : P = P0 holds for a fixed P0.
Deviation from P0 can be conveniently measured by ๐พ(P, P0) which can be readily estimated by:
๐พ(P๐, P0) := sup๐ โH (๐พ):โ ๐ โ๐พโค1
โซX๐ ๐
(P๐ โ P0
),
where P๐ is the empirical distribution of ๐1, ยท ยท ยท , ๐๐. A natural procedure is to reject ๐ป0 if the
estimate exceeds a threshold calibrated to ensure a certain significance level, say ๐ผ (0 < ๐ผ < 1).
Homogeneity tests. Homogeneity tests check if two independent samples come from a com-
mon population. Given two independent samples ๐1, ยท ยท ยท , ๐๐ โผiid P and ๐1, ยท ยท ยท , ๐๐ โผiid Q, we are
2
interested in testing if the null hypothesis ๐ปHOM0 : P = Q holds. Discrepancy between P and Q can
be measured by ๐พ(P,Q), and similar to before, it can be estimated by the MMD between P๐ and
Q๐:
๐พ(P๐, Q๐) := sup๐ โH (๐พ):โ ๐ โ๐พโค1
โซX๐ ๐
(P๐ โ Q๐
).
Again we reject ๐ป0 if the estimate exceeds a threshold calibrated to ensure a certain significance
level.
Independence tests. How to measure or test for independence among a set of random variables
is another classical problem in statistics. Let ๐ = (๐1, . . . , ๐ ๐ )> โ X1 ร ยท ยท ยท ร X๐ be a random
vector. If the random vectors ๐1, . . . , ๐ ๐ are jointly independent, then the distribution of ๐ can be
factorized:
๐ปIND0 : P๐ = P๐
1 โ ยท ยท ยท โ P๐ ๐ .
Dependence among ๐1, . . . , ๐ ๐ can be naturally measured by the difference between the joint
distribution and the product distribution evaluated under MMD:
๐พ(P๐ , P๐1 โ ยท ยท ยท โ P๐ ๐ ) = โ`P๐ โ `P๐
1โยทยทยทโP๐๐ โH๐พ.
When ๐ = 2, ๐พ2(P๐ , P๐1 โP๐2) can be expressed as the squared Hilbert-Schmidt norm of the cross-
covariance operator associated with ๐1 and ๐2 and is therefore referred to as Hilbert-Schmidt
independence criterion (HSIC; Gretton et al., 2005). The more general case as given above is
sometimes referred to as dHSIC (see, e.g., Pfister et al., 2018). As before, we proceed to reject the
independence assumption when ๐พ(P๐๐ , P๐1
๐ โ ยท ยท ยท โ P๐ ๐๐ ) exceed a certain threshold where P๐๐ and
P๐๐
๐ are the empirical distribution of ๐ and ๐ ๐ respectively.
1.3 Minimax Framework
In all these cases the test statistic, namely ๐พ2(P๐, P0), ๐พ2(P๐, Q๐) or ๐พ2(P๐, P๐1
๐ โ ยท ยท ยท โ P๐ ๐๐ ),
is a V-statistic. Following standard asymptotic theory for V-statistics (see, e.g., Serfling, 2009), it
3
can be shown that under mild regularity conditions, when appropriately scaled by the sample size,
they converge to a mixture of ๐21 distribution with weights determined jointly by the underlying
probability distribution and the choice of kernel ๐พ . In contrast, it can also be derived that for a
fixed alternative,
๐พ2(P๐, P0) โ๐ ๐พ2(P, P0), ๐พ2(P๐, Q๐) โ๐ ๐พ
2(P,Q)
and ๐พ2(P๐, P๐1
๐ โ ยท ยท ยท โ P๐ ๐๐ ) โ๐ ๐พ2(P, P๐1 โ ยท ยท ยท ร P๐ ๐ ).
This immediately suggests that all aforementioned tests are consistent against fixed alternatives in
that their power tends to one as sample sizes increase. Although useful, such consistency results
do not tell the full story about the power of these tests, and if there are yet more powerful methods.
Specifically, although consistency against any fixed alternative is proved, the rate at which the
power convergeces to 1 may vary for different alternatives and it remains a problem whether a
given sample size is large enough to ensure a good power for the underlying alternative. In other
words, can we detect the difference between the null and the alternative hypotheses with a high
probability even in the worst scenario?
This concern naturally brings the notion of uniform consistency, meaning power converging to
1 uniformly over all alternaives to be considered, and leads us to adpot the minimax hypothesis
testing framework pioneered by Burnashev (1979), Ingster (1987), and Ingster (1993). See also
Ermakov (1991), Spokoiny (1996), Lepski and Spokoiny (1999), Ingster and Suslina (2000), In-
gster (2000), Baraud (2002), Fromont and Laurent (2006), Fromont et al. (2012), and Fromont et
al. (2013), and references therein. Within this framework, we consider testing against alternatives
getting closer and closer to the null hypothesis as the sample size increases. The smallest departure
from the null hypotheses that can be detected consistently, in a minimax sense, is referred to as
the optimal detection boundary. And the test that maintains uniform consistency as the departure
converges to 0 at the rate of optimal detection boundary is called minimax rate optimal.
4
1.4 Kernel Selection and Adaptation
The critical importance of kernel selection is widely recognized in practice, as the statistical
performances of the associated tests can vary dramatically with different kernels. Yet, the way
it is done is usually ad hoc and how to do so in a more principled way remains one of the chief
practical challenges. See, e.g., Gretton et al. (2008), Fukumizu et al. (2009), Gretton et al. (2012b),
and Sutherland et al. (2017). In the following chapters, we address this problem by proposing two
kernel selection methods in different settings such that the associated tests are shown to be minimax
rate optimal.
However, such kernel selection methods depend on some regularity condition of the underlying
space of probability distributions, and whether we can do it in an agnostic approach remains an-
other challenge. This also naturally brings about the issue of adaptation. To address this challenge,
we introduce a simple testing procedure by maximizing a normalized MMD over a pre-specified
class of kernels. Similar idea of maximizing MMD over a class of kernels was first introduced
by Sriperumbudur et al. (2009). Our analysis, however, suggests that it is more desirable to maxi-
mize normalized MMD instead. More specifically, we show that the proposed procedure can attain
the optimal rate, up to an iterated logarithmic factor, over spaces of probability distributions with
different regularity conditions.
The rest of the thesis is organized as follows. In Chapter 2, focusing on the case of goodness-
of-fit tests, our analyses show that a vanilla version of the kernel embedding based test could be
suboptimal, and suggest a simple remedy by moderating the kernel. We prove that the moderated
approach provides optimal tests and can also be made adaptive over a wide range of deviations from
the null. Then, in Chapter 3 we study the asymptotic properties of goodness-of-fit, homogeneity
and independence tests using Gaussian kernels, arguably the most popular and successful among
such tests. Our results provide theoretical justifications for this common practice by showing that
tests using Gaussian kernel with an appropriately chosen scaling parameter are minimax optimal
against smooth alternatives in all three settings. In addition, we suggests a data-driven choice of the
5
scaling parameter that yields tests optimal, up to an iterated logarithmic factor, over a wide range
of smooth alternatives. Numerical experiments are presented in Chapter 4 to further demonstrate
the practical merits of our methodology. We conclude with some summary discussion in Chapter
5. All the main proofs are relegated to Chapter 6. Other technical results and their proofs are put
in the appendix.
6
Chapter 2: Moderated Kernel Embedding
2.1 Background and Problem Setting
In this chapter, we focus on goodness-of-fit test. Specifically, with ๐ independent X-valued
samples ๐1, ยท ยท ยท , ๐๐ from a certain distribution P, we are interested in testing if the hypothesis
๐ปGOF0 : P = P0
holds for a fixed P0. Problems of this kind have a long and illustrious history in statistics and is
often associated with household names such as Kolmogrov-Smirnov tests, Pearsonโs Chi-square
test or Neymanโs smooth test. A plethora of other techniques have also been proposed over the
years in both parametric and nonparametric settings (e.g., Ingster and Suslina, 2003; Lehmann
and Romano, 2008). Most of the existing techniques are developed with the domain X = R or
[0, 1] in mind and work the best in these cases. Modern applications, however, oftentimes involve
domains different from these traditional ones. For example, when dealing with directional data,
which arise naturally in applications such as diffusion tensor imaging, it is natural to consider X
as the unit sphere in R3 (e.g., Jupp, 2005). Another example occurs in the context of ranking or
preference data (e.g., Ailon et al., 2008). In these cases, X can be taken as the group of permuta-
tions. Furthermore, motivated by several applications, combinatorial testing problems have been
investigated recently (e.g., Addario-Berry et al., 2010), where the spaces under consideration are
specific combinatorially structured spaces.
We consider kernel embedding and maximum mean discrepancy (MMD) based goodness-of-fit
test. Specifically, the goodness-of-fit test can be carried out conveniently by first constructing an
estimate of ๐พ(P, P0), ๐พ(P, P0), and then rejecting ๐ป0 if the estimate exceeds a threshold calibrated
7
to ensure a certain significance level, say ๐ผ (0 < ๐ผ < 1).
We adpot the minimax framework to evaluate the above mentioned testing strategy. To fix
ideas, we assume in this chapter that P is dominated by P0 under the alternative so that the Radon-
Nikodym derivative ๐P/๐P0 is well defined and use the ๐2 divergence between P and P0,
๐2(P, P0) :=โซX
(๐P
๐P0
)2๐P0 โ 1,
as the separation metric to quantify the departure from the null hypothesis. We are particularly
interested in the detection boundary, namely how close P and P0 can be in terms of ๐2 distance,
under the alternative, so that a test based on a sample of ๐ observations can still consistently dis-
tinguish between the null hypothesis and the alternative. For example, in the parametric setting
where P is known up to a finite dimensional parameters under the alternative, the detection bound-
ary of the likelihood ratio test is ๐โ1/2 under mild regularity conditions (e.g., Theorem 13.5.4 in
Lehmann and Romano, 2008, and the discussion leading to it). We are concerned here with alter-
natives that are nonparametric in nature. Our first result suggests that the detection boundary for
aforementioned ๐พ๐พ (P๐, P0) based test is of the order ๐โ1/4. However, our main results indicate,
perhaps surprisingly at first, that this rate is far from optimal and the gap between it and the usual
parametric rate can be largely bridged.
In particular, we argue that the distinguishability between P and P0 depends on how close
๐ข := ๐P/๐P0 โ 1 is to the RKHS H๐พ . The closeness of ๐ข to H๐พ can be measured by the distance
from ๐ข to an arbitrary ball in H๐พ . In particular, we shall consider the case where H๐พ is dense in
๐ฟ2(P0), and focus on functions that are polynomially approximable by H๐พ for concreteness. More
precisely, for some constants ๐, \ > 0, denote by F (\;๐) the collection of functions ๐ โ ๐ฟ2(P0)
such that for any ๐ > 0, there exists an ๐๐ โ H๐พ such that
โ ๐๐ โH๐พโค ๐ , and โ ๐ โ ๐๐ โ๐ฟ2 (P0) โค ๐๐ โ1/\ .
8
We also adopt the convention that
F (0;๐) = { ๐ โ H๐พ : โ ๐ โH๐พโค ๐}.
We investigate the optimal rate of detection for testing ๐ปGOF0 : P = P0 against
๐ปGOF1 (ฮ๐, \, ๐) : P โ P(ฮ๐, \, ๐), (2.1)
where P(ฮ๐, \, ๐) is the collection of distributions P on (X,B) satisfying:
๐P/๐P0 โ 1 โ F (\;๐), and ๐(P, P0) โฅ ฮ๐.
We call ๐๐ the optimal rate of detection if for any ๐ > 0, there exists no consistent test whenever
ฮ๐ โค ๐๐๐; and on the other hand, a consistent test exists as long as ฮ๐ ๏ฟฝ ๐๐.
Throughout this chapter, we shall assume
โซXรX
๐พ2(๐ฅ, ๐ฅโฒ)๐P0(๐ฅ)๐P0(๐ฅโฒ) < โ.
Hence the Hilbert-Schmidt integral operator
๐ฟ๐พ ( ๐ ) (๐ฅ) =โซX๐พ (๐ฅ, ๐ฅโฒ) ๐ (๐ฅโฒ)๐P0(๐ฅโฒ), โ ๐ฅ โ X
is well-defined. The spectral decomposition theorem ensures that ๐ฟ๐พ admits an eigenvalue decom-
position. Let {๐๐ }๐โฅ1 denote the orthonormal eigenfunctions of ๐ฟ๐พ with eigenvalues _๐ โs such
that _1 โฅ _2 โฅ ยท ยท ยท _๐ โฅ ยท ยท ยท > 0. Then as proved in, e.g., Dunford and Schwartz (1963),
๐พ (๐ฅ, ๐ฅโฒ) =โ๐โฅ1
_๐๐๐ (๐ฅ)๐๐ (๐ฅโฒ) (2.2)
in ๐ฟ2(P0 โ P0). We further assume that ๐พ is continuous and that P0 is nondegenerate, meaning the
9
support of P0 is X. Then Mercerโs theorem ensures that (2.2) holds pointwisely. See, e.g., Theorem
4.49 of Steinwart and Christmann (2008).
As shown in Gretton et al. (2012a), the squared MMD between two probability distributions P
and P0 can be expressed as
๐พ2๐พ (P, P0) =
โซ๐พ (๐ฅ, ๐ฅโฒ)๐ (P โ P0) (๐ฅ)๐ (P โ P0) (๐ฅโฒ). (2.3)
Write
๏ฟฝ๏ฟฝ (๐ฅ, ๐ฅโฒ) = ๐พ (๐ฅ, ๐ฅโฒ) โ EP0๐พ (๐ฅ, ๐) โ EP0๐พ (๐, ๐ฅโฒ) + EP0๐พ (๐, ๐โฒ), (2.4)
where the subscript P0 signifies the fact that the expectation is taken over ๐, ๐โฒ โผ P0 independently.
By (2.4), ๐พ2๐พ(P, P0) = ๐พ2
๏ฟฝ๏ฟฝ(P, P0). Therefore, without loss of generality, we can focus on kernels
that are degenerate under P0, i.e.,
EP0๐พ (๐, ยท) = 0. (2.5)
Passing from a nondegenerate kernel to a degenerate one however presents a subtlety regarding
universality. Universality of a kernel is essential for MMD by ensuring that ๐P/๐P0 โ 1 resides
in the linear space spanned by its eigenfunctions. See, e.g., Steinwart (2001) for the definition
of universal kernel and Sriperumbudur et al. (2011) for a detailed discussion of different types of
universality. Observe that ๐P/๐P0 โ 1 necessarily lies in the orthogonal complement of constant
functions in ๐ฟ2(P0). A degenerate kernel ๐พ is universal if its eigenfunctions {๐๐ }๐โฅ1 form an
orthonormal basis of the orthogonal complement of linear space {๐ ยท ๐0 : ๐ โ R} where ๐0(๐ฅ) = 1
in ๐ฟ2(P0). In what follows, we shall assume that ๐พ is both degenerate and universal.
For the sake of concreteness, we shall also assume that ๐พ has infinitely many positive eigen-
10
values decaying polynomially, i.e.,
0 < lim inf๐โโ
๐2๐ _๐ โค lim sup๐โโ
๐2๐ _๐ < โ (2.6)
for some ๐ > 1/2. In addition, we also assume that the eigenfunctions of ๐พ are uniformly bounded,
i.e.,
sup๐โฅ1
โ๐๐ โโ < โ, (2.7)
Together with Assumptions (2.6), (2.7) ensures that Mercerโs decomposition (2.2) holds uniformly.
2.2 Operating Characteristics of MMD Based Test
2.2.1 Asymptotics under ๐ปGOF0
Note that (2.5) implies E๐0๐๐ (๐) = 0, โ ๐ โฅ 1. Hence
๐พ2(P, P0) =โ๐โฅ1
_๐ [EP๐๐ (๐)]2
for any P. Accordingly, when P is replaced by the empirical distribution P๐, the empirical squared
MMD can be expressed as
๐พ2(P๐, P0) =โ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐)]2
.
Classic results on the asymptotics of V-statistic (Serfling, 2009) imply that
๐๐พ2(P๐, P0)๐โ
โ๐โฅ1
_๐๐2๐ := ๐
11
under ๐ปGOF0 , where ๐๐
๐.๐.๐.โผ ๐ (0, 1). Let ฮฆMMD be an MMD based test, which rejects ๐ปGOF0 if and
only if ๐๐พ2(P๐, P0) exceeds the 1 โ ๐ผ quantile ๐๐ค,1โ๐ผ of๐ , i.e.,
ฮฆMMD = 1{๐๐พ2 (P๐,P0)>๐๐ค,1โ๐ผ} .
The above limiting distribution of ๐๐พ2(P๐, P0) immediately suggests that ฮฆMMD is an asymptotic
๐ผ-level test.
2.2.2 Power Analysis for MMD Based Tests
We now investigate the power of ฮฆMMD in testing ๐ปGOF0 against ๐ปGOF
1 (ฮ๐, \, ๐) given by (2.1).
Recall that the type II error of a test ฮฆ : X๐ โ [0, 1] for testing ๐ป0 against a composite alternative
๐ป1 : P โ P is given by
๐ฝ(ฮฆ;P) = supPโPEP [1 โฮฆ(๐1, . . . , ๐๐)],
where EPmeans taking expectation over ๐1, . . . , ๐๐๐.๐.๐.โผ P. For brevity, we shall write ๐ฝ(ฮฆ;ฮ๐, \, ๐)
instead of ๐ฝ(ฮฆ;P(ฮ๐, \, ๐)) in what follows, with P(ฮ๐, \, ๐) defined right below (2.1). The
performance of a test ฮฆ can then be evaluated by its detection boundary, that is, the smallest ฮ๐
under which the type II error converges to 0 as ๐ โ โ. Our first result establishes the conver-
gence rate of the detection boundary for ฮฆMMD in the case when \ = 0. Hereafter, we abbreviate
๐ in P(ฮ๐, \, ๐), ๐ปGOF1 (ฮ๐, \, ๐) and ๐ฝ(ฮฆ;ฮ๐, \, ๐), unless it is necessary to emphasize the
dependence.
Theorem 1. Consider testing ๐ปGOF0 against ๐ปGOF
1 (ฮ๐, 0) by ฮฆMMD.
(i) If ๐1/4ฮ๐ โ โ, then
๐ฝ(ฮฆMMD;ฮ๐, 0) โ 0 as ๐โ โ;
(ii) conversely, there exists a constant ๐0 > 0 such that
lim inf๐โโ
๐ฝ(ฮฆMMD; ๐0๐โ1/4, 0) > 0.
12
Theorem 1 shows that when the alternative ๐ปGOF1 (ฮ๐, 0) is considered, the detection boundary
of ฮฆMMD is of the order ๐โ1/4. It is of interest to compare the detection rate achieved by ฮฆMMD with
that in a parametric setting where consistent tests are available if ๐1/2ฮ๐ โ โ. See, e.g., Theorem
13.5.4 in Lehmann and Romano (2008) and the discussion leading to it. It is natural to raise the
question to what extent such a gap can be entirely attributed to the fundamental difference between
parametric and nonparametric testing problems. We shall now argue that this gap actually is largely
due to the sub-optimality of ฮฆMMD, and the detection boundary of ฮฆMMD could be significantly
improved through a slight modification of the MMD.
2.3 Optimal Tests Based on Moderated MMD
2.3.1 Moderated MMD Test Statistic
The basic idea behind MMD is to project two probability measures onto a unit ball in H๐พ
and use the distance between the two projections to measure the distance between the original
probability measures. If the Radon-Nikodym derivative of P with respect to P0 is far away from
H๐พ , the distance between the two projections may not honestly reflect the distance between them.
More specifically, ๐พ2(P, P0) =โ๐โฅ1
_๐ [EP๐๐ (๐)]2, while the ๐2 distance between P and P0 is
๐2(P, P0) =โ๐โฅ1
[EP๐๐ (๐)]2. Considering that _๐ decreases with ๐ , ๐พ2(P, P0) can be much smaller
than ๐2(P, P0). To overcome this problem, we consider a moderated version of the MMD which
allows us to project the probability measures onto a larger ball in H๐พ . In particular, write
[๐พ,๐ (P,Q;P0) = sup๐ โH๐พ :โ ๐ โ2
๐ฟ2 (P0)+๐2โ ๐ โ2
H๐พโค1
โซX๐ ๐ (P โ Q) (2.8)
for a given distribution P0 and a constant ๐ > 0. Distance between probability measures of this type
was first introduced by Harchaoui et al. (2007) when considering kernel methods for two sample
test. A subtle difference between [๐พ,๐ (P,Q;P0) and the distance from Harchaoui et al. (2007) is
the set of ๐ that we optimize over on the righthand side of (2.8). In the case of two sample test,
there is no information about P0 and therefore one needs to replace the norm โ ยท โ๐ฟ2 (P0) with the
13
empirical ๐ฟ2 norm.
It is worth noting that [๐พ,๐ (P,Q;P0) can also be identified with a particular type of MMD.
Specifically, [๐พ,๐ (P,Q;P0) = ๐พ๏ฟฝ๏ฟฝ๐ (P,Q), where
๏ฟฝ๏ฟฝ๐ (๐ฅ, ๐ฅโฒ) :=โ๐โฅ1
_๐
_๐ + ๐2 ๐๐ (๐ฅ)๐๐ (๐ฅโฒ).
We shall nonetheless still refer to [๐พ,๐ (P,Q;P0) as a moderated MMD in what follows to empha-
size the critical importance of moderation. We shall also abbreviate the dependence of [ on ๐พ and
P0 unless necessary. The unit ball in (2.8) is defined in terms of both RKHS norm and ๐ฟ2(P0)
norm. Recall that ๐ข = ๐P/๐P0 โ 1 so that
supโ ๐ โ๐ฟ2 (P0)โค1
โซX๐ ๐ (P โ P0) = sup
โ ๐ โ๐ฟ2 (P0)โค1
โซX๐ ๐ข๐๐0 = โ๐ขโ๐ฟ2 (P0) = ๐(P, P0).
We can therefore expect that a smaller ๐ will make [2๐ (P, P0) closer to ๐2(P, P0), since the unit
ball to be considered will become more similar to the unit ball in ๐ฟ2(P0). This can also be verified
by noticing that
lim๐โ0
[2๐ (P, P0) = lim
๐โ0
โ๐โฅ1
_๐
_๐ + ๐2 [E๐๐๐ (๐)]2 =
โ๐โฅ1
[E๐๐๐ (๐)]2 = ๐2(P, P0).
Therefore, we choose ๐ converging to 0 when constructing our test statistic.
Hereafter we shall attach the subscript ๐ to ๐ to signify its dependence on ๐. We shall argue that
letting ๐๐ converge to 0 at an appropriate rate as ๐ increases indeed results in a test more powerful
than ฮฆMMD. The test statistic we propose is the empirical version of [2๐๐(P, P0):
[2๐๐(P๐, ๐0) =
1๐2
โ๐, ๐=1
๏ฟฝ๏ฟฝ๐๐ (๐๐, ๐ ๐ ) =โ๐โฅ1
_๐
_๐ + ๐2๐
[1๐
๐โ๐=1
๐๐ (๐๐)]2
. (2.9)
This test statistics is similar in spirit to the homogeneity test proposed previously by Harchaoui
et al. (2007), albeit motivated from a different viewpoint. In either case, it is intuitive to expect
14
improved performance over the vanilla version of the MMD when ๐๐ converges to zero at an appro-
priate rate. The main goal of the present work to precisely characterize the amount of moderation
needed to ensure maximum power. We first argue that letting ๐๐ converge to 0 at an appropriate
rate indeed results in a test more powerful than ฮฆMMD.
2.3.2 Operating Characteristics of [2๐๐(P๐, P0) Based Tests
Although the expression for [2๐๐(P๐, P0) given by (2.9) looks similar to that of ๐พ2(P๐, P0),
their asymptotic behaviors are quite different. At a technical level, this is due to the fact that the
eigenvalues of the underlying kernel
_๐๐ :=_๐
_๐ + ๐2๐
depend on ๐ and may not be uniformly summable over ๐. As presented in the following theorem,
a certain type of asymptotic normality, instead of a sum of chi-squares as in the case of ๐พ2(P๐, P0),
holds for [2๐๐(P๐, P0) under P0, which helps determine the rejection region of the [2
๐๐based test.
Theorem 2. Assume that ๐๐ โ 0 as ๐ โ โ in such a fashion that ๐๐1/(2๐ )๐ โ โ. Then under
๐ปGOF0 ,
๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐]
๐โ ๐ (0, 2),
where
๐ฃ๐ =โ๐โฅ1
(_๐
_๐ + ๐2๐
)2, and ๐ด๐ =
1๐
๐โ๐=1
๏ฟฝ๏ฟฝ๐๐ (๐๐, ๐๐).
In the light of Theorem 2, a test that rejects ๐ป0 if and only if
2โ1/2๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐]
exceeds ๐ง1โ๐ผ is an asymptotic ๐ผ-level test, where ๐ง1โ๐ผ stands for the 1 โ ๐ผ quantile of a standard
normal distribution. We refer to this test as ฮฆM3d where the subscript M3d stands for Moderated
MMD. The performance of ฮฆM3d under the alternative hypothesis is characterized by the follow-
15
ing theorem, showing that its detection boundary is much improved when compared with that of
ฮฆMMD.
Theorem 3. Consider testing ๐ปGOF0 against ๐ปGOF
1 (ฮ๐, \) by ฮฆM3d with ๐๐ = ๐๐โ2๐ (\+1)4๐ +\+1 for an
arbitrary constant ๐ > 0. If ๐2๐
4๐ +\+1ฮ๐ โ โ, then ฮฆM3d is consistent in that
๐ฝ(ฮฆM3d;ฮ๐, \) โ 0, as ๐โ โ.
Theorem 3 indicates that the detection boundary for ฮฆM3d is ๐โ2๐ /(4๐ +\+1) . In particular, when
testing ๐ปGOF0 against ๐ปGOF
1 (ฮ๐, 0), i.e., \ = 0, it becomes ๐โ4๐ /(4๐ +1) . This is to be contrasted with
the detection boundary for ฮฆMMD, which, as suggested by Theorem 1, is of the order ๐โ1/4. It is
also worth noting that the detection boundary for ฮฆM3d deteriorates as \ increases, implying that it
is harder to test against a larger interpolation space.
2.3.3 Minimax Optimality
It is of interest to investigate if the detection boundary of ฮฆM3d can be further improved. We
now show that the answer is negative in a certain sense.
Theorem 4. Consider testing๐ปGOF0 against๐ปGOF
1 (ฮ๐, \), for some \ < 2๐ โ1. If lim sup๐โโ
ฮ๐๐2๐
4๐ +\+1 <
โ, then there exists ๐ผ โ (0, 1) such that for any ฮฆ๐ of level ๐ผ (asymptotically) based on ๐1, ยท ยท ยท , ๐๐,
lim sup๐โโ
๐ฝ(ฮฆ๐;ฮ๐, \) > 0.
Together with Theorem 3, this suggests that ฮฆM3d is rate optimal in the minimax sense, when
considering ๐2 distance as the separation metric and F (\, ๐) as the regularity condition of alter-
native space.
16
2.4 Adaptation
Despite the minimax optimality of ฮฆM3d, a practical challenge in using it is the choice of an
appropriate tuning parameter ๐๐. In particular, Theorem 3 suggests that ๐๐ needs to be taken at
the order of ๐โ2๐ (\+1)/(4๐ +\+1) which depends on the value of ๐ and \. On the one hand, since P0
and ๐พ are known a priori, so is ๐ . On the other hand, \ reflects the property of ๐P/๐P0 which
is typically not known in advance. This naturally brings us to the issue of adaptation (see, e.g.,
Spokoiny, 1996; Ingster, 2000). In other words, we are interested in a single testing procedure that
can achieve the detection boundary for testing ๐ปGOF0 against ๐ปGOF
1 (ฮ๐ (\), \) simultaneously over
all \ โฅ 0. We emphasize the dependence of ฮ๐ on \ since the detection boundary may depend
on \, as suggested by the results from the previous section. In fact, we should build upon the test
statistic introduced before.
More specifically, write
๐โ =
(โlog log ๐๐
)2๐
,
and
๐โ =
log2
๐โ1โ
(โlog log ๐๐
) 2๐ 4๐ +1
.Then our test statistic is taken to be the maximum of ๐๐,๐๐ for ๐๐ = ๐โ, 2๐โ, 22๐โ, . . . , 2๐โ๐โ:
๐GOF(adapt)๐ := sup
0โค๐โค๐โ
๐๐,2๐ ๐โ , (2.10)
where
๐๐,๐๐ = (2๐ฃ๐)โ1/2 [๐[2๐๐(P๐, P0) โ ๐ด๐] .
It turns out if an appropriate rejection threshold is chosen, ๐GOF(adapt)๐ can achieve a detection
boundary very similar to the one we have before, but now simultaneously over all \ > 0.
17
Theorem 5. (i) Under ๐ปGOF0 ,
lim๐โโ
๐
(๐
GOF(adapt)๐ โฅ
โ3 log log ๐
)= 0;
(ii) on the other hand, there exists a constant ๐1 > 0 such that,
lim๐โโ
infPโโช\โฅ0P(ฮ๐ (\),\)
๐
(๐
GOF(adapt)๐ โฅ
โ3 log log ๐
)= 1,
provided that ฮ๐ (\) โฅ ๐1(๐โ1โlog log ๐) 2๐ 4๐ +\+1 .
Theorem 5 immediately suggests that a test rejects๐ปGOF0 if and only if๐GOF(adapt)
๐ โฅโ
3 log log ๐
is consistent for testing it against ๐ปGOF1 (ฮ๐ (\), \) for all \ โฅ 0 provided that
ฮ๐ (\) โฅ ๐1(๐โ1โlog log ๐) 2๐ 4๐ +\+1 .
We note that the detection boundary given in Theorem 5 is similar, but inferior by a factor of
(log log ๐) 2๐ 4๐ +\+1 , to that from Theorem 4. As our next result indicates such an extra factor is indeed
unavoidable and is the price one needs to pay for adaptation.
Theorem 6. Let 0 < \1 < \2 < 2๐ โ 1. Then there exists a positive constant ๐2, such that if
lim sup๐โโ
sup\โ[\1,\2]
ฮ๐ (\)(
๐โlog log ๐
) 2๐ 4๐ +\+1 โค ๐2
then
lim๐โโ
infฮฆ๐
[EP0ฮฆ๐ + sup
\โ[\1,\2]๐ฝ(ฮฆ๐;ฮ๐ (\), \)
]= 1.
Similar to Theorem 4, Theorem 6 shows that there is no consistent test for ๐ปGOF0 against
๐ปGOF1 (ฮ๐, \) simultaneously over all \ โ [\1, \2], if ฮ๐ (\) โค ๐2
(๐โ1โlog log ๐
) 2๐ 4๐ +\+1 โ \ โ
[\1, \2] for a sufficiently small ๐2. Together with Theorem 5, this suggests that the above men-
18
tioned adaptive test is indeed rate optimal.
19
Chapter 3: Gaussian Kernel Embedding
3.1 Test for Goodness-of-fit
Throughout this chapter, we shall consider goodness-of-fit, homogeneity and independence
tests. We focus on continuous data, e.g., X = R๐ , and Gaussian kernels, which are arguably the
most popular and successful choice in practice.
Among the three testing problems that we consider, it is instructive to begin with the case of
goodness-of-fit. Obviously, the choice of kernel ๐พ plays an essential role in kernel embedding of
distributions. In particular, when data are continuous, Gaussian kernels are commonly used. More
specifically, a Gaussian kernel with a scaling parameter a > 0 is given by
๐บ๐,a (๐ฅ, ๐ฆ) = exp(โaโ๐ฅ โ ๐ฆโ2
๐
), โ๐ฅ, ๐ฆ โ R๐ .
Hereafter โ ยท โ๐ stands for the usual Euclidean norm in R๐ . For brevity, we shall suppress the
subscript ๐ in both โ ยท โ and ๐บ when the dimensionality is clear from the context. When P and
Q are probability distributions defined over X = R๐ , we shall write the MMD between them with
a Gaussian kernel and scaling parameter a as ๐พa (P,Q) where the subscript signifies the specific
value of the scaling parameter.
We shall restrict our attention to distributions with smooth densities. Denote by W๐ ,2๐
the ๐ th
order Sobolev space in R๐ , that is
W๐ ,2๐
=
{๐ : R๐ โ R
๏ฟฝ๏ฟฝ ๐ is almost surely continuous andโซ
(1 + โ๐โ2)๐ โF ( ๐ ) (๐)โ2๐๐ < โ}
20
where F ( ๐ ) is the Fourier transform of ๐ :
F ( ๐ ) (๐) = 1(2๐)๐/2
โซR๐๐ (๐ฅ)๐โ๐๐ฅ>๐๐๐ฅ.
In what follows, we shall again abbreviate the subscript ๐ in W๐ ,2๐
when it is clear from the context.
For any ๐ โ W๐ ,2, we shall write
โ ๐ โ2W๐ ,2 =
โซR๐(1 + โ๐โ2)๐ โF ( ๐ ) (๐)โ2๐๐.
Let ๐ and ๐0 be the density functions of P and P0 respectively. We are interested in the case when
both ๐ and ๐0 are elements from W๐ ,2.
Note that we can rewrite the null hypothesis ๐ปGOF0 in terms of density functions: ๐ปGOF
0 : ๐ = ๐0
for some prespecified denstiy ๐0 โ W๐ ,2. To better quantify the power of a test, we shall consider
testing against an alternative that is increasingly closer to the null as the sample size ๐ increases:
๐ปGOF1 (ฮ๐; ๐ ) : ๐ โ W๐ ,2(๐), โ๐ โ ๐0โ๐ฟ2 โฅ ฮ๐,
where
W๐ ,2(๐) ={๐ โ W๐ ,2 : โ ๐ โW๐ ,2 โค ๐
}.
and
โ ๐ โ2๐ฟ2
=
โซR๐๐ 2(๐ฅ)๐๐ฅ.
The alternative hypothesis๐ปGOF1 (ฮ๐; ๐ ) is composite and the power of a test ฮฆ based on ๐1, . . . , ๐๐ โผ
๐ is therefore defined as
power(ฮฆ;๐ปGOF1 (ฮ๐; ๐ )) := inf
๐โW๐ ,2 (๐),โ๐โ๐0โ๐ฟ2โฅฮ๐P{ฮฆ rejects ๐ปGOF
0 }
21
Let
๏ฟฝ๏ฟฝa (๐ฅ, ๐ฆ;P0) = ๐บa (๐ฅ, ๐ฆ) โ E๐โผP0๐บa (๐, ๐ฆ) โ E๐โผP0๐บa (๐ฅ, ๐) + E๐,๐ โฒโผiidP0๐บa (๐, ๐โฒ).
and recall that
๐พ2a (P๐, P0) =
1๐2
๐โ๐, ๐=1
๏ฟฝ๏ฟฝa (๐๐, ๐ ๐ ).
Similarly with Chapter 2, we correct for bias and use instead the following๐-statistic:
๐พ2a (P, P0) :=
1๐(๐ โ 1)
๐โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa (๐๐, ๐ ๐ ),
which we shall focus on in what follows.
The choice of the scaling parameter a is essential when using RKHS embedding for goodness-
of-fit test. While the importance of data-driven choice of a is widely recognized in practice, almost
all existing theoretical studies assume that a fixed kernel, therefore a fixed scaling parameter, is
used. Here we shall demonstrate the benefit of using a data-driven scaling parameter, and especially
choosing a scaling parameter that diverges with the sample size.
More specifically, we argue that, with appropriate scaling, ๐พ2a (P, P0) can be viewed as an esti-
mate of โ๐ โ ๐0โ2๐ฟ2
when a โ โ as ๐โ โ. Note that
โซ(๐ โ ๐0)2 =
โซ๐2 โ 2
โซ๐ ยท ๐0 +
โซ๐2
0.
The first term can be estimated by
โซ๐2 โ 1
๐
๐โ๐=1
๐(๐๐) โ1๐
๐โ๐=1
๐โ,โ๐ (๐๐)
where ๐โ,โ๐ is a kernel density estimate of ๐ with the ๐th observation removed and bandwidth โ:
๐โ,โ๐ (๐ฅ) =1
๐(2๐โ2)๐/2โ๐โ ๐
๐บ (2โ2)โ1 (๐ฅ โ ๐ ๐ ).
22
Thus, we can estimateโซ๐2 by
1๐(๐ โ 1) (2๐โ2)๐/2
โ1โค๐โ ๐โค๐
๐บ (2โ2)โ1 (๐๐, ๐ ๐ ).
Similarly, the cross-product term can be estimated by
โซ๐ ยท ๐0 โ
โซ๐โ (๐ฅ)๐0(๐ฅ)๐๐ฅ =
1๐(2๐โ2)๐/2
๐โ๐=1
โซ๐บ (2โ2)โ1 (๐ฅ, ๐๐)๐0(๐ฅ)๐๐ฅ.
Together, we can view1
๐(๐ โ 1) (2๐โ2)๐/2โ
1โค๐โ ๐โค๐๏ฟฝ๏ฟฝ (2โ2)โ1 (๐๐, ๐ ๐ )
as an estimate ofโซ(๐ โ ๐0)2. Following standard asymptotic properties of the kernel density
estimator (see, e.g., Tsybakov, 2008), we know that
(๐/a)โ๐/2๐พ2a (P, P0) โ๐ โ๐ โ ๐0โ2
๐ฟ2
if a โ โ in such a fashion that a = ๐(๐4/๐). Motivated by this observation, we shall now consider
testing ๐ปGOF0 using ๐พ2
a (P, P0) with a diverging a. To signify the dependence of a on the sample
size, we shall add a subscript ๐ in what follows.
Under ๐ปGOF0 , it is clear E๐พ2
a๐ (P, P0) = 0. Note also that
var(๐พ2a๐ (P, P0))
=2
๐(๐ โ 1)E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)
]2
=2
๐(๐ โ 1)
[E
[๐บa๐ (๐1, ๐2)
]2 โ 2E[๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3)] +(E
[๐บa๐ (๐1, ๐2)
] )2]
=2
๐(๐ โ 1)
[E๐บ2a๐ (๐1, ๐2) โ 2E[๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3)] +
(E
[๐บa๐ (๐1, ๐2)
] )2]. (3.1)
23
Simple calculations yield:
var(๐พ2a๐ (P, P0)) =
2(๐/(2a๐))๐/2๐2 ยท โ๐0โ2
๐ฟ2ยท (1 + ๐(1)),
assuming that a๐ โ โ. We shall show that
๐โ
2
(2a๐๐
)๐/4๐พ2a๐ (P, P0) โ๐ ๐
(0, โ๐0โ2
๐ฟ2
).
To use this as a test statistic, however, we will need to estimate var(๐พ2a๐ (P, P0)). To this end, it
is natural to consider estimating each of the three terms on the rightmost hand side of (3.1) by
๐-statistics:
๐ 2๐,a๐ =1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บ2a๐ (๐๐, ๐ ๐ )
โ 2(๐ โ 3)!๐!
โ1โค๐, ๐1, ๐2โค๐|{๐, ๐1, ๐2}|=3
๐บa๐ (๐๐, ๐ ๐1)๐บa๐ (๐๐, ๐ ๐2)
+ (๐ โ 4)!๐!
โ1โค๐1,๐2, ๐1, ๐2โค๐|{๐1,๐2, ๐1, ๐2}|=4
๐บa๐ (๐๐1 , ๐ ๐1)๐บa๐ (๐๐2 , ๐ ๐2).
Note that ๐ 2๐,a๐ is not always positive. To avoid a negative estimate of the variance, we can replace
it with a sufficiently small value, say 1/๐2, whenever it is negative or too small. Namely, let
๏ฟฝ๏ฟฝ2๐,a๐ = max{๐ 2๐,a๐ , 1/๐
2} ,and consider a test statistic:
๐GOF๐,a๐
:=๐โ
2๏ฟฝ๏ฟฝโ1๐,a๐
๐พ2a๐ (P, P0).
We have
24
Theorem 7. Let a๐ โ โ as ๐โ โ in such a fashion that a๐ = ๐(๐4/๐). Then, under ๐ปGOF0 ,
๐โ
2
(2a๐๐
)๐/4๐พ2a๐ (P, P0) โ๐ ๐ (0, โ๐0โ2
๐ฟ2). (3.2)
Moreover,
๐GOF๐,a๐
โ๐ ๐ (0, 1). (3.3)
Theorem 7 immediately implies a test, denoted by ฮฆGOF๐,a๐,๐ผ
(๐ผ โ (0, 1)), that rejects ๐ปGOF0 if
and only if ๐GOF๐,a๐
exceeds ๐ง๐ผ, the upper 1 โ ๐ผ quantile of the standard normal distribution, is an
asymptotic ๐ผ-level test.
We now proceed to study its power against a smooth alternative. Following the same argument
as before, it can be shown that
1๐(๐ โ 1) (๐/a๐)๐/2
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) โ๐ โ๐ โ ๐0โ2๐ฟ2,
and
(2a๐/๐)๐/2 ๏ฟฝ๏ฟฝ2๐,a๐ โ๐ โ๐โ2๐ฟ2,
so that
๐โ1(a๐/(2๐))๐/4๐GOF๐ โ๐ โ๐ โ ๐0โ2
๐ฟ2/โ๐โ๐ฟ2 .
This immediately implies that, if a๐ โ โ in such a manner that a๐ = ๐(๐4/๐), then ฮฆGOF๐,a๐,๐ผ
is
consistent for a fixed ๐ โ ๐0 in that its power converges to one. In fact, as ๐ increases, more and
more subtle deviation from ๐0 can be detected by ฮฆGOF๐,a๐,๐ผ
. A refined analysis of the asymptotic
behavior of ๐GOF๐,a๐
yields that
Theorem 8. Assume that ๐2๐ /(๐+4๐ )ฮ๐ โ โ. Then for any ๐ผ โ (0, 1),
lim๐โโ
power{ฮฆGOF๐,a๐,๐ผ
;๐ปGOF1 (ฮ๐; ๐ )} โ 1,
25
provided that a๐ ๏ฟฝ ๐4/(๐+4๐ ) .
In other words, ฮฆGOF๐,a๐,๐ผ
has a detection boundary of the order ๐ (๐โ2๐ /(๐+4๐ )) which turns out
to be minimax optimal in that no other tests could attain a detection boundary with faster rate of
convergence. More precisely, we have
Theorem 9. Assume that lim inf๐โโ ๐2๐ /(๐+4๐ )ฮ๐ < โ and ๐0 is density such that
โ๐0โW๐ ,2 < ๐ . Then there exists some ๐ผ โ (0, 1) such that for any test ฮฆ๐ of level ๐ผ (asymptoti-
cally) based on ๐1, . . . , ๐๐ โผ ๐,
lim inf๐โโ
power{ฮฆ๐;๐ปGOF1 (ฮ๐; ๐ )} < 1.
Together, Theorems 8 and 9 suggest that Gaussian kernel embedding of distributions is espe-
cially suitable for testing against smooth alternatives, and it yields a test that could consistently
detect the smallest departures, in terms of rate of convergence, from the null distribution. The idea
can also be readily applied to testing of homogeneity and independence which we shall examine
next.
3.2 Test for Homogeneity
As in the case of goodness of fit test, we shall consider the case when the underlying distri-
butions have smooth densities so that we can rewrite the null hypothesis as ๐ปHOM0 : ๐ = ๐ โ
W๐ ,2(๐), and the alternative hypothesis as
๐ปHOM1 (ฮ๐; ๐ ) : ๐, ๐ โ W๐ ,2(๐), โ๐ โ ๐โ๐ฟ2 โฅ ฮ๐.
The power of a test ฮฆ based on ๐1, . . . , ๐๐ โผ ๐ and ๐1, . . . , ๐๐ โผ ๐ is given by
power(ฮฆ;๐ปHOM1 (ฮ๐; ๐ )) := inf
๐,๐โW๐ ,2 (๐),โ๐โ๐โ๐ฟ2โฅฮ๐P{ฮฆ rejects ๐ปHOM
0 }
26
To fix ideas, we shall also assume that ๐ โค ๐/๐ โค ๐ถ for some constants 0 < ๐ โค ๐ถ < โ.
In addition, we shall express explicitly only the dependence on ๐ and not ๐, for brevity. Our
treatment, however, can be straightforwardly extended to more general situations.
Recall that
๐พ2a๐(P๐, Q๐) =
1๐2
โ1โค๐, ๐โค๐
๐บa๐ (๐๐, ๐ ๐ ) +1๐2
โ1โค๐, ๐โค๐
๐บa๐ (๐๐, ๐ ๐ )
โ 2๐๐
๐โ๐=1
๐โ๐=1๐บa๐ (๐๐, ๐ ๐ ).
As before, to reduce bias, we shall focus instead on a closely related estimate of ๐พa๐ (P,Q):
๐พ2a๐ (P,Q) =
1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บa๐ (๐๐, ๐ ๐ ) +1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บa๐ (๐๐, ๐ ๐ )
โ 2๐๐
๐โ๐=1
๐โ๐=1๐บa๐ (๐๐, ๐ ๐ ).
It is easy to see that under ๐ปHOM0 ,
E๐พ2a๐ (P,Q) = 0,
and
var(๐พ2a๐ (P,Q)
)= 2
(1
๐(๐ โ 1) +2๐๐
+ 1๐(๐ โ 1)
)E(๐,๐ )โผPโQ๏ฟฝ๏ฟฝ
2a๐(๐,๐ ),
where
๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐ฆ) = ๐บa (๐ฅ, ๐ฆ) โ E๐โผP๐บa๐ (๐, ๐ฆ) โ E๐โผQ๐บa๐ (๐ฅ,๐ ) + E(๐,๐ )โผPโQ๐บa๐ (๐,๐ ).
27
It is therefore natural to consider estimating the variance by ๏ฟฝ๏ฟฝ2๐,๐,a๐ = max{๐ 2๐,๐,a๐ , 1/๐
2} where
๐ 2๐,๐,a๐ =1
๐ (๐ โ 1)โ
1โค๐โ ๐โค๐๐บ2a๐ (๐๐, ๐ ๐ )
โ 2(๐ โ 3)!๐!
โ1โค๐, ๐1, ๐2โค๐|{๐, ๐1, ๐2}|=3
๐บa๐ (๐๐, ๐ ๐1)๐บa๐ (๐๐, ๐ ๐2)
+ (๐ โ 4)!๐!
โ1โค๐1,๐2, ๐1, ๐2โค๐|{๐1,๐2, ๐1, ๐2}|=4
๐บa๐ (๐๐1 , ๐ ๐1)๐บa๐ (๐๐2 , ๐ ๐2),
๐ = ๐ + ๐ and ๐๐ = ๐๐ if ๐ โค ๐ and ๐๐โ๐ if ๐ > ๐. This leads to the following test statistic
๐HOM๐,a๐
=๐๐
โ2(๐ + ๐)
ยท ๏ฟฝ๏ฟฝโ1๐,๐,a๐
ยท ๐พ2a๐ (P,Q).
As before, we can show
Theorem 10. Let a๐ โ โ as ๐ โ โ in such a fashion that a๐ = ๐(๐4/๐). Then under ๐ปHOM0 :
๐ = ๐ โ W๐ ,2(๐),
๐HOM๐,a๐
โ๐ ๐ (0, 1), as ๐โ โ.
Motivated by Theorem 10, we can consider a test, denoted by ฮฆHOM๐,a๐,๐ผ
, that rejects ๐ปHOM0 if and
only if ๐HOM๐,a๐
exceeds ๐ง๐ผ. By construction, ฮฆHOM๐,a๐,๐ผ
is an asymptotic ๐ผ level test. We now turn to
study its power against ๐ปHOM1 . As in the case of goodness of fit test, we can prove that ฮฆHOM
๐,a๐,๐ผis
minimax optimal in that it can detect the smallest difference between ๐ and ๐ in terms of rate of
convergence. More precisely, we have
Theorem 11. (i) Assume that ๐2๐ /(๐+4๐ )ฮ๐ โ โ. Then for any ๐ผ โ (0, 1),
lim๐โโ
power{ฮฆHOM๐,a๐,๐ผ
;๐ปHOM1 (ฮ๐; ๐ )} โ 1,
provided that a๐ ๏ฟฝ ๐4/(๐+4๐ ) .
(ii) Conversely, if lim inf๐โโ ๐2๐ /(๐+4๐ )ฮ๐ < โ, then there exists some ๐ผ โ (0, 1) such that for
28
any test ฮฆ๐ of level ๐ผ (asymptotically) based on ๐1, . . . , ๐๐ โผ ๐ and
๐1, . . . , ๐๐ โผ ๐,
lim inf๐โโ
power{ฮฆ๐;๐ปHOM1 (ฮ๐; ๐ )} < 1.
3.3 Test for Independence
Similarly, we can also use Gaussian kernel embedding to construct minimax optimal tests of
independence. Let ๐ = (๐1, . . . , ๐ ๐ )> โ R๐ be a random vector where the subvectors ๐ ๐ โ R๐ ๐
for ๐ = 1, . . . , ๐ so that ๐1 + ยท ยท ยท + ๐๐ = ๐. Denote by ๐ the joint density function of ๐ , and ๐ ๐
the marginal density of ๐ ๐ . We assume that both the joint density and the marginal densities are
smooth. Specifically, we shall consider testing
๐ปIND0 : ๐ = ๐1 โ ยท ยท ยท โ ๐๐ , ๐ ๐ โ W๐ ,2(๐ ๐ ), 1 โค ๐ โค ๐
against a smooth departure from independence:
๐ปIND1 (ฮ๐; ๐ ) : ๐ โ W๐ ,2(๐), ๐ ๐ โ W๐ ,2(๐ ๐ ), 1 โค ๐ โค ๐ and โ๐ โ ๐1 โ ยท ยท ยท โ ๐๐ โ๐ฟ2 โฅ ฮ๐,
where ๐ =๐โ๐=1๐ ๐ so that ๐1 โ ยท ยท ยท โ ๐๐ โ W๐ ,2(๐) under both null and alternative hypotheses.
Given a sample {๐1, . . . , ๐๐} of independent copies of ๐ , we can naturally estimate the so-
called dHSIC ๐พ2a๐(P, P๐1 โ ยท ยท ยท โ P๐ ๐ ) by
๐พ2a๐(P๐, P๐
1
๐ โ ยท ยท ยท โ P๐ ๐๐ ) =1๐2
โ1โค๐, ๐โค๐
๐บa๐ (๐๐, ๐ ๐ )
+ 1๐2๐
โ1โค๐1,...,๐๐ , ๐1..., ๐๐โค๐
๐บa๐ ((๐1๐1, . . . , ๐ ๐๐๐ ), (๐
1๐1, . . . , ๐ ๐๐๐ ))
โ 2๐๐+1
โ1โค๐, ๐1,..., ๐๐โค๐
๐บa๐ (๐๐, (๐1๐1, . . . , ๐ ๐๐๐ )).
29
To correct for the bias, we shall consider the following estimate of ๐พ2a๐(P, P๐1 โ ยท ยท ยท โ P๐ ๐ ) instead.
๐พ2a๐ (P, P
๐1 โ ยท ยท ยท โ P๐ ๐ )
=1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บa๐ (๐๐, ๐ ๐ )
+ (๐ โ 2๐)!๐!
โ1โค๐1,ยทยทยท ,๐๐ , ๐1,ยทยทยท , ๐๐โค๐|{๐1,ยทยทยท ,๐๐ , ๐1,ยทยทยท , ๐๐ }|=2๐
๐บa๐ ((๐1๐1, . . . , ๐ ๐๐๐ ), (๐
1๐1, . . . , ๐ ๐๐๐ ))
โ 2(๐ โ ๐ โ 1)!๐!
โ1โค๐, ๐1,ยทยทยท , ๐๐โค๐
|{๐, ๐1,ยทยทยท , ๐๐ }|=๐+1
๐บa๐ (๐๐, (๐1๐1, . . . , ๐ ๐๐๐ )).
Under ๐ปIND0 , we have
E๐พ2a๐ (P, P
๐1 โ ยท ยท ยท โ P๐ ๐ ) = 0.
Deriving its variance, however, requires a bit more work. Write
โ ๐ (๐ฅ ๐ , ๐ฆ) = E๐โผP๐1โยทยทยทโP๐๐๐บa๐ ((๐1, . . . , ๐ ๐โ1, ๐ฅ ๐ , ๐ ๐+1, . . . , ๐ ๐ ), ๐ฆ)
and
๐ ๐ (๐ฅ ๐ , ๐ฆ) = โ ๐ (๐ฅ ๐ , ๐ฆ) โ E๐ ๐โผP๐ ๐ โ ๐ (๐๐ , ๐ฆ) โ E๐โผPโ ๐ (๐ฅ ๐ , ๐ ) + E(๐ ๐ ,๐ )โผP๐ ๐ โPโ ๐ (๐
๐ , ๐ ).
With slight abuse of notation, also denote by
โ ๐1, ๐2 (๐ฅ ๐1 , ๐ฆ ๐2) = E๐,๐โผiidP๐1โยทยทยทโP๐๐๐บa๐ ((๐1, . . . , ๐ ๐1โ1, ๐ฅ ๐1 , ๐ ๐1+1, . . . , ๐ ๐ ),
(๐1, . . . , ๐ ๐2โ1, ๐ฆ ๐2 , ๐ ๐2+1, . . . , ๐ ๐ ))
30
and
๐ ๐1, ๐2 (๐ฅ ๐1 , ๐ฆ ๐2) =โ ๐1, ๐2 (๐ฅ ๐1 , ๐ฆ ๐2) โ E๐ ๐1โผP๐ ๐1 โ ๐1, ๐2 (๐๐1 , ๐ฆ ๐2)
โ E๐ ๐2โผP๐ ๐2 โ ๐1, ๐2 (๐ฅ
๐1 , ๐ ๐2) + E(๐ ๐1 ,๐ ๐2 )โผP๐ ๐1 โP๐ ๐2 โ ๐1, ๐2 (๐๐1 , ๐ ๐2).
Then we have
Lemma 1. Under ๐ปIND0 ,
var(๐พ2a๐ (P, P
๐1 โ ยท ยท ยท โ P๐ ๐ ))=
2๐(๐ โ 1)
(E๏ฟฝ๏ฟฝ2
a๐(๐,๐ ) โ 2
โ1โค ๐โค๐
E(๐ ๐ (๐ ๐ , ๐ )
)2
+โ
1โค ๐1, ๐2โค๐E
(๐ ๐1, ๐2 (๐ ๐1 , ๐ ๐2)
)2)+๐ (E๐บ2a๐ (๐,๐ )/๐3). (3.4)
In light of Lemma 1, a variance estimator can be derived by estimating the leading term on the
righthand side of (3.4) term by term using ๐-statistics. Formulae for estimating the variance for
general ๐ are tedious and we defer them to the appendix for space consideration. In the special
case when ๐ = 2, the leading term on the righthand side of (3.4) takes a much simplified form:
2๐(๐ โ 1)E๏ฟฝ๏ฟฝa๐ (๐1, ๐1) ยท E๏ฟฝ๏ฟฝa๐ (๐2, ๐2),
where ๐ ๐ , ๐ ๐ โผiid P๐ ๐ for ๐ = 1, 2. Thus, we can estimate E[๏ฟฝ๏ฟฝa๐ (๐ ๐ , ๐ ๐ )]2 by
๐ 2๐, ๐ ,a๐ =1
๐(๐ โ 1)โ
1โค๐1โ ๐2โค๐๐บ2a๐ (๐
๐
๐1, ๐
๐
๐2)
โ 2(๐ โ 3)!๐!
โ1โค๐,๐1,๐2โค๐|{๐,๐1,๐2}|=3
๐บa๐ (๐๐
๐, ๐
๐
๐1)๐บa๐ (๐
๐
๐, ๐
๐
๐2)
+ (๐ โ 4)!๐!
โ1โค๐1,๐2,๐1,๐2โค๐|{๐1,๐2,๐1,๐2}|=4
๐บa๐ (๐๐
๐1, ๐
๐
๐1)๐บa๐ (๐
๐
๐2, ๐
๐
๐2)
31
and var(๐พ2a๐ (P, P๐
1 โ P๐2)) by 2/[๐(๐ โ 1)] ๏ฟฝ๏ฟฝ2๐,a๐ where
๏ฟฝ๏ฟฝ2๐,a๐ := max{๐ 2๐,1,a๐๐
2๐,2,a๐ , 1/๐
2}.
so that a test statistic for ๐ปIND0 is
๐ IND๐,a๐
:=๐โ
2๏ฟฝ๏ฟฝโ1๐,a๐
๐พ2a๐ (P, P
๐1 โ P๐2).
Test statistics for general ๐ > 2 can be defined accordingly. Again, we have
Theorem 12. Let a๐ โ โ as ๐โ โ in such a fashion that a๐ = ๐(๐4/๐). Then under ๐ปIND0 ,
๐ IND๐,a๐
โ๐ ๐ (0, 1), as ๐โ โ.
Motivated by Theorem 12, we can consider a test, denoted by ฮฆIND๐,a๐,๐ผ
, that rejects ๐ปIND0 if and
only if ๐ IND๐,a๐
exceeds ๐ง๐ผ. By construction, ฮฆIND๐,a๐,๐ผ
is an asymptotic ๐ผ level test. We now turn to
study its power against ๐ปIND1 . As in the case of goodness of fit test, we can prove that ฮฆHOM
๐,a๐,๐ผis
minimax optimal in that it can detect the smallest departure from independence in terms of rate of
convergence. More precisely, we have
Theorem 13. (i) Assume that ๐2๐ /(๐+4๐ )ฮ๐ โ โ. Then for any ๐ผ โ (0, 1),
lim๐โโ
power{ฮฆIND๐,a๐,๐ผ
;๐ปIND1 (ฮ๐; ๐ )} โ 1,
provided that a๐ ๏ฟฝ ๐4/(๐+4๐ ) .
(ii) Conversely, if lim inf๐โโ ๐2๐ /(๐+4๐ )ฮ๐ < โ, then there exists some ๐ผ โ (0, 1) such that for
any test ฮฆ๐ of level ๐ผ (asymptotically) based on ๐1, . . . , ๐๐ โผ ๐,
lim inf๐โโ
power{ฮฆ๐;๐ปIND1 (ฮ๐; ๐ )} < 1.
32
3.4 Adaptation
The results presented in the previous sections not only suggest that Gaussian kernel embedding
of distributions is especially suitable for testing against smooth alternatives, but also indicate the
importance of choosing an appropriate scaling parameter in order to detect small deviation from the
null hypothesis. To achieve maximum power, the scaling parameter should be chosen according
to the smoothness of underlying density functions. This, however, presents a practical challenge
because the level of smoothness is rarely known a priori. This naturally brings about the questions
of adaption: can we devise an agnostic testing procedure that does not require such knowledge
but still attain similar performance? We shall show in this section that this is possible, at least for
sufficiently smooth densities.
3.4.1 Test for Goodness-of-fit
We again begin with the test for goodness-of-fit. As we show in Section 3.1, under ๐ปGOF0 ,
๐GOF๐,a๐
โ๐ ๐ (0, 1) if 1 ๏ฟฝ a๐ ๏ฟฝ ๐4/๐; whereas for any ๐ โ W๐ ,2 such that โ๐โ๐0โ๐ฟ2 ๏ฟฝ ๐โ2๐ /(๐+4๐ ) ,
๐GOF๐,a๐
โ โ provided that a๐ ๏ฟฝ ๐4/(๐+4๐ ) . This motivates us to consider the following test statistic:
๐GOF(adapt)๐ = max
1โคa๐โค๐2/๐๐GOF๐,a๐
.
In light of earlier discussion, it is plausible that such a statistic could be used to detect any smooth
departure from the null provided that the level of smoothness ๐ โฅ ๐/4. We now argue that this
is indeed the case. More specifically, we shall proceed to reject ๐ปGOF0 if and only if ๐GOF(adapt)
๐
exceeds the upper ๐ผ quantile, denoted by ๐GOF๐,๐ผ , of its null distribution. In what follows, we shall
call this test ฮฆGOF(adapt) . Note that, even though it is hard to derive the analytic form for ๐GOF๐,๐ผ , it
can be readily evaluated via Monte Carlo method.
To study the power of ฮฆGOF(adapt) against ๐ปGOF1 with different levels of smoothness, we shall
33
consider the following alternative hypothesis
๐ปGOF(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4) : ๐ โ
โ๐ โฅ๐/4
{๐ โ W๐ ,2(๐) : โ๐ โ ๐0โ๐ฟ2 โฅ ฮ๐,๐ }.
The following theorem characterizes the power of ฮฆGOF(adapt) against ๐ปGOF(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4).
Theorem 14. There exists a constant ๐ > 0 such that if
lim inf๐โโ
ฮ๐,๐ (๐/log log ๐)2๐ /(๐+4๐ ) > ๐,
then
power{ฮฆGOF(adapt);๐ปGOF(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4)} โ 1.
Theorem 14 shows that ฮฆGOF(adapt) has a detection boundary of the order (log log ๐/๐) 2๐ ๐+4๐
when ๐ โ W๐ ,2 for any ๐ โฅ ๐/4. If ๐ is known in advance, as we show in Section 3.1, the optimal
test is based on ๐GOF๐,a๐
with a๐ ๏ฟฝ ๐4/(๐+4๐ ) and has a detection boundary of the order ๐ (๐โ2๐ /(๐+4๐ )).
The extra polynomial of iterated logarithmic factor (log log ๐)2๐ /(๐+4๐ ) is the price we pay to ensure
that no knowledge of ๐ is required and ฮฆGOF(adapt) is powerful against smooth alternatives for all
๐ โฅ ๐/4.
3.4.2 Test for Homogeneity
The treatment for homogeneity tests is similar. Instead of ๐HOM๐,a๐
, we now consider a test based
on
๐HOM(adapt)๐ = max
1โคa๐โค๐2/๐๐HOM๐,a๐
.
If ๐HOM(adapt)๐ exceeds the upper ๐ผ quantile, denoted by ๐HOM
๐,๐ผ , of its null distribution, then we
reject ๐ปHOM0 . In what follows, we shall refer to this test as ฮฆHOM(adapt) . As before, we do not
have a closed form expression for ๐HOM๐,๐ผ , and it needs to be evaluated via Monte Carlo method. In
particular, in the case of homogeneity test, we can approximate ๐HOM๐,๐ผ by permutation where we
34
randomly shuffle {๐1, . . . , ๐๐, ๐1, . . . , ๐๐} and compute the test statistic as if the first ๐ shuffled
observations are from the first population whereas the other ๐ are from the second population.
This is repeated multiple times in order to approximate the critical value ๐HOM๐,๐ผ .
The following theorem characterize the power of ฮฆHOM(adapt) against an alternative with dif-
ferent levels of smoothness
๐ปHOM(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4) : (๐, ๐) โ
โ๐ โฅ๐/4
{(๐, ๐) : ๐, ๐ โ W๐ ,2(๐), โ๐ โ ๐โ๐ฟ2 โฅ ฮ๐,๐ }.
Theorem 15. There exists a constant ๐ > 0 such that if
lim inf๐โโ
ฮ๐,๐ (๐/log log ๐)2๐ /(๐+4๐ ) > ๐,
then
power{ฮฆHOM(adapt);๐ปHOM(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4)} โ 1.
Similar to the case of goodness-of-fit test, Theorem 15 shows that ฮฆHOM(adapt) has a detection
boundary of the order ๐ ((๐/log log ๐)โ2๐ /(๐+4๐ )) when ๐ โ ๐ โ W๐ ,2 for any ๐ โฅ ๐/4. In light of
the results from Section 3.2, this is optimal up to an extra polynomial of iterated logarithmic factor.
The main advantage is that ฮฆHOM(adapt) is powerful against smooth alternatives simultaneously for
all ๐ โฅ ๐/4.
3.4.3 Test for Independence
Similarly, for independence test, we shall adopt the following test statistic
๐IND(adapt)๐ = max
1โคa๐โค๐2/๐๐ IND๐,a๐
.
and reject ๐ปIND0 if and only ๐ IND(adapt)
๐ exceeds the upper ๐ผ quantile, denoted by ๐IND๐,๐ผ , of its null
distribution. In what follows, we shall refer to this test as ฮฆHOM(adapt) . The critical value, ๐HOM๐,๐ผ ,
can also be evaluated via permutation test. See, e.g., Pfister et al. (2018) for detailed discussions.
35
We now show that ฮฆIND(adapt) is powerful in testing against the alternative with different levels
of smoothness
๐ปIND(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4) : ๐ โ
โ๐ โฅ๐/4
{๐ โ W๐ ,2(๐), ๐ ๐ โ W๐ ,2(๐ ๐ ), 1 โค ๐ โค ๐,
โ๐ โ ๐1 โ ยท ยท ยท โ ๐๐ โ๐ฟ2 โฅ ฮ๐,๐
}.
More specifically, we have
Theorem 16. There exists a constant ๐ > 0 such that if
lim inf๐โโ
ฮ๐,๐ (๐/log log ๐)2๐ /(๐+4๐ ) > ๐,
then
power{ฮฆIND(adapt);๐ปIND(adapt)1 (ฮ๐,๐ : ๐ โฅ ๐/4)} โ 1.
Similar to before, Theorem 16 shows that ฮฆIND(adapt) is optimal up to an extra polynomial of
iterated logarithmic factor for detecting smooth departure from independence simultaneously for
all ๐ โฅ ๐/4.
36
Chapter 4: Numerical Experiments
To further complement our theoretical development and demonstrate the practical merits of
the proposed methodology, we conducted several sets of numerical experiments. We shall mainly
consider Gaussian kernels in this chapter as they are the most popular choices in practice for
continuous data.
4.1 Effect of Scaling Parameter
Our first set of experiments were designed to illustrate the importance of the scaling parameter
and highlight the potential room for improvement over the โmedianโ heuristicโone of the most
common data-driven choice of the scaling parameter in practice (see, e.g., Gretton et al., 2008;
Pfister et al., 2018).
โข Experiment I: the homogeneity test with underlying distributions being the normal distribu-
tion and the mixture of several normal distributions. Specifically,
๐(๐ฅ) = ๐ (๐ฅ; 0, 1), ๐(๐ฅ) = 0.5 ร ๐ (๐ฅ; 0, 1) + 0.1 รโโ๐๐ (๐ฅ; `, 0.05)
where ๐ (๐ฅ; `, ๐) denotes the density of ๐ (`, ๐2) and ๐ = {โ1,โ0.5, 0, 0.5, 1}.
โข Experiment II: the joint independence test of ๐1, ยท ยท ยท , ๐5 where
๐1, ยท ยท ยท , ๐4, (๐5)โฒ โผiid ๐ (0, 1), ๐5 =๏ฟฝ๏ฟฝ(๐5)โฒ
๏ฟฝ๏ฟฝ ร sign
( 4โ๐=1
๐ ๐
).
Clearly ๐1, ยท ยท ยท , ๐5 are jointly dependent sinceโ๐๐=1 ๐
๐ โฅ 0.
In both experiments, our primary goal is to investigate how the power of Gaussian MMD based
37
test is influenced by a pre-fixed scaling parameter. These tests are also compared to the ones
with scaling parameter selected via โmedianโ heuristic. In order to evaluate tests with different
scaling parameters under a unified framework, we determined the critical values for each test via
permutation test.
For Experiment I we fixed the sample size at ๐ = ๐ = 200; and for Experiment II at ๐ = 400.
The number of permutations was set at 100, and significance level at ๐ผ = 0.05. We first repeated the
experiments 100 times under the null to verify that permutation tests indeed yield the correct size,
up to Monte Carlo error. Each experiment was then repeated for 100 times and the observed power
(ยฑ one standard error) for different choices of the scaling parameter. The results are summarized in
Figure 4.1. It is perhaps not surprising that the scaling parameter selected via โmedian heuristicโ
has little variation across each simulation run, and we represent its performance by a single value.
โ1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
log(a)
Pow
er
Single fixed aMedian
โ3 โ2 โ1 0 1 20
0.2
0.4
0.6
0.8
1
log(a)
Figure 4.1: Observed power against log(a) in Experiment I (left) and Experiment II(right).
The importance of the scaling parameter is evident from Figure 4.1 with the observed power
varies quite significantly for different choices. It is also of interest to note that in these settings the
โmedianโ heuristic typically does not yield a scaling parameter with great power. More specifically,
in Experiment I, log(amedian) โ 0.2 and maximum power is attained at log(a) = 4; in Experiment
II, log(amedian) โ โ2.15 and maximum power is attained at log(a) = 1. This suggests that more
appropriate choice of the scaling parameter may lead to much improved performance.
38
4.2 Efficacy of Adaptation
Our second experiment aims to illustrate that the adaptive procedures we proposed in Section
3.4 indeed yield more powerful tests when compared with other alternatives that are commonly
used in practice. In particular, we compare the proposed self-normalized adaptive test (S.A.)
with a couple of data-driven approaches, namely the โmedianโ heuristic (Median) and the un-
normalized adaptive test (U.A.) proposed in Sriperumbudur et al. (2009). When computing both
self-normalized and unnormalized test statistics, we first rescaled the squared distance โ๐๐ โ ๐ ๐ โ2
by the dimensionality ๐ before taking maximum within a certain range of the scaling parameter.
We considered two experiment setups:
โข Experiment III: the homogeneity test with the underlying distributions being
๐ โผ ๐ (0, ๐ผ๐), ๐ โผ ๐(0,
(1 + 2๐โ1/2
)๐ผ๐
).
As the โsignal strengthโ, the ratio between the variances of ๐ and ๐ in each single direction
is set to decrease to 1 at the order 1/โ๐ with ๐, which is the decreasing order of variance
ratio that can be detected by the classical ๐น-test.
โข Experiment IV: the independence test of ๐1, ๐2 โ R๐/2, where ๐ = (๐1, ๐2) follows a
mixture of
๐ (0, ๐ผ๐) and ๐
(0, (1 + 6๐โ3/5)๐ผ๐
)with mixture probability being 0.5. Similarly, the ratio between the variances in each direc-
tion is set to decrease with ๐, but at a slightly higher rate.
To better compare different methods, we considered different combinations of sample size and
dimensionality for each experiment. More specifically, for Experiment III, the sample sizes were
set to be ๐ = ๐ = 25, 50, 75, ยท ยท ยท , 200 and dimension ๐ = 1, 10, 100, 1000; for Experiment IV, the
sample size were ๐ = 100, 200, ยท ยท ยท , 600 and dimension ๐ = 2, 10, 100, 1000. In both experiments,
39
we fixed the significance level at ๐ผ = 0.05, did 100 permutations to calibrate the critical values as
before. Again we simulated under ๐ป0 to verify that the resulting tests have the targeted size, up to
Monte Carlo error. The power of each method, estimated from 100 such experiments, is reported
in Figures 4.2 and 4.3.
50 100 150 2000
0.2
0.4
0.6
0.8
1
๐
Pow
er
MedianU.A.S.A.
50 100 150 2000
0.2
0.4
0.6
0.8
1
๐50 100 150 200
0
0.2
0.4
0.6
0.8
1
๐50 100 150 200
0
0.2
0.4
0.6
0.8
1
๐
Figure 4.2: Observed power versus sample size in Experiment III for ๐ = 1, 10, 100, 1000 fromleft to right.
200 400 6000
0.2
0.4
0.6
0.8
1
๐
Pow
er
MedianU.A.S.A.
200 400 6000
0.2
0.4
0.6
0.8
1
๐200 400 600
0
0.2
0.4
0.6
0.8
1
๐200 400 600
0
0.2
0.4
0.6
0.8
1
๐
Figure 4.3: Observed power versus sample size in Experiment IV for ๐ = 2, 10, 100, 1000 fromleft to right.
As Figures 4.2 and 4.3 show, for both experiments, these tests are comparable in low-dimensional
settings. But as ๐ increases, the proposed self-normalized adaptive test becomes more and more
preferable to the two alternatives. For example, for Experiment IV, when ๐ = 1000, the observed
power of the proposed self-normalized adaptive test is about 90% when ๐ = 600, while the other
two tests have power around only 15%.
40
4.3 Data Example
Finally, we considered applying the proposed self-normalized adaptive test in a data example
from Mooij et al. (2016). The data set consists of three variables, altitude (Alt), average temper-
ature (Temp) and average duration of sunshine (Sun) from different weather stations. One goal
of interest is to figure out the causal relationship among the three variables by figuring out a suit-
able directed acyclic graph (DAG) among them. Following Peters et al. (2014), if a set of random
variables ๐1, ยท ยท ยท ,๐๐ follow a DAG G0, then we assume that they follow a sequence of additive
models:
๐ ๐ =โ๐โPA๐
๐๐,๐ (๐๐) + ๐ ๐ , โ 1 โค ๐ โค ๐,
where ๐ ๐โs are independent Gaussian noises and PA๐ denotes the collection of parent nodes of node
๐ specified by G0. As shown by (Peters et al., 2014), G0 is identifiable from the joint distribution
of ๐1, ยท ยท ยท , ๐๐ under the assumption of ๐๐,๐โs being non-linear. Therefore a natural method of
deciding a specific DAG underlying a set of random variables is by testing the independence of the
regression residuals after fitting the DAG induced additive models. In our case, there are totally
25 possible DAGs for the three variables. We can apply independence tests for the residuals for
each of the 25 DAGs and choose the one with the largest ๐-value as the most plausible underlying
DAG. See Peters et al. (2014) for more details.
As before, we considered three different ways for independence tests: the proposed self-
normalized adaptive test (S.A.), Gaussian kernel embedding based independent test with the
scaling parameter determined by the โmedianโ heuristic (Median), and the unnormalized adaptive
test from Sriperumbudur et al. (2009) (U.A.). Note that the three variables have different scales
and we standardize them before applying the tests of independence.
The overall sample size of the data set is 349. Each time we randomly select 150 samples
and compute the ๐-value associated with each DAG. The ๐-value is again computed based on 100
permutations. We repeated the experiment for 1000 times and recorded for each test the DAG with
the largest ๐-value. All three tests agree on the top three most selected DAGs and they are shown
41
in Figure 4.4.
Alt
Temp Sun
DAG I
Alt
Temp Sun
DAG II
Alt
Temp Sun
DAG III
Figure 4.4: DAGs with the top 3 highest probabilities of being selected.
In addition, we report in Table 4.1 the frequencies that these three DAGs were selected by each
of the tests. They are generally comparable with the proposed method more consistently selecting
DAG I, the one heavily favored by all three methods.
Test
Prob(%) DAGI II III
Median 78.5 4.7 14.5U.A. 81.4 8.1 8.5S.A. 83.4 9.8 4.7
Table 4.1: Frequency that each DAG in Figure 4.4 was selected by three tests.
42
Chapter 5: Conclusion and Discussion
In this thesis, we aim to address the problem of kernel selection when using kernel embed-
ding for the purpose of nonparametric hypothesis testing, which is an inevitable problem that
researchers and practitioners have been trying to answer ever since the first proposal of kernel em-
bedding method while most of the existing solutions are ad-hoc. We propose principled ways of
kernel selection in two different settings which are proved to ensure minimax rate optimality for
the associated tests. We also propose adaptive test statistics to address the issue of aforementioned
kernel selection methods depending on the regularity condition of the underlying space of prob-
ability distributions, whose sacrifice in terms of detection boundary is only some polynomial of
iterated logarithmic factor of the sample size.
There are still many interesting problems in this area which remain to be explored further.
For example, can we adopt fast computation techniques to compute the kernel based test statistic
approximately so as to reduce the computation complexity to a large extent while maintaining the
statistical optimality? Parallel results in the context of regression have been derived but there seems
to be a lack of such results in hypothesis testing. The second one involves resampling methods such
as permutation method and bootstrap method. In practice, with the concern that the sample size
may not be large enough, resampling methods are usually used to decide the rejection boundary.
Can we still ensure the statistical optimality of the proposed tests when incorporating resampling
methods?
In addition to that, it is also wondered whether similar principled kernel selection methods can
be proposed in a broader range of nonparametric testing problems such as conditional independent
test, which can be very useful in Bayesian network learning and causal discovery.
43
Chapter 6: Proofs
Throughout this chapter, we shall write ๐๐ . ๐๐ if there exists a universal constant ๐ถ > 0 such
that ๐๐ โค ๐ถ๐๐. Similarly, we write ๐๐ & ๐๐ if ๐๐ . ๐๐, and ๐๐ ๏ฟฝ ๐๐ if ๐๐ . ๐๐ and ๐๐ & ๐๐.
When the the constant depends on another quantity ๐ท, we shall write ๐๐ .๐ท ๐๐. Relations &๐ท
and ๏ฟฝ๐ท are defined accordingly.
Proof of Theorem 1. Part (i). The proof of the first part consists of two key steps. First, we show
that the population counterpart ๐๐พ2(P, P0) of the test statistic converges to โ uniformly, i.e.,
๐ inf๐โP(ฮ๐,0)
๐พ2(P, P0) โ โ.
Then, we argue that the deviation from ๐พ2(P, P0) to ๐พ2(P๐, P0) is uniformly negligible compared
with ๐พ2(P, P0) itself.
It is not hard to see that
๐พ(P๐, P0) =
โโโ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐)]2
โฅโโ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ
โโโ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐) โ EP๐๐ (๐)]2.
44
Thus,
๐
{๐๐พ2(P๐, P0) < ๐๐ค,1โ๐ผ
}โค๐
โ๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ
โโ๐โ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐) โ EP๐๐ (๐)]2<โ๐๐ค,1โ๐ผ
=๐
โโ๐โ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐) โ EP๐๐ (๐)]2>
โ๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ โ๐๐ค,1โ๐ผ
.Suppose that
๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 > ๐๐ค,1โ๐ผ .
Then
๐
{๐๐พ2(P๐, P0) < ๐๐ค,1โ๐ผ
}โคEP
{๐
โ๐โฅ1
_๐
[1๐
๐โ๐=1๐๐ (๐๐) โ EP๐๐ (๐)
]2}{โ
๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ โ๐๐ค,1โ๐ผ
}2 .
Observe that for any ๐ โ P(ฮ๐, 0),
EP
{๐โ๐โฅ1
_๐
[1๐
๐โ๐=1
๐๐ (๐๐) โ EP๐๐ (๐)]2}
=โ๐โฅ1
_๐Var[๐๐ (๐)]
โคโ๐โฅ1
_๐EP๐2๐ (๐)
โค(sup๐โฅ1
โ๐๐ โโ)2 โ
๐โฅ1_๐ < โ.
45
This implies that
lim๐โโ
๐ฝ(ฮฆMMD;ฮ๐, 0) = lim๐โโ
supPโP(ฮ๐,0)
๐
{๐๐พ2(P๐, P0) < ๐๐ค,1โ๐ผ
}
โค lim๐โโ
supPโP(ฮ๐,0)
EP
{๐
โ๐โฅ1
_๐
[1๐
๐โ๐=1๐๐ (๐๐) โ EP๐๐ (๐)
]2}inf
PโP(ฮ๐,0)
{โ๐
โ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ โ๐๐ค,1โ๐ผ
}2
=0,
provided that
infPโP(ฮ๐,0)
๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 โ โ, as ๐โ โ. (6.1)
It now suffices to show that (6.1) holds if ๐ฮ4๐ โ โ as ๐โ โ.
To this end, let ๐ข = ๐P/๐P0 โ 1 and
๐๐ = ใ๐ข, ๐๐ใ๐ฟ2 (P0) = EP๐๐ (๐) โ EP0๐๐ (๐) = EP(๐๐ (๐)).
It is clear the that
โ๐โฅ1
_โ1๐ ๐
2๐ = โ๐ขโ2
๐พ , andโ๐โฅ1
๐2๐ = โ๐ขโ2
๐ฟ2 (P0) = ๐2(P, P0).
By the definition of P(ฮ๐, 0),
supPโP(ฮ๐,0)
โ๐โฅ1
_โ1๐ ๐
2๐ โค ๐2, and inf
PโP(ฮ๐,0)
โ๐โฅ1
๐2๐ โฅ ฮ๐.
46
Since ๐ฮ4๐ โ โ as ๐โ โ, we get
infPโP(ฮ๐,0)
๐โ๐โฅ1
_๐ [EP๐๐ (๐)]2 = infPโP(ฮ๐,0)
๐โ๐โฅ1
_๐๐2๐
โฅ infPโP(ฮ๐,0)
๐
( โ๐โฅ1
๐2๐
)2
โ๐โฅ1
_โ1๐๐2๐
โฅ๐ฮ4
๐
๐2 โ โ
as ๐โ โ.
Part (ii). In proving the second part, we will make use of the following lemma that can be
obtained by adapting the argument in Gregory (1977). It gives the limit distribution of V-statistic
under P๐ such that P๐ converges to P0 in the order ๐โ1/4.
Lemma 2. Consider a sequence of probability measures {P๐ : ๐ โฅ 1} contiguous to P0 satisfying
๐ข๐ = ๐P๐/๐P0 โ 1 โ 0 in ๐ฟ2(P0). Suppose that for any fixed ๐ ,
lim๐โโ
โ๐ใ๐ข๐, ๐๐ใ๐ฟ2 (P0) = ๏ฟฝ๏ฟฝ๐ , and lim
๐โโ
โ๐โฅ1
_๐ (โ๐ใ๐ข๐, ๐๐ใ๐ฟ2 (P0))2 =
โ๐โฅ1
_๐ ๏ฟฝ๏ฟฝ2๐ + ๏ฟฝ๏ฟฝ0 < โ,
for some sequence {๏ฟฝ๏ฟฝ๐ : ๐ โฅ 0}, then
1๐
โ๐โฅ1
_๐
[ ๐โ๐=1
๐๐ (๐๐)]2 ๐โ
โ๐โฅ1
_๐ (๐๐ + ๏ฟฝ๏ฟฝ๐ )2 + ๏ฟฝ๏ฟฝ0,
where ๐1, . . . , ๐๐๐.๐.๐โผ P๐, and ๐๐s are independent standard normal random variables.
Write ๐ฟ (๐) = _๐ ๐2๐ . By assumption (2.6),
0 < ๐ฟ := inf๐โฅ1
๐ฟ (๐) โค sup๐โฅ1
๐ฟ (๐) := ๐ฟ < โ.
47
Consider a sequence of {P๐ : ๐ โฅ 1} such that
๐P๐/๐P0 โ 1 = ๐ถ1โ_๐๐ [๐ฟ (๐๐)]โ1๐๐๐ ,
where ๐ถ1 is a positive constant and ๐๐ = b๐ถ2๐14๐ c for some positive constant ๐ถ2. Both ๐ถ1 and ๐ถ2
will be determined later. Since sup๐โฅ1
โ๐๐ โโ < โ and lim๐โโ
_๐ = 0, there exists ๐0 > 0 such that P๐โs
are well-defined probability measures for any ๐ โฅ ๐0.
Note that
โ๐ข๐โ2๐พ =
๐ถ21
๐ฟ2(๐๐)โค ๐ฟโ2๐ถ2
1
and
โ๐ข๐โ2๐ฟ2 (P0) =
๐ถ21_๐๐
๐ฟ2(๐๐)=
๐ถ21
๐ฟ (๐๐)๐โ2๐ ๐ โฅ ๐ฟ
โ1๐ถ2
1 ๐โ2๐ ๐ โผ ๐ฟ
โ1๐ถ2
1๐ถโ2๐ 2 ๐โ1/2,
where ๐ด๐ โผ ๐ต๐ means that lim๐โโ
๐ด๐/๐ต๐ = 1. Thus, by choosing ๐ถ1 sufficiently small and ๐0 =
12๐ฟ
โ1๐ถ2
1๐ถโ2๐ 2 , we ensure that P๐ โ P(๐0๐
โ1/4, 0) for sufficiently large ๐.
To apply Lemma 2, we note that
lim๐โโ
โ๐ข๐โ2๐ฟ2 (P0) = lim
๐โโ
๐ถ21_๐๐
๐ฟ2(๐๐)= 0.
In addition, for any fixed ๐ ,
๏ฟฝ๏ฟฝ๐,๐ =โ๐ใ๐ข๐, ๐๐ใ๐ฟ2 (P0) = 0
for sufficiently large ๐, and
โ๐โฅ1
_๐ ๏ฟฝ๏ฟฝ2๐,๐ =
๐๐ถ21_
2๐๐
๐ฟ2(๐๐)= ๐๐ถ2
1 ๐โ4๐ ๐ โ ๐ถ2
1๐ถโ4๐ 2
48
as ๐โ โ. Thus, Lemma 2 implies that
๐๐พ(P๐, P0)๐โ
โ๐โฅ1
_๐๐2๐ + ๐ถ
21๐ถ
โ4๐ 2 .
Now take ๐ถ2 =(2๐ถ2
1/๐๐ค,1โ๐ผ)1/4๐ so that ๐ถ2
1๐ถโ4๐ 2 = 1
2๐๐ค,1โ๐ผ. Then
lim inf๐โโ
๐ฝ(ฮฆMMD; ๐0๐โ1/2, 0) โฅ lim
๐โโ๐(๐๐พ(P๐, P0) < ๐๐ค,1โ๐ผ)
=๐
(โ๐โฅ1
_๐๐2๐ <
12๐๐ค,1โ๐ผ
)> 0,
which concludes the proof.
Proof of Theorem 2. Let ๏ฟฝ๏ฟฝ๐ (ยท, ยท) := ๏ฟฝ๏ฟฝ๐๐ (ยท, ยท). Note that
๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐] = 2(๐2๐ฃ๐)โ1/2
๐โ๐=2
๐โ1โ๐=1
๏ฟฝ๏ฟฝ๐ (๐๐, ๐ ๐ ).
Let Z๐ ๐ =๐โ1โ๐=1๏ฟฝ๏ฟฝ๐ (๐๐, ๐ ๐ ). Consider a filtration {F๐ : ๐ โฅ 1} where F๐ = ๐{๐๐ : 1 โค ๐ โค ๐}. Due to
the assumption that ๐พ is degenerate, we have E๐๐ (๐) = 0 for any ๐ โฅ 1, which implies that
E(Z๐ ๐ |F๐โ1) =๐โ1โ๐=1E[๏ฟฝ๏ฟฝ๐ (๐๐, ๐ ๐ ) |F๐โ1] =
๐โ1โ๐=1E[๏ฟฝ๏ฟฝ๐ (๐๐, ๐ ๐ ) |๐๐] = 0,
for any ๐ โฅ 2.
Write
๐๐๐ =
0 ๐ = 1๐โ๐=2Z๐ ๐ ๐ โฅ 2
.
49
Then for any fixed ๐, {๐๐๐}๐โฅ1 is a martingale with respect to {F๐ : ๐ โฅ 1} and
๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐] = 2(๐2๐ฃ๐)โ1/2๐๐๐.
We now apply martingale central limit theorem to ๐๐๐. Following the argument from Hall
(1984), it can be shown that
[12๐2E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)]โ1/2
๐๐๐๐โ ๐ (0, 1), (6.2)
provided that
[E๐บ2๐ (๐, ๐โฒ) + ๐โ1E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2๐ (๐, ๐
โฒโฒ) + ๐โ2E๏ฟฝ๏ฟฝ4๐ (๐, ๐โฒ)]/[E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)]2 โ 0, (6.3)
as ๐โ โ, where ๐บ๐ (๐ฅ, ๐ฅโฒ) = E๏ฟฝ๏ฟฝ๐ (๐, ๐ฅ)๏ฟฝ๏ฟฝ๐ (๐, ๐ฅโฒ). Since
E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ) =
โ๐โฅ1
( _๐
_๐ + ๐2๐
)2= ๐ฃ๐,
(6.2) implies that
๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐] =
โ2 ยท
(12๐2E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ))โ1/2
๐๐๐๐โ ๐ (0, 2).
It therefore suffices to verify (6.3).
Note that
E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ) =
โ๐โฅ1
( _๐
_๐ + ๐2๐
)2โฅ
โ_๐โฅ๐2
๐
14+ 1
4๐4๐
โ_๐<๐
2๐
_2๐
=14|{๐ : _๐ โฅ ๐2
๐}| +1
4๐4๐
โ_๐<๐
2๐
_2๐ ๏ฟฝ ๐
โ1/๐ ๐ ,
50
where the last step holds by considering that _๐ ๏ฟฝ ๐โ2๐ . Similarly,
E๐บ2๐ (๐, ๐โฒ) =
โ๐โฅ1
( _๐
_๐ + ๐2๐
)4โค |{๐ : _๐ โฅ ๐2
๐}| + ๐โ8๐
โ_๐<๐
2๐
_4๐ ๏ฟฝ ๐
โ1/๐ ๐ ,
and
E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒโฒ) =E{โ๐โฅ1
( _๐
_๐ + ๐2๐
)2๐2๐ (๐)
}2
โค(sup๐โฅ1
โ๐๐ โโ)4 {โ
๐โฅ1
( _๐
_๐ + ๐2๐
)2}2๏ฟฝ ๐
โ2/๐ ๐ .
Thus there exists a positive constant ๐ถ3 such that
E๐บ2๐ (๐, ๐โฒ)/[E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)]2 โค ๐ถ3๐1/๐ ๐ โ 0, (6.4)
and
๐โ1E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒโฒ)/[E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)]2 โค ๐ถ3๐
โ1 โ 0, (6.5)
as ๐โ โ. On the other hand,
E๏ฟฝ๏ฟฝ4๐ (๐, ๐โฒ) โค โ๏ฟฝ๏ฟฝ๐โ2
โE๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ),
where
โ๏ฟฝ๏ฟฝ๐โโ = sup๐ฅ
{โ๐โฅ1
_๐
_๐ + ๐2๐
๐2๐ (๐ฅ)
}โค
(sup๐โฅ1
โ๐๐ โโ)2 โ
๐โฅ1
_๐
_๐ + ๐2๐
๏ฟฝ ๐โ1/๐ ๐ .
This implies that for some positive constant ๐ถ4,
๐โ2E๏ฟฝ๏ฟฝ4๐ (๐, ๐โฒ)}/[E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)]2 โค ๐โ2โ๏ฟฝ๏ฟฝ๐โ2โ/E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ) โค ๐ถ4(๐2๐1/๐ ๐ )โ1 โ 0. (6.6)
51
as ๐โ โ. Together, (6.4), (6.5) and (6.6) ensure that condition (6.3) holds.
Proof of Theorem 3. Note that
๐[2๐๐(P๐, P0) โ
1๐
๐โ๐=1
๏ฟฝ๏ฟฝ๐ (๐๐, ๐๐)
=1๐
โ๐โฅ1
_๐
_๐ + ๐2๐
โ1โค๐, ๐โค๐๐โ ๐
๐๐ (๐๐)๐๐ (๐ ๐ )
=1๐
โ๐โฅ1
_๐
_๐ + ๐2๐
โ1โค๐, ๐โค๐๐โ ๐
[๐๐ (๐๐) โ EP๐๐ (๐)] [๐๐ (๐ ๐ ) โ EP๐๐ (๐)]
+ 2(๐ โ 1)๐
โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)]โ
1โค๐โค๐[๐๐ (๐๐) โ EP๐๐ (๐)]
+ ๐(๐ โ 1)๐
โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)]2
:=๐1 +๐2 +๐3.
Obviously, EP๐1๐2 = 0. We first argue that the following three statements together implies the
desired result:
lim๐โโ
infPโP(ฮ๐,\)
๐ฃโ1/2๐ ๐3 = โ, (6.7)
supPโP(ฮ๐,\)
(EP๐21 /๐
23 ) = ๐(1), (6.8)
supPโP(ฮ๐,\)
(EP๐22 /๐
23 ) = ๐(1). (6.9)
To see this, note that (6.7) implies that
lim๐โโ
infPโP(ฮ๐,\)
๐(๐ฃโ1/2๐ [๐[2
๐๐(P๐, P0) โ ๐ด๐] โฅ
โ2๐ง1โ๐ผ)
โฅ lim๐โโ
infPโP(ฮ๐,\)
๐
(๐ฃโ1/2๐ ๐3 โฅ 2
โ2๐ง1โ๐ผ, ๐1 +๐2 +๐3 โฅ 1
2๐3
)= lim๐โโ
infPโP(ฮ๐,\)
๐
(๐1 +๐2 +๐3 โฅ 1
2๐3
).
52
On the other hand, (6.8) and (6.9) imply that
lim๐โโ
infPโP(ฮ๐,\)
๐
(๐1 +๐2 +๐3 โฅ 1
2๐3
)=1 โ lim
๐โโsup
PโP(ฮ๐,\)๐
(๐1 +๐2 +๐3 <
12๐3
)โฅ1 โ lim
๐โโsup
PโP(ฮ๐,\)
EP(๐1 +๐2)2
(๐3/2)2 = 1.
This immediately suggests that ฮฆM3d is consistent. We now show that (6.7)-(6.9) indeed hold.
Verifying (6.7). We begin with (6.7). Since ๐ฃ๐ ๏ฟฝ ๐โ1/๐ ๐ and ๐3 = (๐ โ 1)[2
๐๐(P, P0), (6.7) is
equivalent to
lim๐โโ
infPโP(ฮ๐,\)
๐๐12๐ ๐ [
2๐๐(P, P0) = โ.
For any P โ P(ฮ๐, \), let ๐ข = ๐P/๐P0 โ 1 and ๐๐ = ใ๐ข, ๐๐ใ๐ฟ2 (P0) = EP๐๐ (๐). Based on the
assumption that ๐พ is universal, ๐ข =โ๐โฅ1
๐๐๐๐ . We consider the case \ = 0 and \ > 0 separately.
(1) First consider \ = 0. It is clear that
[2๐๐(P, P0) =
โ๐โฅ1
๐2๐ โ
โ๐โฅ1
๐2๐
_๐ + ๐2๐
๐2๐
โฅโ๐ขโ2๐ฟ2 (P0) โ ๐
2๐
โ๐โฅ1
1_๐๐2๐
โฅโ๐ขโ2๐ฟ2 (P0) โ ๐
2๐๐
2.
Take ๐๐ โคโฮ2๐/(2๐2) so that ๐2
๐๐2 โค 1
2ฮ2๐. Then we have
infPโP(ฮ๐,0)
[2๐๐(P, P0) โฅ
12
infPโP(ฮ๐,0)
โ๐ขโ2๐ฟ2 (P0) =
12ฮ2๐.
(2) Now consider the case when \ > 0. For P โ P(ฮ๐, \), โ ๐ > 0, โ ๐๐ โ H (๐พ) such that
53
โ๐ข โ ๐๐ โ๐ฟ2 (P0) โค ๐๐ โ1/\ and โ ๐๐ โ๐พ โค ๐ . Let ๐๐ = ใ ๐๐ , ๐๐ใ๐ฟ2 (P0) .
[2๐๐(P, P0) =
โ๐โฅ1
๐2๐ โ
โ๐โฅ1
๐2๐
_๐ + ๐2๐
๐2๐
โฅโ๐ขโ2๐ฟ2 (P0) โ 2
โ๐โฅ1
๐2๐
_๐ + ๐2๐
(๐๐ โ ๐๐ )2 โ 2โ๐โฅ1
๐2๐
_๐ + ๐2๐
๐2๐
โฅโ๐ขโ2๐ฟ2 (P0) โ 2
โ๐โฅ1
(๐๐ โ ๐๐ )2 โ 2๐2๐
โ๐โฅ1
1_๐๐2๐
=โ๐ขโ2๐ฟ2 (P0) โ 2โ๐ข โ ๐๐ โ2
๐ฟ2 (P0) โ 2๐2๐โ ๐๐ โ2
๐พ .
Taking ๐ = (2๐/โ๐ขโ๐ฟ2 (P0))\ yields that
[2๐๐(P, P0) โฅ โ๐ขโ2
๐ฟ2 (P0) โ 2๐2๐ โ2/\ โ 2๐2๐๐
2 =12โ๐ขโ2
๐ฟ2 (P0) โ 2๐2๐๐
2.
Now by choosing
๐๐ โค1
2โ
2(2๐)โ\ฮ1+\
๐ ,
we can ensure that
2๐2๐๐
2 โค 14โ๐ขโ2
๐ฟ2 (P0) .
So that
infPโP(ฮ๐,\)
[2๐๐(P, P0) โฅ inf
PโP(ฮ๐,\)
14โ๐ขโ2
๐ฟ2 (P0) โฅ14ฮ2๐.
In both cases, with ๐๐ โค ๐ถฮ\+1๐ for a sufficiently small ๐ถ = ๐ถ (๐) > 0, lim
๐โโ๐
12๐ ๐ ๐ฮ
2๐ = โ
suffices to ensure (6.7) holds. Under the condition that lim๐โโ
ฮ๐๐2๐
4๐ +\+1 = โ,
๐๐ = ๐๐โ 2๐ (\+1)
4๐ +\+1 โค ๐ถฮ\+1๐
for sufficiently large ๐ and lim๐โโ
๐12๐ ๐ ๐ฮ
2๐ = โ holds as well.
54
Verifying (6.8). Rewrite ๐1 as
๐1 =1๐
โ1โค๐, ๐โค๐๐โ ๐
โ๐โฅ1
_๐
_๐ + ๐2๐
[๐๐ (๐๐) โ EP๐๐ (๐)] [๐๐ (๐ ๐ ) โ EP๐๐ (๐)]
:=1๐
โ1โค๐, ๐โค๐๐โ ๐
๐น๐ (๐๐, ๐ ๐ ).
Then
EP๐21 =
1๐2
โ๐โ ๐๐โฒโ ๐ โฒ
EP๐น๐ (๐๐, ๐ ๐ )๐น๐ (๐๐โฒ, ๐ ๐ โฒ)
=2๐(๐ โ 1)
๐2 EP๐น2๐ (๐, ๐โฒ)
โค2EP๐น2๐ (๐, ๐โฒ),
where ๐, ๐โฒ ๐.๐.๐.โผ P. Recall that, for any two random variables ๐1, ๐2 such that E๐21 < โ,
E[๐1 โ E(๐1 |๐2)]2 = E๐21 โ E[E(๐1 |๐2)2] โค E๐2
1 .
Together with the fact that
๐น๐ (๐, ๐โฒ) =๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) โ EP [๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) |๐] โ EP [๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) |๐โฒ] + EP๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ)
=๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) โ EP [ ๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) |๐] โ E[๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) โ EP [๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) |๐]
๏ฟฝ๏ฟฝ ๐โฒ] ,we have
EP๐น2๐ (๐, ๐โฒ) โค EP{๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) โ EP [๏ฟฝ๏ฟฝ๐ (๐, ๐โฒ) |๐]}2 โค EP๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ).
55
Thus, to prove (6.8), it suffices to show that
lim๐โโ
supPโP(ฮ๐,\)
EP๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)/๐2
3 = 0.
For any ๐ โ ๐ฟ2(P0) and positive definite kernel ๐บ (ยท, ยท) such that EP0๐บ2(๐, ๐โฒ) < โ, let
โ๐โ๐บ :=โEP0 [๐(๐)๐(๐โฒ)๐บ (๐, ๐โฒ)] .
By the positive definiteness of ๐บ (ยท, ยท), triangular inequality holds for โ ยท โ๐บ , i.e., for any ๐1, ๐2 โ
๐ฟ2(P0),
|โ๐1โ๐บ โ โ๐2โ๐บ | โค โ๐1 โ ๐2โ๐บ .
Thus by taking ๐บ = ๏ฟฝ๏ฟฝ2๐ , ๐1 = ๐P/๐P0 and ๐2 = 1, we have๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝโEP๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ) โ
โEP0๏ฟฝ๏ฟฝ
2๐ (๐, ๐โฒ)
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โค โEP0 [๐ข(๐)๐ข(๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)] . (6.10)
We now appeal to the following lemma to bound the right hand side of (6.10):
Lemma 3. Let ๐บ be a Mercer kernel defined over X ร X with eigenvalue-eigenfunction pairs
{(`๐ , ๐๐ ) : ๐ โฅ 1} with respect to ๐ฟ2(P) such that `1 โฅ `2 โฅ ยท ยท ยท . If ๐บ is a trace kernel in that
E๐บ (๐, ๐) < โ, then for any ๐ โ ๐ฟ2(P)
EP [๐(๐)๐(๐โฒ)๐บ2(๐, ๐โฒ)] โค `1
(โ๐โฅ1
`๐
) (sup๐โฅ1
โ๐๐ โโ)2
โ๐โ2๐ฟ2 (P) .
By Lemma 3, we get
EP0 [๐ข(๐)๐ข(๐โฒ)๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)] โค๐ถ5
(โ๐
_๐
_๐ + ๐2๐
)โ๐ขโ2
๐ฟ2 (P0) ๏ฟฝ ๐โ1/๐ ๐ โ๐ขโ2
๐ฟ2 (P0) .
56
Recall that
EP0๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ) =
โ๐
(_๐
_๐ + ๐2๐
)2๏ฟฝ ๐
โ1/๐ ๐ .
In the light of (6.10), they imply that
EP๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ) โค 2{EP0๏ฟฝ๏ฟฝ
2๐ (๐, ๐โฒ) + EP0 [๐ข(๐)๐ข(๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)]} โค ๐ถ6๐โ1/๐ ๐ [1 + โ๐ขโ2
๐ฟ2 (P0)] .
On the other hand, as already shown in the part of verifying (6.7), ๐๐ ๏ฟฝ ฮ\+1๐ suffices to ensure
that for sufficiently large ๐,
14โ๐ขโ2
๐ฟ2 (P0) โค [2๐๐(P, P0) โค โ๐ขโ2
๐ฟ2 (P0) , โ P โ P(ฮ๐, \).
Thus
lim๐โโ
supPโP(ฮ๐,\)
EP๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)/๐2
3
โค16๐ถ6
{(lim๐โโ
infPโP(ฮ๐,\)
๐1/๐ ๐ ๐2โ๐ขโ4
๐ฟ2 (P0)
)โ1+
(lim๐โโ
infPโP(ฮ๐,\)
๐1/๐ ๐ ๐2โ๐ขโ2
๐ฟ2 (P0)
)โ1}= 0
provided that lim๐โโ
๐2๐
4๐ +\+1ฮ๐ = โ. This immediately implies (6.8).
Verifying (6.9). Observe that
EP๐22 โค4๐EP
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐) โ EP๐๐ (๐)]}2
โค4๐EP{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
=4๐EP0
([1 + ๐ข(๐)]
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
).
57
It is clear that
EP0
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
=โ๐,๐ โฒโฅ1
_๐
_๐ + ๐2๐
_๐ โฒ
_๐ โฒ + ๐2๐
EP๐๐ (๐)EP๐๐ โฒ (๐)EP0 [๐๐ (๐)๐๐ โฒ (๐)]
=โ๐โฅ1
( _๐
_๐ + ๐2๐
)2[EP๐๐ (๐)]2 โค [2
๐๐(P, P0).
On the other hand,
EP0
(๐ข(๐)
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
)โค
โโโEP0
(๐ข2(๐)
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
)ร
รโEP0
{โ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐)]}2
โคโ๐ขโ๐ฟ2 (P0) sup๐ฅ
๏ฟฝ๏ฟฝ๏ฟฝโ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)] [๐๐ (๐ฅ)]๏ฟฝ๏ฟฝ๏ฟฝ ยท [๐๐ (P, P0)
โค(sup๐
โ๐๐ โโ)โ๐ขโ๐ฟ2 (P0)
โ๐โฅ1
_๐
_๐ + ๐2๐
|EP๐๐ (๐) | ยท [๐๐ (P, P0)
โค(sup๐
โ๐๐ โโ)โ๐ขโ๐ฟ2 (P0)
โโ๐โฅ1
_๐
_๐ + ๐2๐
โโ๐โฅ1
_๐
_๐ + ๐2๐
[EP๐๐ (๐)]2 ยท [๐๐ (P, P0)
โค๐ถ7โ๐ขโ๐ฟ2 (P0) ๐โ 1
2๐ ๐ [2
๐๐(P, P0).
Together, they imply that
lim๐โโ
supPโP(ฮ๐,\)
EP๐21 /๐
23
โค4 max{1, ๐ถ7}{(
lim๐โโ
infPโP(ฮ๐,\)
๐[2๐๐(P, P0)
)โ1+ lim๐โโ
supPโP(ฮ๐,\)
(โ๐ขโ๐ฟ2 (P0)
๐12๐ ๐ ๐[
2๐๐ (P, P0)
)}= 0,
under the assumption that lim๐โโ
๐2๐
4๐ +\+1ฮ๐ = โ.
58
Proof of Theorem 4. The main architect is now standard in establishing minimax lower bounds for
nonparametric hypothesis testing. The main idea is to carefully construct a set of points under the
alternative hypothesis and argue that a mixture of these alternatives cannot be reliably distinguished
from the null. See, e.g., Ingster, 1993; Ingster and Suslina, 2003; Tsybakov, 2008. Without loss of
generality, assume ๐ = 1 and ฮ๐ = ๐๐โ 2๐
4๐ +\+1 for some ๐ > 0.
Let us consider the cases of \ = 0 and \ > 0 separately.
The case of \ = 0. We first treat the case when \ = 0. Let ๐ต๐ = b๐ถ8ฮโ 1๐
๐ c for a sufficiently small
constant ๐ถ8 > 0 and ๐๐ =โฮ2๐/๐ต๐. For any b๐ := (b๐1, b๐2, ยท ยท ยท , b๐๐ต๐)> โ {ยฑ1}๐ต๐ , write
๐ข๐,b๐ = ๐๐
๐ต๐โ๐=1
b๐๐๐๐ .
It is clear that
โ๐ข๐,b๐ โ2๐ฟ2 (P0) = ๐ต๐๐
2๐ = ฮ2
๐
and
โ๐ข๐,b๐ โโ โค ๐๐๐ต๐(sup๐
โ๐๐ โโ)๏ฟฝ ฮ
2๐ โ12๐ ๐ โ 0.
By taking ๐ถ8 small enough, we can also ensure
โ๐ข๐,b๐ โ2๐พ = ๐2
๐
๐ต๐โ๐=1
_โ1๐ โค 1,
Therefore, there exists a probability measure P๐,b๐ โ P(ฮ๐, 0) such that ๐P๐,b๐/๐P0 = 1+๐ข๐,b๐ .
Following a standard argument for minimax lower bound, it suffices to show that
lim sup๐โโ
EP0ยฉยญยซ 12๐ต๐
โb๐โ{ยฑ1}๐ต๐
{๐โ๐=1
[1 + ๐ข๐,b๐ (๐๐)]}ยชยฎยฌ
2
< โ. (6.11)
59
Note that
EP0ยฉยญยซ 12๐ต๐
โb๐โ{ยฑ1}๐ต๐
{๐โ๐=1
[1 + ๐ข๐,b๐ (๐๐)]}ยชยฎยฌ
2
=EP0ยฉยญยซ 122๐ต๐
โb๐,b
โฒ๐โ{ยฑ1}๐ต๐
{๐โ๐=1
[1 + ๐ข๐,b๐ (๐๐)]} {
๐โ๐=1
[1 + ๐ข๐,b โฒ๐ (๐๐)]}ยชยฎยฌ
=1
22๐ต๐
โb๐,b
โฒ๐โ{ยฑ1}๐ต๐
๐โ๐=1EP0
{[1 + ๐ข๐,b๐ (๐๐)] [1 + ๐ข๐,b โฒ๐ (๐๐)]
}=
122๐ต๐
โb๐,b
โฒ๐โ{ยฑ1}๐ต๐
(1 + ๐2
๐
๐ต๐โ๐=1
b๐๐bโฒ๐๐
)๐โค 1
22๐ต๐
โb๐,b
โฒ๐โ{ยฑ1}๐ต๐
exp(๐๐2
๐
๐ต๐โ๐=1
b๐,๐bโฒ๐,๐
)=
{exp(๐๐2
๐) + exp(โ๐๐2๐)
2
}๐ต๐โค exp
(12๐ต๐๐
2๐4๐
),
where the last inequality is ensured by that
cosh(๐ก) โค exp(๐ก2
2
), โ ๐ก โ R.
See, e.g., Baraud, 2002. With the particular choice of ๐ต๐, ๐๐, and the conditions on ฮ๐, this
immediately implies (6.11).
The case of \ > 0. The main idea is similar to before. To find a set of probability measures in
P(ฮ๐, \), we appeal to the following lemma.
Lemma 4. Let ๐ข =โ๐
๐๐๐๐ . If
sup๐ตโฅ1
(๐ตโ๐=1
๐2๐
_๐
)2/\ (โ๐โฅ๐ต
๐2๐
) โค ๐2,
60
then ๐ข โ F (\, ๐).
Similar to before, we shall now take ๐ต๐ = b๐ถ10ฮโ \+1
๐ ๐ c and ๐๐ =
โฮ2๐/๐ต๐. By Lemma 4, we can
find P๐,b๐ โ P(ฮ๐, \) such that ๐P๐,b๐/๐P0 = 1 + ๐ข๐,b๐ , for appropriately chosen ๐ถ10. Following
the same argument as in the previous case, we can again verify (6.11).
Proof of Theorem 5. Without loss of generality, assume that ฮ๐ (\) = ๐1(๐โ1โlog log ๐) 2๐ 4๐ +\+1 for
some constant ๐1 > 0 to be determined later.
Type I Error. We first prove the first statement which shows that the Type I error converges to 0.
Following the same notations as defined in the proof of Theorem 2, let
๐๐,2 = E{ ๐โ๐=2E(Z2๐ ๐ |F๐โ1
)โ 1
}2, ๐ฟ๐,2 =
๐โ๐=2EZ4
๐ ๐
where Z๐ ๐ =โ
2Z๐ ๐/(๐โ๐ฃ๐). As shown by Haeusler (1988),
sup๐ก
|๐(๐๐,๐๐ > ๐ก) โ ฮฆ(๐ก) | โค ๐ถ11(๐ฟ๐,2 + ๐๐,2)1/5,
where ฮฆ(๐ก) is the survival function of the standard normal, i.e., ฮฆ(๐ก) = ๐(๐ > ๐ก) where ๐ โผ
๐ (0, 1). Again by the argument from Hall (1984),
E{ ๐โ๐=2E(Z2
๐ ๐ |F๐โ1) โ12๐(๐ โ 1)๐ฃ๐
}2โค ๐ถ12 [๐4E๐บ2
๐ (๐, ๐โฒ) + ๐3E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒโฒ)],
where ๐บ๐ (ยท, ยท) is defined in the proof of Theorem 2, and
๐โ๐=2EZ4
๐ ๐ โค ๐ถ13 [๐2E๏ฟฝ๏ฟฝ4๐ (๐, ๐โฒ) + ๐3E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2๐ (๐, ๐
โฒโฒ)],
61
which ensures
๐๐,2 =
4E{ ๐โ๐=2E(Z2
๐ ๐|F๐โ1) โ 1
2๐(๐ โ 1)๐ฃ๐ โ 12๐๐ฃ๐
}2
๐4๐ฃ2๐
โค8 max{๐ถ12,
14
} {E๐บ2
๐ (๐, ๐โฒ)๐ฃ2๐
+E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2๐ (๐, ๐
โฒโฒ)๐๐ฃ2
๐
+ 1๐2
},
and
๐ฟ๐,2 =
4๐โ๐=2EZ4
๐ ๐
๐4๐ฃ2๐
โค 4๐ถ13
{E๏ฟฝ๏ฟฝ4
๐ (๐, ๐โฒ)๐2๐ฃ2
๐
+E๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2๐ (๐, ๐
โฒโฒ)๐๐ฃ2
๐
}.
As shown in the proof of Theorem 2,
E๐บ2๐ (๐, ๐โฒ)๐ฃ2๐
โค ๐ถ3๐1/๐ ๐ ,
E๏ฟฝ๏ฟฝ4๐ (๐, ๐โฒ)๐2๐ฃ2
๐
โค ๐ถ4๐โ2๐
โ1/๐ ๐ , and
E๏ฟฝ๏ฟฝ2๐ (๐, ๐โฒ)๏ฟฝ๏ฟฝ2
๐ (๐, ๐โฒโฒ)
๐๐ฃ2๐
โค ๐ถ3๐โ1.
Therefore,
sup๐ก
|๐(๐๐,๐๐ > ๐ก) โ ฮฆ(๐ก) | โค ๐ถ14(๐15๐ ๐ + ๐โ 1
5 + ๐โ 25 ๐
โ 15๐
๐ ),
which implies that
๐
(sup
0โค๐โค๐โ
๐๐,2๐ ๐โ > ๐ก
)โค ๐โฮฆ(๐ก) + ๐ถ15(2
๐โ5๐ ๐
15๐ โ + ๐โ๐
โ 15 + ๐โ 2
5 ๐โ 1
5๐ โ ), โ๐ก.
It is not hard to see, by the definitions of ๐โ and ๐โ,
2๐โ ๐โ โค 2
(โlog log ๐๐
) 2๐ 4๐ +1
62
and
๐โ =(log 2)โ1{2๐ log ๐ โ 2๐ 4๐ + 1
log ๐ + ๐(log ๐)}
=(log 2)โ1 8๐ 2
4๐ + 1log ๐ + ๐(log ๐) ๏ฟฝ log ๐.
Together with the fact that ฮฆ(๐ก) โค 12๐
โ๐ก2/2 for ๐ก โฅ 0, we get
๐
(sup
0โค๐โค๐โ
๐๐,2๐ ๐โ >โ
3 log log ๐
)โค๐ถ16
๐โ32 log log ๐ log ๐ +
(โlog ๐๐
) 25(4๐ +1)
+ ๐โ 15 log log ๐ + ๐โ 2
5
(โlog log ๐๐
)โ 25 โ 0,
as ๐โ โ.
Type II Error. Next consider Type II error. To this end, write ๐๐ (\) = (โ
log log ๐๐
)2๐ (\+1)4๐ +\+1 . Let
๏ฟฝ๏ฟฝ๐ (\) = sup0โค๐โค๐โ
{2๐ ๐โ : ๐๐ โค ๐๐ (\)}.
It is clear that ๐๐ โฅ ๐๐,๏ฟฝ๏ฟฝ๐ (\) for any \ โฅ 0. It therefore suffices to show that for any \ โฅ 0,
lim๐โโ
inf\โฅ0
infPโP(ฮ๐,\)
๐
{๐๐,๏ฟฝ๏ฟฝ๐ (\) โฅ
โ3 log log ๐
}= 1.
By Markov inequality, this can accomplished by verifying
inf\โ[0,โ)
infPโP(ฮ๐ (\),\)
EP๐๐,๏ฟฝ๏ฟฝ๐ (\) โฅ ๏ฟฝ๏ฟฝโ
log log ๐ (6.12)
for some ๏ฟฝ๏ฟฝ >โ
3; and
lim๐โโ
sup\โฅ0
supPโP(ฮ๐ (\),\)
Var(๐๐,๏ฟฝ๏ฟฝ๐ (\)
)(EP๐๐,๏ฟฝ๏ฟฝ๐ (\)
)2 = 0. (6.13)
63
We now show that both (6.12) and (6.13) hold with
ฮ๐ (\) = ๐1
(โlog log ๐๐
) 2๐ 4๐ +\+1
for a sufficiently large ๐1 = ๐1(๐, ๏ฟฝ๏ฟฝ).
Note that โ \ โ [0,โ),
12๐๐ (\) โค ๏ฟฝ๏ฟฝ๐ (\) โค ๐๐ (\), (6.14)
which immediately suggests
[2๏ฟฝ๏ฟฝ๐ (\) (P, P0) โฅ [2
๐๐ (\) (P, P0). (6.15)
Following the arguments in the proof of Theorem 3,
EP๐๐,๏ฟฝ๏ฟฝ๐ (\) โฅ ๐ถ17๐[ ๏ฟฝ๏ฟฝ๐ (\)]1/(2๐ )[2๏ฟฝ๏ฟฝ๐ (\) (P, P0) โฅ 2โ1/(2๐ )๐ถ17๐[๐๐ (\)]1/2๐ [2
๐๐ (\) (P, P0),
and โ P โ P(ฮ๐ (\), \),
[2๐๐ (\) (P, P0) โฅ
14โ๐ขโ2
๐ฟ2 (๐0) (6.16)
provided that ฮ๐ (\) โฅ ๐ถโฒ(๐)(โ
log log ๐๐
) 2๐ 4๐ +\+1
.
Therefore,
infPโP(ฮ๐ (\),\)
EP๐๐,๏ฟฝ๏ฟฝ๐ (\) โฅ ๐ถ18๐[๐๐ (\)]1/(2๐ )ฮ๐ (\) โฅ ๐ถ18๐1โ
log log ๐ โฅ ๏ฟฝ๏ฟฝโ
log log ๐
if ๐1 โฅ ๐ถโ118 ๏ฟฝ๏ฟฝ . Hence to ensure (6.12) holds, it suffices to take
๐1 = max{๐ถโฒ(๐), ๐ถโ118 ๏ฟฝ๏ฟฝ}.
64
With (6.14), (6.15) and (6.16), the results in the proof of Theorem 3 imply that for sufficiently
large ๐
supPโP(ฮโ
๐ (\),\)
Var(๐๐,๏ฟฝ๏ฟฝ๐ (\)
)(EP๐๐,๏ฟฝ๏ฟฝ๐ (\)
)2 โค๐ถ19
{ ([๐๐ (\)]
12๐ ๐ฮโ
๐ (\))โ2
+([๐๐ (\)]
1๐ ๐2ฮโ
๐ (\))โ1
+ (๐ฮโ๐ (\))โ1 +
([๐๐ (\)]
12๐ ๐
โฮโ๐ (\)
)โ1 }โค2๐ถ19
([๐๐ (\)]
12๐ ๐ฮโ
๐ (\))โ1
= 2๐ถ19(๐1 log log ๐)โ 12 โ 0,
which shows (6.13).
Proof of Theorem 6. The main idea of the proof is similar to that for Theorem 4.
Nevertheless, in order to show
infฮฆ๐
[EP0ฮฆ๐ + sup
\โ[\1,\2]๐ฝ(ฮฆ๐;ฮ๐ (\), \)
]converges to 1 rather than bounded below from 0, we need to find P๐, which is the marginal
distribution on X๐ with conditional distribution selected from
{Pโ๐ : P โ โช\โ[\1,\2]P(ฮ๐ (\), \)}
and prior distribution ๐ on โช\โ[\1,\2]P(ฮ๐ (\), \) such that the ๐2 distance between P๐ and Pโ๐0
converges to 0. See Ingster (2000).
To this end, assume, without loss of generality, that
ฮ๐ (\) = ๐2
(๐โ
log log ๐
)โ 2๐ 4๐ +\+1
, โ\ โ [\1, \2],
where ๐2 > 0 is a sufficiently small constant to be determined later.
Let ๐๐ = b๐ถ20 log ๐c and ๐ต๐,1 = b๐ถ21ฮโ \1+1
๐ ๐ (\1)cfor sufficiently small ๐ถ20, ๐ถ21 > 0. Set
65
\๐,1 = \1. For 2 โค ๐ โค ๐๐, let
๐ต๐,๐ = 2๐โ2๐ต๐,1
and \๐,๐ is selected such that the following equation holds.
๐ต๐,๐ =
โ๐ถ21 [ฮ๐ (\๐,๐)]โ
\๐,๐ +1๐
โ.
Note that by choosing ๐ถ20 sufficiently small,
๐ต๐,๐๐ = 2๐๐โ2๐ต๐,1 โค๐
2(\1+1)4๐ +\1+12
(๐โ
log log ๐
) 2(\1+1)4๐ +\1+1
ยท 2๐๐โ2
=
โ๐
2(\1+1)4๐ +\1+12 exp
(log
(๐โ
log log ๐
)ยท 2(\1 + 1)
4๐ + \1 + 1+ (๐๐ โ 2) log 2
)โโค
โ๐ถ21 exp
(log
(๐โ
log log ๐
)ยท 2(\2 + 1)
4๐ + \2 + 1
)โ= b๐ถ21 [ฮ๐ (\2)]โ
\2+1๐ c
for sufficiently large ๐. Thus, we can guarante that โ 1 โค ๐ โค ๐๐, \๐,๐๐ โ [\1, \2].
We now construct a finite subset of โช\โ[\1,\2]P(ฮ๐ (\), \) as follows. Let ๐ตโ๐,0 = 0 and ๐ตโ
๐,๐ =
๐ต๐,1 + ยท ยท ยท + ๐ต๐,๐ for ๐ โฅ 1. For each b๐,๐ = (b๐,๐,1, ยท ยท ยท , b๐,๐,๐ต๐,๐ ) โ {ยฑ1}๐ต๐,๐ , let
๐๐,๐,b๐,๐ = 1 +๐ตโ๐,๐โ
๐=๐ตโ๐,๐โ1+1
๐๐,๐b๐,๐,๐โ๐ตโ๐,๐โ1
๐๐ ,
and ๐๐,๐ =
โฮ2๐ (\๐,๐)/๐ต๐,๐ . Following the same argument as that in the proof of Theorem 4, we
can verify that with a sufficiently small ๐ถ21, each P๐,๐,b๐,๐ โ P(ฮ๐ (\๐,๐), \๐,๐), where ๐๐,๐,b๐,๐ is the
Radon-Nikodym derivative ๐P๐,๐,b๐,๐ /๐P0. With slight abuse of notation, write
๐๐ (๐1, ๐2, ยท ยท ยท , ๐๐) =1๐๐
๐๐โ๐=1
๐๐,๐ (๐1, ๐2, ยท ยท ยท , ๐๐),
66
where
๐๐,๐ (๐1, ๐2, ยท ยท ยท , ๐๐) =1
2๐ต๐,๐โ
b๐,๐โ{ยฑ1}๐ต๐,๐
๐โ๐=1
๐๐,๐,b๐,๐ (๐๐).
It now suffices to show that
โ ๐๐ โ 1โ2๐ฟ2 (P0) = โ ๐๐โ2
๐ฟ2 (P0) โ 1 โ 0, as ๐โ โ,
where โ ๐๐โ2๐ฟ2 (P0) = EP0 ๐
2๐ (๐1, ๐2, ยท ยท ยท , ๐๐).
Note that
โ ๐๐โ2๐ฟ2 (P0) =
1๐2๐
โ1โค๐,๐ โฒโค๐๐
ใ ๐๐,๐ , ๐๐,๐ โฒใ๐ฟ2 (P0)
=1๐2๐
โ1โค๐โค๐๐
โ ๐๐,๐ โ2๐ฟ2 (P0) +
1๐2๐
โ1โค๐,๐ โฒโค๐๐๐โ ๐ โฒ
ใ ๐๐,๐ , ๐๐,๐ โฒใ๐ฟ2 (P0) .
It is easy to verify that, for any ๐ โ ๐โฒ,
ใ ๐๐,๐ , ๐๐,๐ โฒใ๐ฟ2 (P0) = 1.
It therefore suffices to show that
โ1โค๐โค๐๐
โ ๐๐,๐ โ2๐ฟ2 (P0) = ๐(๐
2๐).
Following the same derivation as that in the proof of Theorem 4, we can show that
โ ๐๐,๐ โ2๐ฟ2 (P0) โค
(exp(๐๐2
๐,๐) + exp(โ๐๐2๐,๐)
2
)๐ต๐,๐โค exp
(12๐ต๐,๐๐
2๐4๐,๐
)
67
for sufficiently large ๐. By setting ๐2 in the expression of ฮ๐ (\) sufficiently small, we have
๐ต๐,๐๐2๐4๐,๐ โค log ๐๐,
which ensures that
โ1โค๐โค๐๐
โ ๐๐,๐ โ2๐ฟ2 (P0) โค ๐
3/2๐ = ๐(๐2
๐).
Proof of Theorem 7. We begin with (3.2). Note that ๐พ2a๐ (P, P0) is a U-statistic. We can apply the
general techniques for U-statistics to establish its asymptotic normality. In particular, as shown in
Hall (1984), it suffices to verify the following four conditions:
(2a๐๐
)๐/2E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2) โ โ๐0โ2
๐ฟ2, (6.17)
E๏ฟฝ๏ฟฝ4a๐(๐1, ๐2)
๐2 [E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)]2
โ 0, (6.18)
E[๏ฟฝ๏ฟฝ2a๐(๐1, ๐2)๏ฟฝ๏ฟฝ2
a๐(๐1, ๐3)]
๐[E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)]2
โ 0, (6.19)
E๐ป2a๐(๐1, ๐2)
[E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)]2
โ 0, (6.20)
as ๐โ โ, where
๐ปa๐ (๐ฅ, ๐ฆ) = E๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐3)๏ฟฝ๏ฟฝa๐ (๐ฆ, ๐3), โ ๐ฅ, ๐ฆ โ R๐ .
Verifying Condition (6.17). Note that
E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2) = E๐บ2
a๐(๐1, ๐2) โ 2E{E[๐บa๐ (๐1, ๐2) |๐1]}2 + [E๐บa๐ (๐1, ๐2)]2.
68
By Lemma 7,
E๐บa๐ (๐1, ๐2) =(๐
a๐
) ๐2โซ
exp(โโ๐โ2
4a๐
)โF ๐0(๐)โ2 ๐๐,
which immediately yields ( a๐๐
) ๐2E๐บa๐ (๐1, ๐2) โ โ๐0โ2
๐ฟ2
and (2a๐๐
) ๐2
E๐บ2a๐(๐1, ๐2) =
(2a๐๐
) ๐2
E๐บ2a๐ (๐1, ๐2) โ โ๐0โ2๐ฟ2,
as a๐ โ โ.
On the other hand,
E{E[๐บa๐ (๐1, ๐2) |๐1]}2
=
โซ (โซ๐บa๐ (๐ฅ, ๐ฅโฒ)๐บa๐ (๐ฅ, ๐ฅโฒโฒ)๐0(๐ฅ)๐๐ฅ
)๐0(๐ฅโฒ)๐0(๐ฅโฒโฒ)๐๐ฅโฒ๐๐ฅโฒโฒ
=
โซ (โซ๐บ2a๐ (๐ฅ, (๐ฅโฒ + ๐ฅโฒโฒ)/2)๐0(๐ฅ)๐๐ฅ
)๐บa๐/2(๐ฅโฒ, ๐ฅโฒโฒ)๐0(๐ฅโฒ)๐0(๐ฅโฒโฒ)๐๐ฅโฒ๐๐ฅโฒโฒ.
Let ๐ โผ ๐ (0, 4a๐๐ผ๐). Then
โซ๐บ2a๐ (๐ฅ, (๐ฅโฒ + ๐ฅโฒโฒ)/2)๐0(๐ฅ)๐๐ฅ = (2๐)๐/2E
[F ๐0(๐) exp
(๐ฅโฒ + ๐ฅโฒโฒ
2๐๐
)]โค (2๐)๐/2
โE โF ๐0(๐)โ2
.๐ โ๐0โ๐ฟ2/a๐/4๐ .
Thus
E{E[๐บa๐ (๐1, ๐2) |๐1]}2 .๐ โ๐0โ3๐ฟ2/a3๐/4๐ .
Condition (6.17) then follows.
69
Verifying Conditions (6.18) and (6.19). Since
E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2) ๏ฟฝ๐,๐0 a
โ๐/2๐ .
and
E๏ฟฝ๏ฟฝ4a๐(๐1, ๐2) . E๐บ4
a๐(๐1, ๐2) .๐ a
โ๐/2๐ ,
we obtain
๐โ2E๏ฟฝ๏ฟฝ4a๐(๐1, ๐2)/(E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2))2 .๐,๐0 a
๐/2๐ /๐2 โ 0.
Similarly,
E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2)๏ฟฝ๏ฟฝ2
a๐(๐1, ๐3) . E๐บ2
a๐(๐1, ๐2)๐บ2
a๐(๐1, ๐3)
= E๐บ2a๐ (๐1, ๐2)๐บ2a๐ (๐1, ๐3)
.๐,๐0 aโ3๐/4๐ .
This implies
๐โ1E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2)๏ฟฝ๏ฟฝ2
a๐(๐1, ๐3)/(E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2))2 .๐,๐0 a
๐/4๐ /๐โ 0,
which verifies (6.19).
Verifying Condition (6.20). We now prove (6.20). It suffices to show
a๐๐E(E(๏ฟฝ๏ฟฝa๐ (๐1, ๐2)๏ฟฝ๏ฟฝa๐ (๐1, ๐3) |๐2, ๐3))2 โ 0
70
as ๐โ โ. Note that
E(E(๏ฟฝ๏ฟฝa๐ (๐1, ๐2)๏ฟฝ๏ฟฝa๐ (๐1, ๐3) |๐2, ๐3))2
.E(E(๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3) |๐2, ๐3))2
=E๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3)๐บa๐ (๐4, ๐2)๐บa๐ (๐4, ๐3)
=E(๐บa๐ (๐1, ๐4)๐บa๐ (๐2, ๐3)E(๐บa๐ (๐1 + ๐4, ๐2 + ๐3) |๐1 โ ๐4, ๐2 โ ๐3)).
Since for any ๐ฟ > 0,
a๐๐E(๐บa๐ (๐1, ๐4)๐บa๐ (๐2, ๐3)E(๐บa๐ (๐1 + ๐4, ๐2 + ๐3) |๐1 โ ๐4, ๐2 โ ๐3)
(1{โ๐1โ๐4โ>๐ฟ} + 1โ๐2โ๐3โ>๐ฟ})) โ 0,
it remains to show that
a๐๐E(๐บa๐ (๐1, ๐4)๐บa๐ (๐2, ๐3)E(๐บa๐ (๐1 + ๐4, ๐2 + ๐3) |๐1 โ ๐4, ๐2 โ ๐3)
1{โ๐1โ๐4โโค๐ฟ,โ๐2โ๐3โโค๐ฟ})) โ 0
for some ๐ฟ > 0, which holds as long as
E(๐บa๐ (๐1 + ๐4, ๐2 + ๐3) |๐1 โ ๐4, ๐2 โ ๐3) โ 0 (6.21)
uniformly on {โ๐1 โ ๐4โ โค ๐ฟ, โ๐2 โ ๐3โ โค ๐ฟ}.
Let
๐1 = ๐1 โ ๐4, ๐2 = ๐2 โ ๐3, ๐3 = ๐1 + ๐4, ๐4 = ๐2 + ๐3.
71
Then
E(๐บa๐ (๐1 + ๐4, ๐2 + ๐3) |๐1 โ ๐4, ๐2 โ ๐3)
=
(๐
a๐
) ๐2โซ
exp(โโ๐โ2
4a๐
)F ๐๐1 (๐)F ๐๐2 (๐)๐๐
โค
โโ(๐
a๐
) ๐2โซ
exp(โโ๐โ2
4a๐
) F ๐๐1 (๐) 2๐๐
โโ(๐
a๐
) ๐2โซ
exp(โโ๐โ2
4a๐
) F ๐๐2 (๐) 2๐๐
where
๐๐ฆ (๐ฆโฒ) =๐(๐1 = ๐ฆ,๐3 = ๐ฆโฒ)
๐(๐1 = ๐ฆ) =
๐0
(๐ฆ+๐ฆโฒ
2
)๐0
(๐ฆโฒโ๐ฆ
2
)โซ๐0
(๐ฆ+๐ฆโฒ
2
)๐0
(๐ฆโฒโ๐ฆ
2
)๐๐ฆโฒ
is the conditional density of ๐3 given ๐1 = ๐ฆ. Thus to prove (6.21), it suffices to show
โ๐ (๐ฆ) :=(๐
a๐
) ๐2โซ
exp(โโ๐โ2
4a๐
) F ๐๐ฆ (๐) 2๐๐
= ๐๐2
โซexp
(โโ๐โ2
4
) F ๐๐ฆ (โa๐๐) 2๐๐
โ 0
uniformly over {๐ฆ : โ๐ฆโ โค ๐ฟ}.
Note that
โ๐ (๐ฆ) = E๐บa๐ (๐, ๐โฒ)
where ๐, ๐โฒ โผiid ๐๐ฆ, which suggests โ๐ (๐ฆ) โ 0 pointwisely. To prove the uniform convergence of
โ๐ (๐ฆ), we only need to show
lim๐ฆ1โ๐ฆ
sup๐
|โ๐ (๐ฆ1) โ โ๐ (๐ฆ) | = 0
for any ๐ฆ.
Since ๐0 โ ๐ฟ2, ๐(๐1 = ๐ฆ) is continuous. Therefore, the almost surely continuity of ๐0 imme-
diately suggests that for every ๐ฆ, ๐๐ฆ1 (ยท) โ ๐๐ฆ (ยท) almost surely as ๐ฆ1 โ ๐ฆ. Considering that ๐๐ฆ1
72
and ๐๐ฆ are both densities, it follows that
|F ๐๐ฆ1 (๐) โ F ๐๐ฆ (๐) | โค (2๐)โ๐/2โซ
|๐๐ฆ1 (๐ฆโฒ) โ ๐๐ฆ (๐ฆโฒ) |๐๐ฆโฒ โ 0,
i.e., F ๐๐ฆ1 โ F ๐๐ฆ uniformly as ๐ฆ1 โ ๐ฆ. Therefore we have
sup๐โโ
|โ๐ (๐ฆ1) โ โ๐ (๐ฆ) | . F ๐๐ฆ1 โ F ๐๐ฆ
๐ฟโ
โ 0,
which ensures the uniform convergence of โ๐ (๐ฆ) to โ(๐ฆ) over {๐ฆ : โ๐ฆโ โค ๐ฟ}, and hence (6.20).
Indeed, we have shown that
๐๐พ2a๐ (P, P0)โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2โ๐ ๐ (0, 1).
By Slutsky Theorem, in order to prove (3.3), it sufficies to show
๏ฟฝ๏ฟฝ2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ๐ 1,
which is equivalent to
๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ๐ 1 (6.22)
since 1/๐2 = ๐(E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2).
It follows from
E(๐ 2๐,a๐
)= E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
73
and
var(๐ 2๐,a๐
).๐โ4var ยฉยญยซ
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐๐, ๐ ๐ )ยชยฎยฌ + ๐โ6var
ยฉยญยญยญยซโ
1โค๐, ๐1, ๐2โค๐|{๐, ๐1, ๐2}|=3
๐บa๐ (๐๐, ๐ ๐1)๐บa๐ (๐๐, ๐ ๐2)ยชยฎยฎยฎยฌ
+ ๐โ8varยฉยญยญยญยซ
โ1โค๐1,๐2, ๐1, ๐2โค๐|{๐1,๐2, ๐1, ๐2}|=4
๐บa๐ (๐๐1 , ๐ ๐1)๐บa๐ (๐๐2 , ๐ ๐2)ยชยฎยฎยฎยฌ
.๐โ2E๐บ4a๐ (๐1, ๐2) + ๐โ1E๐บ2a๐ (๐1, ๐2)๐บ2a๐ (๐1, ๐3) + ๐โ1(E๐บ2a๐ (๐1, ๐2))2
= ๐((E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2)2).
that (6.22) holds.
Proof of Theorem 8. Recall that
๐พ2a๐ (P, P0) =
1๐(๐ โ 1)
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P0)
=๐พ2a๐(P, P0) +
1๐(๐ โ 1)
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P)
+ 2๐
๐โ๐=1
(E๐โผP [๐บa๐ (๐๐, ๐) |๐๐] โ E๐โผP0 [๐บa๐ (๐๐, ๐) |๐๐]
โ E๐,๐ โฒโผiidP๐บa๐ (๐, ๐โฒ) + E(๐,๐ )โผPโP0๐บa๐ (๐,๐ )).
Denote by the last two terms on the rightmost hand side by ๐ (1)a๐ and ๐ (2)
a๐ respectively. It is clear
that E๐ (1)a๐ = E๐
(2)a๐ = 0. Then it suffices to show that
sup๐โW๐ ,2 (๐)โ๐โ๐0โโฅฮ๐
E(๐
(1)a๐
)2+ E
(๐
(2)a๐
)2
๐พ4a๐ (P, P0)
โ 0 (6.23)
74
and
inf๐โW๐ ,2 (๐)โ๐โ๐0โโฅฮ๐
๐๐พ2๐บa๐
(P, P0)โE
(๏ฟฝ๏ฟฝ2๐,a๐
) โ โ (6.24)
as ๐โ โ.
We first prove (6.23). Note that โ๐โ๐ฟ2 โค โ๐โW๐ ,2 (๐) โค ๐ . Following arguments similar to
those in the proof of Theorem 7, we get
E(๐
(1)a๐
)2. ๐โ2E๐บ2
a๐(๐1, ๐2) .๐ ๐
2๐โ2aโ๐/2๐ ,
and
E(๐
(2)a๐
)2โค 4๐E
[E๐โผP [๐บa๐ (๐๐, ๐) |๐๐] โ E๐โผP0 [๐บa๐ (๐๐, ๐) |๐๐]
]2
=4๐
โซ (โซ๐บ2a๐ (๐ฅ, (๐ฅโฒ + ๐ฅโฒโฒ)/2)๐(๐ฅ)๐๐ฅ
)๐บa๐/2(๐ฅโฒ, ๐ฅโฒโฒ) ๐ (๐ฅโฒ) ๐ (๐ฅโฒโฒ)๐๐ฅโฒ๐๐ฅโฒโฒ
.๐
4๐๐a๐/4
โซ๐บa๐/2(๐ฅโฒ, ๐ฅโฒโฒ) | ๐ (๐ฅโฒ) | | ๐ (๐ฅโฒโฒ) |๐๐ฅโฒ๐๐ฅโฒโฒ
.๐
4๐๐a3๐/4 โ ๐ โ
2๐ฟ2.
By Lemma 8, there exists a constant ๐ถ > 0 depending on ๐ and ๐ only such that for ๐ โ
W๐ ,2(๐),
โซexp
(โโ๐โ2
4a๐
)โF ๐ (๐)โ2 ๐๐ โฅ 1
4โ ๐ โ2
๐ฟ2
given that a๐ โฅ ๐ถโ ๐ โโ2/๐ ๐ฟ2
. Because a๐ฮ๐ /2๐ โ โ, we obtain
๐พ2a๐(P, P0) &๐ a
โ๐/2๐ โ ๐ โ2
๐ฟ2,
75
for sufficiently large ๐. Thus
sup๐โW๐ ,2 (๐)โ๐โ๐0โโฅฮ๐
E(๐
(1)a๐
)2
๐พ4a๐ (P, P0)
.๐ ๐2(๐2a
โ๐/2๐ ฮ4
๐)โ1 โ 0
and
sup๐โW๐ ,2 (๐)โ๐โ๐0โโฅฮ๐
E(๐
(2)a๐
)2
๐พ4๐บa๐
(P, P0).๐ ๐ (๐aโ๐/4๐ ฮ2
๐)โ1 โ 0,
as ๐โ โ.
Next we prove (6.24). It follows from
E(๏ฟฝ๏ฟฝ2๐,a๐
)โค Emax
{๏ฟฝ๏ฟฝ๐ 2๐,a๐ ๏ฟฝ๏ฟฝ , 1/๐2} . E๐บ2a๐ (๐1, ๐2) + 1/๐2 .๐ ๐2a
โ๐/2๐ + 1/๐2
that (6.24) holds.
Proof of Theorem 9. This, in a certain sense, can be viewed as an extension of results from Ingster
(1987), and the proof proceeds in a similar fashion. While Ingster (1987) considered the case when
๐0 is the uniform distribution on [0, 1], we shall show that similar bounds hold for a wider class of
๐0.
For any ๐ > 0 and ๐0 such that โ๐0โW๐ ,2 < ๐ , let
๐ปGOF1 (ฮ๐; ๐ , ๐ โ โ๐0โW๐ ,2)โ
:= {๐ โ W๐ ,2 : โ๐ โ ๐0โW๐ ,2 โค ๐ โ โ๐0โW๐ ,2 , โ๐ โ ๐0โ๐ฟ2 โฅ ฮ๐}.
It is clear that ๐ปGOF1 (ฮ๐; ๐ ) โ ๐ปGOF
1 (ฮ๐; ๐ , ๐ โ โ๐0โW๐ ,2)โ. Hence it suffices to prove Theorem
9 with ๐ปGOF1 (ฮ๐; ๐ ) replaced by ๐ปGOF
1 (ฮ๐; ๐ , ๐)โ for an arbitrary ๐ > 0. We shall abbreviate
๐ปGOF1 (ฮ๐; ๐ , ๐)โ as ๐ปGOF
1 (ฮ๐; ๐ )โ in the rest of the proof.
76
Since ๐0 is almost surely continuous, there exists ๐ฅ0 โ R๐ and ๐ฟ, ๐ > 0 such that
๐0(๐ฅ) โฅ ๐ > 0, โ โ๐ฅ โ ๐ฅ0โ โค ๐ฟ.
In light of this, we shall assume ๐0(๐ฅ) โฅ ๐ > 0, for all ๐ฅ โ [0, 1]๐ without loss of generality.
Let ๐๐ be a multivariate random index. As proved in Ingster (1987), in order to prove the
existence of ๐ผ โ (0, 1) such that no asymptotic ๐ผ-level test can be consistent, it suffices to identify
๐๐,๐๐ โ ๐ปGOF1 (ฮ๐; ๐ )โ for all possible values of ๐๐ such that
E๐0
(๐๐ (๐1, ยท ยท ยท , ๐๐)โ๐
๐=1 ๐0(๐๐)
)2= ๐ (1), (6.25)
where
๐๐ (๐ฅ1, ยท ยท ยท , ๐ฅ๐) = E๐๐
(๐โ๐=1
๐๐,๐๐ (๐ฅ๐)), โ ๐ฅ1, ยท ยท ยท , ๐ฅ๐,
i.e., ๐ is the mixture of all ๐๐,๐๐โs.
Let 1{๐ฅโ[0,1]๐}, ๐๐,1, ยท ยท ยท , ๐๐,๐ต๐ be an orthonormal sets of functions in ๐ฟ2(R๐) such that the
supports of ๐๐,1, ยท ยท ยท , ๐๐,๐ต๐ are disjoint and all included in [0, 1]๐ . Let ๐๐ = (๐๐,1, ยท ยท ยท , ๐๐,๐ต๐)
satisfy that ๐๐,1, ยท ยท ยท , ๐๐,๐ต๐ are independent and that
๐(๐๐,๐ = 1) = ๐(๐๐,๐ = โ1) = 12, โ 1 โค ๐ โค ๐ต๐.
Define
๐๐,๐๐ = ๐0 + ๐๐๐ต๐โ๐=1
๐๐,๐๐๐,๐ .
Then๐๐,๐๐
๐0= 1 + ๐๐
๐ต๐โ๐=1
๐๐,๐๐๐,๐
๐0,
where 1, ๐๐,1๐0, ยท ยท ยท , ๐๐,๐ต๐
๐0are orthogonal in ๐ฟ2(๐0).
77
By arguments similar to those in Ingster (1987), we find
E๐0
(๐๐ (๐1, ยท ยท ยท , ๐๐)โ๐
๐=1 ๐0(๐๐)
)2โค exp
(12๐ต๐๐
2๐4๐ max
1โค๐โค๐ต๐
(โซ๐2๐,๐/๐0๐๐ฅ
)2)
โค exp(
12๐2๐ต๐๐
2๐4๐
).
In order to ensure (6.25), it suffices to have
๐ต1/2๐ ๐๐2
๐ = ๐ (1). (6.26)
Therefore, given ฮ๐ = ๐
(๐โ
2๐ 4๐ +๐
), once we can find proper ๐๐, ๐ต๐ and ๐๐,1, ยท ยท ยท , ๐๐,๐ต๐ such that
๐๐,๐๐ โ ๐ปGOF1 (ฮ๐; ๐ )โ for all ๐๐ and (6.26) holds, the proof is finished.
Let ๐๐ = ๐ต1/๐๐ , ๐ be an infinitely differentiable function supported on [0, 1]๐ that is orthogonal
to 1{๐ฅโ[0,1]๐} in ๐ฟ2, and for each ๐ฅ๐,๐ โ {0, 1, ยท ยท ยท , ๐๐ โ 1}โ๐ , let
๐๐,๐ (๐ฅ) =๐๐/2๐
โ๐โ๐ฟ2
๐(๐๐๐ฅ โ ๐ฅ๐,๐ ), โ ๐ฅ โ R๐ .
Then all ๐๐,๐ โs are supported on [0, 1]๐ and
ใ๐๐,๐ , 1ใ๐ฟ2 =๐๐/2๐
โ๐โ๐ฟ2
โซR๐๐(๐๐๐ฅ โ ๐ฅ๐,๐ )๐๐ฅ =
1๐๐/2๐ โ๐โ๐ฟ2
โซR๐๐(๐ฅ)๐๐ฅ = 0,
โ๐๐,๐ โ2๐ฟ2
=๐๐๐
โ๐โ2๐ฟ2
โซ[0,1/๐๐]๐
๐2(๐๐๐ฅ)๐๐ฅ = 1,
โ๐๐,๐ โ2W๐ ,2 โค ๐2๐
๐
โ๐โ2W๐ ,2
โ๐โ2๐ฟ2
.
Since for ๐ โ ๐โฒ, the supports of ๐๐,๐ and ๐๐,๐ โฒ are disjoint,
โ๐๐,๐๐ โ ๐0โโ = ๐๐๐๐/2๐
โ๐โโโ๐โ๐ฟ2
,
78
and
ใ๐๐,๐ , ๐๐,๐ โฒใ๐ฟ2 = 0, ใ๐๐,๐ , ๐๐,๐ โฒใW๐ ,2 = 0,
from which we immediately obtain
โ๐๐,๐๐ โ ๐0โ2๐ฟ2
= ๐2๐๐
๐๐
โ๐๐,๐๐ โ ๐0โ2W๐ ,2 โค ๐2
๐๐๐+2๐ ๐
โ๐โ2W๐ ,2
โ๐โ2๐ฟ2
.
To ensure ๐๐,๐๐ โ ๐ปGOF1 (ฮ๐; ๐ )โ, it suffices to make
๐๐๐๐/2๐
โ๐โโโ๐โ๐ฟ2
โ 0 as ๐โ โ, (6.27)
๐2๐๐
๐๐ = ฮ2
๐, (6.28)
๐2๐๐
๐+2๐ ๐
โ๐โ2W๐ ,2
โ๐โ2๐ฟ2
โค ๐2. (6.29)
Let
๐๐ =
(๐ โ๐โ2
๐ฟ2
โ๐โW๐ ,2
)1/๐
ฮโ1/๐ ๐
, ๐๐ =ฮ๐
๐๐/2๐
.
Then (6.28) and (6.29) are satisfied. Moreover, given ฮ๐ = ๐
(๐โ
2๐ 4๐ +๐
),
๐ต1/2๐ ๐๐2
๐ = ๐โ๐/2๐ ๐ฮ2
๐ .๐,๐,๐ ๐ฮ4๐ +๐
2๐ ๐ = ๐ (1),
and
๐๐๐๐/2๐
โ๐โโโ๐โ๐ฟ2
.๐ ฮ๐ = ๐(1)
ensuring both (6.26) and (6.27).
79
Finally, we show the existence of such ๐. Let
๐0(๐ฅ1) =
exp(โ 1
1โ(4๐ฅ1โ1)2
)0 < ๐ฅ1 <
12
โ exp(โ 1
1โ(4๐ฅ1โ3)2
)12 < ๐ฅ1 < 1
0 otherwise
.
Then ๐0 is supported on [0, 1], infinitely differentiable and orthogonal to the indicator function of
[0, 1].
Let
๐(๐ฅ) =๐โ๐=1
๐0(๐ฅ๐), โ ๐ฅ = (๐ฅ1, ยท ยท ยท , ๐ฅ๐) โ R๐ .
Then ๐ is supported on [0, 1]๐ , infinitely differentiable and ใ๐, 1ใ๐ฟ2 = ใ๐0, 1ใ๐๐ฟ2 [0,1] = 0.
Proof of Theorem 10. Let ๐ = ๐ + ๐ denote the total sample size. It suffices to prove the result
under the assumption that ๐/๐ โ ๐ โ (0, 1).
Note that under ๐ป0,
๐พ2a๐ (P,Q) =
1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) +1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )
โ 2๐๐
โ1โค๐โค๐
โ1โค ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ).
Let ๐/๐ = ๐๐. Then we have
๐พ2a๐ (P,Q)
=๐โ2 ยฉยญยซ 1๐๐ (๐๐ โ ๐โ1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) +
1(1 โ ๐๐) (1 โ ๐๐ โ ๐โ1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) โ2
๐๐ (1 โ ๐๐)โ
1โค๐โค๐
โ1โค ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )ยชยฎยฌ .
80
Let
๐พ2a๐ (P,Q)
โฒ =๐โ2 ยฉยญยซ 1๐2
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) +1
(1 โ ๐)2
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )
โ 2๐ (1 โ ๐)
โ1โค๐โค๐
โ1โค ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )ยชยฎยฌ .
As we assume ๐๐ โ ๐ as ๐โ โ, Theorem 7 ensures that
๐๐โ
2(๐ + ๐)[E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2)
]โ 12(๐พ2a๐ (P,Q) โ ๐พ2
a๐ (P,Q)โฒ)= ๐๐ (1)
A slight adaption of arguments in Hall (1984) suggests that
E๏ฟฝ๏ฟฝ4a๐(๐1, ๐2)
๐2E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)
+E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2)๏ฟฝ๏ฟฝ2
a๐(๐1, ๐3)
๐E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)
+E๐ป2
a๐(๐1, ๐2)
E๏ฟฝ๏ฟฝ2a๐ (๐1, ๐2)
โ 0 (6.30)
ensures that๐๐
โ2(๐ + ๐)
[E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2)
]โ 12 ๐พ2
a๐ (P,Q)โฒ โ๐ ๐ (0, 1).
Following arguments similar to those in the proof of Theorem 7, given a๐ โ โ and a๐/๐4/๐ โ 0,
(6.30) holds and therefore
๐๐โ
2(๐ + ๐)[E๏ฟฝ๏ฟฝ2
a๐(๐1, ๐2)
]โ 12 ๐พ2
a๐ (P,Q) โ๐ ๐ (0, 1).
Additionally, based on the same arguments as in the proof of Theorem 7,
๏ฟฝ๏ฟฝ2๐,๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ๐ 1.
The proof is therefore concluded.
81
Proof of Theorem 11. With slight abuse of notation, we shall write
๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐ฆ;P,Q) = ๐บa๐ (๐ฅ, ๐ฆ) โ E๐โผQ๐บa๐ (๐ฅ,๐ ) โ E๐โผP๐บa๐ (๐, ๐ฆ) + E(๐,๐ )โผPโQ๐บa๐ (๐,๐ ),
We consider the two parts separately.
Part (i). We first verify the consistency of ฮฆHOM๐,a๐,๐ผ
with a๐ ๏ฟฝ ๐4/(๐+4๐ ) given ฮ๐ ๏ฟฝ ๐โ2๐ /(๐+4๐ ) .
Observe the following decomposition of ๐พ2a๐ (P,Q),
๐พ2a๐ (P,Q) = ๐พ
2a๐(P,Q) + ๐ฟ (1)
๐,a๐ + ๐ฟ(2)๐,a๐ ,
where
๐ฟ(1)๐,a๐ =
1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P) โ2๐๐
โ1โค๐โค๐
โ1โค ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P,Q)
+ 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;Q)
and
๐ฟ(2)๐,a๐ =
2๐
๐โ๐=1
(E[๐บa๐ (๐๐, ๐) |๐๐] โ E๐บa๐ (๐, ๐โฒ) โ E[๐บa๐ (๐๐, ๐ ) |๐๐] + E๐บa๐ (๐,๐ )
)+ 2๐
๐โ๐=1
(E[๐บa๐ (๐ ๐ , ๐ ) |๐ ๐ ] โ E๐บa๐ (๐,๐ โฒ) โ E[๐บa๐ (๐,๐ ๐ ) |๐ ๐ ] + E๐บa๐ (๐,๐ )
).
In order to prove the consistency of ฮฆHOM๐,a๐,๐ผ
, it suffices to show
sup๐,๐โW๐ ,2 (๐)โ๐โ๐โ๐ฟ2โฅฮ๐
E(๐ฟ(1)๐,a๐
)2+ E
(๐ฟ(2)๐,a๐
)2
๐พ4๐บa๐
(P,Q)โ 0, (6.31)
inf๐,๐โW๐ ,2 (๐)โ๐โ๐โ๐ฟ2โฅฮ๐
๐พ2๐บa๐
(P,Q)
(1/๐ + 1/๐)โE
(๏ฟฝ๏ฟฝ2๐,๐,a๐
) โ โ, (6.32)
82
as ๐ โ โ. We now prove (6.31) and (6.32) with arguments similar to those obtained in the proof
of Theorem 8.
Note that
E(๐ฟ (1)๐,a๐)
2
.Eยฉยญยซ 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P)ยชยฎยฌ
2
+ E ยฉยญยซ 2๐๐
โ1โค๐โค๐
โ1โค ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;P,Q)ยชยฎยฌ
2
+ E ยฉยญยซ 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ;Q)ยชยฎยฌ
2
.1๐2E๐บ
2a๐(๐1, ๐2) +
1๐2E๐บ
2a๐(๐1, ๐2).
Given ๐, ๐ โ W๐ ,2(๐),
E๐บ2a๐(๐1, ๐2) .๐ ๐
2aโ๐/2๐ , E๐บ2
a๐(๐1, ๐2) .๐ ๐
2aโ๐/2๐ .
Hence
E(๐ฟ (1)๐,a๐)
2 .๐ ๐2a
โ๐/2๐
(1๐2 + 1
๐2
). (6.33)
Now consider bounding ๐ฟ (2)๐,a๐ . Let ๐ = ๐ โ ๐. Then we have
E(๐ฟ (2)๐,a๐)
2 .๐ aโ 3๐
4๐ ๐ โ ๐ โ2
๐ฟ2
(1๐+ 1๐
). (6.34)
Since a๐ ๏ฟฝ ๐4/(4๐ +๐) ๏ฟฝ ฮโ2/๐ ๐ , Lemma 8 ensures that for sufficiently large ๐,
๐พ2๐บa๐
(P,Q) &๐ aโ๐/2๐ โ ๐ โ2
๐ฟ2, โ ๐, ๐ โ W๐ ,2(๐).
83
This together with (6.33) and (6.34) gives
sup๐,๐โW๐ ,2 (๐)โ๐โ๐โ๐ฟ2โฅฮ๐
E(๐ฟ(1)๐,a๐
)2+ E
(๐ฟ(2)๐,a๐
)2
๐พ4๐บa๐
(P,Q).๐
๐2a๐/2๐
๐2ฮ4๐
+ ๐a๐/4๐
๐ฮ2๐
โ 0
as ๐โ โ, which proves (6.31).
Finally, consider (6.32). It follows from
E(๏ฟฝ๏ฟฝ2๐,๐,a๐
)โค Emax
{๏ฟฝ๏ฟฝ๐ 2๐,๐,a๐ ๏ฟฝ๏ฟฝ , 1/๐2}. max{E๐บ2
a๐(๐1, ๐2),E๐บ2
a๐(๐1, ๐2)} + 1/๐2
.๐๐2a
โ๐/2๐ + 1/๐2
that (6.32) holds.
Part (ii). Next, we prove that if lim inf๐โโ ฮ๐๐2๐ /(๐+4๐ ) < โ, then there exists some ๐ผ โ
(0, 1) such that no asymptotic ๐ผ-level test can be consistent. To prove this, we shall verify that
consistency of homogeneity test is harder to achieve than that of goodness-of-fit test.
Consider an arbitrary ๐0 โ W๐ ,2(๐/2). It immediately follows
๐ปHOM1 (ฮ๐; ๐ ) โ {(๐, ๐0) : ๐ โ ๐ปGOF
1 (ฮ๐; ๐ )}.
Let {ฮฆ๐}๐โฅ1 be any sequence of asymptotic ๐ผ-level homogeneity tests, where
ฮฆ๐ = ฮฆ๐ (๐1, ยท ยท ยท , ๐๐, ๐1, ยท ยท ยท , ๐๐).
Then if ๐1, ยท ยท ยท , ๐๐ โผiid ๐0, {ฮฆ๐}๐โฅ1 can also be treated as a sequence of (random) goodness-of-fit
tests
ฮฆ๐ (๐1, ยท ยท ยท , ๐๐, ๐1, ยท ยท ยท , ๐๐) = ฮฆ๐ (๐1, ยท ยท ยท , ๐๐)
84
whose probabilities of type I error with respect to ๐0 are controlled at ๐ผ asymptotically. Moreover,
power{ฮฆ๐;๐ปHOM1 (ฮ๐; ๐ )} โค power{ฮฆ๐;๐ปGOF
1 (ฮ๐; ๐ )}
Since 0 < ๐ โค ๐/๐ โค ๐ถ < โ, Theorem 9 ensures that there exists some ๐ผ โ (0, 1) such that
for any sequence of asymptotic ๐ผ-level tests {ฮฆ๐}๐โฅ1,
lim inf๐โโ
power{ฮฆ๐;๐ปHOM1 (ฮ๐; ๐ )} โค lim inf
๐โโpower{ฮฆ๐;๐ปGOF
1 (ฮ๐; ๐ )} < 1
given lim inf๐โโ ฮ๐๐2๐ /(๐+4๐ ) < โ.
Proof of Theorem 12. For brevity, we shall focus on the case when ๐ = 2 in the rest of the proof.
Our argument, however, can be straightforwardly extended to the more general cases. The proof
relies on the following decomposition of ๐พ2a๐ (P, P๐
1 โ P๐2) under ๐ปIND0 :
๐พ2a๐ (P, P
๐1 โ P๐2) = 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บโa๐(๐๐, ๐ ๐ ) + ๐ ๐,
where
๐บโa๐(๐ฅ, ๐ฆ) = ๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐ฆ) โ
โ1โค ๐โค2
๐ ๐ (๐ฅ ๐ , ๐ฆ) โโ
1โค ๐โค2๐ ๐ (๐ฆ ๐ , ๐ฅ) +
โ1โค ๐1, ๐2โค2
๐ ๐1, ๐2 (๐ฅ ๐1 , ๐ฆ ๐2)
and the remainder ๐ ๐ satisfies
E(๐ ๐)2 . E๐บ2a (๐1, ๐2)/๐3 .๐ โ๐โ2๐ฟ2aโ๐/2๐ /๐3.
See Appendix B.4 for more details.
85
Moreover, borrowing arguments in the proof of Lemma 1, we obtain
E(๐บโa๐(๐1, ๐2) โ ๏ฟฝ๏ฟฝa๐ (๐1, ๐2))2
.โ
1โค ๐โค2E(๐ ๐ (๐ ๐
1 , ๐2))2
+โ
1โค ๐1, ๐2โค2E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2
โคโ
1โค ๐1โ ๐2โค2E๐บ2a๐ (๐
๐11 , ๐
๐12 ) ยท E
{E
[๐บa๐ (๐
๐21 , ๐
๐22 )
๏ฟฝ๏ฟฝ๏ฟฝ๐ ๐21
]}2+โ
1โค ๐1โ ๐2โค2E๐บ2a๐ (๐
๐11 , ๐
๐12 ) [E๐บa๐ (๐
๐21 , ๐
๐22 )]2+
2E{E
[๐บa๐ (๐1
1 , ๐12 )
๏ฟฝ๏ฟฝ๏ฟฝ๐11
]}2E
{E
[๐บa๐ (๐2
1 , ๐22 )
๏ฟฝ๏ฟฝ๏ฟฝ๐21
]}2
.๐ aโ๐1/2โ3๐2/4๐ โ๐1โ2
๐ฟ2โ๐2โ3
๐ฟ2+ aโ3๐1/4โ๐2/2
๐ โ๐1โ3๐ฟ2โ๐2โ2
๐ฟ2
Together with the fact that
(2a๐/๐)๐/2E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2) โ โ๐โ2
๐ฟ2
as a๐ โ โ, we conclude that
๐พ2a๐ (P, P
๐1 โ P๐2) = ๐ท (a๐) + ๐๐(โE๐ท2(a๐)
),
where
๐ท (a๐) =1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ).
Applying arguments similar to those in the proofs of Theorem 7 and 10, we have
๐ท (a๐)โE๐ท2(a๐)
โ๐ ๐ (0, 1).
Since
E๐ท2(a๐) =2
๐(๐ โ 1)E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 and E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2/E[๐บโa๐(๐1, ๐2)]2 โ 1,
86
it remains to prove
๏ฟฝ๏ฟฝ2๐,a๐/E[๐บโa๐(๐1, ๐2)]2 โ๐ 1,
which immediately follows by observing
๐ 2๐,a๐/E[๐บโa๐(๐1, ๐2)]2 =
2โ๐=1
๐ 2๐, ๐ ,a๐/E[๏ฟฝ๏ฟฝa๐ (๐๐
1 , ๐๐
2 )]2 โ๐ 1
and 1/๐2 = ๐(E[๐บโa๐(๐1, ๐2)]2). The proof is therefore concluded.
Proof of Theorem 13. We prove the two parts separately. Part (i). The proof of consistency of
ฮฆIND๐,a๐,๐ผ
is very similar to its counterpart in the proof of Theorem 11. It sufficies to show
sup๐โ๐ปIND
1 (ฮ๐,๐ )
var(๐พ2a๐ (P, P๐
1 โ P๐2))๐พ4a๐ (P, P๐
1 โ P๐2)โ 0, (6.35)
inf๐โ๐ปIND
1 (ฮ๐,๐ )
๐๐พ2a๐(P, P๐1 โ P๐2)E
(๏ฟฝ๏ฟฝ๐,a๐
) โ โ, (6.36)
as ๐โ โ.
We begin with (6.35). Let ๐ = ๐ โ ๐1 โ ๐2. Lemma 8 then implies that there exists ๐ถ =
๐ถ (๐ , ๐) > 0 such that
๐พ2a (P, P๐
1 โ P๐2) ๏ฟฝ๐ aโ๐/2โ ๐ โ2๐ฟ2
for a โฅ ๐ถโ ๐ โโ2/๐ ๐ฟ2
, which is satisfied by all ๐ โ ๐ปIND1 (ฮ๐, ๐ ) given a = a๐ and lim
๐โโฮ๐๐
2๐ 4๐ +๐ = โ.
On the other hand, we can still do the decomposition of ๐พ2a๐ (P, P๐
1 โ P๐2) as in Appendix B.4. We
follow the same notations here.
87
Under the alternative hypothesis, the โfirst orderโ term
๐ท1(a๐)
=2๐
โ1โค๐โค๐
(E๐๐ ,๐โผiidP [๐บa๐ (๐๐, ๐) |๐๐] โ E๐,๐ โฒโผiidP๐บa๐ (๐, ๐โฒ)
)โ 2๐
โ1โค๐โค๐
(E๐๐โผP,๐โผP๐1โP๐2 [๐บa๐ (๐๐, ๐ ) |๐๐] โ E๐โผP,๐โผP๐1โP๐2๐บa๐ (๐,๐ )
)โ
โ1โค ๐โค2
(2๐
โ1โค๐โค๐
(E๐๐โผP๐1โP๐2
,๐โผP [๐บa๐ (๐๐, ๐) |๐๐
๐] โ E
๐โผP,๐โผP๐1โP๐2๐บa๐ (๐,๐ )))
+โ
1โค ๐โค2
(2๐
โ1โค๐โค๐
(E๐๐ ,๐โผiidP๐
1โP๐2 [๐บa๐ (๐๐, ๐ ) |๐๐
๐] โ E
๐,๐ โฒโผiidP๐1โP๐2๐บa๐ (๐,๐ โฒ)
))
no longer vanish, but based on arguments similar to those in the proof of Theorem 8,
E๐ท21(a๐) .๐ ๐๐
โ1aโ3๐/4๐ โ ๐ โ2
๐ฟ2.
Moreover, the โsecond orderโ term ๐ท2(a๐) is not solelyโ
1โค๐โ ๐โค๐๐บโa๐(๐๐, ๐ ๐ )/(๐(๐ โ 1)), but we
still have
E๐ท22(a๐) . ๐โ2 max{E๐บ2a๐ (๐1, ๐2),E๐บ2a๐ (๐1
1 , ๐12 )E๐บ2a๐ (๐2
1 , ๐22 )} .๐ ๐
2๐โ2aโ๐/2๐ .
Similarly, define the third order term ๐ท3(a๐) and the fourth order term ๐ท4(a๐) as the aggregation
of all 3-variate centered components and the aggregation of all 4-variate centered components in
๐พ2a๐ (P, P๐
1 โ P๐2) respectively, which together constitue ๐ ๐. Then we have
E๐ท23(a๐) .๐ ๐
2๐โ3aโ๐/2๐ , E๐ท2
4(a๐) .๐ ๐2๐โ4a
โ๐/2๐ .
Hence we finally obtain
๐พ2a๐ (P, P
๐1 โ P๐2) = ๐พ2a๐(P, P๐1 โ P๐2) +
4โ๐=1
๐ท ๐ (a๐)
88
and
var(๐พ2a๐ (P, P
๐1 โ P๐2))=
4โ๐=1E๐ท2
๐ (a๐) .๐ ๐๐โ1a
โ3๐/4๐ โ ๐ โ2
๐ฟ2+ ๐2๐โ2a
โ๐/2๐
which proves (6.35).
Now consider (6.36). Since
๏ฟฝ๏ฟฝ๐,a๐ โค max
2โ๐=1
โ๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐, ๐ ,a๐ ๏ฟฝ๏ฟฝ๏ฟฝ, 1/๐ ,we have
E(๏ฟฝ๏ฟฝ๐,a๐
)โค
2โ๐=1
โE
๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐, ๐ ,a๐ ๏ฟฝ๏ฟฝ๏ฟฝ + 1/๐,
where
2โ๐=1E
๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐, ๐ ,a๐ ๏ฟฝ๏ฟฝ๏ฟฝ . 2โ๐=1E๐บ2a๐ (๐
๐
1 , ๐๐
2 ) = E๐1,๐2โผiidP๐1โP๐2๐บ2a๐ (๐1, ๐2) .๐ ๐
2aโ๐/2๐ .
Therefore (6.36) holds.
Part (ii). Then we verify that ๐2๐ /(๐+4๐ )ฮ๐ โ โ is also the necessary condition for the existence
of consistent asymptotic ๐ผ-level tests for any ๐ผ โ (0, 1). Similarly to the proof of Theorem 11,
the idea is to relate the existence of consistent independence test to the existence of consistent
goodness-of-fit test.
Let ๐ ๐ ,0 โ W๐ ,2(๐ ๐/
โ2)
be density on R๐ ๐ for ๐ = 1, 2 and ๐0 be the product of ๐1,0 and
๐2,0, i.e.,
๐0(๐ฅ1, ๐ฅ2) = ๐1,0(๐ฅ1)๐2,0(๐ฅ2), โ ๐ฅ1 โ R๐1 , ๐ฅ2 โ R๐2 .
Hence ๐0 โ W๐ ,2(๐/2).
Let
๐ปGOF1 (ฮ๐; ๐ )โฒ := {๐ : ๐ โ W๐ ,2(๐), ๐1 = ๐1,0, ๐2 = ๐2,0, โ๐ โ ๐0โ๐ฟ2 โฅ ฮ๐}.
89
We immediately have
๐ปIND1 (ฮ๐; ๐ ) โ ๐ปGOF
1 (ฮ๐; ๐ )โฒ
Let {ฮฆ๐}๐โฅ1 be any sequence of asymptotic ๐ผ-level independence tests, where
ฮฆ๐ = ฮฆ๐ (๐1, ยท ยท ยท , ๐๐).
Then {ฮฆ๐}๐โฅ1 can also be treated as a sequence of asymptotic ๐ผ-level goodness-of-fit tests with
the null density being ๐0. Moreover,
power{ฮฆ๐;๐ปIND1 (ฮ๐; ๐ )} โค power{ฮฆ๐;๐ปGOF
1 (ฮ๐; ๐ )โฒ}.
It remains to show that given lim inf๐โโ ๐2๐ /(๐+4๐ )ฮ๐ < โ, there exists some ๐ผ โ (0, 1) such
that
lim inf๐โโ
power{ฮฆ๐;๐ปGOF1 (ฮ๐; ๐ )โฒ} < 1,
which cannot be directly obtained from Theorem 9 because of the additional constraints
๐1 = ๐1,0, ๐2 = ๐2,0 (6.37)
in ๐ปGOF1 (ฮ๐; ๐ )โฒ.
However, by modifying the proof of Theorem 9, we only need to further require each ๐๐,๐๐ in
the proof of Theorem 9 satisfying (6.37), or equivalently,
โซR๐2
(๐ โ ๐0) (๐ฅ1, ๐ฅ2)๐๐ฅ2 = 0,โซR๐1
(๐ โ ๐0) (๐ฅ1, ๐ฅ2)๐๐ฅ1 = 0.
90
Recall that each ๐๐,๐๐ = ๐0 + ๐๐๐ต๐โ๐=1
๐๐,๐๐๐,๐ , where
๐๐,๐ (๐ฅ) =๐๐/2๐
โ๐โ๐ฟ2
๐(๐๐๐ฅ โ ๐ฅ๐,๐ ).
Write ๐ฅ๐,๐ = (๐ฅ1๐,๐, ๐ฅ2๐,๐
) โ R๐1 ร R๐2 . Since ๐ can be decomposed as
๐(๐ฅ1, ๐ฅ2) = ๐1(๐ฅ1)๐2(๐ฅ2),
we have
๐๐,๐ (๐ฅ) =๐๐/2๐
โ๐โ๐ฟ2
๐1(๐๐๐ฅ1 โ ๐ฅ1๐,๐ )๐2(๐๐๐ฅ2 โ ๐ฅ2
๐,๐ )
Hence
โซR๐2
(๐๐,๐๐ โ ๐0) (๐ฅ1, ๐ฅ2)๐๐ฅ2 =๐๐
๐ต๐โ๐=1
๐๐,๐
โซR๐2
๐๐,๐ (๐ฅ1, ๐ฅ2)๐๐ฅ2
=๐๐
๐ต๐โ๐=1
๐๐,๐๐๐/2๐
โ๐โ๐ฟ2
ยท ๐1(๐๐๐ฅ1 โ ๐ฅ1๐,๐ ) ยท
1๐๐2๐
โซR๐2
๐2(๐ฅ2)๐๐ฅ2
=0
sinceโซR๐2 ๐2(๐ฅ2)๐๐ฅ2 = 0. Similarly,
โซR๐1 (๐๐,๐๐ โ ๐0) (๐ฅ1, ๐ฅ2)๐๐ฅ1 = 0. The proof is therefore
finished.
Proof of Theorem 14. The proof of Theorem 14 consists of two steps. First, we bound ๐GOF๐,๐ผ . To
be more specific, we show that there exists ๐ถ = ๐ถ (๐) > 0 such that
๐GOF๐,๐ผ โค ๐ถ (๐) log log ๐
91
for sufficiently large ๐, which holds if
lim๐โโ
๐(๐GOF(adapt)๐ โฅ ๐ถ (๐) log log ๐) = 0 (6.38)
under ๐ปGOF0 . Second, we show that there exists ๐ > 0 such that
lim inf๐โโ
ฮ๐,๐ (๐/log log ๐)2๐ /(๐+4๐ ) > ๐
ensures
inf๐โ๐ปGOF(adapt)
1 (ฮ๐,๐ :๐ โฅ๐/4)๐(๐GOF(adapt)
๐ โฅ ๐ถ (๐) log log ๐) โ 1 (6.39)
as ๐โ โ.
Verifying (6.38). In order to prove (6.38), we first show the following two lemmas. The
first lemma suggests that ๏ฟฝ๏ฟฝ2๐,a๐ is a consistent estimator of E๏ฟฝ๏ฟฝ2a๐(๐1, ๐2) uniformly over all a๐ โ
[1, ๐2/๐]. Recall we have shown in the proof of Theorem 7 that for a๐ increasing at a proper rate,
๏ฟฝ๏ฟฝ2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ๐ 1.
Hence the first lemma is a uniform version of such result.
Lemma 5. We have that ๏ฟฝ๏ฟฝ2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 converges to 1 uniformly over a๐ โ [1, ๐2/๐], i.e.,
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ 1๏ฟฝ๏ฟฝ = ๐๐ (1).
92
We defer the proof of Lemma 5 to the appendix. Note that
๐GOF(adapt)๐ = sup
1โคa๐โค๐2/๐
๐๐พ2a๐ (P, P0)โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2ยทโE[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2/๏ฟฝ๏ฟฝ2๐,a๐
โค sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ๐๐พ2a๐ (P, P0)โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ยท sup1โคa๐โค๐2/๐
โE[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2/๏ฟฝ๏ฟฝ2๐,a๐ .
Lemma 5 first ensures that
sup1โคa๐โค๐2/๐
โE[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2/๏ฟฝ๏ฟฝ2๐,a๐ = 1 + ๐๐ (1).
It therefore suffices to show that under ๐ปGOF0 ,
๐GOF(adapt)๐ := sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ๐๐พ2a๐ (P, P0)โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝis also of order log log ๐. This is the crux of our argument yet its proof is lengthy. For brevity, we
shall state it as a lemma here and defer its proof to the appendix.
Lemma 6. There exists ๐ถ = ๐ถ (๐) > 0 such that
lim๐โโ
๐
(๐
GOF(adapt)๐ โฅ ๐ถ log log ๐
)= 0
under ๐ปGOF0 .
Verifying (6.39). Let
a๐ (๐ )โฒ =(log log ๐
๐
)โ4/(4๐ +๐),
which is smaller than ๐2/๐ for ๐ โฅ ๐/4. Hence it suffices to show
inf๐ โฅ๐/4
inf๐โ๐ปGOF
1 (ฮ๐,๐ ;๐ )๐(๐GOF
๐,a๐ (๐ ) โฒ โฅ ๐ถ (๐) log log ๐) โ 1
93
as ๐โ โ.
First of all, observe
0 โค E(๐ 2๐,a๐ (๐ ) โฒ
)โค E๐บ2a๐ (๐ ) โฒ (๐1, ๐2) โค ๐2(2a๐ (๐ )โฒ/๐)โ๐/2
and
var(๐ 2๐,a๐ (๐ ) โฒ
).๐ ๐
3๐โ1(a๐ (๐ )โฒ)โ3๐/4 + ๐2๐โ2(a๐ (๐ )โฒ)โ๐/2
for any ๐ and ๐ โ ๐ปGOF1 (ฮ๐,๐ , ๐ ). Further considering 1/๐2 = ๐(๐2(2a๐ (๐ )โฒ/๐)โ๐/2) uniformly
over all ๐ , we obtain that
inf๐ โฅ๐/4
inf๐โ๐ปGOF
1 (ฮ๐,๐ ;๐ )๐
(๏ฟฝ๏ฟฝ2๐,a๐ (๐ ) โฒ โค 2๐2(2a๐ (๐ )โฒ/๐)โ๐/2
)โ 1.
Let
ฮ๐,๐ โฅ ๐(โ๐ + ๐) (log log ๐/๐)2๐ /(๐+4๐ )
for some sufficiently large ๐ = ๐(๐). Then
E๐พ2a๐ (๐ ) โฒ (P, P0) = ๐พ2
a๐ (๐ ) โฒ (P, P0) โฅ(
๐
a๐ (๐ )โฒ
)๐/2ยทโ๐ โ ๐0โ2
๐ฟ2
4,
as guaranteed by Lemma 8. Further considering that
var(๐พ2a๐ (๐ ) โฒ (P, P0)
).๐ ๐
2๐โ2(a๐ (๐ )โฒ)โ๐/2 + ๐๐โ1(a๐ (๐ )โฒ)โ3๐/4โ๐ โ ๐0โ2๐ฟ2,
we immediately have
lim๐โโ
inf๐ โฅ๐/4
inf๐โ๐ปGOF
1 (ฮ๐,๐ ;๐ )๐(๐GOF
๐,a๐ (๐ ) โฒ โฅ ๐ถ (๐) log log ๐)
โฅ lim๐โโ
inf๐ โฅ๐/4
inf๐โ๐ปGOF
1 (ฮ๐,๐ ;๐ )๐
ยฉยญยญยซ๐๐พ2
a๐ (๐ ) โฒ (P, P0)/2โ2๏ฟฝ๏ฟฝ2๐,a๐ (๐ ) โฒ
โฅ ๐ถ (๐) log log ๐ยชยฎยฎยฌ = 1.
94
Proof of Theorem 15 and Theorem 16. The proof of Theorem 15 and Theorem 16 is very similar
to that of Theorem 14. Hence we only emphasize the main differences here.
For adaptive homogeneity test: to verify that there exists ๐ถ = ๐ถ (๐) > 0 such that
lim๐โโ
๐(๐HOM(adapt)๐ โฅ ๐ถ log log ๐) = 0
under ๐ปHOM0 , observe that
๐HOM(adapt)๐ โค sup
1โคa๐โค๐2/๐
โE[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ2๐,๐,a๐ยท(1๐+ 1๐
)โ1sup
1โคa๐โค๐2/๐
|๐พ2a๐ (P,Q) |โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2.
Denote ๐1, ยท ยท ยท , ๐๐, ๐1, ยท ยท ยท , ๐๐ as ๐1, ยท ยท ยท , ๐๐ . Hence
2๐โ๐=1
๐โ๐=1๐บa๐ (๐๐, ๐ ๐ ) =
โ1โค๐โ ๐โค๐
๐บa๐ (๐๐, ๐ ๐ ) โโ
1โค๐โ ๐โค๐๐บa๐ (๐๐, ๐ ๐ ) โ
โ1โค๐โ ๐โค๐
๐บa๐ (๐๐, ๐ ๐ )
and
sup1โคa๐โค๐2/๐
|๐พ2a๐ (P,Q) |โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
โค(
1๐(๐ โ 1) +
1๐๐
)sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝโ
1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )โ2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ+
(1
๐(๐ โ 1) +1๐๐
)sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝโ
1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )โ2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ+ 1๐๐
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝโ
1โค๐โ ๐โค๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )โ2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝApply Lemma 6 to bound each term of the right hand side of the above inequality. Then we
95
conclude that for some ๐ถ = ๐ถ (๐) > 0,
lim๐โโ
๐ยฉยญยญยซ(1๐+ 1๐
)โ1sup
1โคa๐โค๐2/๐
|๐พ2a๐ (P,Q) |โ
2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2โฅ ๐ถ log log ๐
ยชยฎยฎยฌ = 0.
For adaptive independence test: to verify that there exists ๐ถ = ๐ถ (๐) > 0 such that
lim๐โโ
๐(๐ IND(adapt)๐ โฅ ๐ถ log log ๐) = 0 (6.40)
under ๐ปIND0 , recall the decomposition
๐พ2a๐ (P, P
๐1 โ P๐2) = ๐ท2(a๐) + ๐ ๐ =1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บโa๐(๐๐, ๐ ๐ ) + ๐ ๐,
where we express ๐ ๐ as ๐ ๐ = ๐ท3(a๐) + ๐ท4(a๐) in the proof of Theorem 13.
Following arguments similar to those in the proof of Lemma 6, we obtain that there exists
๐ถ (๐) > 0 such that for sufficiently large ๐,
๐ยฉยญยญยซ sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ๐๐ท2(a๐)โ2E[๐บโ
a๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ถ (๐) (log log ๐ + ๐ก log log log ๐)ยชยฎยฎยฌ . exp(โ๐ก2/3),
Similarly,
๐ยฉยญยญยซ sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ๐3/2๐ท3(a๐)โ2E[๐บโ
a๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ถ (๐) (log log ๐ + ๐ก log log log ๐)ยชยฎยฎยฌ . exp(โ๐ก1/2)
๐ยฉยญยญยซ sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ ๐2๐ท4(a๐)โ2E[๐บโ
a๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ถ (๐) (log log ๐ + ๐ก log log log ๐)ยชยฎยฎยฌ . exp(โ๐ก2/5)
for sufficiently large ๐.
96
On the other hand, note that
E[๐บโa๐(๐1, ๐2)]2 =
2โ๐=1E[๏ฟฝ๏ฟฝa๐ (๐
๐
1 , ๐๐
2 )]2,
and based on results in the proof of Lemma 5, sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐, ๐ ,a๐/E[๏ฟฝ๏ฟฝa๐ (๐๐
1 , ๐๐
2 )]2 โ 1
๏ฟฝ๏ฟฝ๏ฟฝ = ๐๐ (1) for
๐ = 1, 2. Further considering that
1/๐2 = ๐(E[๐บโa๐(๐1, ๐2)]2)
uniformly over all a๐ โ [1, ๐2/๐], we obtain
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๐บโa๐(๐1, ๐2)]2 โ 1
๏ฟฝ๏ฟฝ = ๐๐ (1).They combined together ensure that (6.40) holds.
To show that the detection boundary of ฮฆIND(adapt) is of order ๐ ((๐/log log ๐)โ2๐ /(๐+4๐ )), ob-
serve that
0 โค E(๐ 2๐, ๐ ,a๐ (๐ ) โฒ
)โค E๐บ2a๐ (๐ ) โฒ (๐
๐
1 , ๐๐
2 ) โค ๐2๐ (2a๐ (๐ )โฒ/๐)โ๐ ๐/2
and
var(๐ 2๐, ๐ ,a๐ (๐ ) โฒ
).๐ ๐ ๐
3๐ ๐
โ1(a๐ (๐ )โฒ)โ3๐ ๐/4 + ๐2๐ ๐
โ2(a๐ (๐ )โฒ)โ๐ ๐/2
for ๐ = 1, 2, where a๐ (๐ )โฒ = (log log ๐/๐)โ4/(4๐ +๐) as in the proof of Theorem 14. Therefore,
inf๐ โฅ๐/4
inf๐โ๐ปIND
1 (ฮ๐,๐ ;๐ )๐
(๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐, ๐ ,a๐ (๐ ) โฒ๏ฟฝ๏ฟฝ๏ฟฝ โค โ3/2๐2
๐ (2a๐ (๐ )โฒ/๐)โ๐ ๐/2)โ 1, ๐ = 1, 2.
Further considering 1/๐2 = ๐(๐2(2a๐ (๐ )โฒ/๐)โ๐/2) uniformly over all ๐ , we obtain that
inf๐ โฅ๐/4
inf๐โ๐ปIND
1 (ฮ๐,๐ ;๐ )๐
(๏ฟฝ๏ฟฝ2๐,a๐ (๐ ) โฒ โค 2๐2(2a๐ (๐ )โฒ/๐)โ๐/2
)โ 1.
97
References
L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi (2010). โOn combinatorial testing prob-lemsโ. In: The Annals of Statistics 38.5, pp. 3063โ3092.
N. Ailon, M. Charikar, and A. Newman (2008). โAggregating inconsistent information: rankingand clusteringโ. In: Journal of ACM 55.5, 23:1โ23:27.
M. A. Arcones and E. Gine (1993). โLimit Theorems for U-Processesโ. In: The Annals of Proba-bility 21.3, pp. 1494โ1542.
Y. Baraud (2002). โNon-asymptotic minimax rates of testing in signal detectionโ. In: Bernoulli 8.5,pp. 577โ606.
M. V. Burnashev (1979). โOn the minimax detection of an inaccurately known signal in a whiteGaussian noise backgroundโ. In: Theory of Probability & Its Applications 24.1, pp. 107โ119.
N. Dunford and J. T. Schwartz (1963). Linear Operators: Part II: Spectral Theory: Self AdjointOperators in Hilbert Space. Interscience Publishers.
M. S. Ermakov (1991). โMinimax detection of a signal in a Gaussian white noiseโ. In: Theory ofProbability & Its Applications 35.4, pp. 667โ679.
M. Fromont and B. Laurent (2006). โAdaptive goodness-of-fit tests in a density modelโ. In: TheAnnals of Statistics 34.2, pp. 680โ720.
M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret (2012). โKernels based tests withnon-asymptotic bootstrap approaches for two-sample problemโ. In: JMLR: Workshop and Con-ference Proceedings. Vol. 23, pp. 23โ1.
M. Fromont, B. Laurent, and P. Reynaud-Bouret (2013). โThe two-sample problem for poissonprocesses: Adaptive tests with a nonasymptotic wild bootstrap approachโ. In: The Annals ofStatistics 41.3, pp. 1431โ1461.
K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schรถlkopf, and B. K. Sriperumbudur (2009). โKernelchoice and classifiability for RKHS embeddings of probability distributionsโ. In: Advances inNeural Information Processing Systems, pp. 1750โ1758.
G. G. Gregory (1977). โLarge sample theory for U-statistics and tests of fitโ. In: The Annals ofStatistics 5.1, pp. 110โ123.
98
A. Gretton, K. Borgwardt, M. Rasch, B. Schรถlkopf, and A. Smola (2012a). โA kernel two-sampletestโ. In: Journal of Machine Learning Research 13.Mar, pp. 723โ773.
A. Gretton, O. Bousquet, A. J. Smola, and B. Schรถlkopf (2005). โMeasuring statistical dependencewith Hilbert-Schmidt normsโ. In: International Conference on Algorithmic Learning Theory.Springer, pp. 63โ77.
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schรถlkopf, and A. J. Smola (2008). โA ker-nel statistical test of independenceโ. In: Advances in Neural Information Processing Systems,pp. 585โ592.
A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.Sriperumbudur (2012b). โOptimal kernel choice for large-scale two-sample testsโ. In: Ad-vances in Neural Information Processing systems, pp. 1205โ1213.
E. Haeusler (1988). โOn the rate of convergence in the central limit theorem for martingales withdiscrete and continuous timeโ. In: The Annals of Probability 16.1, pp. 275โ299.
P. Hall (1984). โCentral limit theorem for integrated square error of multivariate nonparametricdensity estimatorsโ. In: Journal of Multivariate Analysis 14.1, pp. 1โ16.
Z. Harchaoui, F. Bach, and E. Moulines (2007). โTesting for homogeneity with kernel fisher dis-criminant analysisโ. In: Advances in Neural Information Processing Systems, pp. 609โ616.
Y. I. Ingster (1987). โMinimax testing of nonparametric hypotheses on a distribution density in theL_p metricsโ. In: Theory of Probability & Its Applications 31.2, pp. 333โ337.
โ (1993). โAsymptotically minimax hypothesis testing for nonparametric alternatives. I, II, IIIโ.In: Mathematical Methods of Statistics 2.2, pp. 85โ114.
โ (2000). โAdaptive chi-square testsโ. In: Journal of Mathematical Sciences 99.2, pp. 1110โ1119.
Y. I. Ingster and I. A. Suslina (2000). โMinimax nonparametric hypothesis testing for ellipsoidsand Besov bodiesโ. In: ESAIM: Probability and Statistics 4, pp. 53โ135.
โ (2003). Nonparametric Goodness-of-Fit Testing under Gaussian Models. New York, NY: Springer.
P. E. Jupp (2005). โSobolev tests of goodness of fit of distributions on compact Riemannian mani-foldsโ. In: The Annals of Statistics 33.6, pp. 2957โ2966.
E. L. Lehmann and J. P. Romano (2008). Testing Statistical Hypotheses. New York, NY: SpringerScience & Business Media.
99
O. V. Lepski and V. G. Spokoiny (1999). โMinimax nonparametric hypothesis testing: the case ofan inhomogeneous alternativeโ. In: Bernoulli 5.2, pp. 333โ358.
R. Lyons (2013). โDistance covariance in metric spacesโ. In: The Annals of Probability 41.5,pp. 3284โ3305.
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schรถlkopf (2016). โDistinguishing causefrom effect using observational data: methods and benchmarksโ. In: The Journal of MachineLearning Research 17.1, pp. 1103โ1204.
K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schรถlkopf (2017). โKernel mean embed-ding of distributions: a review and beyondโ. In: Foundations and Trendsยฎ in Machine Learning10.1-2, pp. 1โ141.
J. Peters, J. M. Mooij, D. Janzing, and B. Schรถlkopf (2014). โCausal discovery with continuousadditive noise modelsโ. In: The Journal of Machine Learning Research 15.1, pp. 2009โ2053.
N. Pfister, P. Bรผhlmann, B. Schรถlkopf, and J. Peters (2018). โKernel-based tests for joint indepen-denceโ. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.1,pp. 5โ31.
D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013). โEquivalence of distance-based and RKHS-based statistics in hypothesis testingโ. In: The Annals of Statistics 41.5,pp. 2263โ2291.
R. J. Serfling (2009). Approximation Theorems of Mathematical Statistics. New York, NY: JohnWiley & Sons.
V. G. Spokoiny (1996). โAdaptive hypothesis testing using waveletsโ. In: The Annals of Statistics24.6, pp. 2477โ2498.
B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf (2009). โKernelchoice and classifiability for RKHS embeddings of probability distributionsโ. In: Advances inNeural Information Processing Systems 22, pp. 1750โ1758.
B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet (2011). โUniversality, characteristic ker-nels and RKHS embedding of measuresโ. In: Journal of Machine Learning Research 12.Jul,pp. 2389โ2410.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schรถlkopf, and G. R. Lanckriet (2010). โHilbertspace embeddings and metrics on probability measuresโ. In: Journal of Machine LearningResearch 11.Apr, pp. 1517โ1561.
I. Steinwart and A. Christmann (2008). Support Vector Machines. Springer Science & BusinessMedia.
100
I. Steinwart (2001). โOn the influence of the kernel on the consistency of support vector machinesโ.In: Journal of machine learning research 2.Nov, pp. 67โ93.
D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton (2017).โGenerative models and model criticism via optimized maximum mean discrepancyโ. In: In-ternational Conference on Learning Representations.
G. J Szรฉkely and M. L. Rizzo (2009). โBrownian distance covarianceโ. In: The Annals of AppliedStatistics 3.4, pp. 1236โ1265.
G. J. Szรฉkely, M. L. Rizzo, and N. K. Bakirov (2007). โMeasuring and testing dependence bycorrelation of distancesโ. In: The Annals of Statistics 35.6, pp. 2769โ2794.
M. Talagrand (2014). Upper and Lower Bounds for Stochastic Processes: Modern Methods andClassical Problems. Springer Science & Business Media.
A. B. Tsybakov (2008). Introduction to Nonparametric Estimation. New York, NY: Springer Sci-ence & Business Media.
101
Appendix A: Some Technical Results and Proofs Related to Chapter 2
Proof of Lemma 3. We have
๐บ2(๐ฅ, ๐ฅโฒ) =โ๐,๐
`๐`๐๐๐ (๐ฅ)๐๐ (๐ฅ)๐๐ (๐ฅโฒ)๐๐ (๐ฅโฒ).
Thus
โซ๐(๐ฅ)๐(๐ฅโฒ)๐บ2(๐ฅ, ๐ฅโฒ)๐๐(๐ฅ)๐๐(๐ฅโฒ)
=โ๐,๐
`๐`๐
( โซ๐(๐ฅ)๐๐ (๐ฅ)๐๐ (๐ฅ)๐๐(๐ฅ)
)2
โค`1โ๐
`๐
โ๐
( โซ๐(๐ฅ)๐๐ (๐ฅ)๐๐ (๐ฅ)๐๐(๐ฅ)
)2
โค`1
(โ๐
`๐
โซ๐2(๐ฅ)๐2
๐ (๐ฅ)๐๐(๐ฅ))
โค`1
(โ๐
`๐
) (sup๐
โ๐๐ โโ)2
โ๐โ2๐ฟ2 (๐) .
Proof of Lemma 4. For brevity, write
๐๐พ =
๐พโ๐=1
๐2๐
_๐.
By definition, it suffices to show that โ ๐ > 0, โ ๐๐ โ H (๐พ) such that โ ๐๐ โ2๐พ
โค ๐ 2 and โ๐ข โ
๐๐ โ2๐ฟ2 (๐0) โค ๐2๐ โ2/\ .
102
To this end, let ๐พ be such that ๐2๐พโค ๐ 2 โค ๐2
๐พ+1, and denote by
๐๐ =
๐พโ๐=1
๐๐๐๐ + ๐โ๐พ+1(๐ )๐๐พ+1,
where
๐โ๐พ+1(๐ ) = sgn(๐๐พ+1)โ_๐พ+1(๐ 2 โ ๐2
๐พ).
Clearly,
โ ๐๐ โ2๐พ =
๐พโ๐=1
๐2๐
_๐+(๐โ๐+1(๐ ))
2
_๐พ+1= ๐ 2,
and
โ๐ข โ ๐๐ โ2๐ฟ2 (๐0) =
โ๐>๐พ+1
๐2๐ +
(|๐๐พ+1 | โ
โ_๐พ+1(๐ 2 โ ๐2
๐พ))2
โคโ๐โฅ๐พ+1
๐2๐ .
To ensure ๐ข โ F (\, ๐), it suffices to have
sup๐2๐พโค๐ 2โค๐2
๐พ+1
โ๐ข โ ๐๐ โ2๐ฟ2 (๐0)๐
2/\ โค ๐2, โ ๐พ โฅ 0,
which concludes the proof.
103
Appendix B: Some Technical Results and Proofs Related to Chapter 3
B.1 Properties of Gaussian Kernel
We collect here a couple of useful properties of Gaussian kernel that we used repeated in the
proof to the main results.
Lemma 7. For any ๐ โ ๐ฟ2(R๐),โซ๐บa (๐ฅ, ๐ฆ) ๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ =
(๐a
) ๐2โซ
exp(โโ๐โ2
4a
)โF ๐ (๐)โ2 ๐๐.
Proof. Denote by ๐ a Gaussian random vector with mean 0 and covariance matrix 2a๐ผ๐ . Then
โซ๐บa (๐ฅ, ๐ฆ) ๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ =
โซexp
(โaโ๐ฅ โ ๐ฆโ2
)๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ
=
โซE exp[๐๐>(๐ฅ โ ๐ฆ)] ๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ
=E
โซ exp(โ๐๐>๐ฅ) ๐ (๐ฅ)๐๐ฅ 2
=
โซ1
(4๐a)๐/2exp
(โโ๐โ2
4a
) โซ exp(โ๐๐>๐ฅ) ๐ (๐ฅ)๐๐ฅ 2
=
(๐a
) ๐2โซ
exp(โโ๐โ2
4a
)โF ๐ (๐)โ2 ๐๐,
which concludes the proof.
A useful consequence of Lemma 7 is a close connection between Gaussian kernel MMD and
๐ฟ2 norm.
104
Lemma 8. For any ๐ โ W๐ ,2(๐)
( a๐
)๐/2 โซ๐บa (๐ฅ, ๐ฆ) ๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ โฅ
14โ ๐ โ2
๐ฟ2,
provided that
a๐ โฅ 41โ๐ ๐2
(log 3)๐ ยท โ ๐ โโ2๐ฟ2.
Proof. In light of Lemma 7,
( a๐
)๐/2 โซ๐บa (๐ฅ, ๐ฆ) ๐ (๐ฅ) ๐ (๐ฆ)๐๐ฅ๐๐ฆ =
โซexp
(โโ๐โ2
4a
)โF ๐ (๐)โ2 ๐๐.
By Plancherel Theorem, for any ๐ > 0,
โซโ๐โโค๐
โF ๐ (๐)โ2 ๐๐ = โ ๐ โ2๐ฟ2 โ
โซโ๐โ>๐
โF ๐ (๐)โ2 ๐๐ โฅ โ ๐ โ2๐ฟ2 โ
๐2
๐2๐ ,
Choosing
๐ =
(2๐โ ๐ โ๐ฟ2
)1/๐ ,
yields โซโ๐โโค๐
โF ๐ (๐)โ2 ๐๐ โฅ 34โ ๐ โ2
๐ฟ2 .
Hence
โซexp
(โโ๐โ2
4a
)โF ๐ (๐)โ2 ๐๐ โฅ exp
(โ๐
2
4a
) โซโ๐โโค๐
โF ๐ (๐)โ2 ๐๐
โฅ 34
exp(โ๐
2
4a
)โ ๐ โ2
๐ฟ2 .
In particular, if
a โฅ (2๐)2/๐
4 log 3ยท โ ๐ โโ2/๐
๐ฟ2 ,
105
then โซexp
(โโ๐โ2
4a
)โF ๐ (๐)โ2 ๐๐ โฅ 1
4โ ๐ โ2
๐ฟ2 ,
which concludes the proof.
B.2 Proof of Lemma 5
We first prove that sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ 1๏ฟฝ๏ฟฝ = ๐๐ (1) and then show the difference
caused by the modification from ๐ 2๐,a๐ to ๏ฟฝ๏ฟฝ2๐,a๐ is asymptotically negligible.
Note that
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ 1๏ฟฝ๏ฟฝ
โค(
inf1โคa๐โค๐2/๐
a๐/2๐ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
)โ1ยท sup
1โคa๐โค๐2/๐a๐/2๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐ โ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2๏ฟฝ๏ฟฝ .For ๐ โผ P0, denote the distribution of (๐, ๐) as P1. Then we have
E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 = ๐พ2a๐(P1, P0 โ P0).
Hence E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 > 0 for any a๐ > 0 since ๐บa๐ is characteristic.
In addition, a๐/2๐ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 is continuous with respect to a๐ and
lima๐โโ
a๐/2๐ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 =
(๐2
)๐/2โ๐0โ2
๐ฟ2.
Therefore,
inf1โคa๐โค๐2/๐
a๐/2๐ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โฅ inf
a๐โ[0,โ)a๐/2๐ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 > 0,
106
and it remains to prove
sup1โคa๐โค๐2/๐
a๐/2๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐ โ E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2๏ฟฝ๏ฟฝ = ๐๐ (1).Recall the expression of ๐ 2๐,a๐ . It suffcies to show that
sup1โคa๐โค๐2/๐
a๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐๐, ๐ ๐ ) โ E๐บ2a๐ (๐1, ๐2)
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (B.1)
sup1โคa๐โค๐2/๐
a๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ2(๐ โ 3)!
๐!
โ1โค๐, ๐1, ๐2โค๐|{๐, ๐1, ๐2}|=3
๐บa๐ (๐๐, ๐ ๐1)๐บa๐ (๐๐, ๐ ๐2) โ E๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3)
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (B.2)
sup1โคa๐โค๐2/๐
a๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ(๐ โ 4)!๐!
โ1โค๐1,๐2, ๐1, ๐2โค๐|{๐1,๐2, ๐1, ๐2}|=4
๐บa๐ (๐๐1 , ๐ ๐1)๐บa๐ (๐๐2 , ๐ ๐2) โ [E๐บa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (B.3)
are all ๐๐ (1). We shall first control (B.1) and then bound (B.2) and (B.3) in the same way.
Let
E๐๐บ2a๐ (๐, ๐โฒ) = 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐๐, ๐ ๐ ).
In the rest of this proof, abbreviate E๐๐บ2a๐ (๐, ๐โฒ) and E๐บ2a๐ (๐1, ๐2) as E๐๐บ2a๐ and E๐บ2a๐ re-
spectively when no confusion occurs.
Divide the whole interval [1, ๐2/๐] into ๐ด sub-intervals, [๐ข0, ๐ข1], [๐ข1, ๐ข2], ยท ยท ยท , [๐ข๐ดโ1, ๐ข๐ด] with
๐ข0 = 1, ๐ข๐ด = ๐2/๐ . For any a๐ โ [๐ข๐โ1, ๐ข๐],
a๐/2๐ E๐๐บ2a๐ โ a
๐/2๐ E๐บ2a๐ โฅ โ a๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ โ a๐/2๐
๏ฟฝ๏ฟฝE๐บ2๐ข๐ โ E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝโฅ โ ๐ข๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ โ ๐ข๐/2๐
๏ฟฝ๏ฟฝE๐บ2๐ข๐ โ E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝand
a๐/2๐ E๐๐บ2a๐ โ a
๐/2๐ E๐บ2a๐ โค ๐ข
๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐โ1 โ E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝ๏ฟฝ + ๐ข๐/2๐
๏ฟฝ๏ฟฝE๐บ2๐ข๐ โ E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝ ,107
which together ensure that
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝa๐/2๐ E๐๐บ2a๐ โ a๐/2๐ E๐บ2a๐
๏ฟฝ๏ฟฝ๏ฟฝโค sup
1โค๐โค๐ด
(๐ข๐
๐ข๐โ1
)๐/2ยท sup
0โค๐โค๐ด๐ข๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ + sup1โค๐โค๐ด
๐ข๐/2๐
๏ฟฝ๏ฟฝE๐บ2๐ข๐ โ E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝโค sup
1โค๐โค๐ด
(๐ข๐
๐ข๐โ1
)๐/2ยท sup
0โค๐โค๐ด๐ข๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ + sup1โค๐โค๐ด
๏ฟฝ๏ฟฝ๏ฟฝ๐ข๐/2๐ E๐บ2๐ข๐ โ ๐ข๐/2๐โ1E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝ๏ฟฝ+ sup
1โค๐โค๐ด
((๐ข๐/2๐ โ ๐ข๐/2
๐โ1
)E๐บ2๐ข๐โ1
).
Bound the three terms in the right hand side of the last inequality separately.
Let {๐ข๐}๐โฅ0 be a geometric sequence, namely,
๐ด := inf{๐ โ N : ๐๐ โฅ ๐2/๐},
and
๐ข๐ =
๐๐, โ 0 โค ๐ โค ๐ด โ 1
๐2/๐ , ๐ = ๐ด
,
with ๐ > 1 to be determined later.
Since limaโโ
a๐/2E๐บ2a๐ = (๐/2)๐/2โ๐0โ2 and a๐/2E๐บ2a is continuous, we obtain that for any
Y > 0, there exsits sufficiently small ๐ > 1 such that
sup1โค๐โค๐ด
๏ฟฝ๏ฟฝ๏ฟฝ๐ข๐/2๐ E๐บ2๐ข๐ โ ๐ข๐/2๐โ1E๐บ2๐ข๐โ1
๏ฟฝ๏ฟฝ๏ฟฝ โค Y.At the same time, we can also ensure
sup1โค๐โค๐ด
((๐ข๐/2๐ โ ๐ข๐/2
๐โ1
)E๐บ2๐ข๐โ1
)โค (๐๐/2 โ 1)
(๐2
)๐/2โ๐0โ2 โค Y
by choosing ๐ sufficiently small.
108
Finally consider
sup1โค๐โค๐ด
(๐ข๐
๐ข๐โ1
)๐/2ยท sup
0โค๐โค๐ด๐ข๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ .On the one hand,
sup1โค๐โค๐ด
(๐ข๐
๐ข๐โ1
)๐/2โค ๐๐/2.
On the other hand, since
var(E๐๐บ2a๐
).
1๐E๐บ2a๐ (๐, ๐โฒ)๐บ2a๐ (๐, ๐โฒโฒ) + 1
๐2E๐บ4a๐ (๐, ๐โฒ)
.๐
aโ3๐/4๐ โ๐0โ3
๐+ a
โ๐/2๐ โ๐0โ2
๐2
for any a๐ โ (0,โ), we have
๐
(sup
0โค๐โค๐ด๐ข๐/2๐
๏ฟฝ๏ฟฝ๏ฟฝE๐๐บ2๐ข๐ โ E๐บ2๐ข๐
๏ฟฝ๏ฟฝ๏ฟฝ โฅ Y)
โค
๐ดโ๐=0
๐ข๐๐var(E๐๐บ2๐ข๐
)Y2 .๐,๐
1Y2
(๐ข๐/4๐ด
โ๐0โ3
๐+๐ข๐/2๐ด
โ๐0โ2
๐2
)โ 0
as ๐โ โ. Hence we conclude sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝa๐/2๐ E๐๐บ2a๐ โ a๐/2๐ E๐บ2a๐
๏ฟฝ๏ฟฝ๏ฟฝ = ๐๐ (1).Considering that
lima๐โโ
a๐/2๐ E๐บa๐ (๐1, ๐2)๐บa๐ (๐1, ๐3) = 0, lim
a๐โโa๐/2๐ [E๐บa๐ (๐1, ๐2)]2 = 0,
we obtain that (B.2) and (B.3) are also ๐๐ (1), based on almost the same arguments. Hence
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ 1๏ฟฝ๏ฟฝ = ๐๐ (1).
On the other hand, since E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 &๐0,๐ aโ๐/2๐ for a๐ โ [1, ๐2/๐],
sup1โคa๐โค๐2/๐
1๐2E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
= ๐๐ (1).
109
Hence we finally conclude that
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2 โ 1๏ฟฝ๏ฟฝ = ๐๐ (1).
B.3 Proof of Lemma 6
Let
๐พa๐ (๐ฅ, ๐ฅโฒ) =๐บa๐ (๐ฅ, ๐ฅโฒ)โ
2E๐บ2a๐ (๐1, ๐2), โ ๐ฅ, ๐ฅโฒ โ R๐ ,
and accordingly,
๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐ฅโฒ) =๏ฟฝ๏ฟฝa๐ (๐ฅ, ๐ฅโฒ)โ
2E๐บ2a๐ (๐1, ๐2).
Hence
๐GOF(adapt)๐ = sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) ยทโE๐บ2a๐ (๐1, ๐2)E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ .To finish this proof, we first bound
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (B.4)
and then control ๐GOF(adapt)๐ .
Step (i). There are two main tools that we borrow in this step. First, we apply results in Arcones
and Gine (1993) to obtain a Bernstein-type inequality for๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa0 (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ and
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
(๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) โ ๏ฟฝ๏ฟฝaโฒ๐ (๐๐, ๐ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝfor some a0 and arbitrary a๐, aโฒ๐ โ [1,โ). And based on that, we borrow Talagrandโs techniques
on handling Bernstein-type inequality (e.g., see Talagrand, 2014) to give a generic chaining bound
of (B.4).
110
To be more specific, for any a0, a๐, aโฒ๐ โ [1, ๐2/๐], define
๐1(a๐, aโฒ๐) = โ๏ฟฝ๏ฟฝaโฒ๐ โ ๏ฟฝ๏ฟฝa๐ โ๐ฟโ , ๐2(a๐, aโฒ๐) = โ๏ฟฝ๏ฟฝaโฒ๐ โ ๏ฟฝ๏ฟฝa๐ โ๐ฟ2 .
Then Proposition 2.3 (c) of Arcones and Gine (1993) ensures that for any ๐ก > 0,
๐
(๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa0 (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ก
)โค ๐ถ exp
(โ๐ถmin
{๐ก
โ๏ฟฝ๏ฟฝa0 โ๐ฟ2
,
( โ๐๐ก
โ๏ฟฝ๏ฟฝa0 โ๐ฟโ
) 23})
(B.5)
and
๐
(๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
(๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) โ ๏ฟฝ๏ฟฝaโฒ๐ (๐๐, ๐ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ก)
โค๐ถ exp
(โ๐ถmin
{๐ก
๐2(a๐, aโฒ๐),
( โ๐๐ก
๐1(a๐, aโฒ๐)
) 23})
for some ๐ถ > 0, and based on a chaining type argument see, e.g., Theorem 2.2.28 in Talagrand,
2014 the latter inequality suggests there exists ๐ถ > 0 such that
๐
(sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
(๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ ) โ ๏ฟฝ๏ฟฝa0 (๐๐, ๐ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ (B.6)
๐ถ
(๐พ2/3( [1, ๐2/๐], ๐1)โ
๐๐ก + ๐พ1( [1, ๐2/๐], ๐2) + ๐ท2๐ก
) ). exp(โ๐ก2/3),
where ๐พ2/3( [1, ๐2/๐], ๐1), ๐พ1( [1, ๐2/๐], ๐2) are the so-called ๐พ-functionals and
๐ท2 =โ๐โฅ0
๐๐ ( [1, ๐2/๐], ๐2)
with ๐๐ being the so-called entropy numbers.
111
A straightforward combination of (B.5) and (B.6) then gives
๐
(sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ
๐ถ
(๐พ2/3( [1, ๐2/๐], ๐1)โ
๐๐ก + ๐พ1( [1, ๐2/๐], ๐2) + ๐ท2๐ก +
โ๏ฟฝ๏ฟฝa0 โ๐ฟโโ๐
+ โ๏ฟฝ๏ฟฝa0 โ๐ฟ2๐ก
) ). exp(โ๐ก2/3).
Therefore, given that the bounds on โ๏ฟฝ๏ฟฝa0 โ๐ฟ2 and โ๏ฟฝ๏ฟฝa0 โ๐ฟโ can be obtained quite directly, e.g.,
with a0 = 1,
โ๏ฟฝ๏ฟฝa0 โ๐ฟโ โค 4โ๐พa0 | |๐ฟโ =4
โ2E๐บ2
, โ๏ฟฝ๏ฟฝa0 โ๐ฟ2 โค โ๐พa0 โ๐ฟ2 =
โ2
2,
the main focus is to bound ๐พ2/3( [1, ๐2/๐], ๐1), ๐พ1( [1, ๐2/๐], ๐2) and ๐ท2 properly.
First consider ๐พ2/3( [1, ๐2/๐], ๐1). Note that for any 1 โค a๐ < aโฒ๐ < โ,
๐1(a๐, aโฒ๐) โค 4โ๐พa๐ โ ๐พaโฒ๐ โ๐ฟโ โค 4โซ aโฒ๐
a๐
๐๐พ๐ข๐๐ข ๐ฟโ
๐๐ข
Since for any a๐,
๐๐พa๐
๐a๐=(โโ๐ฅ โ ๐ฅโฒโ2)๐บa๐ (๐1, ๐2)
(E๐บ2a๐ (๐1, ๐2)
)โ1/2
โ12๐บa๐ (๐1, ๐2)
(E๐บ2a๐ (๐1, ๐2)
)โ3/2 ๐
๐a๐E๐บ2a๐ (๐1, ๐2)
where
(E๐บ2a๐ (๐1, ๐2)
)โ1/2=
(๐2
)โ๐/4a๐/4๐
(โซexp
(โโ๐โ2
8a๐
)โF ๐0(๐)โ2๐๐
)โ1/2
.๐ a๐/4๐
(โซexp
(โโ๐โ2
8
)โF ๐0(๐)โ2๐๐
)โ1/2,
112
(E๐บ2a๐ (๐1, ๐2)
)โ3/2 .๐ a3๐/4๐
(โซexp
(โโ๐โ2
8
)โF ๐0(๐)โ2๐๐
)โ3/2,
and
๐
๐a๐E2a๐ (๐1, ๐2)
=
(๐2
)๐/2aโ๐/2โ1๐
(โ๐
2ยทโซ
exp(โโ๐โ2
8a๐
)โF ๐0(๐)โ2๐๐
+โซ
exp(โโ๐โ2
8a๐
) (โ๐โ2
8a๐
)โF ๐0(๐)โ2๐๐
),
which together ensure ๐๐พa๐๐a๐
๐ฟโ
.๐,๐0 a๐/4โ1๐ .
Hence
๐1(a๐, aโฒ๐) .๐,๐0 |a๐/4๐ โ (aโฒ๐)๐/4 |,
and ๐พ2/3( [1, ๐2/๐], ๐1) .๐,๐0 | (๐2/๐)๐/4 โ 1๐/4 | โคโ๐.
Then consider ๐พ1( [1, ๐2/๐], ๐2). We have
๐22 (a๐, a
โฒ๐) โค โ๐พaโฒ๐ โ ๐พa๐ โ
2๐ฟ2
= 1 โE๐บa๐๐บaโฒ๐โE๐บ2a๐E๐บ2aโฒ๐
โค โ log
(E๐บa๐๐บaโฒ๐โE๐บ2a๐E๐บ2aโฒ๐
)
Let ๐1(a๐) =โซ
exp(โ โ๐โ2
8a๐
)โF ๐0(๐)โ2๐๐. Then
log(E๐บ2a๐
)=๐
2log
(๐
2a๐
)+ log ๐1(a๐)
and hence
โ log
(E๐บa๐๐บaโฒ๐โE๐บ2a๐E๐บ2aโฒ๐
)=๐
2
(โ
log a๐ + log aโฒ๐2
+ log(a๐ + aโฒ๐
2
))+
(log ๐1(a๐) + log ๐1(aโฒ๐)
2โ log ๐1
(a๐ + aโฒ๐
2
)).
113
Note that
log ๐1(a๐) + log ๐1(aโฒ๐)2
โ log ๐1(a๐ + aโฒ๐
2
)=
12
โซ aโฒ๐โa๐2
0
โซ ๐ข
โ๐ข
(log ๐1
(aโฒ๐ + a๐
2+ ๐ฃ
))โฒโฒ๐๐ฃ๐๐ข.
For any a๐ โฅ 1,
(log ๐1(a๐))โฒโฒ =๐1(a๐) ๐ โฒโฒ1 (a๐) โ ( ๐ โฒ1 (a๐))
2
๐ 21 (a๐)
โค๐ โฒโฒ1 (a๐)๐1(a๐)
,
and
๐ โฒโฒ1 (a๐) =โซ
exp(โโ๐โ2
8a๐
) (โ๐โ4
64a4๐
โ โ๐โ2
4a3๐
)โF ๐0(๐)โ2๐๐ . aโ2
๐ โ๐0โ2๐ฟ2.
Moreover, there exists aโ๐ = aโ๐ (๐0) > 1 such that ๐1(aโ๐) โฅ โ๐0โ2
๐ฟ2/2, from which we obtain
(log ๐1(a๐))โฒโฒ .
aโ2๐ โ๐0โ2
๐ฟ2/ ๐1(1), 1 โค a๐ โค aโ๐
aโ2๐ , aโ๐ < a๐ โค ๐2/๐
,
which suggests that for any a๐, aโฒ๐ โ [1, aโ๐]
๐22 (a๐, a
โฒ๐) .
(๐
2+โ๐0โ2
๐ฟ2
๐1(1)
) (โ
log a๐ + log aโฒ๐2
+ log(a๐ + aโฒ๐
2
)).
(๐
2+โ๐0โ2
๐ฟ2
๐1(1)
)| log a๐ โ log aโฒ๐ |,
and for any a๐, aโฒ๐ โ [aโ๐, ๐2/๐]
๐22 (a๐, a
โฒ๐) .
(๐
2+ 1
)| log a๐ โ log aโฒ๐ |.
Note that in addition to the bound on ๐2 obtained above, we also have
๐2(a๐, aโฒ๐) โค โ๏ฟฝ๏ฟฝa๐ โ๐ฟ2 + โ๏ฟฝ๏ฟฝaโฒ๐ โ๐ฟ2 โค โ๐พa๐ โ๐ฟ2 + โ๐พaโฒ๐ โ๐ฟ2 โคโ
2.
114
Therefore,
๐พ1( [1, ๐2/๐], ๐2) โคโ๐โฅ0
2๐๐๐ ( [1, ๐2/๐], ๐2)
.๐0( [1, ๐2/๐], ๐2) +โ๐โฅ0
2๐๐๐ ( [1, aโ๐], ๐2) +โ๐โฅ0
2๐๐๐ ( [aโ๐, ๐2/๐], ๐2)
.1 +
โ๐
2+โ๐0โ2
๐ฟ2
๐1(1)โ๐โฅ0
2๐โ
log aโ๐ โ log 122๐
+โ๐
2+ 1 ยฉยญยซ
โ๐โฅ0
2๐ min1,
โlog ๐2/๐ โ log aโ๐
22๐
ยชยฎยฌ.1 +
โ๐
2+โ๐0โ2
๐ฟ2
๐1(1)โ
log aโ๐ +โ๐
2+ 1 ยฉยญยซ
โ๐โฅ0
2๐ min1,
โlog ๐2/๐
22๐
ยชยฎยฌ.1 +
โ๐
2+โ๐0โ2
๐ฟ2
๐1(1)โ
log aโ๐ +โ๐
2+ 1 ยฉยญยซ
โ0โค๐<๐โ
2๐ +โ๐โฅ๐โ
2๐โ
log ๐2/๐
22๐ยชยฎยฌ
.1 +
โ๐
2+โ๐0โ2
๐ฟ2
๐1(1)โ
log aโ๐ +โ๐
2+ 1 ยท 2๐
โ
where ๐โ is the smallest ๐ such that โlog ๐2/๐
22๐โค 1.
Hence 2๐โ ๏ฟฝ log log ๐ and there exists ๐ถ = ๐ถ (๐) > 0 such that
๐พ1( [1, ๐2/๐], ๐2) โค ๐ถ (๐) log log ๐
for sufficiently large ๐.
By the similar approach, we get that
๐ท2 . 1 +
โ๐
2+โ๐0โ2
๐ฟ2
๐1(1)โ
log aโ๐ +โ๐
2+ 1 ยท ๐โ
which is upper-bounded by ๐ถ (๐) log log log ๐ for sufficiently large ๐.
115
Therefore, we finally obtain that there exists ๐ถ (๐) > 0 such that for sufficiently large ๐,
๐
(sup
1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ถ (๐) (log log ๐ + ๐ก log log log ๐)
). exp(โ๐ก2/3). (B.7)
Step (ii). By slight abuse of notation, there exists aโ๐ = aโ๐ (๐0) > 1 such that
E๐บ2a๐ (๐1, ๐2)E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
โค 2
for a๐ โฅ aโ๐. Therefore,
๐GOF(adapt)๐ โค sup
1โคa๐โคaโ๐
โE๐บ2a๐ (๐1, ๐2)E[๏ฟฝ๏ฟฝa๐ (๐1, ๐2)]2
ยท sup1โคa๐โคaโ๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ+
โ2 supaโ๐โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ
โค๐ถ (๐0) sup1โคa๐โคaโ๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ+
โ2 supaโ๐โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ
for some ๐ถ (๐0) > 0.
Based on arguments similar to those in the first step,
๐
(sup
1โคa๐โคaโ๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ 1๐ โ 1
โ๐โ ๐
๏ฟฝ๏ฟฝa๐ (๐๐, ๐ ๐ )๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ ๐ถ (๐, ๐0)๐ก
). exp(โ๐ก2/3)
for some ๐ถ (๐, ๐0) > 0 and (B.7) still holds when a๐ is restricted to [aโ๐, ๐2/๐]. They together prove
Lemma 6.
B.4 Decomposition of dHSIC and Its Variance Estimation
In this section, we first derive an approximation of ๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ) under ๐ป0 for general
๐ , and then the approximation of var(๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ))
can be obtained subsequently.
116
Note that
๐บa (๐ฅ, ๐ฆ)
=
โซ๐บa (๐ข, ๐ฃ)๐ (๐ฟ๐ฅ โ P + P) (๐ข)๐ (๐ฟ๐ฆ โ P + P) (๐ฃ)
=๏ฟฝ๏ฟฝa (๐ฅ, ๐ฆ) + (E๐บa (๐ฅ, ๐) โ E๐บa (๐, ๐โฒ)) + (E๐บa (๐ฆ, ๐) โ E๐บa (๐, ๐โฒ)) + E๐บa (๐, ๐โฒ).
Similarly write
๐บa (๐ฅ, (๐ฆ1, ยท ยท ยท , ๐ฆ๐ ))
=
โซ๐บa (๐ข, (๐ฃ1, ยท ยท ยท , ๐ฃ๐ ))๐ (๐ฟ๐ฅ โ P + P)๐ (๐ฟ๐ฆ1 โ P๐1 + P๐1) ยท ยท ยท ๐ (๐ฟ๐ฆ๐ โ P๐
๐ + P๐ ๐ )
and expand it as the summation of all ๐-variate centered components where ๐ โค ๐ + 1. Do the
same expansion to ๐บa ((๐ฅ1, ยท ยท ยท , ๐ฅ๐ ), (๐ฆ1, ยท ยท ยท , ๐ฆ๐ )) and write it as the summation of all ๐-variate
centered components where ๐ โค 2๐ . Plug these expansions in ๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ) and denote
the summation of all ๐-variate centered components in such expression of ๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ )
by ๐ท ๐ (a) for ๐ โค 2๐ . Let the remainder ๐ ๐ =2๐โ๐=3๐ท ๐ (a) so that
๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ) = ๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ) + ๐ท1(a) + ๐ท2(a) + ๐ ๐.
Straightforward calculation yields the following facts:
โข E(๐ ๐)2 .๐ ๐โ3
(E๐บ2a (๐1, ๐2) +
๐โ๐=1E๐บ2a (๐ ๐1, ๐
๐2)
);
โข under the null hypothesis, ๐ท1(a) = 0 and
๐ท2(a) =1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บโa (๐๐, ๐ ๐ )
117
where
๐บโa (๐ฅ, ๐ฆ) = ๏ฟฝ๏ฟฝa (๐ฅ, ๐ฆ) โ
โ1โค ๐โค๐
๐ ๐ (๐ฅ ๐ , ๐ฆ) โโ
1โค ๐โค๐๐ ๐ (๐ฆ ๐ , ๐ฅ) +
โ1โค ๐1, ๐2โค๐
๐ ๐1, ๐2 (๐ฅ ๐1 , ๐ฆ ๐2).
Proof of Lemma 1. Observe that under ๐ป0,
var(๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ))= E(๐ท2(a))2 + E (๐ ๐)2 =
2๐(๐ โ 1)E[๐บ
โa (๐1, ๐2)]2 + E (๐ ๐)2 ,
E (๐ ๐)2 .๐ ๐โ3E๐บ2a (๐1, ๐2),
and
E[๐บโa (๐1, ๐2)]2
=Eยฉยญยซ๏ฟฝ๏ฟฝa (๐1, ๐2) โ
โ1โค ๐โค๐
๐ ๐ (๐ ๐
1 , ๐2)ยชยฎยฌ
2
โ E ยฉยญยซโ
1โค ๐โค๐๐ ๐ (๐ ๐
2 , ๐1) +โ
1โค ๐1, ๐2โค๐๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )ยชยฎยฌ
2
=E๏ฟฝ๏ฟฝ2a (๐1, ๐2) โ 2
โ1โค ๐โค๐
E(๐ ๐ (๐ ๐
1 , ๐2))2
+โ
1โค ๐1, ๐2โค๐E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2.
They together conclude the proof.
Below we shall further expand E๏ฟฝ๏ฟฝ2a (๐1, ๐2), E
(๐ ๐ (๐ ๐
1 , ๐2))2
and E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2in Lemma
1, based on which consistent estimator of var(๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ))
can be derived naturally.
First,
E๏ฟฝ๏ฟฝ2a (๐1, ๐2)
=E๐บ2a (๐1, ๐2) โ 2E๐บa (๐1, ๐2)๐บa (๐1, ๐3) + (E๐บa (๐1, ๐2))2
=โ
1โค๐โค๐E๐บ2a (๐ ๐1, ๐
๐2) โ 2
โ1โค๐โค๐
E๐บa (๐ ๐1, ๐๐2)๐บa (๐ ๐1, ๐
๐3) +
โ1โค๐โค๐
(E๐บa (๐ ๐1, ๐
๐2)
)2.
118
Second,
E(๐ ๐ (๐ ๐
1 , ๐2))2
=E๐บ2a (๐ ๐
1 , ๐๐
2 ) ยทโ๐โ ๐
E๐บa (๐ ๐1, ๐๐2)๐บa (๐ ๐1, ๐
๐3) โ
โ1โค๐โค๐
E๐บa (๐ ๐1, ๐๐2)๐บa (๐ ๐1, ๐
๐3)
โ E๐บa (๐ ๐
1 , ๐๐
2 )๐บa (๐ ๐
1 , ๐๐
3 ) ยทโ๐โ ๐
(E๐บa (๐ ๐1, ๐๐2))
2 +โ
1โค๐โค๐
(E๐บa (๐ ๐1, ๐
๐2)
)2.
Hence
โ1โค ๐โค๐
E(๐ ๐ (๐ ๐
1 , ๐2))2
=
( โ1โค๐โค๐
E๐บa (๐ ๐1, ๐๐2)๐บa (๐ ๐1, ๐
๐3)
) ยฉยญยซโ
1โค ๐โค๐
E๐บ2a (๐ ๐
1 , ๐๐
2 )E๐บa (๐ ๐
1 , ๐๐
2 )๐บa (๐ ๐
1 , ๐๐
3 )โ ๐ยชยฎยฌ
โ( โ1โค๐โค๐
(E๐บa (๐ ๐1, ๐
๐2)
)2) ยฉยญยซ
โ1โค ๐โค๐
E๐บa (๐ ๐
1 , ๐๐
2 )๐บa (๐ ๐
1 , ๐๐
3 )(E๐บa (๐ ๐
1 , ๐๐
2 ))2โ ๐ยชยฎยฌ .
Finally,
E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2
=
E(๏ฟฝ๏ฟฝa (๐ ๐1
1 , ๐๐12 ))2 ยท โ
๐โ ๐1
(E๐บa (๐ ๐1, ๐
๐2)
)2, ๐1 = ๐2โ
๐โ{ ๐1, ๐2}
(E๐บa (๐ ๐1, ๐
๐2)๐บa (๐ ๐1, ๐
๐3) โ (E๐บa (๐ ๐1, ๐
๐2))
2) โ๐โ ๐1, ๐2
(E๐บa (๐ ๐1, ๐
๐2)
)2, ๐1 โ ๐2.
119
Hence
โ1โค ๐1, ๐2โค๐
E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2
=
( โ1โค๐โค๐
(E๐บa (๐ ๐1, ๐
๐2)
)2) ( โ
1โค ๐1โค๐
E(๏ฟฝ๏ฟฝa (๐ ๐11 , ๐
๐12 ))2
(E๐บa (๐ ๐11 , ๐
๐12 ))2
+โ
1โค ๐1โ ๐2โค๐
โ๐โ{ ๐1, ๐2}
ยฉยญยญยซE๐บa (๐ ๐1, ๐
๐2)๐บa (๐ ๐1, ๐
๐2)(
E๐บa (๐ ๐1, ๐๐2)
)2 โ 1ยชยฎยฎยฌ).
Then the consistent estimator ๐ 2๐,a of E(๐บโa (๐1, ๐2)
)2 is constructed by replacing
E๐บ2a (๐ ๐1, ๐๐2), E๐บa (๐ ๐1, ๐
๐2)๐บa (๐ ๐1, ๐
๐3), (E๐บa (๐ ๐1, ๐
๐2))
2
in the above expansions of
E๏ฟฝ๏ฟฝ2a (๐1, ๐2),
โ1โค ๐โค๐
E(๐ ๐ (๐ ๐
1 , ๐2))2,
โ1โค ๐1, ๐2โค๐
E(๐ ๐1, ๐2 (๐
๐11 , ๐
๐22 )
)2
with the corresponding unbiased estimators
1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐ ๐๐ , ๐ ๐๐ ),(๐ โ 3)!๐!
โ1โค๐, ๐1, ๐2โค๐|{๐, ๐1, ๐2}|=3
๐บa๐ (๐ ๐๐ , ๐ ๐๐1)๐บa๐ (๐ ๐๐ , ๐ ๐๐2)
(๐ โ 4)!๐!
โ1โค๐1,๐2, ๐1, ๐2โค๐|{๐1,๐2, ๐1, ๐2}|=4
๐บa๐ (๐ ๐๐1 , ๐๐๐1)๐บa๐ (๐ ๐๐2 , ๐
๐๐2)
for 1 โค ๐ โค ๐ . Again, to avoid a negative estimate of the variance, we can replace ๐ 2๐,a๐ with 1/๐2
whenever it is negative or too small. Namely, let
๏ฟฝ๏ฟฝ2๐,a๐ = max{๐ 2๐,a๐ , 1/๐
2} ,and estimate var
(๐พ2a (P, P๐
1 โ ยท ยท ยท โ P๐ ๐ ))
by 2๏ฟฝ๏ฟฝ2๐,a/(๐(๐ โ 1)).
120
Therefore for general ๐ , the single kernel test statistic and the adaptive test statistic are con-
structed as
๐ IND๐,a๐
=๐โ
2๏ฟฝ๏ฟฝโ1๐,a๐
๐พ2a๐ (P, P
๐1 โ ยท ยท ยท โ P๐ ๐ ) and ๐IND(adapt)๐ = max
1โคa๐โค๐2/๐๐ IND๐,a๐
respectively. Accordingly, ฮฆIND๐,a๐,๐ผ
and ฮฆIND(adapt) can be constructed as in the case of ๐ = 2.
B.5 Theoretical Properties of Independence Tests for General ๐
In this section, with ฮฆIND๐,a๐,๐ผ
and ฮฆIND(adapt) constructed in Appendix B.4 for general ๐ , we
confirm that Theorem 12, Theorem 13 and Theorem 16 still hold. We shall only emphasize the
main differences between the new proofs and the original proofs in the case of ๐ = 2.
Under the null hypothesis: we only need to re-ensure that ๐ 2๐,a๐ is a consistent estimator of
E[๐บโa๐(๐1, ๐2)]2. Specifically, we show that
๐ 2๐,a๐/E[๐บโa๐(๐1, ๐2)]2 โ๐ 1
given 1 ๏ฟฝ a๐ ๏ฟฝ ๐4/๐ for Theorem 12 and
sup1โคa๐โค๐2/๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐/E[๐บโa๐(๐1, ๐2)]2 โ 1
๏ฟฝ๏ฟฝ = ๐๐ (1)for Theorem 16.
To prove the former one, since
E[๐บโa๐(๐1, ๐2)]2
(๐/(2a๐))๐/2โ๐โ2๐ฟ2
โ 1
as a๐ โ โ, it suffices to show
a๐/2๐
๏ฟฝ๏ฟฝ๐ 2๐,a๐ โ E[๐บโa๐(๐1, ๐2)]2๏ฟฝ๏ฟฝ = ๐๐ (1),
121
which follows considering that
a๐๐/2๐ E๐บ2a๐ (๐ ๐1, ๐
๐2), a
๐/2๐ E๐บa๐ (๐ ๐1, ๐
๐2)๐บa๐ (๐ ๐1, ๐
๐3), a
๐๐/2๐ (E๐บa๐ (๐ ๐1, ๐
๐2))
2 (B.8)
are all bounded and they are estimated consistently by their corresponding estimators. For example,
a๐๐/2๐ E๐บ2a๐ (๐ ๐1, ๐
๐2) โ (๐/2)๐๐/2 โ๐๐ โ2
๐ฟ2
and
a๐๐๐ E
ยฉยญยซ 1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐ ๐๐ , ๐ ๐๐ ) โ E๐บ2a๐ (๐ ๐1, ๐๐2)
ยชยฎยฌ2
= a๐๐๐ var ยฉยญยซ 1
๐(๐ โ 1)โ
1โค๐โ ๐โค๐๐บ2a๐ (๐ ๐๐ , ๐ ๐๐ )
ยชยฎยฌ. a
๐๐๐
(๐โ1E๐บ2a๐ (๐ ๐1, ๐
๐2)๐บ2a๐ (๐ ๐1, ๐
๐3) + ๐
โ2E๐บ4a๐ (๐ ๐1, ๐๐2)
).๐๐๐
โ1a๐๐/4๐ โ๐๐ โ3
๐ฟ2+ ๐โ2a
๐๐/2๐ โ๐๐ โ2
๐ฟ2โ 0.
The proof of the latter one is similar. It sufficies to have
โข each term in (B.8) is bounded for a๐ โ [1,โ), which immediately follows since each term
is continuous and converges at โ;
โข the difference between each term in (B.8) and its corresponding estimator converges to 0
uniformly over a๐ โ [1, ๐2/๐], the proof of which is the same with that of Lemma 5.
Under the alternative hypothesis: we only need to re-ensure that ๏ฟฝ๏ฟฝ๐,a๐ is bounded. Specifi-
cally, we show
inf๐โ๐ปIND
1 (ฮ๐,๐ )
๐๐พ2a๐(P, P๐1 โ ยท ยท ยท โ P๐ ๐ )[E
(๏ฟฝ๏ฟฝ2๐,a๐
)1/๐] ๐/2 โ โ
122
for Theorem 13 and
inf๐ โฅ๐/4
inf๐โ๐ปIND
1 (ฮ๐,๐ ;๐ )๐
(๏ฟฝ๏ฟฝ2๐,a๐ (๐ ) โฒ โค 2๐2(2a๐ (๐ )โฒ/๐)โ๐/2
)โ 1 (B.9)
for Theorem 16, where a๐ (๐ )โฒ = (log log ๐/๐)โ4/(4๐ +๐) .
The former one holds because
E(๏ฟฝ๏ฟฝ2๐,a๐
)1/๐โค E
(max
{๏ฟฝ๏ฟฝ๐ 2๐,a๏ฟฝ๏ฟฝ , 1/๐2})1/๐
โค E๏ฟฝ๏ฟฝ๐ 2๐,a๏ฟฝ๏ฟฝ1/๐ + ๐โ2/๐
.๐
(๐โ๐=1E๐บ2a๐ (๐ ๐1, ๐
๐2)
)1/๐
+ ๐โ2/๐
โค(๐2(๐/(2a๐))๐/2
)1/๐+ ๐โ2/๐ .
where the second to last inequality follows from generalized Hรถlderโs inequality. For example,
Eยฉยญยซ๐โ๐=1
1๐(๐ โ 1)
โ1โค๐โ ๐โค๐
๐บ2a๐ (๐ ๐๐ , ๐ ๐๐ )ยชยฎยฌ
1/๐
โค(๐โ๐=1E๐บ2a๐ (๐ ๐1, ๐
๐2)
)1/๐
.
To prove the latter one, note that for a๐ = a๐ (๐ )โฒ, all three terms in (B.8) are bounded by
๐2๐(๐/2)๐๐/2 and the variances of their corresponding estimators are bounded by
๐ถ (๐๐)(๐โ1 (a๐ (๐ )โฒ)๐๐/4 ๐3
๐ + ๐โ2 (a๐ (๐ )โฒ)๐๐/2 ๐2
๐
)= ๐(1)
uniformly over all ๐ . Therefore,
inf๐ โฅ๐/4
inf๐โ๐ปIND
1 (ฮ๐,๐ ;๐ )๐
((a๐ (๐ )โฒ)๐/2
๏ฟฝ๏ฟฝ๏ฟฝ๐ 2๐,a๐ (๐ ) โฒ โ E[๐บโa๐ (๐ ) โฒ (๐1, ๐2)]2
๏ฟฝ๏ฟฝ๏ฟฝ โค ๐2(๐/2)๐/2)โ 1
123
where ๐1, ๐2 โผiid P๐1 โ ยท ยท ยท โ P๐ ๐ . Further considering that
E[๐บโa๐ (๐ ) โฒ (๐1, ๐2)]2 โค E[๏ฟฝ๏ฟฝa๐ (๐ ) โฒ (๐1, ๐2)]2 โค ๐2(๐/(2a๐ (๐ )โฒ))๐/2
and that
1/๐2 = ๐((a๐ (๐ )โฒ)โ๐/2)
uniformly over all ๐ , we prove (B.9).
124
Recommended