Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods
Tong Li
Submitted in partial fulfillment of therequirements for the degree of
Doctor of Philosophyunder the Executive Committee
of the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2021
© 2021
Tong Li
All Rights Reserved
Abstract
On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods
Tong Li
Kernel embedding methods have witnessed a great deal of practical success in the area of
nonparametric hypothesis testing in recent years. But ever since its first proposal, there exists an
inevitable problem that researchers in this area have been trying to answer–what kernel should be
selected, because the performance of the associated nonparametric tests can vary dramatically
with different kernels. While the way of kernel selection is usually ad hoc, we wonder if there
exists a principled way of kernel selection so as to ensure that the associated nonparametric tests
have good performance. As consistency results against fixed alternatives do not tell the full story
about the power of the associated tests, we study their statistical performance within the minimax
framework. First, focusing on the case of goodness-of-fit tests, our analyses show that a vanilla
version of the kernel embedding based test could be suboptimal, and suggest a simple remedy by
moderating the kernel. We prove that the moderated approach provides optimal tests for a wide
range of deviations from the null and can also be made adaptive over a large collection of
interpolation spaces. Then, we study the asymptotic properties of goodness-of-fit, homogeneity
and independence tests using Gaussian kernels, arguably the most popular and successful among
such tests. Our results provide theoretical justifications for this common practice by showing that
tests using a Gaussian kernel with an appropriately chosen scaling parameter are minimax
optimal against smooth alternatives in all three settings. In addition, our analysis also pinpoints
the importance of choosing a diverging scaling parameter when using Gaussian kernels and
suggests a data-driven choice of the scaling parameter that yields tests optimal, up to an iterated
logarithmic factor, over a wide range of smooth alternatives. Numerical experiments are
presented to further demonstrate the practical merits of our methodology.
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Nonparametric Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Minimax Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Kernel Selection and Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Moderated Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Background and Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Operating Characteristics of MMD Based Test . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Asymptotics under 𝐻GOF0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Power Analysis for MMD Based Tests . . . . . . . . . . . . . . . . . . . . 12
2.3 Optimal Tests Based on Moderated MMD . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Moderated MMD Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . 13
i
2.3.2 Operating Characteristics of [2𝜚𝑛(P𝑛, P0) Based Tests . . . . . . . . . . . . 15
2.3.3 Minimax Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Gaussian Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4: Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Effect of Scaling Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Efficacy of Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 5: Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 6: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Appendix A: Some Technical Results and Proofs Related to Chapter 2 . . . . . . . . . . . 102
Appendix B: Some Technical Results and Proofs Related to Chapter 3 . . . . . . . . . . . 104
ii
B.1 Properties of Gaussian Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.2 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.3 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.4 Decomposition of dHSIC and Its Variance Estimation . . . . . . . . . . . . . . . . 116
B.5 Theoretical Properties of Independence Tests for General 𝑘 . . . . . . . . . . . . . 121
iii
List of Tables
4.1 Frequency that each DAG in Figure 4.4 was selected by three tests. . . . . . . . . . 42
iv
List of Figures
4.1 Observed power against log(a) in Experiment I (left) and Experiment II(right). . . 38
4.2 Observed power versus sample size in Experiment III for 𝑑 = 1, 10, 100, 1000 from
left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Observed power versus sample size in Experiment IV for 𝑑 = 2, 10, 100, 1000 from
left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 DAGs with the top 3 highest probabilities of being selected. . . . . . . . . . . . . . 42
v
Acknowledgements
First of all, I want to express my sincere gratitude to my Ph.D. advisor Prof. Ming Yuan. His
guidance and support are invaluable to my research and my life. I have learned a lot from his deep
insights and descent work ethics.
I am very grateful to Prof. Bodhisattva Sen, Prof. Victor de la Pena, Prof. Cynthia Rush and
Prof. Bin Cheng for their serving on the committee. Their comments and feedback were very
helpful for the modification of my thesis and left me inspirations for future research.
I also want to thank my close friends and my fellow students, Yi Li, Youran Qi, Chensheng
Kuang, Shulei Wang, Cuize Han, Fan Gao, Luxi Cao, Yuan Li, Yuqing Xu, Chaoyu Yuan, Guanhua
Fang, Yuanzhe Xu and Shun Xu. Their company has made this journey much more pleasurable.
Their suggestions and help have been indispensable during the hard moments of my life.
Finally, I want to thank my parents Xiuyuan Li and Ping Zhou. They have always been sup-
porting and encouraging me. I owe a lot to them.
vi
Dedication
To my beloved parents who always give me unconditional support and encouragement.
vii
Chapter 1: Introduction
1.1 Kernel Embedding
Tests for goodness-of-fit, homogeneity and independence are central to statistical inferences.
Numerous techniques have been developed for these tasks and are routinely used in practice. In
recent years, there is a renewed interest in them from both statistics and other related fields as they
arise naturally in many modern applications where the performance of classical methods are less
than satisfactory. In particular, nonparametric inferences via the embedding of distributions into
a reproducing kernel Hilbert space (RKHS) have emerged as a popular and powerful technique to
tackle these challenges. The approach immediately allows for easy access to the rich machinery
for RKHS and has found great successes in a wide range of applications from causal discovery to
deep learning. See, e.g., Muandet et al. (2017) for a recent review.
More specifically, let 𝐾 (·, ·) be a symmetric and positive definite function defined over X ×X,
that is 𝐾 (𝑥, 𝑦) = 𝐾 (𝑦, 𝑥) for all 𝑥, 𝑦 ∈ X, and the Gram matrix [𝐾 (𝑥𝑖, 𝑥 𝑗 )]1≤𝑖, 𝑗≤𝑛 is positive
definite for any distinct 𝑥1, . . . , 𝑥𝑛 ∈ X. The Moore-Aronszajn Theorem indicates that such a
function, referred to as a kernel, can always be uniquely identified with a RKHS H𝐾 of functions
over X. The embedding
`P(·) :=∫X𝐾 (𝑥, ·)P(𝑑𝑥),
maps a probability distribution P into H𝐾 . The difference between two probability distributions P
and Q can then be conveniently measured by
𝛾𝐾 (P,Q) := ‖`P − `Q‖H𝐾.
Under mild regularity conditions, it can be shown that 𝛾𝐾 (P,Q) is an integral probability metric so
1
that it is zero if and only if P = Q, and
𝛾𝐾 (P,Q) = sup𝑓 ∈H𝐾 :‖ 𝑓 ‖H𝐾 ≤1
∫X𝑓 𝑑 (P − Q) .
As such, 𝛾𝐾 (P,Q) is often referred to as the maximum mean discrepancy (MMD) between P andQ.
See, e.g., Sriperumbudur et al. (2010) or Gretton et al. (2012a) for details. In what follows, we shall
drop the subscript 𝐾 whenever its choice is clear from the context. It was noted recently that MMD
is also closely related to the so-called energy distance between random variables (Székely et al.,
2007; Székely and Rizzo, 2009) commonly used to measure independence. See, e.g., Sejdinovic
et al. (2013) and Lyons (2013).
1.2 Nonparametric Hypothesis Testing
Given a sample from P and/or Q, estimates of the 𝛾(P,Q) can be derived by replacing P and
Q with their respective empirical distributions. These estimates can subsequently be used for
nonparametric hypothesis testing. Here are several notable examples that we shall focus on in this
work.
Goodness-of-fit tests. The goal of goodness-of-fit tests is to check if a sample comes from
a pre-specified distribution. Let 𝑋1, · · · , 𝑋𝑛 be 𝑛 independent X-valued samples from a certain
distribution P. We are interested in testing if the hypothesis 𝐻GOF0 : P = P0 holds for a fixed P0.
Deviation from P0 can be conveniently measured by 𝛾(P, P0) which can be readily estimated by:
𝛾(P𝑛, P0) := sup𝑓 ∈H (𝐾):‖ 𝑓 ‖𝐾≤1
∫X𝑓 𝑑
(P𝑛 − P0
),
where P𝑛 is the empirical distribution of 𝑋1, · · · , 𝑋𝑛. A natural procedure is to reject 𝐻0 if the
estimate exceeds a threshold calibrated to ensure a certain significance level, say 𝛼 (0 < 𝛼 < 1).
Homogeneity tests. Homogeneity tests check if two independent samples come from a com-
mon population. Given two independent samples 𝑋1, · · · , 𝑋𝑛 ∼iid P and 𝑌1, · · · , 𝑌𝑚 ∼iid Q, we are
2
interested in testing if the null hypothesis 𝐻HOM0 : P = Q holds. Discrepancy between P and Q can
be measured by 𝛾(P,Q), and similar to before, it can be estimated by the MMD between P𝑛 and
Q𝑚:
𝛾(P𝑛, Q𝑚) := sup𝑓 ∈H (𝐾):‖ 𝑓 ‖𝐾≤1
∫X𝑓 𝑑
(P𝑛 − Q𝑚
).
Again we reject 𝐻0 if the estimate exceeds a threshold calibrated to ensure a certain significance
level.
Independence tests. How to measure or test for independence among a set of random variables
is another classical problem in statistics. Let 𝑋 = (𝑋1, . . . , 𝑋 𝑘 )> ∈ X1 × · · · × X𝑘 be a random
vector. If the random vectors 𝑋1, . . . , 𝑋 𝑘 are jointly independent, then the distribution of 𝑋 can be
factorized:
𝐻IND0 : P𝑋 = P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 .
Dependence among 𝑋1, . . . , 𝑋 𝑘 can be naturally measured by the difference between the joint
distribution and the product distribution evaluated under MMD:
𝛾(P𝑋 , P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) = ‖`P𝑋 − `P𝑋
1⊗···⊗P𝑋𝑘 ‖H𝐾.
When 𝑑 = 2, 𝛾2(P𝑋 , P𝑋1 ⊗P𝑋2) can be expressed as the squared Hilbert-Schmidt norm of the cross-
covariance operator associated with 𝑋1 and 𝑋2 and is therefore referred to as Hilbert-Schmidt
independence criterion (HSIC; Gretton et al., 2005). The more general case as given above is
sometimes referred to as dHSIC (see, e.g., Pfister et al., 2018). As before, we proceed to reject the
independence assumption when 𝛾(P𝑋𝑛 , P𝑋1
𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) exceed a certain threshold where P𝑋𝑛 and
P𝑋𝑗
𝑛 are the empirical distribution of 𝑋 and 𝑋 𝑗 respectively.
1.3 Minimax Framework
In all these cases the test statistic, namely 𝛾2(P𝑛, P0), 𝛾2(P𝑛, Q𝑚) or 𝛾2(P𝑛, P𝑋1
𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ),
is a V-statistic. Following standard asymptotic theory for V-statistics (see, e.g., Serfling, 2009), it
3
can be shown that under mild regularity conditions, when appropriately scaled by the sample size,
they converge to a mixture of 𝜒21 distribution with weights determined jointly by the underlying
probability distribution and the choice of kernel 𝐾 . In contrast, it can also be derived that for a
fixed alternative,
𝛾2(P𝑛, P0) →𝑝 𝛾2(P, P0), 𝛾2(P𝑛, Q𝑚) →𝑝 𝛾
2(P,Q)
and 𝛾2(P𝑛, P𝑋1
𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) →𝑝 𝛾2(P, P𝑋1 ⊗ · · · × P𝑋 𝑘 ).
This immediately suggests that all aforementioned tests are consistent against fixed alternatives in
that their power tends to one as sample sizes increase. Although useful, such consistency results
do not tell the full story about the power of these tests, and if there are yet more powerful methods.
Specifically, although consistency against any fixed alternative is proved, the rate at which the
power convergeces to 1 may vary for different alternatives and it remains a problem whether a
given sample size is large enough to ensure a good power for the underlying alternative. In other
words, can we detect the difference between the null and the alternative hypotheses with a high
probability even in the worst scenario?
This concern naturally brings the notion of uniform consistency, meaning power converging to
1 uniformly over all alternaives to be considered, and leads us to adpot the minimax hypothesis
testing framework pioneered by Burnashev (1979), Ingster (1987), and Ingster (1993). See also
Ermakov (1991), Spokoiny (1996), Lepski and Spokoiny (1999), Ingster and Suslina (2000), In-
gster (2000), Baraud (2002), Fromont and Laurent (2006), Fromont et al. (2012), and Fromont et
al. (2013), and references therein. Within this framework, we consider testing against alternatives
getting closer and closer to the null hypothesis as the sample size increases. The smallest departure
from the null hypotheses that can be detected consistently, in a minimax sense, is referred to as
the optimal detection boundary. And the test that maintains uniform consistency as the departure
converges to 0 at the rate of optimal detection boundary is called minimax rate optimal.
4
1.4 Kernel Selection and Adaptation
The critical importance of kernel selection is widely recognized in practice, as the statistical
performances of the associated tests can vary dramatically with different kernels. Yet, the way
it is done is usually ad hoc and how to do so in a more principled way remains one of the chief
practical challenges. See, e.g., Gretton et al. (2008), Fukumizu et al. (2009), Gretton et al. (2012b),
and Sutherland et al. (2017). In the following chapters, we address this problem by proposing two
kernel selection methods in different settings such that the associated tests are shown to be minimax
rate optimal.
However, such kernel selection methods depend on some regularity condition of the underlying
space of probability distributions, and whether we can do it in an agnostic approach remains an-
other challenge. This also naturally brings about the issue of adaptation. To address this challenge,
we introduce a simple testing procedure by maximizing a normalized MMD over a pre-specified
class of kernels. Similar idea of maximizing MMD over a class of kernels was first introduced
by Sriperumbudur et al. (2009). Our analysis, however, suggests that it is more desirable to maxi-
mize normalized MMD instead. More specifically, we show that the proposed procedure can attain
the optimal rate, up to an iterated logarithmic factor, over spaces of probability distributions with
different regularity conditions.
The rest of the thesis is organized as follows. In Chapter 2, focusing on the case of goodness-
of-fit tests, our analyses show that a vanilla version of the kernel embedding based test could be
suboptimal, and suggest a simple remedy by moderating the kernel. We prove that the moderated
approach provides optimal tests and can also be made adaptive over a wide range of deviations from
the null. Then, in Chapter 3 we study the asymptotic properties of goodness-of-fit, homogeneity
and independence tests using Gaussian kernels, arguably the most popular and successful among
such tests. Our results provide theoretical justifications for this common practice by showing that
tests using Gaussian kernel with an appropriately chosen scaling parameter are minimax optimal
against smooth alternatives in all three settings. In addition, we suggests a data-driven choice of the
5
scaling parameter that yields tests optimal, up to an iterated logarithmic factor, over a wide range
of smooth alternatives. Numerical experiments are presented in Chapter 4 to further demonstrate
the practical merits of our methodology. We conclude with some summary discussion in Chapter
5. All the main proofs are relegated to Chapter 6. Other technical results and their proofs are put
in the appendix.
6
Chapter 2: Moderated Kernel Embedding
2.1 Background and Problem Setting
In this chapter, we focus on goodness-of-fit test. Specifically, with 𝑛 independent X-valued
samples 𝑋1, · · · , 𝑋𝑛 from a certain distribution P, we are interested in testing if the hypothesis
𝐻GOF0 : P = P0
holds for a fixed P0. Problems of this kind have a long and illustrious history in statistics and is
often associated with household names such as Kolmogrov-Smirnov tests, Pearson’s Chi-square
test or Neyman’s smooth test. A plethora of other techniques have also been proposed over the
years in both parametric and nonparametric settings (e.g., Ingster and Suslina, 2003; Lehmann
and Romano, 2008). Most of the existing techniques are developed with the domain X = R or
[0, 1] in mind and work the best in these cases. Modern applications, however, oftentimes involve
domains different from these traditional ones. For example, when dealing with directional data,
which arise naturally in applications such as diffusion tensor imaging, it is natural to consider X
as the unit sphere in R3 (e.g., Jupp, 2005). Another example occurs in the context of ranking or
preference data (e.g., Ailon et al., 2008). In these cases, X can be taken as the group of permuta-
tions. Furthermore, motivated by several applications, combinatorial testing problems have been
investigated recently (e.g., Addario-Berry et al., 2010), where the spaces under consideration are
specific combinatorially structured spaces.
We consider kernel embedding and maximum mean discrepancy (MMD) based goodness-of-fit
test. Specifically, the goodness-of-fit test can be carried out conveniently by first constructing an
estimate of 𝛾(P, P0), 𝛾(P, P0), and then rejecting 𝐻0 if the estimate exceeds a threshold calibrated
7
to ensure a certain significance level, say 𝛼 (0 < 𝛼 < 1).
We adpot the minimax framework to evaluate the above mentioned testing strategy. To fix
ideas, we assume in this chapter that P is dominated by P0 under the alternative so that the Radon-
Nikodym derivative 𝑑P/𝑑P0 is well defined and use the 𝜒2 divergence between P and P0,
𝜒2(P, P0) :=∫X
(𝑑P
𝑑P0
)2𝑑P0 − 1,
as the separation metric to quantify the departure from the null hypothesis. We are particularly
interested in the detection boundary, namely how close P and P0 can be in terms of 𝜒2 distance,
under the alternative, so that a test based on a sample of 𝑛 observations can still consistently dis-
tinguish between the null hypothesis and the alternative. For example, in the parametric setting
where P is known up to a finite dimensional parameters under the alternative, the detection bound-
ary of the likelihood ratio test is 𝑛−1/2 under mild regularity conditions (e.g., Theorem 13.5.4 in
Lehmann and Romano, 2008, and the discussion leading to it). We are concerned here with alter-
natives that are nonparametric in nature. Our first result suggests that the detection boundary for
aforementioned 𝛾𝐾 (P𝑛, P0) based test is of the order 𝑛−1/4. However, our main results indicate,
perhaps surprisingly at first, that this rate is far from optimal and the gap between it and the usual
parametric rate can be largely bridged.
In particular, we argue that the distinguishability between P and P0 depends on how close
𝑢 := 𝑑P/𝑑P0 − 1 is to the RKHS H𝐾 . The closeness of 𝑢 to H𝐾 can be measured by the distance
from 𝑢 to an arbitrary ball in H𝐾 . In particular, we shall consider the case where H𝐾 is dense in
𝐿2(P0), and focus on functions that are polynomially approximable by H𝐾 for concreteness. More
precisely, for some constants 𝑀, \ > 0, denote by F (\;𝑀) the collection of functions 𝑓 ∈ 𝐿2(P0)
such that for any 𝑅 > 0, there exists an 𝑓𝑅 ∈ H𝐾 such that
‖ 𝑓𝑅‖H𝐾≤ 𝑅, and ‖ 𝑓 − 𝑓𝑅‖𝐿2 (P0) ≤ 𝑀𝑅−1/\ .
8
We also adopt the convention that
F (0;𝑀) = { 𝑓 ∈ H𝐾 : ‖ 𝑓 ‖H𝐾≤ 𝑀}.
We investigate the optimal rate of detection for testing 𝐻GOF0 : P = P0 against
𝐻GOF1 (Δ𝑛, \, 𝑀) : P ∈ P(Δ𝑛, \, 𝑀), (2.1)
where P(Δ𝑛, \, 𝑀) is the collection of distributions P on (X,B) satisfying:
𝑑P/𝑑P0 − 1 ∈ F (\;𝑀), and 𝜒(P, P0) ≥ Δ𝑛.
We call 𝑟𝑛 the optimal rate of detection if for any 𝑐 > 0, there exists no consistent test whenever
Δ𝑛 ≤ 𝑐𝑟𝑛; and on the other hand, a consistent test exists as long as Δ𝑛 � 𝑟𝑛.
Throughout this chapter, we shall assume
∫X×X
𝐾2(𝑥, 𝑥′)𝑑P0(𝑥)𝑑P0(𝑥′) < ∞.
Hence the Hilbert-Schmidt integral operator
𝐿𝐾 ( 𝑓 ) (𝑥) =∫X𝐾 (𝑥, 𝑥′) 𝑓 (𝑥′)𝑑P0(𝑥′), ∀ 𝑥 ∈ X
is well-defined. The spectral decomposition theorem ensures that 𝐿𝐾 admits an eigenvalue decom-
position. Let {𝜙𝑘 }𝑘≥1 denote the orthonormal eigenfunctions of 𝐿𝐾 with eigenvalues _𝑘 ’s such
that _1 ≥ _2 ≥ · · · _𝑘 ≥ · · · > 0. Then as proved in, e.g., Dunford and Schwartz (1963),
𝐾 (𝑥, 𝑥′) =∑𝑘≥1
_𝑘𝜙𝑘 (𝑥)𝜙𝑘 (𝑥′) (2.2)
in 𝐿2(P0 ⊗ P0). We further assume that 𝐾 is continuous and that P0 is nondegenerate, meaning the
9
support of P0 is X. Then Mercer’s theorem ensures that (2.2) holds pointwisely. See, e.g., Theorem
4.49 of Steinwart and Christmann (2008).
As shown in Gretton et al. (2012a), the squared MMD between two probability distributions P
and P0 can be expressed as
𝛾2𝐾 (P, P0) =
∫𝐾 (𝑥, 𝑥′)𝑑 (P − P0) (𝑥)𝑑 (P − P0) (𝑥′). (2.3)
Write
�� (𝑥, 𝑥′) = 𝐾 (𝑥, 𝑥′) − EP0𝐾 (𝑥, 𝑋) − EP0𝐾 (𝑋, 𝑥′) + EP0𝐾 (𝑋, 𝑋′), (2.4)
where the subscript P0 signifies the fact that the expectation is taken over 𝑋, 𝑋′ ∼ P0 independently.
By (2.4), 𝛾2𝐾(P, P0) = 𝛾2
��(P, P0). Therefore, without loss of generality, we can focus on kernels
that are degenerate under P0, i.e.,
EP0𝐾 (𝑋, ·) = 0. (2.5)
Passing from a nondegenerate kernel to a degenerate one however presents a subtlety regarding
universality. Universality of a kernel is essential for MMD by ensuring that 𝑑P/𝑑P0 − 1 resides
in the linear space spanned by its eigenfunctions. See, e.g., Steinwart (2001) for the definition
of universal kernel and Sriperumbudur et al. (2011) for a detailed discussion of different types of
universality. Observe that 𝑑P/𝑑P0 − 1 necessarily lies in the orthogonal complement of constant
functions in 𝐿2(P0). A degenerate kernel 𝐾 is universal if its eigenfunctions {𝜙𝑘 }𝑘≥1 form an
orthonormal basis of the orthogonal complement of linear space {𝑐 · 𝜙0 : 𝑐 ∈ R} where 𝜙0(𝑥) = 1
in 𝐿2(P0). In what follows, we shall assume that 𝐾 is both degenerate and universal.
For the sake of concreteness, we shall also assume that 𝐾 has infinitely many positive eigen-
10
values decaying polynomially, i.e.,
0 < lim inf𝑘→∞
𝑘2𝑠_𝑘 ≤ lim sup𝑘→∞
𝑘2𝑠_𝑘 < ∞ (2.6)
for some 𝑠 > 1/2. In addition, we also assume that the eigenfunctions of 𝐾 are uniformly bounded,
i.e.,
sup𝑘≥1
‖𝜙𝑘 ‖∞ < ∞, (2.7)
Together with Assumptions (2.6), (2.7) ensures that Mercer’s decomposition (2.2) holds uniformly.
2.2 Operating Characteristics of MMD Based Test
2.2.1 Asymptotics under 𝐻GOF0
Note that (2.5) implies E𝑃0𝜙𝑘 (𝑋) = 0, ∀ 𝑘 ≥ 1. Hence
𝛾2(P, P0) =∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2
for any P. Accordingly, when P is replaced by the empirical distribution P𝑛, the empirical squared
MMD can be expressed as
𝛾2(P𝑛, P0) =∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖)]2
.
Classic results on the asymptotics of V-statistic (Serfling, 2009) imply that
𝑛𝛾2(P𝑛, P0)𝑑→
∑𝑘≥1
_𝑘𝑍2𝑘 := 𝑊
11
under 𝐻GOF0 , where 𝑍𝑘
𝑖.𝑖.𝑑.∼ 𝑁 (0, 1). Let ΦMMD be an MMD based test, which rejects 𝐻GOF0 if and
only if 𝑛𝛾2(P𝑛, P0) exceeds the 1 − 𝛼 quantile 𝑞𝑤,1−𝛼 of𝑊 , i.e.,
ΦMMD = 1{𝑛𝛾2 (P𝑛,P0)>𝑞𝑤,1−𝛼} .
The above limiting distribution of 𝑛𝛾2(P𝑛, P0) immediately suggests that ΦMMD is an asymptotic
𝛼-level test.
2.2.2 Power Analysis for MMD Based Tests
We now investigate the power of ΦMMD in testing 𝐻GOF0 against 𝐻GOF
1 (Δ𝑛, \, 𝑀) given by (2.1).
Recall that the type II error of a test Φ : X𝑛 → [0, 1] for testing 𝐻0 against a composite alternative
𝐻1 : P ∈ P is given by
𝛽(Φ;P) = supP∈PEP [1 −Φ(𝑋1, . . . , 𝑋𝑛)],
where EPmeans taking expectation over 𝑋1, . . . , 𝑋𝑛𝑖.𝑖.𝑑.∼ P. For brevity, we shall write 𝛽(Φ;Δ𝑛, \, 𝑀)
instead of 𝛽(Φ;P(Δ𝑛, \, 𝑀)) in what follows, with P(Δ𝑛, \, 𝑀) defined right below (2.1). The
performance of a test Φ can then be evaluated by its detection boundary, that is, the smallest Δ𝑛
under which the type II error converges to 0 as 𝑛 → ∞. Our first result establishes the conver-
gence rate of the detection boundary for ΦMMD in the case when \ = 0. Hereafter, we abbreviate
𝑀 in P(Δ𝑛, \, 𝑀), 𝐻GOF1 (Δ𝑛, \, 𝑀) and 𝛽(Φ;Δ𝑛, \, 𝑀), unless it is necessary to emphasize the
dependence.
Theorem 1. Consider testing 𝐻GOF0 against 𝐻GOF
1 (Δ𝑛, 0) by ΦMMD.
(i) If 𝑛1/4Δ𝑛 → ∞, then
𝛽(ΦMMD;Δ𝑛, 0) → 0 as 𝑛→ ∞;
(ii) conversely, there exists a constant 𝑐0 > 0 such that
lim inf𝑛→∞
𝛽(ΦMMD; 𝑐0𝑛−1/4, 0) > 0.
12
Theorem 1 shows that when the alternative 𝐻GOF1 (Δ𝑛, 0) is considered, the detection boundary
of ΦMMD is of the order 𝑛−1/4. It is of interest to compare the detection rate achieved by ΦMMD with
that in a parametric setting where consistent tests are available if 𝑛1/2Δ𝑛 → ∞. See, e.g., Theorem
13.5.4 in Lehmann and Romano (2008) and the discussion leading to it. It is natural to raise the
question to what extent such a gap can be entirely attributed to the fundamental difference between
parametric and nonparametric testing problems. We shall now argue that this gap actually is largely
due to the sub-optimality of ΦMMD, and the detection boundary of ΦMMD could be significantly
improved through a slight modification of the MMD.
2.3 Optimal Tests Based on Moderated MMD
2.3.1 Moderated MMD Test Statistic
The basic idea behind MMD is to project two probability measures onto a unit ball in H𝐾
and use the distance between the two projections to measure the distance between the original
probability measures. If the Radon-Nikodym derivative of P with respect to P0 is far away from
H𝐾 , the distance between the two projections may not honestly reflect the distance between them.
More specifically, 𝛾2(P, P0) =∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2, while the 𝜒2 distance between P and P0 is
𝜒2(P, P0) =∑𝑘≥1
[EP𝜙𝑘 (𝑋)]2. Considering that _𝑘 decreases with 𝑘 , 𝛾2(P, P0) can be much smaller
than 𝜒2(P, P0). To overcome this problem, we consider a moderated version of the MMD which
allows us to project the probability measures onto a larger ball in H𝐾 . In particular, write
[𝐾,𝜚 (P,Q;P0) = sup𝑓 ∈H𝐾 :‖ 𝑓 ‖2
𝐿2 (P0)+𝜚2‖ 𝑓 ‖2
H𝐾≤1
∫X𝑓 𝑑 (P − Q) (2.8)
for a given distribution P0 and a constant 𝜚 > 0. Distance between probability measures of this type
was first introduced by Harchaoui et al. (2007) when considering kernel methods for two sample
test. A subtle difference between [𝐾,𝜚 (P,Q;P0) and the distance from Harchaoui et al. (2007) is
the set of 𝑓 that we optimize over on the righthand side of (2.8). In the case of two sample test,
there is no information about P0 and therefore one needs to replace the norm ‖ · ‖𝐿2 (P0) with the
13
empirical 𝐿2 norm.
It is worth noting that [𝐾,𝜚 (P,Q;P0) can also be identified with a particular type of MMD.
Specifically, [𝐾,𝜚 (P,Q;P0) = 𝛾��𝜚 (P,Q), where
��𝜚 (𝑥, 𝑥′) :=∑𝑘≥1
_𝑘
_𝑘 + 𝜚2 𝜙𝑘 (𝑥)𝜙𝑘 (𝑥′).
We shall nonetheless still refer to [𝐾,𝜚 (P,Q;P0) as a moderated MMD in what follows to empha-
size the critical importance of moderation. We shall also abbreviate the dependence of [ on 𝐾 and
P0 unless necessary. The unit ball in (2.8) is defined in terms of both RKHS norm and 𝐿2(P0)
norm. Recall that 𝑢 = 𝑑P/𝑑P0 − 1 so that
sup‖ 𝑓 ‖𝐿2 (P0)≤1
∫X𝑓 𝑑 (P − P0) = sup
‖ 𝑓 ‖𝐿2 (P0)≤1
∫X𝑓 𝑢𝑑𝑃0 = ‖𝑢‖𝐿2 (P0) = 𝜒(P, P0).
We can therefore expect that a smaller 𝜚 will make [2𝜚 (P, P0) closer to 𝜒2(P, P0), since the unit
ball to be considered will become more similar to the unit ball in 𝐿2(P0). This can also be verified
by noticing that
lim𝜚→0
[2𝜚 (P, P0) = lim
𝜚→0
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2 [E𝑃𝜙𝑘 (𝑋)]2 =
∑𝑘≥1
[E𝑃𝜙𝑘 (𝑋)]2 = 𝜒2(P, P0).
Therefore, we choose 𝜚 converging to 0 when constructing our test statistic.
Hereafter we shall attach the subscript 𝑛 to 𝜚 to signify its dependence on 𝑛. We shall argue that
letting 𝜌𝑛 converge to 0 at an appropriate rate as 𝑛 increases indeed results in a test more powerful
than ΦMMD. The test statistic we propose is the empirical version of [2𝜚𝑛(P, P0):
[2𝜚𝑛(P𝑛, 𝑃0) =
1𝑛2
∑𝑖, 𝑗=1
��𝜚𝑛 (𝑋𝑖, 𝑋 𝑗 ) =∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖)]2
. (2.9)
This test statistics is similar in spirit to the homogeneity test proposed previously by Harchaoui
et al. (2007), albeit motivated from a different viewpoint. In either case, it is intuitive to expect
14
improved performance over the vanilla version of the MMD when 𝜚𝑛 converges to zero at an appro-
priate rate. The main goal of the present work to precisely characterize the amount of moderation
needed to ensure maximum power. We first argue that letting 𝜚𝑛 converge to 0 at an appropriate
rate indeed results in a test more powerful than ΦMMD.
2.3.2 Operating Characteristics of [2𝜚𝑛(P𝑛, P0) Based Tests
Although the expression for [2𝜚𝑛(P𝑛, P0) given by (2.9) looks similar to that of 𝛾2(P𝑛, P0),
their asymptotic behaviors are quite different. At a technical level, this is due to the fact that the
eigenvalues of the underlying kernel
_𝑛𝑘 :=_𝑘
_𝑘 + 𝜚2𝑛
depend on 𝑛 and may not be uniformly summable over 𝑛. As presented in the following theorem,
a certain type of asymptotic normality, instead of a sum of chi-squares as in the case of 𝛾2(P𝑛, P0),
holds for [2𝜚𝑛(P𝑛, P0) under P0, which helps determine the rejection region of the [2
𝜚𝑛based test.
Theorem 2. Assume that 𝜚𝑛 → 0 as 𝑛 → ∞ in such a fashion that 𝑛𝜚1/(2𝑠)𝑛 → ∞. Then under
𝐻GOF0 ,
𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛]
𝑑→ 𝑁 (0, 2),
where
𝑣𝑛 =∑𝑘≥1
(_𝑘
_𝑘 + 𝜚2𝑛
)2, and 𝐴𝑛 =
1𝑛
𝑛∑𝑖=1
��𝜚𝑛 (𝑋𝑖, 𝑋𝑖).
In the light of Theorem 2, a test that rejects 𝐻0 if and only if
2−1/2𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛]
exceeds 𝑧1−𝛼 is an asymptotic 𝛼-level test, where 𝑧1−𝛼 stands for the 1 − 𝛼 quantile of a standard
normal distribution. We refer to this test as ΦM3d where the subscript M3d stands for Moderated
MMD. The performance of ΦM3d under the alternative hypothesis is characterized by the follow-
15
ing theorem, showing that its detection boundary is much improved when compared with that of
ΦMMD.
Theorem 3. Consider testing 𝐻GOF0 against 𝐻GOF
1 (Δ𝑛, \) by ΦM3d with 𝜚𝑛 = 𝑐𝑛−2𝑠 (\+1)4𝑠+\+1 for an
arbitrary constant 𝑐 > 0. If 𝑛2𝑠
4𝑠+\+1Δ𝑛 → ∞, then ΦM3d is consistent in that
𝛽(ΦM3d;Δ𝑛, \) → 0, as 𝑛→ ∞.
Theorem 3 indicates that the detection boundary for ΦM3d is 𝑛−2𝑠/(4𝑠+\+1) . In particular, when
testing 𝐻GOF0 against 𝐻GOF
1 (Δ𝑛, 0), i.e., \ = 0, it becomes 𝑛−4𝑠/(4𝑠+1) . This is to be contrasted with
the detection boundary for ΦMMD, which, as suggested by Theorem 1, is of the order 𝑛−1/4. It is
also worth noting that the detection boundary for ΦM3d deteriorates as \ increases, implying that it
is harder to test against a larger interpolation space.
2.3.3 Minimax Optimality
It is of interest to investigate if the detection boundary of ΦM3d can be further improved. We
now show that the answer is negative in a certain sense.
Theorem 4. Consider testing𝐻GOF0 against𝐻GOF
1 (Δ𝑛, \), for some \ < 2𝑠−1. If lim sup𝑛→∞
Δ𝑛𝑛2𝑠
4𝑠+\+1 <
∞, then there exists 𝛼 ∈ (0, 1) such that for any Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, · · · , 𝑋𝑛,
lim sup𝑛→∞
𝛽(Φ𝑛;Δ𝑛, \) > 0.
Together with Theorem 3, this suggests that ΦM3d is rate optimal in the minimax sense, when
considering 𝜒2 distance as the separation metric and F (\, 𝑀) as the regularity condition of alter-
native space.
16
2.4 Adaptation
Despite the minimax optimality of ΦM3d, a practical challenge in using it is the choice of an
appropriate tuning parameter 𝜚𝑛. In particular, Theorem 3 suggests that 𝜚𝑛 needs to be taken at
the order of 𝑛−2𝑠(\+1)/(4𝑠+\+1) which depends on the value of 𝑠 and \. On the one hand, since P0
and 𝐾 are known a priori, so is 𝑠. On the other hand, \ reflects the property of 𝑑P/𝑑P0 which
is typically not known in advance. This naturally brings us to the issue of adaptation (see, e.g.,
Spokoiny, 1996; Ingster, 2000). In other words, we are interested in a single testing procedure that
can achieve the detection boundary for testing 𝐻GOF0 against 𝐻GOF
1 (Δ𝑛 (\), \) simultaneously over
all \ ≥ 0. We emphasize the dependence of Δ𝑛 on \ since the detection boundary may depend
on \, as suggested by the results from the previous section. In fact, we should build upon the test
statistic introduced before.
More specifically, write
𝜌∗ =
(√log log 𝑛𝑛
)2𝑠
,
and
𝑚∗ =
log2
𝜌−1∗
(√log log 𝑛𝑛
) 2𝑠4𝑠+1
.Then our test statistic is taken to be the maximum of 𝑇𝑛,𝜚𝑛 for 𝜌𝑛 = 𝜌∗, 2𝜌∗, 22𝜌∗, . . . , 2𝑚∗𝜌∗:
𝑇GOF(adapt)𝑛 := sup
0≤𝑘≤𝑚∗
𝑇𝑛,2𝑘 𝜚∗ , (2.10)
where
𝑇𝑛,𝜚𝑛 = (2𝑣𝑛)−1/2 [𝑛[2𝜚𝑛(P𝑛, P0) − 𝐴𝑛] .
It turns out if an appropriate rejection threshold is chosen, 𝑇GOF(adapt)𝑛 can achieve a detection
boundary very similar to the one we have before, but now simultaneously over all \ > 0.
17
Theorem 5. (i) Under 𝐻GOF0 ,
lim𝑛→∞
𝑃
(𝑇
GOF(adapt)𝑛 ≥
√3 log log 𝑛
)= 0;
(ii) on the other hand, there exists a constant 𝑐1 > 0 such that,
lim𝑛→∞
infP∈∪\≥0P(Δ𝑛 (\),\)
𝑃
(𝑇
GOF(adapt)𝑛 ≥
√3 log log 𝑛
)= 1,
provided that Δ𝑛 (\) ≥ 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 .
Theorem 5 immediately suggests that a test rejects𝐻GOF0 if and only if𝑇GOF(adapt)
𝑛 ≥√
3 log log 𝑛
is consistent for testing it against 𝐻GOF1 (Δ𝑛 (\), \) for all \ ≥ 0 provided that
Δ𝑛 (\) ≥ 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 .
We note that the detection boundary given in Theorem 5 is similar, but inferior by a factor of
(log log 𝑛) 2𝑠4𝑠+\+1 , to that from Theorem 4. As our next result indicates such an extra factor is indeed
unavoidable and is the price one needs to pay for adaptation.
Theorem 6. Let 0 < \1 < \2 < 2𝑠 − 1. Then there exists a positive constant 𝑐2, such that if
lim sup𝑛→∞
sup\∈[\1,\2]
Δ𝑛 (\)(
𝑛√log log 𝑛
) 2𝑠4𝑠+\+1 ≤ 𝑐2
then
lim𝑛→∞
infΦ𝑛
[EP0Φ𝑛 + sup
\∈[\1,\2]𝛽(Φ𝑛;Δ𝑛 (\), \)
]= 1.
Similar to Theorem 4, Theorem 6 shows that there is no consistent test for 𝐻GOF0 against
𝐻GOF1 (Δ𝑛, \) simultaneously over all \ ∈ [\1, \2], if Δ𝑛 (\) ≤ 𝑐2
(𝑛−1√log log 𝑛
) 2𝑠4𝑠+\+1 ∀ \ ∈
[\1, \2] for a sufficiently small 𝑐2. Together with Theorem 5, this suggests that the above men-
18
tioned adaptive test is indeed rate optimal.
19
Chapter 3: Gaussian Kernel Embedding
3.1 Test for Goodness-of-fit
Throughout this chapter, we shall consider goodness-of-fit, homogeneity and independence
tests. We focus on continuous data, e.g., X = R𝑑 , and Gaussian kernels, which are arguably the
most popular and successful choice in practice.
Among the three testing problems that we consider, it is instructive to begin with the case of
goodness-of-fit. Obviously, the choice of kernel 𝐾 plays an essential role in kernel embedding of
distributions. In particular, when data are continuous, Gaussian kernels are commonly used. More
specifically, a Gaussian kernel with a scaling parameter a > 0 is given by
𝐺𝑑,a (𝑥, 𝑦) = exp(−a‖𝑥 − 𝑦‖2
𝑑
), ∀𝑥, 𝑦 ∈ R𝑑 .
Hereafter ‖ · ‖𝑑 stands for the usual Euclidean norm in R𝑑 . For brevity, we shall suppress the
subscript 𝑑 in both ‖ · ‖ and 𝐺 when the dimensionality is clear from the context. When P and
Q are probability distributions defined over X = R𝑑 , we shall write the MMD between them with
a Gaussian kernel and scaling parameter a as 𝛾a (P,Q) where the subscript signifies the specific
value of the scaling parameter.
We shall restrict our attention to distributions with smooth densities. Denote by W𝑠,2𝑑
the 𝑠th
order Sobolev space in R𝑑 , that is
W𝑠,2𝑑
=
{𝑓 : R𝑑 → R
�� 𝑓 is almost surely continuous and∫
(1 + ‖𝜔‖2)𝑠‖F ( 𝑓 ) (𝜔)‖2𝑑𝜔 < ∞}
20
where F ( 𝑓 ) is the Fourier transform of 𝑓 :
F ( 𝑓 ) (𝜔) = 1(2𝜋)𝑑/2
∫R𝑑𝑓 (𝑥)𝑒−𝑖𝑥>𝜔𝑑𝑥.
In what follows, we shall again abbreviate the subscript 𝑑 in W𝑠,2𝑑
when it is clear from the context.
For any 𝑓 ∈ W𝑠,2, we shall write
‖ 𝑓 ‖2W𝑠,2 =
∫R𝑑(1 + ‖𝜔‖2)𝑠‖F ( 𝑓 ) (𝜔)‖2𝑑𝜔.
Let 𝑝 and 𝑝0 be the density functions of P and P0 respectively. We are interested in the case when
both 𝑝 and 𝑝0 are elements from W𝑠,2.
Note that we can rewrite the null hypothesis 𝐻GOF0 in terms of density functions: 𝐻GOF
0 : 𝑝 = 𝑝0
for some prespecified denstiy 𝑝0 ∈ W𝑠,2. To better quantify the power of a test, we shall consider
testing against an alternative that is increasingly closer to the null as the sample size 𝑛 increases:
𝐻GOF1 (Δ𝑛; 𝑠) : 𝑝 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛,
where
W𝑠,2(𝑀) ={𝑓 ∈ W𝑠,2 : ‖ 𝑓 ‖W𝑠,2 ≤ 𝑀
}.
and
‖ 𝑓 ‖2𝐿2
=
∫R𝑑𝑓 2(𝑥)𝑑𝑥.
The alternative hypothesis𝐻GOF1 (Δ𝑛; 𝑠) is composite and the power of a test Φ based on 𝑋1, . . . , 𝑋𝑛 ∼
𝑝 is therefore defined as
power(Φ;𝐻GOF1 (Δ𝑛; 𝑠)) := inf
𝑝∈W𝑠,2 (𝑀),‖𝑝−𝑝0‖𝐿2≥Δ𝑛P{Φ rejects 𝐻GOF
0 }
21
Let
��a (𝑥, 𝑦;P0) = 𝐺a (𝑥, 𝑦) − E𝑋∼P0𝐺a (𝑋, 𝑦) − E𝑋∼P0𝐺a (𝑥, 𝑋) + E𝑋,𝑋 ′∼iidP0𝐺a (𝑋, 𝑋′).
and recall that
𝛾2a (P𝑛, P0) =
1𝑛2
𝑛∑𝑖, 𝑗=1
��a (𝑋𝑖, 𝑋 𝑗 ).
Similarly with Chapter 2, we correct for bias and use instead the following𝑈-statistic:
𝛾2a (P, P0) :=
1𝑛(𝑛 − 1)
𝑛∑1≤𝑖≠ 𝑗≤𝑛
��a (𝑋𝑖, 𝑋 𝑗 ),
which we shall focus on in what follows.
The choice of the scaling parameter a is essential when using RKHS embedding for goodness-
of-fit test. While the importance of data-driven choice of a is widely recognized in practice, almost
all existing theoretical studies assume that a fixed kernel, therefore a fixed scaling parameter, is
used. Here we shall demonstrate the benefit of using a data-driven scaling parameter, and especially
choosing a scaling parameter that diverges with the sample size.
More specifically, we argue that, with appropriate scaling, 𝛾2a (P, P0) can be viewed as an esti-
mate of ‖𝑝 − 𝑝0‖2𝐿2
when a → ∞ as 𝑛→ ∞. Note that
∫(𝑝 − 𝑝0)2 =
∫𝑝2 − 2
∫𝑝 · 𝑝0 +
∫𝑝2
0.
The first term can be estimated by
∫𝑝2 ≈ 1
𝑛
𝑛∑𝑖=1
𝑝(𝑋𝑖) ≈1𝑛
𝑛∑𝑖=1
𝑝ℎ,−𝑖 (𝑋𝑖)
where 𝑝ℎ,−𝑖 is a kernel density estimate of 𝑝 with the 𝑖th observation removed and bandwidth ℎ:
𝑝ℎ,−𝑖 (𝑥) =1
𝑛(2𝜋ℎ2)𝑑/2∑𝑗≠𝑖
𝐺 (2ℎ2)−1 (𝑥 − 𝑋 𝑗 ).
22
Thus, we can estimate∫𝑝2 by
1𝑛(𝑛 − 1) (2𝜋ℎ2)𝑑/2
∑1≤𝑖≠ 𝑗≤𝑛
𝐺 (2ℎ2)−1 (𝑋𝑖, 𝑋 𝑗 ).
Similarly, the cross-product term can be estimated by
∫𝑝 · 𝑝0 ≈
∫𝑝ℎ (𝑥)𝑝0(𝑥)𝑑𝑥 =
1𝑛(2𝜋ℎ2)𝑑/2
𝑛∑𝑖=1
∫𝐺 (2ℎ2)−1 (𝑥, 𝑋𝑖)𝑝0(𝑥)𝑑𝑥.
Together, we can view1
𝑛(𝑛 − 1) (2𝜋ℎ2)𝑑/2∑
1≤𝑖≠ 𝑗≤𝑛�� (2ℎ2)−1 (𝑋𝑖, 𝑋 𝑗 )
as an estimate of∫(𝑝 − 𝑝0)2. Following standard asymptotic properties of the kernel density
estimator (see, e.g., Tsybakov, 2008), we know that
(𝜋/a)−𝑑/2𝛾2a (P, P0) →𝑝 ‖𝑝 − 𝑝0‖2
𝐿2
if a → ∞ in such a fashion that a = 𝑜(𝑛4/𝑑). Motivated by this observation, we shall now consider
testing 𝐻GOF0 using 𝛾2
a (P, P0) with a diverging a. To signify the dependence of a on the sample
size, we shall add a subscript 𝑛 in what follows.
Under 𝐻GOF0 , it is clear E𝛾2
a𝑛 (P, P0) = 0. Note also that
var(𝛾2a𝑛 (P, P0))
=2
𝑛(𝑛 − 1)E[��a𝑛 (𝑋1, 𝑋2)
]2
=2
𝑛(𝑛 − 1)
[E
[𝐺a𝑛 (𝑋1, 𝑋2)
]2 − 2E[𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)] +(E
[𝐺a𝑛 (𝑋1, 𝑋2)
] )2]
=2
𝑛(𝑛 − 1)
[E𝐺2a𝑛 (𝑋1, 𝑋2) − 2E[𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)] +
(E
[𝐺a𝑛 (𝑋1, 𝑋2)
] )2]. (3.1)
23
Simple calculations yield:
var(𝛾2a𝑛 (P, P0)) =
2(𝜋/(2a𝑛))𝑑/2𝑛2 · ‖𝑝0‖2
𝐿2· (1 + 𝑜(1)),
assuming that a𝑛 → ∞. We shall show that
𝑛√
2
(2a𝑛𝜋
)𝑑/4𝛾2a𝑛 (P, P0) →𝑑 𝑁
(0, ‖𝑝0‖2
𝐿2
).
To use this as a test statistic, however, we will need to estimate var(𝛾2a𝑛 (P, P0)). To this end, it
is natural to consider estimating each of the three terms on the rightmost hand side of (3.1) by
𝑈-statistics:
𝑠2𝑛,a𝑛 =1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 )
− 2(𝑛 − 3)!𝑛!
∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2)
+ (𝑛 − 4)!𝑛!
∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4
𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2).
Note that 𝑠2𝑛,a𝑛 is not always positive. To avoid a negative estimate of the variance, we can replace
it with a sufficiently small value, say 1/𝑛2, whenever it is negative or too small. Namely, let
��2𝑛,a𝑛 = max{𝑠2𝑛,a𝑛 , 1/𝑛
2} ,and consider a test statistic:
𝑇GOF𝑛,a𝑛
:=𝑛√
2��−1𝑛,a𝑛
𝛾2a𝑛 (P, P0).
We have
24
Theorem 7. Let a𝑛 → ∞ as 𝑛→ ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then, under 𝐻GOF0 ,
𝑛√
2
(2a𝑛𝜋
)𝑑/4𝛾2a𝑛 (P, P0) →𝑑 𝑁 (0, ‖𝑝0‖2
𝐿2). (3.2)
Moreover,
𝑇GOF𝑛,a𝑛
→𝑑 𝑁 (0, 1). (3.3)
Theorem 7 immediately implies a test, denoted by ΦGOF𝑛,a𝑛,𝛼
(𝛼 ∈ (0, 1)), that rejects 𝐻GOF0 if
and only if 𝑇GOF𝑛,a𝑛
exceeds 𝑧𝛼, the upper 1 − 𝛼 quantile of the standard normal distribution, is an
asymptotic 𝛼-level test.
We now proceed to study its power against a smooth alternative. Following the same argument
as before, it can be shown that
1𝑛(𝑛 − 1) (𝜋/a𝑛)𝑑/2
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) →𝑝 ‖𝑝 − 𝑝0‖2𝐿2,
and
(2a𝑛/𝜋)𝑑/2 ��2𝑛,a𝑛 →𝑝 ‖𝑝‖2𝐿2,
so that
𝑛−1(a𝑛/(2𝜋))𝑑/4𝑇GOF𝑛 →𝑝 ‖𝑝 − 𝑝0‖2
𝐿2/‖𝑝‖𝐿2 .
This immediately implies that, if a𝑛 → ∞ in such a manner that a𝑛 = 𝑜(𝑛4/𝑑), then ΦGOF𝑛,a𝑛,𝛼
is
consistent for a fixed 𝑝 ≠ 𝑝0 in that its power converges to one. In fact, as 𝑛 increases, more and
more subtle deviation from 𝑝0 can be detected by ΦGOF𝑛,a𝑛,𝛼
. A refined analysis of the asymptotic
behavior of 𝑇GOF𝑛,a𝑛
yields that
Theorem 8. Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),
lim𝑛→∞
power{ΦGOF𝑛,a𝑛,𝛼
;𝐻GOF1 (Δ𝑛; 𝑠)} → 1,
25
provided that a𝑛 � 𝑛4/(𝑑+4𝑠) .
In other words, ΦGOF𝑛,a𝑛,𝛼
has a detection boundary of the order 𝑂 (𝑛−2𝑠/(𝑑+4𝑠)) which turns out
to be minimax optimal in that no other tests could attain a detection boundary with faster rate of
convergence. More precisely, we have
Theorem 9. Assume that lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞ and 𝑝0 is density such that
‖𝑝0‖W𝑠,2 < 𝑀 . Then there exists some 𝛼 ∈ (0, 1) such that for any test Φ𝑛 of level 𝛼 (asymptoti-
cally) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝,
lim inf𝑛→∞
power{Φ𝑛;𝐻GOF1 (Δ𝑛; 𝑠)} < 1.
Together, Theorems 8 and 9 suggest that Gaussian kernel embedding of distributions is espe-
cially suitable for testing against smooth alternatives, and it yields a test that could consistently
detect the smallest departures, in terms of rate of convergence, from the null distribution. The idea
can also be readily applied to testing of homogeneity and independence which we shall examine
next.
3.2 Test for Homogeneity
As in the case of goodness of fit test, we shall consider the case when the underlying distri-
butions have smooth densities so that we can rewrite the null hypothesis as 𝐻HOM0 : 𝑝 = 𝑞 ∈
W𝑠,2(𝑀), and the alternative hypothesis as
𝐻HOM1 (Δ𝑛; 𝑠) : 𝑝, 𝑞 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑞‖𝐿2 ≥ Δ𝑛.
The power of a test Φ based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝 and 𝑌1, . . . , 𝑌𝑚 ∼ 𝑞 is given by
power(Φ;𝐻HOM1 (Δ𝑛; 𝑠)) := inf
𝑝,𝑞∈W𝑠,2 (𝑀),‖𝑝−𝑞‖𝐿2≥Δ𝑛P{Φ rejects 𝐻HOM
0 }
26
To fix ideas, we shall also assume that 𝑐 ≤ 𝑚/𝑛 ≤ 𝐶 for some constants 0 < 𝑐 ≤ 𝐶 < ∞.
In addition, we shall express explicitly only the dependence on 𝑛 and not 𝑚, for brevity. Our
treatment, however, can be straightforwardly extended to more general situations.
Recall that
𝛾2a𝑛(P𝑛, Q𝑚) =
1𝑛2
∑1≤𝑖, 𝑗≤𝑛
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1𝑚2
∑1≤𝑖, 𝑗≤𝑚
𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )
− 2𝑚𝑛
𝑛∑𝑖=1
𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ).
As before, to reduce bias, we shall focus instead on a closely related estimate of 𝛾a𝑛 (P,Q):
𝛾2a𝑛 (P,Q) =
1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1
𝑚(𝑚 − 1)∑
1≤𝑖≠ 𝑗≤𝑚𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )
− 2𝑚𝑛
𝑛∑𝑖=1
𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ).
It is easy to see that under 𝐻HOM0 ,
E𝛾2a𝑛 (P,Q) = 0,
and
var(𝛾2a𝑛 (P,Q)
)= 2
(1
𝑛(𝑛 − 1) +2𝑚𝑛
+ 1𝑚(𝑚 − 1)
)E(𝑋,𝑌 )∼P⊗Q��
2a𝑛(𝑋,𝑌 ),
where
��a𝑛 (𝑥, 𝑦) = 𝐺a (𝑥, 𝑦) − E𝑋∼P𝐺a𝑛 (𝑋, 𝑦) − E𝑌∼Q𝐺a𝑛 (𝑥,𝑌 ) + E(𝑋,𝑌 )∼P⊗Q𝐺a𝑛 (𝑋,𝑌 ).
27
It is therefore natural to consider estimating the variance by ��2𝑛,𝑚,a𝑛 = max{𝑠2𝑛,𝑚,a𝑛 , 1/𝑛
2} where
𝑠2𝑛,𝑚,a𝑛 =1
𝑁 (𝑁 − 1)∑
1≤𝑖≠ 𝑗≤𝑁𝐺2a𝑛 (𝑍𝑖, 𝑍 𝑗 )
− 2(𝑁 − 3)!𝑁!
∑1≤𝑖, 𝑗1, 𝑗2≤𝑁|{𝑖, 𝑗1, 𝑗2}|=3
𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗1)𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗2)
+ (𝑁 − 4)!𝑁!
∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑁|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4
𝐺a𝑛 (𝑍𝑖1 , 𝑍 𝑗1)𝐺a𝑛 (𝑍𝑖2 , 𝑍 𝑗2),
𝑁 = 𝑛 + 𝑚 and 𝑍𝑖 = 𝑋𝑖 if 𝑖 ≤ 𝑛 and 𝑌𝑖−𝑛 if 𝑖 > 𝑛. This leads to the following test statistic
𝑇HOM𝑛,a𝑛
=𝑛𝑚
√2(𝑛 + 𝑚)
· ��−1𝑛,𝑚,a𝑛
· 𝛾2a𝑛 (P,Q).
As before, we can show
Theorem 10. Let a𝑛 → ∞ as 𝑛 → ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then under 𝐻HOM0 :
𝑝 = 𝑞 ∈ W𝑠,2(𝑀),
𝑇HOM𝑛,a𝑛
→𝑑 𝑁 (0, 1), as 𝑛→ ∞.
Motivated by Theorem 10, we can consider a test, denoted by ΦHOM𝑛,a𝑛,𝛼
, that rejects 𝐻HOM0 if and
only if 𝑇HOM𝑛,a𝑛
exceeds 𝑧𝛼. By construction, ΦHOM𝑛,a𝑛,𝛼
is an asymptotic 𝛼 level test. We now turn to
study its power against 𝐻HOM1 . As in the case of goodness of fit test, we can prove that ΦHOM
𝑛,a𝑛,𝛼is
minimax optimal in that it can detect the smallest difference between 𝑝 and 𝑞 in terms of rate of
convergence. More precisely, we have
Theorem 11. (i) Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),
lim𝑛→∞
power{ΦHOM𝑛,a𝑛,𝛼
;𝐻HOM1 (Δ𝑛; 𝑠)} → 1,
provided that a𝑛 � 𝑛4/(𝑑+4𝑠) .
(ii) Conversely, if lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, then there exists some 𝛼 ∈ (0, 1) such that for
28
any test Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝 and
𝑌1, . . . , 𝑌𝑚 ∼ 𝑞,
lim inf𝑛→∞
power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} < 1.
3.3 Test for Independence
Similarly, we can also use Gaussian kernel embedding to construct minimax optimal tests of
independence. Let 𝑋 = (𝑋1, . . . , 𝑋 𝑘 )> ∈ R𝑑 be a random vector where the subvectors 𝑋 𝑗 ∈ R𝑑 𝑗
for 𝑗 = 1, . . . , 𝑘 so that 𝑑1 + · · · + 𝑑𝑘 = 𝑑. Denote by 𝑝 the joint density function of 𝑋 , and 𝑝 𝑗
the marginal density of 𝑋 𝑗 . We assume that both the joint density and the marginal densities are
smooth. Specifically, we shall consider testing
𝐻IND0 : 𝑝 = 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 , 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘
against a smooth departure from independence:
𝐻IND1 (Δ𝑛; 𝑠) : 𝑝 ∈ W𝑠,2(𝑀), 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘 and ‖𝑝 − 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ‖𝐿2 ≥ Δ𝑛,
where 𝑀 =𝑘∏𝑗=1𝑀 𝑗 so that 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ∈ W𝑠,2(𝑀) under both null and alternative hypotheses.
Given a sample {𝑋1, . . . , 𝑋𝑛} of independent copies of 𝑋 , we can naturally estimate the so-
called dHSIC 𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) by
𝛾2a𝑛(P𝑛, P𝑋
1
𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) =1𝑛2
∑1≤𝑖, 𝑗≤𝑛
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 )
+ 1𝑛2𝑘
∑1≤𝑖1,...,𝑖𝑘 , 𝑗1..., 𝑗𝑘≤𝑛
𝐺a𝑛 ((𝑋1𝑖1, . . . , 𝑋 𝑘𝑖𝑘 ), (𝑋
1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 ))
− 2𝑛𝑘+1
∑1≤𝑖, 𝑗1,..., 𝑗𝑘≤𝑛
𝐺a𝑛 (𝑋𝑖, (𝑋1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 )).
29
To correct for the bias, we shall consider the following estimate of 𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) instead.
𝛾2a𝑛 (P, P
𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 )
=1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 )
+ (𝑛 − 2𝑘)!𝑛!
∑1≤𝑖1,··· ,𝑖𝑘 , 𝑗1,··· , 𝑗𝑘≤𝑛|{𝑖1,··· ,𝑖𝑘 , 𝑗1,··· , 𝑗𝑘 }|=2𝑘
𝐺a𝑛 ((𝑋1𝑖1, . . . , 𝑋 𝑘𝑖𝑘 ), (𝑋
1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 ))
− 2(𝑛 − 𝑘 − 1)!𝑛!
∑1≤𝑖, 𝑗1,··· , 𝑗𝑘≤𝑛
|{𝑖, 𝑗1,··· , 𝑗𝑘 }|=𝑘+1
𝐺a𝑛 (𝑋𝑖, (𝑋1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 )).
Under 𝐻IND0 , we have
E𝛾2a𝑛 (P, P
𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) = 0.
Deriving its variance, however, requires a bit more work. Write
ℎ 𝑗 (𝑥 𝑗 , 𝑦) = E𝑋∼P𝑋1⊗···⊗P𝑋𝑘𝐺a𝑛 ((𝑋1, . . . , 𝑋 𝑗−1, 𝑥 𝑗 , 𝑋 𝑗+1, . . . , 𝑋 𝑘 ), 𝑦)
and
𝑔 𝑗 (𝑥 𝑗 , 𝑦) = ℎ 𝑗 (𝑥 𝑗 , 𝑦) − E𝑋 𝑗∼P𝑋 𝑗 ℎ 𝑗 (𝑋𝑗 , 𝑦) − E𝑌∼Pℎ 𝑗 (𝑥 𝑗 , 𝑌 ) + E(𝑋 𝑗 ,𝑌 )∼P𝑋 𝑗 ⊗Pℎ 𝑗 (𝑋
𝑗 , 𝑌 ).
With slight abuse of notation, also denote by
ℎ 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) = E𝑋,𝑌∼iidP𝑋1⊗···⊗P𝑋𝑘𝐺a𝑛 ((𝑋1, . . . , 𝑋 𝑗1−1, 𝑥 𝑗1 , 𝑋 𝑗1+1, . . . , 𝑋 𝑘 ),
(𝑌1, . . . , 𝑌 𝑗2−1, 𝑦 𝑗2 , 𝑌 𝑗2+1, . . . , 𝑌 𝑘 ))
30
and
𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) =ℎ 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) − E𝑋 𝑗1∼P𝑋 𝑗1 ℎ 𝑗1, 𝑗2 (𝑋𝑗1 , 𝑦 𝑗2)
− E𝑋 𝑗2∼P𝑋 𝑗2 ℎ 𝑗1, 𝑗2 (𝑥
𝑗1 , 𝑋 𝑗2) + E(𝑋 𝑗1 ,𝑌 𝑗2 )∼P𝑋 𝑗1 ⊗P𝑋 𝑗2 ℎ 𝑗1, 𝑗2 (𝑋𝑗1 , 𝑌 𝑗2).
Then we have
Lemma 1. Under 𝐻IND0 ,
var(𝛾2a𝑛 (P, P
𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ))=
2𝑛(𝑛 − 1)
(E��2
a𝑛(𝑋,𝑌 ) − 2
∑1≤ 𝑗≤𝑘
E(𝑔 𝑗 (𝑋 𝑗 , 𝑌 )
)2
+∑
1≤ 𝑗1, 𝑗2≤𝑘E
(𝑔 𝑗1, 𝑗2 (𝑋 𝑗1 , 𝑌 𝑗2)
)2)+𝑂 (E𝐺2a𝑛 (𝑋,𝑌 )/𝑛3). (3.4)
In light of Lemma 1, a variance estimator can be derived by estimating the leading term on the
righthand side of (3.4) term by term using 𝑈-statistics. Formulae for estimating the variance for
general 𝑘 are tedious and we defer them to the appendix for space consideration. In the special
case when 𝑘 = 2, the leading term on the righthand side of (3.4) takes a much simplified form:
2𝑛(𝑛 − 1)E��a𝑛 (𝑋1, 𝑌1) · E��a𝑛 (𝑋2, 𝑌2),
where 𝑋 𝑗 , 𝑌 𝑗 ∼iid P𝑋 𝑗 for 𝑗 = 1, 2. Thus, we can estimate E[��a𝑛 (𝑋 𝑗 , 𝑌 𝑗 )]2 by
𝑠2𝑛, 𝑗 ,a𝑛 =1
𝑛(𝑛 − 1)∑
1≤𝑖1≠𝑖2≤𝑛𝐺2a𝑛 (𝑋
𝑗
𝑖1, 𝑋
𝑗
𝑖2)
− 2(𝑛 − 3)!𝑛!
∑1≤𝑖,𝑙1,𝑙2≤𝑛|{𝑖,𝑙1,𝑙2}|=3
𝐺a𝑛 (𝑋𝑗
𝑖, 𝑋
𝑗
𝑙1)𝐺a𝑛 (𝑋
𝑗
𝑖, 𝑋
𝑗
𝑙2)
+ (𝑛 − 4)!𝑛!
∑1≤𝑖1,𝑖2,𝑙1,𝑙2≤𝑛|{𝑖1,𝑖2,𝑙1,𝑙2}|=4
𝐺a𝑛 (𝑋𝑗
𝑖1, 𝑋
𝑗
𝑙1)𝐺a𝑛 (𝑋
𝑗
𝑖2, 𝑋
𝑗
𝑙2)
31
and var(𝛾2a𝑛 (P, P𝑋
1 ⊗ P𝑋2)) by 2/[𝑛(𝑛 − 1)] ��2𝑛,a𝑛 where
��2𝑛,a𝑛 := max{𝑠2𝑛,1,a𝑛𝑠
2𝑛,2,a𝑛 , 1/𝑛
2}.
so that a test statistic for 𝐻IND0 is
𝑇 IND𝑛,a𝑛
:=𝑛√
2��−1𝑛,a𝑛
𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2).
Test statistics for general 𝑘 > 2 can be defined accordingly. Again, we have
Theorem 12. Let a𝑛 → ∞ as 𝑛→ ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then under 𝐻IND0 ,
𝑇 IND𝑛,a𝑛
→𝑑 𝑁 (0, 1), as 𝑛→ ∞.
Motivated by Theorem 12, we can consider a test, denoted by ΦIND𝑛,a𝑛,𝛼
, that rejects 𝐻IND0 if and
only if 𝑇 IND𝑛,a𝑛
exceeds 𝑧𝛼. By construction, ΦIND𝑛,a𝑛,𝛼
is an asymptotic 𝛼 level test. We now turn to
study its power against 𝐻IND1 . As in the case of goodness of fit test, we can prove that ΦHOM
𝑛,a𝑛,𝛼is
minimax optimal in that it can detect the smallest departure from independence in terms of rate of
convergence. More precisely, we have
Theorem 13. (i) Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),
lim𝑛→∞
power{ΦIND𝑛,a𝑛,𝛼
;𝐻IND1 (Δ𝑛; 𝑠)} → 1,
provided that a𝑛 � 𝑛4/(𝑑+4𝑠) .
(ii) Conversely, if lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, then there exists some 𝛼 ∈ (0, 1) such that for
any test Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝,
lim inf𝑛→∞
power{Φ𝑛;𝐻IND1 (Δ𝑛; 𝑠)} < 1.
32
3.4 Adaptation
The results presented in the previous sections not only suggest that Gaussian kernel embedding
of distributions is especially suitable for testing against smooth alternatives, but also indicate the
importance of choosing an appropriate scaling parameter in order to detect small deviation from the
null hypothesis. To achieve maximum power, the scaling parameter should be chosen according
to the smoothness of underlying density functions. This, however, presents a practical challenge
because the level of smoothness is rarely known a priori. This naturally brings about the questions
of adaption: can we devise an agnostic testing procedure that does not require such knowledge
but still attain similar performance? We shall show in this section that this is possible, at least for
sufficiently smooth densities.
3.4.1 Test for Goodness-of-fit
We again begin with the test for goodness-of-fit. As we show in Section 3.1, under 𝐻GOF0 ,
𝑇GOF𝑛,a𝑛
→𝑑 𝑁 (0, 1) if 1 � a𝑛 � 𝑛4/𝑑; whereas for any 𝑝 ∈ W𝑠,2 such that ‖𝑝−𝑝0‖𝐿2 � 𝑛−2𝑠/(𝑑+4𝑠) ,
𝑇GOF𝑛,a𝑛
→ ∞ provided that a𝑛 � 𝑛4/(𝑑+4𝑠) . This motivates us to consider the following test statistic:
𝑇GOF(adapt)𝑛 = max
1≤a𝑛≤𝑛2/𝑑𝑇GOF𝑛,a𝑛
.
In light of earlier discussion, it is plausible that such a statistic could be used to detect any smooth
departure from the null provided that the level of smoothness 𝑠 ≥ 𝑑/4. We now argue that this
is indeed the case. More specifically, we shall proceed to reject 𝐻GOF0 if and only if 𝑇GOF(adapt)
𝑛
exceeds the upper 𝛼 quantile, denoted by 𝑞GOF𝑛,𝛼 , of its null distribution. In what follows, we shall
call this test ΦGOF(adapt) . Note that, even though it is hard to derive the analytic form for 𝑞GOF𝑛,𝛼 , it
can be readily evaluated via Monte Carlo method.
To study the power of ΦGOF(adapt) against 𝐻GOF1 with different levels of smoothness, we shall
33
consider the following alternative hypothesis
𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : 𝑝 ∈
⋃𝑠≥𝑑/4
{𝑝 ∈ W𝑠,2(𝑀) : ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛,𝑠}.
The following theorem characterizes the power of ΦGOF(adapt) against 𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4).
Theorem 14. There exists a constant 𝑐 > 0 such that if
lim inf𝑛→∞
Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐,
then
power{ΦGOF(adapt);𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.
Theorem 14 shows that ΦGOF(adapt) has a detection boundary of the order (log log 𝑛/𝑛) 2𝑠𝑑+4𝑠
when 𝑝 ∈ W𝑠,2 for any 𝑠 ≥ 𝑑/4. If 𝑠 is known in advance, as we show in Section 3.1, the optimal
test is based on 𝑇GOF𝑛,a𝑛
with a𝑛 � 𝑛4/(𝑑+4𝑠) and has a detection boundary of the order 𝑂 (𝑛−2𝑠/(𝑑+4𝑠)).
The extra polynomial of iterated logarithmic factor (log log 𝑛)2𝑠/(𝑑+4𝑠) is the price we pay to ensure
that no knowledge of 𝑠 is required and ΦGOF(adapt) is powerful against smooth alternatives for all
𝑠 ≥ 𝑑/4.
3.4.2 Test for Homogeneity
The treatment for homogeneity tests is similar. Instead of 𝑇HOM𝑛,a𝑛
, we now consider a test based
on
𝑇HOM(adapt)𝑛 = max
1≤a𝑛≤𝑛2/𝑑𝑇HOM𝑛,a𝑛
.
If 𝑇HOM(adapt)𝑛 exceeds the upper 𝛼 quantile, denoted by 𝑞HOM
𝑛,𝛼 , of its null distribution, then we
reject 𝐻HOM0 . In what follows, we shall refer to this test as ΦHOM(adapt) . As before, we do not
have a closed form expression for 𝑞HOM𝑛,𝛼 , and it needs to be evaluated via Monte Carlo method. In
particular, in the case of homogeneity test, we can approximate 𝑞HOM𝑛,𝛼 by permutation where we
34
randomly shuffle {𝑋1, . . . , 𝑋𝑛, 𝑌1, . . . , 𝑌𝑚} and compute the test statistic as if the first 𝑛 shuffled
observations are from the first population whereas the other 𝑚 are from the second population.
This is repeated multiple times in order to approximate the critical value 𝑞HOM𝑛,𝛼 .
The following theorem characterize the power of ΦHOM(adapt) against an alternative with dif-
ferent levels of smoothness
𝐻HOM(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : (𝑝, 𝑞) ∈
⋃𝑠≥𝑑/4
{(𝑝, 𝑞) : 𝑝, 𝑞 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑞‖𝐿2 ≥ Δ𝑛,𝑠}.
Theorem 15. There exists a constant 𝑐 > 0 such that if
lim inf𝑛→∞
Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐,
then
power{ΦHOM(adapt);𝐻HOM(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.
Similar to the case of goodness-of-fit test, Theorem 15 shows that ΦHOM(adapt) has a detection
boundary of the order 𝑂 ((𝑛/log log 𝑛)−2𝑠/(𝑑+4𝑠)) when 𝑝 ≠ 𝑞 ∈ W𝑠,2 for any 𝑠 ≥ 𝑑/4. In light of
the results from Section 3.2, this is optimal up to an extra polynomial of iterated logarithmic factor.
The main advantage is that ΦHOM(adapt) is powerful against smooth alternatives simultaneously for
all 𝑠 ≥ 𝑑/4.
3.4.3 Test for Independence
Similarly, for independence test, we shall adopt the following test statistic
𝑇IND(adapt)𝑛 = max
1≤a𝑛≤𝑛2/𝑑𝑇 IND𝑛,a𝑛
.
and reject 𝐻IND0 if and only 𝑇 IND(adapt)
𝑛 exceeds the upper 𝛼 quantile, denoted by 𝑞IND𝑛,𝛼 , of its null
distribution. In what follows, we shall refer to this test as ΦHOM(adapt) . The critical value, 𝑞HOM𝑛,𝛼 ,
can also be evaluated via permutation test. See, e.g., Pfister et al. (2018) for detailed discussions.
35
We now show that ΦIND(adapt) is powerful in testing against the alternative with different levels
of smoothness
𝐻IND(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : 𝑝 ∈
⋃𝑠≥𝑑/4
{𝑝 ∈ W𝑠,2(𝑀), 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘,
‖𝑝 − 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ‖𝐿2 ≥ Δ𝑛,𝑠
}.
More specifically, we have
Theorem 16. There exists a constant 𝑐 > 0 such that if
lim inf𝑛→∞
Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐,
then
power{ΦIND(adapt);𝐻IND(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.
Similar to before, Theorem 16 shows that ΦIND(adapt) is optimal up to an extra polynomial of
iterated logarithmic factor for detecting smooth departure from independence simultaneously for
all 𝑠 ≥ 𝑑/4.
36
Chapter 4: Numerical Experiments
To further complement our theoretical development and demonstrate the practical merits of
the proposed methodology, we conducted several sets of numerical experiments. We shall mainly
consider Gaussian kernels in this chapter as they are the most popular choices in practice for
continuous data.
4.1 Effect of Scaling Parameter
Our first set of experiments were designed to illustrate the importance of the scaling parameter
and highlight the potential room for improvement over the “median” heuristic—one of the most
common data-driven choice of the scaling parameter in practice (see, e.g., Gretton et al., 2008;
Pfister et al., 2018).
• Experiment I: the homogeneity test with underlying distributions being the normal distribu-
tion and the mixture of several normal distributions. Specifically,
𝑝(𝑥) = 𝑓 (𝑥; 0, 1), 𝑞(𝑥) = 0.5 × 𝑓 (𝑥; 0, 1) + 0.1 ×∑∈𝝁𝑓 (𝑥; `, 0.05)
where 𝑓 (𝑥; `, 𝜎) denotes the density of 𝑁 (`, 𝜎2) and 𝝁 = {−1,−0.5, 0, 0.5, 1}.
• Experiment II: the joint independence test of 𝑋1, · · · , 𝑋5 where
𝑋1, · · · , 𝑋4, (𝑋5)′ ∼iid 𝑁 (0, 1), 𝑋5 =��(𝑋5)′
�� × sign
( 4∏𝑙=1
𝑋 𝑙
).
Clearly 𝑋1, · · · , 𝑋5 are jointly dependent since∏𝑑𝑙=1 𝑋
𝑙 ≥ 0.
In both experiments, our primary goal is to investigate how the power of Gaussian MMD based
37
test is influenced by a pre-fixed scaling parameter. These tests are also compared to the ones
with scaling parameter selected via “median” heuristic. In order to evaluate tests with different
scaling parameters under a unified framework, we determined the critical values for each test via
permutation test.
For Experiment I we fixed the sample size at 𝑛 = 𝑚 = 200; and for Experiment II at 𝑛 = 400.
The number of permutations was set at 100, and significance level at 𝛼 = 0.05. We first repeated the
experiments 100 times under the null to verify that permutation tests indeed yield the correct size,
up to Monte Carlo error. Each experiment was then repeated for 100 times and the observed power
(± one standard error) for different choices of the scaling parameter. The results are summarized in
Figure 4.1. It is perhaps not surprising that the scaling parameter selected via “median heuristic”
has little variation across each simulation run, and we represent its performance by a single value.
−1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
log(a)
Pow
er
Single fixed aMedian
−3 −2 −1 0 1 20
0.2
0.4
0.6
0.8
1
log(a)
Figure 4.1: Observed power against log(a) in Experiment I (left) and Experiment II(right).
The importance of the scaling parameter is evident from Figure 4.1 with the observed power
varies quite significantly for different choices. It is also of interest to note that in these settings the
“median” heuristic typically does not yield a scaling parameter with great power. More specifically,
in Experiment I, log(amedian) ≈ 0.2 and maximum power is attained at log(a) = 4; in Experiment
II, log(amedian) ≈ −2.15 and maximum power is attained at log(a) = 1. This suggests that more
appropriate choice of the scaling parameter may lead to much improved performance.
38
4.2 Efficacy of Adaptation
Our second experiment aims to illustrate that the adaptive procedures we proposed in Section
3.4 indeed yield more powerful tests when compared with other alternatives that are commonly
used in practice. In particular, we compare the proposed self-normalized adaptive test (S.A.)
with a couple of data-driven approaches, namely the “median” heuristic (Median) and the un-
normalized adaptive test (U.A.) proposed in Sriperumbudur et al. (2009). When computing both
self-normalized and unnormalized test statistics, we first rescaled the squared distance ‖𝑋𝑖 − 𝑋 𝑗 ‖2
by the dimensionality 𝑑 before taking maximum within a certain range of the scaling parameter.
We considered two experiment setups:
• Experiment III: the homogeneity test with the underlying distributions being
𝑃 ∼ 𝑁 (0, 𝐼𝑑), 𝑄 ∼ 𝑁(0,
(1 + 2𝑑−1/2
)𝐼𝑑
).
As the ‘signal strength’, the ratio between the variances of 𝑄 and 𝑃 in each single direction
is set to decrease to 1 at the order 1/√𝑑 with 𝑑, which is the decreasing order of variance
ratio that can be detected by the classical 𝐹-test.
• Experiment IV: the independence test of 𝑋1, 𝑋2 ∈ R𝑑/2, where 𝑋 = (𝑋1, 𝑋2) follows a
mixture of
𝑁 (0, 𝐼𝑑) and 𝑁
(0, (1 + 6𝑑−3/5)𝐼𝑑
)with mixture probability being 0.5. Similarly, the ratio between the variances in each direc-
tion is set to decrease with 𝑑, but at a slightly higher rate.
To better compare different methods, we considered different combinations of sample size and
dimensionality for each experiment. More specifically, for Experiment III, the sample sizes were
set to be 𝑚 = 𝑛 = 25, 50, 75, · · · , 200 and dimension 𝑑 = 1, 10, 100, 1000; for Experiment IV, the
sample size were 𝑛 = 100, 200, · · · , 600 and dimension 𝑑 = 2, 10, 100, 1000. In both experiments,
39
we fixed the significance level at 𝛼 = 0.05, did 100 permutations to calibrate the critical values as
before. Again we simulated under 𝐻0 to verify that the resulting tests have the targeted size, up to
Monte Carlo error. The power of each method, estimated from 100 such experiments, is reported
in Figures 4.2 and 4.3.
50 100 150 2000
0.2
0.4
0.6
0.8
1
𝑛
Pow
er
MedianU.A.S.A.
50 100 150 2000
0.2
0.4
0.6
0.8
1
𝑛50 100 150 200
0
0.2
0.4
0.6
0.8
1
𝑛50 100 150 200
0
0.2
0.4
0.6
0.8
1
𝑛
Figure 4.2: Observed power versus sample size in Experiment III for 𝑑 = 1, 10, 100, 1000 fromleft to right.
200 400 6000
0.2
0.4
0.6
0.8
1
𝑛
Pow
er
MedianU.A.S.A.
200 400 6000
0.2
0.4
0.6
0.8
1
𝑛200 400 600
0
0.2
0.4
0.6
0.8
1
𝑛200 400 600
0
0.2
0.4
0.6
0.8
1
𝑛
Figure 4.3: Observed power versus sample size in Experiment IV for 𝑑 = 2, 10, 100, 1000 fromleft to right.
As Figures 4.2 and 4.3 show, for both experiments, these tests are comparable in low-dimensional
settings. But as 𝑑 increases, the proposed self-normalized adaptive test becomes more and more
preferable to the two alternatives. For example, for Experiment IV, when 𝑑 = 1000, the observed
power of the proposed self-normalized adaptive test is about 90% when 𝑛 = 600, while the other
two tests have power around only 15%.
40
4.3 Data Example
Finally, we considered applying the proposed self-normalized adaptive test in a data example
from Mooij et al. (2016). The data set consists of three variables, altitude (Alt), average temper-
ature (Temp) and average duration of sunshine (Sun) from different weather stations. One goal
of interest is to figure out the causal relationship among the three variables by figuring out a suit-
able directed acyclic graph (DAG) among them. Following Peters et al. (2014), if a set of random
variables 𝑋1, · · · ,𝑋𝑑 follow a DAG G0, then we assume that they follow a sequence of additive
models:
𝑋 𝑙 =∑𝑟∈PA𝑙
𝑓𝑙,𝑟 (𝑋𝑟) + 𝑁 𝑙 , ∀ 1 ≤ 𝑙 ≤ 𝑑,
where 𝑁 𝑙’s are independent Gaussian noises and PA𝑙 denotes the collection of parent nodes of node
𝑙 specified by G0. As shown by (Peters et al., 2014), G0 is identifiable from the joint distribution
of 𝑋1, · · · , 𝑋𝑑 under the assumption of 𝑓𝑙,𝑟’s being non-linear. Therefore a natural method of
deciding a specific DAG underlying a set of random variables is by testing the independence of the
regression residuals after fitting the DAG induced additive models. In our case, there are totally
25 possible DAGs for the three variables. We can apply independence tests for the residuals for
each of the 25 DAGs and choose the one with the largest 𝑝-value as the most plausible underlying
DAG. See Peters et al. (2014) for more details.
As before, we considered three different ways for independence tests: the proposed self-
normalized adaptive test (S.A.), Gaussian kernel embedding based independent test with the
scaling parameter determined by the “median” heuristic (Median), and the unnormalized adaptive
test from Sriperumbudur et al. (2009) (U.A.). Note that the three variables have different scales
and we standardize them before applying the tests of independence.
The overall sample size of the data set is 349. Each time we randomly select 150 samples
and compute the 𝑝-value associated with each DAG. The 𝑝-value is again computed based on 100
permutations. We repeated the experiment for 1000 times and recorded for each test the DAG with
the largest 𝑝-value. All three tests agree on the top three most selected DAGs and they are shown
41
in Figure 4.4.
Alt
Temp Sun
DAG I
Alt
Temp Sun
DAG II
Alt
Temp Sun
DAG III
Figure 4.4: DAGs with the top 3 highest probabilities of being selected.
In addition, we report in Table 4.1 the frequencies that these three DAGs were selected by each
of the tests. They are generally comparable with the proposed method more consistently selecting
DAG I, the one heavily favored by all three methods.
Test
Prob(%) DAGI II III
Median 78.5 4.7 14.5U.A. 81.4 8.1 8.5S.A. 83.4 9.8 4.7
Table 4.1: Frequency that each DAG in Figure 4.4 was selected by three tests.
42
Chapter 5: Conclusion and Discussion
In this thesis, we aim to address the problem of kernel selection when using kernel embed-
ding for the purpose of nonparametric hypothesis testing, which is an inevitable problem that
researchers and practitioners have been trying to answer ever since the first proposal of kernel em-
bedding method while most of the existing solutions are ad-hoc. We propose principled ways of
kernel selection in two different settings which are proved to ensure minimax rate optimality for
the associated tests. We also propose adaptive test statistics to address the issue of aforementioned
kernel selection methods depending on the regularity condition of the underlying space of prob-
ability distributions, whose sacrifice in terms of detection boundary is only some polynomial of
iterated logarithmic factor of the sample size.
There are still many interesting problems in this area which remain to be explored further.
For example, can we adopt fast computation techniques to compute the kernel based test statistic
approximately so as to reduce the computation complexity to a large extent while maintaining the
statistical optimality? Parallel results in the context of regression have been derived but there seems
to be a lack of such results in hypothesis testing. The second one involves resampling methods such
as permutation method and bootstrap method. In practice, with the concern that the sample size
may not be large enough, resampling methods are usually used to decide the rejection boundary.
Can we still ensure the statistical optimality of the proposed tests when incorporating resampling
methods?
In addition to that, it is also wondered whether similar principled kernel selection methods can
be proposed in a broader range of nonparametric testing problems such as conditional independent
test, which can be very useful in Bayesian network learning and causal discovery.
43
Chapter 6: Proofs
Throughout this chapter, we shall write 𝑎𝑛 . 𝑏𝑛 if there exists a universal constant 𝐶 > 0 such
that 𝑎𝑛 ≤ 𝐶𝑏𝑛. Similarly, we write 𝑎𝑛 & 𝑏𝑛 if 𝑏𝑛 . 𝑎𝑛, and 𝑎𝑛 � 𝑏𝑛 if 𝑎𝑛 . 𝑏𝑛 and 𝑎𝑛 & 𝑏𝑛.
When the the constant depends on another quantity 𝐷, we shall write 𝑎𝑛 .𝐷 𝑏𝑛. Relations &𝐷
and �𝐷 are defined accordingly.
Proof of Theorem 1. Part (i). The proof of the first part consists of two key steps. First, we show
that the population counterpart 𝑛𝛾2(P, P0) of the test statistic converges to ∞ uniformly, i.e.,
𝑛 inf𝑃∈P(Δ𝑛,0)
𝛾2(P, P0) → ∞.
Then, we argue that the deviation from 𝛾2(P, P0) to 𝛾2(P𝑛, P0) is uniformly negligible compared
with 𝛾2(P, P0) itself.
It is not hard to see that
𝛾(P𝑛, P0) =
√√∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖)]2
≥√∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 −
√√∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2.
44
Thus,
𝑃
{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼
}≤𝑃
√𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 −
√√𝑛∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2<√𝑞𝑤,1−𝛼
=𝑃
√√𝑛∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2>
√𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼
.Suppose that
𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 > 𝑞𝑤,1−𝛼 .
Then
𝑃
{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼
}≤EP
{𝑛
∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)
]2}{√
𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼
}2 .
Observe that for any 𝑃 ∈ P(Δ𝑛, 0),
EP
{𝑛∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2}
=∑𝑘≥1
_𝑘Var[𝜙𝑘 (𝑋)]
≤∑𝑘≥1
_𝑘EP𝜙2𝑘 (𝑋)
≤(sup𝑘≥1
‖𝜙𝑘 ‖∞)2 ∑
𝑘≥1_𝑘 < ∞.
45
This implies that
lim𝑛→∞
𝛽(ΦMMD;Δ𝑛, 0) = lim𝑛→∞
supP∈P(Δ𝑛,0)
𝑃
{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼
}
≤ lim𝑛→∞
supP∈P(Δ𝑛,0)
EP
{𝑛
∑𝑘≥1
_𝑘
[1𝑛
𝑛∑𝑖=1𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)
]2}inf
P∈P(Δ𝑛,0)
{√𝑛
∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼
}2
=0,
provided that
infP∈P(Δ𝑛,0)
𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 → ∞, as 𝑛→ ∞. (6.1)
It now suffices to show that (6.1) holds if 𝑛Δ4𝑛 → ∞ as 𝑛→ ∞.
To this end, let 𝑢 = 𝑑P/𝑑P0 − 1 and
𝑎𝑘 = 〈𝑢, 𝜙𝑘〉𝐿2 (P0) = EP𝜙𝑘 (𝑋) − EP0𝜙𝑘 (𝑋) = EP(𝜙𝑘 (𝑋)).
It is clear the that
∑𝑘≥1
_−1𝑘 𝑎
2𝑘 = ‖𝑢‖2
𝐾 , and∑𝑘≥1
𝑎2𝑘 = ‖𝑢‖2
𝐿2 (P0) = 𝜒2(P, P0).
By the definition of P(Δ𝑛, 0),
supP∈P(Δ𝑛,0)
∑𝑘≥1
_−1𝑘 𝑎
2𝑘 ≤ 𝑀2, and inf
P∈P(Δ𝑛,0)
∑𝑘≥1
𝑎2𝑘 ≥ Δ𝑛.
46
Since 𝑛Δ4𝑛 → ∞ as 𝑛→ ∞, we get
infP∈P(Δ𝑛,0)
𝑛∑𝑘≥1
_𝑘 [EP𝜙𝑘 (𝑋)]2 = infP∈P(Δ𝑛,0)
𝑛∑𝑘≥1
_𝑘𝑎2𝑘
≥ infP∈P(Δ𝑛,0)
𝑛
( ∑𝑘≥1
𝑎2𝑘
)2
∑𝑘≥1
_−1𝑘𝑎2𝑘
≥𝑛Δ4
𝑛
𝑀2 → ∞
as 𝑛→ ∞.
Part (ii). In proving the second part, we will make use of the following lemma that can be
obtained by adapting the argument in Gregory (1977). It gives the limit distribution of V-statistic
under P𝑛 such that P𝑛 converges to P0 in the order 𝑛−1/4.
Lemma 2. Consider a sequence of probability measures {P𝑛 : 𝑛 ≥ 1} contiguous to P0 satisfying
𝑢𝑛 = 𝑑P𝑛/𝑑P0 − 1 → 0 in 𝐿2(P0). Suppose that for any fixed 𝑘 ,
lim𝑛→∞
√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0) = ��𝑘 , and lim
𝑛→∞
∑𝑘≥1
_𝑘 (√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0))2 =
∑𝑘≥1
_𝑘 ��2𝑘 + ��0 < ∞,
for some sequence {��𝑘 : 𝑘 ≥ 0}, then
1𝑛
∑𝑘≥1
_𝑘
[ 𝑛∑𝑖=1
𝜙𝑘 (𝑋𝑖)]2 𝑑→
∑𝑘≥1
_𝑘 (𝑍𝑘 + ��𝑘 )2 + ��0,
where 𝑋1, . . . , 𝑋𝑛𝑖.𝑖.𝑑∼ P𝑛, and 𝑍𝑘s are independent standard normal random variables.
Write 𝐿 (𝑘) = _𝑘 𝑘2𝑠. By assumption (2.6),
0 < 𝐿 := inf𝑘≥1
𝐿 (𝑘) ≤ sup𝑘≥1
𝐿 (𝑘) := 𝐿 < ∞.
47
Consider a sequence of {P𝑛 : 𝑛 ≥ 1} such that
𝑑P𝑛/𝑑P0 − 1 = 𝐶1√_𝑘𝑛 [𝐿 (𝑘𝑛)]−1𝜙𝑘𝑛 ,
where 𝐶1 is a positive constant and 𝑘𝑛 = b𝐶2𝑛14𝑠 c for some positive constant 𝐶2. Both 𝐶1 and 𝐶2
will be determined later. Since sup𝑘≥1
‖𝜙𝑘 ‖∞ < ∞ and lim𝑘→∞
_𝑘 = 0, there exists 𝑁0 > 0 such that P𝑛’s
are well-defined probability measures for any 𝑛 ≥ 𝑁0.
Note that
‖𝑢𝑛‖2𝐾 =
𝐶21
𝐿2(𝑘𝑛)≤ 𝐿−2𝐶2
1
and
‖𝑢𝑛‖2𝐿2 (P0) =
𝐶21_𝑘𝑛
𝐿2(𝑘𝑛)=
𝐶21
𝐿 (𝑘𝑛)𝑘−2𝑠𝑛 ≥ 𝐿
−1𝐶2
1 𝑘−2𝑠𝑛 ∼ 𝐿
−1𝐶2
1𝐶−2𝑠2 𝑛−1/2,
where 𝐴𝑛 ∼ 𝐵𝑛 means that lim𝑛→∞
𝐴𝑛/𝐵𝑛 = 1. Thus, by choosing 𝐶1 sufficiently small and 𝑐0 =
12𝐿
−1𝐶2
1𝐶−2𝑠2 , we ensure that P𝑛 ∈ P(𝑐0𝑛
−1/4, 0) for sufficiently large 𝑛.
To apply Lemma 2, we note that
lim𝑛→∞
‖𝑢𝑛‖2𝐿2 (P0) = lim
𝑛→∞
𝐶21_𝑘𝑛
𝐿2(𝑘𝑛)= 0.
In addition, for any fixed 𝑘 ,
��𝑛,𝑘 =√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0) = 0
for sufficiently large 𝑛, and
∑𝑘≥1
_𝑘 ��2𝑛,𝑘 =
𝑛𝐶21_
2𝑘𝑛
𝐿2(𝑘𝑛)= 𝑛𝐶2
1 𝑘−4𝑠𝑛 → 𝐶2
1𝐶−4𝑠2
48
as 𝑛→ ∞. Thus, Lemma 2 implies that
𝑛𝛾(P𝑛, P0)𝑑→
∑𝑘≥1
_𝑘𝑍2𝑘 + 𝐶
21𝐶
−4𝑠2 .
Now take 𝐶2 =(2𝐶2
1/𝑞𝑤,1−𝛼)1/4𝑠 so that 𝐶2
1𝐶−4𝑠2 = 1
2𝑞𝑤,1−𝛼. Then
lim inf𝑛→∞
𝛽(ΦMMD; 𝑐0𝑛−1/2, 0) ≥ lim
𝑛→∞𝑃(𝑛𝛾(P𝑛, P0) < 𝑞𝑤,1−𝛼)
=𝑃
(∑𝑘≥1
_𝑘𝑍2𝑘 <
12𝑞𝑤,1−𝛼
)> 0,
which concludes the proof.
Proof of Theorem 2. Let ��𝑛 (·, ·) := ��𝜚𝑛 (·, ·). Note that
𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛] = 2(𝑛2𝑣𝑛)−1/2
𝑛∑𝑗=2
𝑗−1∑𝑖=1
��𝑛 (𝑋𝑖, 𝑋 𝑗 ).
Let Z𝑛 𝑗 =𝑗−1∑𝑖=1��𝑛 (𝑋𝑖, 𝑋 𝑗 ). Consider a filtration {F𝑗 : 𝑗 ≥ 1} where F𝑗 = 𝜎{𝑋𝑖 : 1 ≤ 𝑖 ≤ 𝑗}. Due to
the assumption that 𝐾 is degenerate, we have E𝜙𝑘 (𝑋) = 0 for any 𝑘 ≥ 1, which implies that
E(Z𝑛 𝑗 |F𝑗−1) =𝑗−1∑𝑖=1E[��𝑛 (𝑋𝑖, 𝑋 𝑗 ) |F𝑗−1] =
𝑗−1∑𝑖=1E[��𝑛 (𝑋𝑖, 𝑋 𝑗 ) |𝑋𝑖] = 0,
for any 𝑗 ≥ 2.
Write
𝑈𝑛𝑚 =
0 𝑚 = 1𝑚∑𝑗=2Z𝑛 𝑗 𝑚 ≥ 2
.
49
Then for any fixed 𝑛, {𝑈𝑛𝑚}𝑚≥1 is a martingale with respect to {F𝑚 : 𝑚 ≥ 1} and
𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛] = 2(𝑛2𝑣𝑛)−1/2𝑈𝑛𝑛.
We now apply martingale central limit theorem to 𝑈𝑛𝑛. Following the argument from Hall
(1984), it can be shown that
[12𝑛2E��2
𝑛 (𝑋, 𝑋′)]−1/2
𝑈𝑛𝑛𝑑→ 𝑁 (0, 1), (6.2)
provided that
[E𝐺2𝑛 (𝑋, 𝑋′) + 𝑛−1E��2
𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋
′′) + 𝑛−2E��4𝑛 (𝑋, 𝑋′)]/[E��2
𝑛 (𝑋, 𝑋′)]2 → 0, (6.3)
as 𝑛→ ∞, where 𝐺𝑛 (𝑥, 𝑥′) = E��𝑛 (𝑋, 𝑥)��𝑛 (𝑋, 𝑥′). Since
E��2𝑛 (𝑋, 𝑋′) =
∑𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)2= 𝑣𝑛,
(6.2) implies that
𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛] =
√2 ·
(12𝑛2E��2
𝑛 (𝑋, 𝑋′))−1/2
𝑈𝑛𝑛𝑑→ 𝑁 (0, 2).
It therefore suffices to verify (6.3).
Note that
E��2𝑛 (𝑋, 𝑋′) =
∑𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)2≥
∑_𝑘≥𝜚2
𝑛
14+ 1
4𝜚4𝑛
∑_𝑘<𝜚
2𝑛
_2𝑘
=14|{𝑘 : _𝑘 ≥ 𝜚2
𝑛}| +1
4𝜚4𝑛
∑_𝑘<𝜚
2𝑛
_2𝑘 � 𝜚
−1/𝑠𝑛 ,
50
where the last step holds by considering that _𝑘 � 𝑘−2𝑠. Similarly,
E𝐺2𝑛 (𝑋, 𝑋′) =
∑𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)4≤ |{𝑘 : _𝑘 ≥ 𝜚2
𝑛}| + 𝜚−8𝑛
∑_𝑘<𝜚
2𝑛
_4𝑘 � 𝜚
−1/𝑠𝑛 ,
and
E��2𝑛 (𝑋, 𝑋′)��2
𝑛 (𝑋, 𝑋′′) =E{∑𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)2𝜙2𝑘 (𝑋)
}2
≤(sup𝑘≥1
‖𝜙𝑘 ‖∞)4 {∑
𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)2}2� 𝜚
−2/𝑠𝑛 .
Thus there exists a positive constant 𝐶3 such that
E𝐺2𝑛 (𝑋, 𝑋′)/[E��2
𝑛 (𝑋, 𝑋′)]2 ≤ 𝐶3𝜚1/𝑠𝑛 → 0, (6.4)
and
𝑛−1E��2𝑛 (𝑋, 𝑋′)��2
𝑛 (𝑋, 𝑋′′)/[E��2𝑛 (𝑋, 𝑋′)]2 ≤ 𝐶3𝑛
−1 → 0, (6.5)
as 𝑛→ ∞. On the other hand,
E��4𝑛 (𝑋, 𝑋′) ≤ ‖��𝑛‖2
∞E��2𝑛 (𝑋, 𝑋′),
where
‖��𝑛‖∞ = sup𝑥
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
𝜙2𝑘 (𝑥)
}≤
(sup𝑘≥1
‖𝜙𝑘 ‖∞)2 ∑
𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
� 𝜚−1/𝑠𝑛 .
This implies that for some positive constant 𝐶4,
𝑛−2E��4𝑛 (𝑋, 𝑋′)}/[E��2
𝑛 (𝑋, 𝑋′)]2 ≤ 𝑛−2‖��𝑛‖2∞/E��2
𝑛 (𝑋, 𝑋′) ≤ 𝐶4(𝑛2𝜚1/𝑠𝑛 )−1 → 0. (6.6)
51
as 𝑛→ ∞. Together, (6.4), (6.5) and (6.6) ensure that condition (6.3) holds.
Proof of Theorem 3. Note that
𝑛[2𝜚𝑛(P𝑛, P0) −
1𝑛
𝑛∑𝑖=1
��𝑛 (𝑋𝑖, 𝑋𝑖)
=1𝑛
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗
𝜙𝑘 (𝑋𝑖)𝜙𝑘 (𝑋 𝑗 )
=1𝑛
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗
[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋 𝑗 ) − EP𝜙𝑘 (𝑋)]
+ 2(𝑛 − 1)𝑛
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)]∑
1≤𝑖≤𝑛[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]
+ 𝑛(𝑛 − 1)𝑛
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)]2
:=𝑉1 +𝑉2 +𝑉3.
Obviously, EP𝑉1𝑉2 = 0. We first argue that the following three statements together implies the
desired result:
lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑣−1/2𝑛 𝑉3 = ∞, (6.7)
supP∈P(Δ𝑛,\)
(EP𝑉21 /𝑉
23 ) = 𝑜(1), (6.8)
supP∈P(Δ𝑛,\)
(EP𝑉22 /𝑉
23 ) = 𝑜(1). (6.9)
To see this, note that (6.7) implies that
lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑃(𝑣−1/2𝑛 [𝑛[2
𝜚𝑛(P𝑛, P0) − 𝐴𝑛] ≥
√2𝑧1−𝛼)
≥ lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑃
(𝑣−1/2𝑛 𝑉3 ≥ 2
√2𝑧1−𝛼, 𝑉1 +𝑉2 +𝑉3 ≥ 1
2𝑉3
)= lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑃
(𝑉1 +𝑉2 +𝑉3 ≥ 1
2𝑉3
).
52
On the other hand, (6.8) and (6.9) imply that
lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑃
(𝑉1 +𝑉2 +𝑉3 ≥ 1
2𝑉3
)=1 − lim
𝑛→∞sup
P∈P(Δ𝑛,\)𝑃
(𝑉1 +𝑉2 +𝑉3 <
12𝑉3
)≥1 − lim
𝑛→∞sup
P∈P(Δ𝑛,\)
EP(𝑉1 +𝑉2)2
(𝑉3/2)2 = 1.
This immediately suggests that ΦM3d is consistent. We now show that (6.7)-(6.9) indeed hold.
Verifying (6.7). We begin with (6.7). Since 𝑣𝑛 � 𝜚−1/𝑠𝑛 and 𝑉3 = (𝑛 − 1)[2
𝜚𝑛(P, P0), (6.7) is
equivalent to
lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑛𝜚12𝑠𝑛 [
2𝜚𝑛(P, P0) = ∞.
For any P ∈ P(Δ𝑛, \), let 𝑢 = 𝑑P/𝑑P0 − 1 and 𝑎𝑘 = 〈𝑢, 𝜙𝑘〉𝐿2 (P0) = EP𝜙𝑘 (𝑋). Based on the
assumption that 𝐾 is universal, 𝑢 =∑𝑘≥1
𝑎𝑘𝜙𝑘 . We consider the case \ = 0 and \ > 0 separately.
(1) First consider \ = 0. It is clear that
[2𝜚𝑛(P, P0) =
∑𝑘≥1
𝑎2𝑘 −
∑𝑘≥1
𝜚2𝑛
_𝑘 + 𝜚2𝑛
𝑎2𝑘
≥‖𝑢‖2𝐿2 (P0) − 𝜚
2𝑛
∑𝑘≥1
1_𝑘𝑎2𝑘
≥‖𝑢‖2𝐿2 (P0) − 𝜚
2𝑛𝑀
2.
Take 𝜚𝑛 ≤√Δ2𝑛/(2𝑀2) so that 𝜌2
𝑛𝑀2 ≤ 1
2Δ2𝑛. Then we have
infP∈P(Δ𝑛,0)
[2𝜚𝑛(P, P0) ≥
12
infP∈P(Δ𝑛,0)
‖𝑢‖2𝐿2 (P0) =
12Δ2𝑛.
(2) Now consider the case when \ > 0. For P ∈ P(Δ𝑛, \), ∀ 𝑅 > 0, ∃ 𝑓𝑅 ∈ H (𝐾) such that
53
‖𝑢 − 𝑓𝑅‖𝐿2 (P0) ≤ 𝑀𝑅−1/\ and ‖ 𝑓𝑅‖𝐾 ≤ 𝑅. Let 𝑏𝑘 = 〈 𝑓𝑅, 𝜙𝑘〉𝐿2 (P0) .
[2𝜚𝑛(P, P0) =
∑𝑘≥1
𝑎2𝑘 −
∑𝑘≥1
𝜚2𝑛
_𝑘 + 𝜚2𝑛
𝑎2𝑘
≥‖𝑢‖2𝐿2 (P0) − 2
∑𝑘≥1
𝜚2𝑛
_𝑘 + 𝜚2𝑛
(𝑎𝑘 − 𝑏𝑘 )2 − 2∑𝑘≥1
𝜚2𝑛
_𝑘 + 𝜚2𝑛
𝑏2𝑘
≥‖𝑢‖2𝐿2 (P0) − 2
∑𝑘≥1
(𝑎𝑘 − 𝑏𝑘 )2 − 2𝜚2𝑛
∑𝑘≥1
1_𝑘𝑏2𝑘
=‖𝑢‖2𝐿2 (P0) − 2‖𝑢 − 𝑓𝑅‖2
𝐿2 (P0) − 2𝜚2𝑛‖ 𝑓𝑅‖2
𝐾 .
Taking 𝑅 = (2𝑀/‖𝑢‖𝐿2 (P0))\ yields that
[2𝜚𝑛(P, P0) ≥ ‖𝑢‖2
𝐿2 (P0) − 2𝑀2𝑅−2/\ − 2𝜚2𝑛𝑅
2 =12‖𝑢‖2
𝐿2 (P0) − 2𝜚2𝑛𝑅
2.
Now by choosing
𝜚𝑛 ≤1
2√
2(2𝑀)−\Δ1+\
𝑛 ,
we can ensure that
2𝜚2𝑛𝑅
2 ≤ 14‖𝑢‖2
𝐿2 (P0) .
So that
infP∈P(Δ𝑛,\)
[2𝜚𝑛(P, P0) ≥ inf
P∈P(Δ𝑛,\)
14‖𝑢‖2
𝐿2 (P0) ≥14Δ2𝑛.
In both cases, with 𝜚𝑛 ≤ 𝐶Δ\+1𝑛 for a sufficiently small 𝐶 = 𝐶 (𝑀) > 0, lim
𝑛→∞𝜚
12𝑠𝑛 𝑛Δ
2𝑛 = ∞
suffices to ensure (6.7) holds. Under the condition that lim𝑛→∞
Δ𝑛𝑛2𝑠
4𝑠+\+1 = ∞,
𝜚𝑛 = 𝑐𝑛− 2𝑠 (\+1)
4𝑠+\+1 ≤ 𝐶Δ\+1𝑛
for sufficiently large 𝑛 and lim𝑛→∞
𝜚12𝑠𝑛 𝑛Δ
2𝑛 = ∞ holds as well.
54
Verifying (6.8). Rewrite 𝑉1 as
𝑉1 =1𝑛
∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋 𝑗 ) − EP𝜙𝑘 (𝑋)]
:=1𝑛
∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗
𝐹𝑛 (𝑋𝑖, 𝑋 𝑗 ).
Then
EP𝑉21 =
1𝑛2
∑𝑖≠ 𝑗𝑖′≠ 𝑗 ′
EP𝐹𝑛 (𝑋𝑖, 𝑋 𝑗 )𝐹𝑛 (𝑋𝑖′, 𝑋 𝑗 ′)
=2𝑛(𝑛 − 1)
𝑛2 EP𝐹2𝑛 (𝑋, 𝑋′)
≤2EP𝐹2𝑛 (𝑋, 𝑋′),
where 𝑋, 𝑋′ 𝑖.𝑖.𝑑.∼ P. Recall that, for any two random variables 𝑌1, 𝑌2 such that E𝑌21 < ∞,
E[𝑌1 − E(𝑌1 |𝑌2)]2 = E𝑌21 − E[E(𝑌1 |𝑌2)2] ≤ E𝑌2
1 .
Together with the fact that
𝐹𝑛 (𝑋, 𝑋′) =��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋] − EP [��𝑛 (𝑋, 𝑋′) |𝑋′] + EP��𝑛 (𝑋, 𝑋′)
=��𝑛 (𝑋, 𝑋′) − EP [ ��𝑛 (𝑋, 𝑋′) |𝑋] − E[��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋]
�� 𝑋′] ,we have
EP𝐹2𝑛 (𝑋, 𝑋′) ≤ EP{��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋]}2 ≤ EP��2
𝑛 (𝑋, 𝑋′).
55
Thus, to prove (6.8), it suffices to show that
lim𝑛→∞
supP∈P(Δ𝑛,\)
EP��2𝑛 (𝑋, 𝑋′)/𝑉2
3 = 0.
For any 𝑔 ∈ 𝐿2(P0) and positive definite kernel 𝐺 (·, ·) such that EP0𝐺2(𝑋, 𝑋′) < ∞, let
‖𝑔‖𝐺 :=√EP0 [𝑔(𝑋)𝑔(𝑋′)𝐺 (𝑋, 𝑋′)] .
By the positive definiteness of 𝐺 (·, ·), triangular inequality holds for ‖ · ‖𝐺 , i.e., for any 𝑔1, 𝑔2 ∈
𝐿2(P0),
|‖𝑔1‖𝐺 − ‖𝑔2‖𝐺 | ≤ ‖𝑔1 − 𝑔2‖𝐺 .
Thus by taking 𝐺 = ��2𝑛 , 𝑔1 = 𝑑P/𝑑P0 and 𝑔2 = 1, we have�����√EP��2𝑛 (𝑋, 𝑋′) −
√EP0��
2𝑛 (𝑋, 𝑋′)
����� ≤ √EP0 [𝑢(𝑋)𝑢(𝑋′)��2
𝑛 (𝑋, 𝑋′)] . (6.10)
We now appeal to the following lemma to bound the right hand side of (6.10):
Lemma 3. Let 𝐺 be a Mercer kernel defined over X × X with eigenvalue-eigenfunction pairs
{(`𝑘 , 𝜙𝑘 ) : 𝑘 ≥ 1} with respect to 𝐿2(P) such that `1 ≥ `2 ≥ · · · . If 𝐺 is a trace kernel in that
E𝐺 (𝑋, 𝑋) < ∞, then for any 𝑔 ∈ 𝐿2(P)
EP [𝑔(𝑋)𝑔(𝑋′)𝐺2(𝑋, 𝑋′)] ≤ `1
(∑𝑘≥1
`𝑘
) (sup𝑘≥1
‖𝜙𝑘 ‖∞)2
‖𝑔‖2𝐿2 (P) .
By Lemma 3, we get
EP0 [𝑢(𝑋)𝑢(𝑋′)��2𝑛 (𝑋, 𝑋′)] ≤𝐶5
(∑𝑘
_𝑘
_𝑘 + 𝜚2𝑛
)‖𝑢‖2
𝐿2 (P0) � 𝜚−1/𝑠𝑛 ‖𝑢‖2
𝐿2 (P0) .
56
Recall that
EP0��2𝑛 (𝑋, 𝑋′) =
∑𝑘
(_𝑘
_𝑘 + 𝜚2𝑛
)2� 𝜚
−1/𝑠𝑛 .
In the light of (6.10), they imply that
EP��2𝑛 (𝑋, 𝑋′) ≤ 2{EP0��
2𝑛 (𝑋, 𝑋′) + EP0 [𝑢(𝑋)𝑢(𝑋′)��2
𝑛 (𝑋, 𝑋′)]} ≤ 𝐶6𝜚−1/𝑠𝑛 [1 + ‖𝑢‖2
𝐿2 (P0)] .
On the other hand, as already shown in the part of verifying (6.7), 𝜚𝑛 � Δ\+1𝑛 suffices to ensure
that for sufficiently large 𝑛,
14‖𝑢‖2
𝐿2 (P0) ≤ [2𝜚𝑛(P, P0) ≤ ‖𝑢‖2
𝐿2 (P0) , ∀ P ∈ P(Δ𝑛, \).
Thus
lim𝑛→∞
supP∈P(Δ𝑛,\)
EP��2𝑛 (𝑋, 𝑋′)/𝑉2
3
≤16𝐶6
{(lim𝑛→∞
infP∈P(Δ𝑛,\)
𝜚1/𝑠𝑛 𝑛2‖𝑢‖4
𝐿2 (P0)
)−1+
(lim𝑛→∞
infP∈P(Δ𝑛,\)
𝜚1/𝑠𝑛 𝑛2‖𝑢‖2
𝐿2 (P0)
)−1}= 0
provided that lim𝑛→∞
𝑛2𝑠
4𝑠+\+1Δ𝑛 = ∞. This immediately implies (6.8).
Verifying (6.9). Observe that
EP𝑉22 ≤4𝑛EP
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋) − EP𝜙𝑘 (𝑋)]}2
≤4𝑛EP{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
=4𝑛EP0
([1 + 𝑢(𝑋)]
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
).
57
It is clear that
EP0
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
=∑𝑘,𝑘 ′≥1
_𝑘
_𝑘 + 𝜚2𝑛
_𝑘 ′
_𝑘 ′ + 𝜚2𝑛
EP𝜙𝑘 (𝑋)EP𝜙𝑘 ′ (𝑋)EP0 [𝜙𝑘 (𝑋)𝜙𝑘 ′ (𝑋)]
=∑𝑘≥1
( _𝑘
_𝑘 + 𝜚2𝑛
)2[EP𝜙𝑘 (𝑋)]2 ≤ [2
𝜚𝑛(P, P0).
On the other hand,
EP0
(𝑢(𝑋)
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
)≤
√√√EP0
(𝑢2(𝑋)
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
)×
×√EP0
{∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2
≤‖𝑢‖𝐿2 (P0) sup𝑥
���∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑥)]��� · [𝜚𝑛 (P, P0)
≤(sup𝑘
‖𝜙𝑘 ‖∞)‖𝑢‖𝐿2 (P0)
∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
|EP𝜙𝑘 (𝑋) | · [𝜚𝑛 (P, P0)
≤(sup𝑘
‖𝜙𝑘 ‖∞)‖𝑢‖𝐿2 (P0)
√∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
√∑𝑘≥1
_𝑘
_𝑘 + 𝜚2𝑛
[EP𝜙𝑘 (𝑋)]2 · [𝜚𝑛 (P, P0)
≤𝐶7‖𝑢‖𝐿2 (P0) 𝜚− 1
2𝑠𝑛 [2
𝜚𝑛(P, P0).
Together, they imply that
lim𝑛→∞
supP∈P(Δ𝑛,\)
EP𝑉21 /𝑉
23
≤4 max{1, 𝐶7}{(
lim𝑛→∞
infP∈P(Δ𝑛,\)
𝑛[2𝜚𝑛(P, P0)
)−1+ lim𝑛→∞
supP∈P(Δ𝑛,\)
(‖𝑢‖𝐿2 (P0)
𝜚12𝑠𝑛 𝑛[
2𝜚𝑛 (P, P0)
)}= 0,
under the assumption that lim𝑛→∞
𝑛2𝑠
4𝑠+\+1Δ𝑛 = ∞.
58
Proof of Theorem 4. The main architect is now standard in establishing minimax lower bounds for
nonparametric hypothesis testing. The main idea is to carefully construct a set of points under the
alternative hypothesis and argue that a mixture of these alternatives cannot be reliably distinguished
from the null. See, e.g., Ingster, 1993; Ingster and Suslina, 2003; Tsybakov, 2008. Without loss of
generality, assume 𝑀 = 1 and Δ𝑛 = 𝑐𝑛− 2𝑠
4𝑠+\+1 for some 𝑐 > 0.
Let us consider the cases of \ = 0 and \ > 0 separately.
The case of \ = 0. We first treat the case when \ = 0. Let 𝐵𝑛 = b𝐶8Δ− 1𝑠
𝑛 c for a sufficiently small
constant 𝐶8 > 0 and 𝑎𝑛 =√Δ2𝑛/𝐵𝑛. For any b𝑛 := (b𝑛1, b𝑛2, · · · , b𝑛𝐵𝑛)> ∈ {±1}𝐵𝑛 , write
𝑢𝑛,b𝑛 = 𝑎𝑛
𝐵𝑛∑𝑘=1
b𝑛𝑘𝜑𝑘 .
It is clear that
‖𝑢𝑛,b𝑛 ‖2𝐿2 (P0) = 𝐵𝑛𝑎
2𝑛 = Δ2
𝑛
and
‖𝑢𝑛,b𝑛 ‖∞ ≤ 𝑎𝑛𝐵𝑛(sup𝑘
‖𝜑𝑘 ‖∞)� Δ
2𝑠−12𝑠𝑛 → 0.
By taking 𝐶8 small enough, we can also ensure
‖𝑢𝑛,b𝑛 ‖2𝐾 = 𝑎2
𝑛
𝐵𝑛∑𝑘=1
_−1𝑘 ≤ 1,
Therefore, there exists a probability measure P𝑛,b𝑛 ∈ P(Δ𝑛, 0) such that 𝑑P𝑛,b𝑛/𝑑P0 = 1+𝑢𝑛,b𝑛 .
Following a standard argument for minimax lower bound, it suffices to show that
lim sup𝑛→∞
EP0©« 12𝐵𝑛
∑b𝑛∈{±1}𝐵𝑛
{𝑛∏𝑖=1
[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]}ª®¬
2
< ∞. (6.11)
59
Note that
EP0©« 12𝐵𝑛
∑b𝑛∈{±1}𝐵𝑛
{𝑛∏𝑖=1
[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]}ª®¬
2
=EP0©« 122𝐵𝑛
∑b𝑛,b
′𝑛∈{±1}𝐵𝑛
{𝑛∏𝑖=1
[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]} {
𝑛∏𝑖=1
[1 + 𝑢𝑛,b ′𝑛 (𝑋𝑖)]}ª®¬
=1
22𝐵𝑛
∑b𝑛,b
′𝑛∈{±1}𝐵𝑛
𝑛∏𝑖=1EP0
{[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)] [1 + 𝑢𝑛,b ′𝑛 (𝑋𝑖)]
}=
122𝐵𝑛
∑b𝑛,b
′𝑛∈{±1}𝐵𝑛
(1 + 𝑎2
𝑛
𝐵𝑛∑𝑘=1
b𝑛𝑘b′𝑛𝑘
)𝑛≤ 1
22𝐵𝑛
∑b𝑛,b
′𝑛∈{±1}𝐵𝑛
exp(𝑛𝑎2
𝑛
𝐵𝑛∑𝑘=1
b𝑛,𝑘b′𝑛,𝑘
)=
{exp(𝑛𝑎2
𝑛) + exp(−𝑛𝑎2𝑛)
2
}𝐵𝑛≤ exp
(12𝐵𝑛𝑛
2𝑎4𝑛
),
where the last inequality is ensured by that
cosh(𝑡) ≤ exp(𝑡2
2
), ∀ 𝑡 ∈ R.
See, e.g., Baraud, 2002. With the particular choice of 𝐵𝑛, 𝑎𝑛, and the conditions on Δ𝑛, this
immediately implies (6.11).
The case of \ > 0. The main idea is similar to before. To find a set of probability measures in
P(Δ𝑛, \), we appeal to the following lemma.
Lemma 4. Let 𝑢 =∑𝑘
𝑎𝑘𝜑𝑘 . If
sup𝐵≥1
(𝐵∑𝑘=1
𝑎2𝑘
_𝑘
)2/\ (∑𝑘≥𝐵
𝑎2𝑘
) ≤ 𝑀2,
60
then 𝑢 ∈ F (\, 𝑀).
Similar to before, we shall now take 𝐵𝑛 = b𝐶10Δ− \+1
𝑠𝑛 c and 𝑎𝑛 =
√Δ2𝑛/𝐵𝑛. By Lemma 4, we can
find P𝑛,b𝑛 ∈ P(Δ𝑛, \) such that 𝑑P𝑛,b𝑛/𝑑P0 = 1 + 𝑢𝑛,b𝑛 , for appropriately chosen 𝐶10. Following
the same argument as in the previous case, we can again verify (6.11).
Proof of Theorem 5. Without loss of generality, assume that Δ𝑛 (\) = 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 for
some constant 𝑐1 > 0 to be determined later.
Type I Error. We first prove the first statement which shows that the Type I error converges to 0.
Following the same notations as defined in the proof of Theorem 2, let
𝑁𝑛,2 = E{ 𝑛∑𝑗=2E(Z2𝑛 𝑗 |F𝑗−1
)− 1
}2, 𝐿𝑛,2 =
𝑛∑𝑗=2EZ4
𝑛 𝑗
where Z𝑛 𝑗 =√
2Z𝑛 𝑗/(𝑛√𝑣𝑛). As shown by Haeusler (1988),
sup𝑡
|𝑃(𝑇𝑛,𝜚𝑛 > 𝑡) − Φ(𝑡) | ≤ 𝐶11(𝐿𝑛,2 + 𝑁𝑛,2)1/5,
where Φ(𝑡) is the survival function of the standard normal, i.e., Φ(𝑡) = 𝑃(𝑍 > 𝑡) where 𝑍 ∼
𝑁 (0, 1). Again by the argument from Hall (1984),
E{ 𝑛∑𝑗=2E(Z2
𝑛 𝑗 |F𝑗−1) −12𝑛(𝑛 − 1)𝑣𝑛
}2≤ 𝐶12 [𝑛4E𝐺2
𝑛 (𝑋, 𝑋′) + 𝑛3E��2𝑛 (𝑋, 𝑋′)��2
𝑛 (𝑋, 𝑋′′)],
where 𝐺𝑛 (·, ·) is defined in the proof of Theorem 2, and
𝑛∑𝑗=2EZ4
𝑛 𝑗 ≤ 𝐶13 [𝑛2E��4𝑛 (𝑋, 𝑋′) + 𝑛3E��2
𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋
′′)],
61
which ensures
𝑁𝑛,2 =
4E{ 𝑛∑𝑗=2E(Z2
𝑛 𝑗|F𝑗−1) − 1
2𝑛(𝑛 − 1)𝑣𝑛 − 12𝑛𝑣𝑛
}2
𝑛4𝑣2𝑛
≤8 max{𝐶12,
14
} {E𝐺2
𝑛 (𝑋, 𝑋′)𝑣2𝑛
+E��2
𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋
′′)𝑛𝑣2
𝑛
+ 1𝑛2
},
and
𝐿𝑛,2 =
4𝑛∑𝑗=2EZ4
𝑛 𝑗
𝑛4𝑣2𝑛
≤ 4𝐶13
{E��4
𝑛 (𝑋, 𝑋′)𝑛2𝑣2
𝑛
+E��2
𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋
′′)𝑛𝑣2
𝑛
}.
As shown in the proof of Theorem 2,
E𝐺2𝑛 (𝑋, 𝑋′)𝑣2𝑛
≤ 𝐶3𝜚1/𝑠𝑛 ,
E��4𝑛 (𝑋, 𝑋′)𝑛2𝑣2
𝑛
≤ 𝐶4𝑛−2𝜚
−1/𝑠𝑛 , and
E��2𝑛 (𝑋, 𝑋′)��2
𝑛 (𝑋, 𝑋′′)
𝑛𝑣2𝑛
≤ 𝐶3𝑛−1.
Therefore,
sup𝑡
|𝑃(𝑇𝑛,𝜚𝑛 > 𝑡) − Φ(𝑡) | ≤ 𝐶14(𝜚15𝑠𝑛 + 𝑛− 1
5 + 𝑛− 25 𝜚
− 15𝑠
𝑛 ),
which implies that
𝑃
(sup
0≤𝑘≤𝑚∗
𝑇𝑛,2𝑘 𝜚∗ > 𝑡
)≤ 𝑚∗Φ(𝑡) + 𝐶15(2
𝑚∗5𝑠 𝜚
15𝑠∗ + 𝑚∗𝑛
− 15 + 𝑛− 2
5 𝜚− 1
5𝑠∗ ), ∀𝑡.
It is not hard to see, by the definitions of 𝑚∗ and 𝜚∗,
2𝑚∗ 𝜚∗ ≤ 2
(√log log 𝑛𝑛
) 2𝑠4𝑠+1
62
and
𝑚∗ =(log 2)−1{2𝑠 log 𝑛 − 2𝑠4𝑠 + 1
log 𝑛 + 𝑜(log 𝑛)}
=(log 2)−1 8𝑠2
4𝑠 + 1log 𝑛 + 𝑜(log 𝑛) � log 𝑛.
Together with the fact that Φ(𝑡) ≤ 12𝑒
−𝑡2/2 for 𝑡 ≥ 0, we get
𝑃
(sup
0≤𝑘≤𝑚∗
𝑇𝑛,2𝑘 𝜚∗ >√
3 log log 𝑛
)≤𝐶16
𝑒−32 log log 𝑛 log 𝑛 +
(√log 𝑛𝑛
) 25(4𝑠+1)
+ 𝑛− 15 log log 𝑛 + 𝑛− 2
5
(√log log 𝑛𝑛
)− 25 → 0,
as 𝑛→ ∞.
Type II Error. Next consider Type II error. To this end, write 𝜚𝑛 (\) = (√
log log 𝑛𝑛
)2𝑠 (\+1)4𝑠+\+1 . Let
��𝑛 (\) = sup0≤𝑘≤𝑚∗
{2𝑘 𝜚∗ : 𝜚𝑛 ≤ 𝜚𝑛 (\)}.
It is clear that 𝑇𝑛 ≥ 𝑇𝑛,��𝑛 (\) for any \ ≥ 0. It therefore suffices to show that for any \ ≥ 0,
lim𝑛→∞
inf\≥0
infP∈P(Δ𝑛,\)
𝑃
{𝑇𝑛,��𝑛 (\) ≥
√3 log log 𝑛
}= 1.
By Markov inequality, this can accomplished by verifying
inf\∈[0,∞)
infP∈P(Δ𝑛 (\),\)
EP𝑇𝑛,��𝑛 (\) ≥ ��√
log log 𝑛 (6.12)
for some �� >√
3; and
lim𝑛→∞
sup\≥0
supP∈P(Δ𝑛 (\),\)
Var(𝑇𝑛,��𝑛 (\)
)(EP𝑇𝑛,��𝑛 (\)
)2 = 0. (6.13)
63
We now show that both (6.12) and (6.13) hold with
Δ𝑛 (\) = 𝑐1
(√log log 𝑛𝑛
) 2𝑠4𝑠+\+1
for a sufficiently large 𝑐1 = 𝑐1(𝑀, ��).
Note that ∀ \ ∈ [0,∞),
12𝜚𝑛 (\) ≤ ��𝑛 (\) ≤ 𝜚𝑛 (\), (6.14)
which immediately suggests
[2��𝑛 (\) (P, P0) ≥ [2
𝜚𝑛 (\) (P, P0). (6.15)
Following the arguments in the proof of Theorem 3,
EP𝑇𝑛,��𝑛 (\) ≥ 𝐶17𝑛[ ��𝑛 (\)]1/(2𝑠)[2��𝑛 (\) (P, P0) ≥ 2−1/(2𝑠)𝐶17𝑛[𝜚𝑛 (\)]1/2𝑠[2
𝜚𝑛 (\) (P, P0),
and ∀ P ∈ P(Δ𝑛 (\), \),
[2𝜚𝑛 (\) (P, P0) ≥
14‖𝑢‖2
𝐿2 (𝑃0) (6.16)
provided that Δ𝑛 (\) ≥ 𝐶′(𝑀)(√
log log 𝑛𝑛
) 2𝑠4𝑠+\+1
.
Therefore,
infP∈P(Δ𝑛 (\),\)
EP𝑇𝑛,��𝑛 (\) ≥ 𝐶18𝑛[𝜚𝑛 (\)]1/(2𝑠)Δ𝑛 (\) ≥ 𝐶18𝑐1√
log log 𝑛 ≥ ��√
log log 𝑛
if 𝑐1 ≥ 𝐶−118 �� . Hence to ensure (6.12) holds, it suffices to take
𝑐1 = max{𝐶′(𝑀), 𝐶−118 ��}.
64
With (6.14), (6.15) and (6.16), the results in the proof of Theorem 3 imply that for sufficiently
large 𝑛
supP∈P(Δ∗
𝑛 (\),\)
Var(𝑇𝑛,��𝑛 (\)
)(EP𝑇𝑛,��𝑛 (\)
)2 ≤𝐶19
{ ([𝜚𝑛 (\)]
12𝑠 𝑛Δ∗
𝑛 (\))−2
+([𝜚𝑛 (\)]
1𝑠 𝑛2Δ∗
𝑛 (\))−1
+ (𝑛Δ∗𝑛 (\))−1 +
([𝜚𝑛 (\)]
12𝑠 𝑛
√Δ∗𝑛 (\)
)−1 }≤2𝐶19
([𝜚𝑛 (\)]
12𝑠 𝑛Δ∗
𝑛 (\))−1
= 2𝐶19(𝑐1 log log 𝑛)− 12 → 0,
which shows (6.13).
Proof of Theorem 6. The main idea of the proof is similar to that for Theorem 4.
Nevertheless, in order to show
infΦ𝑛
[EP0Φ𝑛 + sup
\∈[\1,\2]𝛽(Φ𝑛;Δ𝑛 (\), \)
]converges to 1 rather than bounded below from 0, we need to find P𝜋, which is the marginal
distribution on X𝑛 with conditional distribution selected from
{P⊗𝑛 : P ∈ ∪\∈[\1,\2]P(Δ𝑛 (\), \)}
and prior distribution 𝜋 on ∪\∈[\1,\2]P(Δ𝑛 (\), \) such that the 𝜒2 distance between P𝜋 and P⊗𝑛0
converges to 0. See Ingster (2000).
To this end, assume, without loss of generality, that
Δ𝑛 (\) = 𝑐2
(𝑛√
log log 𝑛
)− 2𝑠4𝑠+\+1
, ∀\ ∈ [\1, \2],
where 𝑐2 > 0 is a sufficiently small constant to be determined later.
Let 𝑟𝑛 = b𝐶20 log 𝑛c and 𝐵𝑛,1 = b𝐶21Δ− \1+1
𝑠𝑛 (\1)cfor sufficiently small 𝐶20, 𝐶21 > 0. Set
65
\𝑛,1 = \1. For 2 ≤ 𝑟 ≤ 𝑟𝑛, let
𝐵𝑛,𝑟 = 2𝑟−2𝐵𝑛,1
and \𝑛,𝑟 is selected such that the following equation holds.
𝐵𝑛,𝑟 =
⌊𝐶21 [Δ𝑛 (\𝑛,𝑟)]−
\𝑛,𝑟 +1𝑠
⌋.
Note that by choosing 𝐶20 sufficiently small,
𝐵𝑛,𝑟𝑛 = 2𝑟𝑛−2𝐵𝑛,1 ≤𝑐
2(\1+1)4𝑠+\1+12
(𝑛√
log log 𝑛
) 2(\1+1)4𝑠+\1+1
· 2𝑟𝑛−2
=
⌊𝑐
2(\1+1)4𝑠+\1+12 exp
(log
(𝑛√
log log 𝑛
)· 2(\1 + 1)
4𝑠 + \1 + 1+ (𝑟𝑛 − 2) log 2
)⌋≤
⌊𝐶21 exp
(log
(𝑛√
log log 𝑛
)· 2(\2 + 1)
4𝑠 + \2 + 1
)⌋= b𝐶21 [Δ𝑛 (\2)]−
\2+1𝑠 c
for sufficiently large 𝑛. Thus, we can guarante that ∀ 1 ≤ 𝑟 ≤ 𝑟𝑛, \𝑛,𝑟𝑛 ∈ [\1, \2].
We now construct a finite subset of ∪\∈[\1,\2]P(Δ𝑛 (\), \) as follows. Let 𝐵∗𝑛,0 = 0 and 𝐵∗
𝑛,𝑟 =
𝐵𝑛,1 + · · · + 𝐵𝑛,𝑟 for 𝑟 ≥ 1. For each b𝑛,𝑟 = (b𝑛,𝑟,1, · · · , b𝑛,𝑟,𝐵𝑛,𝑟 ) ∈ {±1}𝐵𝑛,𝑟 , let
𝑓𝑛,𝑟,b𝑛,𝑟 = 1 +𝐵∗𝑛,𝑟∑
𝑘=𝐵∗𝑛,𝑟−1+1
𝑎𝑛,𝑟b𝑛,𝑟,𝑘−𝐵∗𝑛,𝑟−1
𝜑𝑘 ,
and 𝑎𝑛,𝑟 =
√Δ2𝑛 (\𝑛,𝑟)/𝐵𝑛,𝑟 . Following the same argument as that in the proof of Theorem 4, we
can verify that with a sufficiently small 𝐶21, each P𝑛,𝑟,b𝑛,𝑟 ∈ P(Δ𝑛 (\𝑛,𝑟), \𝑛,𝑟), where 𝑓𝑛,𝑟,b𝑛,𝑟 is the
Radon-Nikodym derivative 𝑑P𝑛,𝑟,b𝑛,𝑟 /𝑑P0. With slight abuse of notation, write
𝑓𝑛 (𝑋1, 𝑋2, · · · , 𝑋𝑛) =1𝑟𝑛
𝑟𝑛∑𝑟=1
𝑓𝑛,𝑟 (𝑋1, 𝑋2, · · · , 𝑋𝑛),
66
where
𝑓𝑛,𝑟 (𝑋1, 𝑋2, · · · , 𝑋𝑛) =1
2𝐵𝑛,𝑟∑
b𝑛,𝑟∈{±1}𝐵𝑛,𝑟
𝑛∏𝑖=1
𝑓𝑛,𝑟,b𝑛,𝑟 (𝑋𝑖).
It now suffices to show that
‖ 𝑓𝑛 − 1‖2𝐿2 (P0) = ‖ 𝑓𝑛‖2
𝐿2 (P0) − 1 → 0, as 𝑛→ ∞,
where ‖ 𝑓𝑛‖2𝐿2 (P0) = EP0 𝑓
2𝑛 (𝑋1, 𝑋2, · · · , 𝑋𝑛).
Note that
‖ 𝑓𝑛‖2𝐿2 (P0) =
1𝑟2𝑛
∑1≤𝑟,𝑟 ′≤𝑟𝑛
〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0)
=1𝑟2𝑛
∑1≤𝑟≤𝑟𝑛
‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) +
1𝑟2𝑛
∑1≤𝑟,𝑟 ′≤𝑟𝑛𝑟≠𝑟 ′
〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0) .
It is easy to verify that, for any 𝑟 ≠ 𝑟′,
〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0) = 1.
It therefore suffices to show that
∑1≤𝑟≤𝑟𝑛
‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) = 𝑜(𝑟
2𝑛).
Following the same derivation as that in the proof of Theorem 4, we can show that
‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) ≤
(exp(𝑛𝑎2
𝑛,𝑟) + exp(−𝑛𝑎2𝑛,𝑟)
2
)𝐵𝑛,𝑟≤ exp
(12𝐵𝑛,𝑟𝑛
2𝑎4𝑛,𝑟
)
67
for sufficiently large 𝑛. By setting 𝑐2 in the expression of Δ𝑛 (\) sufficiently small, we have
𝐵𝑛,𝑟𝑛2𝑎4𝑛,𝑟 ≤ log 𝑟𝑛,
which ensures that
∑1≤𝑟≤𝑟𝑛
‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) ≤ 𝑟
3/2𝑛 = 𝑜(𝑟2
𝑛).
Proof of Theorem 7. We begin with (3.2). Note that 𝛾2a𝑛 (P, P0) is a U-statistic. We can apply the
general techniques for U-statistics to establish its asymptotic normality. In particular, as shown in
Hall (1984), it suffices to verify the following four conditions:
(2a𝑛𝜋
)𝑑/2E��2
a𝑛(𝑋1, 𝑋2) → ‖𝑝0‖2
𝐿2, (6.17)
E��4a𝑛(𝑋1, 𝑋2)
𝑛2 [E��2a𝑛 (𝑋1, 𝑋2)]2
→ 0, (6.18)
E[��2a𝑛(𝑋1, 𝑋2)��2
a𝑛(𝑋1, 𝑋3)]
𝑛[E��2a𝑛 (𝑋1, 𝑋2)]2
→ 0, (6.19)
E𝐻2a𝑛(𝑋1, 𝑋2)
[E��2a𝑛 (𝑋1, 𝑋2)]2
→ 0, (6.20)
as 𝑛→ ∞, where
𝐻a𝑛 (𝑥, 𝑦) = E��a𝑛 (𝑥, 𝑋3)��a𝑛 (𝑦, 𝑋3), ∀ 𝑥, 𝑦 ∈ R𝑑 .
Verifying Condition (6.17). Note that
E��2a𝑛(𝑋1, 𝑋2) = E𝐺2
a𝑛(𝑋1, 𝑋2) − 2E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2 + [E𝐺a𝑛 (𝑋1, 𝑋2)]2.
68
By Lemma 7,
E𝐺a𝑛 (𝑋1, 𝑋2) =(𝜋
a𝑛
) 𝑑2∫
exp(−‖𝜔‖2
4a𝑛
)‖F 𝑝0(𝜔)‖2 𝑑𝜔,
which immediately yields ( a𝑛𝜋
) 𝑑2E𝐺a𝑛 (𝑋1, 𝑋2) → ‖𝑝0‖2
𝐿2
and (2a𝑛𝜋
) 𝑑2
E𝐺2a𝑛(𝑋1, 𝑋2) =
(2a𝑛𝜋
) 𝑑2
E𝐺2a𝑛 (𝑋1, 𝑋2) → ‖𝑝0‖2𝐿2,
as a𝑛 → ∞.
On the other hand,
E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2
=
∫ (∫𝐺a𝑛 (𝑥, 𝑥′)𝐺a𝑛 (𝑥, 𝑥′′)𝑝0(𝑥)𝑑𝑥
)𝑝0(𝑥′)𝑝0(𝑥′′)𝑑𝑥′𝑑𝑥′′
=
∫ (∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝0(𝑥)𝑑𝑥
)𝐺a𝑛/2(𝑥′, 𝑥′′)𝑝0(𝑥′)𝑝0(𝑥′′)𝑑𝑥′𝑑𝑥′′.
Let 𝑍 ∼ 𝑁 (0, 4a𝑛𝐼𝑑). Then
∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝0(𝑥)𝑑𝑥 = (2𝜋)𝑑/2E
[F 𝑝0(𝑍) exp
(𝑥′ + 𝑥′′
2𝑖𝑍
)]≤ (2𝜋)𝑑/2
√E ‖F 𝑝0(𝑍)‖2
.𝑑 ‖𝑝0‖𝐿2/a𝑑/4𝑛 .
Thus
E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2 .𝑑 ‖𝑝0‖3𝐿2/a3𝑑/4𝑛 .
Condition (6.17) then follows.
69
Verifying Conditions (6.18) and (6.19). Since
E��2a𝑛(𝑋1, 𝑋2) �𝑑,𝑝0 a
−𝑑/2𝑛 .
and
E��4a𝑛(𝑋1, 𝑋2) . E𝐺4
a𝑛(𝑋1, 𝑋2) .𝑑 a
−𝑑/2𝑛 ,
we obtain
𝑛−2E��4a𝑛(𝑋1, 𝑋2)/(E��2
a𝑛(𝑋1, 𝑋2))2 .𝑑,𝑝0 a
𝑑/2𝑛 /𝑛2 → 0.
Similarly,
E��2a𝑛(𝑋1, 𝑋2)��2
a𝑛(𝑋1, 𝑋3) . E𝐺2
a𝑛(𝑋1, 𝑋2)𝐺2
a𝑛(𝑋1, 𝑋3)
= E𝐺2a𝑛 (𝑋1, 𝑋2)𝐺2a𝑛 (𝑋1, 𝑋3)
.𝑑,𝑝0 a−3𝑑/4𝑛 .
This implies
𝑛−1E��2a𝑛(𝑋1, 𝑋2)��2
a𝑛(𝑋1, 𝑋3)/(E��2
a𝑛(𝑋1, 𝑋2))2 .𝑑,𝑝0 a
𝑑/4𝑛 /𝑛→ 0,
which verifies (6.19).
Verifying Condition (6.20). We now prove (6.20). It suffices to show
a𝑑𝑛E(E(��a𝑛 (𝑋1, 𝑋2)��a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2 → 0
70
as 𝑛→ ∞. Note that
E(E(��a𝑛 (𝑋1, 𝑋2)��a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2
.E(E(𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2
=E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)𝐺a𝑛 (𝑋4, 𝑋2)𝐺a𝑛 (𝑋4, 𝑋3)
=E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)).
Since for any 𝛿 > 0,
a𝑑𝑛E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)
(1{‖𝑋1−𝑋4‖>𝛿} + 1‖𝑋2−𝑋3‖>𝛿})) → 0,
it remains to show that
a𝑑𝑛E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)
1{‖𝑋1−𝑋4‖≤𝛿,‖𝑋2−𝑋3‖≤𝛿})) → 0
for some 𝛿 > 0, which holds as long as
E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3) → 0 (6.21)
uniformly on {‖𝑋1 − 𝑋4‖ ≤ 𝛿, ‖𝑋2 − 𝑋3‖ ≤ 𝛿}.
Let
𝑌1 = 𝑋1 − 𝑋4, 𝑌2 = 𝑋2 − 𝑋3, 𝑌3 = 𝑋1 + 𝑋4, 𝑌4 = 𝑋2 + 𝑋3.
71
Then
E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)
=
(𝜋
a𝑛
) 𝑑2∫
exp(−‖𝜔‖2
4a𝑛
)F 𝑝𝑌1 (𝜔)F 𝑝𝑌2 (𝜔)𝑑𝜔
≤
√√(𝜋
a𝑛
) 𝑑2∫
exp(−‖𝜔‖2
4a𝑛
) F 𝑝𝑌1 (𝜔) 2𝑑𝜔
√√(𝜋
a𝑛
) 𝑑2∫
exp(−‖𝜔‖2
4a𝑛
) F 𝑝𝑌2 (𝜔) 2𝑑𝜔
where
𝑝𝑦 (𝑦′) =𝑝(𝑌1 = 𝑦,𝑌3 = 𝑦′)
𝑝(𝑌1 = 𝑦) =
𝑝0
(𝑦+𝑦′
2
)𝑝0
(𝑦′−𝑦
2
)∫𝑝0
(𝑦+𝑦′
2
)𝑝0
(𝑦′−𝑦
2
)𝑑𝑦′
is the conditional density of 𝑌3 given 𝑌1 = 𝑦. Thus to prove (6.21), it suffices to show
ℎ𝑛 (𝑦) :=(𝜋
a𝑛
) 𝑑2∫
exp(−‖𝜔‖2
4a𝑛
) F 𝑝𝑦 (𝜔) 2𝑑𝜔
= 𝜋𝑑2
∫exp
(−‖𝜔‖2
4
) F 𝑝𝑦 (√a𝑛𝜔) 2𝑑𝜔
→ 0
uniformly over {𝑦 : ‖𝑦‖ ≤ 𝛿}.
Note that
ℎ𝑛 (𝑦) = E𝐺a𝑛 (𝑋, 𝑋′)
where 𝑋, 𝑋′ ∼iid 𝑝𝑦, which suggests ℎ𝑛 (𝑦) → 0 pointwisely. To prove the uniform convergence of
ℎ𝑛 (𝑦), we only need to show
lim𝑦1→𝑦
sup𝑛
|ℎ𝑛 (𝑦1) − ℎ𝑛 (𝑦) | = 0
for any 𝑦.
Since 𝑝0 ∈ 𝐿2, 𝑃(𝑌1 = 𝑦) is continuous. Therefore, the almost surely continuity of 𝑝0 imme-
diately suggests that for every 𝑦, 𝑝𝑦1 (·) → 𝑝𝑦 (·) almost surely as 𝑦1 → 𝑦. Considering that 𝑝𝑦1
72
and 𝑝𝑦 are both densities, it follows that
|F 𝑝𝑦1 (𝜔) − F 𝑝𝑦 (𝜔) | ≤ (2𝜋)−𝑑/2∫
|𝑝𝑦1 (𝑦′) − 𝑝𝑦 (𝑦′) |𝑑𝑦′ → 0,
i.e., F 𝑝𝑦1 → F 𝑝𝑦 uniformly as 𝑦1 → 𝑦. Therefore we have
sup𝑛→∞
|ℎ𝑛 (𝑦1) − ℎ𝑛 (𝑦) | . F 𝑝𝑦1 − F 𝑝𝑦
𝐿∞
→ 0,
which ensures the uniform convergence of ℎ𝑛 (𝑦) to ℎ(𝑦) over {𝑦 : ‖𝑦‖ ≤ 𝛿}, and hence (6.20).
Indeed, we have shown that
𝑛𝛾2a𝑛 (P, P0)√
2E[��a𝑛 (𝑋1, 𝑋2)]2→𝑑 𝑁 (0, 1).
By Slutsky Theorem, in order to prove (3.3), it sufficies to show
��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1,
which is equivalent to
𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1 (6.22)
since 1/𝑛2 = 𝑜(E[��a𝑛 (𝑋1, 𝑋2)]2).
It follows from
E(𝑠2𝑛,a𝑛
)= E[��a𝑛 (𝑋1, 𝑋2)]2
73
and
var(𝑠2𝑛,a𝑛
).𝑛−4var ©«
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 )ª®¬ + 𝑛−6var
©«∑
1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2)ª®®®¬
+ 𝑛−8var©«
∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4
𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2)ª®®®¬
.𝑛−2E𝐺4a𝑛 (𝑋1, 𝑋2) + 𝑛−1E𝐺2a𝑛 (𝑋1, 𝑋2)𝐺2a𝑛 (𝑋1, 𝑋3) + 𝑛−1(E𝐺2a𝑛 (𝑋1, 𝑋2))2
= 𝑜((E[��a𝑛 (𝑋1, 𝑋2)]2)2).
that (6.22) holds.
Proof of Theorem 8. Recall that
𝛾2a𝑛 (P, P0) =
1𝑛(𝑛 − 1)
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P0)
=𝛾2a𝑛(P, P0) +
1𝑛(𝑛 − 1)
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P)
+ 2𝑛
𝑛∑𝑖=1
(E𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋∼P0 [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖]
− E𝑋,𝑋 ′∼iidP𝐺a𝑛 (𝑋, 𝑋′) + E(𝑋,𝑌 )∼P⊗P0𝐺a𝑛 (𝑋,𝑌 )).
Denote by the last two terms on the rightmost hand side by 𝑉 (1)a𝑛 and 𝑉 (2)
a𝑛 respectively. It is clear
that E𝑉 (1)a𝑛 = E𝑉
(2)a𝑛 = 0. Then it suffices to show that
sup𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛
E(𝑉
(1)a𝑛
)2+ E
(𝑉
(2)a𝑛
)2
𝛾4a𝑛 (P, P0)
→ 0 (6.23)
74
and
inf𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛
𝑛𝛾2𝐺a𝑛
(P, P0)√E
(��2𝑛,a𝑛
) → ∞ (6.24)
as 𝑛→ ∞.
We first prove (6.23). Note that ‖𝑝‖𝐿2 ≤ ‖𝑝‖W𝑠,2 (𝑀) ≤ 𝑀 . Following arguments similar to
those in the proof of Theorem 7, we get
E(𝑉
(1)a𝑛
)2. 𝑛−2E𝐺2
a𝑛(𝑋1, 𝑋2) .𝑑 𝑀
2𝑛−2a−𝑑/2𝑛 ,
and
E(𝑉
(2)a𝑛
)2≤ 4𝑛E
[E𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋∼P0 [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖]
]2
=4𝑛
∫ (∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝(𝑥)𝑑𝑥
)𝐺a𝑛/2(𝑥′, 𝑥′′) 𝑓 (𝑥′) 𝑓 (𝑥′′)𝑑𝑥′𝑑𝑥′′
.𝑑
4𝑀𝑛a𝑑/4
∫𝐺a𝑛/2(𝑥′, 𝑥′′) | 𝑓 (𝑥′) | | 𝑓 (𝑥′′) |𝑑𝑥′𝑑𝑥′′
.𝑑
4𝑀𝑛a3𝑑/4 ‖ 𝑓 ‖
2𝐿2.
By Lemma 8, there exists a constant 𝐶 > 0 depending on 𝑠 and 𝑀 only such that for 𝑓 ∈
W𝑠,2(𝑀),
∫exp
(−‖𝜔‖2
4a𝑛
)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 1
4‖ 𝑓 ‖2
𝐿2
given that a𝑛 ≥ 𝐶‖ 𝑓 ‖−2/𝑠𝐿2
. Because a𝑛Δ𝑠/2𝑛 → ∞, we obtain
𝛾2a𝑛(P, P0) &𝑑 a
−𝑑/2𝑛 ‖ 𝑓 ‖2
𝐿2,
75
for sufficiently large 𝑛. Thus
sup𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛
E(𝑉
(1)a𝑛
)2
𝛾4a𝑛 (P, P0)
.𝑑 𝑀2(𝑛2a
−𝑑/2𝑛 Δ4
𝑛)−1 → 0
and
sup𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛
E(𝑉
(2)a𝑛
)2
𝛾4𝐺a𝑛
(P, P0).𝑑 𝑀 (𝑛a−𝑑/4𝑛 Δ2
𝑛)−1 → 0,
as 𝑛→ ∞.
Next we prove (6.24). It follows from
E(��2𝑛,a𝑛
)≤ Emax
{��𝑠2𝑛,a𝑛 �� , 1/𝑛2} . E𝐺2a𝑛 (𝑋1, 𝑋2) + 1/𝑛2 .𝑑 𝑀2a
−𝑑/2𝑛 + 1/𝑛2
that (6.24) holds.
Proof of Theorem 9. This, in a certain sense, can be viewed as an extension of results from Ingster
(1987), and the proof proceeds in a similar fashion. While Ingster (1987) considered the case when
𝑝0 is the uniform distribution on [0, 1], we shall show that similar bounds hold for a wider class of
𝑝0.
For any 𝑀 > 0 and 𝑝0 such that ‖𝑝0‖W𝑠,2 < 𝑀 , let
𝐻GOF1 (Δ𝑛; 𝑠, 𝑀 − ‖𝑝0‖W𝑠,2)∗
:= {𝑝 ∈ W𝑠,2 : ‖𝑝 − 𝑝0‖W𝑠,2 ≤ 𝑀 − ‖𝑝0‖W𝑠,2 , ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛}.
It is clear that 𝐻GOF1 (Δ𝑛; 𝑠) ⊃ 𝐻GOF
1 (Δ𝑛; 𝑠, 𝑀 − ‖𝑝0‖W𝑠,2)∗. Hence it suffices to prove Theorem
9 with 𝐻GOF1 (Δ𝑛; 𝑠) replaced by 𝐻GOF
1 (Δ𝑛; 𝑠, 𝑀)∗ for an arbitrary 𝑀 > 0. We shall abbreviate
𝐻GOF1 (Δ𝑛; 𝑠, 𝑀)∗ as 𝐻GOF
1 (Δ𝑛; 𝑠)∗ in the rest of the proof.
76
Since 𝑝0 is almost surely continuous, there exists 𝑥0 ∈ R𝑑 and 𝛿, 𝑐 > 0 such that
𝑝0(𝑥) ≥ 𝑐 > 0, ∀ ‖𝑥 − 𝑥0‖ ≤ 𝛿.
In light of this, we shall assume 𝑝0(𝑥) ≥ 𝑐 > 0, for all 𝑥 ∈ [0, 1]𝑑 without loss of generality.
Let 𝒂𝑛 be a multivariate random index. As proved in Ingster (1987), in order to prove the
existence of 𝛼 ∈ (0, 1) such that no asymptotic 𝛼-level test can be consistent, it suffices to identify
𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗ for all possible values of 𝒂𝑛 such that
E𝑝0
(𝑝𝑛 (𝑋1, · · · , 𝑋𝑛)∏𝑛
𝑖=1 𝑝0(𝑋𝑖)
)2= 𝑂 (1), (6.25)
where
𝑝𝑛 (𝑥1, · · · , 𝑥𝑛) = E𝒂𝑛
(𝑛∏𝑖=1
𝑝𝑛,𝒂𝑛 (𝑥𝑖)), ∀ 𝑥1, · · · , 𝑥𝑛,
i.e., 𝑝 is the mixture of all 𝑝𝑛,𝒂𝑛’s.
Let 1{𝑥∈[0,1]𝑑}, 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 be an orthonormal sets of functions in 𝐿2(R𝑑) such that the
supports of 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 are disjoint and all included in [0, 1]𝑑 . Let 𝒂𝑛 = (𝑎𝑛,1, · · · , 𝑎𝑛,𝐵𝑛)
satisfy that 𝑎𝑛,1, · · · , 𝑎𝑛,𝐵𝑛 are independent and that
𝑝(𝑎𝑛,𝑘 = 1) = 𝑝(𝑎𝑛,𝑘 = −1) = 12, ∀ 1 ≤ 𝑘 ≤ 𝐵𝑛.
Define
𝑝𝑛,𝒂𝑛 = 𝑝0 + 𝑟𝑛𝐵𝑛∑𝑘=1
𝑎𝑛,𝑘𝜙𝑛,𝑘 .
Then𝑝𝑛,𝒂𝑛
𝑝0= 1 + 𝑟𝑛
𝐵𝑛∑𝑘=1
𝑎𝑛,𝑘𝜙𝑛,𝑘
𝑝0,
where 1, 𝜙𝑛,1𝑝0, · · · , 𝜙𝑛,𝐵𝑛
𝑝0are orthogonal in 𝐿2(𝑃0).
77
By arguments similar to those in Ingster (1987), we find
E𝑝0
(𝑝𝑛 (𝑋1, · · · , 𝑋𝑛)∏𝑛
𝑖=1 𝑝0(𝑋𝑖)
)2≤ exp
(12𝐵𝑛𝑛
2𝑟4𝑛 max
1≤𝑘≤𝐵𝑛
(∫𝜙2𝑛,𝑘/𝑝0𝑑𝑥
)2)
≤ exp(
12𝑐2𝐵𝑛𝑛
2𝑟4𝑛
).
In order to ensure (6.25), it suffices to have
𝐵1/2𝑛 𝑛𝑟2
𝑛 = 𝑂 (1). (6.26)
Therefore, given Δ𝑛 = 𝑂
(𝑛−
2𝑠4𝑠+𝑑
), once we can find proper 𝑟𝑛, 𝐵𝑛 and 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 such that
𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗ for all 𝒂𝑛 and (6.26) holds, the proof is finished.
Let 𝑏𝑛 = 𝐵1/𝑑𝑛 , 𝜙 be an infinitely differentiable function supported on [0, 1]𝑑 that is orthogonal
to 1{𝑥∈[0,1]𝑑} in 𝐿2, and for each 𝑥𝑛,𝑘 ∈ {0, 1, · · · , 𝑏𝑛 − 1}⊗𝑑 , let
𝜙𝑛,𝑘 (𝑥) =𝑏𝑑/2𝑛
‖𝜙‖𝐿2
𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 ), ∀ 𝑥 ∈ R𝑑 .
Then all 𝜙𝑛,𝑘 ’s are supported on [0, 1]𝑑 and
〈𝜙𝑛,𝑘 , 1〉𝐿2 =𝑏𝑑/2𝑛
‖𝜙‖𝐿2
∫R𝑑𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 )𝑑𝑥 =
1𝑏𝑑/2𝑛 ‖𝜙‖𝐿2
∫R𝑑𝜙(𝑥)𝑑𝑥 = 0,
‖𝜙𝑛,𝑘 ‖2𝐿2
=𝑏𝑑𝑛
‖𝜙‖2𝐿2
∫[0,1/𝑏𝑛]𝑑
𝜙2(𝑏𝑛𝑥)𝑑𝑥 = 1,
‖𝜙𝑛,𝑘 ‖2W𝑠,2 ≤ 𝑏2𝑠
𝑛
‖𝜙‖2W𝑠,2
‖𝜙‖2𝐿2
.
Since for 𝑘 ≠ 𝑘′, the supports of 𝜙𝑛,𝑘 and 𝜙𝑛,𝑘 ′ are disjoint,
‖𝑝𝑛,𝒂𝑛 − 𝑝0‖∞ = 𝑟𝑛𝑏𝑑/2𝑛
‖𝜙‖∞‖𝜙‖𝐿2
,
78
and
〈𝜙𝑛,𝑘 , 𝜙𝑛,𝑘 ′〉𝐿2 = 0, 〈𝜙𝑛,𝑘 , 𝜙𝑛,𝑘 ′〉W𝑠,2 = 0,
from which we immediately obtain
‖𝑝𝑛,𝒂𝑛 − 𝑝0‖2𝐿2
= 𝑟2𝑛𝑏
𝑑𝑛
‖𝑝𝑛,𝒂𝑛 − 𝑝0‖2W𝑠,2 ≤ 𝑟2
𝑛𝑏𝑑+2𝑠𝑛
‖𝜙‖2W𝑠,2
‖𝜙‖2𝐿2
.
To ensure 𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗, it suffices to make
𝑟𝑛𝑏𝑑/2𝑛
‖𝜙‖∞‖𝜙‖𝐿2
→ 0 as 𝑛→ ∞, (6.27)
𝑟2𝑛𝑏
𝑑𝑛 = Δ2
𝑛, (6.28)
𝑟2𝑛𝑏
𝑑+2𝑠𝑛
‖𝜙‖2W𝑠,2
‖𝜙‖2𝐿2
≤ 𝑀2. (6.29)
Let
𝑏𝑛 =
(𝑀 ‖𝜙‖2
𝐿2
‖𝜙‖W𝑠,2
)1/𝑠
Δ−1/𝑠𝑛
, 𝑟𝑛 =Δ𝑛
𝑏𝑑/2𝑛
.
Then (6.28) and (6.29) are satisfied. Moreover, given Δ𝑛 = 𝑂
(𝑛−
2𝑠4𝑠+𝑑
),
𝐵1/2𝑛 𝑛𝑟2
𝑛 = 𝑏−𝑑/2𝑛 𝑛Δ2
𝑛 .𝑑,𝜙,𝑀 𝑛Δ4𝑠+𝑑
2𝑠𝑛 = 𝑂 (1),
and
𝑟𝑛𝑏𝑑/2𝑛
‖𝜙‖∞‖𝜙‖𝐿2
.𝜙 Δ𝑛 = 𝑜(1)
ensuring both (6.26) and (6.27).
79
Finally, we show the existence of such 𝜙. Let
𝜙0(𝑥1) =
exp(− 1
1−(4𝑥1−1)2
)0 < 𝑥1 <
12
− exp(− 1
1−(4𝑥1−3)2
)12 < 𝑥1 < 1
0 otherwise
.
Then 𝜙0 is supported on [0, 1], infinitely differentiable and orthogonal to the indicator function of
[0, 1].
Let
𝜙(𝑥) =𝑑∏𝑙=1
𝜙0(𝑥𝑙), ∀ 𝑥 = (𝑥1, · · · , 𝑥𝑑) ∈ R𝑑 .
Then 𝜙 is supported on [0, 1]𝑑 , infinitely differentiable and 〈𝜙, 1〉𝐿2 = 〈𝜙0, 1〉𝑑𝐿2 [0,1] = 0.
Proof of Theorem 10. Let 𝑁 = 𝑚 + 𝑛 denote the total sample size. It suffices to prove the result
under the assumption that 𝑛/𝑁 → 𝑟 ∈ (0, 1).
Note that under 𝐻0,
𝛾2a𝑛 (P,Q) =
1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1
𝑚(𝑚 − 1)∑
1≤𝑖≠ 𝑗≤𝑚��a𝑛 (𝑌𝑖, 𝑌 𝑗 )
− 2𝑛𝑚
∑1≤𝑖≤𝑛
∑1≤ 𝑗≤𝑚
��a𝑛 (𝑋𝑖, 𝑌 𝑗 ).
Let 𝑛/𝑁 = 𝑟𝑛. Then we have
𝛾2a𝑛 (P,Q)
=𝑁−2 ©« 1𝑟𝑛 (𝑟𝑛 − 𝑁−1)
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +
1(1 − 𝑟𝑛) (1 − 𝑟𝑛 − 𝑁−1)
∑1≤𝑖≠ 𝑗≤𝑚
��a𝑛 (𝑌𝑖, 𝑌 𝑗 ) −2
𝑟𝑛 (1 − 𝑟𝑛)∑
1≤𝑖≤𝑛
∑1≤ 𝑗≤𝑚
��a𝑛 (𝑋𝑖, 𝑌 𝑗 )ª®¬ .
80
Let
𝛾2a𝑛 (P,Q)
′ =𝑁−2 ©« 1𝑟2
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1
(1 − 𝑟)2
∑1≤𝑖≠ 𝑗≤𝑚
��a𝑛 (𝑌𝑖, 𝑌 𝑗 )
− 2𝑟 (1 − 𝑟)
∑1≤𝑖≤𝑛
∑1≤ 𝑗≤𝑚
��a𝑛 (𝑋𝑖, 𝑌 𝑗 )ª®¬ .
As we assume 𝑟𝑛 → 𝑟 as 𝑛→ ∞, Theorem 7 ensures that
𝑛𝑚√
2(𝑛 + 𝑚)[E��2
a𝑛(𝑋1, 𝑋2)
]− 12(𝛾2a𝑛 (P,Q) − 𝛾2
a𝑛 (P,Q)′)= 𝑜𝑝 (1)
A slight adaption of arguments in Hall (1984) suggests that
E��4a𝑛(𝑋1, 𝑋2)
𝑁2E��2a𝑛 (𝑋1, 𝑋2)
+E��2
a𝑛(𝑋1, 𝑋2)��2
a𝑛(𝑋1, 𝑋3)
𝑁E��2a𝑛 (𝑋1, 𝑋2)
+E𝐻2
a𝑛(𝑋1, 𝑋2)
E��2a𝑛 (𝑋1, 𝑋2)
→ 0 (6.30)
ensures that𝑛𝑚
√2(𝑛 + 𝑚)
[E��2
a𝑛(𝑋1, 𝑋2)
]− 12 𝛾2
a𝑛 (P,Q)′ →𝑑 𝑁 (0, 1).
Following arguments similar to those in the proof of Theorem 7, given a𝑛 → ∞ and a𝑛/𝑛4/𝑑 → 0,
(6.30) holds and therefore
𝑛𝑚√
2(𝑛 + 𝑚)[E��2
a𝑛(𝑋1, 𝑋2)
]− 12 𝛾2
a𝑛 (P,Q) →𝑑 𝑁 (0, 1).
Additionally, based on the same arguments as in the proof of Theorem 7,
��2𝑛,𝑚,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1.
The proof is therefore concluded.
81
Proof of Theorem 11. With slight abuse of notation, we shall write
��a𝑛 (𝑥, 𝑦;P,Q) = 𝐺a𝑛 (𝑥, 𝑦) − E𝑌∼Q𝐺a𝑛 (𝑥,𝑌 ) − E𝑋∼P𝐺a𝑛 (𝑋, 𝑦) + E(𝑋,𝑌 )∼P⊗Q𝐺a𝑛 (𝑋,𝑌 ),
We consider the two parts separately.
Part (i). We first verify the consistency of ΦHOM𝑛,a𝑛,𝛼
with a𝑛 � 𝑛4/(𝑑+4𝑠) given Δ𝑛 � 𝑛−2𝑠/(𝑑+4𝑠) .
Observe the following decomposition of 𝛾2a𝑛 (P,Q),
𝛾2a𝑛 (P,Q) = 𝛾
2a𝑛(P,Q) + 𝐿 (1)
𝑛,a𝑛 + 𝐿(2)𝑛,a𝑛 ,
where
𝐿(1)𝑛,a𝑛 =
1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P) −2𝑚𝑛
∑1≤𝑖≤𝑛
∑1≤ 𝑗≤𝑚
��a𝑛 (𝑋𝑖, 𝑌 𝑗 ;P,Q)
+ 1𝑚(𝑚 − 1)
∑1≤𝑖≠ 𝑗≤𝑚
��a𝑛 (𝑌𝑖, 𝑌 𝑗 ;Q)
and
𝐿(2)𝑛,a𝑛 =
2𝑛
𝑛∑𝑖=1
(E[𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝐺a𝑛 (𝑋, 𝑋′) − E[𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑖] + E𝐺a𝑛 (𝑋,𝑌 )
)+ 2𝑚
𝑚∑𝑗=1
(E[𝐺a𝑛 (𝑌 𝑗 , 𝑌 ) |𝑌 𝑗 ] − E𝐺a𝑛 (𝑌,𝑌 ′) − E[𝐺a𝑛 (𝑋,𝑌 𝑗 ) |𝑌 𝑗 ] + E𝐺a𝑛 (𝑋,𝑌 )
).
In order to prove the consistency of ΦHOM𝑛,a𝑛,𝛼
, it suffices to show
sup𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛
E(𝐿(1)𝑛,a𝑛
)2+ E
(𝐿(2)𝑛,a𝑛
)2
𝛾4𝐺a𝑛
(P,Q)→ 0, (6.31)
inf𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛
𝛾2𝐺a𝑛
(P,Q)
(1/𝑛 + 1/𝑚)√E
(��2𝑛,𝑚,a𝑛
) → ∞, (6.32)
82
as 𝑛 → ∞. We now prove (6.31) and (6.32) with arguments similar to those obtained in the proof
of Theorem 8.
Note that
E(𝐿 (1)𝑛,a𝑛)
2
.E©« 1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P)ª®¬
2
+ E ©« 2𝑚𝑛
∑1≤𝑖≤𝑛
∑1≤ 𝑗≤𝑚
��a𝑛 (𝑋𝑖, 𝑌 𝑗 ;P,Q)ª®¬
2
+ E ©« 1𝑚(𝑚 − 1)
∑1≤𝑖≠ 𝑗≤𝑚
��a𝑛 (𝑌𝑖, 𝑌 𝑗 ;Q)ª®¬
2
.1𝑛2E𝐺
2a𝑛(𝑋1, 𝑋2) +
1𝑚2E𝐺
2a𝑛(𝑌1, 𝑌2).
Given 𝑝, 𝑞 ∈ W𝑠,2(𝑀),
E𝐺2a𝑛(𝑋1, 𝑋2) .𝑑 𝑀
2a−𝑑/2𝑛 , E𝐺2
a𝑛(𝑌1, 𝑌2) .𝑑 𝑀
2a−𝑑/2𝑛 .
Hence
E(𝐿 (1)𝑛,a𝑛)
2 .𝑑 𝑀2a
−𝑑/2𝑛
(1𝑛2 + 1
𝑚2
). (6.33)
Now consider bounding 𝐿 (2)𝑛,a𝑛 . Let 𝑓 = 𝑝 − 𝑞. Then we have
E(𝐿 (2)𝑛,a𝑛)
2 .𝑑 a− 3𝑑
4𝑛 𝑀 ‖ 𝑓 ‖2
𝐿2
(1𝑛+ 1𝑚
). (6.34)
Since a𝑛 � 𝑛4/(4𝑠+𝑑) � Δ−2/𝑠𝑛 , Lemma 8 ensures that for sufficiently large 𝑛,
𝛾2𝐺a𝑛
(P,Q) &𝑑 a−𝑑/2𝑛 ‖ 𝑓 ‖2
𝐿2, ∀ 𝑝, 𝑞 ∈ W𝑠,2(𝑀).
83
This together with (6.33) and (6.34) gives
sup𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛
E(𝐿(1)𝑛,a𝑛
)2+ E
(𝐿(2)𝑛,a𝑛
)2
𝛾4𝐺a𝑛
(P,Q).𝑑
𝑀2a𝑑/2𝑛
𝑛2Δ4𝑛
+ 𝑀a𝑑/4𝑛
𝑛Δ2𝑛
→ 0
as 𝑛→ ∞, which proves (6.31).
Finally, consider (6.32). It follows from
E(��2𝑛,𝑚,a𝑛
)≤ Emax
{��𝑠2𝑛,𝑚,a𝑛 �� , 1/𝑛2}. max{E𝐺2
a𝑛(𝑋1, 𝑋2),E𝐺2
a𝑛(𝑌1, 𝑌2)} + 1/𝑛2
.𝑑𝑀2a
−𝑑/2𝑛 + 1/𝑛2
that (6.32) holds.
Part (ii). Next, we prove that if lim inf𝑛→∞ Δ𝑛𝑛2𝑠/(𝑑+4𝑠) < ∞, then there exists some 𝛼 ∈
(0, 1) such that no asymptotic 𝛼-level test can be consistent. To prove this, we shall verify that
consistency of homogeneity test is harder to achieve than that of goodness-of-fit test.
Consider an arbitrary 𝑝0 ∈ W𝑠,2(𝑀/2). It immediately follows
𝐻HOM1 (Δ𝑛; 𝑠) ⊃ {(𝑝, 𝑝0) : 𝑝 ∈ 𝐻GOF
1 (Δ𝑛; 𝑠)}.
Let {Φ𝑛}𝑛≥1 be any sequence of asymptotic 𝛼-level homogeneity tests, where
Φ𝑛 = Φ𝑛 (𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚).
Then if 𝑌1, · · · , 𝑌𝑚 ∼iid 𝑃0, {Φ𝑛}𝑛≥1 can also be treated as a sequence of (random) goodness-of-fit
tests
Φ𝑛 (𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚) = Φ𝑛 (𝑋1, · · · , 𝑋𝑛)
84
whose probabilities of type I error with respect to 𝑃0 are controlled at 𝛼 asymptotically. Moreover,
power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} ≤ power{Φ𝑛;𝐻GOF
1 (Δ𝑛; 𝑠)}
Since 0 < 𝑐 ≤ 𝑚/𝑛 ≤ 𝐶 < ∞, Theorem 9 ensures that there exists some 𝛼 ∈ (0, 1) such that
for any sequence of asymptotic 𝛼-level tests {Φ𝑛}𝑛≥1,
lim inf𝑛→∞
power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} ≤ lim inf
𝑛→∞power{Φ𝑛;𝐻GOF
1 (Δ𝑛; 𝑠)} < 1
given lim inf𝑛→∞ Δ𝑛𝑛2𝑠/(𝑑+4𝑠) < ∞.
Proof of Theorem 12. For brevity, we shall focus on the case when 𝑘 = 2 in the rest of the proof.
Our argument, however, can be straightforwardly extended to the more general cases. The proof
relies on the following decomposition of 𝛾2a𝑛 (P, P𝑋
1 ⊗ P𝑋2) under 𝐻IND0 :
𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2) = 1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 ) + 𝑅𝑛,
where
𝐺∗a𝑛(𝑥, 𝑦) = ��a𝑛 (𝑥, 𝑦) −
∑1≤ 𝑗≤2
𝑔 𝑗 (𝑥 𝑗 , 𝑦) −∑
1≤ 𝑗≤2𝑔 𝑗 (𝑦 𝑗 , 𝑥) +
∑1≤ 𝑗1, 𝑗2≤2
𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2)
and the remainder 𝑅𝑛 satisfies
E(𝑅𝑛)2 . E𝐺2a (𝑋1, 𝑋2)/𝑛3 .𝑑 ‖𝑝‖2𝐿2a−𝑑/2𝑛 /𝑛3.
See Appendix B.4 for more details.
85
Moreover, borrowing arguments in the proof of Lemma 1, we obtain
E(𝐺∗a𝑛(𝑋1, 𝑋2) − ��a𝑛 (𝑋1, 𝑋2))2
.∑
1≤ 𝑗≤2E(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2
+∑
1≤ 𝑗1, 𝑗2≤2E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2
≤∑
1≤ 𝑗1≠ 𝑗2≤2E𝐺2a𝑛 (𝑋
𝑗11 , 𝑋
𝑗12 ) · E
{E
[𝐺a𝑛 (𝑋
𝑗21 , 𝑋
𝑗22 )
���𝑋 𝑗21
]}2+∑
1≤ 𝑗1≠ 𝑗2≤2E𝐺2a𝑛 (𝑋
𝑗11 , 𝑋
𝑗12 ) [E𝐺a𝑛 (𝑋
𝑗21 , 𝑋
𝑗22 )]2+
2E{E
[𝐺a𝑛 (𝑋1
1 , 𝑋12 )
���𝑋11
]}2E
{E
[𝐺a𝑛 (𝑋2
1 , 𝑋22 )
���𝑋21
]}2
.𝑑 a−𝑑1/2−3𝑑2/4𝑛 ‖𝑝1‖2
𝐿2‖𝑝2‖3
𝐿2+ a−3𝑑1/4−𝑑2/2
𝑛 ‖𝑝1‖3𝐿2‖𝑝2‖2
𝐿2
Together with the fact that
(2a𝑛/𝜋)𝑑/2E��2a𝑛(𝑋1, 𝑋2) → ‖𝑝‖2
𝐿2
as a𝑛 → ∞, we conclude that
𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2) = 𝐷 (a𝑛) + 𝑜𝑝(√E𝐷2(a𝑛)
),
where
𝐷 (a𝑛) =1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛��a𝑛 (𝑋𝑖, 𝑋 𝑗 ).
Applying arguments similar to those in the proofs of Theorem 7 and 10, we have
𝐷 (a𝑛)√E𝐷2(a𝑛)
→𝑑 𝑁 (0, 1).
Since
E𝐷2(a𝑛) =2
𝑛(𝑛 − 1)E[��a𝑛 (𝑋1, 𝑋2)]2 and E[��a𝑛 (𝑋1, 𝑋2)]2/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 → 1,
86
it remains to prove
��2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 →𝑝 1,
which immediately follows by observing
𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 =
2∏𝑗=1
𝑠2𝑛, 𝑗 ,a𝑛/E[��a𝑛 (𝑋𝑗
1 , 𝑋𝑗
2 )]2 →𝑝 1
and 1/𝑛2 = 𝑜(E[𝐺∗a𝑛(𝑋1, 𝑋2)]2). The proof is therefore concluded.
Proof of Theorem 13. We prove the two parts separately. Part (i). The proof of consistency of
ΦIND𝑛,a𝑛,𝛼
is very similar to its counterpart in the proof of Theorem 11. It sufficies to show
sup𝑝∈𝐻IND
1 (Δ𝑛,𝑠)
var(𝛾2a𝑛 (P, P𝑋
1 ⊗ P𝑋2))𝛾4a𝑛 (P, P𝑋
1 ⊗ P𝑋2)→ 0, (6.35)
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠)
𝑛𝛾2a𝑛(P, P𝑋1 ⊗ P𝑋2)E
(��𝑛,a𝑛
) → ∞, (6.36)
as 𝑛→ ∞.
We begin with (6.35). Let 𝑓 = 𝑝 − 𝑝1 ⊗ 𝑝2. Lemma 8 then implies that there exists 𝐶 =
𝐶 (𝑠, 𝑀) > 0 such that
𝛾2a (P, P𝑋
1 ⊗ P𝑋2) �𝑑 a−𝑑/2‖ 𝑓 ‖2𝐿2
for a ≥ 𝐶‖ 𝑓 ‖−2/𝑠𝐿2
, which is satisfied by all 𝑝 ∈ 𝐻IND1 (Δ𝑛, 𝑠) given a = a𝑛 and lim
𝑛→∞Δ𝑛𝑛
2𝑠4𝑠+𝑑 = ∞.
On the other hand, we can still do the decomposition of 𝛾2a𝑛 (P, P𝑋
1 ⊗ P𝑋2) as in Appendix B.4. We
follow the same notations here.
87
Under the alternative hypothesis, the “first order” term
𝐷1(a𝑛)
=2𝑛
∑1≤𝑖≤𝑛
(E𝑋𝑖 ,𝑋∼iidP [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋,𝑋 ′∼iidP𝐺a𝑛 (𝑋, 𝑋′)
)− 2𝑛
∑1≤𝑖≤𝑛
(E𝑋𝑖∼P,𝑌∼P𝑋1⊗P𝑋2 [𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑖] − E𝑋∼P,𝑌∼P𝑋1⊗P𝑋2𝐺a𝑛 (𝑋,𝑌 )
)−
∑1≤ 𝑗≤2
(2𝑛
∑1≤𝑖≤𝑛
(E𝑋𝑖∼P𝑋1⊗P𝑋2
,𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑗
𝑖] − E
𝑋∼P,𝑌∼P𝑋1⊗P𝑋2𝐺a𝑛 (𝑋,𝑌 )))
+∑
1≤ 𝑗≤2
(2𝑛
∑1≤𝑖≤𝑛
(E𝑋𝑖 ,𝑌∼iidP𝑋
1⊗P𝑋2 [𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑗
𝑖] − E
𝑌,𝑌 ′∼iidP𝑋1⊗P𝑋2𝐺a𝑛 (𝑌,𝑌 ′)
))
no longer vanish, but based on arguments similar to those in the proof of Theorem 8,
E𝐷21(a𝑛) .𝑑 𝑀𝑛
−1a−3𝑑/4𝑛 ‖ 𝑓 ‖2
𝐿2.
Moreover, the “second order” term 𝐷2(a𝑛) is not solely∑
1≤𝑖≠ 𝑗≤𝑛𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 )/(𝑛(𝑛 − 1)), but we
still have
E𝐷22(a𝑛) . 𝑛−2 max{E𝐺2a𝑛 (𝑋1, 𝑋2),E𝐺2a𝑛 (𝑋1
1 , 𝑋12 )E𝐺2a𝑛 (𝑋2
1 , 𝑋22 )} .𝑑 𝑀
2𝑛−2a−𝑑/2𝑛 .
Similarly, define the third order term 𝐷3(a𝑛) and the fourth order term 𝐷4(a𝑛) as the aggregation
of all 3-variate centered components and the aggregation of all 4-variate centered components in
𝛾2a𝑛 (P, P𝑋
1 ⊗ P𝑋2) respectively, which together constitue 𝑅𝑛. Then we have
E𝐷23(a𝑛) .𝑑 𝑀
2𝑛−3a−𝑑/2𝑛 , E𝐷2
4(a𝑛) .𝑑 𝑀2𝑛−4a
−𝑑/2𝑛 .
Hence we finally obtain
𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2) = 𝛾2a𝑛(P, P𝑋1 ⊗ P𝑋2) +
4∑𝑙=1
𝐷 𝑙 (a𝑛)
88
and
var(𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2))=
4∑𝑙=1E𝐷2
𝑙 (a𝑛) .𝑑 𝑀𝑛−1a
−3𝑑/4𝑛 ‖ 𝑓 ‖2
𝐿2+ 𝑀2𝑛−2a
−𝑑/2𝑛
which proves (6.35).
Now consider (6.36). Since
��𝑛,a𝑛 ≤ max
2∏𝑗=1
√���𝑠2𝑛, 𝑗 ,a𝑛 ���, 1/𝑛 ,we have
E(��𝑛,a𝑛
)≤
2∏𝑗=1
√E
���𝑠2𝑛, 𝑗 ,a𝑛 ��� + 1/𝑛,
where
2∏𝑗=1E
���𝑠2𝑛, 𝑗 ,a𝑛 ��� . 2∏𝑗=1E𝐺2a𝑛 (𝑋
𝑗
1 , 𝑋𝑗
2 ) = E𝑌1,𝑌2∼iidP𝑋1⊗P𝑋2𝐺2a𝑛 (𝑌1, 𝑌2) .𝑑 𝑀
2a−𝑑/2𝑛 .
Therefore (6.36) holds.
Part (ii). Then we verify that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞ is also the necessary condition for the existence
of consistent asymptotic 𝛼-level tests for any 𝛼 ∈ (0, 1). Similarly to the proof of Theorem 11,
the idea is to relate the existence of consistent independence test to the existence of consistent
goodness-of-fit test.
Let 𝑝 𝑗 ,0 ∈ W𝑠,2(𝑀 𝑗/
√2)
be density on R𝑑 𝑗 for 𝑗 = 1, 2 and 𝑝0 be the product of 𝑝1,0 and
𝑝2,0, i.e.,
𝑝0(𝑥1, 𝑥2) = 𝑝1,0(𝑥1)𝑝2,0(𝑥2), ∀ 𝑥1 ∈ R𝑑1 , 𝑥2 ∈ R𝑑2 .
Hence 𝑝0 ∈ W𝑠,2(𝑀/2).
Let
𝐻GOF1 (Δ𝑛; 𝑠)′ := {𝑝 : 𝑝 ∈ W𝑠,2(𝑀), 𝑝1 = 𝑝1,0, 𝑝2 = 𝑝2,0, ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛}.
89
We immediately have
𝐻IND1 (Δ𝑛; 𝑠) ⊃ 𝐻GOF
1 (Δ𝑛; 𝑠)′
Let {Φ𝑛}𝑛≥1 be any sequence of asymptotic 𝛼-level independence tests, where
Φ𝑛 = Φ𝑛 (𝑋1, · · · , 𝑋𝑛).
Then {Φ𝑛}𝑛≥1 can also be treated as a sequence of asymptotic 𝛼-level goodness-of-fit tests with
the null density being 𝑝0. Moreover,
power{Φ𝑛;𝐻IND1 (Δ𝑛; 𝑠)} ≤ power{Φ𝑛;𝐻GOF
1 (Δ𝑛; 𝑠)′}.
It remains to show that given lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, there exists some 𝛼 ∈ (0, 1) such
that
lim inf𝑛→∞
power{Φ𝑛;𝐻GOF1 (Δ𝑛; 𝑠)′} < 1,
which cannot be directly obtained from Theorem 9 because of the additional constraints
𝑝1 = 𝑝1,0, 𝑝2 = 𝑝2,0 (6.37)
in 𝐻GOF1 (Δ𝑛; 𝑠)′.
However, by modifying the proof of Theorem 9, we only need to further require each 𝑝𝑛,𝒂𝑛 in
the proof of Theorem 9 satisfying (6.37), or equivalently,
∫R𝑑2
(𝑝 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥2 = 0,∫R𝑑1
(𝑝 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥1 = 0.
90
Recall that each 𝑝𝑛,𝒂𝑛 = 𝑝0 + 𝑟𝑛𝐵𝑛∑𝑘=1
𝑎𝑛,𝑘𝜙𝑛,𝑘 , where
𝜙𝑛,𝑘 (𝑥) =𝑏𝑑/2𝑛
‖𝜙‖𝐿2
𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 ).
Write 𝑥𝑛,𝑘 = (𝑥1𝑛,𝑘, 𝑥2𝑛,𝑘
) ∈ R𝑑1 × R𝑑2 . Since 𝜙 can be decomposed as
𝜙(𝑥1, 𝑥2) = 𝜙1(𝑥1)𝜙2(𝑥2),
we have
𝜙𝑛,𝑘 (𝑥) =𝑏𝑑/2𝑛
‖𝜙‖𝐿2
𝜙1(𝑏𝑛𝑥1 − 𝑥1𝑛,𝑘 )𝜙2(𝑏𝑛𝑥2 − 𝑥2
𝑛,𝑘 )
Hence
∫R𝑑2
(𝑝𝑛,𝒂𝑛 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥2 =𝑟𝑛
𝐵𝑛∑𝑘=1
𝑎𝑛,𝑘
∫R𝑑2
𝜙𝑛,𝑘 (𝑥1, 𝑥2)𝑑𝑥2
=𝑟𝑛
𝐵𝑛∑𝑘=1
𝑎𝑛,𝑘𝑏𝑑/2𝑛
‖𝜙‖𝐿2
· 𝜙1(𝑏𝑛𝑥1 − 𝑥1𝑛,𝑘 ) ·
1𝑏𝑑2𝑛
∫R𝑑2
𝜙2(𝑥2)𝑑𝑥2
=0
since∫R𝑑2 𝜙2(𝑥2)𝑑𝑥2 = 0. Similarly,
∫R𝑑1 (𝑝𝑛,𝒂𝑛 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥1 = 0. The proof is therefore
finished.
Proof of Theorem 14. The proof of Theorem 14 consists of two steps. First, we bound 𝑞GOF𝑛,𝛼 . To
be more specific, we show that there exists 𝐶 = 𝐶 (𝑑) > 0 such that
𝑞GOF𝑛,𝛼 ≤ 𝐶 (𝑑) log log 𝑛
91
for sufficiently large 𝑛, which holds if
lim𝑛→∞
𝑃(𝑇GOF(adapt)𝑛 ≥ 𝐶 (𝑑) log log 𝑛) = 0 (6.38)
under 𝐻GOF0 . Second, we show that there exists 𝑐 > 0 such that
lim inf𝑛→∞
Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐
ensures
inf𝑝∈𝐻GOF(adapt)
1 (Δ𝑛,𝑠:𝑠≥𝑑/4)𝑃(𝑇GOF(adapt)
𝑛 ≥ 𝐶 (𝑑) log log 𝑛) → 1 (6.39)
as 𝑛→ ∞.
Verifying (6.38). In order to prove (6.38), we first show the following two lemmas. The
first lemma suggests that ��2𝑛,a𝑛 is a consistent estimator of E��2a𝑛(𝑋1, 𝑋2) uniformly over all a𝑛 ∈
[1, 𝑛2/𝑑]. Recall we have shown in the proof of Theorem 7 that for a𝑛 increasing at a proper rate,
��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1.
Hence the first lemma is a uniform version of such result.
Lemma 5. We have that ��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 converges to 1 uniformly over a𝑛 ∈ [1, 𝑛2/𝑑], i.e.,
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).
92
We defer the proof of Lemma 5 to the appendix. Note that
𝑇GOF(adapt)𝑛 = sup
1≤a𝑛≤𝑛2/𝑑
𝑛𝛾2a𝑛 (P, P0)√
2E[��a𝑛 (𝑋1, 𝑋2)]2·√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛
≤ sup1≤a𝑛≤𝑛2/𝑑
������� 𝑛𝛾2a𝑛 (P, P0)√
2E[��a𝑛 (𝑋1, 𝑋2)]2
������� · sup1≤a𝑛≤𝑛2/𝑑
√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛 .
Lemma 5 first ensures that
sup1≤a𝑛≤𝑛2/𝑑
√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛 = 1 + 𝑜𝑝 (1).
It therefore suffices to show that under 𝐻GOF0 ,
𝑇GOF(adapt)𝑛 := sup
1≤a𝑛≤𝑛2/𝑑
������� 𝑛𝛾2a𝑛 (P, P0)√
2E[��a𝑛 (𝑋1, 𝑋2)]2
�������is also of order log log 𝑛. This is the crux of our argument yet its proof is lengthy. For brevity, we
shall state it as a lemma here and defer its proof to the appendix.
Lemma 6. There exists 𝐶 = 𝐶 (𝑑) > 0 such that
lim𝑛→∞
𝑃
(𝑇
GOF(adapt)𝑛 ≥ 𝐶 log log 𝑛
)= 0
under 𝐻GOF0 .
Verifying (6.39). Let
a𝑛 (𝑠)′ =(log log 𝑛
𝑛
)−4/(4𝑠+𝑑),
which is smaller than 𝑛2/𝑑 for 𝑠 ≥ 𝑑/4. Hence it suffices to show
inf𝑠≥𝑑/4
inf𝑝∈𝐻GOF
1 (Δ𝑛,𝑠;𝑠)𝑃(𝑇GOF
𝑛,a𝑛 (𝑠) ′ ≥ 𝐶 (𝑑) log log 𝑛) → 1
93
as 𝑛→ ∞.
First of all, observe
0 ≤ E(𝑠2𝑛,a𝑛 (𝑠) ′
)≤ E𝐺2a𝑛 (𝑠) ′ (𝑋1, 𝑋2) ≤ 𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2
and
var(𝑠2𝑛,a𝑛 (𝑠) ′
).𝑑 𝑀
3𝑛−1(a𝑛 (𝑠)′)−3𝑑/4 + 𝑀2𝑛−2(a𝑛 (𝑠)′)−𝑑/2
for any 𝑠 and 𝑝 ∈ 𝐻GOF1 (Δ𝑛,𝑠, 𝑠). Further considering 1/𝑛2 = 𝑜(𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2) uniformly
over all 𝑠, we obtain that
inf𝑠≥𝑑/4
inf𝑝∈𝐻GOF
1 (Δ𝑛,𝑠;𝑠)𝑃
(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2
)→ 1.
Let
Δ𝑛,𝑠 ≥ 𝑐(√𝑀 + 𝑀) (log log 𝑛/𝑛)2𝑠/(𝑑+4𝑠)
for some sufficiently large 𝑐 = 𝑐(𝑑). Then
E𝛾2a𝑛 (𝑠) ′ (P, P0) = 𝛾2
a𝑛 (𝑠) ′ (P, P0) ≥(
𝜋
a𝑛 (𝑠)′
)𝑑/2·‖𝑝 − 𝑝0‖2
𝐿2
4,
as guaranteed by Lemma 8. Further considering that
var(𝛾2a𝑛 (𝑠) ′ (P, P0)
).𝑑 𝑀
2𝑛−2(a𝑛 (𝑠)′)−𝑑/2 + 𝑀𝑛−1(a𝑛 (𝑠)′)−3𝑑/4‖𝑝 − 𝑝0‖2𝐿2,
we immediately have
lim𝑛→∞
inf𝑠≥𝑑/4
inf𝑝∈𝐻GOF
1 (Δ𝑛,𝑠;𝑠)𝑃(𝑇GOF
𝑛,a𝑛 (𝑠) ′ ≥ 𝐶 (𝑑) log log 𝑛)
≥ lim𝑛→∞
inf𝑠≥𝑑/4
inf𝑝∈𝐻GOF
1 (Δ𝑛,𝑠;𝑠)𝑃
©«𝑛𝛾2
a𝑛 (𝑠) ′ (P, P0)/2√2��2𝑛,a𝑛 (𝑠) ′
≥ 𝐶 (𝑑) log log 𝑛ª®®¬ = 1.
94
Proof of Theorem 15 and Theorem 16. The proof of Theorem 15 and Theorem 16 is very similar
to that of Theorem 14. Hence we only emphasize the main differences here.
For adaptive homogeneity test: to verify that there exists 𝐶 = 𝐶 (𝑑) > 0 such that
lim𝑛→∞
𝑃(𝑇HOM(adapt)𝑛 ≥ 𝐶 log log 𝑛) = 0
under 𝐻HOM0 , observe that
𝑇HOM(adapt)𝑛 ≤ sup
1≤a𝑛≤𝑛2/𝑑
√E[��a𝑛 (𝑋1, 𝑋2)]2
��2𝑛,𝑚,a𝑛·(1𝑛+ 1𝑚
)−1sup
1≤a𝑛≤𝑛2/𝑑
|𝛾2a𝑛 (P,Q) |√
2E[��a𝑛 (𝑋1, 𝑋2)]2.
Denote 𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚 as 𝑍1, · · · , 𝑍𝑁 . Hence
2𝑛∑𝑖=1
𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ) =
∑1≤𝑖≠ 𝑗≤𝑁
𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗 ) −∑
1≤𝑖≠ 𝑗≤𝑛𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) −
∑1≤𝑖≠ 𝑗≤𝑚
𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )
and
sup1≤a𝑛≤𝑛2/𝑑
|𝛾2a𝑛 (P,Q) |√
2E[��a𝑛 (𝑋1, 𝑋2)]2
≤(
1𝑛(𝑛 − 1) +
1𝑚𝑛
)sup
1≤a𝑛≤𝑛2/𝑑
�������∑
1≤𝑖≠ 𝑗≤𝑛
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2
�������+
(1
𝑚(𝑚 − 1) +1𝑚𝑛
)sup
1≤a𝑛≤𝑛2/𝑑
�������∑
1≤𝑖≠ 𝑗≤𝑚
��a𝑛 (𝑌𝑖, 𝑌 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2
�������+ 1𝑚𝑛
sup1≤a𝑛≤𝑛2/𝑑
�������∑
1≤𝑖≠ 𝑗≤𝑁
��a𝑛 (𝑍𝑖, 𝑍 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2
�������Apply Lemma 6 to bound each term of the right hand side of the above inequality. Then we
95
conclude that for some 𝐶 = 𝐶 (𝑑) > 0,
lim𝑛→∞
𝑃©«(1𝑛+ 1𝑚
)−1sup
1≤a𝑛≤𝑛2/𝑑
|𝛾2a𝑛 (P,Q) |√
2E[��a𝑛 (𝑋1, 𝑋2)]2≥ 𝐶 log log 𝑛
ª®®¬ = 0.
For adaptive independence test: to verify that there exists 𝐶 = 𝐶 (𝑑) > 0 such that
lim𝑛→∞
𝑃(𝑇 IND(adapt)𝑛 ≥ 𝐶 log log 𝑛) = 0 (6.40)
under 𝐻IND0 , recall the decomposition
𝛾2a𝑛 (P, P
𝑋1 ⊗ P𝑋2) = 𝐷2(a𝑛) + 𝑅𝑛 =1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 ) + 𝑅𝑛,
where we express 𝑅𝑛 as 𝑅𝑛 = 𝐷3(a𝑛) + 𝐷4(a𝑛) in the proof of Theorem 13.
Following arguments similar to those in the proof of Lemma 6, we obtain that there exists
𝐶 (𝑑) > 0 such that for sufficiently large 𝑛,
𝑃©« sup
1≤a𝑛≤𝑛2/𝑑
������� 𝑛𝐷2(a𝑛)√2E[𝐺∗
a𝑛 (𝑋1, 𝑋2)]2
������� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡2/3),
Similarly,
𝑃©« sup
1≤a𝑛≤𝑛2/𝑑
������� 𝑛3/2𝐷3(a𝑛)√2E[𝐺∗
a𝑛 (𝑋1, 𝑋2)]2
������� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡1/2)
𝑃©« sup
1≤a𝑛≤𝑛2/𝑑
������� 𝑛2𝐷4(a𝑛)√2E[𝐺∗
a𝑛 (𝑋1, 𝑋2)]2
������� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡2/5)
for sufficiently large 𝑛.
96
On the other hand, note that
E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 =
2∏𝑗=1E[��a𝑛 (𝑋
𝑗
1 , 𝑋𝑗
2 )]2,
and based on results in the proof of Lemma 5, sup1≤a𝑛≤𝑛2/𝑑
���𝑠2𝑛, 𝑗 ,a𝑛/E[��a𝑛 (𝑋𝑗
1 , 𝑋𝑗
2 )]2 − 1
��� = 𝑜𝑝 (1) for
𝑗 = 1, 2. Further considering that
1/𝑛2 = 𝑜(E[𝐺∗a𝑛(𝑋1, 𝑋2)]2)
uniformly over all a𝑛 ∈ [1, 𝑛2/𝑑], we obtain
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 − 1
�� = 𝑜𝑝 (1).They combined together ensure that (6.40) holds.
To show that the detection boundary of ΦIND(adapt) is of order 𝑂 ((𝑛/log log 𝑛)−2𝑠/(𝑑+4𝑠)), ob-
serve that
0 ≤ E(𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′
)≤ E𝐺2a𝑛 (𝑠) ′ (𝑋
𝑗
1 , 𝑋𝑗
2 ) ≤ 𝑀2𝑗 (2a𝑛 (𝑠)′/𝜋)−𝑑 𝑗/2
and
var(𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′
).𝑑 𝑗 𝑀
3𝑗 𝑛
−1(a𝑛 (𝑠)′)−3𝑑 𝑗/4 + 𝑀2𝑗 𝑛
−2(a𝑛 (𝑠)′)−𝑑 𝑗/2
for 𝑗 = 1, 2, where a𝑛 (𝑠)′ = (log log 𝑛/𝑛)−4/(4𝑠+𝑑) as in the proof of Theorem 14. Therefore,
inf𝑠≥𝑑/4
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠;𝑠)𝑃
(���𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′��� ≤ √3/2𝑀2
𝑗 (2a𝑛 (𝑠)′/𝜋)−𝑑 𝑗/2)→ 1, 𝑗 = 1, 2.
Further considering 1/𝑛2 = 𝑜(𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2) uniformly over all 𝑠, we obtain that
inf𝑠≥𝑑/4
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠;𝑠)𝑃
(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2
)→ 1.
97
References
L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi (2010). “On combinatorial testing prob-lems”. In: The Annals of Statistics 38.5, pp. 3063–3092.
N. Ailon, M. Charikar, and A. Newman (2008). “Aggregating inconsistent information: rankingand clustering”. In: Journal of ACM 55.5, 23:1–23:27.
M. A. Arcones and E. Gine (1993). “Limit Theorems for U-Processes”. In: The Annals of Proba-bility 21.3, pp. 1494–1542.
Y. Baraud (2002). “Non-asymptotic minimax rates of testing in signal detection”. In: Bernoulli 8.5,pp. 577–606.
M. V. Burnashev (1979). “On the minimax detection of an inaccurately known signal in a whiteGaussian noise background”. In: Theory of Probability & Its Applications 24.1, pp. 107–119.
N. Dunford and J. T. Schwartz (1963). Linear Operators: Part II: Spectral Theory: Self AdjointOperators in Hilbert Space. Interscience Publishers.
M. S. Ermakov (1991). “Minimax detection of a signal in a Gaussian white noise”. In: Theory ofProbability & Its Applications 35.4, pp. 667–679.
M. Fromont and B. Laurent (2006). “Adaptive goodness-of-fit tests in a density model”. In: TheAnnals of Statistics 34.2, pp. 680–720.
M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret (2012). “Kernels based tests withnon-asymptotic bootstrap approaches for two-sample problem”. In: JMLR: Workshop and Con-ference Proceedings. Vol. 23, pp. 23–1.
M. Fromont, B. Laurent, and P. Reynaud-Bouret (2013). “The two-sample problem for poissonprocesses: Adaptive tests with a nonasymptotic wild bootstrap approach”. In: The Annals ofStatistics 41.3, pp. 1431–1461.
K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf, and B. K. Sriperumbudur (2009). “Kernelchoice and classifiability for RKHS embeddings of probability distributions”. In: Advances inNeural Information Processing Systems, pp. 1750–1758.
G. G. Gregory (1977). “Large sample theory for U-statistics and tests of fit”. In: The Annals ofStatistics 5.1, pp. 110–123.
98
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2012a). “A kernel two-sampletest”. In: Journal of Machine Learning Research 13.Mar, pp. 723–773.
A. Gretton, O. Bousquet, A. J. Smola, and B. Schölkopf (2005). “Measuring statistical dependencewith Hilbert-Schmidt norms”. In: International Conference on Algorithmic Learning Theory.Springer, pp. 63–77.
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola (2008). “A ker-nel statistical test of independence”. In: Advances in Neural Information Processing Systems,pp. 585–592.
A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.Sriperumbudur (2012b). “Optimal kernel choice for large-scale two-sample tests”. In: Ad-vances in Neural Information Processing systems, pp. 1205–1213.
E. Haeusler (1988). “On the rate of convergence in the central limit theorem for martingales withdiscrete and continuous time”. In: The Annals of Probability 16.1, pp. 275–299.
P. Hall (1984). “Central limit theorem for integrated square error of multivariate nonparametricdensity estimators”. In: Journal of Multivariate Analysis 14.1, pp. 1–16.
Z. Harchaoui, F. Bach, and E. Moulines (2007). “Testing for homogeneity with kernel fisher dis-criminant analysis”. In: Advances in Neural Information Processing Systems, pp. 609–616.
Y. I. Ingster (1987). “Minimax testing of nonparametric hypotheses on a distribution density in theL_p metrics”. In: Theory of Probability & Its Applications 31.2, pp. 333–337.
— (1993). “Asymptotically minimax hypothesis testing for nonparametric alternatives. I, II, III”.In: Mathematical Methods of Statistics 2.2, pp. 85–114.
— (2000). “Adaptive chi-square tests”. In: Journal of Mathematical Sciences 99.2, pp. 1110–1119.
Y. I. Ingster and I. A. Suslina (2000). “Minimax nonparametric hypothesis testing for ellipsoidsand Besov bodies”. In: ESAIM: Probability and Statistics 4, pp. 53–135.
— (2003). Nonparametric Goodness-of-Fit Testing under Gaussian Models. New York, NY: Springer.
P. E. Jupp (2005). “Sobolev tests of goodness of fit of distributions on compact Riemannian mani-folds”. In: The Annals of Statistics 33.6, pp. 2957–2966.
E. L. Lehmann and J. P. Romano (2008). Testing Statistical Hypotheses. New York, NY: SpringerScience & Business Media.
99
O. V. Lepski and V. G. Spokoiny (1999). “Minimax nonparametric hypothesis testing: the case ofan inhomogeneous alternative”. In: Bernoulli 5.2, pp. 333–358.
R. Lyons (2013). “Distance covariance in metric spaces”. In: The Annals of Probability 41.5,pp. 3284–3305.
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf (2016). “Distinguishing causefrom effect using observational data: methods and benchmarks”. In: The Journal of MachineLearning Research 17.1, pp. 1103–1204.
K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf (2017). “Kernel mean embed-ding of distributions: a review and beyond”. In: Foundations and Trends® in Machine Learning10.1-2, pp. 1–141.
J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf (2014). “Causal discovery with continuousadditive noise models”. In: The Journal of Machine Learning Research 15.1, pp. 2009–2053.
N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters (2018). “Kernel-based tests for joint indepen-dence”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.1,pp. 5–31.
D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013). “Equivalence of distance-based and RKHS-based statistics in hypothesis testing”. In: The Annals of Statistics 41.5,pp. 2263–2291.
R. J. Serfling (2009). Approximation Theorems of Mathematical Statistics. New York, NY: JohnWiley & Sons.
V. G. Spokoiny (1996). “Adaptive hypothesis testing using wavelets”. In: The Annals of Statistics24.6, pp. 2477–2498.
B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf (2009). “Kernelchoice and classifiability for RKHS embeddings of probability distributions”. In: Advances inNeural Information Processing Systems 22, pp. 1750–1758.
B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet (2011). “Universality, characteristic ker-nels and RKHS embedding of measures”. In: Journal of Machine Learning Research 12.Jul,pp. 2389–2410.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet (2010). “Hilbertspace embeddings and metrics on probability measures”. In: Journal of Machine LearningResearch 11.Apr, pp. 1517–1561.
I. Steinwart and A. Christmann (2008). Support Vector Machines. Springer Science & BusinessMedia.
100
I. Steinwart (2001). “On the influence of the kernel on the consistency of support vector machines”.In: Journal of machine learning research 2.Nov, pp. 67–93.
D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton (2017).“Generative models and model criticism via optimized maximum mean discrepancy”. In: In-ternational Conference on Learning Representations.
G. J Székely and M. L. Rizzo (2009). “Brownian distance covariance”. In: The Annals of AppliedStatistics 3.4, pp. 1236–1265.
G. J. Székely, M. L. Rizzo, and N. K. Bakirov (2007). “Measuring and testing dependence bycorrelation of distances”. In: The Annals of Statistics 35.6, pp. 2769–2794.
M. Talagrand (2014). Upper and Lower Bounds for Stochastic Processes: Modern Methods andClassical Problems. Springer Science & Business Media.
A. B. Tsybakov (2008). Introduction to Nonparametric Estimation. New York, NY: Springer Sci-ence & Business Media.
101
Appendix A: Some Technical Results and Proofs Related to Chapter 2
Proof of Lemma 3. We have
𝐺2(𝑥, 𝑥′) =∑𝑘,𝑙
`𝑘`𝑙𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝜑𝑘 (𝑥′)𝜑𝑙 (𝑥′).
Thus
∫𝑔(𝑥)𝑔(𝑥′)𝐺2(𝑥, 𝑥′)𝑑𝑃(𝑥)𝑑𝑃(𝑥′)
=∑𝑘,𝑙
`𝑘`𝑙
( ∫𝑔(𝑥)𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝑑𝑃(𝑥)
)2
≤`1∑𝑘
`𝑘
∑𝑙
( ∫𝑔(𝑥)𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝑑𝑃(𝑥)
)2
≤`1
(∑𝑘
`𝑘
∫𝑔2(𝑥)𝜑2
𝑘 (𝑥)𝑑𝑃(𝑥))
≤`1
(∑𝑘
`𝑘
) (sup𝑘
‖𝜑𝑘 ‖∞)2
‖𝑔‖2𝐿2 (𝑃) .
Proof of Lemma 4. For brevity, write
𝑙𝐾 =
𝐾∑𝑘=1
𝑎2𝑘
_𝑘.
By definition, it suffices to show that ∀ 𝑅 > 0, ∃ 𝑓𝑅 ∈ H (𝐾) such that ‖ 𝑓𝑅‖2𝐾
≤ 𝑅2 and ‖𝑢 −
𝑓𝑅‖2𝐿2 (𝑃0) ≤ 𝑀2𝑅−2/\ .
102
To this end, let 𝐾 be such that 𝑙2𝐾≤ 𝑅2 ≤ 𝑙2
𝐾+1, and denote by
𝑓𝑅 =
𝐾∑𝑘=1
𝑎𝑘𝜑𝑘 + 𝑎∗𝐾+1(𝑅)𝜑𝐾+1,
where
𝑎∗𝐾+1(𝑅) = sgn(𝑎𝐾+1)√_𝐾+1(𝑅2 − 𝑙2
𝐾).
Clearly,
‖ 𝑓𝑅‖2𝐾 =
𝐾∑𝑘=1
𝑎2𝑘
_𝑘+(𝑎∗𝑘+1(𝑅))
2
_𝐾+1= 𝑅2,
and
‖𝑢 − 𝑓𝑅‖2𝐿2 (𝑃0) =
∑𝑘>𝐾+1
𝑎2𝑘 +
(|𝑎𝐾+1 | −
√_𝐾+1(𝑅2 − 𝑙2
𝐾))2
≤∑𝑘≥𝐾+1
𝑎2𝑘 .
To ensure 𝑢 ∈ F (\, 𝑀), it suffices to have
sup𝑙2𝐾≤𝑅2≤𝑙2
𝐾+1
‖𝑢 − 𝑓𝑅‖2𝐿2 (𝑃0)𝑅
2/\ ≤ 𝑀2, ∀ 𝐾 ≥ 0,
which concludes the proof.
103
Appendix B: Some Technical Results and Proofs Related to Chapter 3
B.1 Properties of Gaussian Kernel
We collect here a couple of useful properties of Gaussian kernel that we used repeated in the
proof to the main results.
Lemma 7. For any 𝑓 ∈ 𝐿2(R𝑑),∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =
(𝜋a
) 𝑑2∫
exp(−‖𝜔‖2
4a
)‖F 𝑓 (𝜔)‖2 𝑑𝜔.
Proof. Denote by 𝑍 a Gaussian random vector with mean 0 and covariance matrix 2a𝐼𝑑 . Then
∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =
∫exp
(−a‖𝑥 − 𝑦‖2
)𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦
=
∫E exp[𝑖𝑍>(𝑥 − 𝑦)] 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦
=E
∫ exp(−𝑖𝑍>𝑥) 𝑓 (𝑥)𝑑𝑥 2
=
∫1
(4𝜋a)𝑑/2exp
(−‖𝜔‖2
4a
) ∫ exp(−𝑖𝜔>𝑥) 𝑓 (𝑥)𝑑𝑥 2
=
(𝜋a
) 𝑑2∫
exp(−‖𝜔‖2
4a
)‖F 𝑓 (𝜔)‖2 𝑑𝜔,
which concludes the proof.
A useful consequence of Lemma 7 is a close connection between Gaussian kernel MMD and
𝐿2 norm.
104
Lemma 8. For any 𝑓 ∈ W𝑠,2(𝑀)
( a𝜋
)𝑑/2 ∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 ≥
14‖ 𝑓 ‖2
𝐿2,
provided that
a𝑠 ≥ 41−𝑠𝑀2
(log 3)𝑠 · ‖ 𝑓 ‖−2𝐿2.
Proof. In light of Lemma 7,
( a𝜋
)𝑑/2 ∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =
∫exp
(−‖𝜔‖2
4a
)‖F 𝑓 (𝜔)‖2 𝑑𝜔.
By Plancherel Theorem, for any 𝑇 > 0,
∫‖𝜔‖≤𝑇
‖F 𝑓 (𝜔)‖2 𝑑𝜔 = ‖ 𝑓 ‖2𝐿2 −
∫‖𝜔‖>𝑇
‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ ‖ 𝑓 ‖2𝐿2 −
𝑀2
𝑇2𝑠 ,
Choosing
𝑇 =
(2𝑀‖ 𝑓 ‖𝐿2
)1/𝑠,
yields ∫‖𝜔‖≤𝑇
‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 34‖ 𝑓 ‖2
𝐿2 .
Hence
∫exp
(−‖𝜔‖2
4a
)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ exp
(−𝑇
2
4a
) ∫‖𝜔‖≤𝑇
‖F 𝑓 (𝜔)‖2 𝑑𝜔
≥ 34
exp(−𝑇
2
4a
)‖ 𝑓 ‖2
𝐿2 .
In particular, if
a ≥ (2𝑀)2/𝑠
4 log 3· ‖ 𝑓 ‖−2/𝑠
𝐿2 ,
105
then ∫exp
(−‖𝜔‖2
4a
)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 1
4‖ 𝑓 ‖2
𝐿2 ,
which concludes the proof.
B.2 Proof of Lemma 5
We first prove that sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1) and then show the difference
caused by the modification from 𝑠2𝑛,a𝑛 to ��2𝑛,a𝑛 is asymptotically negligible.
Note that
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1��
≤(
inf1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2
)−1· sup
1≤a𝑛≤𝑛2/𝑑a𝑑/2𝑛
��𝑠2𝑛,a𝑛 − E[��a𝑛 (𝑋1, 𝑋2)]2�� .For 𝑋 ∼ P0, denote the distribution of (𝑋, 𝑋) as P1. Then we have
E[��a𝑛 (𝑋1, 𝑋2)]2 = 𝛾2a𝑛(P1, P0 ⊗ P0).
Hence E[��a𝑛 (𝑋1, 𝑋2)]2 > 0 for any a𝑛 > 0 since 𝐺a𝑛 is characteristic.
In addition, a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 is continuous with respect to a𝑛 and
lima𝑛→∞
a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 =
(𝜋2
)𝑑/2‖𝑝0‖2
𝐿2.
Therefore,
inf1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 ≥ inf
a𝑛∈[0,∞)a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 > 0,
106
and it remains to prove
sup1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛
��𝑠2𝑛,a𝑛 − E[��a𝑛 (𝑋1, 𝑋2)]2�� = 𝑜𝑝 (1).Recall the expression of 𝑠2𝑛,a𝑛 . It suffcies to show that
sup1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛
������ 1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − E𝐺2a𝑛 (𝑋1, 𝑋2)
������ (B.1)
sup1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛
��������2(𝑛 − 3)!
𝑛!
∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3
𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2) − E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)
�������� (B.2)
sup1≤a𝑛≤𝑛2/𝑑
a𝑑/2𝑛
��������(𝑛 − 4)!𝑛!
∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4
𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2) − [E𝐺a𝑛 (𝑋1, 𝑋2)]2
�������� (B.3)
are all 𝑜𝑝 (1). We shall first control (B.1) and then bound (B.2) and (B.3) in the same way.
Let
E𝑛𝐺2a𝑛 (𝑋, 𝑋′) = 1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 ).
In the rest of this proof, abbreviate E𝑛𝐺2a𝑛 (𝑋, 𝑋′) and E𝐺2a𝑛 (𝑋1, 𝑋2) as E𝑛𝐺2a𝑛 and E𝐺2a𝑛 re-
spectively when no confusion occurs.
Divide the whole interval [1, 𝑛2/𝑑] into 𝐴 sub-intervals, [𝑢0, 𝑢1], [𝑢1, 𝑢2], · · · , [𝑢𝐴−1, 𝑢𝐴] with
𝑢0 = 1, 𝑢𝐴 = 𝑛2/𝑑 . For any a𝑛 ∈ [𝑢𝑎−1, 𝑢𝑎],
a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a
𝑑/2𝑛 E𝐺2a𝑛 ≥ − a𝑑/2𝑛
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� − a𝑑/2𝑛
��E𝐺2𝑢𝑎 − E𝐺2𝑢𝑎−1
��≥ − 𝑢𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� − 𝑢𝑑/2𝑎
��E𝐺2𝑢𝑎 − E𝐺2𝑢𝑎−1
��and
a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a
𝑑/2𝑛 E𝐺2a𝑛 ≤ 𝑢
𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎−1 − E𝐺2𝑢𝑎−1
��� + 𝑢𝑑/2𝑎
��E𝐺2𝑢𝑎 − E𝐺2𝑢𝑎−1
�� ,107
which together ensure that
sup1≤a𝑛≤𝑛2/𝑑
���a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a𝑑/2𝑛 E𝐺2a𝑛
���≤ sup
1≤𝑎≤𝐴
(𝑢𝑎
𝑢𝑎−1
)𝑑/2· sup
0≤𝑎≤𝐴𝑢𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� + sup1≤𝑎≤𝐴
𝑢𝑑/2𝑎
��E𝐺2𝑢𝑎 − E𝐺2𝑢𝑎−1
��≤ sup
1≤𝑎≤𝐴
(𝑢𝑎
𝑢𝑎−1
)𝑑/2· sup
0≤𝑎≤𝐴𝑢𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� + sup1≤𝑎≤𝐴
���𝑢𝑑/2𝑎 E𝐺2𝑢𝑎 − 𝑢𝑑/2𝑎−1E𝐺2𝑢𝑎−1
���+ sup
1≤𝑎≤𝐴
((𝑢𝑑/2𝑎 − 𝑢𝑑/2
𝑎−1
)E𝐺2𝑢𝑎−1
).
Bound the three terms in the right hand side of the last inequality separately.
Let {𝑢𝑎}𝑎≥0 be a geometric sequence, namely,
𝐴 := inf{𝑎 ∈ N : 𝑟𝑎 ≥ 𝑛2/𝑑},
and
𝑢𝑎 =
𝑟𝑎, ∀ 0 ≤ 𝑎 ≤ 𝐴 − 1
𝑛2/𝑑 , 𝑎 = 𝐴
,
with 𝑟 > 1 to be determined later.
Since lima→∞
a𝑑/2E𝐺2a𝑛 = (𝜋/2)𝑑/2‖𝑝0‖2 and a𝑑/2E𝐺2a is continuous, we obtain that for any
Y > 0, there exsits sufficiently small 𝑟 > 1 such that
sup1≤𝑎≤𝐴
���𝑢𝑑/2𝑎 E𝐺2𝑢𝑎 − 𝑢𝑑/2𝑎−1E𝐺2𝑢𝑎−1
��� ≤ Y.At the same time, we can also ensure
sup1≤𝑎≤𝐴
((𝑢𝑑/2𝑎 − 𝑢𝑑/2
𝑎−1
)E𝐺2𝑢𝑎−1
)≤ (𝑟𝑑/2 − 1)
(𝜋2
)𝑑/2‖𝑝0‖2 ≤ Y
by choosing 𝑟 sufficiently small.
108
Finally consider
sup1≤𝑎≤𝐴
(𝑢𝑎
𝑢𝑎−1
)𝑑/2· sup
0≤𝑎≤𝐴𝑢𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� .On the one hand,
sup1≤𝑎≤𝐴
(𝑢𝑎
𝑢𝑎−1
)𝑑/2≤ 𝑟𝑑/2.
On the other hand, since
var(E𝑛𝐺2a𝑛
).
1𝑛E𝐺2a𝑛 (𝑋, 𝑋′)𝐺2a𝑛 (𝑋, 𝑋′′) + 1
𝑛2E𝐺4a𝑛 (𝑋, 𝑋′)
.𝑑
a−3𝑑/4𝑛 ‖𝑝0‖3
𝑛+ a
−𝑑/2𝑛 ‖𝑝0‖2
𝑛2
for any a𝑛 ∈ (0,∞), we have
𝑃
(sup
0≤𝑎≤𝐴𝑢𝑑/2𝑎
���E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎
��� ≥ Y)
≤
𝐴∑𝑎=0
𝑢𝑑𝑎var(E𝑛𝐺2𝑢𝑎
)Y2 .𝑑,𝑟
1Y2
(𝑢𝑑/4𝐴
‖𝑝0‖3
𝑛+𝑢𝑑/2𝐴
‖𝑝0‖2
𝑛2
)→ 0
as 𝑛→ ∞. Hence we conclude sup1≤a𝑛≤𝑛2/𝑑
���a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a𝑑/2𝑛 E𝐺2a𝑛
��� = 𝑜𝑝 (1).Considering that
lima𝑛→∞
a𝑑/2𝑛 E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3) = 0, lim
a𝑛→∞a𝑑/2𝑛 [E𝐺a𝑛 (𝑋1, 𝑋2)]2 = 0,
we obtain that (B.2) and (B.3) are also 𝑜𝑝 (1), based on almost the same arguments. Hence
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).
On the other hand, since E[��a𝑛 (𝑋1, 𝑋2)]2 &𝑝0,𝑑 a−𝑑/2𝑛 for a𝑛 ∈ [1, 𝑛2/𝑑],
sup1≤a𝑛≤𝑛2/𝑑
1𝑛2E[��a𝑛 (𝑋1, 𝑋2)]2
= 𝑜𝑝 (1).
109
Hence we finally conclude that
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).
B.3 Proof of Lemma 6
Let
𝐾a𝑛 (𝑥, 𝑥′) =𝐺a𝑛 (𝑥, 𝑥′)√
2E𝐺2a𝑛 (𝑋1, 𝑋2), ∀ 𝑥, 𝑥′ ∈ R𝑑 ,
and accordingly,
��a𝑛 (𝑥, 𝑥′) =��a𝑛 (𝑥, 𝑥′)√
2E𝐺2a𝑛 (𝑋1, 𝑋2).
Hence
𝑇GOF(adapt)𝑛 = sup
1≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) ·√E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2
����� .To finish this proof, we first bound
sup1≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )����� (B.4)
and then control 𝑇GOF(adapt)𝑛 .
Step (i). There are two main tools that we borrow in this step. First, we apply results in Arcones
and Gine (1993) to obtain a Bernstein-type inequality for����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a0 (𝑋𝑖, 𝑋 𝑗 )����� and
����� 1𝑛 − 1
∑𝑖≠ 𝑗
(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a′𝑛 (𝑋𝑖, 𝑋 𝑗 )
) �����for some a0 and arbitrary a𝑛, a′𝑛 ∈ [1,∞). And based on that, we borrow Talagrand’s techniques
on handling Bernstein-type inequality (e.g., see Talagrand, 2014) to give a generic chaining bound
of (B.4).
110
To be more specific, for any a0, a𝑛, a′𝑛 ∈ [1, 𝑛2/𝑑], define
𝑑1(a𝑛, a′𝑛) = ‖��a′𝑛 − ��a𝑛 ‖𝐿∞ , 𝑑2(a𝑛, a′𝑛) = ‖��a′𝑛 − ��a𝑛 ‖𝐿2 .
Then Proposition 2.3 (c) of Arcones and Gine (1993) ensures that for any 𝑡 > 0,
𝑃
(����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a0 (𝑋𝑖, 𝑋 𝑗 )����� ≥ 𝑡
)≤ 𝐶 exp
(−𝐶min
{𝑡
‖��a0 ‖𝐿2
,
( √𝑛𝑡
‖��a0 ‖𝐿∞
) 23})
(B.5)
and
𝑃
(����� 1𝑛 − 1
∑𝑖≠ 𝑗
(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a′𝑛 (𝑋𝑖, 𝑋 𝑗 )
) ����� ≥ 𝑡)
≤𝐶 exp
(−𝐶min
{𝑡
𝑑2(a𝑛, a′𝑛),
( √𝑛𝑡
𝑑1(a𝑛, a′𝑛)
) 23})
for some 𝐶 > 0, and based on a chaining type argument see, e.g., Theorem 2.2.28 in Talagrand,
2014 the latter inequality suggests there exists 𝐶 > 0 such that
𝑃
(sup
1≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a0 (𝑋𝑖, 𝑋 𝑗 )
) ����� ≥ (B.6)
𝐶
(𝛾2/3( [1, 𝑛2/𝑑], 𝑑1)√
𝑛𝑡 + 𝛾1( [1, 𝑛2/𝑑], 𝑑2) + 𝐷2𝑡
) ). exp(−𝑡2/3),
where 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1), 𝛾1( [1, 𝑛2/𝑑], 𝑑2) are the so-called 𝛾-functionals and
𝐷2 =∑𝑙≥0
𝑒𝑙 ( [1, 𝑛2/𝑑], 𝑑2)
with 𝑒𝑙 being the so-called entropy numbers.
111
A straightforward combination of (B.5) and (B.6) then gives
𝑃
(sup
1≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )����� ≥
𝐶
(𝛾2/3( [1, 𝑛2/𝑑], 𝑑1)√
𝑛𝑡 + 𝛾1( [1, 𝑛2/𝑑], 𝑑2) + 𝐷2𝑡 +
‖��a0 ‖𝐿∞√𝑛
+ ‖��a0 ‖𝐿2𝑡
) ). exp(−𝑡2/3).
Therefore, given that the bounds on ‖��a0 ‖𝐿2 and ‖��a0 ‖𝐿∞ can be obtained quite directly, e.g.,
with a0 = 1,
‖��a0 ‖𝐿∞ ≤ 4‖𝐾a0 | |𝐿∞ =4
√2E𝐺2
, ‖��a0 ‖𝐿2 ≤ ‖𝐾a0 ‖𝐿2 =
√2
2,
the main focus is to bound 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1), 𝛾1( [1, 𝑛2/𝑑], 𝑑2) and 𝐷2 properly.
First consider 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1). Note that for any 1 ≤ a𝑛 < a′𝑛 < ∞,
𝑑1(a𝑛, a′𝑛) ≤ 4‖𝐾a𝑛 − 𝐾a′𝑛 ‖𝐿∞ ≤ 4∫ a′𝑛
a𝑛
𝑑𝐾𝑢𝑑𝑢 𝐿∞
𝑑𝑢
Since for any a𝑛,
𝑑𝐾a𝑛
𝑑a𝑛=(−‖𝑥 − 𝑥′‖2)𝐺a𝑛 (𝑋1, 𝑋2)
(E𝐺2a𝑛 (𝑋1, 𝑋2)
)−1/2
−12𝐺a𝑛 (𝑋1, 𝑋2)
(E𝐺2a𝑛 (𝑋1, 𝑋2)
)−3/2 𝑑
𝑑a𝑛E𝐺2a𝑛 (𝑋1, 𝑋2)
where
(E𝐺2a𝑛 (𝑋1, 𝑋2)
)−1/2=
(𝜋2
)−𝑑/4a𝑑/4𝑛
(∫exp
(−‖𝜔‖2
8a𝑛
)‖F 𝑝0(𝜔)‖2𝑑𝜔
)−1/2
.𝑑 a𝑑/4𝑛
(∫exp
(−‖𝜔‖2
8
)‖F 𝑝0(𝜔)‖2𝑑𝜔
)−1/2,
112
(E𝐺2a𝑛 (𝑋1, 𝑋2)
)−3/2 .𝑑 a3𝑑/4𝑛
(∫exp
(−‖𝜔‖2
8
)‖F 𝑝0(𝜔)‖2𝑑𝜔
)−3/2,
and
𝑑
𝑑a𝑛E2a𝑛 (𝑋1, 𝑋2)
=
(𝜋2
)𝑑/2a−𝑑/2−1𝑛
(−𝑑
2·∫
exp(−‖𝜔‖2
8a𝑛
)‖F 𝑝0(𝜔)‖2𝑑𝜔
+∫
exp(−‖𝜔‖2
8a𝑛
) (‖𝜔‖2
8a𝑛
)‖F 𝑝0(𝜔)‖2𝑑𝜔
),
which together ensure 𝑑𝐾a𝑛𝑑a𝑛
𝐿∞
.𝑑,𝑝0 a𝑑/4−1𝑛 .
Hence
𝑑1(a𝑛, a′𝑛) .𝑑,𝑝0 |a𝑑/4𝑛 − (a′𝑛)𝑑/4 |,
and 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1) .𝑑,𝑝0 | (𝑛2/𝑑)𝑑/4 − 1𝑑/4 | ≤√𝑛.
Then consider 𝛾1( [1, 𝑛2/𝑑], 𝑑2). We have
𝑑22 (a𝑛, a
′𝑛) ≤ ‖𝐾a′𝑛 − 𝐾a𝑛 ‖
2𝐿2
= 1 −E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛
≤ − log
(E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛
)
Let 𝑓1(a𝑛) =∫
exp(− ‖𝜔‖2
8a𝑛
)‖F 𝑝0(𝜔)‖2𝑑𝜔. Then
log(E𝐺2a𝑛
)=𝑑
2log
(𝜋
2a𝑛
)+ log 𝑓1(a𝑛)
and hence
− log
(E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛
)=𝑑
2
(−
log a𝑛 + log a′𝑛2
+ log(a𝑛 + a′𝑛
2
))+
(log 𝑓1(a𝑛) + log 𝑓1(a′𝑛)
2− log 𝑓1
(a𝑛 + a′𝑛
2
)).
113
Note that
log 𝑓1(a𝑛) + log 𝑓1(a′𝑛)2
− log 𝑓1(a𝑛 + a′𝑛
2
)=
12
∫ a′𝑛−a𝑛2
0
∫ 𝑢
−𝑢
(log 𝑓1
(a′𝑛 + a𝑛
2+ 𝑣
))′′𝑑𝑣𝑑𝑢.
For any a𝑛 ≥ 1,
(log 𝑓1(a𝑛))′′ =𝑓1(a𝑛) 𝑓 ′′1 (a𝑛) − ( 𝑓 ′1 (a𝑛))
2
𝑓 21 (a𝑛)
≤𝑓 ′′1 (a𝑛)𝑓1(a𝑛)
,
and
𝑓 ′′1 (a𝑛) =∫
exp(−‖𝜔‖2
8a𝑛
) (‖𝜔‖4
64a4𝑛
− ‖𝜔‖2
4a3𝑛
)‖F 𝑝0(𝜔)‖2𝑑𝜔 . a−2
𝑛 ‖𝑝0‖2𝐿2.
Moreover, there exists a∗𝑛 = a∗𝑛 (𝑝0) > 1 such that 𝑓1(a∗𝑛) ≥ ‖𝑝0‖2
𝐿2/2, from which we obtain
(log 𝑓1(a𝑛))′′ .
a−2𝑛 ‖𝑝0‖2
𝐿2/ 𝑓1(1), 1 ≤ a𝑛 ≤ a∗𝑛
a−2𝑛 , a∗𝑛 < a𝑛 ≤ 𝑛2/𝑑
,
which suggests that for any a𝑛, a′𝑛 ∈ [1, a∗𝑛]
𝑑22 (a𝑛, a
′𝑛) .
(𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)
) (−
log a𝑛 + log a′𝑛2
+ log(a𝑛 + a′𝑛
2
)).
(𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)
)| log a𝑛 − log a′𝑛 |,
and for any a𝑛, a′𝑛 ∈ [a∗𝑛, 𝑛2/𝑑]
𝑑22 (a𝑛, a
′𝑛) .
(𝑑
2+ 1
)| log a𝑛 − log a′𝑛 |.
Note that in addition to the bound on 𝑑2 obtained above, we also have
𝑑2(a𝑛, a′𝑛) ≤ ‖��a𝑛 ‖𝐿2 + ‖��a′𝑛 ‖𝐿2 ≤ ‖𝐾a𝑛 ‖𝐿2 + ‖𝐾a′𝑛 ‖𝐿2 ≤√
2.
114
Therefore,
𝛾1( [1, 𝑛2/𝑑], 𝑑2) ≤∑𝑙≥0
2𝑙𝑒𝑙 ( [1, 𝑛2/𝑑], 𝑑2)
.𝑒0( [1, 𝑛2/𝑑], 𝑑2) +∑𝑙≥0
2𝑙𝑒𝑙 ( [1, a∗𝑛], 𝑑2) +∑𝑙≥0
2𝑙𝑒𝑙 ( [a∗𝑛, 𝑛2/𝑑], 𝑑2)
.1 +
√𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)∑𝑙≥0
2𝑙√
log a∗𝑛 − log 122𝑙
+√𝑑
2+ 1 ©«
∑𝑙≥0
2𝑙 min1,
√log 𝑛2/𝑑 − log a∗𝑛
22𝑙
ª®¬.1 +
√𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)√
log a∗𝑛 +√𝑑
2+ 1 ©«
∑𝑙≥0
2𝑙 min1,
√log 𝑛2/𝑑
22𝑙
ª®¬.1 +
√𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)√
log a∗𝑛 +√𝑑
2+ 1 ©«
∑0≤𝑙<𝑙∗
2𝑙 +∑𝑙≥𝑙∗
2𝑙√
log 𝑛2/𝑑
22𝑙ª®¬
.1 +
√𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)√
log a∗𝑛 +√𝑑
2+ 1 · 2𝑙
∗
where 𝑙∗ is the smallest 𝑙 such that √log 𝑛2/𝑑
22𝑙≤ 1.
Hence 2𝑙∗ � log log 𝑛 and there exists 𝐶 = 𝐶 (𝑑) > 0 such that
𝛾1( [1, 𝑛2/𝑑], 𝑑2) ≤ 𝐶 (𝑑) log log 𝑛
for sufficiently large 𝑛.
By the similar approach, we get that
𝐷2 . 1 +
√𝑑
2+‖𝑝0‖2
𝐿2
𝑓1(1)√
log a∗𝑛 +√𝑑
2+ 1 · 𝑙∗
which is upper-bounded by 𝐶 (𝑑) log log log 𝑛 for sufficiently large 𝑛.
115
Therefore, we finally obtain that there exists 𝐶 (𝑑) > 0 such that for sufficiently large 𝑛,
𝑃
(sup
1≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )����� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)
). exp(−𝑡2/3). (B.7)
Step (ii). By slight abuse of notation, there exists a∗𝑛 = a∗𝑛 (𝑝0) > 1 such that
E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2
≤ 2
for a𝑛 ≥ a∗𝑛. Therefore,
𝑇GOF(adapt)𝑛 ≤ sup
1≤a𝑛≤a∗𝑛
√E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2
· sup1≤a𝑛≤a∗𝑛
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�����+
√2 supa∗𝑛≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�����
≤𝐶 (𝑝0) sup1≤a𝑛≤a∗𝑛
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�����+
√2 supa∗𝑛≤a𝑛≤𝑛2/𝑑
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�����
for some 𝐶 (𝑝0) > 0.
Based on arguments similar to those in the first step,
𝑃
(sup
1≤a𝑛≤a∗𝑛
����� 1𝑛 − 1
∑𝑖≠ 𝑗
��a𝑛 (𝑋𝑖, 𝑋 𝑗 )����� ≥ 𝐶 (𝑑, 𝑝0)𝑡
). exp(−𝑡2/3)
for some 𝐶 (𝑑, 𝑝0) > 0 and (B.7) still holds when a𝑛 is restricted to [a∗𝑛, 𝑛2/𝑑]. They together prove
Lemma 6.
B.4 Decomposition of dHSIC and Its Variance Estimation
In this section, we first derive an approximation of 𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ) under 𝐻0 for general
𝑘 , and then the approximation of var(𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ))
can be obtained subsequently.
116
Note that
𝐺a (𝑥, 𝑦)
=
∫𝐺a (𝑢, 𝑣)𝑑 (𝛿𝑥 − P + P) (𝑢)𝑑 (𝛿𝑦 − P + P) (𝑣)
=��a (𝑥, 𝑦) + (E𝐺a (𝑥, 𝑋) − E𝐺a (𝑋, 𝑋′)) + (E𝐺a (𝑦, 𝑋) − E𝐺a (𝑋, 𝑋′)) + E𝐺a (𝑋, 𝑋′).
Similarly write
𝐺a (𝑥, (𝑦1, · · · , 𝑦𝑘 ))
=
∫𝐺a (𝑢, (𝑣1, · · · , 𝑣𝑘 ))𝑑 (𝛿𝑥 − P + P)𝑑 (𝛿𝑦1 − P𝑋1 + P𝑋1) · · · 𝑑 (𝛿𝑦𝑘 − P𝑋
𝑘 + P𝑋 𝑘 )
and expand it as the summation of all 𝑙-variate centered components where 𝑙 ≤ 𝑘 + 1. Do the
same expansion to 𝐺a ((𝑥1, · · · , 𝑥𝑘 ), (𝑦1, · · · , 𝑦𝑘 )) and write it as the summation of all 𝑙-variate
centered components where 𝑙 ≤ 2𝑘 . Plug these expansions in 𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ) and denote
the summation of all 𝑙-variate centered components in such expression of 𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 )
by 𝐷 𝑙 (a) for 𝑙 ≤ 2𝑘 . Let the remainder 𝑅𝑛 =2𝑘∑𝑙=3𝐷 𝑙 (a) so that
𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ) = 𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ) + 𝐷1(a) + 𝐷2(a) + 𝑅𝑛.
Straightforward calculation yields the following facts:
• E(𝑅𝑛)2 .𝑘 𝑛−3
(E𝐺2a (𝑋1, 𝑋2) +
𝑘∏𝑙=1E𝐺2a (𝑋 𝑙1, 𝑋
𝑙2)
);
• under the null hypothesis, 𝐷1(a) = 0 and
𝐷2(a) =1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛𝐺∗a (𝑋𝑖, 𝑋 𝑗 )
117
where
𝐺∗a (𝑥, 𝑦) = ��a (𝑥, 𝑦) −
∑1≤ 𝑗≤𝑘
𝑔 𝑗 (𝑥 𝑗 , 𝑦) −∑
1≤ 𝑗≤𝑘𝑔 𝑗 (𝑦 𝑗 , 𝑥) +
∑1≤ 𝑗1, 𝑗2≤𝑘
𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2).
Proof of Lemma 1. Observe that under 𝐻0,
var(𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ))= E(𝐷2(a))2 + E (𝑅𝑛)2 =
2𝑛(𝑛 − 1)E[𝐺
∗a (𝑋1, 𝑋2)]2 + E (𝑅𝑛)2 ,
E (𝑅𝑛)2 .𝑘 𝑛−3E𝐺2a (𝑋1, 𝑋2),
and
E[𝐺∗a (𝑋1, 𝑋2)]2
=E©«��a (𝑋1, 𝑋2) −
∑1≤ 𝑗≤𝑘
𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2)ª®¬
2
− E ©«∑
1≤ 𝑗≤𝑘𝑔 𝑗 (𝑋 𝑗
2 , 𝑋1) +∑
1≤ 𝑗1, 𝑗2≤𝑘𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )ª®¬
2
=E��2a (𝑋1, 𝑋2) − 2
∑1≤ 𝑗≤𝑘
E(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2
+∑
1≤ 𝑗1, 𝑗2≤𝑘E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2.
They together conclude the proof.
Below we shall further expand E��2a (𝑋1, 𝑋2), E
(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2
and E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2in Lemma
1, based on which consistent estimator of var(𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ))
can be derived naturally.
First,
E��2a (𝑋1, 𝑋2)
=E𝐺2a (𝑋1, 𝑋2) − 2E𝐺a (𝑋1, 𝑋2)𝐺a (𝑋1, 𝑋3) + (E𝐺a (𝑋1, 𝑋2))2
=∏
1≤𝑙≤𝑘E𝐺2a (𝑋 𝑙1, 𝑋
𝑙2) − 2
∏1≤𝑙≤𝑘
E𝐺a (𝑋 𝑙1, 𝑋𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3) +
∏1≤𝑙≤𝑘
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2.
118
Second,
E(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2
=E𝐺2a (𝑋 𝑗
1 , 𝑋𝑗
2 ) ·∏𝑙≠ 𝑗
E𝐺a (𝑋 𝑙1, 𝑋𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3) −
∏1≤𝑙≤𝑘
E𝐺a (𝑋 𝑙1, 𝑋𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3)
− E𝐺a (𝑋 𝑗
1 , 𝑋𝑗
2 )𝐺a (𝑋 𝑗
1 , 𝑋𝑗
3 ) ·∏𝑙≠ 𝑗
(E𝐺a (𝑋 𝑙1, 𝑋𝑙2))
2 +∏
1≤𝑙≤𝑘
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2.
Hence
∑1≤ 𝑗≤𝑘
E(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2
=
( ∏1≤𝑙≤𝑘
E𝐺a (𝑋 𝑙1, 𝑋𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3)
) ©«∑
1≤ 𝑗≤𝑘
E𝐺2a (𝑋 𝑗
1 , 𝑋𝑗
2 )E𝐺a (𝑋 𝑗
1 , 𝑋𝑗
2 )𝐺a (𝑋 𝑗
1 , 𝑋𝑗
3 )− 𝑘ª®¬
−( ∏1≤𝑙≤𝑘
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2) ©«
∑1≤ 𝑗≤𝑘
E𝐺a (𝑋 𝑗
1 , 𝑋𝑗
2 )𝐺a (𝑋 𝑗
1 , 𝑋𝑗
3 )(E𝐺a (𝑋 𝑗
1 , 𝑋𝑗
2 ))2− 𝑘ª®¬ .
Finally,
E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2
=
E(��a (𝑋 𝑗1
1 , 𝑋𝑗12 ))2 · ∏
𝑙≠ 𝑗1
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2, 𝑗1 = 𝑗2∏
𝑙∈{ 𝑗1, 𝑗2}
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3) − (E𝐺a (𝑋 𝑙1, 𝑋
𝑙2))
2) ∏𝑙≠ 𝑗1, 𝑗2
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2, 𝑗1 ≠ 𝑗2.
119
Hence
∑1≤ 𝑗1, 𝑗2≤𝑘
E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2
=
( ∏1≤𝑙≤𝑘
(E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)
)2) ( ∑
1≤ 𝑗1≤𝑘
E(��a (𝑋 𝑗11 , 𝑋
𝑗12 ))2
(E𝐺a (𝑋 𝑗11 , 𝑋
𝑗12 ))2
+∑
1≤ 𝑗1≠ 𝑗2≤𝑘
∏𝑙∈{ 𝑗1, 𝑗2}
©«E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙2)(
E𝐺a (𝑋 𝑙1, 𝑋𝑙2)
)2 − 1ª®®¬).
Then the consistent estimator 𝑠2𝑛,a of E(𝐺∗a (𝑋1, 𝑋2)
)2 is constructed by replacing
E𝐺2a (𝑋 𝑙1, 𝑋𝑙2), E𝐺a (𝑋 𝑙1, 𝑋
𝑙2)𝐺a (𝑋 𝑙1, 𝑋
𝑙3), (E𝐺a (𝑋 𝑙1, 𝑋
𝑙2))
2
in the above expansions of
E��2a (𝑋1, 𝑋2),
∑1≤ 𝑗≤𝑘
E(𝑔 𝑗 (𝑋 𝑗
1 , 𝑋2))2,
∑1≤ 𝑗1, 𝑗2≤𝑘
E(𝑔 𝑗1, 𝑗2 (𝑋
𝑗11 , 𝑋
𝑗22 )
)2
with the corresponding unbiased estimators
1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 ),(𝑛 − 3)!𝑛!
∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3
𝐺a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗1)𝐺a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗2)
(𝑛 − 4)!𝑛!
∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4
𝐺a𝑛 (𝑋 𝑙𝑖1 , 𝑋𝑙𝑗1)𝐺a𝑛 (𝑋 𝑙𝑖2 , 𝑋
𝑙𝑗2)
for 1 ≤ 𝑙 ≤ 𝑘 . Again, to avoid a negative estimate of the variance, we can replace 𝑠2𝑛,a𝑛 with 1/𝑛2
whenever it is negative or too small. Namely, let
��2𝑛,a𝑛 = max{𝑠2𝑛,a𝑛 , 1/𝑛
2} ,and estimate var
(𝛾2a (P, P𝑋
1 ⊗ · · · ⊗ P𝑋 𝑘 ))
by 2��2𝑛,a/(𝑛(𝑛 − 1)).
120
Therefore for general 𝑘 , the single kernel test statistic and the adaptive test statistic are con-
structed as
𝑇 IND𝑛,a𝑛
=𝑛√
2��−1𝑛,a𝑛
𝛾2a𝑛 (P, P
𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) and 𝑇IND(adapt)𝑛 = max
1≤a𝑛≤𝑛2/𝑑𝑇 IND𝑛,a𝑛
respectively. Accordingly, ΦIND𝑛,a𝑛,𝛼
and ΦIND(adapt) can be constructed as in the case of 𝑘 = 2.
B.5 Theoretical Properties of Independence Tests for General 𝑘
In this section, with ΦIND𝑛,a𝑛,𝛼
and ΦIND(adapt) constructed in Appendix B.4 for general 𝑘 , we
confirm that Theorem 12, Theorem 13 and Theorem 16 still hold. We shall only emphasize the
main differences between the new proofs and the original proofs in the case of 𝑘 = 2.
Under the null hypothesis: we only need to re-ensure that 𝑠2𝑛,a𝑛 is a consistent estimator of
E[𝐺∗a𝑛(𝑋1, 𝑋2)]2. Specifically, we show that
𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 →𝑝 1
given 1 � a𝑛 � 𝑛4/𝑑 for Theorem 12 and
sup1≤a𝑛≤𝑛2/𝑑
��𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 − 1
�� = 𝑜𝑝 (1)for Theorem 16.
To prove the former one, since
E[𝐺∗a𝑛(𝑋1, 𝑋2)]2
(𝜋/(2a𝑛))𝑑/2‖𝑝‖2𝐿2
→ 1
as a𝑛 → ∞, it suffices to show
a𝑑/2𝑛
��𝑠2𝑛,a𝑛 − E[𝐺∗a𝑛(𝑋1, 𝑋2)]2�� = 𝑜𝑝 (1),
121
which follows considering that
a𝑑𝑙/2𝑛 E𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙2), a
𝑑/2𝑛 E𝐺a𝑛 (𝑋 𝑙1, 𝑋
𝑙2)𝐺a𝑛 (𝑋 𝑙1, 𝑋
𝑙3), a
𝑑𝑙/2𝑛 (E𝐺a𝑛 (𝑋 𝑙1, 𝑋
𝑙2))
2 (B.8)
are all bounded and they are estimated consistently by their corresponding estimators. For example,
a𝑑𝑙/2𝑛 E𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙2) → (𝜋/2)𝑑𝑙/2 ‖𝑝𝑙 ‖2
𝐿2
and
a𝑑𝑙𝑛 E
©« 1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 ) − E𝐺2a𝑛 (𝑋 𝑙1, 𝑋𝑙2)
ª®¬2
= a𝑑𝑙𝑛 var ©« 1
𝑛(𝑛 − 1)∑
1≤𝑖≠ 𝑗≤𝑛𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 )
ª®¬. a
𝑑𝑙𝑛
(𝑛−1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙2)𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙3) + 𝑛
−2E𝐺4a𝑛 (𝑋 𝑙1, 𝑋𝑙2)
).𝑑𝑙𝑛
−1a𝑑𝑙/4𝑛 ‖𝑝𝑙 ‖3
𝐿2+ 𝑛−2a
𝑑𝑙/2𝑛 ‖𝑝𝑙 ‖2
𝐿2→ 0.
The proof of the latter one is similar. It sufficies to have
• each term in (B.8) is bounded for a𝑛 ∈ [1,∞), which immediately follows since each term
is continuous and converges at ∞;
• the difference between each term in (B.8) and its corresponding estimator converges to 0
uniformly over a𝑛 ∈ [1, 𝑛2/𝑑], the proof of which is the same with that of Lemma 5.
Under the alternative hypothesis: we only need to re-ensure that ��𝑛,a𝑛 is bounded. Specifi-
cally, we show
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠)
𝑛𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 )[E
(��2𝑛,a𝑛
)1/𝑘] 𝑘/2 → ∞
122
for Theorem 13 and
inf𝑠≥𝑑/4
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠;𝑠)𝑃
(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2
)→ 1 (B.9)
for Theorem 16, where a𝑛 (𝑠)′ = (log log 𝑛/𝑛)−4/(4𝑠+𝑑) .
The former one holds because
E(��2𝑛,a𝑛
)1/𝑘≤ E
(max
{��𝑠2𝑛,a�� , 1/𝑛2})1/𝑘
≤ E��𝑠2𝑛,a��1/𝑘 + 𝑛−2/𝑘
.𝑘
(𝑘∏𝑙=1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙2)
)1/𝑘
+ 𝑛−2/𝑘
≤(𝑀2(𝜋/(2a𝑛))𝑑/2
)1/𝑘+ 𝑛−2/𝑘 .
where the second to last inequality follows from generalized Hölder’s inequality. For example,
E©«𝑘∏𝑙=1
1𝑛(𝑛 − 1)
∑1≤𝑖≠ 𝑗≤𝑛
𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 )ª®¬
1/𝑘
≤(𝑘∏𝑙=1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋
𝑙2)
)1/𝑘
.
To prove the latter one, note that for a𝑛 = a𝑛 (𝑠)′, all three terms in (B.8) are bounded by
𝑀2𝑙(𝜋/2)𝑑𝑙/2 and the variances of their corresponding estimators are bounded by
𝐶 (𝑑𝑙)(𝑛−1 (a𝑛 (𝑠)′)𝑑𝑙/4 𝑀3
𝑙 + 𝑛−2 (a𝑛 (𝑠)′)𝑑𝑙/2 𝑀2
𝑙
)= 𝑜(1)
uniformly over all 𝑠. Therefore,
inf𝑠≥𝑑/4
inf𝑝∈𝐻IND
1 (Δ𝑛,𝑠;𝑠)𝑃
((a𝑛 (𝑠)′)𝑑/2
���𝑠2𝑛,a𝑛 (𝑠) ′ − E[𝐺∗a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2
��� ≤ 𝑀2(𝜋/2)𝑑/2)→ 1
123
where 𝑌1, 𝑌2 ∼iid P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 . Further considering that
E[𝐺∗a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2 ≤ E[��a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2 ≤ 𝑀2(𝜋/(2a𝑛 (𝑠)′))𝑑/2
and that
1/𝑛2 = 𝑜((a𝑛 (𝑠)′)−𝑑/2)
uniformly over all 𝑠, we prove (B.9).
124