On the Construction of Minimax Optimal Nonparametric Tests

On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods

Tong Li

Submitted in partial fulfillment of therequirements for the degree of

Doctor of Philosophyunder the Executive Committee

of the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2021

© 2021

Tong Li

All Rights Reserved

Abstract

On the Construction of Minimax Optimal Nonparametric Tests with Kernel Embedding Methods

Tong Li

Kernel embedding methods have witnessed a great deal of practical success in the area of

nonparametric hypothesis testing in recent years. But ever since its first proposal, there exists an

inevitable problem that researchers in this area have been trying to answer–what kernel should be

selected, because the performance of the associated nonparametric tests can vary dramatically

with different kernels. While the way of kernel selection is usually ad hoc, we wonder if there

exists a principled way of kernel selection so as to ensure that the associated nonparametric tests

have good performance. As consistency results against fixed alternatives do not tell the full story

about the power of the associated tests, we study their statistical performance within the minimax

framework. First, focusing on the case of goodness-of-fit tests, our analyses show that a vanilla

version of the kernel embedding based test could be suboptimal, and suggest a simple remedy by

moderating the kernel. We prove that the moderated approach provides optimal tests for a wide

range of deviations from the null and can also be made adaptive over a large collection of

interpolation spaces. Then, we study the asymptotic properties of goodness-of-fit, homogeneity

and independence tests using Gaussian kernels, arguably the most popular and successful among

such tests. Our results provide theoretical justifications for this common practice by showing that

tests using a Gaussian kernel with an appropriately chosen scaling parameter are minimax

optimal against smooth alternatives in all three settings. In addition, our analysis also pinpoints

the importance of choosing a diverging scaling parameter when using Gaussian kernels and

suggests a data-driven choice of the scaling parameter that yields tests optimal, up to an iterated

logarithmic factor, over a wide range of smooth alternatives. Numerical experiments are

presented to further demonstrate the practical merits of our methodology.

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Nonparametric Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Minimax Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Kernel Selection and Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2: Moderated Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Background and Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Operating Characteristics of MMD Based Test . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Asymptotics under 𝐻GOF0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Power Analysis for MMD Based Tests . . . . . . . . . . . . . . . . . . . . 12

2.3 Optimal Tests Based on Moderated MMD . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Moderated MMD Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . 13

i

2.3.2 Operating Characteristics of [2𝜚𝑛(P𝑛, P0) Based Tests . . . . . . . . . . . . 15

2.3.3 Minimax Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Chapter 3: Gaussian Kernel Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.1 Test for Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.2 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.3 Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Chapter 4: Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Effect of Scaling Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Efficacy of Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Chapter 5: Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Chapter 6: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Appendix A: Some Technical Results and Proofs Related to Chapter 2 . . . . . . . . . . . 102

Appendix B: Some Technical Results and Proofs Related to Chapter 3 . . . . . . . . . . . 104

ii

B.1 Properties of Gaussian Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

B.2 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

B.3 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.4 Decomposition of dHSIC and Its Variance Estimation . . . . . . . . . . . . . . . . 116

B.5 Theoretical Properties of Independence Tests for General 𝑘 . . . . . . . . . . . . . 121

iii

List of Tables

4.1 Frequency that each DAG in Figure 4.4 was selected by three tests. . . . . . . . . . 42

iv

List of Figures

4.1 Observed power against log(a) in Experiment I (left) and Experiment II(right). . . 38

4.2 Observed power versus sample size in Experiment III for 𝑑 = 1, 10, 100, 1000 from

left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Observed power versus sample size in Experiment IV for 𝑑 = 2, 10, 100, 1000 from

left to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 DAGs with the top 3 highest probabilities of being selected. . . . . . . . . . . . . . 42

v

Acknowledgements

First of all, I want to express my sincere gratitude to my Ph.D. advisor Prof. Ming Yuan. His

guidance and support are invaluable to my research and my life. I have learned a lot from his deep

insights and descent work ethics.

I am very grateful to Prof. Bodhisattva Sen, Prof. Victor de la Pena, Prof. Cynthia Rush and

Prof. Bin Cheng for their serving on the committee. Their comments and feedback were very

helpful for the modification of my thesis and left me inspirations for future research.

I also want to thank my close friends and my fellow students, Yi Li, Youran Qi, Chensheng

Kuang, Shulei Wang, Cuize Han, Fan Gao, Luxi Cao, Yuan Li, Yuqing Xu, Chaoyu Yuan, Guanhua

Fang, Yuanzhe Xu and Shun Xu. Their company has made this journey much more pleasurable.

Their suggestions and help have been indispensable during the hard moments of my life.

Finally, I want to thank my parents Xiuyuan Li and Ping Zhou. They have always been sup-

porting and encouraging me. I owe a lot to them.

vi

Dedication

To my beloved parents who always give me unconditional support and encouragement.

vii

Chapter 1: Introduction

1.1 Kernel Embedding

Tests for goodness-of-fit, homogeneity and independence are central to statistical inferences.

Numerous techniques have been developed for these tasks and are routinely used in practice. In

recent years, there is a renewed interest in them from both statistics and other related fields as they

arise naturally in many modern applications where the performance of classical methods are less

than satisfactory. In particular, nonparametric inferences via the embedding of distributions into

a reproducing kernel Hilbert space (RKHS) have emerged as a popular and powerful technique to

tackle these challenges. The approach immediately allows for easy access to the rich machinery

for RKHS and has found great successes in a wide range of applications from causal discovery to

deep learning. See, e.g., Muandet et al. (2017) for a recent review.

More specifically, let 𝐾 (·, ·) be a symmetric and positive definite function defined over X ×X,

that is 𝐾 (𝑥, 𝑦) = 𝐾 (𝑦, 𝑥) for all 𝑥, 𝑦 ∈ X, and the Gram matrix [𝐾 (𝑥𝑖, 𝑥 𝑗 )]1≤𝑖, 𝑗≤𝑛 is positive

definite for any distinct 𝑥1, . . . , 𝑥𝑛 ∈ X. The Moore-Aronszajn Theorem indicates that such a

function, referred to as a kernel, can always be uniquely identified with a RKHS H𝐾 of functions

over X. The embedding

`P(·) :=∫X𝐾 (𝑥, ·)P(𝑑𝑥),

maps a probability distribution P into H𝐾 . The difference between two probability distributions P

and Q can then be conveniently measured by

𝛾𝐾 (P,Q) := ‖`P − `Q‖H𝐾.

Under mild regularity conditions, it can be shown that 𝛾𝐾 (P,Q) is an integral probability metric so

1

that it is zero if and only if P = Q, and

𝛾𝐾 (P,Q) = sup𝑓 ∈H𝐾 :‖ 𝑓 ‖H𝐾 ≤1

∫X𝑓 𝑑 (P − Q) .

As such, 𝛾𝐾 (P,Q) is often referred to as the maximum mean discrepancy (MMD) between P andQ.

See, e.g., Sriperumbudur et al. (2010) or Gretton et al. (2012a) for details. In what follows, we shall

drop the subscript 𝐾 whenever its choice is clear from the context. It was noted recently that MMD

is also closely related to the so-called energy distance between random variables (Székely et al.,

2007; Székely and Rizzo, 2009) commonly used to measure independence. See, e.g., Sejdinovic

et al. (2013) and Lyons (2013).

1.2 Nonparametric Hypothesis Testing

Given a sample from P and/or Q, estimates of the 𝛾(P,Q) can be derived by replacing P and

Q with their respective empirical distributions. These estimates can subsequently be used for

nonparametric hypothesis testing. Here are several notable examples that we shall focus on in this

work.

Goodness-of-fit tests. The goal of goodness-of-fit tests is to check if a sample comes from

a pre-specified distribution. Let 𝑋1, · · · , 𝑋𝑛 be 𝑛 independent X-valued samples from a certain

distribution P. We are interested in testing if the hypothesis 𝐻GOF0 : P = P0 holds for a fixed P0.

Deviation from P0 can be conveniently measured by 𝛾(P, P0) which can be readily estimated by:

𝛾(P𝑛, P0) := sup𝑓 ∈H (𝐾):‖ 𝑓 ‖𝐾≤1

∫X𝑓 𝑑

(P𝑛 − P0

),

where P𝑛 is the empirical distribution of 𝑋1, · · · , 𝑋𝑛. A natural procedure is to reject 𝐻0 if the

estimate exceeds a threshold calibrated to ensure a certain significance level, say 𝛼 (0 < 𝛼 < 1).

Homogeneity tests. Homogeneity tests check if two independent samples come from a com-

mon population. Given two independent samples 𝑋1, · · · , 𝑋𝑛 ∼iid P and 𝑌1, · · · , 𝑌𝑚 ∼iid Q, we are

2

interested in testing if the null hypothesis 𝐻HOM0 : P = Q holds. Discrepancy between P and Q can

be measured by 𝛾(P,Q), and similar to before, it can be estimated by the MMD between P𝑛 and

Q𝑚:

𝛾(P𝑛, Q𝑚) := sup𝑓 ∈H (𝐾):‖ 𝑓 ‖𝐾≤1

∫X𝑓 𝑑

(P𝑛 − Q𝑚

).

Again we reject 𝐻0 if the estimate exceeds a threshold calibrated to ensure a certain significance

level.

Independence tests. How to measure or test for independence among a set of random variables

is another classical problem in statistics. Let 𝑋 = (𝑋1, . . . , 𝑋 𝑘 )> ∈ X1 × · · · × X𝑘 be a random

vector. If the random vectors 𝑋1, . . . , 𝑋 𝑘 are jointly independent, then the distribution of 𝑋 can be

factorized:

𝐻IND0 : P𝑋 = P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 .

Dependence among 𝑋1, . . . , 𝑋 𝑘 can be naturally measured by the difference between the joint

distribution and the product distribution evaluated under MMD:

𝛾(P𝑋 , P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) = ‖`P𝑋 − `P𝑋

1⊗···⊗P𝑋𝑘 ‖H𝐾.

When 𝑑 = 2, 𝛾2(P𝑋 , P𝑋1 ⊗P𝑋2) can be expressed as the squared Hilbert-Schmidt norm of the cross-

covariance operator associated with 𝑋1 and 𝑋2 and is therefore referred to as Hilbert-Schmidt

independence criterion (HSIC; Gretton et al., 2005). The more general case as given above is

sometimes referred to as dHSIC (see, e.g., Pfister et al., 2018). As before, we proceed to reject the

independence assumption when 𝛾(P𝑋𝑛 , P𝑋1

𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) exceed a certain threshold where P𝑋𝑛 and

P𝑋𝑗

𝑛 are the empirical distribution of 𝑋 and 𝑋 𝑗 respectively.

1.3 Minimax Framework

In all these cases the test statistic, namely 𝛾2(P𝑛, P0), 𝛾2(P𝑛, Q𝑚) or 𝛾2(P𝑛, P𝑋1

𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ),

is a V-statistic. Following standard asymptotic theory for V-statistics (see, e.g., Serfling, 2009), it

3

can be shown that under mild regularity conditions, when appropriately scaled by the sample size,

they converge to a mixture of 𝜒21 distribution with weights determined jointly by the underlying

probability distribution and the choice of kernel 𝐾 . In contrast, it can also be derived that for a

fixed alternative,

𝛾2(P𝑛, P0) →𝑝 𝛾2(P, P0), 𝛾2(P𝑛, Q𝑚) →𝑝 𝛾

2(P,Q)

and 𝛾2(P𝑛, P𝑋1

𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) →𝑝 𝛾2(P, P𝑋1 ⊗ · · · × P𝑋 𝑘 ).

This immediately suggests that all aforementioned tests are consistent against fixed alternatives in

that their power tends to one as sample sizes increase. Although useful, such consistency results

do not tell the full story about the power of these tests, and if there are yet more powerful methods.

Specifically, although consistency against any fixed alternative is proved, the rate at which the

power convergeces to 1 may vary for different alternatives and it remains a problem whether a

given sample size is large enough to ensure a good power for the underlying alternative. In other

words, can we detect the difference between the null and the alternative hypotheses with a high

probability even in the worst scenario?

This concern naturally brings the notion of uniform consistency, meaning power converging to

1 uniformly over all alternaives to be considered, and leads us to adpot the minimax hypothesis

testing framework pioneered by Burnashev (1979), Ingster (1987), and Ingster (1993). See also

Ermakov (1991), Spokoiny (1996), Lepski and Spokoiny (1999), Ingster and Suslina (2000), In-

gster (2000), Baraud (2002), Fromont and Laurent (2006), Fromont et al. (2012), and Fromont et

al. (2013), and references therein. Within this framework, we consider testing against alternatives

getting closer and closer to the null hypothesis as the sample size increases. The smallest departure

from the null hypotheses that can be detected consistently, in a minimax sense, is referred to as

the optimal detection boundary. And the test that maintains uniform consistency as the departure

converges to 0 at the rate of optimal detection boundary is called minimax rate optimal.

4

1.4 Kernel Selection and Adaptation

The critical importance of kernel selection is widely recognized in practice, as the statistical

performances of the associated tests can vary dramatically with different kernels. Yet, the way

it is done is usually ad hoc and how to do so in a more principled way remains one of the chief

practical challenges. See, e.g., Gretton et al. (2008), Fukumizu et al. (2009), Gretton et al. (2012b),

and Sutherland et al. (2017). In the following chapters, we address this problem by proposing two

kernel selection methods in different settings such that the associated tests are shown to be minimax

rate optimal.

However, such kernel selection methods depend on some regularity condition of the underlying

space of probability distributions, and whether we can do it in an agnostic approach remains an-

other challenge. This also naturally brings about the issue of adaptation. To address this challenge,

we introduce a simple testing procedure by maximizing a normalized MMD over a pre-specified

class of kernels. Similar idea of maximizing MMD over a class of kernels was first introduced

by Sriperumbudur et al. (2009). Our analysis, however, suggests that it is more desirable to maxi-

mize normalized MMD instead. More specifically, we show that the proposed procedure can attain

the optimal rate, up to an iterated logarithmic factor, over spaces of probability distributions with

different regularity conditions.

The rest of the thesis is organized as follows. In Chapter 2, focusing on the case of goodness-

of-fit tests, our analyses show that a vanilla version of the kernel embedding based test could be

suboptimal, and suggest a simple remedy by moderating the kernel. We prove that the moderated

approach provides optimal tests and can also be made adaptive over a wide range of deviations from

the null. Then, in Chapter 3 we study the asymptotic properties of goodness-of-fit, homogeneity

and independence tests using Gaussian kernels, arguably the most popular and successful among

such tests. Our results provide theoretical justifications for this common practice by showing that

tests using Gaussian kernel with an appropriately chosen scaling parameter are minimax optimal

against smooth alternatives in all three settings. In addition, we suggests a data-driven choice of the

5

scaling parameter that yields tests optimal, up to an iterated logarithmic factor, over a wide range

of smooth alternatives. Numerical experiments are presented in Chapter 4 to further demonstrate

the practical merits of our methodology. We conclude with some summary discussion in Chapter

5. All the main proofs are relegated to Chapter 6. Other technical results and their proofs are put

in the appendix.

6

Chapter 2: Moderated Kernel Embedding

2.1 Background and Problem Setting

In this chapter, we focus on goodness-of-fit test. Specifically, with 𝑛 independent X-valued

samples 𝑋1, · · · , 𝑋𝑛 from a certain distribution P, we are interested in testing if the hypothesis

𝐻GOF0 : P = P0

holds for a fixed P0. Problems of this kind have a long and illustrious history in statistics and is

often associated with household names such as Kolmogrov-Smirnov tests, Pearson’s Chi-square

test or Neyman’s smooth test. A plethora of other techniques have also been proposed over the

years in both parametric and nonparametric settings (e.g., Ingster and Suslina, 2003; Lehmann

and Romano, 2008). Most of the existing techniques are developed with the domain X = R or

[0, 1] in mind and work the best in these cases. Modern applications, however, oftentimes involve

domains different from these traditional ones. For example, when dealing with directional data,

which arise naturally in applications such as diffusion tensor imaging, it is natural to consider X

as the unit sphere in R3 (e.g., Jupp, 2005). Another example occurs in the context of ranking or

preference data (e.g., Ailon et al., 2008). In these cases, X can be taken as the group of permuta-

tions. Furthermore, motivated by several applications, combinatorial testing problems have been

investigated recently (e.g., Addario-Berry et al., 2010), where the spaces under consideration are

specific combinatorially structured spaces.

We consider kernel embedding and maximum mean discrepancy (MMD) based goodness-of-fit

test. Specifically, the goodness-of-fit test can be carried out conveniently by first constructing an

estimate of 𝛾(P, P0), 𝛾(P, P0), and then rejecting 𝐻0 if the estimate exceeds a threshold calibrated

7

to ensure a certain significance level, say 𝛼 (0 < 𝛼 < 1).

We adpot the minimax framework to evaluate the above mentioned testing strategy. To fix

ideas, we assume in this chapter that P is dominated by P0 under the alternative so that the Radon-

Nikodym derivative 𝑑P/𝑑P0 is well defined and use the 𝜒2 divergence between P and P0,

𝜒2(P, P0) :=∫X

(𝑑P

𝑑P0

)2𝑑P0 − 1,

as the separation metric to quantify the departure from the null hypothesis. We are particularly

interested in the detection boundary, namely how close P and P0 can be in terms of 𝜒2 distance,

under the alternative, so that a test based on a sample of 𝑛 observations can still consistently dis-

tinguish between the null hypothesis and the alternative. For example, in the parametric setting

where P is known up to a finite dimensional parameters under the alternative, the detection bound-

ary of the likelihood ratio test is 𝑛−1/2 under mild regularity conditions (e.g., Theorem 13.5.4 in

Lehmann and Romano, 2008, and the discussion leading to it). We are concerned here with alter-

natives that are nonparametric in nature. Our first result suggests that the detection boundary for

aforementioned 𝛾𝐾 (P𝑛, P0) based test is of the order 𝑛−1/4. However, our main results indicate,

perhaps surprisingly at first, that this rate is far from optimal and the gap between it and the usual

parametric rate can be largely bridged.

In particular, we argue that the distinguishability between P and P0 depends on how close

𝑢 := 𝑑P/𝑑P0 − 1 is to the RKHS H𝐾 . The closeness of 𝑢 to H𝐾 can be measured by the distance

from 𝑢 to an arbitrary ball in H𝐾 . In particular, we shall consider the case where H𝐾 is dense in

𝐿2(P0), and focus on functions that are polynomially approximable by H𝐾 for concreteness. More

precisely, for some constants 𝑀, \ > 0, denote by F (\;𝑀) the collection of functions 𝑓 ∈ 𝐿2(P0)

such that for any 𝑅 > 0, there exists an 𝑓𝑅 ∈ H𝐾 such that

‖ 𝑓𝑅‖H𝐾≤ 𝑅, and ‖ 𝑓 − 𝑓𝑅‖𝐿2 (P0) ≤ 𝑀𝑅−1/\ .

8

We also adopt the convention that

F (0;𝑀) = { 𝑓 ∈ H𝐾 : ‖ 𝑓 ‖H𝐾≤ 𝑀}.

We investigate the optimal rate of detection for testing 𝐻GOF0 : P = P0 against

𝐻GOF1 (Δ𝑛, \, 𝑀) : P ∈ P(Δ𝑛, \, 𝑀), (2.1)

where P(Δ𝑛, \, 𝑀) is the collection of distributions P on (X,B) satisfying:

𝑑P/𝑑P0 − 1 ∈ F (\;𝑀), and 𝜒(P, P0) ≥ Δ𝑛.

We call 𝑟𝑛 the optimal rate of detection if for any 𝑐 > 0, there exists no consistent test whenever

Δ𝑛 ≤ 𝑐𝑟𝑛; and on the other hand, a consistent test exists as long as Δ𝑛 � 𝑟𝑛.

Throughout this chapter, we shall assume

∫X×X

𝐾2(𝑥, 𝑥′)𝑑P0(𝑥)𝑑P0(𝑥′) < ∞.

Hence the Hilbert-Schmidt integral operator

𝐿𝐾 ( 𝑓 ) (𝑥) =∫X𝐾 (𝑥, 𝑥′) 𝑓 (𝑥′)𝑑P0(𝑥′), ∀ 𝑥 ∈ X

is well-defined. The spectral decomposition theorem ensures that 𝐿𝐾 admits an eigenvalue decom-

position. Let {𝜙𝑘 }𝑘≥1 denote the orthonormal eigenfunctions of 𝐿𝐾 with eigenvalues _𝑘 ’s such

that _1 ≥ _2 ≥ · · · _𝑘 ≥ · · · > 0. Then as proved in, e.g., Dunford and Schwartz (1963),

𝐾 (𝑥, 𝑥′) =∑𝑘≥1

_𝑘𝜙𝑘 (𝑥)𝜙𝑘 (𝑥′) (2.2)

in 𝐿2(P0 ⊗ P0). We further assume that 𝐾 is continuous and that P0 is nondegenerate, meaning the

9

support of P0 is X. Then Mercer’s theorem ensures that (2.2) holds pointwisely. See, e.g., Theorem

4.49 of Steinwart and Christmann (2008).

As shown in Gretton et al. (2012a), the squared MMD between two probability distributions P

and P0 can be expressed as

𝛾2𝐾 (P, P0) =

∫𝐾 (𝑥, 𝑥′)𝑑 (P − P0) (𝑥)𝑑 (P − P0) (𝑥′). (2.3)

Write

�� (𝑥, 𝑥′) = 𝐾 (𝑥, 𝑥′) − EP0𝐾 (𝑥, 𝑋) − EP0𝐾 (𝑋, 𝑥′) + EP0𝐾 (𝑋, 𝑋′), (2.4)

where the subscript P0 signifies the fact that the expectation is taken over 𝑋, 𝑋′ ∼ P0 independently.

By (2.4), 𝛾2𝐾(P, P0) = 𝛾2

��(P, P0). Therefore, without loss of generality, we can focus on kernels

that are degenerate under P0, i.e.,

EP0𝐾 (𝑋, ·) = 0. (2.5)

Passing from a nondegenerate kernel to a degenerate one however presents a subtlety regarding

universality. Universality of a kernel is essential for MMD by ensuring that 𝑑P/𝑑P0 − 1 resides

in the linear space spanned by its eigenfunctions. See, e.g., Steinwart (2001) for the definition

of universal kernel and Sriperumbudur et al. (2011) for a detailed discussion of different types of

universality. Observe that 𝑑P/𝑑P0 − 1 necessarily lies in the orthogonal complement of constant

functions in 𝐿2(P0). A degenerate kernel 𝐾 is universal if its eigenfunctions {𝜙𝑘 }𝑘≥1 form an

orthonormal basis of the orthogonal complement of linear space {𝑐 · 𝜙0 : 𝑐 ∈ R} where 𝜙0(𝑥) = 1

in 𝐿2(P0). In what follows, we shall assume that 𝐾 is both degenerate and universal.

For the sake of concreteness, we shall also assume that 𝐾 has infinitely many positive eigen-

10

values decaying polynomially, i.e.,

0 < lim inf𝑘→∞

𝑘2𝑠_𝑘 ≤ lim sup𝑘→∞

𝑘2𝑠_𝑘 < ∞ (2.6)

for some 𝑠 > 1/2. In addition, we also assume that the eigenfunctions of 𝐾 are uniformly bounded,

i.e.,

sup𝑘≥1

‖𝜙𝑘 ‖∞ < ∞, (2.7)

Together with Assumptions (2.6), (2.7) ensures that Mercer’s decomposition (2.2) holds uniformly.

2.2 Operating Characteristics of MMD Based Test

2.2.1 Asymptotics under 𝐻GOF0

Note that (2.5) implies E𝑃0𝜙𝑘 (𝑋) = 0, ∀ 𝑘 ≥ 1. Hence

𝛾2(P, P0) =∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2

for any P. Accordingly, when P is replaced by the empirical distribution P𝑛, the empirical squared

MMD can be expressed as

𝛾2(P𝑛, P0) =∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖)]2

.

Classic results on the asymptotics of V-statistic (Serfling, 2009) imply that

𝑛𝛾2(P𝑛, P0)𝑑→

∑𝑘≥1

_𝑘𝑍2𝑘 := 𝑊

11

under 𝐻GOF0 , where 𝑍𝑘

𝑖.𝑖.𝑑.∼ 𝑁 (0, 1). Let ΦMMD be an MMD based test, which rejects 𝐻GOF0 if and

only if 𝑛𝛾2(P𝑛, P0) exceeds the 1 − 𝛼 quantile 𝑞𝑤,1−𝛼 of𝑊 , i.e.,

ΦMMD = 1{𝑛𝛾2 (P𝑛,P0)>𝑞𝑤,1−𝛼} .

The above limiting distribution of 𝑛𝛾2(P𝑛, P0) immediately suggests that ΦMMD is an asymptotic

𝛼-level test.

2.2.2 Power Analysis for MMD Based Tests

We now investigate the power of ΦMMD in testing 𝐻GOF0 against 𝐻GOF

1 (Δ𝑛, \, 𝑀) given by (2.1).

Recall that the type II error of a test Φ : X𝑛 → [0, 1] for testing 𝐻0 against a composite alternative

𝐻1 : P ∈ P is given by

𝛽(Φ;P) = supP∈PEP [1 −Φ(𝑋1, . . . , 𝑋𝑛)],

where EPmeans taking expectation over 𝑋1, . . . , 𝑋𝑛𝑖.𝑖.𝑑.∼ P. For brevity, we shall write 𝛽(Φ;Δ𝑛, \, 𝑀)

instead of 𝛽(Φ;P(Δ𝑛, \, 𝑀)) in what follows, with P(Δ𝑛, \, 𝑀) defined right below (2.1). The

performance of a test Φ can then be evaluated by its detection boundary, that is, the smallest Δ𝑛

under which the type II error converges to 0 as 𝑛 → ∞. Our first result establishes the conver-

gence rate of the detection boundary for ΦMMD in the case when \ = 0. Hereafter, we abbreviate

𝑀 in P(Δ𝑛, \, 𝑀), 𝐻GOF1 (Δ𝑛, \, 𝑀) and 𝛽(Φ;Δ𝑛, \, 𝑀), unless it is necessary to emphasize the

dependence.

Theorem 1. Consider testing 𝐻GOF0 against 𝐻GOF

1 (Δ𝑛, 0) by ΦMMD.

(i) If 𝑛1/4Δ𝑛 → ∞, then

𝛽(ΦMMD;Δ𝑛, 0) → 0 as 𝑛→ ∞;

(ii) conversely, there exists a constant 𝑐0 > 0 such that

lim inf𝑛→∞

𝛽(ΦMMD; 𝑐0𝑛−1/4, 0) > 0.

12

Theorem 1 shows that when the alternative 𝐻GOF1 (Δ𝑛, 0) is considered, the detection boundary

of ΦMMD is of the order 𝑛−1/4. It is of interest to compare the detection rate achieved by ΦMMD with

that in a parametric setting where consistent tests are available if 𝑛1/2Δ𝑛 → ∞. See, e.g., Theorem

13.5.4 in Lehmann and Romano (2008) and the discussion leading to it. It is natural to raise the

question to what extent such a gap can be entirely attributed to the fundamental difference between

parametric and nonparametric testing problems. We shall now argue that this gap actually is largely

due to the sub-optimality of ΦMMD, and the detection boundary of ΦMMD could be significantly

improved through a slight modification of the MMD.

2.3 Optimal Tests Based on Moderated MMD

2.3.1 Moderated MMD Test Statistic

The basic idea behind MMD is to project two probability measures onto a unit ball in H𝐾

and use the distance between the two projections to measure the distance between the original

probability measures. If the Radon-Nikodym derivative of P with respect to P0 is far away from

H𝐾 , the distance between the two projections may not honestly reflect the distance between them.

More specifically, 𝛾2(P, P0) =∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2, while the 𝜒2 distance between P and P0 is

𝜒2(P, P0) =∑𝑘≥1

[EP𝜙𝑘 (𝑋)]2. Considering that _𝑘 decreases with 𝑘 , 𝛾2(P, P0) can be much smaller

than 𝜒2(P, P0). To overcome this problem, we consider a moderated version of the MMD which

allows us to project the probability measures onto a larger ball in H𝐾 . In particular, write

[𝐾,𝜚 (P,Q;P0) = sup𝑓 ∈H𝐾 :‖ 𝑓 ‖2

𝐿2 (P0)+𝜚2‖ 𝑓 ‖2

H𝐾≤1

∫X𝑓 𝑑 (P − Q) (2.8)

for a given distribution P0 and a constant 𝜚 > 0. Distance between probability measures of this type

was first introduced by Harchaoui et al. (2007) when considering kernel methods for two sample

test. A subtle difference between [𝐾,𝜚 (P,Q;P0) and the distance from Harchaoui et al. (2007) is

the set of 𝑓 that we optimize over on the righthand side of (2.8). In the case of two sample test,

there is no information about P0 and therefore one needs to replace the norm ‖ · ‖𝐿2 (P0) with the

13

empirical 𝐿2 norm.

It is worth noting that [𝐾,𝜚 (P,Q;P0) can also be identified with a particular type of MMD.

Specifically, [𝐾,𝜚 (P,Q;P0) = 𝛾��𝜚 (P,Q), where

��𝜚 (𝑥, 𝑥′) :=∑𝑘≥1

_𝑘

_𝑘 + 𝜚2 𝜙𝑘 (𝑥)𝜙𝑘 (𝑥′).

We shall nonetheless still refer to [𝐾,𝜚 (P,Q;P0) as a moderated MMD in what follows to empha-

size the critical importance of moderation. We shall also abbreviate the dependence of [ on 𝐾 and

P0 unless necessary. The unit ball in (2.8) is defined in terms of both RKHS norm and 𝐿2(P0)

norm. Recall that 𝑢 = 𝑑P/𝑑P0 − 1 so that

sup‖ 𝑓 ‖𝐿2 (P0)≤1

∫X𝑓 𝑑 (P − P0) = sup

‖ 𝑓 ‖𝐿2 (P0)≤1

∫X𝑓 𝑢𝑑𝑃0 = ‖𝑢‖𝐿2 (P0) = 𝜒(P, P0).

We can therefore expect that a smaller 𝜚 will make [2𝜚 (P, P0) closer to 𝜒2(P, P0), since the unit

ball to be considered will become more similar to the unit ball in 𝐿2(P0). This can also be verified

by noticing that

lim𝜚→0

[2𝜚 (P, P0) = lim

𝜚→0

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2 [E𝑃𝜙𝑘 (𝑋)]2 =

∑𝑘≥1

[E𝑃𝜙𝑘 (𝑋)]2 = 𝜒2(P, P0).

Therefore, we choose 𝜚 converging to 0 when constructing our test statistic.

Hereafter we shall attach the subscript 𝑛 to 𝜚 to signify its dependence on 𝑛. We shall argue that

letting 𝜌𝑛 converge to 0 at an appropriate rate as 𝑛 increases indeed results in a test more powerful

than ΦMMD. The test statistic we propose is the empirical version of [2𝜚𝑛(P, P0):

[2𝜚𝑛(P𝑛, 𝑃0) =

1𝑛2

∑𝑖, 𝑗=1

��𝜚𝑛 (𝑋𝑖, 𝑋 𝑗 ) =∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[1𝑛

𝑛∑𝑖=1


. (2.9)

This test statistics is similar in spirit to the homogeneity test proposed previously by Harchaoui

et al. (2007), albeit motivated from a different viewpoint. In either case, it is intuitive to expect

14

improved performance over the vanilla version of the MMD when 𝜚𝑛 converges to zero at an appro-

priate rate. The main goal of the present work to precisely characterize the amount of moderation

needed to ensure maximum power. We first argue that letting 𝜚𝑛 converge to 0 at an appropriate

rate indeed results in a test more powerful than ΦMMD.

2.3.2 Operating Characteristics of [2𝜚𝑛(P𝑛, P0) Based Tests

Although the expression for [2𝜚𝑛(P𝑛, P0) given by (2.9) looks similar to that of 𝛾2(P𝑛, P0),

their asymptotic behaviors are quite different. At a technical level, this is due to the fact that the

eigenvalues of the underlying kernel

_𝑛𝑘 :=_𝑘

_𝑘 + 𝜚2𝑛

depend on 𝑛 and may not be uniformly summable over 𝑛. As presented in the following theorem,

a certain type of asymptotic normality, instead of a sum of chi-squares as in the case of 𝛾2(P𝑛, P0),

holds for [2𝜚𝑛(P𝑛, P0) under P0, which helps determine the rejection region of the [2

𝜚𝑛based test.

Theorem 2. Assume that 𝜚𝑛 → 0 as 𝑛 → ∞ in such a fashion that 𝑛𝜚1/(2𝑠)𝑛 → ∞. Then under

𝐻GOF0 ,

𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛]

𝑑→ 𝑁 (0, 2),

where

𝑣𝑛 =∑𝑘≥1

(_𝑘

_𝑘 + 𝜚2𝑛

)2, and 𝐴𝑛 =

1𝑛

𝑛∑𝑖=1

��𝜚𝑛 (𝑋𝑖, 𝑋𝑖).

In the light of Theorem 2, a test that rejects 𝐻0 if and only if

2−1/2𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛]

exceeds 𝑧1−𝛼 is an asymptotic 𝛼-level test, where 𝑧1−𝛼 stands for the 1 − 𝛼 quantile of a standard

normal distribution. We refer to this test as ΦM3d where the subscript M3d stands for Moderated

MMD. The performance of ΦM3d under the alternative hypothesis is characterized by the follow-

15

ing theorem, showing that its detection boundary is much improved when compared with that of

ΦMMD.

Theorem 3. Consider testing 𝐻GOF0 against 𝐻GOF

1 (Δ𝑛, \) by ΦM3d with 𝜚𝑛 = 𝑐𝑛−2𝑠 (\+1)4𝑠+\+1 for an

arbitrary constant 𝑐 > 0. If 𝑛2𝑠

4𝑠+\+1Δ𝑛 → ∞, then ΦM3d is consistent in that

𝛽(ΦM3d;Δ𝑛, \) → 0, as 𝑛→ ∞.

Theorem 3 indicates that the detection boundary for ΦM3d is 𝑛−2𝑠/(4𝑠+\+1) . In particular, when

testing 𝐻GOF0 against 𝐻GOF

1 (Δ𝑛, 0), i.e., \ = 0, it becomes 𝑛−4𝑠/(4𝑠+1) . This is to be contrasted with

the detection boundary for ΦMMD, which, as suggested by Theorem 1, is of the order 𝑛−1/4. It is

also worth noting that the detection boundary for ΦM3d deteriorates as \ increases, implying that it

is harder to test against a larger interpolation space.

2.3.3 Minimax Optimality

It is of interest to investigate if the detection boundary of ΦM3d can be further improved. We

now show that the answer is negative in a certain sense.

Theorem 4. Consider testing𝐻GOF0 against𝐻GOF

1 (Δ𝑛, \), for some \ < 2𝑠−1. If lim sup𝑛→∞

Δ𝑛𝑛2𝑠

4𝑠+\+1 <

∞, then there exists 𝛼 ∈ (0, 1) such that for any Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, · · · , 𝑋𝑛,

lim sup𝑛→∞

𝛽(Φ𝑛;Δ𝑛, \) > 0.

Together with Theorem 3, this suggests that ΦM3d is rate optimal in the minimax sense, when

considering 𝜒2 distance as the separation metric and F (\, 𝑀) as the regularity condition of alter-

native space.

16

2.4 Adaptation

Despite the minimax optimality of ΦM3d, a practical challenge in using it is the choice of an

appropriate tuning parameter 𝜚𝑛. In particular, Theorem 3 suggests that 𝜚𝑛 needs to be taken at

the order of 𝑛−2𝑠(\+1)/(4𝑠+\+1) which depends on the value of 𝑠 and \. On the one hand, since P0

and 𝐾 are known a priori, so is 𝑠. On the other hand, \ reflects the property of 𝑑P/𝑑P0 which

is typically not known in advance. This naturally brings us to the issue of adaptation (see, e.g.,

Spokoiny, 1996; Ingster, 2000). In other words, we are interested in a single testing procedure that

can achieve the detection boundary for testing 𝐻GOF0 against 𝐻GOF

1 (Δ𝑛 (\), \) simultaneously over

all \ ≥ 0. We emphasize the dependence of Δ𝑛 on \ since the detection boundary may depend

on \, as suggested by the results from the previous section. In fact, we should build upon the test

statistic introduced before.

More specifically, write

𝜌∗ =

(√log log 𝑛𝑛

)2𝑠

,

and

𝑚∗ =

log2

𝜌−1∗


) 2𝑠4𝑠+1

.Then our test statistic is taken to be the maximum of 𝑇𝑛,𝜚𝑛 for 𝜌𝑛 = 𝜌∗, 2𝜌∗, 22𝜌∗, . . . , 2𝑚∗𝜌∗:

𝑇GOF(adapt)𝑛 := sup

0≤𝑘≤𝑚∗

𝑇𝑛,2𝑘 𝜚∗ , (2.10)

where

𝑇𝑛,𝜚𝑛 = (2𝑣𝑛)−1/2 [𝑛[2𝜚𝑛(P𝑛, P0) − 𝐴𝑛] .

It turns out if an appropriate rejection threshold is chosen, 𝑇GOF(adapt)𝑛 can achieve a detection

boundary very similar to the one we have before, but now simultaneously over all \ > 0.

17

Theorem 5. (i) Under 𝐻GOF0 ,

lim𝑛→∞

𝑃

(𝑇

GOF(adapt)𝑛 ≥

√3 log log 𝑛

)= 0;

(ii) on the other hand, there exists a constant 𝑐1 > 0 such that,

lim𝑛→∞

infP∈∪\≥0P(Δ𝑛 (\),\)

𝑃

(𝑇

GOF(adapt)𝑛 ≥

√3 log log 𝑛

)= 1,

provided that Δ𝑛 (\) ≥ 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 .

Theorem 5 immediately suggests that a test rejects𝐻GOF0 if and only if𝑇GOF(adapt)

𝑛 ≥√

3 log log 𝑛

is consistent for testing it against 𝐻GOF1 (Δ𝑛 (\), \) for all \ ≥ 0 provided that

Δ𝑛 (\) ≥ 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 .

We note that the detection boundary given in Theorem 5 is similar, but inferior by a factor of

(log log 𝑛) 2𝑠4𝑠+\+1 , to that from Theorem 4. As our next result indicates such an extra factor is indeed

unavoidable and is the price one needs to pay for adaptation.

Theorem 6. Let 0 < \1 < \2 < 2𝑠 − 1. Then there exists a positive constant 𝑐2, such that if

lim sup𝑛→∞

sup\∈[\1,\2]

Δ𝑛 (\)(

𝑛√log log 𝑛

) 2𝑠4𝑠+\+1 ≤ 𝑐2

then

lim𝑛→∞

infΦ𝑛

[EP0Φ𝑛 + sup

\∈[\1,\2]𝛽(Φ𝑛;Δ𝑛 (\), \)

]= 1.

Similar to Theorem 4, Theorem 6 shows that there is no consistent test for 𝐻GOF0 against

𝐻GOF1 (Δ𝑛, \) simultaneously over all \ ∈ [\1, \2], if Δ𝑛 (\) ≤ 𝑐2

(𝑛−1√log log 𝑛

) 2𝑠4𝑠+\+1 ∀ \ ∈

[\1, \2] for a sufficiently small 𝑐2. Together with Theorem 5, this suggests that the above men-

18

tioned adaptive test is indeed rate optimal.

19

Chapter 3: Gaussian Kernel Embedding

3.1 Test for Goodness-of-fit

Throughout this chapter, we shall consider goodness-of-fit, homogeneity and independence

tests. We focus on continuous data, e.g., X = R𝑑 , and Gaussian kernels, which are arguably the

most popular and successful choice in practice.

Among the three testing problems that we consider, it is instructive to begin with the case of

goodness-of-fit. Obviously, the choice of kernel 𝐾 plays an essential role in kernel embedding of

distributions. In particular, when data are continuous, Gaussian kernels are commonly used. More

specifically, a Gaussian kernel with a scaling parameter a > 0 is given by

𝐺𝑑,a (𝑥, 𝑦) = exp(−a‖𝑥 − 𝑦‖2

𝑑

), ∀𝑥, 𝑦 ∈ R𝑑 .

Hereafter ‖ · ‖𝑑 stands for the usual Euclidean norm in R𝑑 . For brevity, we shall suppress the

subscript 𝑑 in both ‖ · ‖ and 𝐺 when the dimensionality is clear from the context. When P and

Q are probability distributions defined over X = R𝑑 , we shall write the MMD between them with

a Gaussian kernel and scaling parameter a as 𝛾a (P,Q) where the subscript signifies the specific

value of the scaling parameter.

We shall restrict our attention to distributions with smooth densities. Denote by W𝑠,2𝑑

the 𝑠th

order Sobolev space in R𝑑 , that is

W𝑠,2𝑑

=

{𝑓 : R𝑑 → R

�� 𝑓 is almost surely continuous and∫

(1 + ‖𝜔‖2)𝑠‖F ( 𝑓 ) (𝜔)‖2𝑑𝜔 < ∞}

20

where F ( 𝑓 ) is the Fourier transform of 𝑓 :

F ( 𝑓 ) (𝜔) = 1(2𝜋)𝑑/2

∫R𝑑𝑓 (𝑥)𝑒−𝑖𝑥>𝜔𝑑𝑥.

In what follows, we shall again abbreviate the subscript 𝑑 in W𝑠,2𝑑

when it is clear from the context.

For any 𝑓 ∈ W𝑠,2, we shall write

‖ 𝑓 ‖2W𝑠,2 =

∫R𝑑(1 + ‖𝜔‖2)𝑠‖F ( 𝑓 ) (𝜔)‖2𝑑𝜔.

Let 𝑝 and 𝑝0 be the density functions of P and P0 respectively. We are interested in the case when

both 𝑝 and 𝑝0 are elements from W𝑠,2.

Note that we can rewrite the null hypothesis 𝐻GOF0 in terms of density functions: 𝐻GOF

0 : 𝑝 = 𝑝0

for some prespecified denstiy 𝑝0 ∈ W𝑠,2. To better quantify the power of a test, we shall consider

testing against an alternative that is increasingly closer to the null as the sample size 𝑛 increases:

𝐻GOF1 (Δ𝑛; 𝑠) : 𝑝 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛,

where

W𝑠,2(𝑀) ={𝑓 ∈ W𝑠,2 : ‖ 𝑓 ‖W𝑠,2 ≤ 𝑀

}.

and

‖ 𝑓 ‖2𝐿2

=

∫R𝑑𝑓 2(𝑥)𝑑𝑥.

The alternative hypothesis𝐻GOF1 (Δ𝑛; 𝑠) is composite and the power of a test Φ based on 𝑋1, . . . , 𝑋𝑛 ∼

𝑝 is therefore defined as

power(Φ;𝐻GOF1 (Δ𝑛; 𝑠)) := inf

𝑝∈W𝑠,2 (𝑀),‖𝑝−𝑝0‖𝐿2≥Δ𝑛P{Φ rejects 𝐻GOF

0 }

21

Let

��a (𝑥, 𝑦;P0) = 𝐺a (𝑥, 𝑦) − E𝑋∼P0𝐺a (𝑋, 𝑦) − E𝑋∼P0𝐺a (𝑥, 𝑋) + E𝑋,𝑋 ′∼iidP0𝐺a (𝑋, 𝑋′).

and recall that

𝛾2a (P𝑛, P0) =

1𝑛2

𝑛∑𝑖, 𝑗=1

��a (𝑋𝑖, 𝑋 𝑗 ).

Similarly with Chapter 2, we correct for bias and use instead the following𝑈-statistic:

𝛾2a (P, P0) :=

1𝑛(𝑛 − 1)

𝑛∑1≤𝑖≠ 𝑗≤𝑛

��a (𝑋𝑖, 𝑋 𝑗 ),

which we shall focus on in what follows.

The choice of the scaling parameter a is essential when using RKHS embedding for goodness-

of-fit test. While the importance of data-driven choice of a is widely recognized in practice, almost

all existing theoretical studies assume that a fixed kernel, therefore a fixed scaling parameter, is

used. Here we shall demonstrate the benefit of using a data-driven scaling parameter, and especially

choosing a scaling parameter that diverges with the sample size.

More specifically, we argue that, with appropriate scaling, 𝛾2a (P, P0) can be viewed as an esti-

mate of ‖𝑝 − 𝑝0‖2𝐿2

when a → ∞ as 𝑛→ ∞. Note that

∫(𝑝 − 𝑝0)2 =

∫𝑝2 − 2

∫𝑝 · 𝑝0 +

∫𝑝2

0.

The first term can be estimated by

∫𝑝2 ≈ 1

𝑛

𝑛∑𝑖=1

𝑝(𝑋𝑖) ≈1𝑛

𝑛∑𝑖=1

𝑝ℎ,−𝑖 (𝑋𝑖)

where 𝑝ℎ,−𝑖 is a kernel density estimate of 𝑝 with the 𝑖th observation removed and bandwidth ℎ:

𝑝ℎ,−𝑖 (𝑥) =1

𝑛(2𝜋ℎ2)𝑑/2∑𝑗≠𝑖

𝐺 (2ℎ2)−1 (𝑥 − 𝑋 𝑗 ).

22

Thus, we can estimate∫𝑝2 by

1𝑛(𝑛 − 1) (2𝜋ℎ2)𝑑/2

∑1≤𝑖≠ 𝑗≤𝑛

𝐺 (2ℎ2)−1 (𝑋𝑖, 𝑋 𝑗 ).

Similarly, the cross-product term can be estimated by

∫𝑝 · 𝑝0 ≈

∫𝑝ℎ (𝑥)𝑝0(𝑥)𝑑𝑥 =

1𝑛(2𝜋ℎ2)𝑑/2

𝑛∑𝑖=1

∫𝐺 (2ℎ2)−1 (𝑥, 𝑋𝑖)𝑝0(𝑥)𝑑𝑥.

Together, we can view1

𝑛(𝑛 − 1) (2𝜋ℎ2)𝑑/2∑

1≤𝑖≠ 𝑗≤𝑛�� (2ℎ2)−1 (𝑋𝑖, 𝑋 𝑗 )

as an estimate of∫(𝑝 − 𝑝0)2. Following standard asymptotic properties of the kernel density

estimator (see, e.g., Tsybakov, 2008), we know that

(𝜋/a)−𝑑/2𝛾2a (P, P0) →𝑝 ‖𝑝 − 𝑝0‖2

𝐿2

if a → ∞ in such a fashion that a = 𝑜(𝑛4/𝑑). Motivated by this observation, we shall now consider

testing 𝐻GOF0 using 𝛾2

a (P, P0) with a diverging a. To signify the dependence of a on the sample

size, we shall add a subscript 𝑛 in what follows.

Under 𝐻GOF0 , it is clear E𝛾2

a𝑛 (P, P0) = 0. Note also that

var(𝛾2a𝑛 (P, P0))

=2

𝑛(𝑛 − 1)E[��a𝑛 (𝑋1, 𝑋2)

]2

=2

𝑛(𝑛 − 1)

[E

[𝐺a𝑛 (𝑋1, 𝑋2)

]2 − 2E[𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)] +(E

[𝐺a𝑛 (𝑋1, 𝑋2)

] )2]

=2

𝑛(𝑛 − 1)

[E𝐺2a𝑛 (𝑋1, 𝑋2) − 2E[𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)] +

(E

[𝐺a𝑛 (𝑋1, 𝑋2)

] )2]. (3.1)

23

Simple calculations yield:

var(𝛾2a𝑛 (P, P0)) =

2(𝜋/(2a𝑛))𝑑/2𝑛2 · ‖𝑝0‖2

𝐿2· (1 + 𝑜(1)),

assuming that a𝑛 → ∞. We shall show that

𝑛√

2

(2a𝑛𝜋

)𝑑/4𝛾2a𝑛 (P, P0) →𝑑 𝑁

(0, ‖𝑝0‖2

𝐿2

).

To use this as a test statistic, however, we will need to estimate var(𝛾2a𝑛 (P, P0)). To this end, it

is natural to consider estimating each of the three terms on the rightmost hand side of (3.1) by

𝑈-statistics:

𝑠2𝑛,a𝑛 =1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 )

− 2(𝑛 − 3)!𝑛!

∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2)

+ (𝑛 − 4)!𝑛!

∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4

𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2).

Note that 𝑠2𝑛,a𝑛 is not always positive. To avoid a negative estimate of the variance, we can replace

it with a sufficiently small value, say 1/𝑛2, whenever it is negative or too small. Namely, let

��2𝑛,a𝑛 = max{𝑠2𝑛,a𝑛 , 1/𝑛

2} ,and consider a test statistic:

𝑇GOF𝑛,a𝑛

:=𝑛√

2��−1𝑛,a𝑛

𝛾2a𝑛 (P, P0).

We have

24

Theorem 7. Let a𝑛 → ∞ as 𝑛→ ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then, under 𝐻GOF0 ,

𝑛√

2

(2a𝑛𝜋

)𝑑/4𝛾2a𝑛 (P, P0) →𝑑 𝑁 (0, ‖𝑝0‖2

𝐿2). (3.2)

Moreover,

𝑇GOF𝑛,a𝑛

→𝑑 𝑁 (0, 1). (3.3)

Theorem 7 immediately implies a test, denoted by ΦGOF𝑛,a𝑛,𝛼

(𝛼 ∈ (0, 1)), that rejects 𝐻GOF0 if

and only if 𝑇GOF𝑛,a𝑛

exceeds 𝑧𝛼, the upper 1 − 𝛼 quantile of the standard normal distribution, is an

asymptotic 𝛼-level test.

We now proceed to study its power against a smooth alternative. Following the same argument

as before, it can be shown that

1𝑛(𝑛 − 1) (𝜋/a𝑛)𝑑/2

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) →𝑝 ‖𝑝 − 𝑝0‖2𝐿2,

and

(2a𝑛/𝜋)𝑑/2 ��2𝑛,a𝑛 →𝑝 ‖𝑝‖2𝐿2,

so that

𝑛−1(a𝑛/(2𝜋))𝑑/4𝑇GOF𝑛 →𝑝 ‖𝑝 − 𝑝0‖2

𝐿2/‖𝑝‖𝐿2 .

This immediately implies that, if a𝑛 → ∞ in such a manner that a𝑛 = 𝑜(𝑛4/𝑑), then ΦGOF𝑛,a𝑛,𝛼

is

consistent for a fixed 𝑝 ≠ 𝑝0 in that its power converges to one. In fact, as 𝑛 increases, more and

more subtle deviation from 𝑝0 can be detected by ΦGOF𝑛,a𝑛,𝛼

. A refined analysis of the asymptotic

behavior of 𝑇GOF𝑛,a𝑛

yields that

Theorem 8. Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),

lim𝑛→∞

power{ΦGOF𝑛,a𝑛,𝛼

;𝐻GOF1 (Δ𝑛; 𝑠)} → 1,

25

provided that a𝑛 � 𝑛4/(𝑑+4𝑠) .

In other words, ΦGOF𝑛,a𝑛,𝛼

has a detection boundary of the order 𝑂 (𝑛−2𝑠/(𝑑+4𝑠)) which turns out

to be minimax optimal in that no other tests could attain a detection boundary with faster rate of

convergence. More precisely, we have

Theorem 9. Assume that lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞ and 𝑝0 is density such that

‖𝑝0‖W𝑠,2 < 𝑀 . Then there exists some 𝛼 ∈ (0, 1) such that for any test Φ𝑛 of level 𝛼 (asymptoti-

cally) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝,

lim inf𝑛→∞

power{Φ𝑛;𝐻GOF1 (Δ𝑛; 𝑠)} < 1.

Together, Theorems 8 and 9 suggest that Gaussian kernel embedding of distributions is espe-

cially suitable for testing against smooth alternatives, and it yields a test that could consistently

detect the smallest departures, in terms of rate of convergence, from the null distribution. The idea

can also be readily applied to testing of homogeneity and independence which we shall examine

next.

3.2 Test for Homogeneity

As in the case of goodness of fit test, we shall consider the case when the underlying distri-

butions have smooth densities so that we can rewrite the null hypothesis as 𝐻HOM0 : 𝑝 = 𝑞 ∈

W𝑠,2(𝑀), and the alternative hypothesis as

𝐻HOM1 (Δ𝑛; 𝑠) : 𝑝, 𝑞 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑞‖𝐿2 ≥ Δ𝑛.

The power of a test Φ based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝 and 𝑌1, . . . , 𝑌𝑚 ∼ 𝑞 is given by

power(Φ;𝐻HOM1 (Δ𝑛; 𝑠)) := inf

𝑝,𝑞∈W𝑠,2 (𝑀),‖𝑝−𝑞‖𝐿2≥Δ𝑛P{Φ rejects 𝐻HOM

0 }

26

To fix ideas, we shall also assume that 𝑐 ≤ 𝑚/𝑛 ≤ 𝐶 for some constants 0 < 𝑐 ≤ 𝐶 < ∞.

In addition, we shall express explicitly only the dependence on 𝑛 and not 𝑚, for brevity. Our

treatment, however, can be straightforwardly extended to more general situations.

Recall that

𝛾2a𝑛(P𝑛, Q𝑚) =

1𝑛2

∑1≤𝑖, 𝑗≤𝑛

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1𝑚2

∑1≤𝑖, 𝑗≤𝑚

𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )

− 2𝑚𝑛

𝑛∑𝑖=1

𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ).

As before, to reduce bias, we shall focus instead on a closely related estimate of 𝛾a𝑛 (P,Q):

𝛾2a𝑛 (P,Q) =

1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1

𝑚(𝑚 − 1)∑

1≤𝑖≠ 𝑗≤𝑚𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )

− 2𝑚𝑛

𝑛∑𝑖=1

𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ).

It is easy to see that under 𝐻HOM0 ,

E𝛾2a𝑛 (P,Q) = 0,

and

var(𝛾2a𝑛 (P,Q)

)= 2

(1

𝑛(𝑛 − 1) +2𝑚𝑛

+ 1𝑚(𝑚 − 1)

)E(𝑋,𝑌 )∼P⊗Q��

2a𝑛(𝑋,𝑌 ),

where

��a𝑛 (𝑥, 𝑦) = 𝐺a (𝑥, 𝑦) − E𝑋∼P𝐺a𝑛 (𝑋, 𝑦) − E𝑌∼Q𝐺a𝑛 (𝑥,𝑌 ) + E(𝑋,𝑌 )∼P⊗Q𝐺a𝑛 (𝑋,𝑌 ).

27

It is therefore natural to consider estimating the variance by ��2𝑛,𝑚,a𝑛 = max{𝑠2𝑛,𝑚,a𝑛 , 1/𝑛

2} where

𝑠2𝑛,𝑚,a𝑛 =1

𝑁 (𝑁 − 1)∑

1≤𝑖≠ 𝑗≤𝑁𝐺2a𝑛 (𝑍𝑖, 𝑍 𝑗 )

− 2(𝑁 − 3)!𝑁!

∑1≤𝑖, 𝑗1, 𝑗2≤𝑁|{𝑖, 𝑗1, 𝑗2}|=3

𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗1)𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗2)

+ (𝑁 − 4)!𝑁!

∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑁|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4

𝐺a𝑛 (𝑍𝑖1 , 𝑍 𝑗1)𝐺a𝑛 (𝑍𝑖2 , 𝑍 𝑗2),

𝑁 = 𝑛 + 𝑚 and 𝑍𝑖 = 𝑋𝑖 if 𝑖 ≤ 𝑛 and 𝑌𝑖−𝑛 if 𝑖 > 𝑛. This leads to the following test statistic

𝑇HOM𝑛,a𝑛

=𝑛𝑚

√2(𝑛 + 𝑚)

· ��−1𝑛,𝑚,a𝑛

· 𝛾2a𝑛 (P,Q).

As before, we can show

Theorem 10. Let a𝑛 → ∞ as 𝑛 → ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then under 𝐻HOM0 :

𝑝 = 𝑞 ∈ W𝑠,2(𝑀),

𝑇HOM𝑛,a𝑛

→𝑑 𝑁 (0, 1), as 𝑛→ ∞.

Motivated by Theorem 10, we can consider a test, denoted by ΦHOM𝑛,a𝑛,𝛼

, that rejects 𝐻HOM0 if and

only if 𝑇HOM𝑛,a𝑛

exceeds 𝑧𝛼. By construction, ΦHOM𝑛,a𝑛,𝛼

is an asymptotic 𝛼 level test. We now turn to

study its power against 𝐻HOM1 . As in the case of goodness of fit test, we can prove that ΦHOM

𝑛,a𝑛,𝛼is

minimax optimal in that it can detect the smallest difference between 𝑝 and 𝑞 in terms of rate of


Theorem 11. (i) Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),

lim𝑛→∞

power{ΦHOM𝑛,a𝑛,𝛼

;𝐻HOM1 (Δ𝑛; 𝑠)} → 1,


(ii) Conversely, if lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, then there exists some 𝛼 ∈ (0, 1) such that for

28

any test Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝 and

𝑌1, . . . , 𝑌𝑚 ∼ 𝑞,

lim inf𝑛→∞

power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} < 1.

3.3 Test for Independence

Similarly, we can also use Gaussian kernel embedding to construct minimax optimal tests of

independence. Let 𝑋 = (𝑋1, . . . , 𝑋 𝑘 )> ∈ R𝑑 be a random vector where the subvectors 𝑋 𝑗 ∈ R𝑑 𝑗

for 𝑗 = 1, . . . , 𝑘 so that 𝑑1 + · · · + 𝑑𝑘 = 𝑑. Denote by 𝑝 the joint density function of 𝑋 , and 𝑝 𝑗

the marginal density of 𝑋 𝑗 . We assume that both the joint density and the marginal densities are

smooth. Specifically, we shall consider testing

𝐻IND0 : 𝑝 = 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 , 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘

against a smooth departure from independence:

𝐻IND1 (Δ𝑛; 𝑠) : 𝑝 ∈ W𝑠,2(𝑀), 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘 and ‖𝑝 − 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ‖𝐿2 ≥ Δ𝑛,

where 𝑀 =𝑘∏𝑗=1𝑀 𝑗 so that 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ∈ W𝑠,2(𝑀) under both null and alternative hypotheses.

Given a sample {𝑋1, . . . , 𝑋𝑛} of independent copies of 𝑋 , we can naturally estimate the so-

called dHSIC 𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) by

𝛾2a𝑛(P𝑛, P𝑋

1

𝑛 ⊗ · · · ⊗ P𝑋 𝑘𝑛 ) =1𝑛2

∑1≤𝑖, 𝑗≤𝑛

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 )

+ 1𝑛2𝑘

∑1≤𝑖1,...,𝑖𝑘 , 𝑗1..., 𝑗𝑘≤𝑛

𝐺a𝑛 ((𝑋1𝑖1, . . . , 𝑋 𝑘𝑖𝑘 ), (𝑋

1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 ))

− 2𝑛𝑘+1

∑1≤𝑖, 𝑗1,..., 𝑗𝑘≤𝑛

𝐺a𝑛 (𝑋𝑖, (𝑋1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 )).

29

To correct for the bias, we shall consider the following estimate of 𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) instead.

𝛾2a𝑛 (P, P

𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 )

=1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 )

+ (𝑛 − 2𝑘)!𝑛!

∑1≤𝑖1,··· ,𝑖𝑘 , 𝑗1,··· , 𝑗𝑘≤𝑛|{𝑖1,··· ,𝑖𝑘 , 𝑗1,··· , 𝑗𝑘 }|=2𝑘

𝐺a𝑛 ((𝑋1𝑖1, . . . , 𝑋 𝑘𝑖𝑘 ), (𝑋

1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 ))

− 2(𝑛 − 𝑘 − 1)!𝑛!

∑1≤𝑖, 𝑗1,··· , 𝑗𝑘≤𝑛

|{𝑖, 𝑗1,··· , 𝑗𝑘 }|=𝑘+1

𝐺a𝑛 (𝑋𝑖, (𝑋1𝑗1, . . . , 𝑋 𝑘𝑗𝑘 )).

Under 𝐻IND0 , we have

E𝛾2a𝑛 (P, P

𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) = 0.

Deriving its variance, however, requires a bit more work. Write

ℎ 𝑗 (𝑥 𝑗 , 𝑦) = E𝑋∼P𝑋1⊗···⊗P𝑋𝑘𝐺a𝑛 ((𝑋1, . . . , 𝑋 𝑗−1, 𝑥 𝑗 , 𝑋 𝑗+1, . . . , 𝑋 𝑘 ), 𝑦)

and

𝑔 𝑗 (𝑥 𝑗 , 𝑦) = ℎ 𝑗 (𝑥 𝑗 , 𝑦) − E𝑋 𝑗∼P𝑋 𝑗 ℎ 𝑗 (𝑋𝑗 , 𝑦) − E𝑌∼Pℎ 𝑗 (𝑥 𝑗 , 𝑌 ) + E(𝑋 𝑗 ,𝑌 )∼P𝑋 𝑗 ⊗Pℎ 𝑗 (𝑋

𝑗 , 𝑌 ).

With slight abuse of notation, also denote by

ℎ 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) = E𝑋,𝑌∼iidP𝑋1⊗···⊗P𝑋𝑘𝐺a𝑛 ((𝑋1, . . . , 𝑋 𝑗1−1, 𝑥 𝑗1 , 𝑋 𝑗1+1, . . . , 𝑋 𝑘 ),

(𝑌1, . . . , 𝑌 𝑗2−1, 𝑦 𝑗2 , 𝑌 𝑗2+1, . . . , 𝑌 𝑘 ))

30

and

𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) =ℎ 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2) − E𝑋 𝑗1∼P𝑋 𝑗1 ℎ 𝑗1, 𝑗2 (𝑋𝑗1 , 𝑦 𝑗2)

− E𝑋 𝑗2∼P𝑋 𝑗2 ℎ 𝑗1, 𝑗2 (𝑥

𝑗1 , 𝑋 𝑗2) + E(𝑋 𝑗1 ,𝑌 𝑗2 )∼P𝑋 𝑗1 ⊗P𝑋 𝑗2 ℎ 𝑗1, 𝑗2 (𝑋𝑗1 , 𝑌 𝑗2).

Then we have

Lemma 1. Under 𝐻IND0 ,

var(𝛾2a𝑛 (P, P

𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ))=

2𝑛(𝑛 − 1)

(E��2

a𝑛(𝑋,𝑌 ) − 2

∑1≤ 𝑗≤𝑘

E(𝑔 𝑗 (𝑋 𝑗 , 𝑌 )

)2

+∑

1≤ 𝑗1, 𝑗2≤𝑘E

(𝑔 𝑗1, 𝑗2 (𝑋 𝑗1 , 𝑌 𝑗2)

)2)+𝑂 (E𝐺2a𝑛 (𝑋,𝑌 )/𝑛3). (3.4)

In light of Lemma 1, a variance estimator can be derived by estimating the leading term on the

righthand side of (3.4) term by term using 𝑈-statistics. Formulae for estimating the variance for

general 𝑘 are tedious and we defer them to the appendix for space consideration. In the special

case when 𝑘 = 2, the leading term on the righthand side of (3.4) takes a much simplified form:

2𝑛(𝑛 − 1)E��a𝑛 (𝑋1, 𝑌1) · E��a𝑛 (𝑋2, 𝑌2),

where 𝑋 𝑗 , 𝑌 𝑗 ∼iid P𝑋 𝑗 for 𝑗 = 1, 2. Thus, we can estimate E[��a𝑛 (𝑋 𝑗 , 𝑌 𝑗 )]2 by

𝑠2𝑛, 𝑗 ,a𝑛 =1

𝑛(𝑛 − 1)∑

1≤𝑖1≠𝑖2≤𝑛𝐺2a𝑛 (𝑋

𝑗

𝑖1, 𝑋

𝑗

𝑖2)

− 2(𝑛 − 3)!𝑛!

∑1≤𝑖,𝑙1,𝑙2≤𝑛|{𝑖,𝑙1,𝑙2}|=3

𝐺a𝑛 (𝑋𝑗

𝑖, 𝑋

𝑗

𝑙1)𝐺a𝑛 (𝑋

𝑗

𝑖, 𝑋

𝑗

𝑙2)

+ (𝑛 − 4)!𝑛!

∑1≤𝑖1,𝑖2,𝑙1,𝑙2≤𝑛|{𝑖1,𝑖2,𝑙1,𝑙2}|=4

𝐺a𝑛 (𝑋𝑗

𝑖1, 𝑋

𝑗

𝑙1)𝐺a𝑛 (𝑋

𝑗

𝑖2, 𝑋

𝑗

𝑙2)

31

and var(𝛾2a𝑛 (P, P𝑋

1 ⊗ P𝑋2)) by 2/[𝑛(𝑛 − 1)] ��2𝑛,a𝑛 where

��2𝑛,a𝑛 := max{𝑠2𝑛,1,a𝑛𝑠

2𝑛,2,a𝑛 , 1/𝑛

2}.

so that a test statistic for 𝐻IND0 is

𝑇 IND𝑛,a𝑛

:=𝑛√

2��−1𝑛,a𝑛

𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2).

Test statistics for general 𝑘 > 2 can be defined accordingly. Again, we have

Theorem 12. Let a𝑛 → ∞ as 𝑛→ ∞ in such a fashion that a𝑛 = 𝑜(𝑛4/𝑑). Then under 𝐻IND0 ,

𝑇 IND𝑛,a𝑛

→𝑑 𝑁 (0, 1), as 𝑛→ ∞.

Motivated by Theorem 12, we can consider a test, denoted by ΦIND𝑛,a𝑛,𝛼

, that rejects 𝐻IND0 if and

only if 𝑇 IND𝑛,a𝑛

exceeds 𝑧𝛼. By construction, ΦIND𝑛,a𝑛,𝛼

is an asymptotic 𝛼 level test. We now turn to

study its power against 𝐻IND1 . As in the case of goodness of fit test, we can prove that ΦHOM

𝑛,a𝑛,𝛼is

minimax optimal in that it can detect the smallest departure from independence in terms of rate of


Theorem 13. (i) Assume that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞. Then for any 𝛼 ∈ (0, 1),

lim𝑛→∞

power{ΦIND𝑛,a𝑛,𝛼

;𝐻IND1 (Δ𝑛; 𝑠)} → 1,


(ii) Conversely, if lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, then there exists some 𝛼 ∈ (0, 1) such that for

any test Φ𝑛 of level 𝛼 (asymptotically) based on 𝑋1, . . . , 𝑋𝑛 ∼ 𝑝,

lim inf𝑛→∞

power{Φ𝑛;𝐻IND1 (Δ𝑛; 𝑠)} < 1.

32

3.4 Adaptation

The results presented in the previous sections not only suggest that Gaussian kernel embedding

of distributions is especially suitable for testing against smooth alternatives, but also indicate the

importance of choosing an appropriate scaling parameter in order to detect small deviation from the

null hypothesis. To achieve maximum power, the scaling parameter should be chosen according

to the smoothness of underlying density functions. This, however, presents a practical challenge

because the level of smoothness is rarely known a priori. This naturally brings about the questions

of adaption: can we devise an agnostic testing procedure that does not require such knowledge

but still attain similar performance? We shall show in this section that this is possible, at least for

sufficiently smooth densities.

3.4.1 Test for Goodness-of-fit

We again begin with the test for goodness-of-fit. As we show in Section 3.1, under 𝐻GOF0 ,

𝑇GOF𝑛,a𝑛

→𝑑 𝑁 (0, 1) if 1 � a𝑛 � 𝑛4/𝑑; whereas for any 𝑝 ∈ W𝑠,2 such that ‖𝑝−𝑝0‖𝐿2 � 𝑛−2𝑠/(𝑑+4𝑠) ,

𝑇GOF𝑛,a𝑛

→ ∞ provided that a𝑛 � 𝑛4/(𝑑+4𝑠) . This motivates us to consider the following test statistic:

𝑇GOF(adapt)𝑛 = max

1≤a𝑛≤𝑛2/𝑑𝑇GOF𝑛,a𝑛

.

In light of earlier discussion, it is plausible that such a statistic could be used to detect any smooth

departure from the null provided that the level of smoothness 𝑠 ≥ 𝑑/4. We now argue that this

is indeed the case. More specifically, we shall proceed to reject 𝐻GOF0 if and only if 𝑇GOF(adapt)

𝑛

exceeds the upper 𝛼 quantile, denoted by 𝑞GOF𝑛,𝛼 , of its null distribution. In what follows, we shall

call this test ΦGOF(adapt) . Note that, even though it is hard to derive the analytic form for 𝑞GOF𝑛,𝛼 , it

can be readily evaluated via Monte Carlo method.

To study the power of ΦGOF(adapt) against 𝐻GOF1 with different levels of smoothness, we shall

33

consider the following alternative hypothesis

𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : 𝑝 ∈

⋃𝑠≥𝑑/4

{𝑝 ∈ W𝑠,2(𝑀) : ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛,𝑠}.

The following theorem characterizes the power of ΦGOF(adapt) against 𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4).

Theorem 14. There exists a constant 𝑐 > 0 such that if

lim inf𝑛→∞

Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐,

then

power{ΦGOF(adapt);𝐻GOF(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.

Theorem 14 shows that ΦGOF(adapt) has a detection boundary of the order (log log 𝑛/𝑛) 2𝑠𝑑+4𝑠

when 𝑝 ∈ W𝑠,2 for any 𝑠 ≥ 𝑑/4. If 𝑠 is known in advance, as we show in Section 3.1, the optimal

test is based on 𝑇GOF𝑛,a𝑛

with a𝑛 � 𝑛4/(𝑑+4𝑠) and has a detection boundary of the order 𝑂 (𝑛−2𝑠/(𝑑+4𝑠)).

The extra polynomial of iterated logarithmic factor (log log 𝑛)2𝑠/(𝑑+4𝑠) is the price we pay to ensure

that no knowledge of 𝑠 is required and ΦGOF(adapt) is powerful against smooth alternatives for all

𝑠 ≥ 𝑑/4.

3.4.2 Test for Homogeneity

The treatment for homogeneity tests is similar. Instead of 𝑇HOM𝑛,a𝑛

, we now consider a test based

on

𝑇HOM(adapt)𝑛 = max

1≤a𝑛≤𝑛2/𝑑𝑇HOM𝑛,a𝑛

.

If 𝑇HOM(adapt)𝑛 exceeds the upper 𝛼 quantile, denoted by 𝑞HOM

𝑛,𝛼 , of its null distribution, then we

reject 𝐻HOM0 . In what follows, we shall refer to this test as ΦHOM(adapt) . As before, we do not

have a closed form expression for 𝑞HOM𝑛,𝛼 , and it needs to be evaluated via Monte Carlo method. In

particular, in the case of homogeneity test, we can approximate 𝑞HOM𝑛,𝛼 by permutation where we

34

randomly shuffle {𝑋1, . . . , 𝑋𝑛, 𝑌1, . . . , 𝑌𝑚} and compute the test statistic as if the first 𝑛 shuffled

observations are from the first population whereas the other 𝑚 are from the second population.

This is repeated multiple times in order to approximate the critical value 𝑞HOM𝑛,𝛼 .

The following theorem characterize the power of ΦHOM(adapt) against an alternative with dif-

ferent levels of smoothness

𝐻HOM(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : (𝑝, 𝑞) ∈

⋃𝑠≥𝑑/4

{(𝑝, 𝑞) : 𝑝, 𝑞 ∈ W𝑠,2(𝑀), ‖𝑝 − 𝑞‖𝐿2 ≥ Δ𝑛,𝑠}.


lim inf𝑛→∞


then

power{ΦHOM(adapt);𝐻HOM(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.

Similar to the case of goodness-of-fit test, Theorem 15 shows that ΦHOM(adapt) has a detection

boundary of the order 𝑂 ((𝑛/log log 𝑛)−2𝑠/(𝑑+4𝑠)) when 𝑝 ≠ 𝑞 ∈ W𝑠,2 for any 𝑠 ≥ 𝑑/4. In light of

the results from Section 3.2, this is optimal up to an extra polynomial of iterated logarithmic factor.

The main advantage is that ΦHOM(adapt) is powerful against smooth alternatives simultaneously for

all 𝑠 ≥ 𝑑/4.

3.4.3 Test for Independence

Similarly, for independence test, we shall adopt the following test statistic

𝑇IND(adapt)𝑛 = max

1≤a𝑛≤𝑛2/𝑑𝑇 IND𝑛,a𝑛

.

and reject 𝐻IND0 if and only 𝑇 IND(adapt)

𝑛 exceeds the upper 𝛼 quantile, denoted by 𝑞IND𝑛,𝛼 , of its null

distribution. In what follows, we shall refer to this test as ΦHOM(adapt) . The critical value, 𝑞HOM𝑛,𝛼 ,

can also be evaluated via permutation test. See, e.g., Pfister et al. (2018) for detailed discussions.

35

We now show that ΦIND(adapt) is powerful in testing against the alternative with different levels

of smoothness

𝐻IND(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4) : 𝑝 ∈

⋃𝑠≥𝑑/4

{𝑝 ∈ W𝑠,2(𝑀), 𝑝 𝑗 ∈ W𝑠,2(𝑀 𝑗 ), 1 ≤ 𝑗 ≤ 𝑘,

‖𝑝 − 𝑝1 ⊗ · · · ⊗ 𝑝𝑘 ‖𝐿2 ≥ Δ𝑛,𝑠

}.

More specifically, we have


lim inf𝑛→∞


then

power{ΦIND(adapt);𝐻IND(adapt)1 (Δ𝑛,𝑠 : 𝑠 ≥ 𝑑/4)} → 1.

Similar to before, Theorem 16 shows that ΦIND(adapt) is optimal up to an extra polynomial of

iterated logarithmic factor for detecting smooth departure from independence simultaneously for

all 𝑠 ≥ 𝑑/4.

36

Chapter 4: Numerical Experiments

To further complement our theoretical development and demonstrate the practical merits of

the proposed methodology, we conducted several sets of numerical experiments. We shall mainly

consider Gaussian kernels in this chapter as they are the most popular choices in practice for

continuous data.

4.1 Effect of Scaling Parameter

Our first set of experiments were designed to illustrate the importance of the scaling parameter

and highlight the potential room for improvement over the “median” heuristic—one of the most

common data-driven choice of the scaling parameter in practice (see, e.g., Gretton et al., 2008;

Pfister et al., 2018).

• Experiment I: the homogeneity test with underlying distributions being the normal distribu-

tion and the mixture of several normal distributions. Specifically,

𝑝(𝑥) = 𝑓 (𝑥; 0, 1), 𝑞(𝑥) = 0.5 × 𝑓 (𝑥; 0, 1) + 0.1 ×∑∈𝝁𝑓 (𝑥; `, 0.05)

where 𝑓 (𝑥; `, 𝜎) denotes the density of 𝑁 (`, 𝜎2) and 𝝁 = {−1,−0.5, 0, 0.5, 1}.

• Experiment II: the joint independence test of 𝑋1, · · · , 𝑋5 where

𝑋1, · · · , 𝑋4, (𝑋5)′ ∼iid 𝑁 (0, 1), 𝑋5 =��(𝑋5)′

�� × sign

( 4∏𝑙=1

𝑋 𝑙

).

Clearly 𝑋1, · · · , 𝑋5 are jointly dependent since∏𝑑𝑙=1 𝑋

𝑙 ≥ 0.

In both experiments, our primary goal is to investigate how the power of Gaussian MMD based

37

test is influenced by a pre-fixed scaling parameter. These tests are also compared to the ones

with scaling parameter selected via “median” heuristic. In order to evaluate tests with different

scaling parameters under a unified framework, we determined the critical values for each test via

permutation test.

For Experiment I we fixed the sample size at 𝑛 = 𝑚 = 200; and for Experiment II at 𝑛 = 400.

The number of permutations was set at 100, and significance level at 𝛼 = 0.05. We first repeated the

experiments 100 times under the null to verify that permutation tests indeed yield the correct size,

up to Monte Carlo error. Each experiment was then repeated for 100 times and the observed power

(± one standard error) for different choices of the scaling parameter. The results are summarized in

Figure 4.1. It is perhaps not surprising that the scaling parameter selected via “median heuristic”

has little variation across each simulation run, and we represent its performance by a single value.

−1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

log(a)

Pow

er

Single fixed aMedian

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1

log(a)

Figure 4.1: Observed power against log(a) in Experiment I (left) and Experiment II(right).

The importance of the scaling parameter is evident from Figure 4.1 with the observed power

varies quite significantly for different choices. It is also of interest to note that in these settings the

“median” heuristic typically does not yield a scaling parameter with great power. More specifically,

in Experiment I, log(amedian) ≈ 0.2 and maximum power is attained at log(a) = 4; in Experiment

II, log(amedian) ≈ −2.15 and maximum power is attained at log(a) = 1. This suggests that more

appropriate choice of the scaling parameter may lead to much improved performance.

38

4.2 Efficacy of Adaptation

Our second experiment aims to illustrate that the adaptive procedures we proposed in Section

3.4 indeed yield more powerful tests when compared with other alternatives that are commonly

used in practice. In particular, we compare the proposed self-normalized adaptive test (S.A.)

with a couple of data-driven approaches, namely the “median” heuristic (Median) and the un-

normalized adaptive test (U.A.) proposed in Sriperumbudur et al. (2009). When computing both

self-normalized and unnormalized test statistics, we first rescaled the squared distance ‖𝑋𝑖 − 𝑋 𝑗 ‖2

by the dimensionality 𝑑 before taking maximum within a certain range of the scaling parameter.

We considered two experiment setups:

• Experiment III: the homogeneity test with the underlying distributions being

𝑃 ∼ 𝑁 (0, 𝐼𝑑), 𝑄 ∼ 𝑁(0,

(1 + 2𝑑−1/2

)𝐼𝑑

).

As the ‘signal strength’, the ratio between the variances of 𝑄 and 𝑃 in each single direction

is set to decrease to 1 at the order 1/√𝑑 with 𝑑, which is the decreasing order of variance

ratio that can be detected by the classical 𝐹-test.

• Experiment IV: the independence test of 𝑋1, 𝑋2 ∈ R𝑑/2, where 𝑋 = (𝑋1, 𝑋2) follows a

mixture of

𝑁 (0, 𝐼𝑑) and 𝑁

(0, (1 + 6𝑑−3/5)𝐼𝑑

)with mixture probability being 0.5. Similarly, the ratio between the variances in each direc-

tion is set to decrease with 𝑑, but at a slightly higher rate.

To better compare different methods, we considered different combinations of sample size and

dimensionality for each experiment. More specifically, for Experiment III, the sample sizes were

set to be 𝑚 = 𝑛 = 25, 50, 75, · · · , 200 and dimension 𝑑 = 1, 10, 100, 1000; for Experiment IV, the

sample size were 𝑛 = 100, 200, · · · , 600 and dimension 𝑑 = 2, 10, 100, 1000. In both experiments,

39

we fixed the significance level at 𝛼 = 0.05, did 100 permutations to calibrate the critical values as

before. Again we simulated under 𝐻0 to verify that the resulting tests have the targeted size, up to

Monte Carlo error. The power of each method, estimated from 100 such experiments, is reported

in Figures 4.2 and 4.3.

50 100 150 2000

0.2

0.4

0.6

0.8

1

𝑛

Pow

er

MedianU.A.S.A.

50 100 150 2000

0.2

0.4

0.6

0.8

1

𝑛50 100 150 200

0

0.2

0.4

0.6

0.8

1

𝑛50 100 150 200

0

0.2

0.4

0.6

0.8

1

𝑛

Figure 4.2: Observed power versus sample size in Experiment III for 𝑑 = 1, 10, 100, 1000 fromleft to right.

200 400 6000

0.2

0.4

0.6

0.8

1

𝑛

Pow

er

MedianU.A.S.A.

200 400 6000

0.2

0.4

0.6

0.8

1

𝑛200 400 600

0

0.2

0.4

0.6

0.8

1

𝑛200 400 600

0

0.2

0.4

0.6

0.8

1

𝑛

Figure 4.3: Observed power versus sample size in Experiment IV for 𝑑 = 2, 10, 100, 1000 fromleft to right.

As Figures 4.2 and 4.3 show, for both experiments, these tests are comparable in low-dimensional

settings. But as 𝑑 increases, the proposed self-normalized adaptive test becomes more and more

preferable to the two alternatives. For example, for Experiment IV, when 𝑑 = 1000, the observed

power of the proposed self-normalized adaptive test is about 90% when 𝑛 = 600, while the other

two tests have power around only 15%.

40

4.3 Data Example

Finally, we considered applying the proposed self-normalized adaptive test in a data example

from Mooij et al. (2016). The data set consists of three variables, altitude (Alt), average temper-

ature (Temp) and average duration of sunshine (Sun) from different weather stations. One goal

of interest is to figure out the causal relationship among the three variables by figuring out a suit-

able directed acyclic graph (DAG) among them. Following Peters et al. (2014), if a set of random

variables 𝑋1, · · · ,𝑋𝑑 follow a DAG G0, then we assume that they follow a sequence of additive

models:

𝑋 𝑙 =∑𝑟∈PA𝑙

𝑓𝑙,𝑟 (𝑋𝑟) + 𝑁 𝑙 , ∀ 1 ≤ 𝑙 ≤ 𝑑,

where 𝑁 𝑙’s are independent Gaussian noises and PA𝑙 denotes the collection of parent nodes of node

𝑙 specified by G0. As shown by (Peters et al., 2014), G0 is identifiable from the joint distribution

of 𝑋1, · · · , 𝑋𝑑 under the assumption of 𝑓𝑙,𝑟’s being non-linear. Therefore a natural method of

deciding a specific DAG underlying a set of random variables is by testing the independence of the

regression residuals after fitting the DAG induced additive models. In our case, there are totally

25 possible DAGs for the three variables. We can apply independence tests for the residuals for

each of the 25 DAGs and choose the one with the largest 𝑝-value as the most plausible underlying

DAG. See Peters et al. (2014) for more details.

As before, we considered three different ways for independence tests: the proposed self-

normalized adaptive test (S.A.), Gaussian kernel embedding based independent test with the

scaling parameter determined by the “median” heuristic (Median), and the unnormalized adaptive

test from Sriperumbudur et al. (2009) (U.A.). Note that the three variables have different scales

and we standardize them before applying the tests of independence.

The overall sample size of the data set is 349. Each time we randomly select 150 samples

and compute the 𝑝-value associated with each DAG. The 𝑝-value is again computed based on 100

permutations. We repeated the experiment for 1000 times and recorded for each test the DAG with

the largest 𝑝-value. All three tests agree on the top three most selected DAGs and they are shown

41

in Figure 4.4.

Alt

Temp Sun

DAG I

Alt

Temp Sun

DAG II

Alt

Temp Sun

DAG III

Figure 4.4: DAGs with the top 3 highest probabilities of being selected.

In addition, we report in Table 4.1 the frequencies that these three DAGs were selected by each

of the tests. They are generally comparable with the proposed method more consistently selecting

DAG I, the one heavily favored by all three methods.

Test

Prob(%) DAGI II III

Median 78.5 4.7 14.5U.A. 81.4 8.1 8.5S.A. 83.4 9.8 4.7

Table 4.1: Frequency that each DAG in Figure 4.4 was selected by three tests.

42

Chapter 5: Conclusion and Discussion

In this thesis, we aim to address the problem of kernel selection when using kernel embed-

ding for the purpose of nonparametric hypothesis testing, which is an inevitable problem that

researchers and practitioners have been trying to answer ever since the first proposal of kernel em-

bedding method while most of the existing solutions are ad-hoc. We propose principled ways of

kernel selection in two different settings which are proved to ensure minimax rate optimality for

the associated tests. We also propose adaptive test statistics to address the issue of aforementioned

kernel selection methods depending on the regularity condition of the underlying space of prob-

ability distributions, whose sacrifice in terms of detection boundary is only some polynomial of

iterated logarithmic factor of the sample size.

There are still many interesting problems in this area which remain to be explored further.

For example, can we adopt fast computation techniques to compute the kernel based test statistic

approximately so as to reduce the computation complexity to a large extent while maintaining the

statistical optimality? Parallel results in the context of regression have been derived but there seems

to be a lack of such results in hypothesis testing. The second one involves resampling methods such

as permutation method and bootstrap method. In practice, with the concern that the sample size

may not be large enough, resampling methods are usually used to decide the rejection boundary.

Can we still ensure the statistical optimality of the proposed tests when incorporating resampling

methods?

In addition to that, it is also wondered whether similar principled kernel selection methods can

be proposed in a broader range of nonparametric testing problems such as conditional independent

test, which can be very useful in Bayesian network learning and causal discovery.

43

Chapter 6: Proofs

Throughout this chapter, we shall write 𝑎𝑛 . 𝑏𝑛 if there exists a universal constant 𝐶 > 0 such

that 𝑎𝑛 ≤ 𝐶𝑏𝑛. Similarly, we write 𝑎𝑛 & 𝑏𝑛 if 𝑏𝑛 . 𝑎𝑛, and 𝑎𝑛 � 𝑏𝑛 if 𝑎𝑛 . 𝑏𝑛 and 𝑎𝑛 & 𝑏𝑛.

When the the constant depends on another quantity 𝐷, we shall write 𝑎𝑛 .𝐷 𝑏𝑛. Relations &𝐷

and �𝐷 are defined accordingly.

Proof of Theorem 1. Part (i). The proof of the first part consists of two key steps. First, we show

that the population counterpart 𝑛𝛾2(P, P0) of the test statistic converges to ∞ uniformly, i.e.,

𝑛 inf𝑃∈P(Δ𝑛,0)

𝛾2(P, P0) → ∞.

Then, we argue that the deviation from 𝛾2(P, P0) to 𝛾2(P𝑛, P0) is uniformly negligible compared

with 𝛾2(P, P0) itself.

It is not hard to see that

𝛾(P𝑛, P0) =

√√∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1


≥√∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 −

√√∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2.

44

Thus,

𝑃

{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼

}≤𝑃

√𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 −

√√𝑛∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2<√𝑞𝑤,1−𝛼

=𝑃

√√𝑛∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2>

√𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼

.Suppose that

𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 > 𝑞𝑤,1−𝛼 .

Then

𝑃

{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼

}≤EP

{𝑛

∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)

]2}{√

𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼

}2 .

Observe that for any 𝑃 ∈ P(Δ𝑛, 0),

EP

{𝑛∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]2}

=∑𝑘≥1

_𝑘Var[𝜙𝑘 (𝑋)]

≤∑𝑘≥1

_𝑘EP𝜙2𝑘 (𝑋)

≤(sup𝑘≥1

‖𝜙𝑘 ‖∞)2 ∑

𝑘≥1_𝑘 < ∞.

45

This implies that

lim𝑛→∞

𝛽(ΦMMD;Δ𝑛, 0) = lim𝑛→∞

supP∈P(Δ𝑛,0)

𝑃

{𝑛𝛾2(P𝑛, P0) < 𝑞𝑤,1−𝛼

}

≤ lim𝑛→∞

supP∈P(Δ𝑛,0)

EP

{𝑛

∑𝑘≥1

_𝑘

[1𝑛

𝑛∑𝑖=1𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)

]2}inf

P∈P(Δ𝑛,0)

{√𝑛

∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 − √𝑞𝑤,1−𝛼

}2

=0,

provided that

infP∈P(Δ𝑛,0)

𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 → ∞, as 𝑛→ ∞. (6.1)

It now suffices to show that (6.1) holds if 𝑛Δ4𝑛 → ∞ as 𝑛→ ∞.

To this end, let 𝑢 = 𝑑P/𝑑P0 − 1 and

𝑎𝑘 = 〈𝑢, 𝜙𝑘〉𝐿2 (P0) = EP𝜙𝑘 (𝑋) − EP0𝜙𝑘 (𝑋) = EP(𝜙𝑘 (𝑋)).

It is clear the that

∑𝑘≥1

_−1𝑘 𝑎

2𝑘 = ‖𝑢‖2

𝐾 , and∑𝑘≥1

𝑎2𝑘 = ‖𝑢‖2

𝐿2 (P0) = 𝜒2(P, P0).

By the definition of P(Δ𝑛, 0),

supP∈P(Δ𝑛,0)

∑𝑘≥1

_−1𝑘 𝑎

2𝑘 ≤ 𝑀2, and inf

P∈P(Δ𝑛,0)

∑𝑘≥1

𝑎2𝑘 ≥ Δ𝑛.

46

Since 𝑛Δ4𝑛 → ∞ as 𝑛→ ∞, we get

infP∈P(Δ𝑛,0)

𝑛∑𝑘≥1

_𝑘 [EP𝜙𝑘 (𝑋)]2 = infP∈P(Δ𝑛,0)

𝑛∑𝑘≥1

_𝑘𝑎2𝑘

≥ infP∈P(Δ𝑛,0)

𝑛

( ∑𝑘≥1

𝑎2𝑘

)2

∑𝑘≥1

_−1𝑘𝑎2𝑘

≥𝑛Δ4

𝑛

𝑀2 → ∞

as 𝑛→ ∞.

Part (ii). In proving the second part, we will make use of the following lemma that can be

obtained by adapting the argument in Gregory (1977). It gives the limit distribution of V-statistic

under P𝑛 such that P𝑛 converges to P0 in the order 𝑛−1/4.

Lemma 2. Consider a sequence of probability measures {P𝑛 : 𝑛 ≥ 1} contiguous to P0 satisfying

𝑢𝑛 = 𝑑P𝑛/𝑑P0 − 1 → 0 in 𝐿2(P0). Suppose that for any fixed 𝑘 ,

lim𝑛→∞

√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0) = ��𝑘 , and lim

𝑛→∞

∑𝑘≥1

_𝑘 (√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0))2 =

∑𝑘≥1

_𝑘 ��2𝑘 + ��0 < ∞,

for some sequence {��𝑘 : 𝑘 ≥ 0}, then

1𝑛

∑𝑘≥1

_𝑘

[ 𝑛∑𝑖=1

𝜙𝑘 (𝑋𝑖)]2 𝑑→

∑𝑘≥1

_𝑘 (𝑍𝑘 + ��𝑘 )2 + ��0,

where 𝑋1, . . . , 𝑋𝑛𝑖.𝑖.𝑑∼ P𝑛, and 𝑍𝑘s are independent standard normal random variables.

Write 𝐿 (𝑘) = _𝑘 𝑘2𝑠. By assumption (2.6),

0 < 𝐿 := inf𝑘≥1

𝐿 (𝑘) ≤ sup𝑘≥1

𝐿 (𝑘) := 𝐿 < ∞.

47

Consider a sequence of {P𝑛 : 𝑛 ≥ 1} such that

𝑑P𝑛/𝑑P0 − 1 = 𝐶1√_𝑘𝑛 [𝐿 (𝑘𝑛)]−1𝜙𝑘𝑛 ,

where 𝐶1 is a positive constant and 𝑘𝑛 = b𝐶2𝑛14𝑠 c for some positive constant 𝐶2. Both 𝐶1 and 𝐶2

will be determined later. Since sup𝑘≥1

‖𝜙𝑘 ‖∞ < ∞ and lim𝑘→∞

_𝑘 = 0, there exists 𝑁0 > 0 such that P𝑛’s

are well-defined probability measures for any 𝑛 ≥ 𝑁0.

Note that

‖𝑢𝑛‖2𝐾 =

𝐶21

𝐿2(𝑘𝑛)≤ 𝐿−2𝐶2

1

and

‖𝑢𝑛‖2𝐿2 (P0) =

𝐶21_𝑘𝑛

𝐿2(𝑘𝑛)=

𝐶21

𝐿 (𝑘𝑛)𝑘−2𝑠𝑛 ≥ 𝐿

−1𝐶2

1 𝑘−2𝑠𝑛 ∼ 𝐿

−1𝐶2

1𝐶−2𝑠2 𝑛−1/2,

where 𝐴𝑛 ∼ 𝐵𝑛 means that lim𝑛→∞

𝐴𝑛/𝐵𝑛 = 1. Thus, by choosing 𝐶1 sufficiently small and 𝑐0 =

12𝐿

−1𝐶2

1𝐶−2𝑠2 , we ensure that P𝑛 ∈ P(𝑐0𝑛

−1/4, 0) for sufficiently large 𝑛.

To apply Lemma 2, we note that

lim𝑛→∞

‖𝑢𝑛‖2𝐿2 (P0) = lim

𝑛→∞

𝐶21_𝑘𝑛

𝐿2(𝑘𝑛)= 0.

In addition, for any fixed 𝑘 ,

��𝑛,𝑘 =√𝑛〈𝑢𝑛, 𝜙𝑘〉𝐿2 (P0) = 0

for sufficiently large 𝑛, and

∑𝑘≥1

_𝑘 ��2𝑛,𝑘 =

𝑛𝐶21_

2𝑘𝑛

𝐿2(𝑘𝑛)= 𝑛𝐶2

1 𝑘−4𝑠𝑛 → 𝐶2

1𝐶−4𝑠2

48

as 𝑛→ ∞. Thus, Lemma 2 implies that

𝑛𝛾(P𝑛, P0)𝑑→

∑𝑘≥1

_𝑘𝑍2𝑘 + 𝐶

21𝐶

−4𝑠2 .

Now take 𝐶2 =(2𝐶2

1/𝑞𝑤,1−𝛼)1/4𝑠 so that 𝐶2

1𝐶−4𝑠2 = 1

2𝑞𝑤,1−𝛼. Then

lim inf𝑛→∞

𝛽(ΦMMD; 𝑐0𝑛−1/2, 0) ≥ lim

𝑛→∞𝑃(𝑛𝛾(P𝑛, P0) < 𝑞𝑤,1−𝛼)

=𝑃

(∑𝑘≥1

_𝑘𝑍2𝑘 <

12𝑞𝑤,1−𝛼

)> 0,

which concludes the proof.

Proof of Theorem 2. Let ��𝑛 (·, ·) := ��𝜚𝑛 (·, ·). Note that

𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛] = 2(𝑛2𝑣𝑛)−1/2

𝑛∑𝑗=2

𝑗−1∑𝑖=1

��𝑛 (𝑋𝑖, 𝑋 𝑗 ).

Let Z𝑛 𝑗 =𝑗−1∑𝑖=1��𝑛 (𝑋𝑖, 𝑋 𝑗 ). Consider a filtration {F𝑗 : 𝑗 ≥ 1} where F𝑗 = 𝜎{𝑋𝑖 : 1 ≤ 𝑖 ≤ 𝑗}. Due to

the assumption that 𝐾 is degenerate, we have E𝜙𝑘 (𝑋) = 0 for any 𝑘 ≥ 1, which implies that

E(Z𝑛 𝑗 |F𝑗−1) =𝑗−1∑𝑖=1E[��𝑛 (𝑋𝑖, 𝑋 𝑗 ) |F𝑗−1] =

𝑗−1∑𝑖=1E[��𝑛 (𝑋𝑖, 𝑋 𝑗 ) |𝑋𝑖] = 0,

for any 𝑗 ≥ 2.

Write

𝑈𝑛𝑚 =

0 𝑚 = 1𝑚∑𝑗=2Z𝑛 𝑗 𝑚 ≥ 2

.

49

Then for any fixed 𝑛, {𝑈𝑛𝑚}𝑚≥1 is a martingale with respect to {F𝑚 : 𝑚 ≥ 1} and

𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛] = 2(𝑛2𝑣𝑛)−1/2𝑈𝑛𝑛.

We now apply martingale central limit theorem to 𝑈𝑛𝑛. Following the argument from Hall

(1984), it can be shown that

[12𝑛2E��2

𝑛 (𝑋, 𝑋′)]−1/2

𝑈𝑛𝑛𝑑→ 𝑁 (0, 1), (6.2)

provided that

[E𝐺2𝑛 (𝑋, 𝑋′) + 𝑛−1E��2

𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋

′′) + 𝑛−2E��4𝑛 (𝑋, 𝑋′)]/[E��2

𝑛 (𝑋, 𝑋′)]2 → 0, (6.3)

as 𝑛→ ∞, where 𝐺𝑛 (𝑥, 𝑥′) = E��𝑛 (𝑋, 𝑥)��𝑛 (𝑋, 𝑥′). Since

E��2𝑛 (𝑋, 𝑋′) =

∑𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)2= 𝑣𝑛,

(6.2) implies that

𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛] =

√2 ·

(12𝑛2E��2

𝑛 (𝑋, 𝑋′))−1/2

𝑈𝑛𝑛𝑑→ 𝑁 (0, 2).

It therefore suffices to verify (6.3).

Note that

E��2𝑛 (𝑋, 𝑋′) =

∑𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)2≥

∑_𝑘≥𝜚2

𝑛

14+ 1

4𝜚4𝑛

∑_𝑘<𝜚

2𝑛

_2𝑘

=14|{𝑘 : _𝑘 ≥ 𝜚2

𝑛}| +1

4𝜚4𝑛

∑_𝑘<𝜚

2𝑛

_2𝑘 � 𝜚

−1/𝑠𝑛 ,

50

where the last step holds by considering that _𝑘 � 𝑘−2𝑠. Similarly,

E𝐺2𝑛 (𝑋, 𝑋′) =

∑𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)4≤ |{𝑘 : _𝑘 ≥ 𝜚2

𝑛}| + 𝜚−8𝑛

∑_𝑘<𝜚

2𝑛

_4𝑘 � 𝜚

−1/𝑠𝑛 ,

and

E��2𝑛 (𝑋, 𝑋′)��2

𝑛 (𝑋, 𝑋′′) =E{∑𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)2𝜙2𝑘 (𝑋)

}2

≤(sup𝑘≥1

‖𝜙𝑘 ‖∞)4 {∑

𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)2}2� 𝜚

−2/𝑠𝑛 .

Thus there exists a positive constant 𝐶3 such that

E𝐺2𝑛 (𝑋, 𝑋′)/[E��2

𝑛 (𝑋, 𝑋′)]2 ≤ 𝐶3𝜚1/𝑠𝑛 → 0, (6.4)

and

𝑛−1E��2𝑛 (𝑋, 𝑋′)��2

𝑛 (𝑋, 𝑋′′)/[E��2𝑛 (𝑋, 𝑋′)]2 ≤ 𝐶3𝑛

−1 → 0, (6.5)

as 𝑛→ ∞. On the other hand,

E��4𝑛 (𝑋, 𝑋′) ≤ ‖��𝑛‖2

∞E��2𝑛 (𝑋, 𝑋′),

where

‖��𝑛‖∞ = sup𝑥

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

𝜙2𝑘 (𝑥)

}≤

(sup𝑘≥1

‖𝜙𝑘 ‖∞)2 ∑

𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

� 𝜚−1/𝑠𝑛 .

This implies that for some positive constant 𝐶4,

𝑛−2E��4𝑛 (𝑋, 𝑋′)}/[E��2

𝑛 (𝑋, 𝑋′)]2 ≤ 𝑛−2‖��𝑛‖2∞/E��2

𝑛 (𝑋, 𝑋′) ≤ 𝐶4(𝑛2𝜚1/𝑠𝑛 )−1 → 0. (6.6)

51

as 𝑛→ ∞. Together, (6.4), (6.5) and (6.6) ensure that condition (6.3) holds.

Proof of Theorem 3. Note that

𝑛[2𝜚𝑛(P𝑛, P0) −

1𝑛

𝑛∑𝑖=1

��𝑛 (𝑋𝑖, 𝑋𝑖)

=1𝑛

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗

𝜙𝑘 (𝑋𝑖)𝜙𝑘 (𝑋 𝑗 )

=1𝑛

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗

[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋 𝑗 ) − EP𝜙𝑘 (𝑋)]

+ 2(𝑛 − 1)𝑛

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)]∑

1≤𝑖≤𝑛[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)]

+ 𝑛(𝑛 − 1)𝑛

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)]2

:=𝑉1 +𝑉2 +𝑉3.

Obviously, EP𝑉1𝑉2 = 0. We first argue that the following three statements together implies the

desired result:

lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑣−1/2𝑛 𝑉3 = ∞, (6.7)

supP∈P(Δ𝑛,\)

(EP𝑉21 /𝑉

23 ) = 𝑜(1), (6.8)

supP∈P(Δ𝑛,\)

(EP𝑉22 /𝑉

23 ) = 𝑜(1). (6.9)

To see this, note that (6.7) implies that

lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑃(𝑣−1/2𝑛 [𝑛[2

𝜚𝑛(P𝑛, P0) − 𝐴𝑛] ≥

√2𝑧1−𝛼)

≥ lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑃

(𝑣−1/2𝑛 𝑉3 ≥ 2

√2𝑧1−𝛼, 𝑉1 +𝑉2 +𝑉3 ≥ 1

2𝑉3

)= lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑃

(𝑉1 +𝑉2 +𝑉3 ≥ 1

2𝑉3

).

52

On the other hand, (6.8) and (6.9) imply that

lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑃

(𝑉1 +𝑉2 +𝑉3 ≥ 1

2𝑉3

)=1 − lim

𝑛→∞sup

P∈P(Δ𝑛,\)𝑃

(𝑉1 +𝑉2 +𝑉3 <

12𝑉3

)≥1 − lim

𝑛→∞sup

P∈P(Δ𝑛,\)

EP(𝑉1 +𝑉2)2

(𝑉3/2)2 = 1.

This immediately suggests that ΦM3d is consistent. We now show that (6.7)-(6.9) indeed hold.

Verifying (6.7). We begin with (6.7). Since 𝑣𝑛 � 𝜚−1/𝑠𝑛 and 𝑉3 = (𝑛 − 1)[2

𝜚𝑛(P, P0), (6.7) is

equivalent to

lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑛𝜚12𝑠𝑛 [

2𝜚𝑛(P, P0) = ∞.

For any P ∈ P(Δ𝑛, \), let 𝑢 = 𝑑P/𝑑P0 − 1 and 𝑎𝑘 = 〈𝑢, 𝜙𝑘〉𝐿2 (P0) = EP𝜙𝑘 (𝑋). Based on the

assumption that 𝐾 is universal, 𝑢 =∑𝑘≥1

𝑎𝑘𝜙𝑘 . We consider the case \ = 0 and \ > 0 separately.

(1) First consider \ = 0. It is clear that

[2𝜚𝑛(P, P0) =

∑𝑘≥1

𝑎2𝑘 −

∑𝑘≥1

𝜚2𝑛

_𝑘 + 𝜚2𝑛

𝑎2𝑘

≥‖𝑢‖2𝐿2 (P0) − 𝜚

2𝑛

∑𝑘≥1

1_𝑘𝑎2𝑘

≥‖𝑢‖2𝐿2 (P0) − 𝜚

2𝑛𝑀

2.

Take 𝜚𝑛 ≤√Δ2𝑛/(2𝑀2) so that 𝜌2

𝑛𝑀2 ≤ 1

2Δ2𝑛. Then we have

infP∈P(Δ𝑛,0)

[2𝜚𝑛(P, P0) ≥

12

infP∈P(Δ𝑛,0)

‖𝑢‖2𝐿2 (P0) =

12Δ2𝑛.

(2) Now consider the case when \ > 0. For P ∈ P(Δ𝑛, \), ∀ 𝑅 > 0, ∃ 𝑓𝑅 ∈ H (𝐾) such that

53

‖𝑢 − 𝑓𝑅‖𝐿2 (P0) ≤ 𝑀𝑅−1/\ and ‖ 𝑓𝑅‖𝐾 ≤ 𝑅. Let 𝑏𝑘 = 〈 𝑓𝑅, 𝜙𝑘〉𝐿2 (P0) .

[2𝜚𝑛(P, P0) =

∑𝑘≥1

𝑎2𝑘 −

∑𝑘≥1

𝜚2𝑛

_𝑘 + 𝜚2𝑛

𝑎2𝑘

≥‖𝑢‖2𝐿2 (P0) − 2

∑𝑘≥1

𝜚2𝑛

_𝑘 + 𝜚2𝑛

(𝑎𝑘 − 𝑏𝑘 )2 − 2∑𝑘≥1

𝜚2𝑛

_𝑘 + 𝜚2𝑛

𝑏2𝑘

≥‖𝑢‖2𝐿2 (P0) − 2

∑𝑘≥1

(𝑎𝑘 − 𝑏𝑘 )2 − 2𝜚2𝑛

∑𝑘≥1

1_𝑘𝑏2𝑘

=‖𝑢‖2𝐿2 (P0) − 2‖𝑢 − 𝑓𝑅‖2

𝐿2 (P0) − 2𝜚2𝑛‖ 𝑓𝑅‖2

𝐾 .

Taking 𝑅 = (2𝑀/‖𝑢‖𝐿2 (P0))\ yields that

[2𝜚𝑛(P, P0) ≥ ‖𝑢‖2

𝐿2 (P0) − 2𝑀2𝑅−2/\ − 2𝜚2𝑛𝑅

2 =12‖𝑢‖2

𝐿2 (P0) − 2𝜚2𝑛𝑅

2.

Now by choosing

𝜚𝑛 ≤1

2√

2(2𝑀)−\Δ1+\

𝑛 ,

we can ensure that

2𝜚2𝑛𝑅

2 ≤ 14‖𝑢‖2

𝐿2 (P0) .

So that

infP∈P(Δ𝑛,\)

[2𝜚𝑛(P, P0) ≥ inf

P∈P(Δ𝑛,\)

14‖𝑢‖2

𝐿2 (P0) ≥14Δ2𝑛.

In both cases, with 𝜚𝑛 ≤ 𝐶Δ\+1𝑛 for a sufficiently small 𝐶 = 𝐶 (𝑀) > 0, lim

𝑛→∞𝜚

12𝑠𝑛 𝑛Δ

2𝑛 = ∞

suffices to ensure (6.7) holds. Under the condition that lim𝑛→∞

Δ𝑛𝑛2𝑠

4𝑠+\+1 = ∞,

𝜚𝑛 = 𝑐𝑛− 2𝑠 (\+1)

4𝑠+\+1 ≤ 𝐶Δ\+1𝑛

for sufficiently large 𝑛 and lim𝑛→∞

𝜚12𝑠𝑛 𝑛Δ

2𝑛 = ∞ holds as well.

54

Verifying (6.8). Rewrite 𝑉1 as

𝑉1 =1𝑛

∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[𝜙𝑘 (𝑋𝑖) − EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋 𝑗 ) − EP𝜙𝑘 (𝑋)]

:=1𝑛

∑1≤𝑖, 𝑗≤𝑛𝑖≠ 𝑗

𝐹𝑛 (𝑋𝑖, 𝑋 𝑗 ).

Then

EP𝑉21 =

1𝑛2

∑𝑖≠ 𝑗𝑖′≠ 𝑗 ′

EP𝐹𝑛 (𝑋𝑖, 𝑋 𝑗 )𝐹𝑛 (𝑋𝑖′, 𝑋 𝑗 ′)

=2𝑛(𝑛 − 1)

𝑛2 EP𝐹2𝑛 (𝑋, 𝑋′)

≤2EP𝐹2𝑛 (𝑋, 𝑋′),

where 𝑋, 𝑋′ 𝑖.𝑖.𝑑.∼ P. Recall that, for any two random variables 𝑌1, 𝑌2 such that E𝑌21 < ∞,

E[𝑌1 − E(𝑌1 |𝑌2)]2 = E𝑌21 − E[E(𝑌1 |𝑌2)2] ≤ E𝑌2

1 .

Together with the fact that

𝐹𝑛 (𝑋, 𝑋′) =��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋] − EP [��𝑛 (𝑋, 𝑋′) |𝑋′] + EP��𝑛 (𝑋, 𝑋′)

=��𝑛 (𝑋, 𝑋′) − EP [ ��𝑛 (𝑋, 𝑋′) |𝑋] − E[��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋]

�� 𝑋′] ,we have

EP𝐹2𝑛 (𝑋, 𝑋′) ≤ EP{��𝑛 (𝑋, 𝑋′) − EP [��𝑛 (𝑋, 𝑋′) |𝑋]}2 ≤ EP��2

𝑛 (𝑋, 𝑋′).

55

Thus, to prove (6.8), it suffices to show that

lim𝑛→∞

supP∈P(Δ𝑛,\)

EP��2𝑛 (𝑋, 𝑋′)/𝑉2

3 = 0.

For any 𝑔 ∈ 𝐿2(P0) and positive definite kernel 𝐺 (·, ·) such that EP0𝐺2(𝑋, 𝑋′) < ∞, let

‖𝑔‖𝐺 :=√EP0 [𝑔(𝑋)𝑔(𝑋′)𝐺 (𝑋, 𝑋′)] .

By the positive definiteness of 𝐺 (·, ·), triangular inequality holds for ‖ · ‖𝐺 , i.e., for any 𝑔1, 𝑔2 ∈

𝐿2(P0),

|‖𝑔1‖𝐺 − ‖𝑔2‖𝐺 | ≤ ‖𝑔1 − 𝑔2‖𝐺 .

Thus by taking 𝐺 = ��2𝑛 , 𝑔1 = 𝑑P/𝑑P0 and 𝑔2 = 1, we have��√EP��2𝑛 (𝑋, 𝑋′) −

√EP0��

2𝑛 (𝑋, 𝑋′)

�� ≤ √EP0 [𝑢(𝑋)𝑢(𝑋′)��2

𝑛 (𝑋, 𝑋′)] . (6.10)

We now appeal to the following lemma to bound the right hand side of (6.10):

Lemma 3. Let 𝐺 be a Mercer kernel defined over X × X with eigenvalue-eigenfunction pairs

{(`𝑘 , 𝜙𝑘 ) : 𝑘 ≥ 1} with respect to 𝐿2(P) such that `1 ≥ `2 ≥ · · · . If 𝐺 is a trace kernel in that

E𝐺 (𝑋, 𝑋) < ∞, then for any 𝑔 ∈ 𝐿2(P)

EP [𝑔(𝑋)𝑔(𝑋′)𝐺2(𝑋, 𝑋′)] ≤ `1

(∑𝑘≥1

`𝑘

) (sup𝑘≥1

‖𝜙𝑘 ‖∞)2

‖𝑔‖2𝐿2 (P) .

By Lemma 3, we get

EP0 [𝑢(𝑋)𝑢(𝑋′)��2𝑛 (𝑋, 𝑋′)] ≤𝐶5

(∑𝑘

_𝑘

_𝑘 + 𝜚2𝑛

)‖𝑢‖2

𝐿2 (P0) � 𝜚−1/𝑠𝑛 ‖𝑢‖2

𝐿2 (P0) .

56

Recall that

EP0��2𝑛 (𝑋, 𝑋′) =

∑𝑘

(_𝑘

_𝑘 + 𝜚2𝑛

)2� 𝜚

−1/𝑠𝑛 .

In the light of (6.10), they imply that

EP��2𝑛 (𝑋, 𝑋′) ≤ 2{EP0��

2𝑛 (𝑋, 𝑋′) + EP0 [𝑢(𝑋)𝑢(𝑋′)��2

𝑛 (𝑋, 𝑋′)]} ≤ 𝐶6𝜚−1/𝑠𝑛 [1 + ‖𝑢‖2

𝐿2 (P0)] .

On the other hand, as already shown in the part of verifying (6.7), 𝜚𝑛 � Δ\+1𝑛 suffices to ensure

that for sufficiently large 𝑛,

14‖𝑢‖2

𝐿2 (P0) ≤ [2𝜚𝑛(P, P0) ≤ ‖𝑢‖2

𝐿2 (P0) , ∀ P ∈ P(Δ𝑛, \).

Thus

lim𝑛→∞

supP∈P(Δ𝑛,\)

EP��2𝑛 (𝑋, 𝑋′)/𝑉2

3

≤16𝐶6

{(lim𝑛→∞

infP∈P(Δ𝑛,\)

𝜚1/𝑠𝑛 𝑛2‖𝑢‖4

𝐿2 (P0)

)−1+

(lim𝑛→∞

infP∈P(Δ𝑛,\)

𝜚1/𝑠𝑛 𝑛2‖𝑢‖2

𝐿2 (P0)

)−1}= 0

provided that lim𝑛→∞

𝑛2𝑠

4𝑠+\+1Δ𝑛 = ∞. This immediately implies (6.8).

Verifying (6.9). Observe that

EP𝑉22 ≤4𝑛EP

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋) − EP𝜙𝑘 (𝑋)]}2

≤4𝑛EP{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

=4𝑛EP0

([1 + 𝑢(𝑋)]

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

).

57

It is clear that

EP0

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

=∑𝑘,𝑘 ′≥1

_𝑘

_𝑘 + 𝜚2𝑛

_𝑘 ′

_𝑘 ′ + 𝜚2𝑛

EP𝜙𝑘 (𝑋)EP𝜙𝑘 ′ (𝑋)EP0 [𝜙𝑘 (𝑋)𝜙𝑘 ′ (𝑋)]

=∑𝑘≥1

( _𝑘

_𝑘 + 𝜚2𝑛

)2[EP𝜙𝑘 (𝑋)]2 ≤ [2

𝜚𝑛(P, P0).

On the other hand,

EP0

(𝑢(𝑋)

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

)≤

√√√EP0

(𝑢2(𝑋)

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

)×

×√EP0

{∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑋)]}2

≤‖𝑢‖𝐿2 (P0) sup𝑥

��∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)] [𝜙𝑘 (𝑥)]�� · [𝜚𝑛 (P, P0)

≤(sup𝑘

‖𝜙𝑘 ‖∞)‖𝑢‖𝐿2 (P0)

∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

|EP𝜙𝑘 (𝑋) | · [𝜚𝑛 (P, P0)

≤(sup𝑘

‖𝜙𝑘 ‖∞)‖𝑢‖𝐿2 (P0)

√∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

√∑𝑘≥1

_𝑘

_𝑘 + 𝜚2𝑛

[EP𝜙𝑘 (𝑋)]2 · [𝜚𝑛 (P, P0)

≤𝐶7‖𝑢‖𝐿2 (P0) 𝜚− 1

2𝑠𝑛 [2

𝜚𝑛(P, P0).

Together, they imply that

lim𝑛→∞

supP∈P(Δ𝑛,\)

EP𝑉21 /𝑉

23

≤4 max{1, 𝐶7}{(

lim𝑛→∞

infP∈P(Δ𝑛,\)

𝑛[2𝜚𝑛(P, P0)

)−1+ lim𝑛→∞

supP∈P(Δ𝑛,\)

(‖𝑢‖𝐿2 (P0)

𝜚12𝑠𝑛 𝑛[

2𝜚𝑛 (P, P0)

)}= 0,

under the assumption that lim𝑛→∞

𝑛2𝑠

4𝑠+\+1Δ𝑛 = ∞.

58

Proof of Theorem 4. The main architect is now standard in establishing minimax lower bounds for

nonparametric hypothesis testing. The main idea is to carefully construct a set of points under the

alternative hypothesis and argue that a mixture of these alternatives cannot be reliably distinguished

from the null. See, e.g., Ingster, 1993; Ingster and Suslina, 2003; Tsybakov, 2008. Without loss of

generality, assume 𝑀 = 1 and Δ𝑛 = 𝑐𝑛− 2𝑠

4𝑠+\+1 for some 𝑐 > 0.

Let us consider the cases of \ = 0 and \ > 0 separately.

The case of \ = 0. We first treat the case when \ = 0. Let 𝐵𝑛 = b𝐶8Δ− 1𝑠

𝑛 c for a sufficiently small

constant 𝐶8 > 0 and 𝑎𝑛 =√Δ2𝑛/𝐵𝑛. For any b𝑛 := (b𝑛1, b𝑛2, · · · , b𝑛𝐵𝑛)> ∈ {±1}𝐵𝑛 , write

𝑢𝑛,b𝑛 = 𝑎𝑛

𝐵𝑛∑𝑘=1

b𝑛𝑘𝜑𝑘 .

It is clear that

‖𝑢𝑛,b𝑛 ‖2𝐿2 (P0) = 𝐵𝑛𝑎

2𝑛 = Δ2

𝑛

and

‖𝑢𝑛,b𝑛 ‖∞ ≤ 𝑎𝑛𝐵𝑛(sup𝑘

‖𝜑𝑘 ‖∞)� Δ

2𝑠−12𝑠𝑛 → 0.

By taking 𝐶8 small enough, we can also ensure

‖𝑢𝑛,b𝑛 ‖2𝐾 = 𝑎2

𝑛

𝐵𝑛∑𝑘=1

_−1𝑘 ≤ 1,

Therefore, there exists a probability measure P𝑛,b𝑛 ∈ P(Δ𝑛, 0) such that 𝑑P𝑛,b𝑛/𝑑P0 = 1+𝑢𝑛,b𝑛 .

Following a standard argument for minimax lower bound, it suffices to show that

lim sup𝑛→∞

EP0©« 12𝐵𝑛

∑b𝑛∈{±1}𝐵𝑛

{𝑛∏𝑖=1

[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]}ª®¬

2

< ∞. (6.11)

59

Note that

EP0©« 12𝐵𝑛

∑b𝑛∈{±1}𝐵𝑛

{𝑛∏𝑖=1

[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]}ª®¬

2

=EP0©« 122𝐵𝑛

∑b𝑛,b

′𝑛∈{±1}𝐵𝑛

{𝑛∏𝑖=1

[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)]} {

𝑛∏𝑖=1

[1 + 𝑢𝑛,b ′𝑛 (𝑋𝑖)]}ª®¬

=1

22𝐵𝑛

∑b𝑛,b

′𝑛∈{±1}𝐵𝑛

𝑛∏𝑖=1EP0

{[1 + 𝑢𝑛,b𝑛 (𝑋𝑖)] [1 + 𝑢𝑛,b ′𝑛 (𝑋𝑖)]

}=

122𝐵𝑛

∑b𝑛,b

′𝑛∈{±1}𝐵𝑛

(1 + 𝑎2

𝑛

𝐵𝑛∑𝑘=1

b𝑛𝑘b′𝑛𝑘

)𝑛≤ 1

22𝐵𝑛

∑b𝑛,b

′𝑛∈{±1}𝐵𝑛

exp(𝑛𝑎2

𝑛

𝐵𝑛∑𝑘=1

b𝑛,𝑘b′𝑛,𝑘

)=

{exp(𝑛𝑎2

𝑛) + exp(−𝑛𝑎2𝑛)

2

}𝐵𝑛≤ exp

(12𝐵𝑛𝑛

2𝑎4𝑛

),

where the last inequality is ensured by that

cosh(𝑡) ≤ exp(𝑡2

2

), ∀ 𝑡 ∈ R.

See, e.g., Baraud, 2002. With the particular choice of 𝐵𝑛, 𝑎𝑛, and the conditions on Δ𝑛, this

immediately implies (6.11).

The case of \ > 0. The main idea is similar to before. To find a set of probability measures in

P(Δ𝑛, \), we appeal to the following lemma.

Lemma 4. Let 𝑢 =∑𝑘

𝑎𝑘𝜑𝑘 . If

sup𝐵≥1

(𝐵∑𝑘=1

𝑎2𝑘

_𝑘

)2/\ (∑𝑘≥𝐵

𝑎2𝑘

) ≤ 𝑀2,

60

then 𝑢 ∈ F (\, 𝑀).

Similar to before, we shall now take 𝐵𝑛 = b𝐶10Δ− \+1

𝑠𝑛 c and 𝑎𝑛 =

√Δ2𝑛/𝐵𝑛. By Lemma 4, we can

find P𝑛,b𝑛 ∈ P(Δ𝑛, \) such that 𝑑P𝑛,b𝑛/𝑑P0 = 1 + 𝑢𝑛,b𝑛 , for appropriately chosen 𝐶10. Following

the same argument as in the previous case, we can again verify (6.11).

Proof of Theorem 5. Without loss of generality, assume that Δ𝑛 (\) = 𝑐1(𝑛−1√log log 𝑛) 2𝑠4𝑠+\+1 for

some constant 𝑐1 > 0 to be determined later.

Type I Error. We first prove the first statement which shows that the Type I error converges to 0.

Following the same notations as defined in the proof of Theorem 2, let

𝑁𝑛,2 = E{ 𝑛∑𝑗=2E(Z2𝑛 𝑗 |F𝑗−1

)− 1

}2, 𝐿𝑛,2 =

𝑛∑𝑗=2EZ4

𝑛 𝑗

where Z𝑛 𝑗 =√

2Z𝑛 𝑗/(𝑛√𝑣𝑛). As shown by Haeusler (1988),

sup𝑡

|𝑃(𝑇𝑛,𝜚𝑛 > 𝑡) − Φ(𝑡) | ≤ 𝐶11(𝐿𝑛,2 + 𝑁𝑛,2)1/5,

where Φ(𝑡) is the survival function of the standard normal, i.e., Φ(𝑡) = 𝑃(𝑍 > 𝑡) where 𝑍 ∼

𝑁 (0, 1). Again by the argument from Hall (1984),

E{ 𝑛∑𝑗=2E(Z2

𝑛 𝑗 |F𝑗−1) −12𝑛(𝑛 − 1)𝑣𝑛

}2≤ 𝐶12 [𝑛4E𝐺2

𝑛 (𝑋, 𝑋′) + 𝑛3E��2𝑛 (𝑋, 𝑋′)��2

𝑛 (𝑋, 𝑋′′)],

where 𝐺𝑛 (·, ·) is defined in the proof of Theorem 2, and

𝑛∑𝑗=2EZ4

𝑛 𝑗 ≤ 𝐶13 [𝑛2E��4𝑛 (𝑋, 𝑋′) + 𝑛3E��2

𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋

′′)],

61

which ensures

𝑁𝑛,2 =

4E{ 𝑛∑𝑗=2E(Z2

𝑛 𝑗|F𝑗−1) − 1

2𝑛(𝑛 − 1)𝑣𝑛 − 12𝑛𝑣𝑛

}2

𝑛4𝑣2𝑛

≤8 max{𝐶12,

14

} {E𝐺2

𝑛 (𝑋, 𝑋′)𝑣2𝑛

+E��2

𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋

′′)𝑛𝑣2

𝑛

+ 1𝑛2

},

and

𝐿𝑛,2 =

4𝑛∑𝑗=2EZ4

𝑛 𝑗

𝑛4𝑣2𝑛

≤ 4𝐶13

{E��4

𝑛 (𝑋, 𝑋′)𝑛2𝑣2

𝑛

+E��2

𝑛 (𝑋, 𝑋′)��2𝑛 (𝑋, 𝑋

′′)𝑛𝑣2

𝑛

}.

As shown in the proof of Theorem 2,

E𝐺2𝑛 (𝑋, 𝑋′)𝑣2𝑛

≤ 𝐶3𝜚1/𝑠𝑛 ,

E��4𝑛 (𝑋, 𝑋′)𝑛2𝑣2

𝑛

≤ 𝐶4𝑛−2𝜚

−1/𝑠𝑛 , and

E��2𝑛 (𝑋, 𝑋′)��2

𝑛 (𝑋, 𝑋′′)

𝑛𝑣2𝑛

≤ 𝐶3𝑛−1.

Therefore,

sup𝑡

|𝑃(𝑇𝑛,𝜚𝑛 > 𝑡) − Φ(𝑡) | ≤ 𝐶14(𝜚15𝑠𝑛 + 𝑛− 1

5 + 𝑛− 25 𝜚

− 15𝑠

𝑛 ),

which implies that

𝑃

(sup

0≤𝑘≤𝑚∗

𝑇𝑛,2𝑘 𝜚∗ > 𝑡

)≤ 𝑚∗Φ(𝑡) + 𝐶15(2

𝑚∗5𝑠 𝜚

15𝑠∗ + 𝑚∗𝑛

− 15 + 𝑛− 2

5 𝜚− 1

5𝑠∗ ), ∀𝑡.

It is not hard to see, by the definitions of 𝑚∗ and 𝜚∗,

2𝑚∗ 𝜚∗ ≤ 2


) 2𝑠4𝑠+1

62

and

𝑚∗ =(log 2)−1{2𝑠 log 𝑛 − 2𝑠4𝑠 + 1

log 𝑛 + 𝑜(log 𝑛)}

=(log 2)−1 8𝑠2

4𝑠 + 1log 𝑛 + 𝑜(log 𝑛) � log 𝑛.

Together with the fact that Φ(𝑡) ≤ 12𝑒

−𝑡2/2 for 𝑡 ≥ 0, we get

𝑃

(sup

0≤𝑘≤𝑚∗

𝑇𝑛,2𝑘 𝜚∗ >√

3 log log 𝑛

)≤𝐶16

𝑒−32 log log 𝑛 log 𝑛 +

(√log 𝑛𝑛

) 25(4𝑠+1)

+ 𝑛− 15 log log 𝑛 + 𝑛− 2

5


)− 25 → 0,

as 𝑛→ ∞.

Type II Error. Next consider Type II error. To this end, write 𝜚𝑛 (\) = (√

log log 𝑛𝑛

)2𝑠 (\+1)4𝑠+\+1 . Let

��𝑛 (\) = sup0≤𝑘≤𝑚∗

{2𝑘 𝜚∗ : 𝜚𝑛 ≤ 𝜚𝑛 (\)}.

It is clear that 𝑇𝑛 ≥ 𝑇𝑛,��𝑛 (\) for any \ ≥ 0. It therefore suffices to show that for any \ ≥ 0,

lim𝑛→∞

inf\≥0

infP∈P(Δ𝑛,\)

𝑃

{𝑇𝑛,��𝑛 (\) ≥

√3 log log 𝑛

}= 1.

By Markov inequality, this can accomplished by verifying

inf\∈[0,∞)

infP∈P(Δ𝑛 (\),\)

EP𝑇𝑛,��𝑛 (\) ≥ ��√

log log 𝑛 (6.12)

for some �� >√

3; and

lim𝑛→∞

sup\≥0

supP∈P(Δ𝑛 (\),\)

Var(𝑇𝑛,��𝑛 (\)

)(EP𝑇𝑛,��𝑛 (\)

)2 = 0. (6.13)

63

We now show that both (6.12) and (6.13) hold with

Δ𝑛 (\) = 𝑐1


) 2𝑠4𝑠+\+1

for a sufficiently large 𝑐1 = 𝑐1(𝑀, ��).

Note that ∀ \ ∈ [0,∞),

12𝜚𝑛 (\) ≤ ��𝑛 (\) ≤ 𝜚𝑛 (\), (6.14)

which immediately suggests

[2��𝑛 (\) (P, P0) ≥ [2

𝜚𝑛 (\) (P, P0). (6.15)

Following the arguments in the proof of Theorem 3,

EP𝑇𝑛,��𝑛 (\) ≥ 𝐶17𝑛[ ��𝑛 (\)]1/(2𝑠)[2��𝑛 (\) (P, P0) ≥ 2−1/(2𝑠)𝐶17𝑛[𝜚𝑛 (\)]1/2𝑠[2

𝜚𝑛 (\) (P, P0),

and ∀ P ∈ P(Δ𝑛 (\), \),

[2𝜚𝑛 (\) (P, P0) ≥

14‖𝑢‖2

𝐿2 (𝑃0) (6.16)

provided that Δ𝑛 (\) ≥ 𝐶′(𝑀)(√

log log 𝑛𝑛

) 2𝑠4𝑠+\+1

.

Therefore,

infP∈P(Δ𝑛 (\),\)

EP𝑇𝑛,��𝑛 (\) ≥ 𝐶18𝑛[𝜚𝑛 (\)]1/(2𝑠)Δ𝑛 (\) ≥ 𝐶18𝑐1√

log log 𝑛 ≥ ��√

log log 𝑛

if 𝑐1 ≥ 𝐶−118 �� . Hence to ensure (6.12) holds, it suffices to take

𝑐1 = max{𝐶′(𝑀), 𝐶−118 ��}.

64

With (6.14), (6.15) and (6.16), the results in the proof of Theorem 3 imply that for sufficiently

large 𝑛

supP∈P(Δ∗

𝑛 (\),\)

Var(𝑇𝑛,��𝑛 (\)

)(EP𝑇𝑛,��𝑛 (\)

)2 ≤𝐶19

{ ([𝜚𝑛 (\)]

12𝑠 𝑛Δ∗

𝑛 (\))−2

+([𝜚𝑛 (\)]

1𝑠 𝑛2Δ∗

𝑛 (\))−1

+ (𝑛Δ∗𝑛 (\))−1 +

([𝜚𝑛 (\)]

12𝑠 𝑛

√Δ∗𝑛 (\)

)−1 }≤2𝐶19

([𝜚𝑛 (\)]

12𝑠 𝑛Δ∗

𝑛 (\))−1

= 2𝐶19(𝑐1 log log 𝑛)− 12 → 0,

which shows (6.13).

Proof of Theorem 6. The main idea of the proof is similar to that for Theorem 4.

Nevertheless, in order to show

infΦ𝑛

[EP0Φ𝑛 + sup

\∈[\1,\2]𝛽(Φ𝑛;Δ𝑛 (\), \)

]converges to 1 rather than bounded below from 0, we need to find P𝜋, which is the marginal

distribution on X𝑛 with conditional distribution selected from

{P⊗𝑛 : P ∈ ∪\∈[\1,\2]P(Δ𝑛 (\), \)}

and prior distribution 𝜋 on ∪\∈[\1,\2]P(Δ𝑛 (\), \) such that the 𝜒2 distance between P𝜋 and P⊗𝑛0

converges to 0. See Ingster (2000).

To this end, assume, without loss of generality, that

Δ𝑛 (\) = 𝑐2

(𝑛√

log log 𝑛

)− 2𝑠4𝑠+\+1

, ∀\ ∈ [\1, \2],

where 𝑐2 > 0 is a sufficiently small constant to be determined later.

Let 𝑟𝑛 = b𝐶20 log 𝑛c and 𝐵𝑛,1 = b𝐶21Δ− \1+1

𝑠𝑛 (\1)cfor sufficiently small 𝐶20, 𝐶21 > 0. Set

65

\𝑛,1 = \1. For 2 ≤ 𝑟 ≤ 𝑟𝑛, let

𝐵𝑛,𝑟 = 2𝑟−2𝐵𝑛,1

and \𝑛,𝑟 is selected such that the following equation holds.

𝐵𝑛,𝑟 =

⌊𝐶21 [Δ𝑛 (\𝑛,𝑟)]−

\𝑛,𝑟 +1𝑠

⌋.

Note that by choosing 𝐶20 sufficiently small,

𝐵𝑛,𝑟𝑛 = 2𝑟𝑛−2𝐵𝑛,1 ≤𝑐

2(\1+1)4𝑠+\1+12

(𝑛√

log log 𝑛

) 2(\1+1)4𝑠+\1+1

· 2𝑟𝑛−2

=

⌊𝑐

2(\1+1)4𝑠+\1+12 exp

(log

(𝑛√

log log 𝑛

)· 2(\1 + 1)

4𝑠 + \1 + 1+ (𝑟𝑛 − 2) log 2

)⌋≤

⌊𝐶21 exp

(log

(𝑛√

log log 𝑛

)· 2(\2 + 1)

4𝑠 + \2 + 1

)⌋= b𝐶21 [Δ𝑛 (\2)]−

\2+1𝑠 c

for sufficiently large 𝑛. Thus, we can guarante that ∀ 1 ≤ 𝑟 ≤ 𝑟𝑛, \𝑛,𝑟𝑛 ∈ [\1, \2].

We now construct a finite subset of ∪\∈[\1,\2]P(Δ𝑛 (\), \) as follows. Let 𝐵∗𝑛,0 = 0 and 𝐵∗

𝑛,𝑟 =

𝐵𝑛,1 + · · · + 𝐵𝑛,𝑟 for 𝑟 ≥ 1. For each b𝑛,𝑟 = (b𝑛,𝑟,1, · · · , b𝑛,𝑟,𝐵𝑛,𝑟 ) ∈ {±1}𝐵𝑛,𝑟 , let

𝑓𝑛,𝑟,b𝑛,𝑟 = 1 +𝐵∗𝑛,𝑟∑

𝑘=𝐵∗𝑛,𝑟−1+1

𝑎𝑛,𝑟b𝑛,𝑟,𝑘−𝐵∗𝑛,𝑟−1

𝜑𝑘 ,

and 𝑎𝑛,𝑟 =

√Δ2𝑛 (\𝑛,𝑟)/𝐵𝑛,𝑟 . Following the same argument as that in the proof of Theorem 4, we

can verify that with a sufficiently small 𝐶21, each P𝑛,𝑟,b𝑛,𝑟 ∈ P(Δ𝑛 (\𝑛,𝑟), \𝑛,𝑟), where 𝑓𝑛,𝑟,b𝑛,𝑟 is the

Radon-Nikodym derivative 𝑑P𝑛,𝑟,b𝑛,𝑟 /𝑑P0. With slight abuse of notation, write

𝑓𝑛 (𝑋1, 𝑋2, · · · , 𝑋𝑛) =1𝑟𝑛

𝑟𝑛∑𝑟=1

𝑓𝑛,𝑟 (𝑋1, 𝑋2, · · · , 𝑋𝑛),

66

where

𝑓𝑛,𝑟 (𝑋1, 𝑋2, · · · , 𝑋𝑛) =1

2𝐵𝑛,𝑟∑

b𝑛,𝑟∈{±1}𝐵𝑛,𝑟

𝑛∏𝑖=1

𝑓𝑛,𝑟,b𝑛,𝑟 (𝑋𝑖).

It now suffices to show that

‖ 𝑓𝑛 − 1‖2𝐿2 (P0) = ‖ 𝑓𝑛‖2

𝐿2 (P0) − 1 → 0, as 𝑛→ ∞,

where ‖ 𝑓𝑛‖2𝐿2 (P0) = EP0 𝑓

2𝑛 (𝑋1, 𝑋2, · · · , 𝑋𝑛).

Note that

‖ 𝑓𝑛‖2𝐿2 (P0) =

1𝑟2𝑛

∑1≤𝑟,𝑟 ′≤𝑟𝑛

〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0)

=1𝑟2𝑛

∑1≤𝑟≤𝑟𝑛

‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) +

1𝑟2𝑛

∑1≤𝑟,𝑟 ′≤𝑟𝑛𝑟≠𝑟 ′

〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0) .

It is easy to verify that, for any 𝑟 ≠ 𝑟′,

〈 𝑓𝑛,𝑟 , 𝑓𝑛,𝑟 ′〉𝐿2 (P0) = 1.

It therefore suffices to show that


‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) = 𝑜(𝑟

2𝑛).

Following the same derivation as that in the proof of Theorem 4, we can show that

‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) ≤

(exp(𝑛𝑎2

𝑛,𝑟) + exp(−𝑛𝑎2𝑛,𝑟)

2

)𝐵𝑛,𝑟≤ exp

(12𝐵𝑛,𝑟𝑛

2𝑎4𝑛,𝑟

)

67

for sufficiently large 𝑛. By setting 𝑐2 in the expression of Δ𝑛 (\) sufficiently small, we have

𝐵𝑛,𝑟𝑛2𝑎4𝑛,𝑟 ≤ log 𝑟𝑛,

which ensures that


‖ 𝑓𝑛,𝑟 ‖2𝐿2 (P0) ≤ 𝑟

3/2𝑛 = 𝑜(𝑟2

𝑛).

Proof of Theorem 7. We begin with (3.2). Note that 𝛾2a𝑛 (P, P0) is a U-statistic. We can apply the

general techniques for U-statistics to establish its asymptotic normality. In particular, as shown in

Hall (1984), it suffices to verify the following four conditions:

(2a𝑛𝜋

)𝑑/2E��2

a𝑛(𝑋1, 𝑋2) → ‖𝑝0‖2

𝐿2, (6.17)

E��4a𝑛(𝑋1, 𝑋2)

𝑛2 [E��2a𝑛 (𝑋1, 𝑋2)]2

→ 0, (6.18)

E[��2a𝑛(𝑋1, 𝑋2)��2

a𝑛(𝑋1, 𝑋3)]

𝑛[E��2a𝑛 (𝑋1, 𝑋2)]2

→ 0, (6.19)

E𝐻2a𝑛(𝑋1, 𝑋2)

[E��2a𝑛 (𝑋1, 𝑋2)]2

→ 0, (6.20)

as 𝑛→ ∞, where

𝐻a𝑛 (𝑥, 𝑦) = E��a𝑛 (𝑥, 𝑋3)��a𝑛 (𝑦, 𝑋3), ∀ 𝑥, 𝑦 ∈ R𝑑 .

Verifying Condition (6.17). Note that

E��2a𝑛(𝑋1, 𝑋2) = E𝐺2

a𝑛(𝑋1, 𝑋2) − 2E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2 + [E𝐺a𝑛 (𝑋1, 𝑋2)]2.

68

By Lemma 7,

E𝐺a𝑛 (𝑋1, 𝑋2) =(𝜋

a𝑛

) 𝑑2∫

exp(−‖𝜔‖2

4a𝑛

)‖F 𝑝0(𝜔)‖2 𝑑𝜔,

which immediately yields ( a𝑛𝜋

) 𝑑2E𝐺a𝑛 (𝑋1, 𝑋2) → ‖𝑝0‖2

𝐿2

and (2a𝑛𝜋

) 𝑑2

E𝐺2a𝑛(𝑋1, 𝑋2) =

(2a𝑛𝜋

) 𝑑2

E𝐺2a𝑛 (𝑋1, 𝑋2) → ‖𝑝0‖2𝐿2,

as a𝑛 → ∞.

On the other hand,

E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2

=

∫ (∫𝐺a𝑛 (𝑥, 𝑥′)𝐺a𝑛 (𝑥, 𝑥′′)𝑝0(𝑥)𝑑𝑥

)𝑝0(𝑥′)𝑝0(𝑥′′)𝑑𝑥′𝑑𝑥′′

=

∫ (∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝0(𝑥)𝑑𝑥

)𝐺a𝑛/2(𝑥′, 𝑥′′)𝑝0(𝑥′)𝑝0(𝑥′′)𝑑𝑥′𝑑𝑥′′.

Let 𝑍 ∼ 𝑁 (0, 4a𝑛𝐼𝑑). Then

∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝0(𝑥)𝑑𝑥 = (2𝜋)𝑑/2E

[F 𝑝0(𝑍) exp

(𝑥′ + 𝑥′′

2𝑖𝑍

)]≤ (2𝜋)𝑑/2

√E ‖F 𝑝0(𝑍)‖2

.𝑑 ‖𝑝0‖𝐿2/a𝑑/4𝑛 .

Thus

E{E[𝐺a𝑛 (𝑋1, 𝑋2) |𝑋1]}2 .𝑑 ‖𝑝0‖3𝐿2/a3𝑑/4𝑛 .

Condition (6.17) then follows.

69

Verifying Conditions (6.18) and (6.19). Since

E��2a𝑛(𝑋1, 𝑋2) �𝑑,𝑝0 a

−𝑑/2𝑛 .

and

E��4a𝑛(𝑋1, 𝑋2) . E𝐺4

a𝑛(𝑋1, 𝑋2) .𝑑 a

−𝑑/2𝑛 ,

we obtain

𝑛−2E��4a𝑛(𝑋1, 𝑋2)/(E��2

a𝑛(𝑋1, 𝑋2))2 .𝑑,𝑝0 a

𝑑/2𝑛 /𝑛2 → 0.

Similarly,

E��2a𝑛(𝑋1, 𝑋2)��2

a𝑛(𝑋1, 𝑋3) . E𝐺2

a𝑛(𝑋1, 𝑋2)𝐺2

a𝑛(𝑋1, 𝑋3)

= E𝐺2a𝑛 (𝑋1, 𝑋2)𝐺2a𝑛 (𝑋1, 𝑋3)

.𝑑,𝑝0 a−3𝑑/4𝑛 .

This implies

𝑛−1E��2a𝑛(𝑋1, 𝑋2)��2

a𝑛(𝑋1, 𝑋3)/(E��2

a𝑛(𝑋1, 𝑋2))2 .𝑑,𝑝0 a

𝑑/4𝑛 /𝑛→ 0,

which verifies (6.19).

Verifying Condition (6.20). We now prove (6.20). It suffices to show

a𝑑𝑛E(E(��a𝑛 (𝑋1, 𝑋2)��a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2 → 0

70

as 𝑛→ ∞. Note that

E(E(��a𝑛 (𝑋1, 𝑋2)��a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2

.E(E(𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3) |𝑋2, 𝑋3))2

=E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)𝐺a𝑛 (𝑋4, 𝑋2)𝐺a𝑛 (𝑋4, 𝑋3)

=E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)).

Since for any 𝛿 > 0,

a𝑑𝑛E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)

(1{‖𝑋1−𝑋4‖>𝛿} + 1‖𝑋2−𝑋3‖>𝛿})) → 0,

it remains to show that

a𝑑𝑛E(𝐺a𝑛 (𝑋1, 𝑋4)𝐺a𝑛 (𝑋2, 𝑋3)E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)

1{‖𝑋1−𝑋4‖≤𝛿,‖𝑋2−𝑋3‖≤𝛿})) → 0

for some 𝛿 > 0, which holds as long as

E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3) → 0 (6.21)

uniformly on {‖𝑋1 − 𝑋4‖ ≤ 𝛿, ‖𝑋2 − 𝑋3‖ ≤ 𝛿}.

Let

𝑌1 = 𝑋1 − 𝑋4, 𝑌2 = 𝑋2 − 𝑋3, 𝑌3 = 𝑋1 + 𝑋4, 𝑌4 = 𝑋2 + 𝑋3.

71

Then

E(𝐺a𝑛 (𝑋1 + 𝑋4, 𝑋2 + 𝑋3) |𝑋1 − 𝑋4, 𝑋2 − 𝑋3)

=

(𝜋

a𝑛

) 𝑑2∫

exp(−‖𝜔‖2

4a𝑛

)F 𝑝𝑌1 (𝜔)F 𝑝𝑌2 (𝜔)𝑑𝜔

≤

√√(𝜋

a𝑛

) 𝑑2∫

exp(−‖𝜔‖2

4a𝑛

) F 𝑝𝑌1 (𝜔) 2𝑑𝜔

√√(𝜋

a𝑛

) 𝑑2∫

exp(−‖𝜔‖2

4a𝑛

) F 𝑝𝑌2 (𝜔) 2𝑑𝜔

where

𝑝𝑦 (𝑦′) =𝑝(𝑌1 = 𝑦,𝑌3 = 𝑦′)

𝑝(𝑌1 = 𝑦) =

𝑝0

(𝑦+𝑦′

2

)𝑝0

(𝑦′−𝑦

2

)∫𝑝0

(𝑦+𝑦′

2

)𝑝0

(𝑦′−𝑦

2

)𝑑𝑦′

is the conditional density of 𝑌3 given 𝑌1 = 𝑦. Thus to prove (6.21), it suffices to show

ℎ𝑛 (𝑦) :=(𝜋

a𝑛

) 𝑑2∫

exp(−‖𝜔‖2

4a𝑛

) F 𝑝𝑦 (𝜔) 2𝑑𝜔

= 𝜋𝑑2

∫exp

(−‖𝜔‖2

4

) F 𝑝𝑦 (√a𝑛𝜔) 2𝑑𝜔

→ 0

uniformly over {𝑦 : ‖𝑦‖ ≤ 𝛿}.

Note that

ℎ𝑛 (𝑦) = E𝐺a𝑛 (𝑋, 𝑋′)

where 𝑋, 𝑋′ ∼iid 𝑝𝑦, which suggests ℎ𝑛 (𝑦) → 0 pointwisely. To prove the uniform convergence of

ℎ𝑛 (𝑦), we only need to show

lim𝑦1→𝑦

sup𝑛

|ℎ𝑛 (𝑦1) − ℎ𝑛 (𝑦) | = 0

for any 𝑦.

Since 𝑝0 ∈ 𝐿2, 𝑃(𝑌1 = 𝑦) is continuous. Therefore, the almost surely continuity of 𝑝0 imme-

diately suggests that for every 𝑦, 𝑝𝑦1 (·) → 𝑝𝑦 (·) almost surely as 𝑦1 → 𝑦. Considering that 𝑝𝑦1

72

and 𝑝𝑦 are both densities, it follows that

|F 𝑝𝑦1 (𝜔) − F 𝑝𝑦 (𝜔) | ≤ (2𝜋)−𝑑/2∫

|𝑝𝑦1 (𝑦′) − 𝑝𝑦 (𝑦′) |𝑑𝑦′ → 0,

i.e., F 𝑝𝑦1 → F 𝑝𝑦 uniformly as 𝑦1 → 𝑦. Therefore we have

sup𝑛→∞

|ℎ𝑛 (𝑦1) − ℎ𝑛 (𝑦) | . F 𝑝𝑦1 − F 𝑝𝑦

𝐿∞

→ 0,

which ensures the uniform convergence of ℎ𝑛 (𝑦) to ℎ(𝑦) over {𝑦 : ‖𝑦‖ ≤ 𝛿}, and hence (6.20).

Indeed, we have shown that

𝑛𝛾2a𝑛 (P, P0)√

2E[��a𝑛 (𝑋1, 𝑋2)]2→𝑑 𝑁 (0, 1).

By Slutsky Theorem, in order to prove (3.3), it sufficies to show

��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1,

which is equivalent to

𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1 (6.22)

since 1/𝑛2 = 𝑜(E[��a𝑛 (𝑋1, 𝑋2)]2).

It follows from

E(𝑠2𝑛,a𝑛

)= E[��a𝑛 (𝑋1, 𝑋2)]2

73

and

var(𝑠2𝑛,a𝑛

).𝑛−4var ©«

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 )ª®¬ + 𝑛−6var

©«∑

1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2)ª®®®¬

+ 𝑛−8var©«

∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4

𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2)ª®®®¬

.𝑛−2E𝐺4a𝑛 (𝑋1, 𝑋2) + 𝑛−1E𝐺2a𝑛 (𝑋1, 𝑋2)𝐺2a𝑛 (𝑋1, 𝑋3) + 𝑛−1(E𝐺2a𝑛 (𝑋1, 𝑋2))2

= 𝑜((E[��a𝑛 (𝑋1, 𝑋2)]2)2).

that (6.22) holds.

Proof of Theorem 8. Recall that

𝛾2a𝑛 (P, P0) =

1𝑛(𝑛 − 1)

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P0)

=𝛾2a𝑛(P, P0) +

1𝑛(𝑛 − 1)

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P)

+ 2𝑛

𝑛∑𝑖=1

(E𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋∼P0 [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖]

− E𝑋,𝑋 ′∼iidP𝐺a𝑛 (𝑋, 𝑋′) + E(𝑋,𝑌 )∼P⊗P0𝐺a𝑛 (𝑋,𝑌 )).

Denote by the last two terms on the rightmost hand side by 𝑉 (1)a𝑛 and 𝑉 (2)

a𝑛 respectively. It is clear

that E𝑉 (1)a𝑛 = E𝑉

(2)a𝑛 = 0. Then it suffices to show that

sup𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛

E(𝑉

(1)a𝑛

)2+ E

(𝑉

(2)a𝑛

)2

𝛾4a𝑛 (P, P0)

→ 0 (6.23)

74

and

inf𝑝∈W𝑠,2 (𝑀)‖𝑝−𝑝0‖≥Δ𝑛

𝑛𝛾2𝐺a𝑛

(P, P0)√E

(��2𝑛,a𝑛

) → ∞ (6.24)

as 𝑛→ ∞.

We first prove (6.23). Note that ‖𝑝‖𝐿2 ≤ ‖𝑝‖W𝑠,2 (𝑀) ≤ 𝑀 . Following arguments similar to

those in the proof of Theorem 7, we get

E(𝑉

(1)a𝑛

)2. 𝑛−2E𝐺2

a𝑛(𝑋1, 𝑋2) .𝑑 𝑀

2𝑛−2a−𝑑/2𝑛 ,

and

E(𝑉

(2)a𝑛

)2≤ 4𝑛E

[E𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋∼P0 [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖]

]2

=4𝑛

∫ (∫𝐺2a𝑛 (𝑥, (𝑥′ + 𝑥′′)/2)𝑝(𝑥)𝑑𝑥

)𝐺a𝑛/2(𝑥′, 𝑥′′) 𝑓 (𝑥′) 𝑓 (𝑥′′)𝑑𝑥′𝑑𝑥′′

.𝑑

4𝑀𝑛a𝑑/4

∫𝐺a𝑛/2(𝑥′, 𝑥′′) | 𝑓 (𝑥′) | | 𝑓 (𝑥′′) |𝑑𝑥′𝑑𝑥′′

.𝑑

4𝑀𝑛a3𝑑/4 ‖ 𝑓 ‖

2𝐿2.

By Lemma 8, there exists a constant 𝐶 > 0 depending on 𝑠 and 𝑀 only such that for 𝑓 ∈

W𝑠,2(𝑀),

∫exp

(−‖𝜔‖2

4a𝑛

)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 1

4‖ 𝑓 ‖2

𝐿2

given that a𝑛 ≥ 𝐶‖ 𝑓 ‖−2/𝑠𝐿2

. Because a𝑛Δ𝑠/2𝑛 → ∞, we obtain

𝛾2a𝑛(P, P0) &𝑑 a

−𝑑/2𝑛 ‖ 𝑓 ‖2

𝐿2,

75

for sufficiently large 𝑛. Thus


E(𝑉

(1)a𝑛

)2

𝛾4a𝑛 (P, P0)

.𝑑 𝑀2(𝑛2a

−𝑑/2𝑛 Δ4

𝑛)−1 → 0

and


E(𝑉

(2)a𝑛

)2

𝛾4𝐺a𝑛

(P, P0).𝑑 𝑀 (𝑛a−𝑑/4𝑛 Δ2

𝑛)−1 → 0,

as 𝑛→ ∞.

Next we prove (6.24). It follows from

E(��2𝑛,a𝑛

)≤ Emax

{��𝑠2𝑛,a𝑛 �� , 1/𝑛2} . E𝐺2a𝑛 (𝑋1, 𝑋2) + 1/𝑛2 .𝑑 𝑀2a

−𝑑/2𝑛 + 1/𝑛2

that (6.24) holds.

Proof of Theorem 9. This, in a certain sense, can be viewed as an extension of results from Ingster

(1987), and the proof proceeds in a similar fashion. While Ingster (1987) considered the case when

𝑝0 is the uniform distribution on [0, 1], we shall show that similar bounds hold for a wider class of

𝑝0.

For any 𝑀 > 0 and 𝑝0 such that ‖𝑝0‖W𝑠,2 < 𝑀 , let

𝐻GOF1 (Δ𝑛; 𝑠, 𝑀 − ‖𝑝0‖W𝑠,2)∗

:= {𝑝 ∈ W𝑠,2 : ‖𝑝 − 𝑝0‖W𝑠,2 ≤ 𝑀 − ‖𝑝0‖W𝑠,2 , ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛}.

It is clear that 𝐻GOF1 (Δ𝑛; 𝑠) ⊃ 𝐻GOF

1 (Δ𝑛; 𝑠, 𝑀 − ‖𝑝0‖W𝑠,2)∗. Hence it suffices to prove Theorem

9 with 𝐻GOF1 (Δ𝑛; 𝑠) replaced by 𝐻GOF

1 (Δ𝑛; 𝑠, 𝑀)∗ for an arbitrary 𝑀 > 0. We shall abbreviate

𝐻GOF1 (Δ𝑛; 𝑠, 𝑀)∗ as 𝐻GOF

1 (Δ𝑛; 𝑠)∗ in the rest of the proof.

76

Since 𝑝0 is almost surely continuous, there exists 𝑥0 ∈ R𝑑 and 𝛿, 𝑐 > 0 such that

𝑝0(𝑥) ≥ 𝑐 > 0, ∀ ‖𝑥 − 𝑥0‖ ≤ 𝛿.

In light of this, we shall assume 𝑝0(𝑥) ≥ 𝑐 > 0, for all 𝑥 ∈ [0, 1]𝑑 without loss of generality.

Let 𝒂𝑛 be a multivariate random index. As proved in Ingster (1987), in order to prove the

existence of 𝛼 ∈ (0, 1) such that no asymptotic 𝛼-level test can be consistent, it suffices to identify

𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗ for all possible values of 𝒂𝑛 such that

E𝑝0

(𝑝𝑛 (𝑋1, · · · , 𝑋𝑛)∏𝑛

𝑖=1 𝑝0(𝑋𝑖)

)2= 𝑂 (1), (6.25)

where

𝑝𝑛 (𝑥1, · · · , 𝑥𝑛) = E𝒂𝑛

(𝑛∏𝑖=1

𝑝𝑛,𝒂𝑛 (𝑥𝑖)), ∀ 𝑥1, · · · , 𝑥𝑛,

i.e., 𝑝 is the mixture of all 𝑝𝑛,𝒂𝑛’s.

Let 1{𝑥∈[0,1]𝑑}, 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 be an orthonormal sets of functions in 𝐿2(R𝑑) such that the

supports of 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 are disjoint and all included in [0, 1]𝑑 . Let 𝒂𝑛 = (𝑎𝑛,1, · · · , 𝑎𝑛,𝐵𝑛)

satisfy that 𝑎𝑛,1, · · · , 𝑎𝑛,𝐵𝑛 are independent and that

𝑝(𝑎𝑛,𝑘 = 1) = 𝑝(𝑎𝑛,𝑘 = −1) = 12, ∀ 1 ≤ 𝑘 ≤ 𝐵𝑛.

Define

𝑝𝑛,𝒂𝑛 = 𝑝0 + 𝑟𝑛𝐵𝑛∑𝑘=1

𝑎𝑛,𝑘𝜙𝑛,𝑘 .

Then𝑝𝑛,𝒂𝑛

𝑝0= 1 + 𝑟𝑛

𝐵𝑛∑𝑘=1

𝑎𝑛,𝑘𝜙𝑛,𝑘

𝑝0,

where 1, 𝜙𝑛,1𝑝0, · · · , 𝜙𝑛,𝐵𝑛

𝑝0are orthogonal in 𝐿2(𝑃0).

77

By arguments similar to those in Ingster (1987), we find

E𝑝0

(𝑝𝑛 (𝑋1, · · · , 𝑋𝑛)∏𝑛

𝑖=1 𝑝0(𝑋𝑖)

)2≤ exp

(12𝐵𝑛𝑛

2𝑟4𝑛 max

1≤𝑘≤𝐵𝑛

(∫𝜙2𝑛,𝑘/𝑝0𝑑𝑥

)2)

≤ exp(

12𝑐2𝐵𝑛𝑛

2𝑟4𝑛

).

In order to ensure (6.25), it suffices to have

𝐵1/2𝑛 𝑛𝑟2

𝑛 = 𝑂 (1). (6.26)

Therefore, given Δ𝑛 = 𝑂

(𝑛−

2𝑠4𝑠+𝑑

), once we can find proper 𝑟𝑛, 𝐵𝑛 and 𝜙𝑛,1, · · · , 𝜙𝑛,𝐵𝑛 such that

𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗ for all 𝒂𝑛 and (6.26) holds, the proof is finished.

Let 𝑏𝑛 = 𝐵1/𝑑𝑛 , 𝜙 be an infinitely differentiable function supported on [0, 1]𝑑 that is orthogonal

to 1{𝑥∈[0,1]𝑑} in 𝐿2, and for each 𝑥𝑛,𝑘 ∈ {0, 1, · · · , 𝑏𝑛 − 1}⊗𝑑 , let

𝜙𝑛,𝑘 (𝑥) =𝑏𝑑/2𝑛

‖𝜙‖𝐿2

𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 ), ∀ 𝑥 ∈ R𝑑 .

Then all 𝜙𝑛,𝑘 ’s are supported on [0, 1]𝑑 and

〈𝜙𝑛,𝑘 , 1〉𝐿2 =𝑏𝑑/2𝑛

‖𝜙‖𝐿2

∫R𝑑𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 )𝑑𝑥 =

1𝑏𝑑/2𝑛 ‖𝜙‖𝐿2

∫R𝑑𝜙(𝑥)𝑑𝑥 = 0,

‖𝜙𝑛,𝑘 ‖2𝐿2

=𝑏𝑑𝑛

‖𝜙‖2𝐿2

∫[0,1/𝑏𝑛]𝑑

𝜙2(𝑏𝑛𝑥)𝑑𝑥 = 1,

‖𝜙𝑛,𝑘 ‖2W𝑠,2 ≤ 𝑏2𝑠

𝑛

‖𝜙‖2W𝑠,2

‖𝜙‖2𝐿2

.

Since for 𝑘 ≠ 𝑘′, the supports of 𝜙𝑛,𝑘 and 𝜙𝑛,𝑘 ′ are disjoint,

‖𝑝𝑛,𝒂𝑛 − 𝑝0‖∞ = 𝑟𝑛𝑏𝑑/2𝑛

‖𝜙‖∞‖𝜙‖𝐿2

,

78

and

〈𝜙𝑛,𝑘 , 𝜙𝑛,𝑘 ′〉𝐿2 = 0, 〈𝜙𝑛,𝑘 , 𝜙𝑛,𝑘 ′〉W𝑠,2 = 0,

from which we immediately obtain

‖𝑝𝑛,𝒂𝑛 − 𝑝0‖2𝐿2

= 𝑟2𝑛𝑏

𝑑𝑛

‖𝑝𝑛,𝒂𝑛 − 𝑝0‖2W𝑠,2 ≤ 𝑟2

𝑛𝑏𝑑+2𝑠𝑛

‖𝜙‖2W𝑠,2

‖𝜙‖2𝐿2

.

To ensure 𝑝𝑛,𝒂𝑛 ∈ 𝐻GOF1 (Δ𝑛; 𝑠)∗, it suffices to make

𝑟𝑛𝑏𝑑/2𝑛

‖𝜙‖∞‖𝜙‖𝐿2

→ 0 as 𝑛→ ∞, (6.27)

𝑟2𝑛𝑏

𝑑𝑛 = Δ2

𝑛, (6.28)

𝑟2𝑛𝑏

𝑑+2𝑠𝑛

‖𝜙‖2W𝑠,2

‖𝜙‖2𝐿2

≤ 𝑀2. (6.29)

Let

𝑏𝑛 =

(𝑀 ‖𝜙‖2

𝐿2

‖𝜙‖W𝑠,2

)1/𝑠

Δ−1/𝑠𝑛

, 𝑟𝑛 =Δ𝑛

𝑏𝑑/2𝑛

.

Then (6.28) and (6.29) are satisfied. Moreover, given Δ𝑛 = 𝑂

(𝑛−

2𝑠4𝑠+𝑑

),

𝐵1/2𝑛 𝑛𝑟2

𝑛 = 𝑏−𝑑/2𝑛 𝑛Δ2

𝑛 .𝑑,𝜙,𝑀 𝑛Δ4𝑠+𝑑

2𝑠𝑛 = 𝑂 (1),

and

𝑟𝑛𝑏𝑑/2𝑛

‖𝜙‖∞‖𝜙‖𝐿2

.𝜙 Δ𝑛 = 𝑜(1)

ensuring both (6.26) and (6.27).

79

Finally, we show the existence of such 𝜙. Let

𝜙0(𝑥1) =

exp(− 1

1−(4𝑥1−1)2

)0 < 𝑥1 <

12

− exp(− 1

1−(4𝑥1−3)2

)12 < 𝑥1 < 1

0 otherwise

.

Then 𝜙0 is supported on [0, 1], infinitely differentiable and orthogonal to the indicator function of

[0, 1].

Let

𝜙(𝑥) =𝑑∏𝑙=1

𝜙0(𝑥𝑙), ∀ 𝑥 = (𝑥1, · · · , 𝑥𝑑) ∈ R𝑑 .

Then 𝜙 is supported on [0, 1]𝑑 , infinitely differentiable and 〈𝜙, 1〉𝐿2 = 〈𝜙0, 1〉𝑑𝐿2 [0,1] = 0.

Proof of Theorem 10. Let 𝑁 = 𝑚 + 𝑛 denote the total sample size. It suffices to prove the result

under the assumption that 𝑛/𝑁 → 𝑟 ∈ (0, 1).

Note that under 𝐻0,

𝛾2a𝑛 (P,Q) =

1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1

𝑚(𝑚 − 1)∑

1≤𝑖≠ 𝑗≤𝑚��a𝑛 (𝑌𝑖, 𝑌 𝑗 )

− 2𝑛𝑚

∑1≤𝑖≤𝑛

∑1≤ 𝑗≤𝑚

��a𝑛 (𝑋𝑖, 𝑌 𝑗 ).

Let 𝑛/𝑁 = 𝑟𝑛. Then we have

𝛾2a𝑛 (P,Q)

=𝑁−2 ©« 1𝑟𝑛 (𝑟𝑛 − 𝑁−1)

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +

1(1 − 𝑟𝑛) (1 − 𝑟𝑛 − 𝑁−1)

∑1≤𝑖≠ 𝑗≤𝑚

��a𝑛 (𝑌𝑖, 𝑌 𝑗 ) −2

𝑟𝑛 (1 − 𝑟𝑛)∑

1≤𝑖≤𝑛

∑1≤ 𝑗≤𝑚

��a𝑛 (𝑋𝑖, 𝑌 𝑗 )ª®¬ .

80

Let

𝛾2a𝑛 (P,Q)

′ =𝑁−2 ©« 1𝑟2

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) +1

(1 − 𝑟)2

∑1≤𝑖≠ 𝑗≤𝑚

��a𝑛 (𝑌𝑖, 𝑌 𝑗 )

− 2𝑟 (1 − 𝑟)

∑1≤𝑖≤𝑛

∑1≤ 𝑗≤𝑚

��a𝑛 (𝑋𝑖, 𝑌 𝑗 )ª®¬ .

As we assume 𝑟𝑛 → 𝑟 as 𝑛→ ∞, Theorem 7 ensures that

𝑛𝑚√

2(𝑛 + 𝑚)[E��2

a𝑛(𝑋1, 𝑋2)

]− 12(𝛾2a𝑛 (P,Q) − 𝛾2

a𝑛 (P,Q)′)= 𝑜𝑝 (1)

A slight adaption of arguments in Hall (1984) suggests that

E��4a𝑛(𝑋1, 𝑋2)

𝑁2E��2a𝑛 (𝑋1, 𝑋2)

+E��2

a𝑛(𝑋1, 𝑋2)��2

a𝑛(𝑋1, 𝑋3)

𝑁E��2a𝑛 (𝑋1, 𝑋2)

+E𝐻2

a𝑛(𝑋1, 𝑋2)

E��2a𝑛 (𝑋1, 𝑋2)

→ 0 (6.30)

ensures that𝑛𝑚

√2(𝑛 + 𝑚)

[E��2

a𝑛(𝑋1, 𝑋2)

]− 12 𝛾2

a𝑛 (P,Q)′ →𝑑 𝑁 (0, 1).

Following arguments similar to those in the proof of Theorem 7, given a𝑛 → ∞ and a𝑛/𝑛4/𝑑 → 0,

(6.30) holds and therefore

𝑛𝑚√

2(𝑛 + 𝑚)[E��2

a𝑛(𝑋1, 𝑋2)

]− 12 𝛾2

a𝑛 (P,Q) →𝑑 𝑁 (0, 1).

Additionally, based on the same arguments as in the proof of Theorem 7,

��2𝑛,𝑚,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1.

The proof is therefore concluded.

81

Proof of Theorem 11. With slight abuse of notation, we shall write

��a𝑛 (𝑥, 𝑦;P,Q) = 𝐺a𝑛 (𝑥, 𝑦) − E𝑌∼Q𝐺a𝑛 (𝑥,𝑌 ) − E𝑋∼P𝐺a𝑛 (𝑋, 𝑦) + E(𝑋,𝑌 )∼P⊗Q𝐺a𝑛 (𝑋,𝑌 ),

We consider the two parts separately.

Part (i). We first verify the consistency of ΦHOM𝑛,a𝑛,𝛼

with a𝑛 � 𝑛4/(𝑑+4𝑠) given Δ𝑛 � 𝑛−2𝑠/(𝑑+4𝑠) .

Observe the following decomposition of 𝛾2a𝑛 (P,Q),

𝛾2a𝑛 (P,Q) = 𝛾

2a𝑛(P,Q) + 𝐿 (1)

𝑛,a𝑛 + 𝐿(2)𝑛,a𝑛 ,

where

𝐿(1)𝑛,a𝑛 =

1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P) −2𝑚𝑛

∑1≤𝑖≤𝑛

∑1≤ 𝑗≤𝑚

��a𝑛 (𝑋𝑖, 𝑌 𝑗 ;P,Q)

+ 1𝑚(𝑚 − 1)

∑1≤𝑖≠ 𝑗≤𝑚

��a𝑛 (𝑌𝑖, 𝑌 𝑗 ;Q)

and

𝐿(2)𝑛,a𝑛 =

2𝑛

𝑛∑𝑖=1

(E[𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝐺a𝑛 (𝑋, 𝑋′) − E[𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑖] + E𝐺a𝑛 (𝑋,𝑌 )

)+ 2𝑚

𝑚∑𝑗=1

(E[𝐺a𝑛 (𝑌 𝑗 , 𝑌 ) |𝑌 𝑗 ] − E𝐺a𝑛 (𝑌,𝑌 ′) − E[𝐺a𝑛 (𝑋,𝑌 𝑗 ) |𝑌 𝑗 ] + E𝐺a𝑛 (𝑋,𝑌 )

).

In order to prove the consistency of ΦHOM𝑛,a𝑛,𝛼

, it suffices to show

sup𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛

E(𝐿(1)𝑛,a𝑛

)2+ E

(𝐿(2)𝑛,a𝑛

)2

𝛾4𝐺a𝑛

(P,Q)→ 0, (6.31)

inf𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛

𝛾2𝐺a𝑛

(P,Q)

(1/𝑛 + 1/𝑚)√E

(��2𝑛,𝑚,a𝑛

) → ∞, (6.32)

82

as 𝑛 → ∞. We now prove (6.31) and (6.32) with arguments similar to those obtained in the proof

of Theorem 8.

Note that

E(𝐿 (1)𝑛,a𝑛)

2

.E©« 1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ;P)ª®¬

2

+ E ©« 2𝑚𝑛

∑1≤𝑖≤𝑛

∑1≤ 𝑗≤𝑚

��a𝑛 (𝑋𝑖, 𝑌 𝑗 ;P,Q)ª®¬

2

+ E ©« 1𝑚(𝑚 − 1)

∑1≤𝑖≠ 𝑗≤𝑚

��a𝑛 (𝑌𝑖, 𝑌 𝑗 ;Q)ª®¬

2

.1𝑛2E𝐺

2a𝑛(𝑋1, 𝑋2) +

1𝑚2E𝐺

2a𝑛(𝑌1, 𝑌2).

Given 𝑝, 𝑞 ∈ W𝑠,2(𝑀),

E𝐺2a𝑛(𝑋1, 𝑋2) .𝑑 𝑀

2a−𝑑/2𝑛 , E𝐺2

a𝑛(𝑌1, 𝑌2) .𝑑 𝑀

2a−𝑑/2𝑛 .

Hence


2 .𝑑 𝑀2a

−𝑑/2𝑛

(1𝑛2 + 1

𝑚2

). (6.33)

Now consider bounding 𝐿 (2)𝑛,a𝑛 . Let 𝑓 = 𝑝 − 𝑞. Then we have


2 .𝑑 a− 3𝑑

4𝑛 𝑀 ‖ 𝑓 ‖2

𝐿2

(1𝑛+ 1𝑚

). (6.34)

Since a𝑛 � 𝑛4/(4𝑠+𝑑) � Δ−2/𝑠𝑛 , Lemma 8 ensures that for sufficiently large 𝑛,

𝛾2𝐺a𝑛

(P,Q) &𝑑 a−𝑑/2𝑛 ‖ 𝑓 ‖2

𝐿2, ∀ 𝑝, 𝑞 ∈ W𝑠,2(𝑀).

83

This together with (6.33) and (6.34) gives

sup𝑝,𝑞∈W𝑠,2 (𝑀)‖𝑝−𝑞‖𝐿2≥Δ𝑛

E(𝐿(1)𝑛,a𝑛

)2+ E

(𝐿(2)𝑛,a𝑛

)2

𝛾4𝐺a𝑛

(P,Q).𝑑

𝑀2a𝑑/2𝑛

𝑛2Δ4𝑛

+ 𝑀a𝑑/4𝑛

𝑛Δ2𝑛

→ 0

as 𝑛→ ∞, which proves (6.31).

Finally, consider (6.32). It follows from

E(��2𝑛,𝑚,a𝑛

)≤ Emax

{��𝑠2𝑛,𝑚,a𝑛 �� , 1/𝑛2}. max{E𝐺2

a𝑛(𝑋1, 𝑋2),E𝐺2

a𝑛(𝑌1, 𝑌2)} + 1/𝑛2

.𝑑𝑀2a

−𝑑/2𝑛 + 1/𝑛2

that (6.32) holds.

Part (ii). Next, we prove that if lim inf𝑛→∞ Δ𝑛𝑛2𝑠/(𝑑+4𝑠) < ∞, then there exists some 𝛼 ∈

(0, 1) such that no asymptotic 𝛼-level test can be consistent. To prove this, we shall verify that

consistency of homogeneity test is harder to achieve than that of goodness-of-fit test.

Consider an arbitrary 𝑝0 ∈ W𝑠,2(𝑀/2). It immediately follows

𝐻HOM1 (Δ𝑛; 𝑠) ⊃ {(𝑝, 𝑝0) : 𝑝 ∈ 𝐻GOF

1 (Δ𝑛; 𝑠)}.

Let {Φ𝑛}𝑛≥1 be any sequence of asymptotic 𝛼-level homogeneity tests, where

Φ𝑛 = Φ𝑛 (𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚).

Then if 𝑌1, · · · , 𝑌𝑚 ∼iid 𝑃0, {Φ𝑛}𝑛≥1 can also be treated as a sequence of (random) goodness-of-fit

tests

Φ𝑛 (𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚) = Φ𝑛 (𝑋1, · · · , 𝑋𝑛)

84

whose probabilities of type I error with respect to 𝑃0 are controlled at 𝛼 asymptotically. Moreover,

power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} ≤ power{Φ𝑛;𝐻GOF

1 (Δ𝑛; 𝑠)}

Since 0 < 𝑐 ≤ 𝑚/𝑛 ≤ 𝐶 < ∞, Theorem 9 ensures that there exists some 𝛼 ∈ (0, 1) such that

for any sequence of asymptotic 𝛼-level tests {Φ𝑛}𝑛≥1,

lim inf𝑛→∞

power{Φ𝑛;𝐻HOM1 (Δ𝑛; 𝑠)} ≤ lim inf

𝑛→∞power{Φ𝑛;𝐻GOF

1 (Δ𝑛; 𝑠)} < 1

given lim inf𝑛→∞ Δ𝑛𝑛2𝑠/(𝑑+4𝑠) < ∞.

Proof of Theorem 12. For brevity, we shall focus on the case when 𝑘 = 2 in the rest of the proof.

Our argument, however, can be straightforwardly extended to the more general cases. The proof

relies on the following decomposition of 𝛾2a𝑛 (P, P𝑋

1 ⊗ P𝑋2) under 𝐻IND0 :

𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2) = 1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 ) + 𝑅𝑛,

where

𝐺∗a𝑛(𝑥, 𝑦) = ��a𝑛 (𝑥, 𝑦) −

∑1≤ 𝑗≤2

𝑔 𝑗 (𝑥 𝑗 , 𝑦) −∑

1≤ 𝑗≤2𝑔 𝑗 (𝑦 𝑗 , 𝑥) +

∑1≤ 𝑗1, 𝑗2≤2

𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2)

and the remainder 𝑅𝑛 satisfies

E(𝑅𝑛)2 . E𝐺2a (𝑋1, 𝑋2)/𝑛3 .𝑑 ‖𝑝‖2𝐿2a−𝑑/2𝑛 /𝑛3.

See Appendix B.4 for more details.

85

Moreover, borrowing arguments in the proof of Lemma 1, we obtain

E(𝐺∗a𝑛(𝑋1, 𝑋2) − ��a𝑛 (𝑋1, 𝑋2))2

.∑

1≤ 𝑗≤2E(𝑔 𝑗 (𝑋 𝑗

1 , 𝑋2))2

+∑

1≤ 𝑗1, 𝑗2≤2E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2

≤∑

1≤ 𝑗1≠ 𝑗2≤2E𝐺2a𝑛 (𝑋

𝑗11 , 𝑋

𝑗12 ) · E

{E

[𝐺a𝑛 (𝑋

𝑗21 , 𝑋

𝑗22 )

��𝑋 𝑗21

]}2+∑

1≤ 𝑗1≠ 𝑗2≤2E𝐺2a𝑛 (𝑋

𝑗11 , 𝑋

𝑗12 ) [E𝐺a𝑛 (𝑋

𝑗21 , 𝑋

𝑗22 )]2+

2E{E

[𝐺a𝑛 (𝑋1

1 , 𝑋12 )

��𝑋11

]}2E

{E

[𝐺a𝑛 (𝑋2

1 , 𝑋22 )

��𝑋21

]}2

.𝑑 a−𝑑1/2−3𝑑2/4𝑛 ‖𝑝1‖2

𝐿2‖𝑝2‖3

𝐿2+ a−3𝑑1/4−𝑑2/2

𝑛 ‖𝑝1‖3𝐿2‖𝑝2‖2

𝐿2

Together with the fact that

(2a𝑛/𝜋)𝑑/2E��2a𝑛(𝑋1, 𝑋2) → ‖𝑝‖2

𝐿2

as a𝑛 → ∞, we conclude that

𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2) = 𝐷 (a𝑛) + 𝑜𝑝(√E𝐷2(a𝑛)

),

where

𝐷 (a𝑛) =1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛��a𝑛 (𝑋𝑖, 𝑋 𝑗 ).

Applying arguments similar to those in the proofs of Theorem 7 and 10, we have

𝐷 (a𝑛)√E𝐷2(a𝑛)

→𝑑 𝑁 (0, 1).

Since

E𝐷2(a𝑛) =2

𝑛(𝑛 − 1)E[��a𝑛 (𝑋1, 𝑋2)]2 and E[��a𝑛 (𝑋1, 𝑋2)]2/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 → 1,

86

it remains to prove

��2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 →𝑝 1,

which immediately follows by observing

𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 =

2∏𝑗=1

𝑠2𝑛, 𝑗 ,a𝑛/E[��a𝑛 (𝑋𝑗

1 , 𝑋𝑗

2 )]2 →𝑝 1

and 1/𝑛2 = 𝑜(E[𝐺∗a𝑛(𝑋1, 𝑋2)]2). The proof is therefore concluded.

Proof of Theorem 13. We prove the two parts separately. Part (i). The proof of consistency of

ΦIND𝑛,a𝑛,𝛼

is very similar to its counterpart in the proof of Theorem 11. It sufficies to show

sup𝑝∈𝐻IND

1 (Δ𝑛,𝑠)

var(𝛾2a𝑛 (P, P𝑋

1 ⊗ P𝑋2))𝛾4a𝑛 (P, P𝑋

1 ⊗ P𝑋2)→ 0, (6.35)

inf𝑝∈𝐻IND

1 (Δ𝑛,𝑠)

𝑛𝛾2a𝑛(P, P𝑋1 ⊗ P𝑋2)E

(��𝑛,a𝑛

) → ∞, (6.36)

as 𝑛→ ∞.

We begin with (6.35). Let 𝑓 = 𝑝 − 𝑝1 ⊗ 𝑝2. Lemma 8 then implies that there exists 𝐶 =

𝐶 (𝑠, 𝑀) > 0 such that

𝛾2a (P, P𝑋

1 ⊗ P𝑋2) �𝑑 a−𝑑/2‖ 𝑓 ‖2𝐿2

for a ≥ 𝐶‖ 𝑓 ‖−2/𝑠𝐿2

, which is satisfied by all 𝑝 ∈ 𝐻IND1 (Δ𝑛, 𝑠) given a = a𝑛 and lim

𝑛→∞Δ𝑛𝑛

2𝑠4𝑠+𝑑 = ∞.

On the other hand, we can still do the decomposition of 𝛾2a𝑛 (P, P𝑋

1 ⊗ P𝑋2) as in Appendix B.4. We

follow the same notations here.

87

Under the alternative hypothesis, the “first order” term

𝐷1(a𝑛)

=2𝑛

∑1≤𝑖≤𝑛

(E𝑋𝑖 ,𝑋∼iidP [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑖] − E𝑋,𝑋 ′∼iidP𝐺a𝑛 (𝑋, 𝑋′)

)− 2𝑛

∑1≤𝑖≤𝑛

(E𝑋𝑖∼P,𝑌∼P𝑋1⊗P𝑋2 [𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑖] − E𝑋∼P,𝑌∼P𝑋1⊗P𝑋2𝐺a𝑛 (𝑋,𝑌 )

)−

∑1≤ 𝑗≤2

(2𝑛

∑1≤𝑖≤𝑛

(E𝑋𝑖∼P𝑋1⊗P𝑋2

,𝑋∼P [𝐺a𝑛 (𝑋𝑖, 𝑋) |𝑋𝑗

𝑖] − E

𝑋∼P,𝑌∼P𝑋1⊗P𝑋2𝐺a𝑛 (𝑋,𝑌 )))

+∑

1≤ 𝑗≤2

(2𝑛

∑1≤𝑖≤𝑛

(E𝑋𝑖 ,𝑌∼iidP𝑋

1⊗P𝑋2 [𝐺a𝑛 (𝑋𝑖, 𝑌 ) |𝑋𝑗

𝑖] − E

𝑌,𝑌 ′∼iidP𝑋1⊗P𝑋2𝐺a𝑛 (𝑌,𝑌 ′)

))

no longer vanish, but based on arguments similar to those in the proof of Theorem 8,

E𝐷21(a𝑛) .𝑑 𝑀𝑛

−1a−3𝑑/4𝑛 ‖ 𝑓 ‖2

𝐿2.

Moreover, the “second order” term 𝐷2(a𝑛) is not solely∑

1≤𝑖≠ 𝑗≤𝑛𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 )/(𝑛(𝑛 − 1)), but we

still have

E𝐷22(a𝑛) . 𝑛−2 max{E𝐺2a𝑛 (𝑋1, 𝑋2),E𝐺2a𝑛 (𝑋1

1 , 𝑋12 )E𝐺2a𝑛 (𝑋2

1 , 𝑋22 )} .𝑑 𝑀

2𝑛−2a−𝑑/2𝑛 .

Similarly, define the third order term 𝐷3(a𝑛) and the fourth order term 𝐷4(a𝑛) as the aggregation

of all 3-variate centered components and the aggregation of all 4-variate centered components in

𝛾2a𝑛 (P, P𝑋

1 ⊗ P𝑋2) respectively, which together constitue 𝑅𝑛. Then we have

E𝐷23(a𝑛) .𝑑 𝑀

2𝑛−3a−𝑑/2𝑛 , E𝐷2

4(a𝑛) .𝑑 𝑀2𝑛−4a

−𝑑/2𝑛 .

Hence we finally obtain

𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2) = 𝛾2a𝑛(P, P𝑋1 ⊗ P𝑋2) +

4∑𝑙=1

𝐷 𝑙 (a𝑛)

88

and

var(𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2))=

4∑𝑙=1E𝐷2

𝑙 (a𝑛) .𝑑 𝑀𝑛−1a

−3𝑑/4𝑛 ‖ 𝑓 ‖2

𝐿2+ 𝑀2𝑛−2a

−𝑑/2𝑛

which proves (6.35).

Now consider (6.36). Since

��𝑛,a𝑛 ≤ max

2∏𝑗=1

√��𝑠2𝑛, 𝑗 ,a𝑛 ��, 1/𝑛 ,we have

E(��𝑛,a𝑛

)≤

2∏𝑗=1

√E

��𝑠2𝑛, 𝑗 ,a𝑛 �� + 1/𝑛,

where

2∏𝑗=1E

��𝑠2𝑛, 𝑗 ,a𝑛 �� . 2∏𝑗=1E𝐺2a𝑛 (𝑋

𝑗

1 , 𝑋𝑗

2 ) = E𝑌1,𝑌2∼iidP𝑋1⊗P𝑋2𝐺2a𝑛 (𝑌1, 𝑌2) .𝑑 𝑀

2a−𝑑/2𝑛 .

Therefore (6.36) holds.

Part (ii). Then we verify that 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 → ∞ is also the necessary condition for the existence

of consistent asymptotic 𝛼-level tests for any 𝛼 ∈ (0, 1). Similarly to the proof of Theorem 11,

the idea is to relate the existence of consistent independence test to the existence of consistent

goodness-of-fit test.

Let 𝑝 𝑗 ,0 ∈ W𝑠,2(𝑀 𝑗/

√2)

be density on R𝑑 𝑗 for 𝑗 = 1, 2 and 𝑝0 be the product of 𝑝1,0 and

𝑝2,0, i.e.,

𝑝0(𝑥1, 𝑥2) = 𝑝1,0(𝑥1)𝑝2,0(𝑥2), ∀ 𝑥1 ∈ R𝑑1 , 𝑥2 ∈ R𝑑2 .

Hence 𝑝0 ∈ W𝑠,2(𝑀/2).

Let

𝐻GOF1 (Δ𝑛; 𝑠)′ := {𝑝 : 𝑝 ∈ W𝑠,2(𝑀), 𝑝1 = 𝑝1,0, 𝑝2 = 𝑝2,0, ‖𝑝 − 𝑝0‖𝐿2 ≥ Δ𝑛}.

89

We immediately have

𝐻IND1 (Δ𝑛; 𝑠) ⊃ 𝐻GOF

1 (Δ𝑛; 𝑠)′

Let {Φ𝑛}𝑛≥1 be any sequence of asymptotic 𝛼-level independence tests, where

Φ𝑛 = Φ𝑛 (𝑋1, · · · , 𝑋𝑛).

Then {Φ𝑛}𝑛≥1 can also be treated as a sequence of asymptotic 𝛼-level goodness-of-fit tests with

the null density being 𝑝0. Moreover,

power{Φ𝑛;𝐻IND1 (Δ𝑛; 𝑠)} ≤ power{Φ𝑛;𝐻GOF

1 (Δ𝑛; 𝑠)′}.

It remains to show that given lim inf𝑛→∞ 𝑛2𝑠/(𝑑+4𝑠)Δ𝑛 < ∞, there exists some 𝛼 ∈ (0, 1) such

that

lim inf𝑛→∞

power{Φ𝑛;𝐻GOF1 (Δ𝑛; 𝑠)′} < 1,

which cannot be directly obtained from Theorem 9 because of the additional constraints

𝑝1 = 𝑝1,0, 𝑝2 = 𝑝2,0 (6.37)

in 𝐻GOF1 (Δ𝑛; 𝑠)′.

However, by modifying the proof of Theorem 9, we only need to further require each 𝑝𝑛,𝒂𝑛 in

the proof of Theorem 9 satisfying (6.37), or equivalently,

∫R𝑑2

(𝑝 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥2 = 0,∫R𝑑1

(𝑝 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥1 = 0.

90

Recall that each 𝑝𝑛,𝒂𝑛 = 𝑝0 + 𝑟𝑛𝐵𝑛∑𝑘=1

𝑎𝑛,𝑘𝜙𝑛,𝑘 , where


‖𝜙‖𝐿2

𝜙(𝑏𝑛𝑥 − 𝑥𝑛,𝑘 ).

Write 𝑥𝑛,𝑘 = (𝑥1𝑛,𝑘, 𝑥2𝑛,𝑘

) ∈ R𝑑1 × R𝑑2 . Since 𝜙 can be decomposed as

𝜙(𝑥1, 𝑥2) = 𝜙1(𝑥1)𝜙2(𝑥2),

we have


‖𝜙‖𝐿2

𝜙1(𝑏𝑛𝑥1 − 𝑥1𝑛,𝑘 )𝜙2(𝑏𝑛𝑥2 − 𝑥2

𝑛,𝑘 )

Hence

∫R𝑑2

(𝑝𝑛,𝒂𝑛 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥2 =𝑟𝑛

𝐵𝑛∑𝑘=1

𝑎𝑛,𝑘

∫R𝑑2

𝜙𝑛,𝑘 (𝑥1, 𝑥2)𝑑𝑥2

=𝑟𝑛

𝐵𝑛∑𝑘=1

𝑎𝑛,𝑘𝑏𝑑/2𝑛

‖𝜙‖𝐿2

· 𝜙1(𝑏𝑛𝑥1 − 𝑥1𝑛,𝑘 ) ·

1𝑏𝑑2𝑛

∫R𝑑2

𝜙2(𝑥2)𝑑𝑥2

=0

since∫R𝑑2 𝜙2(𝑥2)𝑑𝑥2 = 0. Similarly,

∫R𝑑1 (𝑝𝑛,𝒂𝑛 − 𝑝0) (𝑥1, 𝑥2)𝑑𝑥1 = 0. The proof is therefore

finished.

Proof of Theorem 14. The proof of Theorem 14 consists of two steps. First, we bound 𝑞GOF𝑛,𝛼 . To

be more specific, we show that there exists 𝐶 = 𝐶 (𝑑) > 0 such that

𝑞GOF𝑛,𝛼 ≤ 𝐶 (𝑑) log log 𝑛

91

for sufficiently large 𝑛, which holds if

lim𝑛→∞

𝑃(𝑇GOF(adapt)𝑛 ≥ 𝐶 (𝑑) log log 𝑛) = 0 (6.38)

under 𝐻GOF0 . Second, we show that there exists 𝑐 > 0 such that

lim inf𝑛→∞

Δ𝑛,𝑠 (𝑛/log log 𝑛)2𝑠/(𝑑+4𝑠) > 𝑐

ensures

inf𝑝∈𝐻GOF(adapt)

1 (Δ𝑛,𝑠:𝑠≥𝑑/4)𝑃(𝑇GOF(adapt)

𝑛 ≥ 𝐶 (𝑑) log log 𝑛) → 1 (6.39)

as 𝑛→ ∞.

Verifying (6.38). In order to prove (6.38), we first show the following two lemmas. The

first lemma suggests that ��2𝑛,a𝑛 is a consistent estimator of E��2a𝑛(𝑋1, 𝑋2) uniformly over all a𝑛 ∈

[1, 𝑛2/𝑑]. Recall we have shown in the proof of Theorem 7 that for a𝑛 increasing at a proper rate,

��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 →𝑝 1.

Hence the first lemma is a uniform version of such result.

Lemma 5. We have that ��2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 converges to 1 uniformly over a𝑛 ∈ [1, 𝑛2/𝑑], i.e.,

sup1≤a𝑛≤𝑛2/𝑑

��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).

92

We defer the proof of Lemma 5 to the appendix. Note that

𝑇GOF(adapt)𝑛 = sup

1≤a𝑛≤𝑛2/𝑑

𝑛𝛾2a𝑛 (P, P0)√

2E[��a𝑛 (𝑋1, 𝑋2)]2·√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛

≤ sup1≤a𝑛≤𝑛2/𝑑

�� 𝑛𝛾2a𝑛 (P, P0)√

2E[��a𝑛 (𝑋1, 𝑋2)]2

�� · sup1≤a𝑛≤𝑛2/𝑑

√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛 .

Lemma 5 first ensures that


√E[��a𝑛 (𝑋1, 𝑋2)]2/��2𝑛,a𝑛 = 1 + 𝑜𝑝 (1).

It therefore suffices to show that under 𝐻GOF0 ,

𝑇GOF(adapt)𝑛 := sup


�� 𝑛𝛾2a𝑛 (P, P0)√

2E[��a𝑛 (𝑋1, 𝑋2)]2

��is also of order log log 𝑛. This is the crux of our argument yet its proof is lengthy. For brevity, we

shall state it as a lemma here and defer its proof to the appendix.

Lemma 6. There exists 𝐶 = 𝐶 (𝑑) > 0 such that

lim𝑛→∞

𝑃

(𝑇

GOF(adapt)𝑛 ≥ 𝐶 log log 𝑛

)= 0

under 𝐻GOF0 .

Verifying (6.39). Let

a𝑛 (𝑠)′ =(log log 𝑛

𝑛

)−4/(4𝑠+𝑑),

which is smaller than 𝑛2/𝑑 for 𝑠 ≥ 𝑑/4. Hence it suffices to show

inf𝑠≥𝑑/4

inf𝑝∈𝐻GOF

1 (Δ𝑛,𝑠;𝑠)𝑃(𝑇GOF

𝑛,a𝑛 (𝑠) ′ ≥ 𝐶 (𝑑) log log 𝑛) → 1

93

as 𝑛→ ∞.

First of all, observe

0 ≤ E(𝑠2𝑛,a𝑛 (𝑠) ′

)≤ E𝐺2a𝑛 (𝑠) ′ (𝑋1, 𝑋2) ≤ 𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2

and

var(𝑠2𝑛,a𝑛 (𝑠) ′

).𝑑 𝑀

3𝑛−1(a𝑛 (𝑠)′)−3𝑑/4 + 𝑀2𝑛−2(a𝑛 (𝑠)′)−𝑑/2

for any 𝑠 and 𝑝 ∈ 𝐻GOF1 (Δ𝑛,𝑠, 𝑠). Further considering 1/𝑛2 = 𝑜(𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2) uniformly

over all 𝑠, we obtain that

inf𝑠≥𝑑/4

inf𝑝∈𝐻GOF

1 (Δ𝑛,𝑠;𝑠)𝑃

(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2

)→ 1.

Let

Δ𝑛,𝑠 ≥ 𝑐(√𝑀 + 𝑀) (log log 𝑛/𝑛)2𝑠/(𝑑+4𝑠)

for some sufficiently large 𝑐 = 𝑐(𝑑). Then

E𝛾2a𝑛 (𝑠) ′ (P, P0) = 𝛾2

a𝑛 (𝑠) ′ (P, P0) ≥(

𝜋

a𝑛 (𝑠)′

)𝑑/2·‖𝑝 − 𝑝0‖2

𝐿2

4,

as guaranteed by Lemma 8. Further considering that

var(𝛾2a𝑛 (𝑠) ′ (P, P0)

).𝑑 𝑀

2𝑛−2(a𝑛 (𝑠)′)−𝑑/2 + 𝑀𝑛−1(a𝑛 (𝑠)′)−3𝑑/4‖𝑝 − 𝑝0‖2𝐿2,

we immediately have

lim𝑛→∞

inf𝑠≥𝑑/4

inf𝑝∈𝐻GOF

1 (Δ𝑛,𝑠;𝑠)𝑃(𝑇GOF

𝑛,a𝑛 (𝑠) ′ ≥ 𝐶 (𝑑) log log 𝑛)

≥ lim𝑛→∞

inf𝑠≥𝑑/4

inf𝑝∈𝐻GOF


©«𝑛𝛾2

a𝑛 (𝑠) ′ (P, P0)/2√2��2𝑛,a𝑛 (𝑠) ′

≥ 𝐶 (𝑑) log log 𝑛ª®®¬ = 1.

94

Proof of Theorem 15 and Theorem 16. The proof of Theorem 15 and Theorem 16 is very similar

to that of Theorem 14. Hence we only emphasize the main differences here.

For adaptive homogeneity test: to verify that there exists 𝐶 = 𝐶 (𝑑) > 0 such that

lim𝑛→∞

𝑃(𝑇HOM(adapt)𝑛 ≥ 𝐶 log log 𝑛) = 0

under 𝐻HOM0 , observe that

𝑇HOM(adapt)𝑛 ≤ sup


√E[��a𝑛 (𝑋1, 𝑋2)]2

��2𝑛,𝑚,a𝑛·(1𝑛+ 1𝑚

)−1sup


|𝛾2a𝑛 (P,Q) |√

2E[��a𝑛 (𝑋1, 𝑋2)]2.

Denote 𝑋1, · · · , 𝑋𝑛, 𝑌1, · · · , 𝑌𝑚 as 𝑍1, · · · , 𝑍𝑁 . Hence

2𝑛∑𝑖=1

𝑚∑𝑗=1𝐺a𝑛 (𝑋𝑖, 𝑌 𝑗 ) =

∑1≤𝑖≠ 𝑗≤𝑁

𝐺a𝑛 (𝑍𝑖, 𝑍 𝑗 ) −∑

1≤𝑖≠ 𝑗≤𝑛𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗 ) −

∑1≤𝑖≠ 𝑗≤𝑚

𝐺a𝑛 (𝑌𝑖, 𝑌 𝑗 )

and


|𝛾2a𝑛 (P,Q) |√

2E[��a𝑛 (𝑋1, 𝑋2)]2

≤(

1𝑛(𝑛 − 1) +

1𝑚𝑛

)sup


��∑

1≤𝑖≠ 𝑗≤𝑛

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2

��+

(1

𝑚(𝑚 − 1) +1𝑚𝑛

)sup


��∑

1≤𝑖≠ 𝑗≤𝑚

��a𝑛 (𝑌𝑖, 𝑌 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2

��+ 1𝑚𝑛


��∑

1≤𝑖≠ 𝑗≤𝑁

��a𝑛 (𝑍𝑖, 𝑍 𝑗 )√2E[��a𝑛 (𝑋1, 𝑋2)]2

��Apply Lemma 6 to bound each term of the right hand side of the above inequality. Then we

95

conclude that for some 𝐶 = 𝐶 (𝑑) > 0,

lim𝑛→∞

𝑃©«(1𝑛+ 1𝑚

)−1sup


|𝛾2a𝑛 (P,Q) |√

2E[��a𝑛 (𝑋1, 𝑋2)]2≥ 𝐶 log log 𝑛

ª®®¬ = 0.

For adaptive independence test: to verify that there exists 𝐶 = 𝐶 (𝑑) > 0 such that

lim𝑛→∞

𝑃(𝑇 IND(adapt)𝑛 ≥ 𝐶 log log 𝑛) = 0 (6.40)

under 𝐻IND0 , recall the decomposition

𝛾2a𝑛 (P, P

𝑋1 ⊗ P𝑋2) = 𝐷2(a𝑛) + 𝑅𝑛 =1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛𝐺∗a𝑛(𝑋𝑖, 𝑋 𝑗 ) + 𝑅𝑛,

where we express 𝑅𝑛 as 𝑅𝑛 = 𝐷3(a𝑛) + 𝐷4(a𝑛) in the proof of Theorem 13.

Following arguments similar to those in the proof of Lemma 6, we obtain that there exists

𝐶 (𝑑) > 0 such that for sufficiently large 𝑛,

𝑃©« sup


�� 𝑛𝐷2(a𝑛)√2E[𝐺∗

a𝑛 (𝑋1, 𝑋2)]2

�� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡2/3),

Similarly,

𝑃©« sup


�� 𝑛3/2𝐷3(a𝑛)√2E[𝐺∗

a𝑛 (𝑋1, 𝑋2)]2

�� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡1/2)

𝑃©« sup


�� 𝑛2𝐷4(a𝑛)√2E[𝐺∗

a𝑛 (𝑋1, 𝑋2)]2

�� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)ª®®¬ . exp(−𝑡2/5)

for sufficiently large 𝑛.

96

On the other hand, note that

E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 =

2∏𝑗=1E[��a𝑛 (𝑋

𝑗

1 , 𝑋𝑗

2 )]2,

and based on results in the proof of Lemma 5, sup1≤a𝑛≤𝑛2/𝑑

��𝑠2𝑛, 𝑗 ,a𝑛/E[��a𝑛 (𝑋𝑗

1 , 𝑋𝑗

2 )]2 − 1

�� = 𝑜𝑝 (1) for

𝑗 = 1, 2. Further considering that

1/𝑛2 = 𝑜(E[𝐺∗a𝑛(𝑋1, 𝑋2)]2)

uniformly over all a𝑛 ∈ [1, 𝑛2/𝑑], we obtain


��𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 − 1

�� = 𝑜𝑝 (1).They combined together ensure that (6.40) holds.

To show that the detection boundary of ΦIND(adapt) is of order 𝑂 ((𝑛/log log 𝑛)−2𝑠/(𝑑+4𝑠)), ob-

serve that

0 ≤ E(𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′

)≤ E𝐺2a𝑛 (𝑠) ′ (𝑋

𝑗

1 , 𝑋𝑗

2 ) ≤ 𝑀2𝑗 (2a𝑛 (𝑠)′/𝜋)−𝑑 𝑗/2

and

var(𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′

).𝑑 𝑗 𝑀

3𝑗 𝑛

−1(a𝑛 (𝑠)′)−3𝑑 𝑗/4 + 𝑀2𝑗 𝑛

−2(a𝑛 (𝑠)′)−𝑑 𝑗/2

for 𝑗 = 1, 2, where a𝑛 (𝑠)′ = (log log 𝑛/𝑛)−4/(4𝑠+𝑑) as in the proof of Theorem 14. Therefore,

inf𝑠≥𝑑/4

inf𝑝∈𝐻IND


(��𝑠2𝑛, 𝑗 ,a𝑛 (𝑠) ′�� ≤ √3/2𝑀2

𝑗 (2a𝑛 (𝑠)′/𝜋)−𝑑 𝑗/2)→ 1, 𝑗 = 1, 2.

Further considering 1/𝑛2 = 𝑜(𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2) uniformly over all 𝑠, we obtain that

inf𝑠≥𝑑/4

inf𝑝∈𝐻IND


(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2

)→ 1.

97

References

L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi (2010). “On combinatorial testing prob-lems”. In: The Annals of Statistics 38.5, pp. 3063–3092.

N. Ailon, M. Charikar, and A. Newman (2008). “Aggregating inconsistent information: rankingand clustering”. In: Journal of ACM 55.5, 23:1–23:27.

M. A. Arcones and E. Gine (1993). “Limit Theorems for U-Processes”. In: The Annals of Proba-bility 21.3, pp. 1494–1542.

Y. Baraud (2002). “Non-asymptotic minimax rates of testing in signal detection”. In: Bernoulli 8.5,pp. 577–606.

M. V. Burnashev (1979). “On the minimax detection of an inaccurately known signal in a whiteGaussian noise background”. In: Theory of Probability & Its Applications 24.1, pp. 107–119.

N. Dunford and J. T. Schwartz (1963). Linear Operators: Part II: Spectral Theory: Self AdjointOperators in Hilbert Space. Interscience Publishers.

M. S. Ermakov (1991). “Minimax detection of a signal in a Gaussian white noise”. In: Theory ofProbability & Its Applications 35.4, pp. 667–679.

M. Fromont and B. Laurent (2006). “Adaptive goodness-of-fit tests in a density model”. In: TheAnnals of Statistics 34.2, pp. 680–720.

M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret (2012). “Kernels based tests withnon-asymptotic bootstrap approaches for two-sample problem”. In: JMLR: Workshop and Con-ference Proceedings. Vol. 23, pp. 23–1.

M. Fromont, B. Laurent, and P. Reynaud-Bouret (2013). “The two-sample problem for poissonprocesses: Adaptive tests with a nonasymptotic wild bootstrap approach”. In: The Annals ofStatistics 41.3, pp. 1431–1461.

K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf, and B. K. Sriperumbudur (2009). “Kernelchoice and classifiability for RKHS embeddings of probability distributions”. In: Advances inNeural Information Processing Systems, pp. 1750–1758.

G. G. Gregory (1977). “Large sample theory for U-statistics and tests of fit”. In: The Annals ofStatistics 5.1, pp. 110–123.

98

A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2012a). “A kernel two-sampletest”. In: Journal of Machine Learning Research 13.Mar, pp. 723–773.

A. Gretton, O. Bousquet, A. J. Smola, and B. Schölkopf (2005). “Measuring statistical dependencewith Hilbert-Schmidt norms”. In: International Conference on Algorithmic Learning Theory.Springer, pp. 63–77.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola (2008). “A ker-nel statistical test of independence”. In: Advances in Neural Information Processing Systems,pp. 585–592.

A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.Sriperumbudur (2012b). “Optimal kernel choice for large-scale two-sample tests”. In: Ad-vances in Neural Information Processing systems, pp. 1205–1213.

E. Haeusler (1988). “On the rate of convergence in the central limit theorem for martingales withdiscrete and continuous time”. In: The Annals of Probability 16.1, pp. 275–299.

P. Hall (1984). “Central limit theorem for integrated square error of multivariate nonparametricdensity estimators”. In: Journal of Multivariate Analysis 14.1, pp. 1–16.

Z. Harchaoui, F. Bach, and E. Moulines (2007). “Testing for homogeneity with kernel fisher dis-criminant analysis”. In: Advances in Neural Information Processing Systems, pp. 609–616.

Y. I. Ingster (1987). “Minimax testing of nonparametric hypotheses on a distribution density in theL_p metrics”. In: Theory of Probability & Its Applications 31.2, pp. 333–337.

— (1993). “Asymptotically minimax hypothesis testing for nonparametric alternatives. I, II, III”.In: Mathematical Methods of Statistics 2.2, pp. 85–114.

— (2000). “Adaptive chi-square tests”. In: Journal of Mathematical Sciences 99.2, pp. 1110–1119.

Y. I. Ingster and I. A. Suslina (2000). “Minimax nonparametric hypothesis testing for ellipsoidsand Besov bodies”. In: ESAIM: Probability and Statistics 4, pp. 53–135.

— (2003). Nonparametric Goodness-of-Fit Testing under Gaussian Models. New York, NY: Springer.

P. E. Jupp (2005). “Sobolev tests of goodness of fit of distributions on compact Riemannian mani-folds”. In: The Annals of Statistics 33.6, pp. 2957–2966.

E. L. Lehmann and J. P. Romano (2008). Testing Statistical Hypotheses. New York, NY: SpringerScience & Business Media.

99

O. V. Lepski and V. G. Spokoiny (1999). “Minimax nonparametric hypothesis testing: the case ofan inhomogeneous alternative”. In: Bernoulli 5.2, pp. 333–358.

R. Lyons (2013). “Distance covariance in metric spaces”. In: The Annals of Probability 41.5,pp. 3284–3305.

J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf (2016). “Distinguishing causefrom effect using observational data: methods and benchmarks”. In: The Journal of MachineLearning Research 17.1, pp. 1103–1204.

K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf (2017). “Kernel mean embed-ding of distributions: a review and beyond”. In: Foundations and Trends® in Machine Learning10.1-2, pp. 1–141.

J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf (2014). “Causal discovery with continuousadditive noise models”. In: The Journal of Machine Learning Research 15.1, pp. 2009–2053.

N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters (2018). “Kernel-based tests for joint indepen-dence”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.1,pp. 5–31.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013). “Equivalence of distance-based and RKHS-based statistics in hypothesis testing”. In: The Annals of Statistics 41.5,pp. 2263–2291.

R. J. Serfling (2009). Approximation Theorems of Mathematical Statistics. New York, NY: JohnWiley & Sons.

V. G. Spokoiny (1996). “Adaptive hypothesis testing using wavelets”. In: The Annals of Statistics24.6, pp. 2477–2498.

B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf (2009). “Kernelchoice and classifiability for RKHS embeddings of probability distributions”. In: Advances inNeural Information Processing Systems 22, pp. 1750–1758.

B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet (2011). “Universality, characteristic ker-nels and RKHS embedding of measures”. In: Journal of Machine Learning Research 12.Jul,pp. 2389–2410.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet (2010). “Hilbertspace embeddings and metrics on probability measures”. In: Journal of Machine LearningResearch 11.Apr, pp. 1517–1561.

I. Steinwart and A. Christmann (2008). Support Vector Machines. Springer Science & BusinessMedia.

100

I. Steinwart (2001). “On the influence of the kernel on the consistency of support vector machines”.In: Journal of machine learning research 2.Nov, pp. 67–93.

D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton (2017).“Generative models and model criticism via optimized maximum mean discrepancy”. In: In-ternational Conference on Learning Representations.

G. J Székely and M. L. Rizzo (2009). “Brownian distance covariance”. In: The Annals of AppliedStatistics 3.4, pp. 1236–1265.

G. J. Székely, M. L. Rizzo, and N. K. Bakirov (2007). “Measuring and testing dependence bycorrelation of distances”. In: The Annals of Statistics 35.6, pp. 2769–2794.

M. Talagrand (2014). Upper and Lower Bounds for Stochastic Processes: Modern Methods andClassical Problems. Springer Science & Business Media.

A. B. Tsybakov (2008). Introduction to Nonparametric Estimation. New York, NY: Springer Sci-ence & Business Media.

101

Appendix A: Some Technical Results and Proofs Related to Chapter 2

Proof of Lemma 3. We have

𝐺2(𝑥, 𝑥′) =∑𝑘,𝑙

`𝑘`𝑙𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝜑𝑘 (𝑥′)𝜑𝑙 (𝑥′).

Thus

∫𝑔(𝑥)𝑔(𝑥′)𝐺2(𝑥, 𝑥′)𝑑𝑃(𝑥)𝑑𝑃(𝑥′)

=∑𝑘,𝑙

`𝑘`𝑙

( ∫𝑔(𝑥)𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝑑𝑃(𝑥)

)2

≤`1∑𝑘

`𝑘

∑𝑙

( ∫𝑔(𝑥)𝜑𝑘 (𝑥)𝜑𝑙 (𝑥)𝑑𝑃(𝑥)

)2

≤`1

(∑𝑘

`𝑘

∫𝑔2(𝑥)𝜑2

𝑘 (𝑥)𝑑𝑃(𝑥))

≤`1

(∑𝑘

`𝑘

) (sup𝑘

‖𝜑𝑘 ‖∞)2

‖𝑔‖2𝐿2 (𝑃) .

Proof of Lemma 4. For brevity, write

𝑙𝐾 =

𝐾∑𝑘=1

𝑎2𝑘

_𝑘.

By definition, it suffices to show that ∀ 𝑅 > 0, ∃ 𝑓𝑅 ∈ H (𝐾) such that ‖ 𝑓𝑅‖2𝐾

≤ 𝑅2 and ‖𝑢 −

𝑓𝑅‖2𝐿2 (𝑃0) ≤ 𝑀2𝑅−2/\ .

102

To this end, let 𝐾 be such that 𝑙2𝐾≤ 𝑅2 ≤ 𝑙2

𝐾+1, and denote by

𝑓𝑅 =

𝐾∑𝑘=1

𝑎𝑘𝜑𝑘 + 𝑎∗𝐾+1(𝑅)𝜑𝐾+1,

where

𝑎∗𝐾+1(𝑅) = sgn(𝑎𝐾+1)√_𝐾+1(𝑅2 − 𝑙2

𝐾).

Clearly,

‖ 𝑓𝑅‖2𝐾 =

𝐾∑𝑘=1

𝑎2𝑘

_𝑘+(𝑎∗𝑘+1(𝑅))

2

_𝐾+1= 𝑅2,

and

‖𝑢 − 𝑓𝑅‖2𝐿2 (𝑃0) =

∑𝑘>𝐾+1

𝑎2𝑘 +

(|𝑎𝐾+1 | −

√_𝐾+1(𝑅2 − 𝑙2

𝐾))2

≤∑𝑘≥𝐾+1

𝑎2𝑘 .

To ensure 𝑢 ∈ F (\, 𝑀), it suffices to have

sup𝑙2𝐾≤𝑅2≤𝑙2

𝐾+1

‖𝑢 − 𝑓𝑅‖2𝐿2 (𝑃0)𝑅

2/\ ≤ 𝑀2, ∀ 𝐾 ≥ 0,


103

Appendix B: Some Technical Results and Proofs Related to Chapter 3

B.1 Properties of Gaussian Kernel

We collect here a couple of useful properties of Gaussian kernel that we used repeated in the

proof to the main results.

Lemma 7. For any 𝑓 ∈ 𝐿2(R𝑑),∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =

(𝜋a

) 𝑑2∫

exp(−‖𝜔‖2

4a

)‖F 𝑓 (𝜔)‖2 𝑑𝜔.

Proof. Denote by 𝑍 a Gaussian random vector with mean 0 and covariance matrix 2a𝐼𝑑 . Then

∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =

∫exp

(−a‖𝑥 − 𝑦‖2

)𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦

=

∫E exp[𝑖𝑍>(𝑥 − 𝑦)] 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦

=E

∫ exp(−𝑖𝑍>𝑥) 𝑓 (𝑥)𝑑𝑥 2

=

∫1

(4𝜋a)𝑑/2exp

(−‖𝜔‖2

4a

) ∫ exp(−𝑖𝜔>𝑥) 𝑓 (𝑥)𝑑𝑥 2

=

(𝜋a

) 𝑑2∫

exp(−‖𝜔‖2

4a

)‖F 𝑓 (𝜔)‖2 𝑑𝜔,


A useful consequence of Lemma 7 is a close connection between Gaussian kernel MMD and

𝐿2 norm.

104

Lemma 8. For any 𝑓 ∈ W𝑠,2(𝑀)

( a𝜋

)𝑑/2 ∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 ≥

14‖ 𝑓 ‖2

𝐿2,

provided that

a𝑠 ≥ 41−𝑠𝑀2

(log 3)𝑠 · ‖ 𝑓 ‖−2𝐿2.

Proof. In light of Lemma 7,

( a𝜋

)𝑑/2 ∫𝐺a (𝑥, 𝑦) 𝑓 (𝑥) 𝑓 (𝑦)𝑑𝑥𝑑𝑦 =

∫exp

(−‖𝜔‖2

4a

)‖F 𝑓 (𝜔)‖2 𝑑𝜔.

By Plancherel Theorem, for any 𝑇 > 0,

∫‖𝜔‖≤𝑇

‖F 𝑓 (𝜔)‖2 𝑑𝜔 = ‖ 𝑓 ‖2𝐿2 −

∫‖𝜔‖>𝑇

‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ ‖ 𝑓 ‖2𝐿2 −

𝑀2

𝑇2𝑠 ,

Choosing

𝑇 =

(2𝑀‖ 𝑓 ‖𝐿2

)1/𝑠,

yields ∫‖𝜔‖≤𝑇

‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 34‖ 𝑓 ‖2

𝐿2 .

Hence

∫exp

(−‖𝜔‖2

4a

)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ exp

(−𝑇

2

4a

) ∫‖𝜔‖≤𝑇

‖F 𝑓 (𝜔)‖2 𝑑𝜔

≥ 34

exp(−𝑇

2

4a

)‖ 𝑓 ‖2

𝐿2 .

In particular, if

a ≥ (2𝑀)2/𝑠

4 log 3· ‖ 𝑓 ‖−2/𝑠

𝐿2 ,

105

then ∫exp

(−‖𝜔‖2

4a

)‖F 𝑓 (𝜔)‖2 𝑑𝜔 ≥ 1

4‖ 𝑓 ‖2

𝐿2 ,


B.2 Proof of Lemma 5

We first prove that sup1≤a𝑛≤𝑛2/𝑑

��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1) and then show the difference

caused by the modification from 𝑠2𝑛,a𝑛 to ��2𝑛,a𝑛 is asymptotically negligible.

Note that


��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1��

≤(

inf1≤a𝑛≤𝑛2/𝑑

a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2

)−1· sup

1≤a𝑛≤𝑛2/𝑑a𝑑/2𝑛

��𝑠2𝑛,a𝑛 − E[��a𝑛 (𝑋1, 𝑋2)]2�� .For 𝑋 ∼ P0, denote the distribution of (𝑋, 𝑋) as P1. Then we have

E[��a𝑛 (𝑋1, 𝑋2)]2 = 𝛾2a𝑛(P1, P0 ⊗ P0).

Hence E[��a𝑛 (𝑋1, 𝑋2)]2 > 0 for any a𝑛 > 0 since 𝐺a𝑛 is characteristic.

In addition, a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 is continuous with respect to a𝑛 and

lima𝑛→∞

a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 =

(𝜋2

)𝑑/2‖𝑝0‖2

𝐿2.

Therefore,

inf1≤a𝑛≤𝑛2/𝑑

a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 ≥ inf

a𝑛∈[0,∞)a𝑑/2𝑛 E[��a𝑛 (𝑋1, 𝑋2)]2 > 0,

106

and it remains to prove


a𝑑/2𝑛

��𝑠2𝑛,a𝑛 − E[��a𝑛 (𝑋1, 𝑋2)]2�� = 𝑜𝑝 (1).Recall the expression of 𝑠2𝑛,a𝑛 . It suffcies to show that


a𝑑/2𝑛

�� 1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − E𝐺2a𝑛 (𝑋1, 𝑋2)

�� (B.1)


a𝑑/2𝑛

��2(𝑛 − 3)!

𝑛!

∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3

𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖, 𝑋 𝑗2) − E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3)

�� (B.2)


a𝑑/2𝑛

��(𝑛 − 4)!𝑛!

∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4

𝐺a𝑛 (𝑋𝑖1 , 𝑋 𝑗1)𝐺a𝑛 (𝑋𝑖2 , 𝑋 𝑗2) − [E𝐺a𝑛 (𝑋1, 𝑋2)]2

�� (B.3)

are all 𝑜𝑝 (1). We shall first control (B.1) and then bound (B.2) and (B.3) in the same way.

Let

E𝑛𝐺2a𝑛 (𝑋, 𝑋′) = 1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋𝑖, 𝑋 𝑗 ).

In the rest of this proof, abbreviate E𝑛𝐺2a𝑛 (𝑋, 𝑋′) and E𝐺2a𝑛 (𝑋1, 𝑋2) as E𝑛𝐺2a𝑛 and E𝐺2a𝑛 re-

spectively when no confusion occurs.

Divide the whole interval [1, 𝑛2/𝑑] into 𝐴 sub-intervals, [𝑢0, 𝑢1], [𝑢1, 𝑢2], · · · , [𝑢𝐴−1, 𝑢𝐴] with

𝑢0 = 1, 𝑢𝐴 = 𝑛2/𝑑 . For any a𝑛 ∈ [𝑢𝑎−1, 𝑢𝑎],

a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a

𝑑/2𝑛 E𝐺2a𝑛 ≥ − a𝑑/2𝑛

��E𝑛𝐺2𝑢𝑎 − E𝐺2𝑢𝑎

�� − a𝑑/2𝑛

��E𝐺2𝑢𝑎 − E𝐺2𝑢𝑎−1

��≥ − 𝑢𝑑/2𝑎


�� − 𝑢𝑑/2𝑎


��and

a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a

𝑑/2𝑛 E𝐺2a𝑛 ≤ 𝑢

𝑑/2𝑎

��E𝑛𝐺2𝑢𝑎−1 − E𝐺2𝑢𝑎−1

�� + 𝑢𝑑/2𝑎


�� ,107

which together ensure that


��a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a𝑑/2𝑛 E𝐺2a𝑛

��≤ sup

1≤𝑎≤𝐴

(𝑢𝑎

𝑢𝑎−1

)𝑑/2· sup

0≤𝑎≤𝐴𝑢𝑑/2𝑎


�� + sup1≤𝑎≤𝐴

𝑢𝑑/2𝑎


��≤ sup

1≤𝑎≤𝐴

(𝑢𝑎

𝑢𝑎−1

)𝑑/2· sup



�� + sup1≤𝑎≤𝐴

��𝑢𝑑/2𝑎 E𝐺2𝑢𝑎 − 𝑢𝑑/2𝑎−1E𝐺2𝑢𝑎−1

��+ sup

1≤𝑎≤𝐴

((𝑢𝑑/2𝑎 − 𝑢𝑑/2

𝑎−1

)E𝐺2𝑢𝑎−1

).

Bound the three terms in the right hand side of the last inequality separately.

Let {𝑢𝑎}𝑎≥0 be a geometric sequence, namely,

𝐴 := inf{𝑎 ∈ N : 𝑟𝑎 ≥ 𝑛2/𝑑},

and

𝑢𝑎 =

𝑟𝑎, ∀ 0 ≤ 𝑎 ≤ 𝐴 − 1

𝑛2/𝑑 , 𝑎 = 𝐴

,

with 𝑟 > 1 to be determined later.

Since lima→∞

a𝑑/2E𝐺2a𝑛 = (𝜋/2)𝑑/2‖𝑝0‖2 and a𝑑/2E𝐺2a is continuous, we obtain that for any

Y > 0, there exsits sufficiently small 𝑟 > 1 such that

sup1≤𝑎≤𝐴

��𝑢𝑑/2𝑎 E𝐺2𝑢𝑎 − 𝑢𝑑/2𝑎−1E𝐺2𝑢𝑎−1

�� ≤ Y.At the same time, we can also ensure

sup1≤𝑎≤𝐴

((𝑢𝑑/2𝑎 − 𝑢𝑑/2

𝑎−1

)E𝐺2𝑢𝑎−1

)≤ (𝑟𝑑/2 − 1)

(𝜋2

)𝑑/2‖𝑝0‖2 ≤ Y

by choosing 𝑟 sufficiently small.

108

Finally consider

sup1≤𝑎≤𝐴

(𝑢𝑎

𝑢𝑎−1

)𝑑/2· sup



�� .On the one hand,

sup1≤𝑎≤𝐴

(𝑢𝑎

𝑢𝑎−1

)𝑑/2≤ 𝑟𝑑/2.

On the other hand, since

var(E𝑛𝐺2a𝑛

).

1𝑛E𝐺2a𝑛 (𝑋, 𝑋′)𝐺2a𝑛 (𝑋, 𝑋′′) + 1

𝑛2E𝐺4a𝑛 (𝑋, 𝑋′)

.𝑑

a−3𝑑/4𝑛 ‖𝑝0‖3

𝑛+ a

−𝑑/2𝑛 ‖𝑝0‖2

𝑛2

for any a𝑛 ∈ (0,∞), we have

𝑃

(sup



�� ≥ Y)

≤

𝐴∑𝑎=0

𝑢𝑑𝑎var(E𝑛𝐺2𝑢𝑎

)Y2 .𝑑,𝑟

1Y2

(𝑢𝑑/4𝐴

‖𝑝0‖3

𝑛+𝑢𝑑/2𝐴

‖𝑝0‖2

𝑛2

)→ 0

as 𝑛→ ∞. Hence we conclude sup1≤a𝑛≤𝑛2/𝑑

��a𝑑/2𝑛 E𝑛𝐺2a𝑛 − a𝑑/2𝑛 E𝐺2a𝑛

�� = 𝑜𝑝 (1).Considering that

lima𝑛→∞

a𝑑/2𝑛 E𝐺a𝑛 (𝑋1, 𝑋2)𝐺a𝑛 (𝑋1, 𝑋3) = 0, lim

a𝑛→∞a𝑑/2𝑛 [E𝐺a𝑛 (𝑋1, 𝑋2)]2 = 0,

we obtain that (B.2) and (B.3) are also 𝑜𝑝 (1), based on almost the same arguments. Hence


��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).

On the other hand, since E[��a𝑛 (𝑋1, 𝑋2)]2 &𝑝0,𝑑 a−𝑑/2𝑛 for a𝑛 ∈ [1, 𝑛2/𝑑],


1𝑛2E[��a𝑛 (𝑋1, 𝑋2)]2

= 𝑜𝑝 (1).

109

Hence we finally conclude that


��𝑠2𝑛,a𝑛/E[��a𝑛 (𝑋1, 𝑋2)]2 − 1�� = 𝑜𝑝 (1).

B.3 Proof of Lemma 6

Let

𝐾a𝑛 (𝑥, 𝑥′) =𝐺a𝑛 (𝑥, 𝑥′)√

2E𝐺2a𝑛 (𝑋1, 𝑋2), ∀ 𝑥, 𝑥′ ∈ R𝑑 ,

and accordingly,

��a𝑛 (𝑥, 𝑥′) =��a𝑛 (𝑥, 𝑥′)√

2E𝐺2a𝑛 (𝑋1, 𝑋2).

Hence

𝑇GOF(adapt)𝑛 = sup


�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) ·√E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2

�� .To finish this proof, we first bound


�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�� (B.4)

and then control 𝑇GOF(adapt)𝑛 .

Step (i). There are two main tools that we borrow in this step. First, we apply results in Arcones

and Gine (1993) to obtain a Bernstein-type inequality for�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a0 (𝑋𝑖, 𝑋 𝑗 )�� and

�� 1𝑛 − 1

∑𝑖≠ 𝑗

(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a′𝑛 (𝑋𝑖, 𝑋 𝑗 )

) ��for some a0 and arbitrary a𝑛, a′𝑛 ∈ [1,∞). And based on that, we borrow Talagrand’s techniques

on handling Bernstein-type inequality (e.g., see Talagrand, 2014) to give a generic chaining bound

of (B.4).

110

To be more specific, for any a0, a𝑛, a′𝑛 ∈ [1, 𝑛2/𝑑], define

𝑑1(a𝑛, a′𝑛) = ‖��a′𝑛 − ��a𝑛 ‖𝐿∞ , 𝑑2(a𝑛, a′𝑛) = ‖��a′𝑛 − ��a𝑛 ‖𝐿2 .

Then Proposition 2.3 (c) of Arcones and Gine (1993) ensures that for any 𝑡 > 0,

𝑃

(�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a0 (𝑋𝑖, 𝑋 𝑗 )�� ≥ 𝑡

)≤ 𝐶 exp

(−𝐶min

{𝑡

‖��a0 ‖𝐿2

,

( √𝑛𝑡

‖��a0 ‖𝐿∞

) 23})

(B.5)

and

𝑃

(�� 1𝑛 − 1

∑𝑖≠ 𝑗

(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a′𝑛 (𝑋𝑖, 𝑋 𝑗 )

) �� ≥ 𝑡)

≤𝐶 exp

(−𝐶min

{𝑡

𝑑2(a𝑛, a′𝑛),

( √𝑛𝑡

𝑑1(a𝑛, a′𝑛)

) 23})

for some 𝐶 > 0, and based on a chaining type argument see, e.g., Theorem 2.2.28 in Talagrand,

2014 the latter inequality suggests there exists 𝐶 > 0 such that

𝑃

(sup


�� 1𝑛 − 1

∑𝑖≠ 𝑗

(��a𝑛 (𝑋𝑖, 𝑋 𝑗 ) − ��a0 (𝑋𝑖, 𝑋 𝑗 )

) �� ≥ (B.6)

𝐶

(𝛾2/3( [1, 𝑛2/𝑑], 𝑑1)√

𝑛𝑡 + 𝛾1( [1, 𝑛2/𝑑], 𝑑2) + 𝐷2𝑡

) ). exp(−𝑡2/3),

where 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1), 𝛾1( [1, 𝑛2/𝑑], 𝑑2) are the so-called 𝛾-functionals and

𝐷2 =∑𝑙≥0

𝑒𝑙 ( [1, 𝑛2/𝑑], 𝑑2)

with 𝑒𝑙 being the so-called entropy numbers.

111

A straightforward combination of (B.5) and (B.6) then gives

𝑃

(sup


�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�� ≥

𝐶

(𝛾2/3( [1, 𝑛2/𝑑], 𝑑1)√

𝑛𝑡 + 𝛾1( [1, 𝑛2/𝑑], 𝑑2) + 𝐷2𝑡 +

‖��a0 ‖𝐿∞√𝑛

+ ‖��a0 ‖𝐿2𝑡

) ). exp(−𝑡2/3).

Therefore, given that the bounds on ‖��a0 ‖𝐿2 and ‖��a0 ‖𝐿∞ can be obtained quite directly, e.g.,

with a0 = 1,

‖��a0 ‖𝐿∞ ≤ 4‖𝐾a0 | |𝐿∞ =4

√2E𝐺2

, ‖��a0 ‖𝐿2 ≤ ‖𝐾a0 ‖𝐿2 =

√2

2,

the main focus is to bound 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1), 𝛾1( [1, 𝑛2/𝑑], 𝑑2) and 𝐷2 properly.

First consider 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1). Note that for any 1 ≤ a𝑛 < a′𝑛 < ∞,

𝑑1(a𝑛, a′𝑛) ≤ 4‖𝐾a𝑛 − 𝐾a′𝑛 ‖𝐿∞ ≤ 4∫ a′𝑛

a𝑛

𝑑𝐾𝑢𝑑𝑢 𝐿∞

𝑑𝑢

Since for any a𝑛,

𝑑𝐾a𝑛

𝑑a𝑛=(−‖𝑥 − 𝑥′‖2)𝐺a𝑛 (𝑋1, 𝑋2)

(E𝐺2a𝑛 (𝑋1, 𝑋2)

)−1/2

−12𝐺a𝑛 (𝑋1, 𝑋2)


)−3/2 𝑑

𝑑a𝑛E𝐺2a𝑛 (𝑋1, 𝑋2)

where


)−1/2=

(𝜋2

)−𝑑/4a𝑑/4𝑛

(∫exp

(−‖𝜔‖2

8a𝑛

)‖F 𝑝0(𝜔)‖2𝑑𝜔

)−1/2

.𝑑 a𝑑/4𝑛

(∫exp

(−‖𝜔‖2

8

)‖F 𝑝0(𝜔)‖2𝑑𝜔

)−1/2,

112


)−3/2 .𝑑 a3𝑑/4𝑛

(∫exp

(−‖𝜔‖2

8

)‖F 𝑝0(𝜔)‖2𝑑𝜔

)−3/2,

and

𝑑

𝑑a𝑛E2a𝑛 (𝑋1, 𝑋2)

=

(𝜋2

)𝑑/2a−𝑑/2−1𝑛

(−𝑑

2·∫

exp(−‖𝜔‖2

8a𝑛

)‖F 𝑝0(𝜔)‖2𝑑𝜔

+∫

exp(−‖𝜔‖2

8a𝑛

) (‖𝜔‖2

8a𝑛

)‖F 𝑝0(𝜔)‖2𝑑𝜔

),

which together ensure 𝑑𝐾a𝑛𝑑a𝑛

𝐿∞

.𝑑,𝑝0 a𝑑/4−1𝑛 .

Hence

𝑑1(a𝑛, a′𝑛) .𝑑,𝑝0 |a𝑑/4𝑛 − (a′𝑛)𝑑/4 |,

and 𝛾2/3( [1, 𝑛2/𝑑], 𝑑1) .𝑑,𝑝0 | (𝑛2/𝑑)𝑑/4 − 1𝑑/4 | ≤√𝑛.

Then consider 𝛾1( [1, 𝑛2/𝑑], 𝑑2). We have

𝑑22 (a𝑛, a

′𝑛) ≤ ‖𝐾a′𝑛 − 𝐾a𝑛 ‖

2𝐿2

= 1 −E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛

≤ − log

(E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛

)

Let 𝑓1(a𝑛) =∫

exp(− ‖𝜔‖2

8a𝑛

)‖F 𝑝0(𝜔)‖2𝑑𝜔. Then

log(E𝐺2a𝑛

)=𝑑

2log

(𝜋

2a𝑛

)+ log 𝑓1(a𝑛)

and hence

− log

(E𝐺a𝑛𝐺a′𝑛√E𝐺2a𝑛E𝐺2a′𝑛

)=𝑑

2

(−

log a𝑛 + log a′𝑛2

+ log(a𝑛 + a′𝑛

2

))+

(log 𝑓1(a𝑛) + log 𝑓1(a′𝑛)

2− log 𝑓1

(a𝑛 + a′𝑛

2

)).

113

Note that

log 𝑓1(a𝑛) + log 𝑓1(a′𝑛)2

− log 𝑓1(a𝑛 + a′𝑛

2

)=

12

∫ a′𝑛−a𝑛2

0

∫ 𝑢

−𝑢

(log 𝑓1

(a′𝑛 + a𝑛

2+ 𝑣

))′′𝑑𝑣𝑑𝑢.

For any a𝑛 ≥ 1,

(log 𝑓1(a𝑛))′′ =𝑓1(a𝑛) 𝑓 ′′1 (a𝑛) − ( 𝑓 ′1 (a𝑛))

2

𝑓 21 (a𝑛)

≤𝑓 ′′1 (a𝑛)𝑓1(a𝑛)

,

and

𝑓 ′′1 (a𝑛) =∫

exp(−‖𝜔‖2

8a𝑛

) (‖𝜔‖4

64a4𝑛

− ‖𝜔‖2

4a3𝑛

)‖F 𝑝0(𝜔)‖2𝑑𝜔 . a−2

𝑛 ‖𝑝0‖2𝐿2.

Moreover, there exists a∗𝑛 = a∗𝑛 (𝑝0) > 1 such that 𝑓1(a∗𝑛) ≥ ‖𝑝0‖2

𝐿2/2, from which we obtain

(log 𝑓1(a𝑛))′′ .

a−2𝑛 ‖𝑝0‖2

𝐿2/ 𝑓1(1), 1 ≤ a𝑛 ≤ a∗𝑛

a−2𝑛 , a∗𝑛 < a𝑛 ≤ 𝑛2/𝑑

,

which suggests that for any a𝑛, a′𝑛 ∈ [1, a∗𝑛]

𝑑22 (a𝑛, a

′𝑛) .

(𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)

) (−

log a𝑛 + log a′𝑛2

+ log(a𝑛 + a′𝑛

2

)).

(𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)

)| log a𝑛 − log a′𝑛 |,

and for any a𝑛, a′𝑛 ∈ [a∗𝑛, 𝑛2/𝑑]

𝑑22 (a𝑛, a

′𝑛) .

(𝑑

2+ 1

)| log a𝑛 − log a′𝑛 |.

Note that in addition to the bound on 𝑑2 obtained above, we also have

𝑑2(a𝑛, a′𝑛) ≤ ‖��a𝑛 ‖𝐿2 + ‖��a′𝑛 ‖𝐿2 ≤ ‖𝐾a𝑛 ‖𝐿2 + ‖𝐾a′𝑛 ‖𝐿2 ≤√

2.

114

Therefore,

𝛾1( [1, 𝑛2/𝑑], 𝑑2) ≤∑𝑙≥0

2𝑙𝑒𝑙 ( [1, 𝑛2/𝑑], 𝑑2)

.𝑒0( [1, 𝑛2/𝑑], 𝑑2) +∑𝑙≥0

2𝑙𝑒𝑙 ( [1, a∗𝑛], 𝑑2) +∑𝑙≥0

2𝑙𝑒𝑙 ( [a∗𝑛, 𝑛2/𝑑], 𝑑2)

.1 +

√𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)∑𝑙≥0

2𝑙√

log a∗𝑛 − log 122𝑙

+√𝑑

2+ 1 ©«

∑𝑙≥0

2𝑙 min1,

√log 𝑛2/𝑑 − log a∗𝑛

22𝑙

ª®¬.1 +

√𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)√

log a∗𝑛 +√𝑑

2+ 1 ©«

∑𝑙≥0

2𝑙 min1,

√log 𝑛2/𝑑

22𝑙

ª®¬.1 +

√𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)√


2+ 1 ©«

∑0≤𝑙<𝑙∗

2𝑙 +∑𝑙≥𝑙∗

2𝑙√

log 𝑛2/𝑑

22𝑙ª®¬

.1 +

√𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)√


2+ 1 · 2𝑙

∗

where 𝑙∗ is the smallest 𝑙 such that √log 𝑛2/𝑑

22𝑙≤ 1.

Hence 2𝑙∗ � log log 𝑛 and there exists 𝐶 = 𝐶 (𝑑) > 0 such that

𝛾1( [1, 𝑛2/𝑑], 𝑑2) ≤ 𝐶 (𝑑) log log 𝑛

for sufficiently large 𝑛.

By the similar approach, we get that

𝐷2 . 1 +

√𝑑

2+‖𝑝0‖2

𝐿2

𝑓1(1)√


2+ 1 · 𝑙∗

which is upper-bounded by 𝐶 (𝑑) log log log 𝑛 for sufficiently large 𝑛.

115

Therefore, we finally obtain that there exists 𝐶 (𝑑) > 0 such that for sufficiently large 𝑛,

𝑃

(sup


�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�� ≥ 𝐶 (𝑑) (log log 𝑛 + 𝑡 log log log 𝑛)

). exp(−𝑡2/3). (B.7)

Step (ii). By slight abuse of notation, there exists a∗𝑛 = a∗𝑛 (𝑝0) > 1 such that

E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2

≤ 2

for a𝑛 ≥ a∗𝑛. Therefore,

𝑇GOF(adapt)𝑛 ≤ sup

1≤a𝑛≤a∗𝑛

√E𝐺2a𝑛 (𝑋1, 𝑋2)E[��a𝑛 (𝑋1, 𝑋2)]2

· sup1≤a𝑛≤a∗𝑛

�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )��+

√2 supa∗𝑛≤a𝑛≤𝑛2/𝑑

�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )��

≤𝐶 (𝑝0) sup1≤a𝑛≤a∗𝑛

�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )��+

√2 supa∗𝑛≤a𝑛≤𝑛2/𝑑

�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )��

for some 𝐶 (𝑝0) > 0.

Based on arguments similar to those in the first step,

𝑃

(sup

1≤a𝑛≤a∗𝑛

�� 1𝑛 − 1

∑𝑖≠ 𝑗

��a𝑛 (𝑋𝑖, 𝑋 𝑗 )�� ≥ 𝐶 (𝑑, 𝑝0)𝑡

). exp(−𝑡2/3)

for some 𝐶 (𝑑, 𝑝0) > 0 and (B.7) still holds when a𝑛 is restricted to [a∗𝑛, 𝑛2/𝑑]. They together prove

Lemma 6.

B.4 Decomposition of dHSIC and Its Variance Estimation

In this section, we first derive an approximation of 𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ) under 𝐻0 for general

𝑘 , and then the approximation of var(𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ))

can be obtained subsequently.

116

Note that

𝐺a (𝑥, 𝑦)

=

∫𝐺a (𝑢, 𝑣)𝑑 (𝛿𝑥 − P + P) (𝑢)𝑑 (𝛿𝑦 − P + P) (𝑣)

=��a (𝑥, 𝑦) + (E𝐺a (𝑥, 𝑋) − E𝐺a (𝑋, 𝑋′)) + (E𝐺a (𝑦, 𝑋) − E𝐺a (𝑋, 𝑋′)) + E𝐺a (𝑋, 𝑋′).

Similarly write

𝐺a (𝑥, (𝑦1, · · · , 𝑦𝑘 ))

=

∫𝐺a (𝑢, (𝑣1, · · · , 𝑣𝑘 ))𝑑 (𝛿𝑥 − P + P)𝑑 (𝛿𝑦1 − P𝑋1 + P𝑋1) · · · 𝑑 (𝛿𝑦𝑘 − P𝑋

𝑘 + P𝑋 𝑘 )

and expand it as the summation of all 𝑙-variate centered components where 𝑙 ≤ 𝑘 + 1. Do the

same expansion to 𝐺a ((𝑥1, · · · , 𝑥𝑘 ), (𝑦1, · · · , 𝑦𝑘 )) and write it as the summation of all 𝑙-variate

centered components where 𝑙 ≤ 2𝑘 . Plug these expansions in 𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ) and denote

the summation of all 𝑙-variate centered components in such expression of 𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 )

by 𝐷 𝑙 (a) for 𝑙 ≤ 2𝑘 . Let the remainder 𝑅𝑛 =2𝑘∑𝑙=3𝐷 𝑙 (a) so that

𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ) = 𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ) + 𝐷1(a) + 𝐷2(a) + 𝑅𝑛.

Straightforward calculation yields the following facts:

• E(𝑅𝑛)2 .𝑘 𝑛−3

(E𝐺2a (𝑋1, 𝑋2) +

𝑘∏𝑙=1E𝐺2a (𝑋 𝑙1, 𝑋

𝑙2)

);

• under the null hypothesis, 𝐷1(a) = 0 and

𝐷2(a) =1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛𝐺∗a (𝑋𝑖, 𝑋 𝑗 )

117

where

𝐺∗a (𝑥, 𝑦) = ��a (𝑥, 𝑦) −

∑1≤ 𝑗≤𝑘

𝑔 𝑗 (𝑥 𝑗 , 𝑦) −∑

1≤ 𝑗≤𝑘𝑔 𝑗 (𝑦 𝑗 , 𝑥) +

∑1≤ 𝑗1, 𝑗2≤𝑘

𝑔 𝑗1, 𝑗2 (𝑥 𝑗1 , 𝑦 𝑗2).

Proof of Lemma 1. Observe that under 𝐻0,

var(𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ))= E(𝐷2(a))2 + E (𝑅𝑛)2 =

2𝑛(𝑛 − 1)E[𝐺

∗a (𝑋1, 𝑋2)]2 + E (𝑅𝑛)2 ,

E (𝑅𝑛)2 .𝑘 𝑛−3E𝐺2a (𝑋1, 𝑋2),

and

E[𝐺∗a (𝑋1, 𝑋2)]2

=E©«��a (𝑋1, 𝑋2) −

∑1≤ 𝑗≤𝑘

𝑔 𝑗 (𝑋 𝑗

1 , 𝑋2)ª®¬

2

− E ©«∑

1≤ 𝑗≤𝑘𝑔 𝑗 (𝑋 𝑗

2 , 𝑋1) +∑

1≤ 𝑗1, 𝑗2≤𝑘𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )ª®¬

2

=E��2a (𝑋1, 𝑋2) − 2

∑1≤ 𝑗≤𝑘

E(𝑔 𝑗 (𝑋 𝑗

1 , 𝑋2))2

+∑

1≤ 𝑗1, 𝑗2≤𝑘E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2.

They together conclude the proof.

Below we shall further expand E��2a (𝑋1, 𝑋2), E

(𝑔 𝑗 (𝑋 𝑗

1 , 𝑋2))2

and E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2in Lemma

1, based on which consistent estimator of var(𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ))

can be derived naturally.

First,

E��2a (𝑋1, 𝑋2)

=E𝐺2a (𝑋1, 𝑋2) − 2E𝐺a (𝑋1, 𝑋2)𝐺a (𝑋1, 𝑋3) + (E𝐺a (𝑋1, 𝑋2))2

=∏

1≤𝑙≤𝑘E𝐺2a (𝑋 𝑙1, 𝑋

𝑙2) − 2

∏1≤𝑙≤𝑘

E𝐺a (𝑋 𝑙1, 𝑋𝑙2)𝐺a (𝑋 𝑙1, 𝑋

𝑙3) +

∏1≤𝑙≤𝑘

(E𝐺a (𝑋 𝑙1, 𝑋

𝑙2)

)2.

118

Second,


1 , 𝑋2))2

=E𝐺2a (𝑋 𝑗

1 , 𝑋𝑗

2 ) ·∏𝑙≠ 𝑗


𝑙3) −

∏1≤𝑙≤𝑘


𝑙3)

− E𝐺a (𝑋 𝑗

1 , 𝑋𝑗

2 )𝐺a (𝑋 𝑗

1 , 𝑋𝑗

3 ) ·∏𝑙≠ 𝑗

(E𝐺a (𝑋 𝑙1, 𝑋𝑙2))

2 +∏

1≤𝑙≤𝑘


𝑙2)

)2.

Hence

∑1≤ 𝑗≤𝑘


1 , 𝑋2))2

=

( ∏1≤𝑙≤𝑘


𝑙3)

) ©«∑

1≤ 𝑗≤𝑘

E𝐺2a (𝑋 𝑗

1 , 𝑋𝑗

2 )E𝐺a (𝑋 𝑗

1 , 𝑋𝑗

2 )𝐺a (𝑋 𝑗

1 , 𝑋𝑗

3 )− 𝑘ª®¬

−( ∏1≤𝑙≤𝑘


𝑙2)

)2) ©«

∑1≤ 𝑗≤𝑘

E𝐺a (𝑋 𝑗

1 , 𝑋𝑗

2 )𝐺a (𝑋 𝑗

1 , 𝑋𝑗

3 )(E𝐺a (𝑋 𝑗

1 , 𝑋𝑗

2 ))2− 𝑘ª®¬ .

Finally,

E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2

=

E(��a (𝑋 𝑗1

1 , 𝑋𝑗12 ))2 · ∏

𝑙≠ 𝑗1


𝑙2)

)2, 𝑗1 = 𝑗2∏

𝑙∈{ 𝑗1, 𝑗2}


𝑙2)𝐺a (𝑋 𝑙1, 𝑋

𝑙3) − (E𝐺a (𝑋 𝑙1, 𝑋

𝑙2))

2) ∏𝑙≠ 𝑗1, 𝑗2


𝑙2)

)2, 𝑗1 ≠ 𝑗2.

119

Hence

∑1≤ 𝑗1, 𝑗2≤𝑘

E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2

=

( ∏1≤𝑙≤𝑘


𝑙2)

)2) ( ∑

1≤ 𝑗1≤𝑘

E(��a (𝑋 𝑗11 , 𝑋

𝑗12 ))2

(E𝐺a (𝑋 𝑗11 , 𝑋

𝑗12 ))2

+∑

1≤ 𝑗1≠ 𝑗2≤𝑘

∏𝑙∈{ 𝑗1, 𝑗2}

©«E𝐺a (𝑋 𝑙1, 𝑋


𝑙2)(

E𝐺a (𝑋 𝑙1, 𝑋𝑙2)

)2 − 1ª®®¬).

Then the consistent estimator 𝑠2𝑛,a of E(𝐺∗a (𝑋1, 𝑋2)

)2 is constructed by replacing

E𝐺2a (𝑋 𝑙1, 𝑋𝑙2), E𝐺a (𝑋 𝑙1, 𝑋


𝑙3), (E𝐺a (𝑋 𝑙1, 𝑋

𝑙2))

2

in the above expansions of

E��2a (𝑋1, 𝑋2),

∑1≤ 𝑗≤𝑘


1 , 𝑋2))2,

∑1≤ 𝑗1, 𝑗2≤𝑘

E(𝑔 𝑗1, 𝑗2 (𝑋

𝑗11 , 𝑋

𝑗22 )

)2

with the corresponding unbiased estimators

1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 ),(𝑛 − 3)!𝑛!

∑1≤𝑖, 𝑗1, 𝑗2≤𝑛|{𝑖, 𝑗1, 𝑗2}|=3

𝐺a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗1)𝐺a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗2)

(𝑛 − 4)!𝑛!

∑1≤𝑖1,𝑖2, 𝑗1, 𝑗2≤𝑛|{𝑖1,𝑖2, 𝑗1, 𝑗2}|=4

𝐺a𝑛 (𝑋 𝑙𝑖1 , 𝑋𝑙𝑗1)𝐺a𝑛 (𝑋 𝑙𝑖2 , 𝑋

𝑙𝑗2)

for 1 ≤ 𝑙 ≤ 𝑘 . Again, to avoid a negative estimate of the variance, we can replace 𝑠2𝑛,a𝑛 with 1/𝑛2

whenever it is negative or too small. Namely, let

��2𝑛,a𝑛 = max{𝑠2𝑛,a𝑛 , 1/𝑛

2} ,and estimate var

(𝛾2a (P, P𝑋

1 ⊗ · · · ⊗ P𝑋 𝑘 ))

by 2��2𝑛,a/(𝑛(𝑛 − 1)).

120

Therefore for general 𝑘 , the single kernel test statistic and the adaptive test statistic are con-

structed as

𝑇 IND𝑛,a𝑛

=𝑛√

2��−1𝑛,a𝑛

𝛾2a𝑛 (P, P

𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 ) and 𝑇IND(adapt)𝑛 = max

1≤a𝑛≤𝑛2/𝑑𝑇 IND𝑛,a𝑛

respectively. Accordingly, ΦIND𝑛,a𝑛,𝛼

and ΦIND(adapt) can be constructed as in the case of 𝑘 = 2.

B.5 Theoretical Properties of Independence Tests for General 𝑘

In this section, with ΦIND𝑛,a𝑛,𝛼

and ΦIND(adapt) constructed in Appendix B.4 for general 𝑘 , we

confirm that Theorem 12, Theorem 13 and Theorem 16 still hold. We shall only emphasize the

main differences between the new proofs and the original proofs in the case of 𝑘 = 2.

Under the null hypothesis: we only need to re-ensure that 𝑠2𝑛,a𝑛 is a consistent estimator of

E[𝐺∗a𝑛(𝑋1, 𝑋2)]2. Specifically, we show that

𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 →𝑝 1

given 1 � a𝑛 � 𝑛4/𝑑 for Theorem 12 and


��𝑠2𝑛,a𝑛/E[𝐺∗a𝑛(𝑋1, 𝑋2)]2 − 1

�� = 𝑜𝑝 (1)for Theorem 16.

To prove the former one, since

E[𝐺∗a𝑛(𝑋1, 𝑋2)]2

(𝜋/(2a𝑛))𝑑/2‖𝑝‖2𝐿2

→ 1

as a𝑛 → ∞, it suffices to show

a𝑑/2𝑛

��𝑠2𝑛,a𝑛 − E[𝐺∗a𝑛(𝑋1, 𝑋2)]2�� = 𝑜𝑝 (1),

121

which follows considering that

a𝑑𝑙/2𝑛 E𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙2), a

𝑑/2𝑛 E𝐺a𝑛 (𝑋 𝑙1, 𝑋

𝑙2)𝐺a𝑛 (𝑋 𝑙1, 𝑋

𝑙3), a

𝑑𝑙/2𝑛 (E𝐺a𝑛 (𝑋 𝑙1, 𝑋

𝑙2))

2 (B.8)

are all bounded and they are estimated consistently by their corresponding estimators. For example,

a𝑑𝑙/2𝑛 E𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙2) → (𝜋/2)𝑑𝑙/2 ‖𝑝𝑙 ‖2

𝐿2

and

a𝑑𝑙𝑛 E

©« 1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 ) − E𝐺2a𝑛 (𝑋 𝑙1, 𝑋𝑙2)

ª®¬2

= a𝑑𝑙𝑛 var ©« 1

𝑛(𝑛 − 1)∑

1≤𝑖≠ 𝑗≤𝑛𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 )

ª®¬. a

𝑑𝑙𝑛

(𝑛−1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙2)𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙3) + 𝑛

−2E𝐺4a𝑛 (𝑋 𝑙1, 𝑋𝑙2)

).𝑑𝑙𝑛

−1a𝑑𝑙/4𝑛 ‖𝑝𝑙 ‖3

𝐿2+ 𝑛−2a

𝑑𝑙/2𝑛 ‖𝑝𝑙 ‖2

𝐿2→ 0.

The proof of the latter one is similar. It sufficies to have

• each term in (B.8) is bounded for a𝑛 ∈ [1,∞), which immediately follows since each term

is continuous and converges at ∞;

• the difference between each term in (B.8) and its corresponding estimator converges to 0

uniformly over a𝑛 ∈ [1, 𝑛2/𝑑], the proof of which is the same with that of Lemma 5.

Under the alternative hypothesis: we only need to re-ensure that ��𝑛,a𝑛 is bounded. Specifi-

cally, we show

inf𝑝∈𝐻IND

1 (Δ𝑛,𝑠)

𝑛𝛾2a𝑛(P, P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 )[E

(��2𝑛,a𝑛

)1/𝑘] 𝑘/2 → ∞

122

for Theorem 13 and

inf𝑠≥𝑑/4

inf𝑝∈𝐻IND


(��2𝑛,a𝑛 (𝑠) ′ ≤ 2𝑀2(2a𝑛 (𝑠)′/𝜋)−𝑑/2

)→ 1 (B.9)

for Theorem 16, where a𝑛 (𝑠)′ = (log log 𝑛/𝑛)−4/(4𝑠+𝑑) .

The former one holds because

E(��2𝑛,a𝑛

)1/𝑘≤ E

(max

{��𝑠2𝑛,a�� , 1/𝑛2})1/𝑘

≤ E��𝑠2𝑛,a��1/𝑘 + 𝑛−2/𝑘

.𝑘

(𝑘∏𝑙=1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙2)

)1/𝑘

+ 𝑛−2/𝑘

≤(𝑀2(𝜋/(2a𝑛))𝑑/2

)1/𝑘+ 𝑛−2/𝑘 .

where the second to last inequality follows from generalized Hölder’s inequality. For example,

E©«𝑘∏𝑙=1

1𝑛(𝑛 − 1)

∑1≤𝑖≠ 𝑗≤𝑛

𝐺2a𝑛 (𝑋 𝑙𝑖 , 𝑋 𝑙𝑗 )ª®¬

1/𝑘

≤(𝑘∏𝑙=1E𝐺2a𝑛 (𝑋 𝑙1, 𝑋

𝑙2)

)1/𝑘

.

To prove the latter one, note that for a𝑛 = a𝑛 (𝑠)′, all three terms in (B.8) are bounded by

𝑀2𝑙(𝜋/2)𝑑𝑙/2 and the variances of their corresponding estimators are bounded by

𝐶 (𝑑𝑙)(𝑛−1 (a𝑛 (𝑠)′)𝑑𝑙/4 𝑀3

𝑙 + 𝑛−2 (a𝑛 (𝑠)′)𝑑𝑙/2 𝑀2

𝑙

)= 𝑜(1)

uniformly over all 𝑠. Therefore,

inf𝑠≥𝑑/4

inf𝑝∈𝐻IND


((a𝑛 (𝑠)′)𝑑/2

��𝑠2𝑛,a𝑛 (𝑠) ′ − E[𝐺∗a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2

�� ≤ 𝑀2(𝜋/2)𝑑/2)→ 1

123

where 𝑌1, 𝑌2 ∼iid P𝑋1 ⊗ · · · ⊗ P𝑋 𝑘 . Further considering that

E[𝐺∗a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2 ≤ E[��a𝑛 (𝑠) ′ (𝑌1, 𝑌2)]2 ≤ 𝑀2(𝜋/(2a𝑛 (𝑠)′))𝑑/2

and that

1/𝑛2 = 𝑜((a𝑛 (𝑠)′)−𝑑/2)

uniformly over all 𝑠, we prove (B.9).

124

Documents

On the Construction of Minimax Optimal Nonparametric Tests