4 Chaining1 - Yale Universitypollard/Books/Mini/Chaining.pdf · 2016-01-02 · In place of union bounds we now need ways to control ˆof the maximum over nite sets. See Section4.4for

4 Chaining 14.1 What is chaining? . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Covering and packing numbers . . . . . . . . . . . . . . . . . 54.3 Bounding sums by integrals . . . . . . . . . . . . . . . . . . . 74.4 Orlicz norms of maxima of finitely many variables . . . . . . 84.5 Chaining with norms and packing numbers . . . . . . . . . . 104.6 An alternative to packing: majorizing measures . . . . . . . . 164.7 Chaining with tail probabilities . . . . . . . . . . . . . . . . . 18

4.7.1 Tail chaining via packing . . . . . . . . . . . . . . . . 204.7.2 Tail chaining via majorizing measures . . . . . . . . . 21

4.8 Brownian Motion, for example . . . . . . . . . . . . . . . . . 224.8.1 Control by (conditional) expectations . . . . . . . . . 244.8.2 Control by Ψ2-norm . . . . . . . . . . . . . . . . . . . 254.8.3 Control by tail probabilities . . . . . . . . . . . . . . . 26

4.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Printed: 22 November 2015

version: 20nov15printed: 22 November 2015

Mini-empiricalc©David Pollard

§4.1 What is chaining? 1

Chapter 4

Chaining

This chapter introduces some of the tools that are used in the next chapter(on chaining) to construct approximations to stochastic processes.

Section 4.1 describes two related methods for quantifying the complexity ofa metric space: by means of covering/packing numbers or by means ofmajorizing measures.

Section 4.2 defines covering and packing numbers.Section 4.3 explains why chaining bounds are often expressed as integrals.Section 4.4 presents a few simple ways to bound the Orlicz norm of a

maximum of finitely many random variables.Section 4.5 illustrates the method for combining the chaining with the max-

imal inequalities like those from Section 4.4 to control the norm of theoscillation of a stochastic process.

Section 4.6 mentions a more subtle alternative to packing/covering, whichis discussed in detail in Chapter 7.

Section 4.7 discusses chaining with tail probabilities.Section 4.8 compares various chaining methods by applying them to the

humble example of Brownian motion on the unit interval.

4.1 What is chaining?Chaining::S:intro

This chapter begins the task of developing various probabilistic bounds forstochastic processes, X = Xt : t ∈ T, where the index set T is equippedwith a metric (or semi-metric) d that gives some control over the incrementsof the process. For example, we might have ‖Xs −Xt‖ ≤ Cd(s, t) for some

Draft: 20nov15 c©David Pollard


norm (or semi-norm) ‖·‖ and some constant C, or we might have a tailbound

P|Xs −Xt| ≥ ηd(s, t) ≤ β(η) for all s, t ∈ T ,

for some decreasing function β. The leading case—the example that drovethe development of much theory—is the centered Gaussian process with(semi-)metric defined by d2

X(s, t) = P|Xs −Xt|2.

Remark. Especially if T is finite or countably infinite, it is not essentialthat d(s, t) = 0 should imply that s = t. It usually suffices to haveXs = Xt almost surely if d(s, t) = 0. More elegantly, we couldpartition T into equivalence classes [t] = t ∈ T : d(s, t) = 0 thenreplace T by a subset T0 consisting of one point from each equivalenceclass. The restriction of d to T0 would then be a metric. Like mostprobabilists, I tend to ignore these niceties unless they start causingtrouble. When dealing with finite index sets we may assume d is ametric.

The central task in many probabilistic and statistical problems is to findgood upper bounds for quantities such as supt∈T |Xt| or the oscillation,

\E@ osc.def\E@ osc.def <1> osc(δ,X, T ) := sup|Xs −Xt| : s, t ∈ T, d(s, t) < δ.

For example, we might seek bounds on ‖supt∈T |Xt| ‖ or ‖osc(δ,X, T ) ‖, forvarious norms, or on their tail probability analogs

Psupt∈T |Xt| > η or Posc(δ,X, T ) > η.

Such quantities play an important role in the theory of stochastic processes,empirical process theory, and statistical asymptotic theory. As explained inSection 5.4, bounds for the oscillation are essential for the construction ofprocesses with continuous sample paths and for the study of convergencein distribution of sequences of stochstic processes. In particular, oscillationcontrol is the key ingredient in the proofs of Donsker theorems for empiricalprocesses. In the literature on the asymptotic theory for estimators definedby optimization over random processes, oscillation bounds (or somethingsimilar) have played a major role under various names, such as “stochasticasymptotic equicontinuity”.

If T is uncountable the suprema in the previous paragraph need notbe measurable, which leads to difficulties of the type discussed in Chap-ter 5. Some authors sidestep this difficulty by interpreting the inequalitiesas statements about arbitrarily large finite subsets of T . For example,

interpret P supt∈T Xt to mean supSP supt∈S Xt : finite S ⊂ T.



With such a convention the real challenge becomes: find bounds forfinite S that do not grow unhelpfully large as the size (cardinality) of Sincreases. For example, if T ∗ is any countable subset of T one might hopeto get a reasonable bound for P supt∈T ∗ Xt by passing to the limit along asequence of finite subsets Sn that increase to T ∗. If T ∗ is dense in T themethods from Chapter 5 then take over, leading to bounds for P supt∈T Xt

if the version of X has suitable sample paths.The workhorse of the modern approach to approximating stochastic pro-

cesses is called chaining. Suppose you wish to approximate maxt∈S Xt ona (possibly very large) finite subset S of T . You could try a union bound,such as

Pmaxt∈S |Xt| ≥ η ≤∑

t∈SP|Xt| ≥ η,

but typically the upper bound grows larger than 1 as the size of S in-creases. Instead you could creep up on S through a sequence of finite subsets,S0, S1, . . . , Sm = S, breaking the process on S into a contribution from S0

plus a sum of increments across each Si+1 to Si pair.To carry out such a strategy you need maps ì : Si+1 → Si for i =

0, . . . ,m− 1. The composition Lp = `p `p+1 · · · `m−1 maps Sm into Sp,for 0 ≤ p < m. Each t = tm in Sm is connected to a point t0 = L0t in S0 bya chain of points

\E@ the.chain\E@ the.chain <2> tm = t`m−1−→ tm−1 = Lm−1(t)

`m−2−→ tm−2 = . . .`0−→ t0 = L0(t),

with a corresponding decomposition of the process into a sum of increments,

\E@ X.increments\E@ X.increments <3> X(t)−X(t0) =∑m−1

i=0X(ti+1)−X(ti) with ti = Li(t).

S0

S1

Sm-1

Smt=tm

tm-1 = m-1t

t0= L0(t)

tm-2 = m-2tm-1



Remark. To me ` stands for link. The pair (t, ìt) with t ∈ Si+1 definesa link in the chain connecting t to L0t.

The point of decomposition <3> is that many different t’s in S mightshare the same increment X(s) −X(ì(s)) with s ∈ Si+1, but you need tocontrol that increment only once. For example, for each t in S the triangleinquality applied to <3> gives

|X(t)−X(t0)| ≤∑m−1

i=0|X(ti+1)−X(ti)| ≤

∑m−1

i=0maxs∈Si+1

|X(s)−X(ì(s))| .

Remark. Does | · | stand for the absolute value of a real-valued randomvariable or the norm of some random element of a normed vectorspace? It really doesn’t matter which interpretation you prefer. Thechaining argument generalizes easily to more exotic random elementsof vector spaces.

The final sum is the same for every t; it also provides an upper bound forthe maximum over t:

\E@ max.link\E@ max.link <4> maxt∈S |X(t)−X(t0)| ≤∑m−1

i=0maxs∈Si+1 |X(s)−X(ì(s))| .

If ε = ε1 + · · · + εm then a simple union bound provides some control overthe quantity on the left-hand side of <4>:

Pmaxt∈S|X(t)−X(t0)| ≥ ε

≤ P⋃

i

⋃s∈Si+1

|X(s)−X(ì(s))| ≥ εi

≤∑m−1

i=0

∑s∈Si+1

P|X(s)−X(ì(s))| ≥ εi.

The tail probability Pmaxt∈S |Xt| ≥ η+ε is less than Pmaxt∈S0 |Xt| ≥ ηplus the double sum over i and Si+1. If the growth in the size of the Si’s couldbe offset by a decrease in the d(s, ìs) distances you might get probabilisticbounds that do not depend explicitly on the size of S.

The extra details involving the choice of the εi’s makes chaining with tailprobabilities seem more complicated that the analogous argument where onemerely takes some norm ρ of both sides of inequality <4>, giving

\E@ norm.max.link\E@ norm.max.link <5> ρ

(maxt∈S|X(t)−X(t0)|

)≤∑m−1

i=0ρ(maxs∈Si+1 |X(s)−X(ì(s))|

).

Here I am assuming that ρY1 ≤ ρY2 if 0 ≤ Y1 ≤ Y2. More precisely, fromnow on I assume that ρ has the following properties


§4.2 Covering and packing numbers 5

Chaining::norm.like <6> Assumption. Let M+ = M+(Ω,F,P) denote the set of all F-measurablefunctions on a probability space (Ω,F,P) taking values in [0,∞]. Assume ρis a map from M+ into [0,∞] that satisfies

(i) ρ(X + Y ) ≤ ρ(X) + ρ(Y )

(ii) ρ(cX) = cρ(X) for all c ∈ R+.

(iii) ρ(X) ≤ ρ(Y ) if X ≤ Y almost surely

(iv) ρ(X) = 0 iff X = 0 almost surely.

Remark. If we intend to pass to the limit as a sequence of finite subsetsincrease to a countable subset of T we also need something like: if0 ≤ X1 ≤ X2 ≤ . . . ↑ X almost surely then ρ(Xn)→ ρ(X).

In place of union bounds we now need ways to control ρ of the maximumover finite sets. See Section 4.4 for some simple methods when ρ is an Orlicznorm.

The traditional approach to chaining chooses the Si sets for their uni-form approximation properties. Typically one starts with a decreasing se-quence δi, such as δi = δ/2i for some δ > 0, then seeks sets Si withmins∈Si d(t, s) ≤ δi for every t in S such that Si is as small as possible. Thesmallest cardinality is called the covering number. Section 4.2 gives theformal definition, as well as describing a slightly more convenient conceptcalled the packing number.

Section 4.5 illustrates the method for combining bounds on packingnumbers with bounds on the norms of maxima of finitely many variables(Section 4.4) to obtain useful bounds for a norm of maxt∈S |X(t) −X(t0)|.Section 4.7 replaces norms by tail probabilities.

Section 4.8 returns to the relatively simple case of Brownian motion onthe unit interval as a test case for comparing the different approaches.

For each method I use the task of controlling the oscillation of a stochas-tic process as a nontrivial example of what can be done with chaining.

4.2 Covering and packing numbersChaining::S:covering

The simplest strategy for chaining is to minimize the size (cardinality) of Sisubject to a given upper bound on supt∈T mins∈Si d(t, s), That idea trans-lates into a statement about covering numbers.


§4.2 Covering and packing numbers 6

Remark. The logarithm of the covering number is sometimes called themetric entropy, probably because it appears naturally when dealingwith processes whose increments have tail probabilities that decreaseexponentially fast.

Chaining::covering <7> Definition. For a subset F of T write NT (δ, F, d) for the δ-covering num-ber, the smallest number of closed δ-balls needed to cover F . That is, thecovering number is the smallest N for which there exist points t1, . . . , tNin T with mini≤N d(t, ti) ≤ δ for each t in F . The set of centers ti iscalled a δ-net for F .

Remark. Notice a small subtlety related to the subscript T in thedefinition. If we regard F as a metric space in its own right, not just asa subset of T , then the covering numbers might be larger because thecenters ti would be forced to lie in F . It is an easy exercise (select apoint of F from each covering ball that actually intersects F ) to showthat NF (2δ, F, d) ≤ NT (δ, F, d). The extra factor of 2 would usuallybe of little consequence. When in doubt, you should interpret coveringnumbers to refer to NF .

Some metric spaces (such as the whole real line under its usual metric)cannot be covered by a finite set of balls of a fixed radius. A metric space Tfor which NT (δ, T, d) < ∞ for every δ > 0 is said to be totally bounded.A metric space is compact if and only if it is both complete and totallybounded (Dudley, 2003, Section 2.3).

I prefer to work with the packing number pack(δ, F, d), defined as thelargest N for which there exist points t1, . . . , tN in F that are δ-separated,that is, for which d(ti, tj) > δ if i 6= j. Notice the lack of a subscript T ; thepacking numbers are an intrinsic property of F , and do not depend on Texcept through the metric it defines on F .

Chaining::cover.pack <8> Lemma. For each δ > 0,

NF (δ, F, d) ≤ pack(δ, F, d) ≤ NT (δ/2, F, d) ≤ NF (δ/2, F, d).

Proof For the middle inequality, observe that no closed ball of radius δ/2can contain points more than δ apart. Each of the centers for pack(δ, F, d)must lie in a distinct δ/2 covering ball. The other inequalities have similarlysimple proofs.

Chaining::rr.norm <9> Example. Let ‖·‖ denote any norm on Rk. For example, it might be or-dinary Euclidean distance (the `2 norm), or the `1 norm, ‖x‖1 =

∑i≤k |xi|.

The covering numbers for such norms share a common geometric bound.


§4.3 Bounding sums by integrals 7

Write BR for the ball of radius R centered at the origin. For a fixed ε,with 0 < ε ≤ 1, how many balls of radius εR does it take to cover BR?Equivalently, what are the packing numbers for BR?

Let x1, . . . , xN be any set of points in BR that is εR-separated. Theclosed balls B[xi, εR/2] of radius εR/2 centered at the xi are disjoint andtheir union lies within BR+εR/2. Write Λ for the Lebesgue measure of the

unit ball B1. Each B[xi, εR] has Lebesgue measure (εR/2)kΛ and BR+εR/2

has Lebesgue measure (R+ εR/2)kΛ. It follows that

N ≤ (R+ εR/2)kΛ

(εR/2)kΛ=

(2 + ε

ε

)k≤ (3/ε)k

That is, pack(εR,BR, d) ≤ (3/ε)k for 0 < ε ≤ 1, where d denotes the metriccorresponding to ‖·‖.

Finite packing numbers lend themselves in a simple way to the con-struction of increasing sequences of approximating subsets. We start with amaximal subset S0 of points that are δ0-separated, then enlarge to a maximalsubset of points that are δ1-separated, and so on.

If S is finite, the whole procedure must stop after a finite number ofsteps, leaving S itself as the final approximating set. In any case, we haveconstructed nested finite subsets S0 ⊆ S1 ⊆ S2 ⊆ S3 ⊆ . . . and we have abound on the sizes, #Si ≤ pack(δi, S, d). If δ0 = diam(S) then S0 consistsof only one point.

The natural choice for ì is the nearest neighbor map from Si+1 to Si, forwhich d(t, ì(t)) = mins∈Si d(t, s) ≤ δi, with a convention such as choosingthe s that appeared earliest in the construction to handle ties.

4.3 Bounding sums by integralsChaining::S:sum.to.integral

Typically we do not know the exact values of the packing numbers but

G(y)

yδδδ

area = δG(δ)/4

instead only have an upper bound, pack(δ, S, d) ≤ M(δ). The chain-ing strategy outlined in Section 4.1 often leads to upper bounds of theform

∑i≤kG(δi)δi−1, with G(δ) an increasing function, such as Ψ−1(M(δ)).

If the δi’s decrease geometrically, say δi = δ0/2i, then the sum can be

bounded above by an integral,

∑i≤k

G(δi)δi−1 ≤ 4

∫ δ1

δk+1

G(r) dr ≤ 4

∫ δ1

0G(r) dr.


§4.4 Orlicz norms of maxima of finitely many variables 8

If G is integrable the replacement of the lower terminal by 0 costs little. Notethe similarity to Dudley’s entropy integral assumption <24> in Section 4.6.In the early empirical process literature, some authors chose the δi’s to makethe G(δi)’s increase geometrically, for similar reasons.

4.4 Orlicz norms of maxima of finitely many variablesChaining::S:finitemax

The Orlicz norms are natural candidates for the ρ of Assumption <6>.The next Theorem collects a few simple inequalities well suited to inequal-ity <5>. See Section 4.8 for an example that highlights the relative meritsof the inequalities.

Chaining::orlicz.maximal <10> Theorem. Let X1, . . . , XN be random variables defined on the same (Ω,F,P)with maxi≤N ‖Xi‖Ψ ≤ τ . Define M := maxi≤N Xi.

(i) PM ≤ τΨ−1(N)

(ii) For each event B with PB > 0,

PBM ≤ τΨ−1(N/PB)

where PB denotes conditional expectation given B.

(iii) Let Φ be another Young function that is related to Ψ by the inequalityΦ(α)Φ(β) ≤ Ψ(αβ) whenever Φ(α ∧ β) ≥ 1. Then

‖M‖Φ ≤ 2τΦ−1(N).

(iv) In particular, if there exist positive constants K0, c0, and c1 for whichΨ(α)Ψ(β) ≤ K0Ψ(c0αβ) whenever α ∧ β ≥ c1 then

‖M‖Ψ ≤ C0τΨ−1(N)

for a constant C0 that depends only on Ψ.

Remark. Note well that the Theorem makes no independence assump-tions.

For each p ≥ 1 the Young function Ψ(t) = tp satisfies condition (iv)with equality when K0 = c0 = 1 and c1 = 0. Problem [??] shows thateach of the Ψα functions from Section 2.3, which grow like exp(tα),also satisfy (iv).


§4.4 Orlicz norms of maxima of finitely many variables 9

Proof Without loss of generality, suppose τ = 1 and each Xi is nonnega-tive, so that PΨ(Xi) ≤ 1 for each i.

Even though assertion (i) is a special case of assertion (ii), I think ithelps understanding to see both proofs. For (i), Jensen’s inequality thenmonotonicity of Ψ imply

Ψ (PM) ≤ Pmaxi≤N Ψ(Xi) ≤∑

i≤NPΨ(Xi) ≤ N.

For (ii), partition B into disjoint subsets Bi such that maxj Xj = Xi

on Bi. Without loss of generality assume PBi > 0 for each i. (Alternatively,just discard those Bi with zero probability.) By definition,

PBM = PB(∑

iBiXi

)=∑

i

PBiPB

PBiXi

By Jensen’s inequality,

Ψ(PBiXi) ≤ PBiΨ(Xi) =PBiΨ(Xi)

PBi≤ PΨ(Xi)

PBi≤ 1

PBi.

Thus

Ψ (PBM) = Ψ

(∑i

PBiPB

PBiXi

)≤∑

i

PBiPB

Ψ (PBiXi) by convexity of Ψ

≤∑

i

PBiPB

1

PBi=

N

PB.

Perhaps it would be better to write the last equality as an inequality, tocover the case where some PBi are zero.

For (iii), let β = Φ−1(N), so that Φ(β) = N . Trivially,

Φ(M/β) ≤ 1Φ (M/β) ≤ 1

and, by the relationship between Φ and Ψ,

Φ(M/β) > 1Φ (M/β) Φ(β) ≤ Ψ(M) ≤∑

iΨ(Xi).

Divide both sides of the second inequality by N , add, then take expectationsto deduce that

PΦ (M/β) ≤ 1 +1

N

∑iPΨ(Xi) ≤ 2,


§4.5 Chaining with norms and packing numbers 10

which implies ‖M‖Φ ≤ 2β (see Section 2.3).For (iv), first note that K0Ψ(x) ≤ Ψ(K1x) for K1 = max(1,K0). If

L ≥ 1 the assumed bound gives

Ψ(α)Ψ(β) ≤ Ψ(Lα)Ψ(Lβ) ≤ Ψ(K1c0LαLβ) if (Lα) ∧ (Lβ) ≥ c1.

Choose L so that LΨ−12 (1) ≥ c1. Then (Lα)∧ (Lβ) ≥ c1 if Ψ(α)∧Ψ(β) ≥ 1

and

Ψ(α)Ψ(β) ≤ Ψ(αβ) := Ψ(c0K1L2αβ) if Ψ(α) ∧Ψ(β) ≥ 1.

Invoke the result from (iii), noting that ‖Y ‖Ψ

= c0K1L2 ‖Y ‖Ψ2

for eachrandom variable Y .

4.5 Chaining with norms and packing numbersChaining::S:norm.chain

Consider a process Xt : t ∈ T indexed by a (necessarily totally bounded)metric space for which we have some control over the packing numbers.Instead of working via an assumption like ‖Xs −Xt‖Ψ ≤ Cd(s, t), combinedwith one of the bounds from Section 4.4, I’ll cut out the middle man byassuming directly there is a norm ρ that satisfies Assumption <6> and anincreasing function H : N→ R+ for which

\E@ max.norm\E@ max.norm <11> ρ

(maxi≤N

|Xsi −Xti |d(si, ti)

)≤ H(N)

for all finite sets of increments with d(si, ti) > 0. For example, if ρ wereexpected value and the ‖Xs −Xt‖Ψ2

≤ d(s, t) then H(N) = Ψ−12 (N) =√

log(1 +N) .

Chaining::normchain1 <12> Lemma. Suppose X, ρ, and H satisfy <11> and S0 ⊆ S1 ⊆ · · · ⊆ Sm arefinite subsets of T for which d(s, Si) ≤ δi for all s ∈ Si+1 and #Si ≤ Ni.Then, in the chaining notation from <2>,

ρ

(maxt∈Sm

|X(t)−X(L0t)|)≤∑m−1

i=0δiH(Ni+1).

In particular, if δi = δ0/2i and Si is δi-separated then Ni ≤ pack(δi, T, d)

and

ρ

(maxt∈Sm

|X(t)−X(L0t)|)≤ 4

∫ δ1

δm+1

H (pack(r, T, d)) dr.



Proof The first inequality results from applying ρ to both sides of theinequality

maxt∈Sm

|X(t)−X(L0t)| ≤∑m−1

i=0δi maxs∈Si+1

|X(s)−X(ìs)|d(s, ìs)

The method described in Section 4.3 provides the integral bound.

Chaining::subgaussian.norm <13> Example. Suppose the Xt : t ∈ T process has subgaussian incrementswith ‖Xs −Xt‖Ψ2

≤ d(s, t). The function Ψ2(t) = et2 − 1 satisfies the

assumptions of Theorem <10> part (iv). Inequality <11> holds with ρ equalto the ‖·‖Ψ2

and H(N) a constant multiple of Ψ−12 (N) =

√log(1 +N) .

Because pack(r, T, d) = 1 for all r > diam(T ) we may as well assumeδ0 ≤ diam(T ), in which case pack(r, T, d) ≥ 2 for all r ≤ δ1. That lets usabsorb the pesky 1+ from the log() into the constant, leaving∥∥∥∥max

t∈Sm|X(t)−X(L0t)|

∥∥∥∥Ψ2

≤ C∫ δ1

0

√log pack(r, T, d) dr,

for some universal constant C. Remember that for each k ≥ 1 there is aconstant Ck for which ‖Z‖k ≤ Ck ‖Z‖Ψ2

(Section 2.4). Thus the upperbounds for the ‖·‖Ψ2

norm of maxt∈Sm |X(t)−X(L0t)| automatically implysimilar upper bounds for its ‖·‖k. In particular, both Theorems 3.2 and 3.3in the expository paper of Pollard (1989) follow from the result for the Ψ2-norm.

Lemma <12> captures the main idea for chaining with norms. The nextTheorem uses the example of oscillation control to show why the uniformapproximation of Xs : s ∈ Sm by Xs : s ∈ Sm is such a powerful tool.

Chaining::osc.rho <14> Theorem. Suppose Xt : t ∈ T is a stochastic process whose incrementssatisfy <11>, that is, ρ (maxi≤N |Xsi −Xti |/d(si, ti)) ≤ H(N) for all finitesets of increments. Suppose also that∫ D

0H(pack(r, T, d)) dr <∞ where D := diam(T ).

Then for each ε > 0 there exists a δ > 0 for which ρ(osc(δ,X, S)) < ε forevery finite subset S of T .



Proof Choose δ0 so that

\E@ del0.choice\E@ del0.choice <15> 4

∫ δ0

0H(pack(r, T, d)) dr < ε/5.

Chain with δi = δ0/2i, finding a sequence of δi-separated subsets Si for

which S0 ⊆ · · · ⊆ Sm = S with

Ni = #Si ≤ pack(δi, S, d) ≤ pack(δi, T, d).

Define ∆ := maxt∈Sm |X(t)−X(L0t)|. Lemma <12> gives ρ(∆) < ε/5.The value N0 = #S0, which only depends on ε, is now fixed.There might be many pairs s, t in Sm for which d(s, t) < δ, but they

correspond to at most(N0

2

)pairs L0s, L0t in S0 for which

\E@ chain.lengths\E@ chain.lengths <16> d(L0s, L0t) ≤∑m−1

i=0d(Li+1s, Lis) + δ+

∑m−1

i=0d(Li+1t, Lit) ≤ 4δ0 + δ.

For the subgaussian case, where H(N) grows like√

logN , we couldafford to invoke a finite maximal inequality to control the contributions fromthe pairs in S0 by a constant multiple of H(N2

0 ) (4δ0 + δ), which could becontrolled by <15> if δ = δ0 because H(N2

0 ) ≤ constant×H(N0). Withoutsuch behavior for H we would need something stronger than <15>.

For general H(·) it pays to make a few more trips up and down thechains, using a clever idea of Ledoux and Talagrand (1991, Theorem 11.6).

L0s = L0tE,F

tE,F

L0t = L0tF,E

tF,E s t

ΔΔΔΔ

δΛ

E F

The map L0 defines an equivalence relation on Sm, with t ∼ t′ if and onlyif L0t = L0t

′. The corresponding equivalence classes define a partition πmof Sm into at most N0 subsets. For each distinct pair E,F from πm choosepoints tE,F ∈ E and tF,E ∈ F such that

d(tE,F , tF,E) = d(E,F ) = mind(s, t) : s ∈ E, t ∈ F



then define

Λ := max

|X(tE,F )−X(tF,E)|

d(E,F ): E,F ∈ πm, E 6= F

.

Assumption <11> implies ρ(Λ) ≤ H(N20 ).

Remark. The definition of Λ might be awkward if d were a semi-metricand d(E,F ) = 0 for some pair E 6= F . By working with equivalenceclasses we avoid such awkwardness.

Now consider any pair s, t ∈ Sm for which d(s, t) < δ. If s ∼ t, so that

L0s = L0tE,F

tE,F

L0t = L0tF,E

tF,E s t

ΔΔΔΔ

δΛ

E F

L0s = L0t, we have

|X(s)−X(t)| ≤ |X(s)−X(L0s)|+ 0 + |X(L0t)−X(t)| ≤ 2∆.

If s and t belong to different sets, E and F , from πm then

|X(s)−X(tE,F )| ≤ 2∆ and |X(t)−X(tF,E)| ≤ 2∆.

We also have d(E,F ) ≤ d(s, t) < δ, so that |X(tE,F ) −X(tF,E)| ≤ δΛ and|X(s)−X(t)| is bounded by

|X(s)−X(tE,F )|+ |X(tE,F )−X(tF,E)|+ |X(tF,E)−X(t)| ≤ 4∆ + δΛ.

The upper bound does not depend on the particular s, t pair for whichd(s, t) < δ. It follows that

ρ (osc(δ,X, S)) < 4ε/5 +H(N20 )δ < ε for δ small enough.

The upper bound has no hidden dependence on the size of S.

The final Example shows that even very weak control over the incre-ments of a process can lead to powerful inequalities if the index set hassmall packing numbers. The Example makes use of some properties of theHellinger distance h(P,Q) between two probability measures P and Q,defined by

h2(P,Q) = µ

(√dP

dµ−

√dQ

dµ

)2

= 2− 2µ

√dP

dµ

dQ

dµ

where µ is any measure dominating both P and Q. For product measuresthe distance reduces to a simple expression involving the marginals,

\E@ Hellinger.product\E@ Hellinger.product <17> h2(Pn, Qn) = 2− 2(1− 1/2h

2(P,Q))n ≤ nh2(P,Q).

See, for example, Pollard (2001, Problem 4.18).



Chaining::dqm.and.mle <18> Example. Let Pθ : θ ∈ [0, 1] be a family of probability measures havingdensities p(x, θ) with respect to a fixed measure µ on some set X. LetPθ,n = Pθ ⊗ · · · ⊗ Pθ and µn = µ ⊗ · · · ⊗ µ, both n-fold product measures.For a fixed θ0, abbreviate Pθ0,n to P. Under Pθ0,n the coordinates of thetypical point x = (x1, . . . , xn) in Xn become identically distributed randomvariables, each with distribution Pθ0 .

To avoid distracting measurability issues, I suppose θ 7→ p(x, θ) is con-tinuous. I also ignore irrelevant questions regarding uniqueness and measur-ability of the the maximum likelihood estimator (MLE) θn = θn(x1, . . . , xn),which by definition maximizes the likelihood process Ln(θ) :=

∏i≤n p(xi, θ).

Suppose there exists positive constants C1 and C2 for which

\E@ euc.hell\E@ euc.hell <19> C1|θ − θ′| ≤ h(Pθ, Pθ′) ≤ C2|θ − θ′| for all θ, θ′ ∈ [0, 1].

Remark. Notice that such an inequality can hold only for a boundedparameter set, because H is a bounded metric. Typically it wouldfollow from a slight strengthening of a Hellinger differentiabilitycondition. In a sense, the right metric to use is h. The upper boundin Assumption <19> ensures that the packing numbers under hbehave like packing numbers under ordinary Euclidean distance. Thelower bound ensures that P

√Ln(t) decays rapidly as t increases—see

inequality <22> below.

Then a chaining argument will show that, for each θ0 in [0, 1], the esti-mator θn converges at an n−1/2-rate to θ0. More precisely, there existsconstants C3 and C4 for which

\E@ mle.dev\E@ mle.dev <20> Pθ0,n√n|θn − θ0| ≥ y ≤ C3 exp(−C4y

2) for all y ≥ 0 and all n.

The MLE also maximizes the square root of the likelihood process. Thestandardized estimator tn =

√n(θn − θ0) maximizes, over the interval Tn =

t ∈ R : θ0 + t/√n ∈ [0, 1], the process

Zn(t) =

√Ln(θ0 + t/

√n)

Ln(θ0)

=∏

i≤nξ(xi, t) where ξ(z, t) :=

√p(z, θ0 + t/

√n)/p(z, θ0) .

Remark. For a more careful treatment, which has no significant effecton the final bound, one should include an indicator function z ∈ S,where S = p(·, θ0) > 0, into the definition of ξ. And a few equalitiesshould actually be inequalities.



Inequality <20> will come from the fact that Zn(tn) ≥ Zn(0) = 1, whichimplies for each y0 ≥ 0 that

Pθ0,n|tn| ≥ y0 ≤ Pθ0,nsup|t|≥y0 Zn(t) ≥ 1 ≤ Pθ0,n sup|t|≥y0 Zn(t).

The last expected value involves a supremum over an uncountable set, whichsample path continuity allows us to calculate as a limit over maxima overfinite sets.

Let me show how Lemma <12> handles half the range, supt≥y0 Zn(t).The argument for the other half is analogous.

To simplify notation, abbreviate Pθ0,n to P, with θ0 fixed for the restof the Example. Split the set t ∈ Tn : t ≥ y0 into a union of intervalsJk := [yk, yk+1) ∩ Tn, where yk = y0 + k for k ∈ N. Then

P supt≥y0 Zn(t) ≤∑

k≥0P supt∈Jk Zn(t)\E@ strat\E@ strat <21>

Remark. This technique is sometimes called stratification.

Inequality <19> provides the means for bounding the kth term in thesum <21>. The lower bound from <19> gives us some control over theexpected value of Zn:

PZn(t) ≤ µn∏i≤n

√p(xi, θ0 + t/

√n)p(xi, θ0)

=(

1− 1/2h2(Pθ0 , Pθ0+t/

√n))n

≤ exp(−1/2C21 t

2)\E@ pwise.bound\E@ pwise.bound <22>

for each t in Tn. The upper bound in <19> gives some control over theincrements of the Zn process: for t1, t2 ∈ Tn,

P|Zn(t1)− Zn(t2)|2

= µn|√Ln(θ1) −

√Ln(θ2) |2 where θi = θ0 + ti/

√n

= h2(Pnθ1 , Pnθ2)

≤ C22 |t1 − t2|2 by <17> and <19>.

By Theorem <10> part (i), it then follows that

\E@ L2\E@ L2 <23> Pmaxi≤N

|Zn(si)− Zn(ti)||si − ti|

≤ H(N) := C2

√N


§4.6 An alternative to packing: majorizing measures 16

for all finite sets of increments.Consider the kth summand in <21>. With d as the usual Euclidean

metric, Jk has diameter at most 1 and pack(r, Jk, d) ≤ 1/r for r ≤ 1. Fora δ0 that needs to be specified, invoke Lemma <12> then pass to the limitas Sm expands up to a countable dense subset to get

P supt∈Jk Zn(t) ≤ Pmaxt∈S0 Zn(t) + 4C2

∫ δ1

0

√1/r dr

≤∑

t∈S0

e−1/2C

21 t

2+ 4C2 × 2δ

1/21

≤ (1/δ0)e−1/2C

21y

2k + C2

√8δ

1/20 .

The last sum is approximately minimized by choosing δ0 = e−13C

21y

2k , which

leads to

P supt≥y0 Zn(t) ≤ (1 + C2

√8)∑

k≥0exp(−C2

1 (y0 + k)2/6).

After some fiddling around with constants, the bound <20> emerges.

4.6 An alternative to packing: majorizing measuresChaining::S:MM

The modern theory owes much to the work of Dudley (1973), who appliedchaining arguments to establish fine control over the sample paths of theso-called isonormal Gaussian process, and Dudley (1978), who initiated themodern theory of empirical processes by adapting the Gaussian methods toprocesses more general than the classical empirical distribution function. Inboth cases he constrained the complexity of the index set by using boundson the covering numbers N(δ) = NT (δ, T, d). For example, for a Gaussianprocess he proved existence of a version with continuous sample paths underthe assumption that∫ D

0

√logN(r) dr <∞ where D = diam(T ), the diameter of T .

Equivalently, but more suggestively, the last inequality can be written as

\E@ entropy.integral\E@ entropy.integral <24>

∫ D

0Ψ−1

2 (NT (r)) dr <∞ where Ψ2(x) = exp(x2)− 1,

which highlights the role of the Ψ2-Orlicz norm in controlling the incrementsof Gaussian processes.


§4.6 An alternative to packing: majorizing measures 17

Work by Fernique and Talagrand showed that assumption <24> can beweakened slightly to get improved bounds, which in some cases are optimalup to multiplicative constants. As originally formulated, the improvementsassumed existence of a majorizing measure for a Young function Ψ (= Ψ2

for the Gaussian case): a probability measure µ on the Borel sigma-field of Tfor which

\E@ MM.def\E@ MM.def <25> supt∈T

∫ D

0Ψ−1

(1

µB[t, r]

)dr <∞,

where B[t, r] denotes the closed ball of radius r and center t.

Remark. When I first encountered <25> I was puzzled by how asingle µ could control the sample paths of a stochastic process. Itseemed not at all obvious that µ could be used to replace the familiarconstructions with covering numbers. It helped me to dig inside someof the proofs to find that the key fact is: if B1, . . . , Bn are disjoint Borelsets with µBi ≥ ε for each i then n ≤ 1/ε. The same idea appears inthe very classical setting of Example <9>, where Lebesgue measureon Rk plays a similar role in bounding covering numbers.

Fernique (1975) proved many results about the sample paths of a cen-tered Gaussian process X. For example, he proved (his Section 6) a resultstronger than: existence of a majorizing measure implies that

\E@ bdd.paths\E@ bdd.paths <26> supt∈T ∗ |X(ω, t)| <∞ for almost all ω.

Slight strengthenings of <25> also imply existence of a version of X withalmost all sample paths uniformly continuous.

Talagrand (1987) brilliantly demonstrated the central role of majorizingmeasures by showing that a condition like <26> implies existence of someprobability µ for which <25> holds with d(s, t) = ‖Xs −Xt‖2. He subse-quently showed (Talagrand, 2001) that the majorizing measures work theirmagic by generating a nested sequence πi of finite partitions of T , whichcan be used for chaining arguments analogous to those used by Dudley.(See subsection 4.7.2 and Chapter 7 for details.) Each πi+1 is a refinementof the previous πi. Each point t of T belongs to exactly one member, de-noted by Ei(t), of πi. Talagrand allowed the size #πi to grow very rapidly.For Young functions Ψ that grow like exp(xα) for some α > 0, he showedthat <25> is equivalent to the existence of partitions πi with #πi ≤ ni := 22i

for which

\E@ partitions\E@ partitions <27> supt∈T∑

i∈NΨ−1(ni) diam (Ei(t)) <∞


§4.7 Chaining with tail probabilities 18

More precisely, he showed that

the infimum of <27> over all partitions πi with #πi ≤ nithe infimum of the integral in <25> over all majorizing measures

is bounded above and below by positive constants that depend only on α.

Remark. The subtle point, which distinguishes <27> from theanalogous constructions based on covering numbers, is that the uniformbound on the sum does not come via a bound on maxE∈πi

diam(E).As Talagrand (2005, bottom of page 13) has pointed out, such

uniform control of the diameters is not the best way to proceed, becauseit tacitly assumes some sort of homogeneity of ‘complexity’ over differentsubregions of T . By contrast, the partitions constructed from majorizingmeasures are much more adaptive to local complexities, subdividingmore finely in regions where the stochastic process fluctuates morewildly.

In more recent work Talagrand established analogous equivalences be-tween majorizing measures and partitions for nongaussian processes. He haseven declared (Talagrand, 2005, Section 1.4) that the measures µ themselvesare now “totally obsolete”, with everything one needs to know coming fromthe nested partitions.

Despite the convincing case made by Talagrand for majorizing measureswithout measures, I still think there might be a role for the measures them-selves in some statistical applications. And in any case, it does help toknow something about how an idea evolved when struggling with its fanciermodern refinements.

4.7 Chaining with tail probabilitiesChaining::S:tails

Start once again from a sequence of finite subsets S0 ⊆ S1 ⊆ · · · ⊆ Sm,with #Si = Ni and the chains defined as in Section 4.1. Again we have adecomposition for each t in Sm,

X(t)−X(t0) =∑m−1

i=0X(ti+1)−X(ti) with ti = Li(t).

And again assume we have increment control via tail probabilities,

P|Xs −Xt| ≥ ηd(s, t) ≤ β(η) for all s, t ∈ T .

with β a decreasing function.



By analogy with the approach followed in Section 4.5 define

Mi+1 := maxs∈Si+1

|X(s)−X(ìs)|δi(s)

where δi(s) := d(s, ìs),

with the convention that any terms for which δi(s) = 0 are omitted fromthe max—it matters only that |X(s)−X(ìs)| ≤Mi+1δi(s) for all s ∈ Si+1.For ηi ≥ 0,

PMi+1 ≥ ηi ≤∑

s∈Si+1

P|X(s)−X(ìs)| ≥ ηiδi(s) ≤ Ni+1β(ηi)

The eventB := ∃i ≤ m : Mi ≥ ηi has probability at most∑m−1

i=0 Ni+1β(ηi)On the event Bc we have

∆m(t) := |X(t)−X(L0t)| ≤∑m−1

i=0ηiδi(t) for each t ∈ Sm.

Put another way

\E@ tail.chain0\E@ tail.chain0 <28> P∃t ∈ Sm : ∆m(t) >∑m−1

i=0ηiδi(t) ≤

∑m−1

i=0Ni+1β(ηi).

Remark. We could replace Ni+1β(ηi), which is an upper bound onthe probability of a union, by min(1, Ni+1β(ηi)). Of course such amodification typically gains us nothing because <28> is trivial if evenone of the summands on the right-hand side equals 1. Nevertheless, themodification does eliminate some pesky trivial cases from the followingarguments.

Now comes the clever part: How should we choose the Si’s and the ηi’s?I do not know any systematic way to handle that question but I do knowsome heuristics that have proven useful.

Let me attempt to explain for the specific case of Ψα-norms, for α > 0,with

\E@ beta.alpha\E@ beta.alpha <29> β(η) = min(1, 1/Ψα(η)) ≤ Cα exp(−ηα),

first for the traditional approach based on packing numbers then for themajoring measure alternative.

Remark. You should not be misled by what follows into believing thatall chaining arguments with tail probabilities involve a lot of tediousfiddling. I present some of the messy details just because so few authorsseem to explain the reasons for their clever choices of constants.



4.7.1 Tail chaining via packingChaining::tails.pack

Choose the Si as δi-separated subsets with, for example, δi = δ0/2i. That

ensures Ni ≤ pack(δi, T, d) and maxt δi(t) ≤ δi. Inequality <28> becomes

\E@ pack.bnd\E@ pack.bnd <30> P∃t ∈ Sm : ∆m(t) >∑m−1

i=0ηiδi ≤

∑m−1

i=0Ni+1β(ηi),

that is

Pmaxt∈Sm

|X(t)−X(L0t)| >∑m−1

i=0ηiδi ≤

∑m−1

i=0Ni+1β(ηi).

We could choose the ηi to control∑ηiδi, then hope for a useful

∑Ni+1β(ηi),

or to control∑Ni+1β(ηi), hoping for a useful

∑ηiδi.

Here are some heuristics for β as in <29>.To make the right-hand side of <30> small we should expect

∑m−1i=0 ηiδi

to be larger than P∆m, which an analog of Lemma <12> with H = CΨ−1α

bounds by a constant multiple of∑m−1

i=0 Ψ−1α (Ni+1)δi. That suggests we

should make ηi a bit bigger than Ψ−1α (Ni+1). Try

ηi = Ψ−1α (Ni+1Ψα(yi)) ≤ cα

(Ψ−1α (Ni+1) + yi

)for some yi > 0.

I write the extra factor in that strange way because it gives a clean bound,

min (1, Ni+1β(ηi)) = min (1, Ni+1/Ψα(ηi)) ≤ min (1, 1/Ψα(yi)) ≤ Cαe−yαi .

Inequality <30> then implies

\E@ pack.bnd2\E@ pack.bnd2 <31> Pmaxt

∆m(t) > cα∑m−1

i=0

(Ψ−1α (Ni+1) + yi

)δi ≤ Cα

∑m−1

i=0e−y

αi .

Now we need to choose a yi sequence to make the sums tractable, thenmaybe bound sums by integrals to get neater expressions, and so on. Ifyou find such games amusing, look at Problem [2], which guides you to amore elegant bound. For the special case of subgaussian increments it givesexistence of positive constants K1, K2, and c for which

Pmaxt

∆m(t) > K1

∫ δp

0y +

√pack(r, T, d) + log(1/r) dr ≤ K2e

−cy2 .

Remark. The integral of√pack(r, T, d) is a surrogate for Pmaxt ∆m(t).

The presence of a factor K1, which might be much larger than 1, dis-appoints. With a lot more effort, and sharper techniques, one can getbounds for Pmaxt ∆m(t) ≥ Pmaxt ∆m(t) + . . . . See, for exampleMassart (2000), who showed how tensorization methods (see Chap-ter 12) can be used to rederive a conconcentration inequality due toTalagrand (1996).



4.7.2 Tail chaining via majorizing measuresChaining::tails.MM

The calculations depend on two simplifying facts about Ψα:

(i) Ψα(x) ≤ exp(xα) for all x > 0, so Ψ−1α (y) ≥ log1/α(y) for all y > 0;

(ii) there exists an x0 for which Ψα(x) ≥ 12 exp(xα) for all x ≥ x0, so there

exists a y0 for which that Ψ−1α (y) ≤ log1/α(2y) for all y ≥ y0.

As mentioned in Section 4.6, the majorizing measure can be used toconstructed a nested sequence of partitions πi of the index set T with #πi ≤Ni := 22i such that

KX := supt∈T∑

i∈NΨ−1α (Ni+1) diam (Ei(t)) <∞

where Ei(t) denotes the unique member of πi that contains the index point t.

Remark. If you check back you might notice that originally the ithsummand contained Ψ−1α (Ni). The change has only a trivial effect on

the sum because Ψ−1α (y) grows like log1/α(y).

For each E in πi select a point tE ∈ E then define Si = tE : E ∈ πi.Clearly we have #Si = #πi. With a little care we can also ensure Si ⊆ Si+1:if F ∈ πi and tF ∈ E ∈ πi+1 then make sure to choose tE = tF . Thelink function ì maps each tE in Si+1 onto the tF for which E ⊆ F ∈ πi.

S0

S1

S2

S3

π0

π1

π2

π3

Remark. We could also ensure, for any given finite subset S of T , thatS0 ⊆ S1 ⊆ · · · ⊆ Sm = S, for some m

For each t in Sm the chain t 7→ Lm−1(t) 7→ . . . 7→ L0(t) then followsa path through the corresponding sets Em(t) ⊆ Em−1(t) ⊆ · · · ⊆ E0(t)from the partitions. The diameter of the set Ei(t) provides an upper bound


§4.8 Brownian Motion, for example 22

for the link length, δi(t) = d(Li+1(t), Li(t)) ≤ diam(Ei(t)). If we chooseηi = yΨ−1

α (Ni+1) with y large enough

β(ηi) ≤ 1/Ψα

(y log1/α(Ni+1)

)≤ 2 exp(−yα logNi+1).

so that inequality <28> implies

P∃t ∈ Sm : ∆m(t) > yKX

≤∑m−1

i=0Ni+1/Ψα(ηi)

≤∑m−1

i=02 exp (logNi+1 − yα logNi+1)

≤ kα exp (−yα/2) for some constant kα, if yα ≥ 2.

With a suitable increase in kα the final inequality holds for all y ≥ 0. Thatis,

Pmaxt|X(t)−X(L0t)| > yKX ≤ kαe−y

α/2 for all y ≥ 0.

Remark. When integrated∫∞0. . . dy, the last inequality implies that

Pmaxt∈S Xt is bounded above, unformly over all finite subsets S of T ,by a constant times KX . See Chapter 7 for a converse when X is acentered Gaussian process.

4.8 Brownian Motion, for exampleChaining::S:BM

This Section provides one way of comparing the merits of the various chain-ing methods described in Sections 4.5 and 4.7, using Brownian motion pro-cess Xt : 0 ≤ t ≤ 1 as a test case.

Remember that X is a stochastic process for which

(i) X0 = 0

(ii) each increment Xs −Xt has a N(0, |s− t|) distribution

(iii) for all 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · < tn ≤ 1 the increments X(si) −X(ti) are independent

(iv) each sample path X(·, ω) is a continuous function on [0, 1].

For current purposes it suffices to know that

P|Xs −Xt| ≥ ε√|s− t| ≤ 2e−ε

2/2 for ε ≥ 0\E@ tail.incr\E@ tail.incr <32>

‖Xs −Xt‖Ψ2≤ c0

√|s− t| where c0 =

√8/3 .\E@ psi2.incr\E@ psi2.incr <33>



Of course Ψ2(x) = exp(x2)− 1, as before.The oscillation (also known as the modulus of continuity) is defined by

osc1(σ) := sup|Xs −Xt| : |s− t| < σ.

I have added the subscript 1 as a reminder that the oscillation is calculatedfor the usual metric d1(s, t) := |s− t| on [0, 1]. Inequalities <32> and <33>suggest that it would be cleaner to work with the metric d2(s, t) :=

√|s− t| .

Note that

osc2(σ) := sup|Xs −Xt| : d2(s, t) < σ = osc1(σ2).

Remark. Note that d2(s, t) is just the L2(Leb) distance between theindicator functions of the intervals [0, s] and [0, t].

A beatiful result of Levy asserts that

\E@ Levy\E@ Levy <34> limσ→0

osc1(σ)

h1(σ)= 1 almost surely, where h1(σ) =

√2σ log(1/σ) .

See McKean (1969, Section 1.6) for a detailed proof (which is similar inspirit to the tail chaining described in subsection 4.8.3) of Levy’s theorem.

Under the d2 metric the result becomes

limσ→0

osc2(σ)

h2(σ)= 1 almost surely, where h2(σ) = h1(σ2) = σ

√log(1/σ) .

To avoid too much detail I will settle for something less than <34>, namelyupper bounds that recover the O(h2(σ)) behavior. For reasons that shouldsoon be clear, it simplifies matters to replace the function h2 by the increas-ing function (see Problem [5])

h(σ) := σΨ−12 (1/σ2) = σ

√log(1 + σ−2) .

Continuity of the sample paths of Brownian Motion process ensures thatosc(σ, Sm, d2) ↑ osc2(σ) as m → ∞, so it suffices to obtain probabilisticbounds that hold uniformly in m before passing to the limit as m tends toinfinity.

Define δi = 2−i and sj,i = jδ2i and Si = sj,i : j = 0, 1, . . . , 4i − 1, so

that Ni = #Si = 4i = (1/δi)2. The function ì that rounds down to an

integer multiple of δ2i has the property that d2(s, ìs) ≤ δi for all s ∈ [0, 1].



Remark. I omitted the endpoint s = 1 just to avoid having #Si = 1+4i.Not important. You might object that a smaller covering set couldbe obtained by shifting Si−1 slightly to the right then taking ì tomap to the nearest integer multiple of δ2i−1. The slight improvementwould only affect the constant in the O(h2(σ)), at the cost of a messierargument.

Given a σ < 1 let p be the integer for which δp+1 ≤ σ < δp. The chainswill only extend to the Sp-level, rather than to the S0-level. If the change

from S0 to Sp disturbs you, you could work with Si := Si+p and δi := δi+pfor i = 0, 1, . . . .

The map Lp takes points s < t in Sm to points Lps ≤ Lpt in Sp withd2(Lps, Lpt) ≤ δp + d2(s, t). If d2(s, t) < σ then d2(Lps, Lpt) < 2δp, whichmeans that either Lps = Lpt or that Lpt = Lps+ δ2

i . Define

∆m,p := maxt∈Sm |X(t)−X(Lpt)| ≤∑m−1

i=pmaxs∈Si+1

|X(s)−X(ìs)|

Λp := max|X(sj,p)−X(sj+1,p)| : j = 0, 1, . . . , 4p − 1.

Then for s, t ∈ Sm,

|Xs −Xt| ≤ |X(s)−X(Lps)|+ |X(Lps)−X(Lpt)|+ |X(Lpt)−X(t)|≤ ∆m,p + Λp + ∆m,p,

which leads to the bound

\E@ osc.bnd\E@ osc.bnd <35> osc(σ, Sm, d2) ≤ 2∆m,p + Λp.

From here on it is just a matter of applying the various maximal inequalitiesto the ∆m,p and Λp terms.

4.8.1 Control by (conditional) expectationsChaining::conditEcontrol

For each event B with β := PB > 0, inequality <33> and Theorem <10>part (ii) give

\E@ ceH\E@ ceH <36> PB maxi≤N

|X(si)−X(ti)|d2(si, ti)

≤ H(N) := c0Ψ−12 (N/β) where c0 =

√8/3.

The inequality Ψ−12 (u)+Ψ−1

2 (v) ≥ Ψ−12 (uv) (see the Problems for Chapter 2)

separates the N and β contributions. In particular,

δiH(Ni) = c0δiΨ−12

(δ−2i β−1

)≤ c0δi

(Ψ−1

2 (1/δ2i ) + Ψ−1

2 (1/β))

= c0 (h(δi) + δiγ) where γ := Ψ−12 (1/β).



Inequality <36> gives a neat expression for the conditional expectationof the upper bound in <35>:

PBΛp ≤ δpH(Np) ≤ c0h(δp) + c0δpγ

and, by a minor modification of Lemma <12>,

PB∆m,p ≤∑m−1

i=pδiH(Ni+1) ≤ 2

∑∞

i=pc0 (h(δi+1) + δi+1γ) .

By Problem [5] the h(δi) decrease geometrically fast: h(δi+1) < 0.77h(δi)for all i. It follows that there exists a universal constant C for which

PBosc(σ, Sm, d2) ≤ C

2

(h(δp+1) + δp+1Ψ−1

2 (1/β))≤ C

(h(σ) + σΨ−1

2 (1/PB)).

Now letm tend to infinity to obtain an analogous upper bound for PBosc2(σ).If we choose B equal to the whole sample space the Ψ−1(1/PB) term is

superfluous. We have Posc2(σ) ≤ Ch(σ), which is sharp up to a multiplica-tive constant: Problem [6] uses the independence of the Brownian motionincrements to show that Posc2(σ) ≥ ch(σ) for all 0 < σ ≤ 1/2, where c isa positive constant.

Now for the surprising part. If, for an x > 0, we choose

B = osc2(σ) ≥ Ch(σ) + Cσx

then

Ch(σ) + Cσx ≤ PBosc2(σ) ≤ Ch(σ) + CσΨ−12 (1/PB).

Solve for PB to conclude that

\E@ subg.osc\E@ subg.osc <37> Posc2(σ) ≥ Ch(σ) + Cσx ≤ 1/Ψ2(x),

a clean subgaussian tail bound.

4.8.2 Control by Ψ2-normChaining::Psicontrol

Theorem <10> part (iv) gives∥∥∥∥maxi≤N

|X(si)−X(ti)|d2(si, ti)

∥∥∥∥Ψ2

≤ H(N) := C0Ψ−12 (N)

which is the same as <36> except that PB has been replaced by ‖·‖Ψ2and

β = 1 (and the constant is different). With those modifications, repeat the



argument from subsection 4.8.1 to deduce that ‖osc(σ, Sm, d2)‖Ψ2≤ Ch(σ),

for some new constant C,that is

PΨ2

(osc(σ, Sm, d2)

Ch(σ)

)≤ 1.

Again let m increase to infinity to deduce that ‖osc2(σ)‖Ψ2≤ Ch(σ), which

is clearly an improvement over Posc2(σ) ≤ Ch(σ). For example, we canimmediately deduce that ‖osc2(σ)‖r ≤ Crh(σ) for each r ≥ 1. We also geta tail bound,

Posc2(σ) ≥ xCh(σ) ≤ 1/Ψ2(x) for x > 0,

which is inferior to inequality <37>.

4.8.3 Control by tail probabilitiesChaining::tailcontrol

Once again start from inequality <35>

osc(σ, Sm, d2) ≤ 2∆m,p + Λp.

For nonnegative constants λp and ηi, whose choice will soon be considered,define y = δpλp +

∑m−1i=p δiηi. Then, as in Section 4.1,

Posc(σ, Sm, d2) ≥ y

≤ PΛp ≥ δpλp+∑m−1

i=pP max

s∈Si+1

|X(s)−X(ìs)| ≥ δiηi

≤ 2Npe−λ2p/2 +

∑m−1

i=p2Ni+1e

−η2i /2

≤ 2 exp(logNp − λ2

p/2)

+∑m−1

i=p2 exp

(logNi+1 − η2

i /2)

\E@ tail.sum\E@ tail.sum <38>

How to choose the ηi’s and λp? For the sake of comparison with theinequality <37> obtained by the clever choice of the conditioning event B,let me try for an upper bound that is a constant multiple of e−x

2/2, for agiven x ≥ 0.

Consider the first term in the bound <38>. To make the exponentialexactly equal to e−x

2/2 we should choose

λp =√

2 logNp + x2 ≤√

2 logNp +√x2 ,

The contribution to y is then at most

δp

√2 log(1/δ2

p) + δpx ≤√

2h(δp) + 2σx.


§4.9 Problems 27

A similar idea works for each of the ηi’s, except that we need to add on alittle bit more to keep the sum bounded as m goes off to infinity. If the termswere to decrease geometrically then the whole sum would be bounded by amultiple of the first term, which would roughly match the Λp contribution.With those thought in mind, choose

ηi =√

2 logNi+1 + x2 + 2 log 2i−p ≤√

2 logNi+1 +x+√

2(i− p) log 2

so that∑m−1

i=p2 exp

(logNi+1 − η2

i /2)≤ 4e−x

2/2

and∑m−1

i=p δiηi is less than

∑∞

i=p

(2δi+1

√2 log(1/δ2

i+1) + xδi

)+ δp

∑∞

k=0δk√

2k log 2

≤√

8∑∞

i=ph(δi+1) + δp(x+ c1) ≤ c2 (h(σ) + σx)

for universal constants c1 and c2. (I absorbed the σc1 into the h(σ) term.)With these simplifications, inequality <38> gives a clean bound,

Posc(σ, Sm, d2) > Ch(σ) + Cσx ≤ 5e−x2/2

for some universal constant C. Here I very cunningly changed the ≥ to a >to ensure a clean passage to the limit, via

osc(σ, Sm, d2) > y ↑ osc2(σ) > y as m→∞.

In the limit we get an inequality very similar to <37>.

4.9 ProblemsChaining::S:Problems

[1] Prove the 1934 result of Kolmogorov that is cited at the start of Section 4.10.Chaining::P:K1934

(It might help to look at Chapter 5 first.)

[2] In inequality <31> put yi = (yα + i − p) for some nonnegative y. Use theChaining::P:Psia.tail

fact that for each decreasing nonnegative function g on R+ and nonnegativeintegers a and b∑b

i=ag(i) ≤ g(a) +

∫ b−1

ag(r) dr


§4.9 Problems 28

to deduce existence of positive constants Kα,K′α for which

P∆m,p > K1

∫ δp

0y + Ψ−1

α (pack(r, T, d)) + log1/α(1/r) dr ≤ K2e−cyα

for all y ≥ 0.

[3] (How to build a majorizing measure via packing numbers.) Suppose Ψ is aChaining::P:pack.MM

Young function for which∫ 1

0 Ψ−1(1/r) dr <∞ and

Ψ−1(uv) ≤ C0

(Ψ−1(u) + Ψ−1(v)

)if min(u, v) ≥ 1

for some positive constant C0. Suppose also that∫ D

0Ψ−1 (pack(r, T, d)) , dr <∞ where D = diam(T ).

Define δi = D/2i for i = 0, 1, 2, . . . and let Si be the maximal δi-separatedsubset of T , with Ni := #Si = pack(δki, T, d), as described near the end ofSection 4.2. Define µi to be the uniform probability distribution on Si, thatis, mass 1/Ni at each point of Si, and µ =

∑i≥0 2−i−1µi.

(i) For each t ∈ T and r > δi show that µB[t, r] ≥ 2−i−1µiB[t, r] ≥ (2i+1Ni)−1.

Hint: Could Si ∩B[t, r] be empty?

(ii) By splitting the range of integration into intervals where δi−1 ≥ r > δi,deduce (cf. Section 4.3) that∫ D

0Ψ−1

(1

µB[t, r]

)dr ≤

∑i≥1

2C0δi

(Ψ−1(2k+1) + Ψ−1(Nk)

)<∞.

That is, µ is a majorizing measure.

[4] Supose πi : i ∈ N is a nested sequence of partitions of T with #πi ≤ ni =Chaining::P:MM.from.partition

22i and

supt∈T∑

i∈NΨ−1(2ini)diam(Ei(t)) <∞.

Define Si to consist of exactly one point from each E in πi. Mimic themethod from Problem [3] to show that supt∈T

∫ D0 Ψ−1 (1/µB[t, r]) dr < ∞

for some Borel probability measure µ.

[5] From the fact that g(y) = Ψ2(y)/y2 is an increasing function on R+ deduceChaining::P:modulus

that the function

h(σ) = 1/

√g(Ψ−1

2 (1/σ2)) = σΨ−1(1/σ2) = σ√

log(1 + σ−2)


§4.10 Notes 29

is increasing on R+. More precisely, for each α ∈ (0, 1), show that h(αy)/h(y) =√gα(y) where y 7→ gα(y) is a decreasing function with

√g1/2(1) < 0.77.

[6] Suppose Z1, . . . , Zn are independent random variables, each distributedN(0, 1)Chaining::P:indep.BM

(with density ϕ). Define Mn = maxi≤n |Zi|. For a positive constant c, letxn = xn(c) be the value for which P|Zi| ≥ xn = c/n.

(i) Show that PMn ≤ xn = (1− c/n)n → e−c as n→∞.

(ii) If 0 ≤ x + 1 ≤√

2 log n, show that P|Zi| ≥ x ≥ 2ϕ(1 + x) ≥ n−1√

2/π.

Deduce that the xn(√

2/π ) + 1 >√

2 log n.

(iii) Deduce that there exists some positive constant C for which PMn ≥ C√

log nfor all n.

(iv) For X a standard Brownian motion as in Section 4.8, deduce from (iii)that Posc(σ,X) ≥ c

√σ log(1/σ) for all 0 < σ ≤ 1/2, where c is a positive

constant. Hint: Write 2m/2X1 as a sum of 2m independent standard normals.

[7] (A sharpening of the result from the previous Problem.) The classicalChaining::P:normal.max

bounds (Feller, 1968, Section VII.1 and Problem 7.1) show the normal tailprobability Φ(x) = PN(0, 1) > x behaves asymptotically like φ(x)/x.More precisely, 1 − x−2 ≤ xΦ(x)/φ(x) ≤ 1 for all x > 0. Less precisely,− log

(c0Φ(x)

)= 1

2x2 + log x+O(x−2) as x→∞, where c0 =

√2π .

(i) (Compare with Leadbetter et al. 1983, Theorem 1.5.3.) Define an =√

2 log nand Ln = log an. For each constant τ define mτ,n = an−(1+τ)Ln/an. Showthat

c0Φ (mτ,n) = n−1 exp (τLn + o(1))

(ii) Define Mn = maxi≤n Zi for iid N(0, 1) random variables Z1, Z2, . . . . Showthat

PMn ≤ mτ,n =(1− Φ(mτ,n)

)n= exp

(−aτn(c−1

0 + o(1)))

(iii) If n increases rapidly enough, show that there exists an interval In oflength o(Ln/an) containing the point an−Ln/an for which PMn /∈ In → 0very fast.

4.10 NotesChaining::S:Notes

Credit for the idea of chaining as a method of successive approximationsclearly belongs to Kolmogorov, at least for the case of a one-dimensionalindex set. For example, at the start of the paper of Chentsov (1956):


§4.10 Notes 30

In 1934 A. N. Kolmogorov proved the followingTheorem. If ξ(t) is a separable (see [1]) stochastic process,

0 ≤ t ≤ 1, and

(1) M |ξ(t1)− ξ(t2)|p < C|t1 − t2|1+r,

where p > 0, r > 0 and C is a constant independent of t, thenthe trajectories of the process are continuous with probability 1.

A generalization of this theorem is the following propositionwhich was suggested to the author by A. N. Kolmogorov: . . .

The statement of the Theorem was footnoted by the comment “This theoremwas first published in a paper by E. E. Slutskii [2]”, with a reference to a 1937paper that I have not seen. See Billingsley (1968, Section 12) for a smallgeneralization of the theorem—with credit to Kolmogorov, via Slutsky, andChentsov—and a chaining proof.

See Dudley (1973, Section 1) and Dudley (1999a, Section 1.2 and Notes)for more about packing and covering. The definitive early work is due toKolmogorov and Tikhomirov (1959).

Dudley (1973) used chaining with packing/covering numbers and tail in-equalities to establish various probabilistic bounds for Gaussian processes.Dudley (1978) adapted the methods using the Bernstein inequality and met-ric entropy and inclusion assumptions (now called bracketing—see Chap-ter 13) to extend the Gaussian techniques to empirical processes indexedby collections of sets. He also derived bounds for processes indexed by VCclasses of sets (see Chapter 9) via symmetrization (see Chapter 8) argu-ments. In each case he controlled the increments of the empirical processesby exponential inequalities like those in Chapter ChapHoeffBenn.

Pisier (1983) is usually credited for realizing that the entropy methodsused for Gaussian processes could also be extended to nonGaussian processeswith Orlicz norm control of the increments. However, as Pisier (page 127)remarked:

For the proof of this theorem, we follow essentially [10]; I haveincluded a slight improvement over [10] which was kindly pointedout to me by X. Fernique. Moreover, I should mention that N.Kono [6] proved a result which is very close to the above; at thetime of [10], I was not aware of Kono’s paper [6].

Here [10] = Pisier (1980) and [6] = Kono (1980). The earlier paper [10]included extensive discussion of other precursors for the idea. See also theNotes to Section 2.6 of Dudley (1999b).


§4.10 Notes 31

Using methods like those in Section 4.5, Nolan and Pollard (1988) proveda functional central limit for the U-statistic analog of the empirical process.Kim and Pollard (1990) and Pollard (1990) proved limit theorems for avariety of statistical estimators using second moment control for suprema ofempirical processes.

My analysis in Example <18> is based on arguments of Ibragimov andHas’minskii (1981, Section 1.5), with the chaining bound replacing theirmethod for deriving maximal inequalities. The analysis could be extendedto unbounded subsets of R by similar adaptations of their arguments forunbounded sets.

See Pollard (1985) for one way to use a form of oscillation bound (underthe name stochastic differentiability) to establish central limit theorems forM-estimators. Pakes and Pollard (1989, Lemma 2.17) used a property moreeasily recognized as oscillation around a fixed index point.

References

Billingsley68book Billingsley, P. (1968). Convergence of Probability Measures. New York:Wiley.

Chentsov1956TPA Chentsov, N. N. (1956). Weak convergence of stochastic processes whosetrajectories have no discontinuities of the second kind and the “heuristic”approach to the Kolmogorov-Smirnov tests. Theory of Probability and ItsApplications 1 (1), 140–144.

Dudley73gauss Dudley, R. M. (1973). Sample functions of the Gaussian process. Annals ofProbability 1, 66–103.

Dudley78clt Dudley, R. M. (1978). Central limit theorems for empirical measures. Annalsof Probability 6, 899–929.

Dudley1999UCLT Dudley, R. M. (1999a). Uniform Central Limit Theorems. Cambridge Uni-versity Press.

Dudley99UCLT Dudley, R. M. (1999b). Uniform Central Limit Theorems. Cambridge Uni-versity Press.

Dudley2003RAP Dudley, R. M. (2003). Real Analysis and Probability (2nd ed.). Cambridgestudies in advanced mathematics. Cambridge University Press.

Feller1 Feller, W. (1968). An Introduction to Probability Theory and Its Applications(third ed.), Volume 1. New York: Wiley.


§4.10 Notes 32

Fernique75StFlour Fernique, X. (1975). Regularite des trajectoires des fonctions aleatoiresgaussiennes. Springer Lecture Notes in Mathematics 480, 1–97. Ecoled’Ete de Probabilites de Saint-Flour IV—1974.

Ibragimov:Hasminskii:81book Ibragimov, I. A. and R. Z. Has’minskii (1981). Statistical Estimation:Asymptotic Theory. New York: Springer. (English translation from 1979Russian edition).

KimPollard90cuberoot Kim, J. and D. Pollard (1990). Cube root asymptotics. Annals of Statis-tics 18, 191–219.

KolmogorovTikhomirov1959 Kolmogorov, A. N. and V. M. Tikhomirov (1959). ε-entropy and ε-capacityof sets in function spaces. Uspekhi Mat. Nauk 14 (2 (86)), 3–86. Review byG. G. Lorentz at MathSciNet MR0112032. Included as paper 7 in volume 3of Selected Works of A. N. Kolmogorov.

Kono1980JMKY Kono, N. (1980). Sample path properties of stochastic processes. J. Math.Kyoto Univ. 20 (2), 295–313.

LeadbetterLindgrenRootzen83 Leadbetter, M. R., G. Lindgren, and H. Rootzen (1983). Extremes andRelated Properties of Random Sequences and Processes. Springer-Verlag.

LedouxTalagrand91book Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces:Isoperimetry and Processes. New York: Springer.

Massart2000AnnProb Massart, P. (2000). About the constants in Talagrand’s concentration in-equalities for empirical processes. The Annals of Probability 28 (2), pp.863–884.

McKean69si McKean, H. P. (1969). Stochastic Integrals. Academic Press.

NolanPollard88Uproc2 Nolan, D. and D. Pollard (1988). Functional limit theorems for U-processes.Annals of Probability 16, 1291–1298.

PakesPollard89simulation Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of opti-mization estimators. Econometrica 57, 1027–1058.

Pisier1979-80 Pisier, G. (1980). Conditions d’entropie assurant la continuite de certainsprocessus et applications a l’analyse harmonique. In Seminaire d’analysefonctionnelle, 1979-80, pp. 1–41. Ecole Polytechnique Palaiseau. Availablefrom http://archive.numdam.org/.

Pisier83metricEntropy Pisier, G. (1983). Some applications of the metric entropy condition toharmonic analysis. Springer Lecture Notes in Mathematics 995, 123–154.


References for Chapter 4 33

Pollard85NewWays Pollard, D. (1985). New ways to prove central limit theorems. EconometricTheory 1, 295–314.

Pollard89StatSci Pollard, D. (1989). Asymptotics via empirical processes (with discussion).Statistical Science 4, 341–366.

Pollard90Iowa Pollard, D. (1990). Empirical Processes: Theory and Applications, Volume 2of NSF-CBMS Regional Conference Series in Probability and Statistics.Hayward, CA: Institute of Mathematical Statistics.

PollardUGMTP Pollard, D. (2001). A User’s Guide to Measure Theoretic Probability. Cam-bridge University Press.

Talagrand1987ActaMath Talagrand, M. (1987). Regularity of Gaussian processes. Acta Mathemat-ica 159, 99–149.

Talagrand96IM Talagrand, M. (1996). New concentration inequalities in product spaces.Inventiones Mathematicae 126, 505–563.

Talagrand2001AnnProb-MMwM Talagrand, M. (2001). Majorizing measures without measures. The Annalsof Probability 29 (1), 411–417.

Talagrand2005MMbook Talagrand, M. (2005). The Generic Chaining: Upper and lower bounds ofstochastic processes. Springer-Verlag.


Documents

4 Chaining1 - Yale Universitypollard/Books/Mini/Chaining.pdf · 2016-01-02 · In place of union bounds we now need ways to control ˆof the maximum over nite sets. See Section4.4for