Moments and tails - UW-Madison Department of Mathematicsroch/mdp/roch-mdp-chap2.pdf · moments central moment are of course the mean and variance, the square root of which is the

Chapter 2

Moments and tails

Moments capture useful information about the tail of a random variable. In thischapter we recall a few inequalities quantifying this intuition. Although they areoften straigthforward to derive, such inequalities are surprisingly powerful. Weillustrate their use on a range of applications. In particular we discuss three ofthe most fundamental tools in discrete probability: the first moment method, thesecond moment method, and the Chernoff-Cramer method.

2.1 Background

We start with a few basic definitions and standard inequalities.

2.1.1 Definitions

Moments As a quick reminder, let X be a random variable with E|X|k < +1

for some non-negative integer k. In that case we write X 2 Lk. Recall that

the quantities E[Xk] and E[(X � EX)

k], which are well-defined when X 2 L

k,are called respectively the k-th moment and k-th central moment of X . The first momentsmoment and the second central moment are the mean and variance, the square rootof which is the standard deviation. A random variable is said to be centered if itsmean is 0. Recall that for a non-negative random variable X , the k-th moment canbe expressed as

E[Xk] =

Z+1

0

kxk�1P [X > x] dx. (2.1)

Version: October 7, 2020Modern Discrete Probability: An Essential ToolkitCopyright © 2020 Sebastien Roch

17

The moment-generating function of X is the function moment-generatingfunction

MX(s) = E⇥esX

⇤,

defined for all s 2 R where it is finite, which includes at least s = 0. If MX(s) isdefined on (�s0, s0) for some s0 > 0 then X has finite moments of all orders andthe following expansion holds

MX(s) =X

k�0

sk

k!E[Xk

], |s| < s0.

One more piece of notation: if A is an event and X 2 L1, then we use the

shorthand

E[X;A] = E[X1A].

Tails We refer to a probability of the form P[X � x] as an upper tail (or righttail) probability. Typically x is greater than the mean or median of X . Similarly

tail probabilitieswe refer to P[X x] as a lower tail (or left tail) probability. Our general goal inthis chapter is to bound tail probabilities using moments and moment-generatingfunctions.

Tail probabilities arise naturally in many contexts, as events of interest canoften be framed in terms of a random variable being unusually large or small. Suchprobabilities are often hard to compute directly however. As we will see in thischapter, moments offer an effective means to control tail probabilities for two mainreasons: 1) moments contain information about tails, as (2.1) makes explicit forinstance; and 2) they are typically easier to compute—or, at least, to estimate.

2.1.2 Basic inequalities

Markov’s inequality Our first bound on the tail of a random variable is Markov’sinequality. In words, for a non-negative random variable: the heavier the tail, thelarger the expectation. This simple inequality is in fact a key ingredient in moresophisticated tail bounds as we will see.

Markov’sinequality

Theorem 2.1 (Markov’s inequality). Let X be a non-negative random variable.Then, for all b > 0,

P[X � b] EXb

. (2.2)

Proof.EX � E[X;X � b] � E[b;X � b] = bP[X � b].

18

Figure 2.1: Proof of Markov’s inequality: taking expectations of the two functionsdepicted above yields the inequality.

See Figure 2.1 for a proof by picture. Note that this inequality is non-trivial onlywhen b > EX .

Chebyshev’s inequality An application of Markov’s inequality to the randomvariable |X � EX|

2 gives a classical tail inequality featuring the second momentof a random variable.

Chebyshev’sinequality

Theorem 2.2 (Chebyshev’s inequality). Let X be a random variable with EX2 <+1. Then, for all � > 0,

P[|X � EX| > �] Var[X]

�2. (2.3)

Proof. This follows immediately by applying (2.2) to |X�EX|2 with b = �2.

Of course this bound is non-trivial only when � is larger than the standard devia-tion.

Example 2.3. Let X be a Gaussian random variable with mean 0 and variance �2.A direct computation shows that E|X| = �

q2

⇡. Hence Markov’s inequality gives

P[|X| � b] E|X|

b=

r2

⇡·�

b,

while Chebyshev’s inequality gives

P[|X| � b] ⇣�b

⌘2

.

19

Figure 2.2: Comparison of Markov’s and Chebyshev’s inequalities: the squareddeviation from the mean (in green) gives a better approximation of the indicatorfunction (in grey) close to the mean (here 0) than the absolute deviation (in orange).

For b large enough, Chebyshev’s inequality produces a stronger bound. See Fig-ure 2.2 for some insight. J

Example 2.4 (Coupon collector’s problem). Let (Xi) be i.i.d. uniform randomvariables over [n]. Let Tn,k be the first time that k elements of [n] have beenpicked, i.e.,

Tn,k = inf {i : |{X1, . . . , Xi}| = k} ,

with Tn,0 := 0. We prove that the time it takes to pick all elements at least once—or “collect each coupon”—has the following tail. For any " > 0, we have asn ! +1:

Claim 2.5.

P

2

4

��Tn,n � n

nX

j=1

j�1

�� "n log n

3

5 ! 0.

To prove this claim we note that the time elapsed between Tn,k�1 and Tn,k, whichwe denote by ⌧n,k := Tn,k � Tn,k�1, is geometric with success probability 1 �k�1

n. And all ⌧n,ks are independent. So, by standard results on geometric variables

20

(e.g. [Dur10, Example 1.6.5]), the expectation and variance of Tn,n are

E[Tn,n] =

nX

i=1

✓1�

i� 1

n

◆�1

= nnX

j=1

j�1⇠ n log n, (2.4)

and

Var[Tn,n]

nX

i=1

✓1�

i� 1

n

◆�2

= n2

nX

j=1

j�2 n2

+1X

j=1

j�2= ⇥(n2

). (2.5)

So by Chebyshev’s inequality (Theorem 2.2)

P[|Tn,n � E[Tn,n]| � "n log n] Var[Tn,n]

("n log n)2

n2

P+1

j=1j�2

("n log n)2

! 0,

by (2.4) and (2.5). J

2.2 First moment method

We begin with techniques based on the first moment. First recall that the expecta-tion of a random variable has an elementary, yet handy property: linearity. Thatis, if random variables X1, . . . , Xn defined on a joint probability space have finitefirst moments, then

E[X1 + · · ·+Xn] = E[X1] + · · ·+ E[Xn], (2.6)

without any further assumption. In particular linearity holds whether or not the Xisare independent.

2.2.1 The probabilistic method

A key technique of probabilistic combinatorics is the so-called probabilistic method.The idea is that one can establish the existence of an object satisfying a certainproperty—without having to construct one explicitly. Instead one argues that arandomly chosen object exhibits the given property with positive probability. Thefollowing “obvious” observation, sometimes referred to as the first moment princi-ple, plays a key role in this context.

21

first momentprinciple

Theorem 2.6 (First moment principle). Let X be a random variable with finiteexpectation. Then, for any µ 2 R,

EX µ =) P[X µ] > 0.

Proof. We argue by contradiction, assume EX µ and P[X µ] = 0. Write{X µ} =

Tn�1

{X < µ + 1/n}. That implies by monotonicity that, for any" 2 (0, 1), P[X < µ+ 1/n] < " for n large enough. Hence

µ � EX= E[X;X < µ+ 1/n] + E[X;X � µ+ 1/n]

� µP[X < µ+ 1/n] + (µ+ 1/n)(1� P[X < µ+ 1/n])

> µ,

a contradiction.

The power of this principle is easier to appreciate on an example.

Example 2.7 (Balancing vectors). Let v1, . . . ,vn be arbitrary unit vectors in Rn.How small can we make the norm of the combination

x1v1 + · · ·+ xnvn

by appropriately choosing x1, . . . , xn 2 {�1,+1}? We claim that it can be assmall as

pn, for any collection of vis. At first sight, this may appear to be a

complicated geometry problem. But the proof is trivial once one thinks of choosingthe xis at random. Let X1, . . . , Xn be independent random variables uniformlydistributed in {�1,+1}. Then

EkX1v1 + · · ·+Xnvnk2

= E

2

4X

i,j

XiXjvi · vj

3

5

=

X

i,j

E[XiXjvi · vj ] (2.7)

=

X

i,j

vi · vj E[XiXj ]

=

X

i

kvik2

= n,

22

where we used the linearity of expectation in (2.7). But note that a discrete randomvariable Z = kX1v1 + · · ·+Xnvnk2 with expectation EZ = n must take a value n with positive probability by the first moment principle (Theorem 2.6). In otherwords, there must be a choice of Xis such that Z n. That proves the claim. J

Here is a slightly more subtle example of the probabilistic method, where onehas to modify the original random choice.

Example 2.8 (Independent sets). Let G = (V,E) be a d-regular graph with nvertices and m = nd/2 edges, where d � 1. Our goal is derive a lower boundon the size, ↵(G), of the largest independent set in G. Recall that an independentset is a set of vertices in a graph, no two of which are adjacent. Again, at firstsight, this may seem like a rather complicated graph-theoretic problem. But anappropriate random choice gives a non-trivial bound. Specifically, we claim that↵(G) � n/2d.

The proof proceeds in two steps:

1. We first prove the existence of a subset S of vertices with relatively fewedges.

2. We remove vertices from S to obtain an independent set.

Let 0 < p < 1 to be chosen below. To form the set S, pick each vertex in Vindependently with probability p. Letting X be the number of vertices in S, wehave by the linearity of expectation that

EX = E"X

v2V

1v2S

#= np,

where we used that E[1v2S ] = p. Letting Y be the number of edges betweenvertices in S, we have by the linearity of expectation that

EY = E

2

4X

{i,j}2E

1i2S1j2S

3

5 =nd

2p2,

where we also used that E[1i2S1j2S ] = p2 by independence. Hence, subtracting,

E[X � Y ] = np�nd

2p2,

which, as a function of p, is maximized at p = 1/d where it takes the value n/2d.As a result, by the first moment principle (Theorem 2.6) there must exist a set S ofvertices in G such that

|S|� |{{i, j} 2 E : i, j 2 S}| � n/2d. (2.8)

23

For each edge e connecting two vertices in S, remove one of the end-vertices of e.Then, by (2.8), the resulting subset of vertices has at least n/2d vertices, with noedge between them. This proves the claim.

Note that a graph G made of n/(d + 1) cliques of size d + 1 (with no edgebetween the cliques) has ↵(G) = n/(d+ 1), showing that our bound is tight up toa constant. This is known as a Turan graph. JRemark 2.9. The previous result can be strengthened to

↵(G) �

X

v2V

1

dv + 1,

for a general graph G = (V,E), where dv is the degree of v. This bound is achieved forTuran graphs. See, e.g., [AS11, The probabilistic lens: Turan’s theorem].

The previous example also illustrates the important indicator trick, i.e., writingindicatortrick

a random variable as a sum of indicators, which is often used in combination withthe linearity of expectation.

2.2.2 Union bound

Markov’s inequality (Theorem 2.1) can be interpreted as a quantitative version ofthe first moment principle (Theorem 2.6). In this context, it is often stated in thefollowing special form.

Theorem 2.10 (First moment method). If X is a non-negative, integer-valued ran-dom variable, then

P[X > 0] EX. (2.9)

Proof. Take b = 1 in Markov’s inequality (Theorem 2.1).

Theorem 2.6 implies that, if a non-negative integer-valued random variable X hasexpectation smaller than 1, then its value is 0 with positive probability. Theo-rem 2.10 adds: if X has “small” expectation, then its value is 0 with “large” prob-ability. This simple fact is typically used in the following manner: one wants toshow that a certain “bad event” does not occur with probability approaching 1; therandom variable X then counts the number of such “bad events.” In that case, Xis a sum of indicators and Theorem 2.10 reduces to the standard union bound, alsoknown as Boole’s inequality.

Corollary 2.11. Let Bn = An,1 [ · · · [ An,mn, where An,1, . . . , An,mn

is a col-lection of events for each n. Then, letting

µn :=

mnX

i=1

P[An,i],

24

we haveP[Bn] µn.

In particular, if µn ! 0 as n ! +1, then P[Bn] ! 0.

Proof. This is of course a fundamental property of probability measures. (Or takeX := Xn =

Pmn

i=11An,i

in Theorem 2.10.)

A useful generalization of the union bound is given in Exercise 2.1.Applications of Theorem 2.10 are often referred to as the first moment method.

first momentmethod

We give a few more examples.

Example 2.12 (Random k-SAT threshold). For r 2 R+, let �n,r : {0, 1}n !

{0, 1} be a random k-CNF formula on n Boolean variables z1, . . . , zn with rnclauses. (For simplicity, assume rn is an integer.) That is, �n,r is an AND of rnORs, each obtained by picking independently k literals uniformly at random (withreplacement). Recall that a literal is a variable zi or its negation zi. The formula�n,r is said to be satisfiable if there exists an assignment z such that �n,r(z) = 1.Clearly the higher the value of r, the less likely it is for �n,r to be satisfiable. Infact it is conjectured that there exists an r⇤

k2 R+ such that

limn!1

P[�n,r is satisfiable] =

(0, if r > r⇤

k,

1, if r < r⇤k.

Studying such threshold phenomena is a major theme of modern discrete probabil-ity. Using the first moment method, we give an upper bound on this conjecturedthreshold. Specifically, we claim that

r > 2kln 2 =) lim sup

n!1

P[�n,r is satisfiable] = 0.

How to start the proof should be obvious: let Xn be the number of satisfyingassignments of �n,r. Applying the first moment method, since

P[�n,r is satisfiable] = P[Xn > 0],

it suffices to show that EXn ! 0. To compute EXn, we use the indicator trick

Xn =

X

z2{0,1}n

1{z satisfies �n,r}.

There are 2n possible assignments, each of which satisfies �n,r with probability

(1 � 2�k

)rn. Indeed note that the rn clauses are independent and each clause

25

literal picked is satisfied with probability 1/2. Therefore, by the assumption on r,for some " > 0

EXn = 2n(1� 2

�k)rn

< 2n(1� 2

�k)(2

kln 2)(1+")n

< 2ne�(ln 2)(1+")n

= 2�"n

! 0.

JRemark 2.13. For k � 3, it has been shown that if r < 2

kln 2� k

lim infn!1

P[�n,r is satisfiable] = 1.

See [ANP05]. For the k = 2 case, the conjecture above has been established and r⇤2 =

1 [CR92].

2.2.3 . Random permutations: longest increasing subsequence

In this section, we bound the expected length of a longest increasing subsequencein a random permutation. Let �n be a uniformly random permutation of [n] :={1, . . . , n} and let Ln be the length of a longest increasing subsequence of �n.

Claim 2.14.ELn = ⇥(

pn).

Proof. We first prove that

lim supn!1

ELnpn

e,

which implies half of the claim. Bounding the expectation of Ln is not straightfor-ward as it is the expectation of a maximum. A natural way to proceed is to find avalue ` for which P[Ln � `] is “small.” More formally, we bound the expectationas follows

ELn `P[Ln < `] + nP[Ln � `] `+ nP[Ln � `], (2.10)

for an ` chosen below. To bound the probability on the r.h.s., we appeal to thefirst moment method by letting Xn be the number of increasing subsequences oflength `. We also use the indicator trick, i.e., we think of Xn as a sum of indicatorsover subsequences (not necessarily increasing) of length `. There are

�n

`

�such

subsequences, each of which is increasing with probability 1/`!. Note that these

26

subsequences are not independent. Nevertheless, by the linearity of expectationand the first moment method,

P[Ln � `] = P[Xn > 0] EXn =1

`!

✓n

`

◆

n`

(`!)2

n`

e2[`/e]2`

✓epn

`

◆2`

,

where we used a standard bound on factorials recalled in Section A.1.1. Note that,in order for this bound to go to 0, we need ` > e

pn. The first claim follows by

taking ` = (1 + �)epn in (2.10), for an arbitrarily small � > 0.

For the other half of the claim, we show that

ELnpn

� 1.

This part does not rely on the first moment method (and may be skipped). We seeka lower bound on the expected length of a longest increasing subsequence. Theproof uses the following two ideas. First observe that there is a natural symme-try between the lengths of the longest increasing and decreasing subsequences—they are identically distributed. Moreover if a permutation has a “short” longestincreasing subsequence, then intuitively it must have a “long” decreasing subse-quence, and vice versa. Combining these two observations gives a lower boundon the expectation of Ln. Formally, let Dn be the length of a longest decreasingsubsequence. By symmetry and the arithmetic mean-geometric mean inequality,note that

ELn = ELn +Dn

2

�� E

pLnDn.

We show that LnDn � n, which proves the claim. We use a clever combinatorialargument. Let L(k)

n be the length of a longest increasing subsequence ending atposition k, and similarly for D(k)

n . It suffices to show that the pairs (L(k)

n , D(k)

n ),1 k n are distinct. Indeed, noting that L(k)

n Ln and D(k)

n Dn, thenumber of pairs in [Ln]⇥ [Dn] is at most LnDn which must then be at least n. Let1 j < k n. If �n(k) > �n(j) then we see that L(k)

n > L(j)

n by appending�n(k) to the subsequence ending at position j achieving L(j)

n . The opposite holdsfor the decreasing case, which implies that (L(j)

n , D(j)

n ) and (L(k)

n , D(k)

n ) must bedistinct. This combinatorial argument is known as the Erdos-Szekeres theorem.That concludes the proof of the second claim.

Remark 2.15. It has been shown that in fact

ELn = 2pn+ cn1/6

+ o(n1/6),

where c = �1.77... [BDJ99].

27

2.2.4 . Percolation on Z2: non-trivial threshold

We use the first moment method to prove the existence a non-trivial threshold inbond percolation on lattices.

Threshold in bond percolation Consider bond percolation on the two-dimensio-nal lattice L2 with density p and let Pp denote the corresponding measure. Writingx , y if x, y 2 L2 are connected by an open path, recall that the open cluster of xis

Cx := {y 2 Z2: x , y}.

The percolation function is defined aspercolationfunction✓(p) := Pp[|C0| = +1],

i.e., ✓(p) is the probability that the origin is connected by open paths to infinitelymany vertices. It is intuitively clear that the function ✓(p) is non-decreasing. Indeedconsider the following alternative representation of the percolation process: to eachedge e, assign a uniform [0, 1] random variable Ue and declare the edge open ifUe p. Using the same Ues for densities p1 < p2, it follows immediately fromthe monotonicity of the construction that ✓(p1) ✓(p2). (We will have much moreto say about this type of “coupling” argument in Chapter 4.) Moreover note that✓(0) = 0 and ✓(1) = 1. The critical value is defined as

critical value

pc(L2) = sup{p � 0 : ✓(p) = 0},

the point at which the probability that the origin is contained in an infinite opencluster becomes positive. Note that by a union bound over all vertices, when✓(p) = 0, we have that Pp[9x, |Cx| = +1] = 0. Conversely, because {9x, |Cx| =+1} is a tail event, by Kolmogorov’s 0-1 law (e.g. [Dur10, Theorem 2.5.1]) itholds that Pp[9x, |Cx| = +1] = 1 when ✓(p) > 0.

Using the first moment method we show that the critical value is non-trivial,i.e., it is strictly between 0 and 1. This is another example of a threshold phe-nomenon.

Claim 2.16.pc(L2

) 2 (0, 1).

Proof. We first show that, for any p < 1/3, ✓(p) = 0. In order to apply the firstmoment method, roughly speaking, we need to reduce the problem to counting thenumber of instances of an appropriately chosen sub-structure. The key observationis the following:

28

An infinite C0 must contain an open self-avoiding path starting at 0 ofinfinite length and, as a result, of all lengths.

Hence, we let Xn be the number of open self-avoiding paths of length n starting at0. Then, by monotonicity,

P[|C0| = +1] P[\n{Xn > 0}] = limn

P[Xn > 0] lim supn

E[Xn], (2.11)

where the last inequality follows from Theorem 2.10. We bound the number ofself-avoiding paths by noting that they cannot backtrack. That gives 4 choices atthe first step, and at most 3 choices at each subsequent step. Hence, we get thefollowing bound

EXn 4(3n�1

)pn. (2.12)

The r.h.s. in (2.12) goes to 0 when p < 1/3. When combined with (2.11), thatproves the first part of the claim. (Why did we use self-avoiding—rather thanunconstrained—paths?)

For the other direction, we show that ✓(p) > 0 for p close enough to 1. Thistime, we count “dual cycles.” This type of proof is known as a contour argument,or Peierls’ argument, and is based on the following construction. Consider the duallattice bL2 whose vertices are Z2

+(1/2, 1/2) and whose edges connect vertices u, vdual lattice

with ku�vk1 = 1. See Figure 2.3. Note that each edge in the primal lattice L2 hasa unique corresponding edge in the dual lattice which crosses it perpendicularly.We make the same assignment, open or closed, for corresponding primal and dualedges. The following graph-theoretic lemma, whose proof is sketched below, formsthe basis of contour arguments.

Lemma 2.17 (Contour lemma). If |C0| < +1, then there is a closed self-avoidingcycle around the origin in the dual lattice bL2.

To prove that ✓(p) > 0 for p close enough to 1, the idea is to use the first momentmethod with Xn equal to the number of closed self-avoiding dual cycles of length nsurrounding the origin. We bound from above the number of dual cycles of lengthn around the origin by the number of choices for the starting edge across the upper

29

Figure 2.3: Primal (black) and dual (blue) lattices.

y-axis and for each n� 1 subsequent non-backtracking choices. Namely,

P[|C0| < +1] P[9n � 4, Xn > 0]

X

n�4

P[Xn > 0]

X

n�4

EXn

X

n�4

n

23n�1

(1� p)n

=33(1� p)4

2

X

m�1

(m+ 3)(3(1� p))m�1

=33(1� p)4

2

✓1

(1� 3(1� p))2+ 3

1

1� 3(1� p)

◆,

when p > 2/3 (where the first term on the last line comes from differentiating thegeometric series). This expression can be taken smaller than 1 if we let p ! 1. Wehave shown that ✓(p) > 0 for p close enough to 1, and that concludes the proof.(Exercise 2.2 sketches a proof that ✓(p) > 0 for p > 2/3.)

It is straightforward to extend the claim to Ld. Exercise 2.3 asks for the details.

Proof of the contour lemma We conclude this section by sketching the proof ofthe contour lemma, which relies on topological arguments.

30

Proof of Lemma 2.17. Assume |C0| < +1. Imagine identifying each vertex in L2

with a square of side 1 centered around it so that the sides line up with dual edges.Paint green the squares of vertices in C0. Paint red the squares of vertices in C

c

0

which share a side with a green square. Leave the other squares white. Let u0 bethe highest vertex in C0 along the y-axis and let v0 be the dual vertex correspondingto the upper left corner of the square of u0. Because u0 is highest, it must be thatthe square above it is red. Walk along the dual edge {v0, v1} separating the squaresof u0 and u0 + (0, 1) from v0 to v1. Notice that this edge satisfies what we callthe red-green property: a red square sits on your left and a green square is onyour right. Proceed further by iteratively walking along an incident dual edge withthe following rule. Choose an edge satisfying the red-green property, with theedges to your left, straight ahead, and to your right in decreasing order of priority(i.e., choose the highest priority edge that satisfies the red-green property). Stopwhen a previously visited dual vertex is reached. The claim is that this procedureconstructs the desired cycle. Let v0, v1, v2, . . . be the dual vertices visited. Byconstruction {vi�1, vi} is a dual edge for all i.

- (A dual cycle is produced) We first argue that this procedure cannot get stuck.Let {vi�1, vi} be the edge just crossed and assume that it has the red-greenproperty. If there is a green square to the left ahead, then the edge to theleft, which has highest priority, has the red-green property. If the left squareahead is not green, but the right one is, then the left square must in fact bered by construction. In that case, the edge straight ahead has the red-greenproperty. Finally, if neither square ahead is green, then the right square mustin fact be red because the square behind to the right is green by assumption.That implies that the edge to the right has the red-green property. Hence wehave shown that the procedure does not get stuck. Moreover, because byassumption the number of green squares is finite, this procedure must even-tually terminate when a previously visited dual vertex is reached, forming acycle.

- (The origin lies within the cycle) The inside of a cycle in the plane is well-defined by the Jordan curve theorem. So the dual cycle produced above hasits adjacent green squares either on the inside (negative orientation) or on theoutside (positive orientation). In the former case, the origin must lie insidethe cycle as otherwise the vertices corresponding to the green squares on theinside would not be in C0. So it remains to consider the latter case where, forsimilar reasons, the origin is outside the cycle.

Let vj be the repeated dual vertex. Assume first that vj 6= v0 and let vj�1 andvj+1 be the dual vertices preceding and following vj during the first visit to

31

vj . Let vk be the dual vertex preceding vj on the second visit. After travers-ing {vj�1, vj}, vk cannot be to the left or to the right because in those casesthe red-green property of the two corresponding edges are not compatible.So vk is straight ahead and, by the priority rules, vj+1 must be to the left.But in that case, for the origin to lie outside the cycle and for the cycle toavoid the path v0, . . . , vj�1, we must traverse the cycle with a negative ori-entation, i.e., the green squares adjacent to the cycle must be on the inside, acontradiction.

So, finally, assume v0 is the repeated vertex. If the cycle is traversed with apositive orientation and the origin is on the outside, it must be that the cyclecrosses the y-axis at least once above u0 + (0, 1), a contradiction.

Hence we have shown that the origin is inside the cycle.

That concludes the proof.

Remark 2.18. It turns out that pc(L2) = 1/2. We will prove pc(L2

) � 1/2, known asHarris’ theorem, in Section 4.3.6. The other direction is due to Kesten [Kes80].

2.3 Second moment method

The first moment method gives an upper bound on the probability that a non-negative, integer-valued random variable is positive—provided its expectation issmall enough. In this section we seek a lower bound on that probability. We firstnote that a large expectation does not suffice in general. Say Xn is n2 with prob-ability 1/n, and 0 otherwise. Then EXn = n ! +1, yet P[Xn > 0] ! 0.That is, although the expectation diverges, the probability that Xn is positive canbe arbitrarily small.

So we turn to the second moment. Intuitively the basis for the so-called secondmoment method is that, if the expectation of Xn is large and its variance is rela-tively small, then we can bound the probability that Xn is close to 0. As we willsee in applications, the first and second moment methods often work hand in hand.

2.3.1 Paley-Zygmund inequality

As an immediate corollary of Chebyshev’s inequality (Theorem 2.2), we get a firstversion of the second moment method: if the standard deviation of X is less thanits expectation, then the probability that X is 0 is bounded away from 1. SeeFigure 2.4. Formally, let X be a non-negative, integer-valued random variable (not

32

Figure 2.4: Second moment method: if the standard deviation �X of X is less thanits expectation µX , then the probability that X is 0 is bounded away from 1.

identically zero). Then

P[X > 0] � 1�Var[X]

(EX)2. (2.13)

Indeed, by (2.3),

P[X = 0] P[|X � EX| � EX] Var[X]

(EX)2.

The following tail inequality, a simple application of Cauchy-Schwarz, leadsto an improved version of the second moment method.

Theorem 2.19 (Paley-Zygmund inequality). Let X be a non-negative random vari-able. For all 0 < ✓ < 1,

P[X � ✓EX] � (1� ✓)2(EX)

2

E[X2]. (2.14)

Proof. We have

EX = E[X1{X<✓EX}] + E[X1{X�✓EX}]

✓EX +

pE[X2]P[X � ✓EX],

where we used Cauchy-Schwarz. Rearranging gives the result.

33

As an immediate application:

Theorem 2.20 (Second moment method). Let X be a non-negative random vari-able (not identically zero). Then

P[X > 0] �(EX)

2

E[X2]. (2.15)

Proof. Take ✓ # 0 in (2.14).

Since(EX)

2

E[X2]= 1�

Var[X]

(EX)2 +Var[X],

we see that (2.15) is stronger than (2.13). We typically apply the second momentmethod to a sequence of random variables (Xn). The previous theorem gives auniform lower bound on the probability that {Xn > 0} when E[X2

n] C(E[Xn])2

for some C > 0.Just like the first moment method, the second moment method is often applied

to a sum of indicators (but see Section 2.3.3 for a weighted case).

Corollary 2.21. Let Bn = An,1 [ · · · [ An,mn, where An,1, . . . , An,mn

is a col-lection of events for each n. Write i

n⇠ j if i 6= j and An,i and An,j are not

independent. Then, letting

µn :=

mnX

i=1

P[An,i], �n :=

X

in⇠j

P[An,i \An,j ],

where the second sum is over ordered pairs, we have limn P[Bn] > 0 wheneverµn ! +1 and �n Cµ2

n for some C > 0. If moreover �n = o(µ2n) then

limn P[Bn] = 1.

Proof. Take X := Xn =P

mn

i=11An,i

in the second moment method (Theo-rem 2.20). Note that

Var[Xn] =

X

i

Var[1An,i] +

X

i 6=j

Cov[1An,i,1An,j

],

whereVar[1An,i

] = E[(1An,i)2]� (E[1An,i

])2 P[An,i],

and, if An,i and An,j are independent,

Cov[1An,i,1An,j

] = 0,

34

whereas, if i n⇠ j,

Cov[1An,i,1An,j

] = E[1An,i1An,j

]� E[1An,i]E[1An,j

] P[An,i \An,j ].

HenceVar[Xn]

(EXn)2

µn + �nµ2n

=1

µn

+�nµ2n

.

Noting

(EXn)2

E[X2n]

=(EXn)

2

(EXn)2 +Var[Xn]

=1

1 + Var[Xn]/(EXn)2,

and applying Theorem 2.20 gives the result.

We give applications of Theorem 2.20 and Corollary 2.21 in Sections 2.3.2and 2.3.3.

2.3.2 . Erdos-Renyi: subgraph containment and connectivity

We have seen examples of threshold phenomena in constraint satisfaction problemsand percolation. Such thresholds are also common in random graphs. We considerhere the Erdos-Renyi random graph. Formally, a threshold function for a graphproperty P is a function r(n) such that

limn

Pn,pn [Gn has property P ] =

(0, if pn ⌧ r(n)

1, if pn � r(n),

where, under Pn,pn , Gn ⇠ Gn,pn is an Erdos-Renyi graph with n vertices anddensity pn. In this section, we illustrate this type of phenomenon on two properties:the containment of small subgraphs and connectivity.

Small subgraphs: the containment problem

We first consider the clique number, then we turn to the more general subgraphcontainment problem.

Cliques Let !(G) be the clique number of a graph G, i.e., the size of its largestclique number

clique.

Claim 2.22. The property !(G) � 4 has threshold function n�2/3.

35

Proof. Let Xn be the number of 4-cliques in the Erdos-Renyi graph Gn ⇠ Gn,pn .Then, noting that there are

�4

2

�= 6 edges in a 4-clique,

En,pn [Xn] =

✓n

4

◆p6n = ⇥(n4p6n),

which goes to 0 when pn ⌧ n�2/3. Hence the first moment method (Theo-rem 2.10) gives one direction.

For the other direction, we apply the second moment method for sums of indi-cators, Corollary 2.21. For an enumeration S1, . . . , Sm of the 4-tuples of verticesin Gn, let A1, . . . , Am be the events that the corresponding 4-cliques are present.By the calculation above we have µm = ⇥(n4p6n) which goes to +1 whenpn � n�2/3. Also µ2

m = ⇥(n8p12n ) so it suffices to show that �m = o(n8p12n ).Note that two 4-cliques with disjoint edge sets (but possibly sharing one vertex)are independent. Suppose Si and Sj share 3 vertices. Then

Pn,pn [Ai |Aj ] = p3n,

as the event Aj implies that all edges between three of the vertices in Si are present,and there are 3 edges between the remaining vertex and the rest of Si. Similarly if|Si \ Sj | = 2, Pn,pn [Ai |Aj ] = p5n. Putting these together we get

�m =

X

i⇠j

Pn,pn [Aj ]Pn,pn [Ai |Aj ]

=

✓n

4

◆p6n

✓4

3

◆(n� 4)p3n +

✓4

2

◆✓n� 4

2

◆p5n

�

= O(n5p9n) +O(n6p11n )

= O

✓n8p12nn3p3n

◆+O

✓n8p12nn2pn

◆

= o(n8p12n )

= o(µ2

m),

where we used that pn � n�2/3 (so that for example n3p3n � 1). Corollary 2.21gives the result.

Roughly speaking, the first and second moments suffice to pinpoint the thresh-old in this case because the indicators in Xn are “mostly” pairwise independentand, as a result, the sum is concentrated around its mean.

36

General subgraphs The methods of Claim 2.22 can be applied to more generalsubgraphs. However the situation is somewhat more complicated than it is forcliques. For a graph H0, let vH0 and eH0 be the number of vertices and edges ofH0 respectively. Let Xn be the number of copies of H0 in Gn ⇠ Gn,pn . By thefirst moment method,

P[Xn > 0] E[Xn] = ⇥(nvH0p

eH0n ) ! 0,

when pn ⌧ n�vH0/eH0 . (The constant factor, which does not play a role in theasymptotics, accounts in particular for the number of automorphisms of H0.) Fromthe proof of Claim 2.22, one might guess that the threshold function is n�vH0/eH0 .That is not the case in general. To see what can go wrong, consider the graphof Figure 2.5 whose edge density is eH0

vH0=

6

5. When pn � n�5/6, the expected

edge densitynumber of copies of H0 tends to +1. But observe that the subgraph H of H0

has the higher density 5/4 and, hence, when n�5/6⌧ pn ⌧ n�4/5 the expected

number of copies of H tends to 0. By the first moment method, the probabilitythat a copy of H0—and therefore H—is present in that regime is asymptoticallynegligible despite its diverging expectation. This leads to the following definition

rH0 := max

⇢eHvH

: H ✓ H0, vH > 0

�.

Assume H0 has at least one edge.

Claim 2.23. “Having a copy of H0” has threshold n�1/rH0 .

Proof. We proceed as in Claim 2.22. Let H⇤

0be a subgraph of H0 achieving rH0 .

When pn ⌧ n�1/rH0 , the probability that a copy of H⇤

0is in Gn tends to 0 by the

argument above. Therefore the same conclusion holds for H0 itself.Assume pn � n�1/rH0 . Let I1, . . . , Im be an enumeration of the copies of H0

in a complete graph on the vertices of Gn. Let Ai be the event that Ii ✓ Gn. Usingthe notation of Corollary 2.21,

µm = ⇥(nvH0peH0n ) = ⌦(�H0),

where�H0 := min

H✓H0,eH0>0

nvHpeHn .

Note that �H0 ! +1. The events Ai and Aj are independent if Ii and Ij shareno edge. Otherwise we write i ⇠ j. Note that there are ⇥(nvHn2(vH0�vH)

) pairs

37

Figure 2.5: Graph H0 and subgraph H .

Ii, Ij whose intersection is isomorphic to H . The probability that both elements ofsuch a pair are present in Gn is ⇥(peHn p

2(eH0�eH)

n ). Hence

�m =

X

i⇠j

P[Ai \Aj ]

=

X

H✓H0,eH>0

⇥

⇣n2vH0�vHp

2eH0�eH

n

⌘

=⇥(µ2

m)

⇥(�H0)

= o(µ2

m).

The result follows from Corollary 2.21.

Going back to the example of Figure 2.5, the proof above confirms that whenn�5/6

⌧ pn ⌧ n�4/5 the second moment method fails for H0 since �H0 ! 0.In that regime, although there is in expectation a large number of copies of H0,those copies are highly correlated as they are produced from a (vanishingly) smallnumber of copies of H—explaining the failure of the second moment method.

38

Connectivity threshold

In this section, we use the second moment method to show that the threshold func-tion for connectivity in the Erdos-Renyi random graph is logn

n. In fact we prove

this result by deriving the threshold function for the presence of isolated vertices.The connection between the two is obvious in one direction. Isolated vertices im-ply a disconnected graph. What is less obvious is that it also works the other wayin the following sense: the two thresholds actually coincide.

Isolated vertices We begin with isolated vertices.

Claim 2.24. “Not having an isolated vertex” has threshold function logn

n.

Proof. Let Xn be the number of isolated vertices in the Erdos-Renyi graph Gn ⇠

Gn,pn . Using 1� x e�x for all x 2 R,

En,pn [Xn] = n(1� pn)n�1

elogn�(n�1)pn ! 0,

when pn �logn

n. So the first moment method gives one direction: Pn,pn [Xn >

0] ! 0.For the other direction, we use the second moment method. Let Aj be the

event that vertex j is isolated. By the computation above, using 1 � x � e�x�x2

for x 2 [0, 1/2] (Exercise: Check),

µn =

X

i

Pn,pn [Ai] = n(1� pn)n�1

� elogn�npn�np2n , (2.16)

which goes to +1 when pn ⌧logn

n. Note that i ⇠ j for all i 6= j and

Pn,pn [Ai \Aj ] = (1� pn)2(n�2)+1,

so that�n =

X

i 6=j

Pn,pn [Ai \Aj ] = n(n� 1)(1� pn)2n�3.

Because �n is not a o(µ2n), we cannot apply Corollary 2.21. Instead we use Theo-

rem 2.20 directly. We have

En,pn [X2n]

(En,pn [Xn])2

=µn + �n

µ2n

n(1� pn)n�1

+ n2(1� pn)2n�3

n2(1� pn)2n�2

1

n(1� pn)n�1+

1

1� pn, (2.17)

which is 1 + o(1) when pn ⌧logn

n.

39

Connectivity We use Claim 2.24 to study the threshold for connectivity.

Claim 2.25. Connectivity has threshold function logn

n.

Proof. We start with the easy direction. If pn ⌧logn

n, Claim 2.24 implies that the

graph has isolated vertices, and therefore is disconnected, with probability goingto 1 as n ! +1.

Assume that pn �logn

n. Let Dn be the event that Gn is disconnected. To

bound Pn,pn [Dn], for k 2 {1, . . . , n/2} we let Yk be the number of subsets of kvertices that are disconnected from all other vertices in the graph. Then, by the firstmoment method,

Pn,pn [Dn] Pn,pn

2

4n/2X

k=1

Yk > 0

3

5

n/2X

k=1

En,pn [Yk].

The expectation of Yk is straightforward to estimate. Using that k n/2 and�n

k

� nk,

En,pn [Yk] =

✓n

k

◆(1� pn)

k(n�k)

⇣n(1� pn)

n/2

⌘k

.

The expression in parentheses is o(1) when pn �logn

n. Summing over k,

Pn,pn [Dn]

+1X

k=1

⇣n(1� pn)

n/2

⌘k

= O(n(1� pn)n/2

) = o(1),

where we used that the geometric series (started at k = 1) is dominated asymptot-ically by its first term.

A closer look We have shown that connectivity and the absence of isolated ver-tices have the same threshold function. In fact, in a sense, isolated vertices are the“last obstacle” to connectivity. A slight modification of the proof above leads tothe following more precise result. For k 2 {1, . . . , n/2}, let Zk be the number ofconnected components of size k in Gn. In particular, Z1 is the number of isolatedvertices. We consider the critical window pn =

cn

nwhere cn := log n + s for

some fixed s 2 R. We show that, in that regime, the graph is composed of a largeconnected component together with some isolated vertices.

Claim 2.26.

Pn,pn [Z1 > 0] �1

1 + es+ o(1) and Pn,pn

2

4n/2X

k=2

Zk > 0

3

5 = o(1).

40

Proof. We first consider isolated vertices. From (2.16) and (2.17),

Pn,pn [Z1 > 0] �

✓e� logn+npn+np

2n +

1

1� pn

◆�1

=1

1 + es+ o(1),

as n ! +1.To bound the number of components of size k > 1, we note first the random

variable Yk used in the previous claim (which imposes no condition on the edgesbetween the vertices) is too loose to provide a suitable bound. Instead, to bound theprobability that a set of k vertices forms a connected component, we observe thata connected component is characterized by two properties: it is disconnected fromthe rest of the graph; and it contains a spanning tree. Formally, for k = 2, . . . , n/2,we let Z 0

kbe the the number of (not necessarily induced) maximal trees of size k

or, put differently, the number of spanning trees of connected components of sizek. Then, by the first moment method, the probability that a connected componentof size > 1 is present in Gn is bounded by

Pn,pn

2

4n/2X

k=2

Zk > 0

3

5 Pn,pn

2

4n/2X

k=2

Z 0

k> 0

3

5

n/2X

k=2

En,pn [Z0

k]. (2.18)

To estimate the expectation of Z 0

k, we use Cayley’s theorem (e.g. [LP, Corollary

4.5]) which implies that there are kk�2 labelled trees on a set of k vertices. Recallfurther that a tree on k vertices has k � 1 edges. Hence,

En,pn [Z0

k] =

✓n

k

◆kk�2

| {z }(a)

pk�1

n|{z}(b)

(1� pn)k(n�k)

| {z }(c)

,

where (a) is the number of trees of size k, (b) is the probability that such a tree ispresent in the graph, and (c) is the probability that this tree is disconnected fromevery other vertex in the graph. Using that k! � (k/e)k (see Section A.1.1),

En,pn [Z0

k]

nk

k!kk�2pk�1

n (1� pn)k(n�k)

n⇣ecne

�(1�k

n)cn

⌘k

.

For k n/2, the expression in parentheses is o(1). In fact, for k � 2, En,pn [Z0

k] =

o(1). Furthermore, summing over k > 2,n/2X

k=3

En,pn [Z0

k]

+1X

k=3

n⇣ecne

�12 cn

⌘k

= O(n�1/2log

3 n) = o(1).

Plugging this back into (2.18) concludes the proof.

The limit of Pn,pn [Z1 > 0] can be computed explicitly using the method of mo-ments. See Exercise 2.14.

41

2.3.3 . Percolation on trees: critical value and branching number

Consider bond percolation on the infinite d-regular tree Td. Root the tree arbi-trarily at a vertex 0 and let C0 be the open connected component containing 0. Inthis section we illustrate the use of the first and second moment methods on theidentification of the critical value

pc(Td) = sup{p 2 [0, 1] : ✓(p) = 0},

where recall that the percolation function is ✓(p) = Pp[|C0| = +1]. We thenconsider general trees, introduce the branching number, and present a weightedversion of the second moment method.

Regular tree Our main result for Td is the following.

Theorem 2.27.pc(Td) =

1

d� 1.

Proof. Let @n be the n-th level of Td, i.e., the set of vertices at graph distance nfrom 0. Let Xn be the number of vertices in @n\C0. In order for the component ofthe root to be infinite, there must be at least one vertex on the n-th level connectedto the root by an open path. By the first moment method,

✓(p) Pp[Xn > 0] EpXn = d(d� 1)n�1pn ! 0, (2.19)

when p < 1

d�1. Here we used that there is a unique path between 0 and any vertex

in the tree to deduce that Pp[x 2 C0] = pn for x 2 @n. Equation (2.19) impliespc(Td) �

1

d�1.

The second moment method gives a lower bound on Pp[Xn > 0]. To simplifythe notation, it is convenient to introduce the “branching ratio” b := d � 1. Wesay that x is a descendant of z if the path between 0 and x goes through z. Each

descendantz 6= 0 has d � 1 descendant subtrees, i.e., subtrees of Td rooted at z made of alldescendants of z. Let x ^ y be the most recent common ancestor of x and y, i.e., most recent

commonancestor

the furthest vertex from 0 that lies on both the path from 0 to x and the path from0 to y; see Figure 2.6. Letting µ := Ep[Xn]

42

Figure 2.6: Most recent common ancestor of x and y.

Ep[X2

n] =

X

x,y2@n

Pp[x, y 2 C0]

=

X

x2@n

Pp[x 2 C0] +

n�1X

m=0

X

x,y2@n

1{x^y2@m}pmp2(n�m)

= µn + (b+ 1)bn�1

n�1X

m=0

(b� 1)b(n�m)�1p2n�m

µn + (b+ 1)(b� 1)b2n�2p2n+1X

m=0

(bp)�m

= µn + µ2

n ·b� 1

b+ 1·

1

1� (bp)�1,

where, on the third line, we used that all vertices on the n-th level are equivalentand that, for a fixed x, the set {y : x^y 2 @m} is composed of those vertices in @nthat are descendants of x ^ y but not in the descendant subtree of x ^ y containing

43

x. When p > 1

d�1=

1

b, dividing by (EpXn)

2= µ2

n ! +1, we get

Ep[X2n]

(EpXn)2

1

µn

+b� 1

b+ 1·

1

1� (bp)�1(2.20)

1 +b� 1

b+ 1·

1

1� (bp)�1

=: Cb,p,

for n large enough. By the second moment method and monotonicity,

✓(p) = Pp[8n, Xn > 0] = limn

Pp[Xn > 0] � C�1

b,p> 0,

which concludes the proof. (Note that the version of the second moment methodin Equation (2.13) does not work here. Subtract 1 in (2.20) and take p close to1/b.)

The argument above relies crucially on the fact that, in a tree, any two verticesare connected by a unique path. For instance, estimating Pp[x 2 C0] is much harderon a lattice. Note furthermore that, intuitively, the reason why the first momentcaptures the critical threshold exactly in this case is that bond percolation on Td isa branching process, where Xn represents the population size at generation n. Thequalitative behavior of a branching process is governed by its expectation: whenthe mean number of offsprings bp exceeds 1, the process grows exponentially onaverage and “explodes” with positive probability. We will come back to this pointof view in Chapter 5 where branching processes are used to give a more refinedanalysis of bond percolation on Td.

General trees Let T be a locally finite tree (i.e., all its degrees are finite) withroot 0. For an edge e, let ve be the endvertex of e furthest from the root. We denoteby |e| the graph distance between 0 and ve. A cutset separating 0 and +1 is a set cutsetof edges ⇧ such that all infinite self-avoiding paths starting at 0 go through ⇧. Fora cutset ⇧, we let ⇧v := {ve : e 2 ⇧}. Repeating the argument in (2.19), for anycutset ⇧, by the first moment method

✓(p) Pp[C0 \⇧v 6= ;]

X

u2⇧v

Pp[u 2 C0] =

X

e2⇧

p|e|. (2.21)

This bound naturally leads to the following definition.

Definition 2.28 (Branching number). The branching number of T is given bybranchingnumber

br(T ) = sup

(� � 1 : inf

cutset ⇧

X

e2⇧

��|e| > 0

). (2.22)

44

Remark 2.29. For locally finite trees, it suffices to consider finite cutsets. See the proofof [LP, Theorem 3.1].

Equation (2.21) implies that pc(T ) �1

br(T ). Remarkably, this bound is tight.

The proof is based on a weighted second moment method.

Theorem 2.30. For any rooted, locally finite tree T ,

pc(T ) =1

br(T ).

Proof. As we argued above, pc(T ) �1

br(T )follows from (2.21) and (2.22).

Let p > 1

br(T ), p�1 < � < br(T ), and " > 0 such that

X

e2⇧

��|e|� " (2.23)

for all cutsets ⇧. The existence of such an " is guaranteed by the definition of thebranching number. As in the proof of Theorem 2.27, we use that ✓(p) is the limitas n ! +1 of the probability that C0 reaches the n-th level. However, this time,we use a weighted count on the n-th level. Let Tn be the first n levels of T and,as before, let @n be the vertices on the n-th level. For a probability measure ⌫ on@n, we define the weighted count

Xn =

X

z2@n

⌫(z)1{z2C0}1

Pp[z 2 C0].

The purpose of the denominator is normalization:

EpXn =

X

z2@n

⌫(z) = 1.

(Note that multiplying all weights by a constant does not affect the event {Xn > 0}

and that the constant cancels out in the ratio of (EpXn)2 and EpX2

n in the secondmoment method.) Because of (2.23), a natural choice of ⌫ follows from the max-flow min-cut theorem which guarantees the existence of a unit flow � from 0 to @nsatisfying the capacity constraint �(e) "�1��|e|, for all edges e in Tn. Define⌫(z) to be the flow entering z 2 Tn under �. In particular, because � is a unit flow,⌫ restricted to @n defines a probability measure. It remains to bound the second

45

moment of Xn under this choice. We have

EpX2

n =

X

x,y2@n

⌫(x)⌫(y)Pp[x, y 2 C0]

Pp[x 2 C0]Pp[y 2 C0]

=

nX

m=0

X

x,y2@n

1{x^y2@m}⌫(x)⌫(y)pmp2(n�m)

p2n

=

nX

m=0

p�mX

z2@m

0

@X

x,y2@n

1{x^y=z}⌫(x)⌫(y)

1

A .

In the expression in parentheses, for each x the sum over y is at most ⌫(x)⌫(z) bythe definition of a flow. Hence

EpX2

n

nX

m=0

p�mX

z2@m

(⌫(z))2

nX

m=0

p�mX

z2@m

("�1��m)⌫(z)

"�1

+1X

m=0

(p�)�m

="�1

1� (p�)�1=: C",�,p < +1,

where the second line follows from the capacity constraint, and we used p� > 1 onthe last line. From the second moment method (recall that EpXn = 1), it followsthat

✓(p) � C�1

",�,p> 0,

and pc(T ) 1

br(T ).

2.4 Chernoff-Cramer method

Chebyshev’s inequality (Theorem 2.2) gives a bound on the concentration aroundthe mean of a square integrable random variable that is, in general, best possible.Indeed take X to be µ+ b� or µ� b� with probability (2b2)�1 respectively, and µotherwise. Then EX = µ, VarX = �2, and for � = b�,

P[|X � EX| � �] = P[|X � EX| = �] =1

b2=

VarX

�2.

46

However, in many cases, much stronger bounds can be derived. For instance, ifX ⇠ N(0, 1), by the following lemma

P[|X � EX| � �] ⇠

r2

⇡��1

exp(��2/2) ⌧1

�2, (2.24)

as � ! +1. Indeed:

Lemma 2.31. For x > 0,

(x�1� x�3

) e�x2/2

Z+1

x

e�y2/2dy x�1 e�x

2/2.

Proof. By the change of variable y = x+ z and using e�z2/2

1

Z+1

x

e�y2/2dy e�x

2/2

Z+1

0

e�xzdz = e�x

2/2x�1.

For the other direction, by differentiationZ

+1

x

(1� 3y�4) e�y

2/2dy = (x�1

� x�3) e�x

2/2.

In this section we discuss the Chernoff-Cramer method, which produces exponen-tial tail inequalities, provided the moment-generating function is finite in a neigh-borhood of 0.

2.4.1 Tail bounds via the moment-generating function

Under a finite variance, squaring within Markov’s inequality (Theorem 2.1) pro-duces Chebyshev’s inequality (Theorem 2.2). This “boosting” can be pushed fur-ther when stronger integrability conditions hold. We refer to (2.25) in the nextlemma as the Chernoff-Cramer bound.

Chernoff-Cramerbound

Lemma 2.32 (Chernoff-Cramer bound). Assume X is a centered random variablesuch that MX(s) < +1 for s 2 (�s0, s0) for some s0 > 0. For any � > 0 ands > 0,

P[X � �] exp [� {s� � X(s)}] , (2.25)

where

X(s) = logMX(s),

is the cumulant-generating function of X .

47

Proof. Exponentiating within Markov’s inequality gives

P[X � �] = P[esX � es� ] MX(s)

es�= exp [� {s� � X(s)}] .

Returning to the Gaussian case, let X ⇠ N(0, ⌫) where ⌫ > 0 is the varianceand note that

MX(s) =

Z+1

�1

esx1

p2⇡⌫

e�x2

2⌫ dx

=

Z+1

�1

es2⌫

21

p2⇡⌫

e�(x�s⌫)2

2⌫ dx

= exp

✓s2⌫

2

◆.

By straightforward calculus, the optimal choice of s in (2.25) gives the exponent

sups>0

(s� � s2⌫/2) =�2

2⌫, (2.26)

achieved at s� = �/⌫. For � > 0, this leads to the bound

P[X � �] exp

✓��2

2⌫

◆, (2.27)

which is much sharper than Chebyshev’s inequality for large �—compare to (2.24).As another toy example, we consider simple random walk on Z.

Lemma 2.33 (Chernoff bound for simple random walk on Z). Let Z1, . . . , Zn beindependent {�1, 1}-valued random variables with P[Zi = 1] = P[Zi = �1] =

1/2. Let Sn =P

inZi. Then, for any � > 0,

P[Sn � �] e��2/2n. (2.28)

Proof. The moment-generating function of Z1 can be bounded as follows

MZ1(s) =es + e�s

2=

X

j�0

s2j

(2j)!

X

j�0

(s2/2)j

j!= es

2/2. (2.29)

Taking s = �/n in the Chernoff-Cramer bound (2.25), we get

P[Sn � �] exp (�s� + n Z1(s))

exp��s� + ns2/2

�

= e��2/2n,

which concludes the proof.

48

Observe the similarity between (2.28) and the Gaussian bound (2.27), if onetakes ⌫ to be the variance of Sn

⌫ := Var[Sn] = nVar[Z1] = nE[Z2

1 ] = n,

where we used that Z1 is centered. The central limit theorem says that simplerandom walk is well approximated by a Gaussian in the bulk of the distribution;the bound above extends the approximation in the large deviation regime. Thebounding technique used in the proof of Lemma 2.33 will be substantially extendedin Section 2.4.3.

Example 2.34 (Set balancing). Let v1, . . . ,vm be arbitrary non-zero vectors in{0, 1}n. Think of vi = (vi,1, . . . , vi,n) as representing subsets of [n] = {1, . . . , n}:vi,j = 1 indicates that j is in subset i. Suppose we want to partition [n] into twogroups such that the subsets corresponding to the vis are as balanced as possible,that is, are as close as possible to having the same number of elements from eachgroup. More formally, we seek a vector x = (x1, . . . , xn) 2 {�1,+1}

n such thatB⇤

= maxi=1,...,m |x · vi| is as small as possible. Once again we select each xiindependently, uniformly at random in {�1,+1}. Fix " > 0. We claim that

PhB⇤

�

p2n(logm+ log(2"�1))

i ". (2.30)

Indeed, by (2.28) (considering only the non-zero entries of vi),

Ph|x · vi| �

p2n(logm+ log(2"�1))

i

2 exp

✓�2n(logm+ log(2"�1

))

2kvik1

◆

"

m,

where we used kvik1 n. Taking a union bound over the m vectors gives theresult. In (2.30), the

pn term on the right-hand side of the inequality is to be ex-

pected since it is the standard deviation of |x · vi| in the worst case. The powerof the exponential tail bound (2.28) appears in the logarithmic terms, which wouldhave been much larger if one had used Chebyshev’s inequality (Theorem 2.2) in-stead. (Try it!) J

The Chernoff-Cramer bound is particularly useful for sums of independent ran-dom variables as the moment-generating function then factorizes. Let

⇤

X(�) = sups2R+

(s� � X(s)),

be the Fenchel-Legendre dual of the cumulant-generating function of X .Fenchel-Legendredual

49

Theorem 2.35 (Chernoff-Cramer method). Let Sn =P

inXi, where the Xis are

i.i.d. centered random variables. Assume MX1(s) < +1 on s 2 (�s0, s0) forsome s0 > 0. For any � > 0,

P[Sn � �] exp

✓�n ⇤

X1

✓�

n

◆◆. (2.31)

In particular, in the large deviations regime, i.e., when � = bn for some b > 0, wehave

� lim supn

1

nlogP[Sn � bn] � ⇤

X1(b) . (2.32)

Proof. Observe that

⇤

Sn(�) = sup

s>0

(s� � n X1(s)) = sups>0

n

✓s

✓�

n

◆� X1(s)

◆= n ⇤

X1

✓�

n

◆,

and optimize over s in (2.25).

2.4.2 Some examples

We use the Chernoff-Cramer method to derive a few standard bounds.

Poisson variables Let Z ⇠ Poi(�) be Poisson with mean � and recall that, let-ting X = Z � �,

X(s) = log

0

@X

`�0

e��`

`!es(`��)

1

A

= log

0

@e�(1�s)�X

`�0

(es�)`

`!

1

A

= log

⇣e�(1�s)�ee

s�

⌘

= �(es � s� 1),

so that straightforward calculus gives for � > 0

⇤

X(�) = sups>0

(s� � �(es � s� 1))

= �

✓1 +

�

�

◆log

✓1 +

�

�

◆�

�

�

�

=: �h

✓�

�

◆,

50

achieved at s� = log

⇣1 +

�

�

⌘, where h is defined as the expression in square

brackets. Plugging ⇤

X(�) into Theorem 2.35 leads for � > 0 to the bound

P[Z � �+ �] exp

✓��h

✓�

�

◆◆. (2.33)

A similar calculation for �(Z � �) gives for � < 0

P[Z �+ �] exp

✓��h

✓�

�

◆◆. (2.34)

If Sn is a sum of n i.i.d. Poi(�) variables, then by (2.32) for a > �

� lim supn

1

nlogP[Sn � an] � �h

✓a� �

�

◆

= a log⇣a�

⌘� a+ �

=: IPoi�

(a), (2.35)

and similarly for a < �

� lim supn

1

nlogP[Sn an] � IPoi

�(a). (2.36)

In fact, these bounds follow immediately from (2.33) and (2.34) by noting thatSn ⇠ Poi(n�).

Binomial variables and Chernoff bounds Let Z be a binomial random variablewith parameters n and p. Recall that Z is a sum of i.i.d. indicators Y1, . . . , Yn and,letting Xi = Yi � p and Sn = Z � np,

X1(s) = log (pes + (1� p))� ps.

For b 2 (0, 1� p), letting a = b+ p, direct calculation gives

⇤

X1(b) = sup

s>0

(sb� (log [pes + (1� p)]� ps))

= (1� a) log1� a

1� p+ a log

a

p=: D(akp), (2.37)

achieved at sb = log(1�p)a

p(1�a). The function D(akp) in (2.37) is the so-called

Kullback-Leibler divergence or relative entropy of Bernoulli variables with param-Kullback-Leiblerdivergence

eters a and p respectively. By (2.31) for � > 0

P[Z � np+ �] exp (�nD (p+ �/nkp)) .

Applying the same argument to Z 0= n� Z gives a bound in the other direction.

51

Remark 2.36. In the large deviations regime, it can be shown that the previous bound istight in the sense that

�1

nlogP[Z � np+ bn] ! D (p+ bkp) =: IBin

n,p (b),

as n ! +1. The theory of large deviations, which deals with asymptotics of probabil-ities of rare events, provides general results along those lines. See e.g. [Dur10, Section2.6]. Upper bounds will be enough for our purposes. A similar remark applies to (??)and (2.35).

The following related bounds, proved in Exercise 2.6, are often useful.

Theorem 2.37 (Chernoff bounds for Poisson trials). Let Y1, . . . , Yn be independent{0, 1}-valued random variables with P[Yi = 1] = pi and µ =

Pipi. These are

called Poisson trials. Let Z =P

iYi. Then:

Poisson trials1. Above the mean

(a) For any � > 0,

P[Z � (1 + �)µ]

✓e�

(1 + �)(1+�)

◆µ

.

(b) For any 0 < � 1,

P[Z � (1 + �)µ] e�µ�2/3.

2. Below the mean

(a) For any 0 < � < 1,

P[Z (1� �)µ]

✓e��

(1� �)(1��)

◆µ

.

(b) For any 0 < � < 1,

P[Z (1� �)µ] e�µ�2/2.

There are innumerable applications of the Chernoff-Cramer method. We dis-cuss a few in the next subsections.

2.4.3 Sub-Gaussian random variables

The bounds in Section 2.4.2 were obtained by computing the moment-generatingfunction explicitly (possibly with some approximations). This is seldom possible.In this section, we give some important examples of concentration inequalitiesderived from the Chernoff-Cramer method for broad classes of random variablesunder natural conditions on their distributions.

52

Sub-Gaussian random variables Here is our key definition.

Definition 2.38 (Sub-Gaussian random variables). We say that a random variableX with mean µ is sub-Gaussian with variance factor ⌫ if

sub-Gaussianvariable

X�µ(s) s2⌫

2, 8s 2 R, (2.38)

for some ⌫ > 0. We use the notation X 2 sG(⌫).

Note that the r.h.s. in (2.38) is the cumulant-generating function of a N(0, ⌫).By the Chernoff-Cramer method and (2.26) it follows immediately that

P [X � µ ��] _ P [X � µ � �] exp

✓��2

2⌫

◆, (2.39)

where we used that X 2 sG(⌫) implies �X 2 sG(⌫). As a quick example, notethat this is the approach that we took in Lemma 2.33, that is, we showed that auniform random variable in {�1, 1} is sub-Gaussian with variance factor 1.

When considering (weighted) sums of independent sub-Gaussian random vari-ables, we get the following.

Theorem 2.39 (General Hoeffding inequality). Suppose X1, . . . , Xn are indepen-dent random variables where, for each i, Xi 2 sG(⌫i) with 0 < ⌫i < +1. For(w1, . . . , wn) 2 Rn, let Sn =

Pin

wiXi. Then

Sn 2 sG

nX

i=1

w2

i ⌫i

!.

In particular, for all � > 0,

P [Sn � ESn � �] exp

✓�

�2

2P

n

i=1w2

i⌫i

◆.

Proof. By independence,

Sn(s) =

X

in

wiXi(s) =

X

in

Xi(swi)

X

in

(swi)2⌫i

2=

s2P

inw2

i⌫i

2.

53

Bounded random variables For bounded random variables, the previous in-equality reduces to a standard bound.

Theorem 2.40 (Hoeffding’s inequality). Let X1, . . . , Xn be independent randomvariables where, for each i, Xi takes values in [ai, bi] with �1 < ai bi < +1.Let Sn =

Pin

Xi. For all � > 0,

P[Sn � ESn � �] exp

�

2�2

Pin

(bi � ai)2

!.

By Theorem 2.39, it suffices to show that Xi�EXi 2 sG(⌫i) with ⌫i =1

4(bi�ai)2.

We first give a quick proof of a weaker version that uses a trick called symmetriza-tion. Suppose the Xis are centered and satisfy |Xi| ci for some ci > 0. Let

symmetrizationX 0

ibe an independent copy of Xi and let Zi be an independent uniform random

variable in {�1, 1}. For any s, by Jensen’s inequality and (2.29),

E⇥esXi

⇤= E

hesE[Xi�X

0i|Xi]

i

EhEhes(Xi�X

0i)

��Xi

ii

= Ehes(Xi�X

0i)

i

= EhEhes(Xi�X

0i)

��Zi

ii

= EhEhesZi(Xi�X

0i)

��Zi

ii

= EhesZi(Xi�X

0i)

i

= EhEhesZi(Xi�X

0i)

��Xi �X 0

i

ii

Ehe(s(Xi�X

0i))

2/2

i

e�4c2is2/2,

where on the fifth line we used the fact that Xi � X 0

iis identically distributed

to �(Xi � X 0

i) and that Zi is independent of both Xi and X 0

i(together with

Lemma A.6). That is, Xi is sub-Gaussian with variance factor 4c2i. By Theo-

rem 2.39, Sn is sub-Gaussian with variance factorP

in4c2

iand

P[Sn � t] exp

�

t2

8P

inc2i

!.

54

Proof of Theorem 2.40. As pointed out above, it suffices to show that Xi �EXi issub-Gaussian with variance factor 1

4(bi � ai)2. This is the content of Hoeffding’s

lemma (which we will use again in Chapter 3). First an observation:

Lemma 2.41 (Variance of bounded random variables). For any random variableZ taking values in [a, b] with �1 < a b < +1, we have

Var[Z] 1

4(b� a)2.

Proof. Indeed��Z �

a+ b

2

�� b� a

2,

and

Var[Z] = Var

Z �

a+ b

2

� E

"✓Z �

a+ b

2

◆2#

✓b� a

2

◆2

.

Lemma 2.42 (Hoeffding’s lemma). Let X be a random variable taking values in[a, b] for �1 < a b < +1. Then X � EX 2 sG

�1

4(b� a)2

�.

Proof. Note first that X�EX 2 [a�EX, b�EX] and 1

4((b�EX)�(a�EX))

2=

1

4(b�a)2. So w.l.o.g. we assume EX = 0. Because X is bounded, MX(s) is finite

for all s 2 R. From standard results on moment-generating functions (e.g., [Bil12,Section 21]; see also [Dur10, Theorem A.5.1]), for any k 2 Z,

M (k)

X(s) = E

hXkesX

i.

Hence

X(0) = logMX(0) = 0, 0

X(0) =M 0

X(0)

MX(0)= EX = 0,

and by a Taylor expansion

X(s) = X(0) + s 0

X(0) +s2

2

00

X(s⇤) =s2

2

00

X(s⇤),

for some s⇤ 2 [0, s]. Therefore it suffices to show that for all s

00

X(s) 1

4(b� a)2. (2.40)

55

Note that

00

X(s) =M 00

X(s)

MX(s)�

✓M 0

X(s)

MX(s)

◆2

=1

MX(s)E⇥X2esX

⇤�

✓1

MX(s)E⇥XesX

⇤◆2

= EX2

esX

MX(s)

��

✓EX

esX

MX(s)

�◆2

.

The trick to conclude is to notice that esx

MX(s)defines a density (i.e., a Radon-

Nikodym derivative) on [a, b] with respect to the law of X . The variance underthis density—the last line above—is less than 1

4(b � a)2 by Lemma 2.41. This

establishes (2.40) and concludes the proof.

Remark 2.43. The change of measure above is known as tilting and is a standard trick inlarge deviation theory. See, e.g., [Dur10, Section 2.6].

Example: A knapsack problem To be written. See [FR98, Section 5.3].

2.4.4 Sub-exponential random variables

Unfortunately, not every random variable of interest is sub-Gaussian. A simpleexample is the square of a Gaussian variable. Indeed, suppose X ⇠ N(0, 1). ThenW = X2 is �2-distributed and its moment-generating function can be computedexplicitly. Using the change of variable u = x

p1� 2s, for s < 1/2,

MW (s) =1

p2⇡

Z+1

�1

esx2e�x

2/2dx

=1

p1� 2s

⇥1

p2⇡

Z+1

�1

e�u2/2du

=1

(1� 2s)1/2. (2.41)

When s � 1/2, however, MW (s) = +1. In particular, W cannot be sub-Gaussianfor any variance factor ⌫ > 0. (Note that centering W produces an additional factorof e�s in the moment-generating factor which does not prevent it from diverging.)

56

Further confirming this observation, arguing as in (2.24), the upper tail of W de-cays as

P[W � �] = P[X �

p�]

⇠

r1

2⇡[

p�]�1

exp(�[

p�]2/2)

⇠

r1

2⇡�exp(��/2),

as � ! +1. That is, it decays exponentially with �, but much slower than theGaussian tail.

Sub-exponential random variables We now define a broad class of distributionswhich have such exponential tail decay.

Definition 2.44 (Sub-exponential random variable). We say that a random variableX with mean µ is sub-exponential with parameters (⌫,↵) if

sub-exponentialvariable

X�µ(s) s2⌫

2, 8|s|

1

↵, (2.42)

for some ⌫,↵ > 0. We write X 2 sE(⌫,↵).

Observe that the key difference between (2.38) and (2.42) is the interval of s overwhich it holds. As we will see below, the parameter ↵ dictates the exponentialdecay rate of the tail. The specific form of the bound in (2.42) is natural once onenotices that, as |s| ! 0, a centered random variable with variance ⌫ should roughlysatisfy

logE[esX ] ⇡ log

⇢1 + sE[X] +

s2

2E[X2

]

�⇡ log

⇢1 +

s2⌫

2

�⇡

s2⌫

2.

57

Returning to the �2 distribution, note that from (2.41) we have for |s| < 1/4

W�1(s) = �s�1

2log(1� 2s)

= �s�1

2

"�

+1X

i=1

(2s)i

i

#

=s2

2

"4

+1X

i=2

(2s)i�2

i

#

s2

2

"2

+1X

i=2

|1/2|i�2

#

s2

2⇥ 4.

Hence W 2 sE(4, 1/4).Using the Chernoff-Cramer bound (Lemma 2.32), we obtain the following tail

bound for sub-exponential variables.

Theorem 2.45 (Sub-exponential tail bound). Suppose the random variable X withmean µ is sub-exponential with parameters (⌫,↵). Then for all � 2 R+

P[X � µ � �]

(exp(�

�2

2⌫), if 0 � ⌫/↵,

exp(��

2↵), if � > ⌫/↵.

(2.43)

In words, the tail decays exponentially at large deviations but behaves as inthe sub-Gaussian case for smaller deviations. We will see below that this awk-ward double-tail allows to extrapolate naturally between different regimes. Firstwe prove the claim.

Proof of Theorem 2.45. We start by applying the Chernoff-Cramer bound (Lemma2.32). For any � > 0 and |s| 1/↵

P[X � µ � �] exp (�s� + X(s)) exp��s� + s2⌫/2

�.

At this point, the proof diverges from the sub-Gaussian case because the optimalchoice of s depends on � because of the additional constraint |s| 1/↵. Whens⇤ = �/⌫ satisfies s⇤ 1/↵, the quadratic function of s in the exponent is mini-mized at s⇤, giving the bound

P[X � �] exp

✓��2

2⌫

◆,

58

for 0 � ⌫/↵.On the other hand, when � > ⌫/↵, the exponent is strictly decreasing over

the interval s 1/↵. Hence the optimal choice is s⇤ = 1/↵, which produces thebound

P[X � �] exp

✓��

↵+

⌫

2↵2

◆

< exp

✓��

↵+

�

2↵

◆

= exp

✓�

�

2↵

◆,

where we used that ⌫ < �↵ on the second line.

For (weighted) sums of independent sub-exponential random variables, we getthe following.

Theorem 2.46 (General Bernstein inequality). Suppose X1, . . . , Xn are indepen-dent random variables where, for each i, Xi 2 sE(⌫i,↵i) with 0 < ⌫i,↵i < +1.For (w1, . . . , wn) 2 Rn, let Sn =

Pin

wiXi. Then

Sn 2 sE

nX

i=1

w2

i ⌫i,maxi

|wi|↵i

!.

In particular, for all � > 0,

P [Sn � ESn � �]

8<

:exp

⇣�

�2

2P

n

i=1 w2i⌫i

⌘, if 0 �

Pn

i=1 w2i⌫i

maxi |wi|↵i

,

exp

⇣�

�

2maxi |wi|↵i

⌘, if � >

Pn

i=1 w2i⌫i

maxi |wi|↵i

.

Proof. By independence,

Sn(s) =

X

in

wiXi(s) =

X

in

Xi(swi)

X

in

(swi)2⌫i

2=

s2P

inw2

i⌫i

2,

provided |swi| < 1/↵i for all i, that is,

|s| <1

maxi |wi|↵i

.

59

Bounded random variables We apply the previous result to bounded randomvariables.

Theorem 2.47 (Bernstein’s inequality). Let X1, . . . , Xn be independent randomvariables where, for each i, Xi has mean µi, variance ⌫i and satisfies |Xi�µi| cfor some 0 < c < +1. Let Sn =

Pin

Xi. For all � > 0,

P [Sn � ESn � �]

8<

:exp

⇣�

�2

4P

n

i=1 ⌫i

⌘, if 0 �

Pn

i=1 ⌫i

c,

exp

⇣�

�

4c

⌘, if � >

Pn

i=1 ⌫i

c.

Proof. We claim that Xi 2 sE(2⌫i, 2c). To establish the claim, we derive a boundon all moments of Xi. Note that for all integers k � 2

E|Xi � µi|k ck�2E|Xi � µi|

2= ck�2⌫i.

Hence, first applying the dominated convergence theorem (see e.g. [Bil12, Section21] for details), we have for |s| < 1

2c

E[es(Xi�µi)] =

+1X

k=0

sk

k!E[(Xi � µi)

k]

1 + sE[(Xi � µi)] +

+1X

k=2

sk

k!ck�2⌫i

1 +s2⌫i2

+s2⌫i3!

+1X

k=3

(cs)k�2

= 1 +s2⌫i2

⇢1 +

1

3

cs

1� cs

�

1 +s2⌫i2

⇢1 +

1

3

1/2

1� 1/2

�

1 +s2

22⌫i

exp

✓s2

22⌫i

◆.

That proves the claim. Using the General Bernstein inequality (Theorem 2.46)gives the result.

It may seem counter-intuitive to derive a tail bound based on the sub-exponentialproperty of bounded random variables when we have already done so using their

60

sub-Gaussian behavior. After all, the latter is on the surface a strengthening of theformer. However, note that we have obtained a better bound in Theorem 2.47 thanwe did in Theorem 2.40 when � is not too large. That improvement stems from theuse of the variance for moderate deviations, at the expense of the sub-Gaussian tailfor large deviations. We give an example below.

Example 2.48 (Erdos-Renyi: maximum degree). Let Gn = (Vn, En) ⇠ Gn,pn

be an Erdos-Renyi graph with n vertices and density pn. Recall that two verticesu, v 2 Vn are adjacent if {u, v} 2 En and that the set of adjacent vertices of v,denoted by N(v), is called the neighborhood of v. The degree of v is the size of itsneighborhood, i.e. �(v) = |N(v)|. Here we study the maximum degree of Gn

Dn = maxv2Vn

�(v).

We focus on the regime npn = !(log n). Note that, for any vertex v 2 Vn, itsdegree is Bin(n � 1, pn) by independence of the edges. In particular its expecteddegree is (n � 1)pn. To prove a high-probability upper bound on the maximumDn, we need to control the deviation of each vertex’s degree from its expectation.Observe that the degrees are not independent. Instead we apply a union bound overall vertices, after using a tail bound.

Claim 2.49. For any " > 0, as n ! +1,

Ph|Dn � npn| � 2

p(1 + ")npn log n

i! 0.

Proof. For a fixed vertex v, think of �(v) = Sn�1 ⇠ Bin(n � 1, pn) as a sum ofn�1 independent {0, 1}-valued random variables, one for each possible edge. Thatis, Sn�1 =

Pn�1

i=1Xi where Xi are bounded random variables. The mean of Xi is

pn and its variance is pn(1� pn). So in Bernstein’s inequality (Theorem 2.47), wecan take µi := pn, ⌫i := pn(1� pn) and c := 1 for all i. We get

P [Sn�1 � (n� 1)pn + �]

8<

:exp

⇣�

�2

4⌫

⌘, if 0 � ⌫,

exp

⇣�

�

4

⌘, if � > ⌫.

where ⌫ = (n�1)pn(1�pn) = !(log n) by assumption. The value of � is chosento be the smallest that will produce a tail probability less than n�1�" for " > 0,that is,

� =

p4(n� 1)pn(1� pn)⇥

p(1 + ") log n = o(⌫),

which falls in the lower regime of the tail bound. In particular, � = o(npn). Finallyby a union bound over v 2 Vn

PhDn � (n� 1)pn +

p4(1 + ")pn(1� pn)(n� 1) log n

i n⇥

1

n1+"! 0.

61

The same holds in the other direction. That proves the claim.

Had we used Hoeffding’s inequality (Theorem 2.40) in the proof of Claim 2.49we would have had to take � =

p(1 + ") log n. That would have produced a much

weaker bound when pn = o(1). Indeed the advantage of Bernstein’s inequality isthat it makes explicit use of the variance, which when pn = o(1) is much smallerthan the worst case for bounded variables. J

2.4.5 Epsilon-nets and chaining

Suppose we are interested in bounding the expectation or tail of the supremum ofa stochastic process

supt2T

Xt,

where T is an arbitrary index set and the Xt’s are real-valued random variables. Toavoid measurability issues, we assume throughout that T is countable.

So far we have developed tools that can handle cases where T is finite. Whenthe supremum is over an infinite index set, however, new ideas are required. Oneway to proceed is to apply a tail inequality to a sufficiently dense finite subset of theindex set and then extend the resulting bound by a Lipschitz continuity argument.We present this type of approach in this section, as well as a multi-scale versionknown as chaining.

First we summarize one important special case that will be useful below: T isfinite and Xt is sub-Gaussian.

Theorem 2.50 (Maximal inequalities: sub-Gaussian case). Let {Xt}t2T be a stoch-astic process with finite index set T . Assume that there is ⌫ > 0 such that, for all t,Xt 2 sG(⌫) and that E[Xt] = 0. Then

Esupt2T

Xt

�

p2⌫ log |T |,

and, for all � > 0,

Psupt2T

Xt �

p2⌫ log |T |+ �

� exp

✓��2

2⌫

◆.

Proof. For the expectation, we apply a variation on the Chernoff-Cramer method(Section 2.4). Naively, we could bound the supremum supt2T Xt by the sumP

t2T|Xt|, but that would lead to a bound growing linearly with the cardinality

|T |. Instead we first take an exponential, which tends to amplify the largest term

62

and produces a much stronger bound. Specifically, by Jensen’s inequality, for anys > 0

Esupt2T

Xt

�=

1

sEsupt2T

sXt

�

1

slogE

exp

✓supt2T

sXt

◆�.

Since ea_b ea + eb by the non-negativity of the exponential, we can bound

Esupt2T

Xt

�

1

slog

"X

t2T

E [exp (sXt)]

#

=1

slog

"X

t2T

MXi(s)

#

1

slog

|T | e

s2⌫

2

�

=log |T |

s+

s⌫

2.

The optimal choice of s is obtained when the two terms in the sum above are equal,that is, s =

p2⌫�1 log |T |, which gives finally

Esupt2T

Xt

�

p2⌫ log |T |,

as claimed.For the tail inequality, we use a union bound and (2.39)

Psupt2T

Xt �

p2⌫ log |T |+ �

�

X

t2T

PhXt �

p2⌫ log |T |+ �

i

|T | exp

�(p2⌫ log |T |+ �)2

2⌫

!

exp

✓��2

2⌫

◆,

as claimed, where we used that � > 0 on the last line.

Epsilon-nets and covering numbers

Moving on to infinite index sets, we first define the notion of an "-net. This notionrequires that a pseudometric ⇢ (i.e. ⇢ : T ⇥ T ! R+ is symmetric and satisfies thetriangle inequality) be defined over T .

63

Definition 2.51 ("-net). Let T be a subset of a pseudometric space (M, ⇢) and let" > 0. The collection of points N ✓ M is called an "-net of T if

"-net

T ✓

[

t2N

B⇢(t, "),

where B⇢(t, ") = {s 2 T : ⇢(s, t) "}, that is, each element of T is withindistance " of an element in N . The smallest cardinality of an "-net of T is calledthe covering number

covering number

N (T, ⇢, ") = inf{|N | : N is an "-net of T}.

A natural way to construct an "-net is the following algorithm. Start with N = ;

and successively add a point from T to N at distance at least " from all other pre-vious points until it is not possible to do so anymore. Provided T is compact, thisprocedure will terminate after a finite number of steps. This leads to the followingdual perspective.

Definition 2.52 ("-packing). Let T be a subset of a metric space (M, ⇢) and let" > 0. The collection of points N ✓ T is called an "-packing of T if

"-packing

B⇢(t, ") \B⇢(t0, ") = ;, 8t 6= t0 2 N

that is, every pair of element of N is at distance at least ". The largest cardinalityof an "-packing of T is called the packing number

packing number

P(T, ⇢, ") = inf{|N | : N is an "-packing of T}.

Lemma 2.53 (Covering and packing numbers). For any T ✓ M and all " > 0,

N (T, ⇢, ") P(T, ⇢, ").

Proof. Observe that a maximal "-packing N is an "-net. Indeed, by maximality,any element of T \N is at distance " from an element of N .

Example 2.54 (Sphere in Rk). Let Sk�1 be the sphere of radius 1 centered aroundthe origin in Rk with the Euclidean metric. Let 0 < " < 1.

Claim 2.55.

N (S, ⇢, ")

✓3

"

◆k

.

64

Let N be any maximal "-packing of S. We show that |N | (3/")k, whichimplies the claim by Lemma 2.55. The balls of radius "/2 around points in N ,{Bk

(xi, "/2) : xi 2 N}, satisfy two properties:

1. They are pairwise disjoint: if z 2 Bk(xi, "/2) \ Bk

(xj , "/2), then kxi �xjk2 kxi � zk2 + kxj � zk2 ", a contradiction.

2. They are included in the ball of radius 3/2 around the origin: if z 2 Bk(xi, "/2),

then kzk2 kz � xik2 + kxik "/2 + 1 3/2.

The volume of a ball of radius is "/2 is ⇡k/2

("/2)k

�(k/2+1)and that of a ball of radius 3/2

is ⇡k/2

(3/2)k

�(k/2+1). Dividing one by the other proves the claim. J

The basic approach to use an "-net for controlling the supremum of a stochasticprocess is the following. We say that a stochastic process {Xt}t2T is Lipschitz for

Lipschitz processmetric ⇢ on T if there is a random variable 0 < K < +1 such that

|Xt �Xs| K⇢(s, t), 8s, t 2 T.

If, in addition, Xt is sub-Gaussian for all t, then we can bound the expectation ortail probability of the supremum of {Xt}t2T if we can bound the expectation ortail probability of the Lipschitz constant K. Indeed, let N be an "-net of T and,for each t 2 T , let ⇡(t) be the closest element of N to t. We will refer to ⇡ as theprojection map of N . We then have the inequality

supt2T

Xt supt2T

(Xt �X⇡(t)) + supt2T

X⇡(t) K"+ sups2N

Xs,

where we can use Theorem 2.50 to bound the last term. We give an example next.Another example can be found in Section 2.4.6.

Spectral norm of a random matrix For a m⇥ n matrix A 2 Rm⇥n, recall thatthe spectral norm is defined as

kAk := sup

x2Rn\{0}

kAxk2kxk2

= sup

x2Sn�1

y2Sm�1

hAx,yi, (2.44)

where Sn�1 is the sphere of radius 1 around the origin in Rn. (To see the equalityabove, note that Cauchy-Schwarz implies hAx,yi kAxk2kyk2 and that one cantake y = Ax/kAxk2 for any x such that Ax 6= 0 in the rightmost expression.)

65

Theorem 2.56. Let A 2 Rm⇥n be a random matrix whose entries are centered,independent and sub-Gaussian with variance factor ⌫. Then there exist a constant0 < C < +1 such that, for all t > 0,

kAk Cp⌫(pm+

pn+ t),

with probability at least 1� e�t2 .

Proof. Fix " = 1/4. By Claim 2.55, there is an "-net N (respectively M ) of Sn�1

(respectively Sm�1) with |N | 12n (respectively |M | 12

m). We proceed intwo steps:

1. We first apply the general Hoeffding inequality (Theorem 2.39) to controlthe deviations of the supremum in (2.44) restricted to N and M .

2. We then extend the bound to the full supremum by Lipschitz continuity.

Formally, the result follows from the following two lemmas.

Lemma 2.57 (Spectral norm: "-net). Let N and M be as above. For C largeenough, for all t > 0,

P

2

4maxx2Ny2M

hAx,yi �1

2Cp⌫(pm+

pn+ t)

3

5 e�t2.

Lemma 2.58 (Spectral norm: Lipschitz constant). For any "-nets N and M ofSn�1 and Sm�1 respectively, the following inequalities hold

supx2Ny2M

hAx,yi kAk 1

1� 2"supx2Ny2M

hAx,yi.

Proof of Lemma 2.57. Note that the quantity hAx,yi is a linear combination ofindependent random variables:

hAx,yi =X

i,j

xiyjAij .

By the general Hoeffding inequality (Theorem 2.39), hAx,yi is sub-Gaussian withvariance factor

X

i,j

(xiyj)2 ⌫ = kxk22 kyk

2

2 ⌫ = ⌫,

66

for all x 2 N and y 2 M . In particular, for all � > 0,

P [hAx,yi � �] exp

✓��2

2⌫

◆.

Hence, by a union bound over N and M ,

P

2

4maxx2Ny2M

hAx,yi �1

2Cp⌫(pm+

pn+ t)

3

5

X

x2Ny2M

PhAx,yi �

1

2Cp⌫(pm+

pn+ t)

�

|N ||M | exp

�

1

2⌫

⇢1

2Cp⌫(pm+

pn+ t)

�2!

12n+m

exp

✓�C2

8

�m+ n+ t2)

◆

e�t2,

for C2/8 = ln 12 � 1, where in the third inequality we ignored all cross-productssince they are non-negative.

Proof of Lemma 2.58. The first inequality is immediate. For the second inequality,we will use the following observation

hAx,yi � hAx0,y0i = hAx,y � y0i+ hA(x� x0),y0i. (2.45)

Fix x 2 Sn�1 and y 2 Sm�1 such that hAx,yi = kAk, and let x0 2 N andy0 2 M such that

kx� x0k2 " and ky � y0k2 ".

Then (2.45), Cauchy-Schwarz and the definition of the spectral norm imply

kAk � hAx0,y0i kAkkxk2ky � y0k2 + kAkkx� x0k2ky0k2 2"kAk.

Rearranging gives the claim.

Putting the two lemmas together concludes the proof of Theorem 2.56.

67

Chaining method

We go back to the inequality

supt2T

Xt supt2T

(Xt �X⇡(t)) + supt2T

X⇡(t). (2.46)

We consider cases where we may not have a good almost sure bound on the Lip-schitz constant, but where we can control increments uniformly in the followingprobabilistic sense. We say that a stochastic process {Xt}t2T has sub-Gaussianincrements on (T, ⇢) if there exists 0 K < +1 such that

sub-GaussianincrementsXt �Xs 2 sG(K2⇢(s, t)2), 8s, t 2 T.

In (2.46), the first term on the r.h.s. remains a supremum over an infinite set. Tocontrol it, the chaining method repeats the argument at progressively smaller scales,leading to the following inequality. The diameter of T , denoted by diam(T ), isdefined as

diam(T ) = sup{⇢(s, t) : s, t,2 T}.

We apply the discrete Dudley’s inequality in Section 2.4.8.

Theorem 2.59 (Discrete Dudley’s Inequality). Let {Xt}t2T be a zero-mean stoch-astic process with sub-Gaussian increments on (T, ⇢) and assume diam(T ) 1.Then

Esupt2T

Xt

� C

+1X

k=0

2�k

qlogN (T, ⇢, 2�k).

for some constant 0 C < +1.

Proof. Recall that we assume there is a countable T0 ✓ T such that supt2T Xt =

supt2T0Xt almost surely. Let Tk ✓ T0, k � 1, be a sequence of finite sets such

that Tk " T0. By monotone convergence

Esupt2T0

Xt

�= sup

k�1

E"supt2Tk

Xt

#.

Moreover, N (Tk, ⇢, ") N (T, ⇢, ") for any " > 0 since Tk ✓ T . Hence it sufficesto handle the case |T | < +1.

68

"-nets at all scales For each k � 0, let Nk be an 2�k-net with |Nk| = N (T, ⇢, 2�k

)

and projection map ⇡k. Because diam(T ) 1, N0 = {t0} where t0 2 T can betaken arbitrarily. Moreover, because T is finite, there is 1 < +1 such thatNk = T for all k � . In particular, ⇡(t) = t for all t 2 T . Hence, by atelescoping argument,

Xt = Xt0 +

�1X

k=0

�X⇡k+1(t)

�X⇡k(t)

�.

Taking a supremum and an expectation gives

Esupt2T

Xt

�=

�1X

k=0

Esupt2T

�X⇡k+1(t)

�X⇡k(t)

��, (2.47)

where we used E[Xt0 ] = 0.

Sub-Gaussian bound We use Theorem 2.50 to bound the expectation in (2.47).The number of elements in the supremum is at most

|{(⇡k(t),⇡k+1(t)) : t 2 T}| |Nk ⇥Nk+1|

= |Nk|⇥ |Nk+1|

(N (T, ⇢, 2�k�1))

2.

For any t 2 T , by the triangle inequality

⇢(⇡k(t),⇡k+1(t)) ⇢(⇡k(t), t) + ⇢(t,⇡k+1(t)) 2�k

+ 2�k�1

2�k+1,

so thatX⇡k+1(t)

�X⇡k(t)2 sG(K2

2�2k+2

),

for some 0 K < +1 by the sub-Gaussian increments assumption. We cantherefore apply Theorem 2.50 to get

Esupt2T

�X⇡k+1(t)

�X⇡k(t)

��

q2K22�2k+2 log(N (T, ⇢, 2�k�1))2

C2�k�1

qlogN (T, ⇢, 2�k�1),

for some constant 0 C < +1. Plugging back into (2.47),

Esupt2T

Xt

�

�1X

k=0

C2�k�1

qlogN (T, ⇢, 2�k�1),

which implies the claim.

69

Using a similar argument, one can derive a tail inequality.

Theorem 2.60 (Chaining Tail Inequality). Let {Xt}t2T be a zero-mean stochasticprocess with sub-Gaussian increments on (T, ⇢) and assume diam(T ) 1. Then,for all t0 2 T and � > 0,

P"supt2T

(Xt �Xt0) � C+1X

k=0

2�k

qlogN (T, ⇢, 2�k) + �

# C exp

✓��2

C

◆,

for some constant 0 C < +1.

2.4.6 . Data science: Johnson-Lindenstrauss and application to sparse

recovery

In this section we discuss an application of the Chernoff-Cramer method to dimen-sionality reduction. We use once again an "-net argument.

Johnson-Lindenstrauss lemma The Johnson-Lindenstrauss lemma states rough-ly that, for any collection of points in a high-dimensional Euclidean space, one canfind an embedding of much lower dimension that preserves the metric relationshipsof the points, i.e., their norms and distances. Remarkably, no structure is assumedon the points and the result is independent of the original dimension. The methodof proof simply involves applying a well-chosen random linear mapping.

Before stating and proving the lemma, we define the mapping employed. LetA be a d⇥n matrix whose entries are independent N(0, 1). Note that, for any fixedz 2 Rn,

E kAzk22= E

2

4dX

i=1

0

@nX

j=1

Aijzj

1

A23

5 = dVar

2

4nX

j=1

A1jzj

3

5 = dkzk22, (2.48)

where we used the independence of the A1js and

E

2

4nX

j=1

Aijzj

3

5 = 0. (2.49)

Hence the linear mapping L =1p

dA is an isometry “on average.” We use the

Chernoff-Cramer method to prove a high probability result.

Theorem 2.61. Let x(1), . . . ,x(m) be arbitrary points in Rn. Fix �, ✓ > 0 and d �8

3✓�2

(logm+1

2log ��1

). Let A be a d⇥ n matrix whose entries are independent

70

N(0, 1) and consider the linear mapping L =1p

dA. Then the following hold with

probability at least 1� �:

(1� ✓)kx(i)k2 kLx(i)

k2 (1 + ✓)kx(i)k2, 8i

and

(1� ✓)kx(i)� x(j)

k2 kLx(i)� Lx(j)

k2 (1 + ✓)kx(i)� x(j)

k2, 8i, j.

Proof. Fix a z 2 Rn with kzk2 = 1. To prove the theorem it suffices to show thatkLzk2 /2 [1 � ✓, 1 + ✓] with probability at most �/(m +

�m

2

�) and use a union

bound over all points z = x(i)/kx(i)k2 and pairwise differences z = (x(i)

�

x(j))/kx(i)

� x(j)k2.

Lemma 2.62 (Johnson-Lindenstrauss lemma). Fix a z 2 Rn with kzk2 = 1. LetA be a d⇥n matrix whose entries are independent N(0, 1) and consider the linearmapping L =

1p

dA. Then, for all ✓ > 0,

P [kLzk2 > 1 + ✓] exp

✓�3

4d✓2

◆. (2.50)

Proof. Recall that a sum of independent Gaussians is Gaussian. So

(Az)k ⇠ N(0, kzk22) = N(0, 1), 8k,

where we used (2.48) and (2.49) to compute the variance. Hence

W = kAzk22 =dX

k=1

{(Az)k}2 ,

is a sum of squares of independent Gaussians (also known as �2-distribution with ddegrees of freedom or �(d/2, 2)). Using independence and the change of variableu = x

p1� 2s, for s < 1/2

MW (s) =

✓1

p2⇡

Z+1

�1

esx2e�x

2/2dx

◆d

=

✓1

p1� 2s

⇥1

p2⇡

Z+1

�1

e�u2/2du

◆d

=1

(1� 2s)d/2.

71

Applying the Chernoff-Cramer bound (2.25) with s = 1

2(1� d/�) gives

P[W � �] MW (s)

es�=

1

es�(1� 2s)d/2= e(d��)/2

✓�

d

◆d/2

.

Finally, take � = d(1 + ✓)2. Rearranging we get

P[kLzk2 � 1 + ✓] = P[kAzk22 � d(1 + ✓)2]

= P[W � �]

ed{1�(1+✓)2}/2

�(1 + ✓)2

d/2

exp��d(✓ + ✓2/2� log(1 + ✓))

�

exp

✓�3

4d✓2

◆,

where we used that log(1+x) x�x2/4 on [0, 1]. (To check the inequality, notethat the two sides are equal at 0 and compare the derivatives on [0, 1].)

Our choice of d gives that the r.h.s. in (2.50) is less than �/m2.

Remark 2.63. The Johnson-Lindenstrauss lemma is essentially optimal [Alo03]. Notehowever that it relies crucially on the use of the Euclidean norm [BC03].

To give some further geometric insights into the Johnson-Lindenstrauss lemma,we make a series of observations:

1. The rows of 1pnA are “on average” orthonormal. Indeed, note that for i 6= j

E"1

n

nX

k=1

AikAjk

#= E[Ai1]E[Aj1] = 0,

by independence and

E"1

n

nX

k=1

A2

ik

#= E[A2

i1] = 1,

since the Aiks have 0 mean. When n is large, these two quantities are con-centrated around their mean.

2. So, roughly speaking, 1pnAz corresponds to an orthogonal projection of z to

a uniformly random subspace of dimension d and 1pnkAzk2 gives the norm

of that projection. Then (2.48) shows that 1pnkAzk2 has expectation

pd/n.

72

3. A more geometric way to come to this last conclusion is to notice that pro-jecting on a uniform random subspace of dimension d is equivalent to per-forming a uniform rotation of z first and then projecting the resulting vectoron the first d dimensions. In other words, 1

pnkAzk2 is approximately dis-

tributed as the norm of the first d components of a uniform unit vector in Rn.To analyze this quantity, observe that a vector in Rn whose components areindependent N(0, 1), when divided by its norm, produces a uniform vectorin Rn. When d is large, the norm of that vector is therefore a ratio whose nu-merator is concentrated around

pd and whose denominator is concentrated

aroundpn.

4. Either way, kLzk2 =p

n

d⇥

1pnkAzk2 can be expected to be close to 1.

The Johnson-Lindenstrauss lemma makes it possible to solve certain compu-tational problems, e.g., finding the nearest point to a query, more efficiently byworking in a smaller dimension. We discuss a different type of application next.

Restricted isometries In the compressed sensing problem, one seeks to recovera signal x 2 Rn from a small number of measurements (Lx)i, i = 1, . . . , d. Incomplete generality, one needs n such measurements to recover any x 2 Rn asthe sensing matrix L must be invertible (or, more precisely, injective). However,

sensing matrixby imposing extra structure on the signal, much better results can be obtained. Weconsider the case of sparse signals.

Definition 2.64 (Sparse vectors). We say that a vector z 2 Rn is k-sparse if it hask-sparse vector

at most k non-zero entries. We let S n

kbe the set of k-sparse vectors in Rn. Note

that S n

kis a union of

�n

k

�linear subspaces, one for each support of the nonzero

entries.

To solve the compressed sensing problem in this case, it suffices to find a sensingmatrix L satisfying that all subsets of 2k columns are linearly independent. Indeed,if x, x0 2 S n

k, then x� x0 has at most 2k nonzero entries. Hence, in order to have

L(x � x0) = 0, it must be that x � x0 = 0 under the previous condition on L.That implies the required injectivity. The implication goes in the other directionas well. Observe for instance that the matrix used in the Johnson-Lindenstrausslemma satisfies this property as long as d � 2k: because of the continuous densityof its entries, the probability that 2k of its columns are linearly dependent is 0

when d � 2k. For practical applications, however, other requirements must bemet: computational efficiency and robustness to measurement noise as well as tothe sparsity assumption. We discuss the first two issues (and see Remark 2.68 forthe last one which can be dealt with along the same lines).

The following definition will play a key role.

73

Definition 2.65 (Restricted isometry property). A d⇥n linear mapping L satisfiesthe (k, ✓)-restricted isometry property (RIP) if for all z 2 S n

k restrictedisometryproperty

(1� ✓)kzk2 kLzk2 (1 + ✓)kzk2. (2.51)

We say that L is (k, ✓)-RIP.

Given a (k, ✓)-RIP matrix L, can we recover z 2 S n

kfrom Lz? And how small

can d be? The next two claims answer these questions.

Claim 2.66. Let L be (10k, 1/3)-RIP. Then:

1. (Sparse case) For any x 2 S n

k, the unique solution to the following mini-

mization problem

minz2Rn

kzk1 subject to Lz = Lx, (2.52)

is z⇤= x.

2. (Almost sparse case) For any x 2 Rn, the solution to (2.52) satisfies kz⇤�

xk2 = O(⌘(x)/pk), where ⌘(x) := minx02S n

kkx� x0

k1.

Claim 2.67. Let A be a d ⇥ n matrix whose entries are i.i.d. N(0, 1) and let L =1p

dA. There is a constant 0 < C < +1 such that if d � Ck log n then L is

(k, 1/3)-RIP with probability at least 1� 1/n.

Roughly speaking, a restricted isometry preserves enough of the metric structureof S n

kto be invertible on its image. The purpose of the `1 minimization in (2.52)

is to promote sparsity. See Figure 2.7 for an illustration. It may seem that amore natural approach would be to minimize the number of non-zero entries in z.However the advantage of `1 minimization is that it can be formulated as a linearprogram, i.e., the minimization of a linear objective subject to linear inequalities.This permits much faster computation of the solution using standard techniques.See Exercise 2.10.

In Claim 2.67, note that d is much smaller than n and not far from the 2k boundwe derived above. Note also that Claim 2.67 does not follow immediately fromthe Johnson-Lindenstrauss lemma. Indeed that lemma shows that a matrix withi.i.d. Gaussian entries is an approximate isometry on a finite set of points. Here weneed a linear mapping that is an approximate isometry for all vectors in S n

k. The

proof of this stronger statement uses an "-net argument.

74

Figure 2.7: Because `1 balls (in orange) have corners, minimizing the `1 normover a linear subspace (in green) tends to produce sparse solutions.

"-net argument We start with the proof of Claim 2.67. For a subset of indicesJ ✓ [n] and a vector y 2 Rn, we let yJ be the vector y restricted to the entriesin J , i.e., the sub-vector (yj)j2J . Fix a subset of indices I ✓ [n] of size k. Aswe mentioned above, the Johnson-Lindenstrauss lemma only applies to a finitecollection of vectors. However we need the RIP condition to hold for all z 2 Rn

with non-zero entries in I (and all such I). The way out is to use an "-net argument,as described in Section ??. Notice that, for z 6= 0, the function kLzk2/kzk2

1. does not depend on the norm of z, so that we can restrict ourselves to thecompact set @BI := {z : z[n]\I = 0, kzk2 = 1}, and

2. is continuous on @BI , so that it suffices to construct a fine enough coveringof @BI by a finite collection of balls and apply the Johnson-Lindenstrausslemma to the centers of these balls.

Proof of Claim 2.67. Let I ✓ [n] be a subset of indices of size k. There are�n

k

�

nk= exp(k log n) such subsets and we denote their collection by

�[n]

k

�. We let

NI be an "-net of @BI . The proof strategy is to apply the Johnson-Lindenstrausslemma to each NI , I 2

�[n]

k

�, and use a continuity argument to extend the RIP

property to all k-sparse vectors.

75

- Bounding Lipschitz constant. We first choose an appropriate value for ".Let A⇤ be the largest entry of A in absolute value. For all y 2 NI andall z 2 @BI within distance " of y, by the triangle inequality, we havekLzk2 kLyk2 + kL(z � y)k2 and kLzk2 � kLyk2 � kL(z � y)k2.Moreover

kL(z � y)k22 =

dX

i=1

0

@nX

j=1

Lij(zj � yj)

1

A2

dX

i=1

0

@nX

j=1

L2

ij

1

A

0

@nX

j=1

(zj � yj)2

1

A

kz � yk22 · dn

✓1pdA⇤

◆2

("A⇤)2n,

where we used Cauchy-Schwarz. It remains to bound A⇤. For this we usethe Chernoff-Cramer bound for Gaussians in (2.27) which implies

Ph9i, j, |Aij | � C

plog n

i n2e�(

pC logn)

2/2

1

2n, (2.53)

for a C > 0 large enough. Hence we take

" =1

Cp6n log n

,

and, assuming that the event in (2.53) does not hold, we get

|kLzk2 � kLyk2| 1

6. (2.54)

- Applying Johnson-Lindenstrauss to "-net. By Claim 2.55,

|{NI}I | nk

✓3

"

◆k

exp(C 0k log n),

for some C 0 > 0. Apply the Johnson-Lindenstrauss lemma to {NI}I with✓ = 1/6, � =

1

2n, and

d =8

3✓�2

✓log |{NI}I |+

1

2log(2n)

◆= ⇥(k log n).

Then5

6=

5

6kyk2 kLyk2

7

6kyk2 =

7

6, 8I, 8y 2 NI . (2.55)

76

Assuming (2.54) and (2.55) hold, an event of probability at least 1 � 2(1/2n) =

1� 1/n, we finally get

2

3 kLzk2

4

3, 8I, 8z 2 @BI .


`1 minimization It remains to prove Claim 2.66.

Proof of Claim 2.66. We only prove the sparse case x 2 S n

k. For the almost

sparse case, see Exercise 2.11. Let z⇤ be a solution to (2.52) and note that sucha solution exists because z = x satisfies the constraint in (2.52). W.l.o.g. assumethat only the first k entries of x are non-zero, i.e., x[n]\[k] = 0. Moreover order theremaining entries of x so that the residual r = z⇤

� x is so that the entries r[n]\[k]are non-increasing in absolute value. Our goal is to show that krk2 = 0.

In order to leverage the RIP condition, we break up the vector r into 9k-longsub-vectors. Let

I0 = [k], Ii = {(9(i� 1) + 1)k + 1, . . . , (9i+ 1)k}, 8i � 1,

and let I01 = I0 [ I1, Ii =S

j>iIj and I01 = I1.

We first use the optimality of z⇤. Note that xI0

= 0 implies that

kz⇤k1 = kz⇤

I0k1 + kz⇤

I0k1 = kz⇤

I0k1 + kr

I0k1,

andkxk1 = kxI0k1 kz⇤

I0k1 + krI0k1,

by the triangle inequality. Since kz⇤k1 kxk1 we then have

krI0k1 krI0k1. (2.56)

On the other hand, the RIP condition gives a similar inequality in the otherdirection. Indeed notice that Lr = 0 or LrI01 = �

Pi�2

LrIi by the constraintin (2.52). Then, by the RIP condition and the triangle inequality, we have that

2

3krI01k2 kLrI01k2

X

i�2

kLrIik2 4

3

X

i�2

krIik2. (2.57)

We note that by the ordering of the entries of x

krIi+1k2

2 9k

✓krIik19k

◆2

, (2.58)

77

where we bounded rIi+1 entrywise by the expression in parenthesis. Combin-ing (2.56) and (2.58), and using that krI0k1

pkkrI0k2 by Cauchy-Schwarz, we

have

X

i�2

krIik2 X

j�1

krIjk1p9k

=kr

I0k1

3pk

krI0k1

3pk

krI0k2

3

krI01k23

.

Plugging this back into (2.57) gives

krI01k2 2

X

i�2

krIik2 2

3krI01k2,

which implies rI01 = 0. In particular rI0 = 0 and, by (2.56), rI0

= 0 as well. Wehave shown that r = 0 or z⇤

= x.

Remark 2.68. Claim 2.66 can be extended to noisy measurements (using a slight modifi-cation of (2.52)). See [CRT06b].

2.4.7 . Markov chains: Varopoulos-Carne and application to mixing

times

If (St) is simple random walk on Z, then Lemma 2.33 implies that, for any x, y 2

Z,P t

(x, y) e�|x�y|2/2t, (2.59)

where P is the transition matrix of (St). Interestingly a bound similar to (2.59)holds for any reversible Markov chain. And Lemma 2.33 plays an unexpected rolein its proof. An application to mixing times is discussed below.

Varopoulos-Carne Our main bound is the following.

Theorem 2.69 (Varopoulos-Carne). Let P be the transition matrix of an irre-ducible Markov chain (Xt) on the countable state space V . Assume further that Pis reversible with respect to the stationary measure ⇡ and that the correspondingnetwork N is locally finite. Then the following hold

8x, y 2 V, 8t 2 N, P t(x, y) 2

s⇡(y)

⇡(x)e�⇢(x,y)

2/2t,

where ⇢(x, y) is the graph distance between x and y on N .

78

As a sanity check before proving the theorem, note that if the chain is aperiodicand ⇡ is a stationary distribution then

P t(x, y) ! ⇡(y) 2

s⇡(y)

⇡(x), as t ! +1,

since ⇡(x),⇡(y) 1.

Proof of Theorem 2.69. The idea of the proof is to show that

P t(x, y) 2

s⇡(y)

⇡(x)P[St � ⇢(x, y)],

where St is simple random walk on Z started at 0, and use Lemma 2.33.By assumption only a finite number of states can be reached by time t. Hence

we can reduce the problem to a finite state space. More precisely, let V = {z 2

V : ⇢(x, z) t} and

P (z, w) =

(P (z, w), if u 6= v

P (z, z) + P (z, V � V ), otherwise.

By construction P is reversible with respect to ⇡ = ⇡/⇡(V ) on V . Because in timet one never reaches a state z where P (z, V � V ) > 0, by Chapman-Kolmogorovand using the fact that ⇡(y)/⇡(x) = ⇡(y)/⇡(x), it suffices to prove the result forP . Hence we assume without loss of generality that V is finite with |V | = n.

To relate (Xt) to simple random walk on Z, we use a special representation ofP t based on Chebyshev polynomials. For ⇠ = cos ✓ 2 [�1, 1], Tk(⇠) = cos k✓ isa Chebyshev polynomial of the first kind. Note that |Tk(⇠)| 1 on [�1, 1]. The

Chebyshevpolynomials

classical trigonometric identity (to see this, write it in complex form)

cos((k + 1)✓) + cos((k � 1)✓) = 2 cos ✓ cos(k✓),

implies the recursion

Tk+1(⇠) + Tk�1(⇠) = 2⇠ Tk(⇠),

which in turn implies that Tk is indeed a polynomial. It has degree k from inductionand the fact that T0(⇠) = 1 and T1(⇠) = ⇠. The connection to simple random walkon Z comes from the following somewhat miraculous representation (which doesnot rely on reversibility). Let Tk(P ) denote the polynomial Tk evaluated at P as amatrix polynomial.

79

Lemma 2.70.

P t=

tX

k=�t

P[St = k]T|k|(P ).

Proof. This essentially follows from the binomial theorem. It suffices to prove

⇠t =tX

k=�t

P[St = k]T|k|(⇠),

as an identity of polynomials. Since P[St = k] = 2�t�

t

(t+k)/2

�for |k| t of the

same parity as t (and 0 otherwise), this follows immediately from the expansion

⇠t =

✓ei✓ + e�i✓

2

◆t

=

tX

`=0

2�t

✓t

`

◆(ei✓)`(e�i✓

)t�`

=

tX

k=�t

P[St = k]eik✓,

where we used that the probability that

St = �t+ 2` = (+1)`+ (�1)(t� `),

is the probability of making ` steps to the right and t � ` steps to the left. (Putdifferently, (cos ✓)t is the characteristic function of St.) Take real parts and usecos(k✓) = cos(�k✓).

We bound Tk(P )(x, y) as follows.

Lemma 2.71.Tk(P )(x, y) = 0, 8k < ⇢(x, y).

and

Tk(P )(x, y)

s⇡(y)

⇡(x), 8k � ⇢(x, y).

Proof. Note that Tk(P )(x, y) = 0 when k < ⇢(x, y) because Tk(P )(x, y) is afunction of the entries P `

(x, y) for ` k.Let f1, . . . , fn be a right eigenvector decomposition of P orthonormal with

respect to the inner product

hf, gi⇡ =

X

x2V

⇡(x)f(x)g(x),

with eigenvalues �1, . . . ,�n 2 [�1, 1]. Such a decomposition exists by the re-versibility of P . See Lemma A.7. Then f1, . . . , fn is also an eigenvector decom-position of the polynomial Tk(P ) with eigenvalues Tk(�1), . . . , Tk(�n) 2 [�1, 1]

80

by the definition of Chebyshev polynomials. By decomposing any f =P

n

i=1↵ifi

according to this eigenbasis, that implies that

kTk(P )fk2⇡ =

nX

i=1

↵2

i Tk(�i)2hfi, fii⇡

nX

i=1

↵2

i hfi, fii⇡ = kfk2⇡, (2.60)

where we used the norm kfk⇡ associated with the inner product h·, ·i⇡, the or-thonormality of the eigenvector basis under this inner product, and the fact thatTk(�i)

22 [0, 1].

Let �x denote the point mass at x. By Cauchy-Schwarz and (2.60),

Tk(P )(x, y) =h�x, Tk(P )�yi⇡

⇡(x)

k�xk⇡k�yk⇡⇡(x)

=

p⇡(x)

p⇡(y)

⇡(x)=

s⇡(y)

⇡(x),

for k � ⇢(x, y).

Combining the two lemmas gives the result.

Remark 2.72. The local finiteness assumption is made for simplicity only. The result holdsfor any countable-space, reversible chain. See [LP, Section 13.2].

Lower bound on mixing Let (Xt) be an irreducible, aperiodic Markov chainwith finite state space V and stationary distribution ⇡. Recall that, for a fixed0 < " < 1/2, the mixing time is

tmix(") = min{t : d(t) "},

whered(t) = max

x2V

kP t(x, · )� ⇡kTV.

It is intuitively clear that tmix(") is at least of the order of the “diameter” of thetransition graph of P . For x, y 2 V , let ⇢(x, y) be the graph distance between xand y on the undirected version of the transition graph, i.e., ignoring the orientationof the edges. With this definition, a shortest directed path from x to y contains atleast ⇢(x, y) edges. Here we define the diameter of the transition graph as � :=

diametermaxx,y2V ⇢(x, y). Let x0, y0 be a pair of vertices achieving the diameter. Then weclaim that P b(��1)/2c

(x0, · ) and P b(��1)/2c(y0, · ) are supported on disjoint sets.

To see this letA = {z 2 S : ⇢(x0, z) < ⇢(y0, z)}.

See Figure 2.4.7. By the triangle inequality for ⇢, any z such that ⇢(x0, z)

b(��1)/2c is in A, otherwise we would have ⇢(y0, z) ⇢(x0, z) b(��1)/2c

81

Figure 2.8: The supports of P b(��1)/2c(x0, · ) and P b(��1)/2c

(y0, · ) are containedin A and Ac respectively.

and hence ⇢(x0, y0) ⇢(x0, z) + ⇢(y0, z) 2b(� � 1)/2c < �. Similarly, if⇢(y0, z) b(� � 1)/2c, then z 2 Ac. By the triangle inequality for the totalvariation distance,

d(b(�� 1)/2c) �1

2

��P b(��1)/2c(x0, · )� P b(��1)/2c

(y0, · )��TV

�1

2

nP b(��1)/2c

(x0, A)� P b(��1)/2c(y0, A)

o

=1

2{1� 0} =

1

2, (2.61)

so that:

Claim 2.73.tmix(") �

�

2.

This bound is often far from the truth. Consider for instance simple random walkon a cycle of size n. The diameter is � = n/2. But Lemma 2.33 suggests that ittakes time of order �2 to even reach the antipode of the starting point, let aloneachieve stationarity.

More generally, when P is reversible, we use the Varopoulos-Carne bound toshow that the mixing time does indeed scale at least as the square of the diameter.

82

Assume that P is reversible with respect to ⇡ and has diameter�. Letting n = |V |

and ⇡min = minx2V ⇡(x), we have the following.

Claim 2.74.

tmix(") ��

2

12 log n+ 4| log ⇡min|, if n �

16

(1� 2")2.

Proof. The proof is based on the same argument we used to derive our first diame-ter-based bound, except that the Varopoulos-Carne bound gives a better depen-dence on the diameter. Namely, let x0, y0, and A be as above. By the Varopoulos-Carne bound,

P t(x0, A

c) =

X

z2Ac

P t(x0, z)

X

z2Ac

2

s⇡(z)

⇡(x0)e�

⇢2(x0,z)

2t 2n⇡�1/2

mine�

�2

8t ,

where we used that |Ac| n and ⇢(x0, z) �

�

2for z 2 Ac. For any

t <�

2

12 log n+ 4| log ⇡min|,

we get that

P t(x0, A

c) 2n⇡�1/2

minexp

✓�3 log n+ | log ⇡min|

2

◆=

2pn.

Similarly, P t(y0, A)

2pn

so that arguing as in (2.61)

d(b(�� 1)/2c) �1

2

⇢1�

2pn�

2pn

�=

1

2�

2pn� ",

for n as in the statement.

Remark 2.75. The dependence on� and ⇡min in Claim 2.74 cannot be improved. See [LP,Section 13.3].

2.4.8 . Data science: classification, empirical risk minization and VC

dimension

In binary classification, one is given samples Sn = {(Xi, C(Xi))}n

i=1where Xi 2 binary

classificationRd is a feature vector and C(Xi) 2 {0, 1} is a label. The feature vectors areassumed to be independent samples from an unknown probability measure µ andC : Rd

! {0, 1} is a measurable Boolean function. For instance, the feature vector

83

might be an image (encoded as a vector) and the label might indicate cat (label 0)or dog (label 1). Our goal is “learn” the concept C. More precisely, we seek toconstruct a hypothesis h : Rd

! {0, 1} that is a good approximation to C in thesense that it predicts the label well on a fresh sample.

Put differently, we want h to have small risk, or generalization error,generalizationerrorR(h) = P[h(X) 6= C(X)]

where X ⇠ µ. Because we only have access to the distribution µ through thesamples, it is natural to estimate the risk of the hypothesis h using the samples as

Rn(h) =1

n

nX

i=1

1{h(Xi) 6= C(Xi)},

which is called the empirical risk. Indeed observe that ERn(h) = R(h) and, byempirical risk

the law of large numbers, Rn(h) ! R(h) almost surely as n ! +1. Ignor-ing computational considerations, one can then formally define an empirical riskminimizer

empirical riskminimizerh⇤ 2 ERMH(Sn) = {h 2 H : Rn(h

0) Rn(h

0), 8h0 2 H}

where H, the hypothesis class, is a collection of Boolean functions over Rd. Wehypothesisclass

assume further that h⇤ can be defined as a measurable function of the samples.

Overfitting Why restrict the hypothesis class? The following theorem showsthat minimizing the empirical risk over all Boolean functions makes it impossibleto achieve an arbitrarily small risk. Intuitively, considering too rich a class offunctions, that is, functions that too intricately model the data, leads to overfitting:the learned hypothesis will fit the sampled data, but it may not generalize wellto unseen examples. A learner A is a map from samples to measurable Booleanfunctions over Rd, that is, for any n and any Sn 2 (Rd

⇥ {0, 1})n, the learneroutputs a function A( · ,Sn) : Rd

! {0, 1}.

Theorem 2.76 (No Free Lunch). For any learner A and any X ✓ Rd with |X | =

2m > 4, there exist a concept C : X ! {0, 1} and a distribution µ over X suchthat

P[R(A( · ,Sm)) � 1/8] � 1/8, (2.62)

where Sm = {(Xi, C(Xi))}m

i=1with independent Xi ⇠ µ.

84

Proof. We let µ be uniform over X . To prove the existence of a concept satisfy-ing (2.62), we use the probabilistic method (Section 2.2.1) and pick C at random.For each x 2 X , we set C(x) = Yx where the Yx’s are i.i.d. uniform in {0, 1}.

We first bound E[R(A( · ,Sm))], where the expectation runs over both randomlabels {Yx}x2X and the samples Sm = {(Xi, C(Xi))}

m

i=1. For an additional inde-

pendent sample X ⇠ µ, we will need the event that the learner, given samples Sm,makes an incorrect prediction on X

B = {A(X,Sm)) 6= YX},

and the event that X is observed in the samples Sm

O = {X 2 {X1, . . . , Xm}}.

By the law of total expectation,

E[R(A( · ,Sm))] = P[B]

= E[P[B | Sm]]

= E [P[B |O,Sm]P[O | Sm] + P[B |Oc,Sm]P[Oc| Sm]]

� E [P[B |Oc,Sm]P[Oc| Sm]]

�1

2⇥

1

2,

where we used that:

• P[Oc| Sm] � 1/2 because |X | = 2m and µ is uniform, and;

• P[B |Oc,Sm] = 1/2 because for any x /2 {X1, . . . , Xm} the predictionA(x,Sm) 2 {0, 1} is independent of Yx and the latter is uniform.

Conditioning over the concept, we have proved that

E [E[R(A( · ,Sm)) | {Yx}x2X ]] �1

4.

Hence, by the first moment principle (Theorem 2.6),

P[E[R(A( · ,Sm)) | {Yx}x2X ] � 1/4] > 0,

where the probability is taken over {Yx}x2X . That is, there exists a choice {yx}x2X 2

{0, 1}X such that

E[R(A( · ,Sm)) | {Yx = yx}x2X ] � 1/4. (2.63)

85

Finally, to prove (2.62), we use a variation on Markov’s inequality (Theo-rem 2.1) for [0, 1]-valued random variables. If Z 2 [0, 1] is a random variablewith E[Z] = µ and ↵ 2 [0, 1], then

E[Z] ↵⇥ P[Z < ↵] + 1⇥ P[Z � ↵] P[Z � ↵] + ↵.

Taking ↵ = µ/2 givesP[Z � µ/2] � µ/2.

Going back to (2.63), we obtain

PR(A( · ,Sm)) �

1

8

�� {Yx = yx}x2X

��

1

8,

establishing the claim.

The gist of the proof is intuitive. In essence, if the target concept is arbitrary and weonly get to see half of the possible instances, then we have learned nothing aboutthe other half and cannot expect low generalization error.

So, as we mentioned before, one way out is to limit the complexity of thehypotheses. For instance, we could restrict ourselves to half-spaces

HH =

nh(x) = 1{xTu � ↵} : u 2 Rd,↵ 2 R

o,

or axis-aligned boxes

HB = {h(x) = 1{xi 2 [↵i,�i], 8i} : �1 ↵i �i 1, 8i} .

In order for the empirical risk minimizer h⇤ to have a generalization error closeto the best achievable error, we need the empirical risk of the learned hypothesisRn(h⇤) to be close to its expectation R(h⇤), which is guaranteed by the law oflarge numbers for sufficiently large n. But that is not enough, we also need thatsame property to hold for all hypotheses in H simultaneously. Otherwise we couldbe fooled by a poorly performing hypothesis with unusually good empirical risk onthe samples. The hypothesis class is typically infinite and, therefore, controllingempirical risk deviations from their expectations uniformly over H is not straight-forward.

Uniform deviations Our goal in this section is to show how to bound

Esuph2H

Rn(h)�R(h)

�= E

"suph2H

1

n

nX

i=1

`(h,Xi)� E[`(h,X)]

#(2.64)

86

in terms of a measure of complexity of the class H, where we defined the loss`(h, x) = 1{h(x) 6= C(x)} to simplify the notation. We assume that H is count-able. Observe for instance that, for HH and HB, nothing is lost by assuming thatthe parameters defining the hypotheses are rational-valued.

Controlling deviations uniformly over H as in (2.64) allows one to provideguarantees on the empirical risk minimizer. Indeed, for any h0 2 H,

R(h⇤) Rn(h⇤) + sup

h2H

(R(h)�Rn(h))

Rn(h0) + sup

h2H

(R(h)�Rn(h))

R(h0) + suph2H

(Rn(h)�R(h)) + suph2H

(R(h)�Rn(h)),

where, on the second line, we used the definition of the empirical risk minimizer.Taking an infimum over h0 and then taking an expectation gives

E[R(h⇤)] infh2H

R(h) + Esuph2H

Rn(h)�R(h)

�+ E

suph2H

R(h)�Rn(h)

�.

Observe that the supremum in (2.64) is inside the expectation and that the ran-dom variables Rn(h)�R(h) are highly correlated. Indeed, two similar hypotheseswill produce similar predictions. While the absence of independence in some sensemakes bounding this type expectation harder, the correlation is ultimately what al-lows us to tackle infinite classes H.

We will use the methods of Section 2.4.5. As we did in that section, we assumethat there is a countable collection H0 ✓ H such that all suprema considered herecan be taken over H0. As a first step, we use symmetrization, which we introducedin Section 2.4.3 to give a proof of Hoeffding’s lemma (Lemma 2.42). Let ("i)ni=1

be i.i.d. uniform random variables in {�1,+1} and let (X 0

i)n

i=1be an independent

87

copy of (Xi)n

i=1. Then

E"suph2H

1

n

nX

i=1

`(h,Xi)� E[`(h,X)]

#

= E"suph2H

1

n

nX

i=1

`(h,Xi)� E[`(h,X 0

i) | (Xi)n

i=1]

#

E"suph2H

1

n

nX

i=1

[`(h,Xi)� `(h,X 0

i)]

#

= E"suph2H

1

n

nX

i=1

"i[`(h,Xi)� `(h,X 0

i)]

#

E"suph2H

1

n

nX

i=1

"i`(h,Xi) + suph2H

1

n

nX

i=1

�"i`(h,X0

i)

#

2E"suph2H

1

n

nX

i=1

"i`(h,Xi)

#,

where, on the third line, we used suph EYh E[suph Yh] and the tower property,on the fourth line, we used that [`(h,Xi)�`(h,X 0

i)] is symmetric and independent

of "i, and on the fifth line, we used that "i and �"i are identically distributed.Define

Zn(h) =1pn

nX

i=1

"i`(h,Xi). (2.65)

Our task reduces to bounding

Esuph2H

Zn(h)

�. (2.66)

VC dimension We make two observations about Zn(h):

1. As a weighted sum of independent random variables in [�1, 1], Zn(h) is sub-Gaussian with variance factor 1 by the general Hoeffding inequality (Theo-rem 2.39) and Hoeffding’s lemma (Lemma 2.42).

2. The random variable Zn(h) only depends on the values of h at a finite num-ber of points, X1, . . . , Xn. Hence, while the supremum in (2.66) is over apotentially infinite class of functions H, it is in effect a supremum over atmost 2n functions, that is, all the possible restrictions of the h’s to (Xi)

n

i=1.

88

A naive application of the maximal inequality in Lemma 2.50, together with thetwo observations above, unfortunately gives the very weak bound

Esuph2H

Zn(h)

�

p2 log 2n =

p2n log 2.

To obtain a better bound, we show that in general the number of distinct re-strictions of H to n points can grow much slower than 2

n.

Definition 2.77 (Shattering). Let ⇤ = {`1, . . . , `n} ✓ Rd be a finite set and let Hbe a class of Boolean functions on Rd. The restriction of H to ⇤ is

H⇤ = {(h(`1), . . . , h(`n)) : h 2 H}.

We say that ⇤ is shattered by H if |H⇤| = 2|⇤|, that is, if all Boolean functions over

shattering⇤ can be obtained by restricting a function in H to the points in ⇤.

Definition 2.78 (VC dimension). Let H be a class of Boolean functions on Rd. TheVC dimension of H, denoted vc(H), is the maximum cardinality of a set shattered

VC dimensionby H.

We prove the following combinatorial lemma at the end of this section.

Lemma 2.79 (Sauer’s lemma). Let H be a class of Boolean functions on Rd. Forany finite set ⇤ = {`1, . . . , `n} ✓ Rd,

|H⇤|

✓en

vc(H)

◆vc(H)

.

That is, the number of distinct restrictions of H to any n points grows at most as/ nvc(H).

Returning to E[suph2H Zn(h)], we get the following inequality using Lemma 2.50.

Lemma 2.80. There exists a constant 0 < C < +1 such that, for any countableclass of measurable Boolean functions H over Rd,

Esuph2H

Rn(h)�R(h)

� C

rvc(H) log n

n. (2.67)

Proof. Let Zn(h) be as in (2.65) and recall that Zn(h) 2 sG(1). Since the supre-mum over H, when seen as restricted to {X1, . . . , Xn}, is in fact a supremumover at most (en)vc(H) functions by Sauer’s lemma (Lemma 2.79), we have byLemma 2.50

Esuph2H

Zn(h)

�

q2 log(en)vc(H).

Symmetrization then proves the claim.

89

We give some examples.

Example 2.81 (VC dimension of half-spaces). Consider the class of half-spaces.

Claim 2.82.vc(HH) = d+ 1.

We only prove the case d = 1, where HH reduces to half-lines (�1, �] or [�,+1).Clearly any set ⇤ = {`1, `2} ✓ R with elements is shattered by HH. On the otherhand, for any ⇤ = {`1, `2, `3} with `1 < `2 < `3, any half-line containing `1 and`3 necessarily includes `2 as well. Hence no set of size 3 is shattered by HH.

For the proof in general dimension d, see e.g. [SSBD14, Section 9.1.3]. J

Example 2.83 (VC dimension of boxes). Consider the class of axis-aligned boxes.

Claim 2.84.vc(HB) = 2d.

We only prove the case d = 2, where HB reduces to rectangles. The set ⇤ =

{(�1, 0), (1, 0), (0,�1), (0, 1)} is shattered by HB. Indeed, the rectangle [�1, 1]⇥[�1, 1] contains ⇤, with each side of the rectangle containing one the points. Mov-ing any side inward by " removes the corresponding point from the rectangle.Hence, any subset of ⇤ can be obtained by this procedure.

On the other hand, let ⇤ = {`1, . . . , `5} ✓ R2 be any set with five distinctpoints. Consider the rectangle with smallest area containing ⇤. Then there must befour distinct points in ⇤, each touching a different side of the rectangle (includingdegenerate sides). Now observe that any rectangle containing those four pointsmust also contain the fifth one. That proves the claim.

The proof in general dimension d follows along the same lines. J

These two examples also provide insight into Sauer’s lemma. Consider the caseof rectangles for instance. Over a collection of n sample points, a rectangle definesthe same {0, 1}-labeling as the minimal-area rectangle containing the same points.Because each side of a minimal-area rectangle must touch at least one point in thesample, there are at most n4 such rectangles, and hence there are at most n4

⌧ 2n

restrictions of HB to these sample points.

Chaining It turns out that theplog n factor in (2.67) is not optimal. We use

chaining (Section 2.4.5) to improve the bound. We return to symmetrization, thatis, we consider again the process Zn(h) defined in (2.65).

90

We claim that (Zn(h))h2H has sub-Gaussian increments under an appropri-ately defined pseudo-metric. Indeed, conditioning on (Xi)

n

i=1, by the general Ho-

effding inequality (Theorem 2.39) and Hoeffding’s lemma (Lemma 2.42), we have

Zn(g)� Zn(h) =nX

i=1

"i`(g,Xi)� `(h,Xi)

pn

,

is sub-Gaussian increments with variance factornX

i=1

✓`(g,Xi)� `(h,Xi)

pn

◆2

⇥ 1 =1

n

nX

i=1

[`(g,Xi)� `(h,Xi)]2.

Define the pseudometric

⇢n(g, h) =

"1

n

nX

i=1

[`(g,Xi)� `(h,Xi)]2

#1/2

=

"1

n

nX

i=1

[g(Xi)� h(Xi)]2

#1/2

,

which satisfies the triangle inequality since it can be expressed as a Euclidean norm.In fact, it will be useful to recast it in a more general setting. For a probabilitymeasure ⌘ over Rd, define

kg � hk2L2(⌘)

=

Z

Rd

(f(x)� g(x))2d⌘(x).

Let µn be the empirical measureempiricalmeasure

µn = µ{Xi}n

i=1:=

1

n

nX

i=1

�Xi, (2.68)

where �x is the probability measure that puts mass 1 on x. Then, we can re-write

⇢n(g, h) = kg � hkL2(µn).

Hence we have shown that (Zn(h))h2H has sub-Gaussian increments with respectto k · kL2(µn)

. Note that the pseudometric here is random as it depends on thesamples. However, by the law of large numbers, kg � hkL2(µn)

approaches itsexpectation, kg � hkL2(µ), as n ! +1.

Applying the discrete Dudley’s inequality (Theorem 2.59), we obtain the fol-lowing bound.


Esuph2H

Rn(h)�R(h)

�

CpnE"+1X

k=0

2�k

qlogN (H, k · kL2(µn)

, 2�k)

#,

where µn is the empirical measure over the samples (Xi)n

i=1.

91

Proof. Because H comprises only Boolean functions, it follows that under thepseudometric k · kL2(µn)

the diameter is bounded by 1. We apply the discreteDudley’s inequality (Theorem 2.59) conditioned on (Xi)

n

i=1. Then we take an

expectation over the samples which, together with symmetrization, gives the claim.

Our use of symmetrization above is more intuitive than it may appear at first. Thecentral limit theorem indicates that the fluctuations of centered averages such as

(Rn(g)�R(g))� (Rn(h)�R(h))

tend cancel out and that, in the limit, the variance alone characterizes the over-all behavior. The "i’s in some sense explicitly capture the canceling part of thisphenomenon while ⇢n captures the scale of the resulting global fluctuations in theincrements.

Bounding the covering numbers using the VC dimension Our final task is tobound the covering numbers N (H, k · kL2(µn)

, 2�k).

Theorem 2.86 (Covering numbers via VC dimension). There exists a constant0 < C < +1 such that, for any class of measurable Boolean functions H overRd, any probability measure ⌘ over Rd and any " 2 (0, 1),

N (H, k · kL2(⌘), ")

✓2

"

◆C vc(H)

.

Before proving Theorem 2.86, we derive its implications on uniform devia-tions. Compare the following bound to Lemma 2.80.


Esuph2H

Rn(h)�R(h)

� C

rvc(H)

n.

92

Proof. By Lemma 2.85,

Esuph2H

Rn(h)�R(h)

�

CpnE"+1X

k=0

2�k

qlogN (H, k · kL2(µn)

, 2�k)

#

C E

2

4+1X

k=0

2�k

s

log

✓2

2�k

◆C0 vc(H)

3

5

Cpvc(H)E

"+1X

k=0

2�k

p

k + 1

pC 0 log 2

#

C 00p

vc(H),

for some 0 < C 00 < +1.

It remains to prove Theorem 2.86.

Proof of Theorem 2.86. Let G = {g1, . . . , gN} ✓ H be a maximal "-packing ofH with N � N (H, k · kL2(⌘), "), which exists by Lemma 2.53. We use theprobabilistic method (Section 2.2) and Hoeffding’s inequality (Theorem 2.40) toshow that there exists a small number of points {x1, . . . , xm} such that G is stilla good packing when H is restricted to the xi’s. Then we use Sauer’s lemma(Lemma 2.79) to conclude.

1. Restriction. By construction, the collection G satisfies

kgi � gjkL2(⌘) > ", 8i 6= j.

For an integer m that we will choose as small as possible below, let X =

{X1, . . . , Xm} be i.i.d. samples from ⌘ and let ⌘X be the corresponding em-pirical measure (as defined in (2.68)). Observe that, for any i 6= j,

E⇥kgi � gjkL2(⌘X)

⇤= E

"1

m

mX

k=1

[gi(Xk)� gj(Xk)]2

#= kgi � gjkL2(⌘).

Moreover [gi(Xk) � gj(Xk)]22 [0, 1]. Hence, by Hoeffding’s inequal-

ity (Theorem 2.40), there exists a constant 0 < C < +1 and an m

C"�4logN such that

Pkgi � gjk

2

L2(⌘)� kgi � gjk

2

L2(⌘X)�

3"2

4

� exp

✓�2(m · 3"2/4)2

m

◆

= exp

✓�9

8m"4

◆

<1

N2.

93

That implies that, for this choice of m,

Phkgi � gjk

2

L2(⌘X)>

"

2, 8i 6= j

i> 0,

where the probability is over the samples. Therefore, there must be a setX = {x1, . . . , xm} ✓ Rd such that

kgi � gjk2

L2(⌘X )>

"

2, 8i 6= j. (2.69)

2. VC bound. In particular, by (2.69), the functions in G restricted to X aredistinct. By Sauer’s lemma (Lemma 2.79),

N = |GX | |HX |

✓em

vc(H)

◆vc(H)

✓eC"�4

logN

vc(H)

◆vc(H)

. (2.70)

Using that 1

2DlogN = logN1/2D

N1/2D where D = vc(H), we get

✓eC"�4

logN

vc(H)

◆vc(H)

�C 0"�4

�vc(H)N1/2, (2.71)

where C 0= 2eC. Plugging (2.71) back into (2.70) and rearranging gives

N �C 0"�4

�2 vc(H).


Proof of Sauer’s lemma To be written. See [Ver18, Section 8.3.3].

Exercises

Exercise 2.1 (Bonferroni inequalities). Let A1, . . . , An be events and Bn := [iAi.Define

S(r):=

X

1i1<···<irn

P[Ai1 \ · · · \Air ],

and

Xn :=

nX

i=1

1Ai.

a) Let x0 x1 · · · xs � xs+1 � · · · � xm be a unimodal sequence ofnon-negative reals such that

Pm

j=0(�1)

jxj = 0. Show thatP

`

j=0(�1)

jxjis � 0 for even ` and 0 for odd `.

94

b) Show that, for all r,X

1i1<···<irn

1Ai11Ai2

· · ·1Air=

✓Xn

r

◆.

c) Use a) and b) to show that when ` 2 [n] is odd

P[Bn]

`X

r=1

(�1)r�1S(r),

and when ` 2 [n] is even

P[Bn] �

`X

r=1

(�1)r�1S(r).

These inequalities are called Bonferroni inequalities. The case ` = 1 isBoole’s inequality.

Exercise 2.2 (Percolation on Z2: better bound [Ste]). Let E1 be the event that alledges are open in [�N,N ] ⇥ [�N,N ] and E2 be the event that there is no closedself-avoiding dual cycle surrounding [�N,N ]

2. By looking at E1 \E2, show that✓(p) > 0 for p > 2/3.

Exercise 2.3 (Percolation on Zd: existence of critical threshold). Consider bondpercolation on Ld.

a) Show that pc(Ld) > 0. [Hint: Count self-avoiding paths.]

b) Show that pc(Ld) < 1. [Hint: Use the result for L2.]

Exercise 2.4 (Sums of uncorrelated variables). Centered random variables X1, . . . , Xn

are pairwise uncorrelated ifpairwiseuncorrelatedvariables

E[XrXs] = 0, 8r 6= s.

Assume further that Var[Xr] C < +1 for all r. Show that

P

2

4 1

n

X

rn

Xr � �

3

5 C2

�2n.

Exercise 2.5 (Pairwise independence: lack of concentration [LW06]). Let U =

(U1, . . . , U`) be uniformly distributed over {0, 1}`. Let n = 2`� 1. For all v 2

{0, 1}`\0, defineXv = (U · v) mod 2.

95

a) Show that the random variables Xv, v 2 {0, 1}`\0, are uniformly dis-tributed in {0, 1} and pairwise independent.

b) Show that for any event A measurable with respect to �(Xv,v 2 {0, 1}`\0),P[A] is either 0 or � 1/(n+ 1).

Exercise 2.4 shows that pairwise independence implies “polynomial concentra-tion” of the average of square-integrable Xvs. On the other hand, the current ex-ercise suggests that in general pairwise independence cannot imply “exponentialconcentration.”

Exercise 2.6 (Chernoff bound for Poisson trials). Using the Chernoff-Cramer method,prove part (a) of Theorem 2.37. Show that part (b) follows from part (a).

Exercise 2.7 (A proof of Polya’s theorem). Let (Xt) be simple random walk onLd started at the origin 0.

a) For d = 1, use Stirling’s formula to show that P0[X2n = 0] = ⇥(n�1/2).

b) For j = 1, . . . , d, let N (j)

tbe the number of steps in the j-th coordinate by

time t. Show that

PN (j)

n 2

n

2d,3n

2d

�, 8j

�� 1� exp(�dn),

for some constant d > 0.

c) Use a) and b) to show that, for any d � 3, P0[X2n = 0] = O(n�d/2).

Exercise 2.8 (Maximum degree). Let Gn = (Vn, En) ⇠ Gn,pn be an Erdos-Renyigraph with n vertices and density pn. Suppose npn = C log n for some C > 0.Let Dn be the maximum degree of Gn. Use Bernstein’s inequality to show that forany " > 0

P [Dn � (n� 1)pn +max{C, 4(1 + ")} log n] ! 0,

as n ! +1.

Exercise 2.9 (RIP v. orthogonality). Show that a (k, 0)-RIP matrix with k � 2 isorthogonal, i.e., its columns are orthonormal.

Exercise 2.10 (Compressed sensing: linear programming formulation). Formu-late (2.52) as a linear program, i.e., the minimization of a linear objective subjectto linear inequalities.

96

Exercise 2.11 (Compressed sensing: almost sparse case). Prove the almost sparsecase of Claim 2.66 by adapting the proof of the sparse case.

Exercise 2.12 (Chaining tail inequality). Prove Theorem 2.60.

Exercise 2.13 (Poisson convergence: method of moments). Let A1, . . . , An beevents and A := [iAi. Define

S(r):=

X

1i1<···<irn

P[Ai1 \ · · · \Air ],

and

Xn :=

nX

i=1

Ai.

Assume that there is µ > 0 such that, for all r,

S(r)!

µr

r!.

Use Exercise 2.1 and a Taylor expansion of e�µ to show that

P[Xn = 0] ! e�µ.

In fact, Xn

d! Poi(µ) (no need to prove this). This is a special case of the method

of moments. See e.g. [Dur10, Section 3.3.5] and [JLR11, Section 6.1].

Exercise 2.14 (Connectivity: critical window). Using Exercise 2.13 show that,when pn =

logn+s

n, the probability that an Erdos-Renyi graph Gn ⇠ Gn,pn con-

tains no isolated vertex converges to e�e�s .

Bibliographic remarks

Section 2.2 The examples in Section 2.2.1 are taken from [AS11, Sections 2.4,3.2]. A fascinating account of the longest increasing subsequence problem is givenin [Rom14], from which the material in Section 2.2.3 is taken. The contour lemma,Lemma 2.17, is attributed to Whitney [Whi32] and is usually proved “by pic-ture” [Gri10a, Figure 3.1]. A formal proof of the lemma can be found in [Kes82,Appendix A]. For much more on percolation, see [Gri10b]. A gentler introductionis provided in [Ste].

97

Section 2.3 The presentation in Section 2.3.2 follows [AS11, Section 4.4] and[JLR11, Section 3.1]. The result for general subgraphs is due to Bollobas [Bol81].A special case (including cliques) was proved by Erdos and Renyi [ER60]. Forvariants of the small subgraph containment problem involving copies that are in-duced, disjoint, isolated etc., see e.g. [JLR11, Chapter 3]. For corresponding re-sults for larger graphs, such as cycles or matchings, see e.g. [Bol01]. The resultin that section is due to Erdos and Renyi [ER60]. The connectivity threshold inSection 2.3.2 is also due to the same authors [ER59]. The presentation here fol-lows [vdH14, Section 5.2]. Theorem 2.30 is due to R. Lyons [Lyo90].

Section 2.4 The use of the moment-generating function to derive tail bounds forsums of independent random variables was pioneered by Cramer [Cra38], Bern-stein [Ber46], and Chernoff [Che52]. For much more on concentration inequalities,see e.g. [BLM13]. The basics of large deviation theory are covered in [Dur10, Sec-tion 2.6]. See also [RAS] and [DZ10]. The presentation in Section 2.4.6 is basedon [Har, Lectures 6 and 8] and [Tao]. Section 2.4.4 is based partly on [Ver18]and [Lug, Section 3.2]. Very insightful, and much deeper, treatment of the mate-rial in Section 2.4.5 can be found in [Ver18, vH16]. The Johnson-Lindenstrausslemma was first proved by Johnson and Lindenstrauss using non-probabilistic ar-guments [JL84]. The idea of using random projections to simplify the proof wasintroduced by Frankl and Maehara [FM88] and the proof presented here basedon Gaussian projections is due to Indyk and Motwani [IM98]. See [Ach03] foran overview of the various proofs known. For more on the random projectionmethod, see [Vem04]. For algorithmic applications of the Johnson-Lindenstrausslemma, see e.g. [Har, Lecture 7]. Compressed sensing emerged in the works ofDonoho [Don06] and Candes, Romberg and Tao [CRT06a, CRT06b]. The re-stricted isometry property was introduced by Candes and Tao [CT05]. Claim 2.66is due to Candes, Romberg and Tao [CRT06b]. The proof of Claim 2.67 presentedhere is due to Baraniuk et al. [BDDW08]. A survey of compressed sensing can befound in [CW08]. A thorough mathematical introduction to compressed sensingcan be found in [FR13]. The presentation in Section 2.4.7 follows [KP, Section3] and [LP, Section 13.3]. The Varopoulos-Carne bound is due to Carne [Car85]and Varopoulos [Var85]. For a probabilistic approach to the Varopoulos-Carnebound see Peyre’s proof [Pey08]. The application to mixing times is from [LP].The material in Section 2.4.3 can be found in [BLM13, Chapter 2]. Hoeffding’slemma and inequality are due to Hoeffding [Hoe63]. Bennett’s inequality is due toBennett [Ben62]. Section 2.4.8 borrows from [Ver18, vH16, SSBD14, Haz16].

98

Documents

Moments and tails - UW-Madison Department of Mathematicsroch/mdp/roch-mdp-chap2.pdf · moments central moment are of course the mean and variance, the square root of which is the