Printed: 27 November 2015 - Yale Universitypollard/Books/Mini/Basic.pdf · Chapter 2 A few good inequalities Section 2.1introduces exponential tail bounds. Section 2.2introduces the

2 A few good inequalities 12.1 Tail bounds and concentration . . . . . . . . . . . . . . . . . 12.2 Moment generating functions . . . . . . . . . . . . . . . . . . 32.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Subgaussian tails . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Bennett’s inequality . . . . . . . . . . . . . . . . . . . . . . . 112.6 Bernstein’s inequality . . . . . . . . . . . . . . . . . . . . . . 142.7 Subexponential distributions? . . . . . . . . . . . . . . . . . . 172.8 The Hanson-Wright inequality for subgaussians . . . . . . . . 202.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Printed: 27 November 2015

version: 20nov15printed: 27 November 2015

Mini-empiricalc©David Pollard

Chapter 2

A few good inequalities

Section 2.1 introduces exponential tail bounds.Section 2.2 introduces the method for bounding tail probabilities using mo-

ment generating functions.Section 2.3 describes analogs of the usual Lp norms, which have proved

useful in empirical process theory.Section 2.4 discusses subgaussianity, a most useful extension of the gaus-

sianity property.Section 2.5 discusses the Bennett inequality for sums of independent ran-

dom variables that are bounded above by a constant.Section 2.6 shows how the Bennett inequality implies one form of the Bern-

stein inequality, then discusses extensions to unbounded summands.Section 2.7 describes two ways to capture the idea of tails that decrease

like those of the exponential distribution.Section 2.8 illustrates the ideas from the previous Section by deriving a

subgaussian/subexponential tail bound for quadratic forms in independentsubgaussian random variables.

2.1 Tail bounds and concentrationBasic::S:intro

Sums of independent random variables are often approximately normal dis-tributed. In particular, their tail probabilities often decrease in ways similarto the tails of a normal distribution. Several classical inequalities make thisvague idea more precise. Such inequalities provided the foundation on whichempirical process theory was built.

version: 20nov15printed: 27 November 2015

Mini-empiricalc©David Pollard

§2.2 Moment generating functions 2

We know a lot about the normal distribution. For example, if W hasa N(µ, σ2) distribution then (Feller, 1968, Section 7.1)(

1

x− 1

x3

)e−x

2/2

√2π≤ PW ≥ µ+ σx ≤ 1

x

e−x2/2

√2π

for all x > 0.

Clearly the inequalities are useful only for larger x: as x decreases to zerothe lower bound goes to −∞ and the upper bound goes to +∞. For manypurposes the exp(−x2/2) factor matters the most. Indeed, the simpler tailbound PW ≥ µ+σx ≤ exp(−x2/2) for x ≥ 0 often suffices for asymptoticarguments. Sometimes we need a better inequality showing that the dis-tribution concentrates most of its probability mass in a small region, suchas

P|W − µ| ≥ σx ≤ 2e−x2/2 for all x ≥ 0,

a concentration inequality.Similar-looking tail bounds exist for random variables that are not ex-

actly normally distributed. Particularly useful are the so-called exponentialtail bounds, which take the form

\E@ exp.bnd\E@ exp.bnd <1> PX ≥ x+ PX ≤ e−g(x) for x ≥ 0,

with g a function that typically increases like a power of x. Sometimes gbehaves like a quadratic for some range of x and like a linear function furtherout in the tails; sometimes some logarithmic terms creep in; and sometimesit depends on var(X) or on some more exotic measure of size, such as anOrlicz norm (see Section 2.3).

For example, suppose S = X1 + · · ·+Xn, a sum of independent randomvariables with PXi = 0 and each Xi bounded above by some constant b.The Bernstein inequality (Section 2.6) tells us that

PS ≥ x√V ≤ exp

(−x2/2

1 + 13bx/√V

)for x ≥ 0,

where V is any upper bound for var(S). If each Xi is also bounded belowby −b then a similar tail bound exists for −S, which leads to a concentrationinequality, an upper bound for P|S| ≥ x

√V .

For x much smaller than√V /b the exponent in the Bernstein inequality

behaves like −x2/2; for much larger x the exponent behaves like a negativemultiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, theinequality gives a subgaussian bound for x not too big and a subexponentialbound for very large x.

Draft: 20nov15 c©David Pollard

§2.2 Moment generating functions 3

2.2 Moment generating functionsBasic::S:mgf

The most common way to get a bound like <1> uses the moment gener-ating function, M(θ) := PeθX . From the fact that exp is nonnegative andexp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

\E@ mgf.tail\E@ mgf.tail <2> PX ≥ w ≤ infθ≥0

Peθ(X−w) = infθ≥0

e−θwM(θ).

Basic::normal <3> Example. If X has a N(µ, σ2) distribution then M(θ) = exp(θµ+σ2θ2/2)is finite for all real θ. For x ≥ 0 inequality <2> gives

PX ≥ µ+ σx ≤ infθ≥0

exp(−θ(µ+ σx) + θµ+ θ2σ2/2)

= exp(−x2/2) for all x ≥ 0,

the minimum being achieved by θ = x/σ.Analogous arguments, with X −µ replaced by µ−X, give an analogous

bound for the lower tail,

PX ≤ µ− σx ≤ exp(−x2/2) for all x > 0,

leading to the concentration inequality P|X − µ| ≥ σx ≤ 2e−x2/2.

Remark. Of course the algebra would have been a tad simpler if I hadworked with the standardized variable (X − µ)/σ. I did things themessier way in order to make a point about the centering in Section 2.4.

Exactly the same arguments work for any random variable whose mo-ment generating is bounded above by exp(θµ + σ2θ2/2), for all real θ andsome finite constants µ and τ . This property essentially defines the sub-gaussian distributions, which are discussed in Section 2.4.

The function L(θ) := logM(θ) is called the cumulant-generatingfunction. The upper bound in <2> is often rewritten as

e−L∗(θ) where L∗(θ) := supθ≥0(θw − L(θ)).

Remark. As Boucheron et al. (2013, Sections 2.2) and others havenoted, the function L∗ is the Fenchel-Lagrange transform of L, which isalso known as the conjugate function (Boyd and Vandenberghe, 2004,Section 3.3).


§2.3 Orlicz norms 4

Clearly L(0) = 0. The Holder inequality shows that L is convex: if θ isa convex combination α1θ1 + α2θ2 then

eL(θ) = M(α1θ1 + α2θ2)

= P(eα1θ1Xeα2θ2X

)≤(Peθ1X

)α1(Peθ2X

)α2

= eα1L(θ1)+α2L(θ2).

More precisely (Problem [1]), the set J = θ : M(θ) <∞ is an interval,possibly degenerate or unbounded, containing the origin. For example, forthe Cauchy, M(θ) = ∞ for all real θ 6= 0; for the standard exponentialdistribution (density e−xx > 0 with respect to Lebesgue measure),

M(θ) =

(1− θ)−1 for θ < 1∞ otherwise

.

Problem [1] also justifies the interchange of differentiation and expecta-tion on the interior of J , which leads to

L′(θ) =M ′(θ)

M(θ)= PX

eθX

M(θ)

L′′(θ) =M ′′(θ)

M(θ)−(M ′(θ)

M(θ)

)2

= PX2 eθX

M(θ)−(PX

eθX

M(θ)

)2

.

If we define a probability measure Pθ by its density, dPθ/dP = eθX/M(θ),then the derivatives of L can be re-expressed using moments of X under Pθ:

L′(θ) = PθX and L′′(θ) = Pθ (X − PθX)2 = varθ(X) ≥ 0.

In particular, L′(0) = PX, so that Taylor expansion gives

\E@ L.Taylor\E@ L.Taylor <4> L(θ) = θPX + 12θ

2varθ∗(X) with θ∗ lying between 0 and θ.

Bounds on the Pθ variance of X lead to bounds on L and tail boundsfor X − PX. The Hoeffding inequality (Example <14>) provides a goodillustration of this line of attack.

2.3 Orlicz normsBasic::S:orlicz

There is a fruitful way to generalize the the concept of an Lp norm by meansof a Young function, that is, a convex, increasing function Ψ : R+ → R+

with Ψ(0) = 0 and Ψ(x) → ∞ as x → ∞. The concept applies with anymeasure µ but I need it mostly for probability measures.


§2.3 Orlicz norms 5

Remark. The assumption that Ψ(x)→∞ as x→∞ is needed only torule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x0) > 0 for some x0

then, for x = x0/α with α < 1,

Ψ(x0) = Ψ(αx+ (1− α)0) ≤ αΨ(x)

That is, Ψ(x)/x ≥ Ψ(x0)/x0 for all x ≥ x0. If Ψ is not identically zerothen it must increase at least linearly with increasing x.

At one time Orlicz norms became very popular in the empirical processliterature, in part (I suspect) because chaining arguments (see Chapter 4)are cleaner with norms than with tail probabilities.

The most important Young functions for my purposes are the powerfunctions, Ψ(x) = xp for a fixed p ≥ 1, and the exponential power func-tions Ψα(x) := exp(xα)−1 for α ≥ 1. The function Ψ2 is particularly useful(see Section 2.4) for working with distributions whose tails decrease like aGaussian. It is also possible (Problem [6]) to define a Young function Ψα,for 0 < α < 1, for which

Ψα(x) ≤ exp(xα) ≤ Kα + Ψα(x) for all x ≥ 0,

where Kα is a positive constant. Some linear interpolation near 0 is neededto overcome the annoyance that x 7→ exp(xα) is concave on the inter-

val [0, zα], for zα := ((1− α)/α)1/α.

Basic::Orlicz.norm <5> Definition. Let (X,A, µ) be a measure space and f be an A-measuablereal function on X. For each Young function Ψ the corresponding Orlicznorm ‖f‖Ψ (or Ψ-norm) is defined by

‖f‖Ψ = infc > 0 : µΨ(|f(x)|/c) ≤ 1,

with ‖f‖Ψ =∞ if the infimum runs over an empty set.

Remark. It is not hard to show that ‖f‖Ψ < ∞ if and only ifµΨ(|f |/C0) < ∞ for at least one finite constant C0. Moreover, if0 < c = ‖f‖Ψ <∞ then µΨ(|f |/c) ≤ 1; the infimum is achieved.

When restricted to the set LΨ(X,A, µ) of all measurable realfunctions on X with finite Ψ-norm, ‖·‖Ψ is a seminorm. It fails to bea norm only because ‖f‖Ψ = 0 if and only if µx : f(x) 6= 0 = 0.The corresponding set of µ-equivalence classes is a Banach space. Itis traditional and benign to call ‖·‖Ψ a norm even though it is only aseminorm on LΨ(X,A, µ).

The assumption Ψ(0) = 0 can be discarded if µ is a finite measure.For each finite constant A > Ψ(0), the quantity infc > 0 : µΨ(|f |/c) ≤A also defines a seminorm.


§2.4 Subgaussian tails 6

Finiteness of the Ψ-norm for a random variable Z immediately impliesa tail bound: if 0 < σ = ‖Z‖Ψ <∞ then

\E@ orlicz.tail\E@ orlicz.tail <6> P|Z| ≥ xσ ≤ max

(1,P

Ψ (|Z|/σ)

Ψ(x)

)= max

(1,

1

Ψ(x)

)for x ≥ 0.

For example, for each α > 0 there is a constant Cα for which

P|Z| ≥ x ‖Z‖Ψα ≤ Cαe−xα for x ≥ 0.

The constant can be chosen so that Cαe−xα ≥ 1 for all x so small that

Ψα(x)e−xα

is not close to 1.When seeking to bound an Orlicz norm I often find it easier to first obtain

a constant C0 for which PΨ(|X|/C0) ≤ K0, where K0 is a constant strictlylarger than 1. As the next Lemma shows, a simple convexity argument turnssuch an inequality into a bound on ‖X‖Ψ.

Basic::Psi.bound <7> Lemma. Suppose Ψ is a Young function. If PΨ(|X|/C0) ≤ K0 for somefinite constants C0 and K0 > 1 then ‖X‖Ψ ≤ C0K0.

Proof For each θ in [0, 1] convexity of Ψ gives

PΨ

(θ|X|C0

)≤ θPΨ

(|X|C0

)+ (1− θ)Ψ(0) ≤ θK0.

The choice θ = 1/K0 makes the last bound equal to 1.

Beware. The Lemma does not give a foolproof way of determining rea-sonable bounds for the Ψ-norm. For example, if X has a N(0, 1) distributionthen simple integration shows that

PΨ2(|X|/c) = g(c) := c/√c2 − 2 − 1 for c >

√2.

Thus ‖X‖Ψ2=√

8/3 but cg(c)→∞ as c decreases to√

2.

2.4 Subgaussian tailsBasic::S:subgaussian

As noted in Section 2.2, if X is a random variable for which there existconstants µ and σ2 for which PeθX ≤ exp(µθ + 1

2σ2θ2) for all real θ then

\E@ subg.oneside\E@ subg.oneside <8> PX ≥ µ+ σx ≤ e−x2/2 for all x ≥ 0.



The approximations for θ near zero,

PeθX = 1 + θPX + 12θ

2PX2 + o(θ2),

exp(µθ + 12σ

2θ2) = 1 + µθ + 12(σ2 + µ2)θ2 + o(θ2),

would force PX = µ and PX2 ≤ σ2 + µ2, that is, var(X) ≤ σ2. As Prob-lem [11] shows, σ2 might need to be much larger than the variance. To avoidany hint of the convention that σ2 denotes a variance, I feel it is safer toreplace σ by another symbol.

Basic::subgaussian <9> Definition. Say that a random variable X with PX = µ has a subgaussiandistribution with scale factor τ < ∞ if PeθX ≤ exp(µθ + τ2θ2/2) for allreal θ. Write τ(X) for the smallest τ for which such a bound holds.

Remark. Some authors call such an X a τ -subgaussian randomvariable. Some authors require µ = 0 for subgaussianity; others usethe qualifier centered to refer to the case where µ is zero.

For a subgaussian X, the two-sided analog of <8> gives

\E@ beta.def\E@ beta.def <10> P|X − µ| ≥ x ≤ 2 exp

(− x2

2β2

)for all x ≥ 0,

for 0 < β ≤ τ(X). In fact (Theorem <16>) existence of such a tail boundis equivalent to subgaussianity.

Basic::gaussian <11> Example. Suppose Y = (Y1, . . . , Yn) has a multivariate normal distribu-tion. Define M := maxi≤n |Yi| and σ2 := maxi≤n var(Yi). Chapter 6 explainswhy

P|M − PM | ≥ σx ≤ 2 exp(−x2/2) for all x ≥ 0.

That is, M is subgaussian—a most surprising and wonderful fact.

Many useful results that hold for gaussian variables can be extended tosubgaussians. In empirical process theory, conditional subgaussianity playsa key role in symmetrization arguments (see Chapter 8). The first examplecaptures the main idea.

Basic::Rademachers <12> Example. Suppose X =∑

i≤n bisi, where the bi’s are constants and the si’s

are independent Rademacher random variables, that is, Psi = +1 = 12 =

Psi = −1. Then

Peθbisi = 12

(eθbi + e−θbi

)=∞∑k=0

(θbi)2k

(2k)!≤ exp(θ2b2i /2)



so that PeθX ≤∏i≤n Peθbisi ≤ eθ

2τ2/2 with τ2 =∑

i≤n b2i . That is, X has a

centered subgaussian distribution with τ(X)2 ≤∑

i≤n b2i .

More generally, if X is a sum of independent subgaussians X1, . . . , Xn

then the equality

Peθ(X−PX) =∏

iPeθ(Xi−PXi)

shows that X is subgaussian with τ2(X) ≤∑

i τ2(Xi). For sums of indepen-

dent variables we need consider only the subgaussianity of each summand.

Basic::interval.range <13> Example. Suppose X is a random variable with PX = µ and a ≤ X ≤ b,for finite constants a and b. By representation <4>, its moment generatingfunction M(θ) satisfies

logM(θ) = θPX + 12θ

2varθ∗(X) with θ∗ lying between 0 and θ.

Write µ∗ for Pθ∗X and m for (b − a)/2. Note that |X − m| ≤ (b − a)/2.Thus

varθ∗(X) = Pθ∗ |X − µ∗|2 ≤ Pθ∗ |X −m|2 ≤ (b− a)2/4.

and logPeθ(X−µ) ≤ (b − a)2θ2/8. The random variable X is subgaussianwith τ(X) ≤ (b− a)/2.

Remark. Hoeffding (1963, page 22) derived the same bound on themoment generating function by a direct appeal to convexity of theexponential function:

PeθX ≤ qeθa + peθb where 1− q = p =PX − ab− a

In effect, the bound represents the extremal case where X concentrateson the the two-point set a, b. He then calculated a second derivativefor L, without interpreting the results as a variance but using aninequality that can be given a variance interpretation.

Basic::Hoeffding.ineq <14> Example. (Hoeffding’s inequality for independent summands) Suppose ran-dom variables X1, . . . , Xn are independent with PXi = µi and ai ≤ Xi ≤ bifor each i, for constants ai and bi.

By Example <13>, each Xi is subgaussian with τ(Xi) ≤ (bi−ai)/2. Thesum S =

∑i≤nXi is also subgaussian, with expected value µ =

∑i≤n µi



and τ2(S) =∑

i≤n τ2(Xi) =

∑i≤n(bi − ai)

2/4. The one-sided version ofinequality <10> gives

\E@ Hoeff.indep\E@ Hoeff.indep <15>

P∑

i≤n(Xi − µi) ≥ x

≤ exp

(−2x2/

∑i≤n

(bi − ai)2)

for each x ≥ 0.

See Chapter 3 for an extension of Hoeffding’s inequality <15> to sumsof bounded martingale differences.

The next Theorem provides other characterizations of subgaussianity,using the Orlicz norm for the Young function Ψ2(x) = exp(x2)− 1 and mo-ment inequalities. The proof of the Theorem provides a preview of a “sym-metrization” argument that plays a major role in Chapter 8. I prefer to writethe argument using product measures rather than conditional expectations.Here is the idea. Suppose X1 is a random variable with PX1 = µ, defined ona probability space (Ω1,F1,P1). Make an “independent copy” (Ω2,F2,P2)of the probability space then carry X1 over to a new random variable X2

on Ω2. Under P1⊗P2 the random variables X1 and X2 are independent andidentically distributed.

Remark. You might prefer to start with an X defined on (Ω,F,P)then define random variables on Ω × Ω by X1(ω1, ω2) = X(ω1) andX2(ω1, ω2) = X(ω2). Under P⊗P the random variables X1 and X2 areindependent and each has the same distribution as X (under P). I oncereferred to independent copies during a seminar in a Math department.One member of the audience protested vehemently that a copy of X isthe same as X and hence can only be independent of X in trivial cases.

Note that P2X2 = P1X1 = µ. For each convex function f ,

P1f(X1 − µ) = Pω11 f [X1(ω1)− Pω2

2 X2(ω2)]

= Pω11 f [Pω2

2 (X1(ω1)−X2(ω2))]

≤ Pω11 Pω2

2 f [X1(ω1)−X2(ω2)] ,

the last line coming from Jensen’s inequality for P2.

Basic::subgauss.equiv <16> Theorem. For each random variable X with expected value µ, the followingassertions are equivalent.

(i) X has a subgaussian disribution (with τ(X) <∞)

Basic::L.finite (ii) L(X) := supr≥1

‖X‖r√r

<∞



Basic::Psitwo.finite (iii) ‖X‖Ψ2<∞

Basic::beta.def2 (iv) there exists a positive constant β for which P|X−µ| ≥ x ≤ 2 exp

(− x2

2β2

)for all x ≥ 0

More precisely,

Basic::Psi2L.equiv (a) c0 ‖X‖Ψ2≤ L(X) ≤ ‖X‖Ψ2

with c0 = 1/√

4e ;

Basic::centered (b) if β(X) denotes the smallest β for which inequality (iv) holds,

c1 ‖X − µ‖Ψ2≤ β(X) ≤ τ(X) ≤ 4L(X − µ) with c1 = 1/

√6.

Remark. Finiteness of L(X) is also equivalent to finiteness of L(X−µ)because |µ| ≤ ‖X‖1 ≤ L(X) and |L(X)−L(X−µ)| ≤ |µ|. ConsequentlyL(X − µ) ≤ 2L(X), but there can be no universal inequality in theother direction: L(X − µ) is unchanged by addition of a constant to Xbut L(X) can be made arbitrarily large.

The precise values of the constants ci are not important. Mostlikely they could be improved, but I can see no good reason for trying.

Proof Implicit in the sequences of inequalities is the assertion that finite-ness of one quantity implies finiteness of another quantity.

Proof of (a)Problem [8] shows that kk ≥ k! ≥ (k/e)k for all k ∈ N.

Suppose PΨ2(|X|/D) ≤ 1. Then∑

k∈N P(X/D)2k/k! ≤ 1 so that

‖X‖2k ≤ D(k!)1/2k ≤ D√k for each k ∈ N

and L(X) ≤√

2 supk∈N ‖X‖2k /√

2k ≤ D by Problem [9].Conversely, if ∞ > L ≥ ‖X‖r/

√r for all r ≥ 1 then

PΨ2

(|X|D

)=∑k∈N

‖X/D‖2k2kk!

≤∑k∈N

(L√

2k/D)2k(e/k)k,

which equals 1 if D =√

4e L.

Proof of (b)None of the quantities appearing in the inequalities depend on µ, so we mayas well assume µ = 0.

Inequality <10> gives β(X) ≤ τ(X).


§2.5 Bennett’s inequality 11

Simplify notation by writing β for β(X) and τ for τ(X) and L for L(X),and define γ := ‖X‖Ψ2

. Also, assume that X is not a constant, to avoidannoying trivial cases.

To show that γ ≤ β√

6:

PΨ2

(|X − µ|β√

6

)= P

∫ ∞0

ex|X − µ|2 ≥ 6β2x dx

=

∫ ∞0

exP|X − µ| ≥ β√

6x dx

≤ 2

∫ ∞0

exp (x− 3x) dx = 1.

To show that τ ≤ 4L: Symmetrize. Construct a random variable X ′ withthe same distribution as X but independent of X. By the triangle inequality,‖X −X ′‖r ≤ 2 ‖X‖r ≤ 2L

√r for all r ≥ 1. By Jensen’s inequality

PeθX ≤ Peθ(X−X′)

= 1 +∑

k∈N

θ2kP(X −X ′)2k

(2k)!symmetry kills odd moments

≤ 1 +∑

k∈N

(2L√

2k )2k

kkk!

= 1 +∑

k∈N

(8θ2L2)k

k!= e8L2θ2 ,

and so on.

2.5 Bennett’s inequalityBasic::S:Bennett

The Hoeffding inequality depends on the somewhat crude bound

var(W ) ≤ (b− a)2/4 if a ≤W ≤ b.

Such an inequality can be too pessimistic if W has only a small probabilityof taking values near the endpoints of the interval [a, b]. The inequalitiescan sometimes be improved by introducing second moments explicitly intothe upper bound. Bennett’s one-sided tail bound provides a particularlyelegant illustration of the principle.


§2.5 Bennett’s inequality 12

Basic::Bennett.ineq <17> Theorem. Suppose X1, . . . , Xn are independent random variables for whichXi ≤ b and PXi = µi and PX2

i ≤ vi for each i, for nonnegative constants band vi. Let W equal

∑i≤n vi. Then

P∑


≤ exp

(− x2

2WψBenn

(bx

W

))for x ≥ 0

where ψBenn denotes the function defined on [−1,∞) by

\E@ Benpsi.def\E@ Benpsi.def <18> ψBenn(t) :=(1 + t) log(1 + t)− t

t2/2for t 6= 0, and ψBenn(0) = 1.

Remark. By Problem [14], ψBenn is positive, decreasing, and convex,with ψBenn(−1) = 2 and ψBenn(t) decreasing like (2/t) log t as t→∞.

When bx/W is close to zero the ψBenn contribution to the upperbound is close to 1 and the exponent is close to a subgaussian−x2/(2W ).(If b = 0 then it is exactly subgaussian.) Further out in the tail, forlarger x, the exponent decreases like −x log(bx/W )/b, which is slightlyfaster than exponential.

Proof Define Φ1(x) = ex− 1−x, the convex function Ψ1 adjusted to havezero derivative at the origin, as in Problem [7]. From Problem [14], thepositive function

\E@ Del.def\E@ Del.def <19> ∆(x) =Φ1(x)

x2/2for x 6= 0, and ∆(0) = 1,

is continuous and increasing on the whole real line. For θ ≥ 0,

PeθXi = 1 + θPXi + PΦ1(θXi)

= 1 + θµi + P(

12(θXi)

2∆(θXi))

≤ 1 + θµi + 12θ

2P(X2i )∆(θb) because ∆ is increasing

≤ exp(θµi + 1

2θ2vi∆(θb)

).\E@ Bennett.mgf\E@ Bennett.mgf <20>

That is,

\E@ bdd1\E@ bdd1 <21> Peθ(Xi−µi) ≤ exp(

12θ

2vi∆(θb))

= exp(vib2

Φ1(θb))

for θ ≥ 0.

Remark. In the last expression I am implicitly assuming that bis nonzero. We can recover the bound when b = 0 by a passageto the limit as b decreases to zero. Actually, the b = 0 case isquite noteworthy because the ∆ contribution disappears from theupper bound for P exp(θXi) and the ψBenn disappears from the finalinequality.


§2.5 Bernstein’s inequality 13

Abbreviate∑

i≤n(Xi − µi) to T . By independence and <21>,

PeθT =∏

i≤nPeθ(Xi−µi) ≤ exp

(W

b2Φ1(θb)

)and, by <2>,

PT ≥ x ≤ infθ≥0 exp

(−xθ +

W

b2Φ1(θb)

)= exp

(−Wb2

supbθ≥0

(bx

Wbθ − Φ1(θb)

))

= exp

(−Wb2

Φ∗1(bx/W )

),

where Φ∗1 denotes the conjugate of Φ1,

Φ∗1(y) := supt∈R

(ty − Φ1(t)) =

(1 + y) log(1 + y)− y if y ≥ −1+∞ otherwise

.

When y ≥ −1 the supremum is achieved at t = log(1+y). Thus 12y

2ψBenn(y) =Φ∗1(y) for y ≥ −1 and (W/b2)Φ∗1(bx/W ) = x2/(2W )ψBenn(bx/W ), as as-serted.

Remark. The inequality still holds if 0 > b ≥ −x/W , although theproof changes slightly. What happens for smaller b?

Unlike the case of the two-sided Hoeffding inequality, there is no au-tomatic lower tail analog for the Bennett inequality without an explicitlower bound for the summands. One particularly interesting case occurswhen Xi ≥ 0:

P∑

i≤n(Xi−µi) ≤ −x = P

∑i≤n

(µi−Xi) ≥ x ≤ exp(−ε2/(2W ))

for all x ≥ 0, a clean (one-sided) subgaussian tail bound.The Bennett inequality also works for the Poisson distribution. If X has

a Poisson(λ) distribution then (Problem [15])

PX ≥ λ+ x exp

(−x

2

2λψBenn

(xλ

))for x ≥ 0

and, for 0 ≤ x ≤ λ,

PX ≤ λ− x = exp

(−x

2

2λψBenn

(−xλ

))≤ exp

(−x

2

2λ

),



a perfect subgaussian lower tail. See Problem [16] if you have any doubtsabout the boundedness of Poisson random variables.

See Chapter 3 for an extension of the Bennett inequality to sums ofmartingale differences.

2.6 Bernstein’s inequalityBasic::S:Bernstein

As shown by Problem [14], ψBenn(t) ≥ (1 + t/3)−1 for t ≥ −1. Thus theBennett inequality implies a weaker result: if X1, . . . , Xn are independentwith PXi = µi and PX2

i ≤ vi and Xi ≤ b then

\E@ Bernstein.indep0\E@ Bernstein.indep0 <22> P∑

i≤n(Xi − µi) ≥ x ≤ exp

(− x2/2

W + bx/3

)for x ≥ 0,

a form of Bernstein’s inequality. When bx/W is close to zero the expo-nent is again close to x2/(2W ); when bx/W is large, the Bernstein inequalityloses the extra log term in the exponent. If the summands Xi are boundedin absolute value (not just bound above) we get an analogous two-sidedinequality.

The assumptions PX2i ≤ vi and Xi ≤ b imply

P|Xi|k ≤ PX2i bk−2 ≤ vibk−2 for k ≥ 2.

A much weaker condition on the growth of P|Xi|k with k, without anyassumption of boundedness, leads to a useful upper bound for the momentgenerating function and a more genral version of the Bernstein ineqi=uality..

Basic::Bernstein.unbdd <23> Theorem. (Bernstein inequality) Suppose X1, . . . , Xn are independent ran-dom variables with PXi = µi and

\E@ Bernstein.moment\E@ Bernstein.moment <24> P|Xi|k ≤ 12viB

k−2k! for k ≥ 2.

Define W =∑

i≤n vi. Then, for x ≥ 0,

P∑


≤ e−HBenn(x,B,W ) ≤ e−HBern(x,B,W )

where

HBern(x,B,W ) =x2/W

2(1 + xB/W )

HBenn(x,B,W ) =x2/W

1 + xB/W +√

1 + 2xB/W



Remark. The B here does not represent an upper bound for Xi. Thetraditional form of Bernstein’s inequality, with HBern, correspondsto the bound <22> derived from the Bennett inequality under theassumption Xi ≤ b = B/3.

Proof As before,

F := P∑

iXi ≥ x+

∑iµi

≤ inf

θ≥0e−(x+

∑i µi)θ

∏i≤n

PeθXi .

Use the moment inequality <24> to bound the moment generating func-tion for Xi.

PeθXi = 1 + θµi + PΦ1(θXi) where Φ1(t) := et − 1− t

and, for θ ≥ 0,

PΦ1(θXi) ≤ PΦ1(θ|Xi|) =∑

k≥2

θkP|Xi|k

k!

≤ 12viθ

2∑

k≥2(θB)k−2

= 12

viθ2

1−Bθprovided 0 ≤ θ < 1/B.\E@ Bernstein.mgf\E@ Bernstein.mgf <25>

Thus

PeθXi ≤ 1 + θµi + 12

viθ2

1− bθ≤ exp

(θµi + 1

2

viθ2

1− bθ

)for 0 ≤ θB < 1

and

F ≤ exp(− sup0≤Bθ<1 g(θ)

)where g(θ) := xθ − 1

2

Wθ2

1−Bθ.

The HBern boundAs an approximate maximizer for g choose θ0 = x/(W + xB) so that

F ≤ e−g(θ0) = exp

(− x2/2

W + xB

).

Bennett (1962, equation (7)) made this strange choice by definingG = 1−θBthen minimizing −xθ+ 1

2θ2W/G while ignoring the fact that G depends on θ.

That gives θ = xG/W = (1 − θB)x/W , a linear equation with solution θ0.Very confusing. As Bennett noted, a boundedness assumption |Xi| ≤ b



actually implies <24> with B = b/3. The exponential upper bound thenagrees with <22>.

The HBenn boundFollowing Bennett (1962, equation (7a)), this time we really maximize g(θ).The calculation is cleaner if we rewrite g with u := xB/W and s := 1− θB.With those substitutions,

g(θ) = g0(s) =W

2b2(2us− (1− s)2/s

)=

W

2B2

(−s−1 − (1 + 2u)s− 2(1 + u)

).

The maximum is achieved at s0 = (1 + 2u)−1/2 giving F ≤ e−g0(s0) where

\E@ u.bound\E@ u.bound <26> g0(s0) =W

B2

(1 + u−

√1 + 2u

)=W

B2

(u2

1 + u+√

1 + 2u

).

Both inequalities in Theorem <23> have companion lower bounds, ob-tained by replacing Xi by −Xi in all the calculations. This works becausethe Bernstein condition <24> also holds with Xi replaced by −Xi. The in-equalities could all have been written as two-sided bounds, that is, as upperbounds for P|

∑i(Xi − µi)| ≥ x.

If we are really only interested in a one-sided bound then it seems super-fluous to control the size of |Xi|. For example, for the upper tail it shouldbe enough to control X+

i . Boucheron et al. (2013, page 37)=BLM derivedsuch a bound (which they attributed to Emmanuel Rio) by means of theelementary inequality

ex − 1− x ≤ 12x

2 +∑

k≥3(x+)k/k! for all real x.

(For x ≥ 0 both sides are equal. For x < 0 it reflects the fact that g(x) =ex − (1 + x + 1

2x2) has a positive derivative with g(0) = 0.) They replaced

the assumption <24> by

PX2i = vi and P|Xi|k ≤ 1

2viBk−2k! for k ≥ 3.

For B > θ ≥ 0, we then get

PeθXi ≤ 1 + θµi + 12θ

2vi + 12viθ

2∑k≥3

(θB)k−2 ≤ exp

(θµi + 1

2

viθ2

1− θB

),


§2.7 Subexponential distributions? 17

which is exactly the same inequality as in the proof of Theorem <23>. Aswith the HBenn calculation it follows that

P∑

iXi ≥ x+

∑iµi

≤ exp

(−WB2

(1 + u−

√1 + 2u

))where W =

∑i vi and u = xB/W . BLM observed that the upper bound

can be inverted by solving a quadratic in u,

1 + 2u =(1 + u− tB2/W

)2,

which has a positive solution u = tB2/W+√

2tB2/W , corresponding to x =Wu/B = tB+

√2tW . The one-sided Bernstein bound then takes an elegant

form,

\E@ one.sided.Bernstein\E@ one.sided.Bernstein <27> P∑

i(Xi − µi) ≥ tB +

√2tW

≤ e−t for t ≥ 0.

By splitting according to whether tB ≥√

2tW or not, and by introducing anextraneous factor of 2, we can split the elegant form into two more suggestiveinequalities,

P∑

i(Xi − µi) ≥ 2tB ≤ e−t for t ≥ 2W/B2

P∑

i(Xi − µi) ≥ 2t

√W ≤ e−t2/2 for t2/2 ≥ 2W/B2.

These inequalities make clear that the subexponential part (large t) of thetail is controlled by B and the subgaussian part (moderate t) is controlledby W .

2.7 Subexponential distributions?Basic::S:subexp

To me, subexponential behavior refers to functions that are bounded aboveby a constant multiple of exp(−c|x|), for some positive constant c. Forexample, the Bernstein inequality shows that, far enough from the mean, thetail probabilities for a large class of sums of independent random variablesare subexponential. Closer to the mean the tail are subgaussian, which onemight think is better than subexponential. As another example, if a randomvariable X takes values in [−1,+1] we have P|X| ≥ x = 0 ≤ exp(−1010x)for all x ≥ 1, but I would hesitate to call that behavior subexponential.

What then does it mean for a whole distribution to be subexponential?And how is subexponentiality related to the Bernstein moment condition,

\E@ Bernstein.moment0\E@ Bernstein.moment0 <28> P|X|k

Bkk!≤ 1

2v/B2 for k ≥ 2?



As you already know, condition <28> implies

\E@ Bernstein.mgf0\E@ Bernstein.mgf0 <29> PeθX−1−θPX = PΦ1(θX) ≤ PΦ1(θ|X|) ≤ 12

vθ2

1− θBfor 0 ≤ θB < 1,

which implies

PeθX ≤ exp

(θµ+ 1

2

vθ2

1− θB

)for 0 ≤ θB < 1 and µ = PX.

Van der Vaart and Wellner (1996, page 103) pointed out that condition<28>is implied by

12v/B

2 ≥ PΦ1(|X|/B) =∑k≥2

P|X|k

Bkk!where Φ1(x) := ex − 1− x.

And, if <28> holds, then

PΦ1(|X|/2B) =∑k≥2

P|X|k

2kBkk!≤ 1

2(2v)/(2B)2,

which is just <29> with θ = 1/(2B). It would seem that the moment condi-tion is somehow related to ‖X‖Φ1

. Indeed, Vershynin (2010, Section 5.2.4)defined a random variable X to be subexponential if supk∈N ‖X‖k /k <∞.By Problems [7] and [10], this condition is equivalent to finiteness of theOrlicz norm ‖X‖Φ1

.Boucheron et al. (2013, Section 2.4) defined a centered random vari-

able X to be sub-gamma on the right tail with variance factor v and scalefactor B if

\E@ sub.gamma\E@ sub.gamma <30> PeθX ≤ exp

(vθ2/2

1− θB

)for 0 ≤ θB < 1.

They pointed out that if X has a gamma(α, β) distribution (with mean µ =αβ and variance v = αβ2) then

Peθ(X−µ) = e−θµ(1− θβ)−α = exp (−θαβ − α log(1− θβ))

≤ exp

(αβ2θ2/2

1− θβ

)for 0 ≤ θβ < 1.

Hence the name. The special case where α = 1 corresponds to the exponen-tial.



Let me try to tie these definitions and facts together.

First note that the upper bound in<29> equals 1 if θ = 2/(B +

√B2 + 2v

).

Thus the Bernstein assumption <28> implies

γ := ‖X‖Φ1≤(B +

√B2 + 2v

)/2.

If, as is sometimes assumed, v = PX2 then <28> also implies

v3/2 ≤ P|X|3 ≤ 12v3!B2.

That is, v ≤ (3B)2 and B < γ ≤ 2.7B. It appears that the B, or ‖X‖Φ1,

controls the subexponential part of the tail probabilities, at least as far asthe Bernstein inequality is concerned. Inequality <27> conveyed the samemessage. Compare with the subexponential bound

P|X| ≥ x ≤ min (1, 1/Φ1(x/γ)) for γ = ‖X‖Φ1.

The inequality PΦ1(|X|/γ) ≤ 1 also implies

P|X|k

γkk!≤ 1 ≤ 1

2(2γ2)/γ2 for k ≥ 2.

That is, <28> holds with v = 2γ2 and B = γ. Theorem <23> then gives

PX − µ ≥ x ≤ exp

(− x2

4γ2 + 2xγ

)≤ exp(−x2/(8γ2)) for 0 ≤ x ≤ 2γ.

Is that not a subgaussian bound? Unfortunately, the bound—subgaussianor not—has little value for such a small range near zero. The upper boundalways exceeds 1/

√e .

The useful subgaussian part of the bound is contributed by the v in<28>.The effect is clearer for a sum of independent random variables X1, . . . , Xn,each distributed like X:

P exp(θ∑

i≤nXi) ≤ exp

(nvθ2/2

1− θB

)for 0 ≤ θB < 1,

which leads to tail bounds like

P∑

iXi ≥ x ≤ exp

(− x2/2

nv + xB

),


§2.8 The Hanson-Wright inequality for subgaussians 20

useful subgaussian behavior when 0 ≤ x ≤ nv/B.In my opinion, the sub-gamma property <30> is better thought of as a

restricted subgaussian assumption that automatically implies subexponen-tial behavior far out in the tails. For each α in (0, 1) it implies

PeθX ≤ exp

(vθ2/2

1− α

)for 0 ≤ θB ≤ α < 1.

Remark. Uspensky (1937, page 204) used this approach for theBernstein inequality, before making the choice α = xB/(W + xB) torecover the traditional HBern form of the upper bound.

More generally, if we have a bound

\E@ restricted.subg\E@ restricted.subg <31> Peθ(X−µ) ≤ evθ2/2 for 0 ≤ θ ≤ ρ,

with v and ρ positive constants and µ = PX, then for x ≥ 0,

PX ≥ µ+ x ≤ min0≤θ≤ρ

exp(−xθ + 1

2vθ2)

=

exp

(−1

2x2/v)

for 0 ≤ x ≤ ρvexp

(−ρx+ 1

2vρ2)< e−xρ/2 for x > ρv

.\E@ gauss.exp\E@ gauss.exp <32>

The minimum is achieved by θ = min(ρ, x/v).The proof in the next Section of a generalized version of the Hanson-

Wright inequality—showing that a quadratic∑

i,j XiAi,jXj in independentsubgaussians has a subgaussian/subexponential tail— illustrates this ap-proach.

2.8 The Hanson-Wright inequality for subgaussiansBasic::S:HW

The following is based on an elegant argument by Rudelson and Vershynin(2013) = R&V. To shorten the proof I assume that the A matrix has zerosdown its diagonal, so that P

∑i,j XiAi,jXj = 0 and I can skip the part of

the argument (see Problem [17]) that deals with∑

iAi,i(X2i − PX2

i ).

Basic::HW <33> Theorem. Let X1, . . . , Xn be independent, centered subgaussian randomvariables with τ ≥ maxi τ(Xi). Let A be an n × n matrix of real numberswith Aii = 0 for each i. Then

PX ′AX ≥ t ≤ exp

(−min

(t2

64τ4 ‖A‖HS

,t

8√

2 τ2 ‖A‖2

))for t ≥ 0,

where ‖A‖HS

:=√

trace(A′A) =√∑

i,j A2i,j and ‖A‖2 := sup‖u‖2≤1 ‖Au‖2.


§2.8 The Hanson-Wright inequality for subgaussians 21

Remark. There are no assumptions of symmetry or positive definitenesson A. The proof also works with A replaced by −A, leading to a similarupper bound for P|X ′AX| ≥ t. The Hilbert-Schmidt norm ‖A‖

HS

is also called the Frobenius norm. The constants 64 and 8√

2 are notimportant.

Proof Without loss of generality, suppose τ = 1, so that PeθXi ≤ eθ2/2 forall i and all real θ and P exp(α′X) ≤ exp(‖α‖22 /2) for each α ∈ Rn.

As explained at the end of Section 2.7, it suffices to show

PeθX′AX ≤ exp(1

2vθ2) for 0 ≤ θ ≤ ρ

where v = 16 ‖A‖2HS

and ρ =(√

32 ‖A‖2)−1

.

To avoid conditioning arguments, assume P = ⊗i∈[n]Pi on B(R[n]), where[n] := 1, 2, . . . , n, and each Xi is a coordinate map, that is, Xi(x) = xiand X(x) = x. Write G for the N(0, In) distribution, another probabilitymeasure on B(R[n]). Of course G = ⊗i∈[n]Gi with each Gi equal to N(0, 1).For each subset J of [n] write PJ for ⊗i∈JPi and GJ for ⊗i∈JGi and, forgeneric vectors w = (wi : i ∈ [n]) in R[n], write wJ for (wi : i ∈ J) ∈ RJ .

The proof works by bounding PeθX′AX by the G expected value of ananalogous quadratic, using a decoupling argument that R&V attributed toBourgain (1998).

For each δ = (δi : i ∈ I) ∈ 0, 1[n] define Aδ to be the n × n matrixwith (i, j)th element δi(1 − δj)Ai,j . Write Dδ for i ∈ [n] : δi = 1 and Nδ

for [n]\Dδ. Then X ′AδX = X ′DδEXNδ where Eδ := A[Dδ, Nδ]. If the rowsand columns of A were permuted so that i < i′ for all i ∈ Dδ and i′ ∈ Nδ

then Aδ would look like

Aδ =

(0 Eδ0 0

)where Eδ := A[Dδ, Nδ] .

Now let Q denote the uniform distribution on 0, 1[n]. Under Q the δi’sare independent Ber(1/2) random variables, so that QAδ = 1

4A. For each θ,Jensen’s inequality and Fubini give

P exp(θX ′AX

)= Px exp(4θQδX ′AδX) ≤ QδPx exp(4θX ′AδX).

Write Qδ for X ′AδX. It therefore suffices to show that

P exp(4θQδ) ≤ exp(12vθ

2) for each δ and all 0 ≤ θ ≤ ρ.

From this point onwards the δ is held fixed. To simplify notation I dropmost of the subscript δ’s, writing XD instead of XDδ , and so on. Of courseAδ cannot drop its subscript.


§2.9 Problems 22

Under PD, with XN held fixed, the quadratic Qδ := X ′AδX = X ′DEXN

is a linear combination of the independent subgaussians XD. Thus

PDe4θQδ ≤ exp(

12(4θ)2 ‖EXN‖22

)= GD exp (4θQδ) .

Similarly, PN exp (4θQδ) ≤ GN exp (θQδ). Thus

P exp(4θQδ) = PDPN exp(4θQδ)

≤ PDGN exp(4θQδ) = GNPD exp(4θQδ)

≤ GNGD exp(4θQδ) = GNe8θ2X′NE

′EXN .

The problem is now reduced to its gaussian analog, for which rotationalsymmetry of GN allows further simplification by means of a diagonalizationof a positive definite matrix, E′E = L′ΛL, where L is orthogonal and Λ =diag(λ1, . . . , λn), with λ1 ≥ λ2 ≥ . . . λn ≥ 0 are the eigenvalues of E′E.

GNe8θ2X′NE

′EXN = GN exp(8θ2X ′NΛXN

)=∏

i∈NGi exp(8θ2λix

2i )

≤ exp(∑

i∈N16θ2λi

)provided 8θ2 max

i∈Nλi ≤ 1/4.\E@ G.bnd\E@ G.bnd <34>

The last inequality uses the fact that, if Z ∼ N(0, 1),

PesZ2

= (1−2s)−1/2 = exp(−1

2 log(1− 2s))≤ exp(2s) if 0 ≤ s ≤ 1

4 .

Now we need some matrix facts to bring everything back to an expressioninvolving A. For notational simplicity again assume that the rows D allprecede the rows N , so that

A =

(B1 EB2 B3

)and A′A =

(B′1B1 +B′2B2 B′1E +B′2B3

E′B1 +B′3B2 E′E +B′3B3

).

It follows that

‖A‖2HS

= trace(A′A) ≥ trace(E′E) =∑

i∈Nλi

and

‖A‖2 = sup‖v‖2≤1

‖Av‖2 ≥ sup‖u‖2≤1

‖Eu‖2 = λ1.

Combine the various inequalities to conclude that

P exp(θX ′AX

)≤ exp

(16θ2 ‖A‖2

HS

)provide 0 ≤ θ ≤

(√32 ‖A‖2

)−1.

An appeal to inequality <32> completes the proof.


§2.9 Problems 23

2.9 ProblemsBasic::S:Problems

[1] Define a = supθ < 0 : M(θ) = +∞ and b = infθ > 0 : M(θ) = +∞Basic::P:convex.interval

where M(θ) = PeθX . For simplicity, suppose both a and b are finite anda < b.

(i) Show that the restriction of M to the interval [a, b] is continuous. (Forexample, if M(b) = +∞ you should show that M(θ) → +∞ as θ increasesto b.)

(ii) Suppose θ ∈ (a, b). Choose δ > 0 so that [θ − 2δ, θ + 2δ0] ⊂ (a, b). For|h| < δ, establish the domination bound

|e(θ+h)X − eθX |/h ≤∣∣∣∣∫ 1

0Xe(θ+sh)Xds

∣∣∣∣≤ max

(e(θ+2δ)X , eθX , e(θ−2sδ)X

)/δ.

Deduce via a Dominated Convergence argument that M ′(θ) = P(XeθX).

(iii) Argue similarly to justify another differentiation inside the expectation, lead-ing to the expression for M ′′(θ).

[2] Suppose h(x) = eg(x), where g is a twice differentiable real-valued functionBasic::P:e.to.g

on R+.

(i) Show that h is convex on any interval J for which (g′(x))2 + g′′(x) ≥ 0 forx ∈ int(J).

[3] For a fixed α ≥ 1 define Ψα(x) = exα − 1 for x ≥ 0.Basic::P:Psia.ge1

(i) Show that Ψα(x) is a Young function..

(ii) Show that

Ψα(x)Ψα(y) ≤ exp(xα + yα)− 1 ≤ Ψα(x+ y) for all x, y ≥ 0

≤ Ψα(21/axy) for x ∧ y ≥ 1.

(iii) Deduce from the inequality Ψα(x)Ψα(y) ≤ Ψα(x+ y) that

Ψ−1α (uv) ≤ Ψ−1

α (u) + Ψ−1α (v) for all u, v ≥ 0.

[4] Suppose f is a convex, nonnegative, increasing function on an interval [a,∞),Basic::P:extend.to.Young

with a > 0. Suppose also that there exists an x0 ∈ [a,∞) for which


§2.9 Problems 24

f(x0)/x0 = τ := infx≥a f(x)/x. Show that h(x) := τx0 ≤ x < x0 +f(x)x ≥ x0 is a Young function.

[5] Suppose f : R+ → R+ is strictly increasing with f(0) = 0 andBasic::P:extend.range

f(u1u2) ≤ c0 (f(u1) + f(u2)) for min(u1, u2) ≥ v0,

for some constants c0 and v0 > 0. For each v1 in (0, v0) prove existence of alarger constant c1 for which

f(u1u2) ≤ c1 (f(u1) + f(u2)) for min(u1, u2) ≥ v1.

Hint: For u∗i = max(ui, v0) show that f(u∗i ) ≤ f(ui)f(v0)/f(v1).

[6] For a fixed α ∈ (0, 1) define fα(x) = exα

for x ≥ 0.

0 2 6 8

05

1015

zα xα

fα

α=1/2

Basic::P:Psia.lt1

(i) Show that x 7→ fα(x) is convex for x ≥ zα := (α−1 − 1)1/α and concave for0 ≤ x ≤ zα.

(ii) Define xα = (1/α)1/α and τα = fα(xα)/xα = (αe)1/α. Show that

Ψα(x) := τxx < xα+ fα(x)x ≥ xα

is a Young function.

(iii) Show that there exists a constant Kα for which

Ψα(x) ≤ exp(xα) ≤ Kα + Ψα(x) for all x ≥ 0.

(iv) Show that Ψα(x)Ψα(y) ≤ Ψα(21/αxy) for x ∧ y ≥ xα.

(v) Deduce that there exists a constant Cα for which

Cα(Ψ−1α (u) + Ψ−1

α (v))≥ Ψ−1

α (uv) when u ∧ v ≥ 1.

[7] Suppose Ψ is a Young function with right-hand derivative τ > 0 at 0. DefineBasic::P:slope.zero

Φ(x) = Ψ(x)− τx for x ≥ 0. If Φ is not identically zero then it is a Youngfunction. For that case show that

‖f‖Ψ ≥ ‖f‖Φ ≥ ‖f‖Ψ /K where K = 1 + τΦ−1(1).

for each measurable real function f .


§2.9 Problems 25

[8] For each k in N prove that ek ≥ kk/k! ≥ 1. Compare with the sharperBasic::P:Stirling

estimate via Stirling’s formula (Feller, 1968, Section 2.9),

kke−k

k!= e−rk/

√2πk where

1

12k> rk >

1

12k + 1.

[9] Suppose n0 = 1 ≤ n1 < n2 < . . . is an increasing sequence of positiveBasic::P:interp

integers with supk∈N nk/nk−1 = B <∞ and f : [1,∞)→ R+ is an increasingfunction with supt≥1 f(Bt)/f(t) = A < ∞. For each random variable Xshow that supr≥1 ‖X‖r/f(r) ≤ A supk∈N ‖X‖nk/f(nk).

[10] With Ψα as in Problem [2], prove the existence of positive constants cα anBasic::P:moment.Psia

Cα for which

cα ‖X‖Ψα ≤ supr≥1

‖X‖rr1/α

≤ Cα ‖X‖Ψα

for every random variable X.

[11] Theorem <16> suggests that ‖X‖2 might be comparable to β(X) if X hasBasic::P:subg.vara subgaussian distribution. It is easy to deduce from inequality <10> that‖X‖2 ≤ 2β(X). Show that there is no companion inequality in the otherdirection by considering the bounded, symmetric random variable X forwhich PX = ±M = δ and PX = 0 = 1 − 2δ. If 2δ = M show thatPX2 = 1 but

log(1 + θ2/2! + (θ4M2)/4!

)≤ logPeXθ ≤ τ2θ2/2 for all real θ

would force τ2 ≥ 2 log(3/2 +M2/24

).

[12] Show that a random variable X that has either of the following two prop-Basic::P:subg.suff

erties is subgaussian.

(i) For some positive constants c1, c2, and c3,

P|X| ≥ x ≤ c1 exp(−c2x2) for all |x| ≥ c3.

(ii) For some positive constants c1, c2, and c3,

PeθX ≤ c1 exp(c2θ2) for all |θ| ≥ c3

[13] Suppose Xn is a sequence of random variables for whichBasic::P:subg.limit1

P exp(θXn) ≤ exp(τ2θ2/2) for all n and all real θ,


§2.10 Notes 26

where τ is a fixed positive constant. If Xn → X almost surely, show thatPeθX ≤ exp(τ2θ2/2) for all real θ.

[14] Suppose f is a sufficiently smooth real-valued function defined at least in aBasic::P:lagrange

neighborhood of the origin of the real line. Define

G(x) =f(x)− f(0)− xf ′(0)

x2/2if x 6= 0, and G(0) = f ′′(0).

Prove the following facts.

(i) f(x)− f(0)− xf ′(0) = x2∫∫

f ′′(xs)0 < s < t < 1 ds dt(ii) If f is convex then G is nonnegative.

(iii) If f ′′ is an increasing function then so is G(x).

(iv) If f ′′ is a convex function then so is G. Moreover G(x) ≥ f ′′(x/3). Hint:Jensen and 2

∫∫s0 < s < t < 1 ds dt = 1/3.

(v) For the special case where f(x) = (1+x) log(1+x)−x show that G = ψBenn

from <18> and G(x) ≥ (1 + x/3)−1.

[15] Recall the notation F (x) = ex − 1− x for x ∈ R andBasic::P:Poisson

ψBenn(y) = F ∗(y)/(y2/2) where F ∗(y) = (1 + y) log(1 + y)− y for y ≥ −1.

Suppose X has a Poisson(λ) distribution

(i) Show that L(θ) = logPθX = λθ + λF (θ).

(ii) For x ≥ 0, deduce that

PX ≥ λ+ x ≤ exp (−λF ∗(x/λ)) = exp

(−x

2

2λψBenn(x/λ)

).

(iii) For 0 ≤ x ≤ λ show that

PX ≤ λ− x ≤ infθ≥0

exp (λF (−θ)− λx)

= exp

(−x

2

2λψBenn(−x/λ)

)≤ exp

(−x

2

2λ

).

[16] To what extent can the bounds from Problem [15] be recovered as limit-Basic::P:BinPoisson

ing cases of the bounds derived via the Bennett inequality applied to theBin(n, λ/n) distribution?

[17] Suppose ∞ > γ = ‖X‖Ψ2. Use Theorem <16> to deduce that PX2k ≤

Basic::P:diag.HW (√2kγ

)2k≤ (2eγ2)kk!. Use Bernstein’s inequality from Theorem <23> to

deduce a tail bound analogous to Theorem <33> for∑

i αi(X2i −PX2

i ) withindependent subgaussians X1, . . . , Xn and constants αi.


§2.10 Notes 27

2.10 NotesBasic::S:Notes

Anyone who has read Massart’s Saint-Flour lectures (Massart, 2003) or Lu-gosi’s lecture notes (Lugosi, 2003) will realize how much I have been influ-enced by those authors. Many of the ideas in those lectures reappear in thebook by Boucheron, Lugosi, and Massart (2013).

Many authors seem to credit Chernoff (1952) with the moment gener-ating trick in <2>, even though the idea is obviously much older: com-pare with the treatment of the Bernstein inequality by Uspensky (1937,pages 204–206)). Chernoff himself gave credit to Cramer for “extremelymore powerful results” obtained under stronger assumptions.

The characterizations for subgaussians in Theorem <16> are essentiallydue to Kahane (1968, Exercise 6.10).

Hoeffding (1963) established many tail bounds for sums of independentrandom variables. He also commented that the results extend to martin-gales.

The Bennett inequality for independent summands was proved by Ben-nett (1962) by a more complicated method.

Section 2.6 on the Bernstein inequality is based on the exposition byUspensky (1937, pages 204–206), with refinements from Bennett (1962) andMassart (2003, Section 1.2.3). See also Boucheron et al. (2013, Sections 2.4,2.8) for a slightly more detailed exposition, including the idea of using thepositive part to get a one-sided bound, which they credited to EmmanuelRio. Van der Vaart and Wellner (1996, page 103) noted the connectionbetween the moment assumption <24> and the behavior of PΦ1(θ|Xi|).

Dudley (1999, Appendix H) used the name Young-Orlicz modulus forwhat I am calling a Young function. to such a function as an Orlicz modulus.

Every Young function can be written as Ψ(t) =∫ t

0 ψ(x) dx, where ψis the right-hand derivative of Ψ. The function ψ is increasing and right-continuous. For their N -functions, Krasnosel’skiı and Rutickiı (1961, Sec-tion 1.3) added the requirements that ψ(0) = 0 < ψ(x) for x > 0 and ψ(x)→∞ as x→∞, which ensures that Ψ is strictly increasing with Ψ(t)/t→∞as t→∞. Dudley called such a function an Orlicz modulus.

When ψ(0) = 0 there is a measure µ on B(R+) for which ψ(b) = µ(0, b]for all intervals (0, b]. In that case PΨ(X) = µsP(X − s)+, a representationclosely related to Lemma 1 of Panchenko (2003).


§2.10 Notes 28

References

Bennett62jasa Bennett, G. (1962). Probability inequalities for the sum of independentrandom variables. Journal of the American Statistical Association 57,33–45.

BLM2013Concentration Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities:A Nonasymptotic Theory of Independence. Oxford University Press.

bourgain1998random Bourgain, J. (1998). Random points in isotropic convex sets. MSRI Publi-cations: Convex geometric analysis 34, 53–58.

BoydVandenberghe2004 Boyd, S. and L. Vandenberghe (2004). Convex Optimization. CambridgeUniversity Press.

Chernoff52AMS Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hy-pothesis based on a sum of observations. Annals of Mathematical Statis-tics 23 (4), 493–507.

Dudley99book Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Uni-versity Press.

Feller1 Feller, W. (1968). An Introduction to Probability Theory and Its Applications(third ed.), Volume 1. New York: Wiley.

Hoeffding:63 Hoeffding, W. (1963). Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association 58, 13–30.

Kahane68RSF Kahane, J.-P. (1968). Some Random Series of Functions. Heath. Secondedition: Cambridge University Press 1985.

KrasnoselskiiRutickii1961 Krasnosel’skiı, M. A. and Y. B. Rutickiı (1961). Convex Functions andOrlicz Spaces. Noordhoff. Translated from the first Russian edition byLeo F. Boron.

Lugosi2003ANU Lugosi, G. (2003). Concentration-of-measure inequalities. Notes from theSummer School on Machine Learning, Australian National University.Available at http://www.econ.upf.es/˜lugosi/.

Massart03Flour Massart, P. (2003). Concentration Inequalities and Model Selection, Volume1896 of Lecture Notes in Mathematics. Springer Verlag. Lectures given atthe 33rd Probability Summer School in Saint-Flour.


Notes 29

Panchenko2003AP Panchenko, D. (2003). Symmetrization approach to concentration inequali-ties for empirical processes. Annals of Probability 31, 2068–2081.

RudelsonVershynin2013ECP Rudelson, M. and R. Vershynin (2013). Hanson-Wright inequality and sub-gaussian concentration. Electron. Commun. Probab 18 (82), 1–9.

Uspensky1937book Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGraw-Hill.

vaartwellner96book van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence andEmpirical Process: With Applications to Statistics. Springer-Verlag.

vershynin2010compressed Vershynin, R. (2010). Introduction to the non-asymptotic analysis of randommatrices. arXiv:1011.3027.


Documents

Printed: 27 November 2015 - Yale Universitypollard/Books/Mini/Basic.pdf · Chapter 2 A few good inequalities Section 2.1introduces exponential tail bounds. Section 2.2introduces the