A Short Introduction to Stein’s Method - stats.ox.ac.uk · A Short Introduction to Stein’s Method Gesine Reinert Department of Statistics University of Oxford Michaelmas Term

Lecture 1: Normal approximationLecture 2: Poisson approximation and other distributions

A Short Introduction to Stein’s Method

Gesine ReinertDepartment of Statistics

University of Oxford

Michaelmas Term 2011

Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method


Outline

Lecture 1: Normal approximationMotivationDistributional distancesThe Stein equationLocal dependenceCouplings

Lecture 2: Poisson approximation and other distributionsPoisson approximation

Poisson approximation: the local approach

Poisson approximation: size bias couplings

A general frameworkChisquare distributionsOne-dimensional Gibbs distributionsFinal Remarks



MotivationDistributional distancesThe Stein equationLocal dependenceCouplings

Lecture 1: Normal approximation

Recall two famous distributional approximations:Example: X1, X2, . . . ,Xn i.i.d., P(Xi = 1) = p = 1 − P(Xi = 0)

n−1/2n∑

i=1

(Xi − p) ≈d N (0, p(1 − p))

but alson∑

i=1

Xi ≈d Poisson(np)

Which approximation should we choose?We would like to assess the distance between distributions viaexplicit bounds.




Weak convergence

Often distributional distances are phrased in terms of testfunctions. This approach is based on weak convergence.For c.d.f.s Fn, n ≥ 0 and F on the line we say that Fn convergesweakly (converges in distribution) to F ,

FnD−→ F

ifFn(x) → F (x) (n → ∞)

for all continuity points x of F .For the associated probability distributions:

PnD−→ P.




Facts of weak convergence

PnD−→ P

is equivalent to:

1. Pn(A) → P(A) for each P-continuity set A (i.e. P(∂A) = 0)

2.∫

fdPn →∫

fdP for all functions f that are bounded,continuous, real-valued

3.∫

fdPn →∫

fdP for all functions f that are bounded, infinitelyoften differentiable, continuous, real-valued.

For a random variable X with distribution L(X ),

L(Xn)D−→ L(X ) ⇐⇒ Ef (Xn) → Ef (X )

for all functions f that are bounded, infinitely often differentiable,continuous, real-valued.




Total variation distance

Let L(X ) = P,L(Y ) = Q; define total variation distance

dTV (P, Q) = supA measurable

|P(A) − Q(A)|.

If the underlying space is discrete then convergence in totalvariation is equivalent to convergence in distribution. Forprobability measures on Z+ we can write

dTV (P, Q) =1

2

∑

j≥0

|P{j} − Q{j}|.

If the underlying space is continuous then the total variationdistance is very strong; for example approximating a discreteinteger-valued random variable by a normal random variable wecould choose as set A the set of integers; then the total variationdistance would be 1.




Wasserstein distance

Consider the set of Lipschitz-continuous functions with Lipschitzconstant 1,

L = {g : R → R; |g(y) − g(x)| ≤ |y − x |}

and Wasserstein distance

dW (P, Q) = supg∈L

|Eg(Y ) − Eg(X )|

= inf E|Y − X |,

where the infimum is over all couplings X , Y such thatL(X ) = P,L(Y ) = Q.




Using

F = {f ∈ L absolutely continuous, f (0) = f ′(0) = 0}

we also have

dW (P, Q) = supf ∈F

|Ef ′(Y ) − Ef ′(X )|.




Stein’s Method for Normal Approximation

Stein (1972, 1986)Z ∼ N (µ, σ2) if and only if for all smooth functions f ,

E(Z − µ)f (Z ) = σ2Ef ′(Z ).

For a random variable W with EW = µ,VarW = σ2, if

σ2Ef ′(W ) − E(W − µ)f (W )

is close to zero for many functions f , then W should be close to Zin distribution.




Sketch of proof for µ = 0, σ2 = 1:

Assume Z ∼ N (0, 1). Integration by parts:

1√2π

∫

f ′(x)e−x2/2dx

=

[

1√2π

f (x)e−x2/2

]

+1√2π

∫

xf (x)e−x2/2dx

=1√2π

∫

xf (x)e−x2/2dx

and so Ef ′(Z ) = EZf (Z ).




Conversely, assume EZf (Z ) = Ef ′(Z ): We solve the differentialequation

f ′(x) − xf (x) = g(x), limx→−∞f (x)e−x2/2 = 0

for any bounded function g, giving

f (y) = ey2/2

∫ y

−∞g(x)e−x2/2dx

Take g(x) = 1(x ≤ x0) − Φ(x0), then

0 = E(f ′(Z ) − Zf (Z )) = P(Z ≤ x0) − Φ(x0)

so Z ∼ N (0, 1).




The Stein equation

Let µ = 0. Given a test function h, let Nh = Eh(Z/σ), and solvefor f in the Stein equation

σ2f ′(w) − wf (w) = h(w/σ) − Nh

giving

f (y) = ey2/2

∫ y

−∞(h(x/σ) − Nh) e−x2/2dx .

Now evaluate the expectation of the r.h.s. by the expectation ofthe l.h.s.We can bound the solution f and its derivatives in terms of thetest function h and its derivatives, e.g. ‖ f ′′ ‖≤ 2 ‖ h′ ‖.




Example: the sum of i.i.d. random variables

X , X1, . . . ,Xn i.i.d. with EX = 0, VarX = 1n; put W =

∑ni=1 Xi

and put Wi = W − Xi =∑

j 6=i Xj . Then

EWf (W ) =n∑

i=1

EXi f (W )

=n∑

i=1

EXi f (Wi ) +n∑

i=1

EX 2i f ′(Wi ) + R

=1

n

n∑

i=1

Ef ′(Wi ) + R.




So

Ef ′(W ) − EWf (W ) =1

n

n∑

i=1

E{f ′(W ) − f ′(Wi )} + R

and we can bound the remainder term R.




Theorem: explicit bound on distance

For any smooth h

|Eh(W ) − Nh| ≤ ‖h′‖(

2√n

+n∑

i=1

E|X 3i |)

.

Note: this bound is true for any n; nothing goes to infinity.

This theorem extends to local dependence.




Local dependence

Let X1, . . . ,Xn be mean zero, finite variances, put W =∑n

i=1 Xi ;assume VarW = 1. Suppose that for each i = 1, . . . , n there existsets Ai ⊂ Bi ⊂ {1, . . . , n} such that Xi is independent of

∑

j 6∈AiXj

and∑

j∈AiXj is independent of

∑

j 6∈BiXj . Define

ηi =∑

j∈Ai

Xj and τi =∑

j∈Bi

Xj .

TheoremFor any smooth h with ‖h′‖ ≤ 1,

|Eh(W ) − Nh| ≤ 2n∑

i=1

(E|Xiηiτi | + |E(Xiηi )|E|τi |) +n∑

i=1

E|Xiη2i |.




Example: word counts in DNA sequences

Let B = B1B2 . . .Bn be a string of letters which are i.i.d. from thealphabet A = {A, C , G , T}. Let w = w1w2 . . .wm be a word oflength m composed of letters from A. Let V = V (w) be thenumber of times that w occurs in the sequence B.Words occur in clumps; for example consider

B = T G A A C A A A C A A C A A A G A A C A A A A

w = A A C A A







w = A A C A A







w = A A C A A







w = A A C A A

V (w) = 4; there are two clumps of occurrences.




For j = 1, . . . , n − m + 1 define the word indicators

Ij =

{

1 if BjBj+1 . . .Bj+m−1 = w

0 otherwise.

Let p(x) = P(X1 = x) for x = A, C , G , T , then

EIj = p(w) :=m∏

i=1

p(wi )

andEV = (n − m + 1)p(w).




The variance

Define, for ℓ = 1, . . . ,m,

pw(m) = p(wm−ℓ+1)p(wm−ℓ+2) · · · p(wm),

which is the probability that w occurs, given that w1w2 · · ·wm−ℓ

has occurred. Define the self-overlap indicator

ξ(j) = 1(wj+1 = w1, . . . ,wm = wm−j).

This overlap indicator equals 1 if the last m − j letters of woverlap exactly the first m − j letters of w. Then the varianceVar(W ) = σ2 is given by

2p(w)m−1∑

j=1

(n − m − j + 1)ξ(j)pw(j) − 2(m − 1)(n − m + 1)p(w)2.

The variance is of order n when m = o(n).Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



The neighbourhoods of dependence

Put

Xj =1

σ(Ij − p(w))

andAi = {j ∈ {1, . . . , n} : |j − i | ≤ m − 1}

as well as

Bi = {j ∈ {1, . . . , n} : |j − i | ≤ 2(m − 1)}.

Then Xi is independent of∑

j 6∈AiXj and

∑

j∈AiXj is independent

of∑

j 6∈BiXj . Moreover

ηi =∑

j :|j−i |≤m−1

Xj and τi =∑

j :|j−i |≤2(m−1)

Xj .




Consider the statement of the theorem,

|Eh(W ) − Nh| ≤ 2n∑

i=1

(E|Xiηiτi | + |E(Xiηi )|E|τi |) +n∑

i=1

E|Xiη2i |.

We bound |Xi | ≤ 1σ and |ηi | ≤ 2m−1

σ as well as |τi | ≤ 4m−1σ and

hence the bound can be bounded by

(2m − 1)(4(4m − 1) + 1)

σ3n.




Couplings

The local approach does not work well when there is (weak) globaldependence, for example as in simple random sampling.Instead we make use of couplings: changing one random variableshould not have much effect on the other random variables if thedependence is weak.Here we consider two such couplings: the size-bias coupling andthe zero-bias coupling. There are more, such as the exchangeablepair coupling.




Size-Bias coupling: µ > 0

If W ≥ 0,EW > 0 then W s has the W -size biased distribution if

EWf (W ) = EWEf (W s)

for all f for which both sides exist.

Example: If X ∼ Bernoulli(p), then EXf (X ) = pf (1) and soX s = 1.

Example: If X ∼ Poisson(λ), then

P(X s = k) =ke−λλk

k!λ=

e−λλk−1

(k − 1)!

and so X s = X + 1, where the equality is in distribution.Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



UsingEWf (W ) = EWEf (W s)

with f (x) =? we get that

EW s =EW 2

µ.





with f (x) = x we get that

EW s =EW 2

µ.





with f (x) = x2 we get that

E(W s)2 =EW 3

µ.




Construction

(Goldstein + Rinott 1997) Suppose W =∑n

i=1 Xi with Xi ≥ 0,EXi > 0, all i .Choose index V proportional to the mean, EXv . If V = v : replaceXv by X s

v having the Xv -size biased distribution, independent, andif X s

v = x : adjust Xu, u 6= v , such that

L(Xu, u 6= v) = L(Xu, u 6= v |Xv = x)

Then W s =∑

u 6=V Xu + X sV has the W -size bias distribution.

Example: sum of independent Bernoullis Example: Xi ∼ Be(pi ) fori = 1, . . . , nThen W s =

∑

u 6=V Xu + 1.




How does the size-bias coupling help?

X , X1, . . . ,Xn i.i.d. non-negative, EX = µ,VarX = σ2 andW =

∑ni=1 Xi . Then

E(W − µ)f (W ) = µE(f (W s) − f (W ))

≈ µE(W s − W )f ′(W )

= µ1

n

n∑

i=1

E(X si − Xi )f

′(W )

≈ µEf ′(W )E(X s − µ)

= µEf ′(W )

{

1

µEX 2 − µ

}

= σ2Ef ′(W ).




As a theorem

(Goldstein + Rinott 1997) Let W ≥ 0 have mean µ > 0 andvariance 1. Let W s have the W−size bias distribution and assumethat W s is defined on the same probability space as W . Let h be asmooth function. Then

|Eh(W − µ) − Eh(Z )|

≤ (sup h − inf h)µ√

VarE(W s − W |W ) +1

6‖h′‖µE[(W s − W )2].




Example: sum of i.i.d.’s

X , X1, . . . ,Xn i.i.d. non-negative, EX = µn, VarX = 1

n, and

W =∑n

i=1 Xi . Then

E(W s − W |W ) =1

n

n∑

i=1

E(X si − Xi |W )

=1

n

(

n∑

i=1

nEX 2

µ−

n∑

i=1

Xi

)

=nEX 2

µ− 1

nW

and so

VarE(W s − W |W ) =1

n2.




Moreover

E[(W s − W )2] =1

n

n∑

i=1

E(Xi − X si )2

= EX 2 − 2µEX s + E(X s)2

= EX 2 − 2nEX 2 + nEX 3

µ

≤ nEX 3

µ.

Thus the bound gives

|Eh(W − µ) − Eh(Z )| ≤ 1

2‖h′′‖µ1

n+

1

6‖h(3)‖EX 3n.

We think of X ≍ n−12 ; then the overall bound is of order n−

12 .




Example: Vertex degrees in Bernoulli random graphs

Let G (n, p) be a random graph on n vertices where each pair ofvertices has probability p of making up an edge, independently.The degree D(v) of a vertex v is the number of its neighbours.Let W count the number of vertices with degree d . Then

W =n∑

v=1

I(D(v) = d).

The indicators are not independent - if we know that D(v) = 0 forv = 1, . . . , n − 1, then D(n) = 0 also.




Coupling

Take a random graph and pick a vertex v at random. Set itsdegree equal to d . If its degree was d already, nothing needs to beadjusted. If its degree was D(v) > d then choose D(v) − d of itsedges uniformly at random and delete them. If the degree wasD(v) < d then choose d − D(v) vertices among the remainingn − D(v) − 1 non-neighbours of v and create an edge betweeneach of these and v .Goldstein and Rinott (1997) calculate the bound; see also Lin andR. (2012) for a random graph model where p = p(u, v) maydepend on the vertices.




For mean-zero random variables the size bias coupling is not easyto implement.In particular it would not be applicable to normally distributedrandom variables.This drawback lead to the definition of the zero-bias coupling.




Zero bias coupling

Let X be a mean zero random variable with finite, nonzerovariance σ2. We say that X ∗ has the X -zero biased distribution iffor all differentiable f for which EXf (X ) exists,

EXf (X ) = σ2Ef ′(X ∗).

The zero bias distribution X ∗ exists for all X that have mean zeroand finite variance. (Goldstein and R. 1997)It is easy to verify that W ∗ has density

p∗(w) = σ−2E{W I(W > w)}.




Examples

Example: If X ∼ N (0, σ2), then X has the X -zero bias distribution- and the normal is the only fixed point of the zero biastransformation.

Example: If X ∼ Bernoulli(p) − p, then

E{X I(X > x)} = p(1 − p) for − p < x < 1 − p

and is zero elsewhere, so X ∗ ∼ Uniform(−p, 1 − p).




Connection with Wasserstein distance

We have for W mean zero, variance 1,

|Eh(W ) − Nh| = |E[

f ′(W ) − Wf (W )]

|= |E

[

f ′(W ) − f ′(W ∗)]

|≤ ||f ′′||E|W − W ∗|,

where || · || is the supremum norm. By (Stein 1986) for habsolutely continuous we have ||f ′′|| ≤ 2||h′|| and hence

|Eh(W ) − Nh| ≤ 2||h′||E|W − W ∗|;

thusdW (L(W ),N (0, 1)) ≤ 2E|W − W ∗|.




Construction: sum of i.i.d.s

W =∑n

i=1 Xi , where the Xi ’s are independent mean zero finitevariance σ2

i variables: Choose an index I proportional to thevariance, zero bias in that variable,

W ∗ = W − XI + X ∗I .

Then, for any smooth f ,

EWf (W ) =n∑

i=1

EXi f (W ) =n∑

i=1

EXi f (Xi +∑

t 6=i

Xt)

=

n∑

i=1

σ2i Ef ′(X ∗

i +∑

t 6=i

Xt) = σ2n∑

i=1

σ2i

σ2Ef ′(W − Xi + X ∗

i )

= σ2Ef ′(W − XI + X ∗I ) = σ2Ef ′(W ∗),

where we have used independence of Xi and Xt , t 6= i .Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



And here is the theorem

Let X1, . . . ,Xn be independent mean zero variables with variancesσ2

1, . . . , σ2n and finite third moments, and let

W = (X1 + . . . + Xn)/σ where σ2 = σ21 + . . . + σ2

n. Then for allabsolutely continuous test functions h,

|Eh(W ) − Nh| ≤ 2||h′||σ3

n∑

i=1

E

(

|Xi | +1

2|Xi |3

)

σ2i .

When the variables are i.i.d. with variance 1,

|Eh(W ) − Nh| ≤ 3||h′||E|X1|3√n

.




The construction is (much) more involved under dependence.When third moments vanish, fourth moments exist: Order n−1

bound.Also: Berry-Esseen bound, combinatorial central limit theorem(Goldstein 2004).




Some remarks

◮ Left out:

1. Exchangeable pair couplings, also used for variance reductionin simulations

2. Multivariate, also coupling approaches3. Infinite-dimensional variables (Malliavin calculus), random

measures

◮ Key feature: bounds in the presence of dependence

◮ In the i.i.d. case: Berry-Esseen inequality not quite recovered




Further reading

◮ Barbour, A.D. (1990). Stein’s method for diffusion approximations, Probability Theory and Related Fields

84 297-322.

◮ Barbour, A.D. and Chen. L.H.Y. eds (2005). An Introduction to Stein’s Method. Lecture Notes Series 4,Inst. Math. Sciences, World Scientific Press, Singapore.

◮ Barbour, A.D. and Chen. L.H.Y. eds (2005). Stein’s Method and Applications. Lecture Notes Series 5,Inst. Math. Sciences, World Scientific Press, Singapore.

◮ Chen, L.H.Y., Goldstein, L., and Shao, Q.-M. (2011). Normal Approximation by Stein’s Method. Springer,Berlin and Heidelberg.

◮ Diaconis, P., and Holmes, S. (eds.) (2004). Stein’s Method: Expository Lectures and Applications. IMSLecture Notes 46.




More further reading

◮ Reinert, G. (2011). Gaussian approximation of functionals: Malliavin calculus and Stein’s method In:Surveys in Stochastic Processes. Editors: Jochen Blath , Peter Imkeller , and Sylvie Roelly. EuropeanMathematical Society, Zurich. pp. 107-126

◮ G. Reinert (1998). Couplings for normal approximations with Stein’s method. In Microsurveys in Discrete

Probability. D. Aldous, J. Propp eds., Dimacs series. AMS, 193 - 207.

◮ G. Reinert, S. Schbath, and M.S. Waterman (2005). Probabilistic and Statistical Properties of FiniteWords in Finite Sequences. In: Lothaire: Applied Combinatorics on Words, Cambridge University Press, J.Berstel, D. Perrin, eds., 268-352.

◮ Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum ofdependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 2 583-602, Univ.California Press, Berkeley.

◮ Stein, C. (1986). Approximate Computation of Expectations. IMS, Hayward, CA.



Poisson approximationA general frameworkChisquare distributionsOne-dimensional Gibbs distributionsFinal Remarks

Poisson approximation and other distributions

We first look at Poisson approximation, then we will put theapproximations in a general framework.We then apply the general framework to chisquare distributions aswell as one-dimensional Gibbs measures.




Poisson approximation

The standard reference for this section is Barbour, Holst andJanson (1992).Recall the Poisson(λ) distribution

pk = e−λ λk

k!, k = 0, 1, . . . .

This is a discrete distribution, and for approximations on Z+ wewould usually take total variation distance;

dTV (P, Q) = supA⊂Z+

|P(A) − Q(A)|.




A Stein characterisation

Fact: Z ∼ Po(λ) ⇐⇒ for all functions f : Z+ → R withEf (Z ) = 0

Eλg(Z + 1) − Zg(Z ) = 0.

This suggests to use as test functions indicators I (j ∈ A) for setsA ⊂ Z+ and as Stein equation

λg(j + 1) − jg(j) = I (j ∈ A) − Po{λ}(A).

For any random variable W we hence have

λg(W + 1) − Wg(W ) = I (W ∈ A) − Po{λ}(A)

and taking expectations

dTV (L(W ), Po{λ}) ≤ supg

{Eλg(W + 1) − Wg(W )} .

Here the sup is over all functions g which solve the Stein equation.Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



The solution of the Stein equation

λg(j + 1) − jg(j) = I (j ∈ A) − Po{λ}(A)

can be calculated iteratively. We only need the results that

supj

|g(j)| ≤ min(1, λ− 12 )

and∆g := sup

j

|g(j + 1) − g(j)| ≤ min(1, λ−1).

Here λ−1 is often referred to as the magic factor.




Example: sum of independent Bernoulli variables

Let X1, X2, . . . be independent Bernoulli(pi ) and W =∑n

i=1 Xi .With λ =

∑ni=1 pi

EWg(W ) =∑

EXig(W ) =∑

EXig(W − Xi + 1)

=∑

piEg(W − Xi + 1)

and thus

Eλg(W + 1) − Wg(W )

=∑

piE(g(W + 1) − g(W − Xi + 1))

=∑

p2i E(g(W + 1) − g(W − Xi + 1)|Xi = 1)

and

|Eλg(W + 1) − Wg(W )| ≤ ∆g∑

p2i ≤ min(1, λ−1)

∑

p2i .




The local approach

Again the argument generalises to local dependence.Let X1, . . . ,Xn be Bernoulli random variables, Xi ∼ Be(pi ), andput W =

∑ni=1 Xi . Let λ =

∑ni=1 pi . Suppose that for each

i = 1, . . . , n there exists a set Ai ⊂ {1, . . . , n} such that Xi isindependent of

∑

j 6∈AiXj . Define

ηi =∑

j∈Ai

Xj .

TheoremFor any smooth h with ‖h′‖ ≤ 1, and λ = EW,

dTV (L(W ), Po(λ)) ≤n∑

i=1

[(piEηi + E(Xiηi ))] min(

1, λ−1)

.





Let B = B1B2 . . .Bn be a string of letters which are i.i.d. from thealphabet A = {A, C , G , T}. Let w = w1w2 . . .wm be a word oflength m composed of letters from A. Let V = V (w) be thenumber of times that w occurs in the sequence B.For j = 1, . . . , n − m + 1 define the word indicators

Xj =

{

1 if BjBj+1 . . .Bj+m−1 = w0 otherwise.

Let p(x) = P(B1 = x) for x = A, C , G , T , then

EXj = p(w) :=m∏

i=1

p(wi ).




The neighbourhood of dependence

Put Ai = {j ∈ {1, . . . , n} : |j − i | ≤ m − 1}. Then Xi isindependent of

∑

j 6∈AiXj . Moreover ηi =

∑

j :|j−i |≤m−1 Xj .




Consider the statement of the theorem,

dTV (L(W ), Po(λ) ≤n∑

i=1

[(piEηi + E(Xiηi ))] min(

1, λ−1)

.

Here pi = p(w) for all i and we think of λ = O(1) so that roughlyp(w) = O(n−1). Moreover Eηi ≤ (2m − 1)p(w) andE(Xiηi ) =

∑

j :|j−i |≤m−1 EXiXj depends on the amount ofself-overlap in w. If w does not have any self-overlap then we haveEXiXj = 0 for |i − j | ≤ m − 1 and hence

dTV (L(W ), Po(λ) ≤ n(2m − 1)p(w)2 min(

1, λ−1)

.

If w has considerable self-overlap then E(Xiηi ) will not be smalland indeed a compound Poisson approximation would be advised.




Example: sequencing by hybridisation

(Arratia et al. 1996) Sequencing by hybridisation is a tool todetermine a DNA sequence from the unordered list of all l-tuplesin the sequence; typically l = 8, 10, 12. Assume that the multisetof all l-tuples is known. The sequence is then uniquely recoverableunless one of the following occurs in the sequence:

◮ a non-trivial rotation

◮ a non-trivial transposition using three way repeats

◮ a non-trivial transposition using two interleaved pairs ofrepeats.




Using a Poisson approximation for pairs and three way repeats wecan bound the probability of the sequence being uniquelyrecoverable. For example if all letters are equally likely:The probability of unique recoverability is at least .95 ifl = 8 and sequence length is at most 85l = 10 and sequence length is at most 469l = 12 and sequence length is at most 2,288.




Recall: If W ≥ 0,EW = λ then W s has the W -size biaseddistribution if

EWf (W ) = λEf (W s)

for all f for which both sides exist.

Example: sum of independent Bernoullis Example: Xi ∼ Be(pi ) fori = 1, . . . , n; pick V ∝ pi , then W s =

∑

u 6=V Xu + 1.

TheoremFor the sum of possibly dependent Bernoullis, withL(Xj , j 6= i) = L(Xj , j 6= i |Xi = 1),

dTV (L(W ), Po(λ)) = |n∑

i=1

pi (Eg(W + 1) − g(∑

j 6=i

Xj + 1))|

≤ ∆gn∑

i=1

piE|W −∑

j 6=i

Xj |.




If the Bernoulli random variables are independent then

E|W −∑

j 6=i

Xj | = EXi = pi

and we obtain the same bound as from the local approach.




A general framework

Start with a target distribution µ.

1. Find characterization: operator A such that X ∼ µ if and onlyif for all smooth functions f , EAf (X ) = 0

2. For each smooth function h find solution f = fh of the Steinequation

h(x) −∫

hdµ = Af (x)

3. Then for any variable W ,

Eh(W ) −∫

hdµ = EAf (W )

We usually need to bound f , f ′, or ∆f .Here h is a smooth test function; for nonsmooth functions: seetechniques used by Shao, Chen, Rinott and Rotar, Gotze.




The generator approach

Barbour 1989, 1990; Gotze 1993Choose A as generator of a Markov process with stationarydistribution µ, that is:Let (Xt)t≥0 be a homogeneous Markov processPut Tt f (x) = E (f (Xt)|X (0) = x)The generator is defined as

Af (x) = limt↓0

1

t(Tt f (x) − f (x)) .




Generator facts

(Ethier and Kurtz (1986), for example)

1. µ stationary distribution then X ∼ µ if and only ifEAf (X ) = 0 for f for which Af is defined

2. Tth − h = A(

∫ t

0 Tuhdu)

and formally

∫

hdµ − h = A(∫ ∞

0Tuhdu

)

if the r.h.s. exists.




Examples

1. Ah(x) = h′′(x) − xh′(x) is the generator ofOrnstein-Uhlenbeck process, stationary distribution N (0, 1).

2. Ah(x) = λ(h(x + 1) − h(x)) + x(h(x − 1) − h(x)) or

Af (x) = λf (x + 1) − xf (x)

is the generator of an immigration-death process withimmigration rate λ and unit per capita death rate; thestationary distribution is Poisson(λ).

Advantage: generalisations to multivariate, diffusions, measurespace...




Chisquare distributions

A generator for χ2p is

Af (x) = xf ′′(x) +1

2(p − x)f ′(x).

Luk 1994: A Stein operator for Gamma(r , λ) is

Af (x) = xf ′′(x) + (r − λx)f ′(x)

and χ2p = Gamma(d/2, 1/2); and, for χ2

p, A is the generator of aMarkov process given by the solution of the stochastic differentialequation

Xt = x +1

2

∫ t

0(p − Xs)ds +

∫ t

0

√

2XsdBs

where Bs is standard Brownian motion.Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



The Stein equation is then

(χ2p) h(x) − χ2

ph = xf ′′(x) +1

2(p − x)f ′(x)

where χ2ph is the expectation of h under the χ2

p-distribution.

Lemma(Pickett 2002). Suppose h : R → R is absolutely bounded,|h(x)| ≤ ceax for some c > 0 a ∈ R, and the first k derivatives of hare bounded. Then the equation (χ2

p) has a solution f = fh suchthat

‖ f (j) ‖≤√

2π√p

‖ h(j−1) ‖

with h(0) = h.

(Improvement over Luk 1994 in 1√p)




Example: squared sum (R. + Pickett)Xi , i = 1, . . . , n, i.i.d. mean zero, variance one, exisiting 8th

moment

S =1√n

n∑

i=1

Xi

andW = S2

Pickett + R. (2004): the bound on the distance to Chisquare(1)for smooth test functions is of order 1

n

Also: Pearson’s chisquare statistic.Robert Gaunt: Variance-gamma distributions.




One-dimensional Gibbs distributions

(Eichelsbacher + R. (2008)) We consider discrete univariateGibbs measures µ with supp(µ) = {0, . . . ,N}, whereN ∈ N0 ∪ {∞}. Such a Gibbs measure can be written as

µ(k) =1

Zexp(V (k))

ωk

k!, k = 0, 1, . . . ,N,

with Z =∑N

k=0 exp(V (k))ωk

k! , where ω > 0 is fixed. We shallassume that the normalizing constant Z exists.Conversely, for a given probability distribution (µ(k))k∈N0 we canfind a representation as a Gibbs measure by choosing

V (k) = log µ(k) + log k! + log Z − k log ω, k = 0, 1, . . . ,N,

with V (0) = log µ(0) + log Z. We have some freedom in thechoice of ω and thus of V . Fix a representation.




A Markov process

To each Gibbs measure we associate a Markovian birth-deathprocess with unit per-capita death rate dk = k and birth rate

bk = ω exp{V (k + 1) − V (k)} = (k + 1)µ(k + 1)

µ(k),

for k , k + 1 ∈ supp(µ). It is easy to see that this birth-deathprocess has invariant measure µ.




The generator

Following the generator approach to Stein’s method, we wouldtherefore choose as generator

(Ah)(k) = (h(k + 1) − h(k)) exp{V (k + 1) − V (k)}ω+k(h(k − 1) − h(k))

or, with the simplification f (k) = h(k) − h(k − 1),

(Af )(k) = f (k + 1) exp{V (k + 1) − V (k)}ω − kf (k).




Examples

1. The Poisson-distribution with parameter λ > 0: We useω = λ, V (k) = −λ,Z = 1. The Stein-operator is

(Af )(k) = f (k + 1)λ − kf (k)

as before.

2. The Binomial-distribution with parameters n and 0 < p < 1:We use ω = p

1−p, V (k) = − log((n − k)!), and

Z = (n!(1 − p)n)−1. The Stein-operator is

(Af )(k) = f (k + 1)p(n − k)

(1 − p)− kf (k).

3. The Hypergeometric distribution: The Stein-operator is

(Af )(k) = (a − k)(n − k) f (k + 1) − (b − n − x) kf (k).




The solution of the Stein equation

For a given function h, a solution f of the Stein equation is givenby f (0) = 0, f (k) = 0 for k 6∈ supp(µ), and

f (j + 1) =j!

ωj+1e−V (j+1)

j∑

k=0

eV (k) ωk

k!(h(k) − µ(h))

= − j!

ωj+1e−V (j+1)

N∑

k=j+1

eV (k) ωk

k!(h(k) − µ(h)) .

We can bound the solution and their first difference ∆f .




Exploiting size-biasing

Recall that, for W ≥ 0 with EW > 0 we say that W ∗ has theW -size biased distribution if

EWg(W ) = EWEg(W ∗)

for all g for which both sides exist. In particular we obtain that

E

{

exp{V (X + 1) − V (X )}ω g(X + 1) − X g(X )

}

= E

{

exp{V (X + 1) − V (X )}ω g(X + 1) − EXEg(X ∗)

}

and

EX = ωEeV (X+1)−V (X ).




A Gibbs measure characterisation

Let X ≥ 0 be such that 0 < E (X ) < ∞, and let µ be a discreteunivariate Gibbs measure on the non-negative integers. ThenX ∼ µ if and only if for all bounded g we have that

ω EeV (X+1)−V (X )g(X + 1) = ω EeV (X+1)−V (X )Eg(X ∗).




And here is the theorem

For any W ≥ 0 with 0 < EW < ∞ we have that

Eh(W ) − µ(h)

= ω{EeV (W+1)−V (W )f (W + 1)

−EeV (W+1)−V (W )Ef (W ∗)}

where f is the solution of the Stein equation.




Example: lattice approximation in statistical physics

Chayes and Klein (1994): Assume A ⊂ Rd is a rectangle; denoteits volume by |A|. Consider the intersection of A with thed-dimensional lattice n−1Zd . For each site m in this intersectionwe associate a Bernoulli random variable X n

m which takes value 1,with probability pn

m, if a particle is present at m and 0 otherwise. Ifthe collection (X n

m)m is independent, the joint distribution can beinterpreted as the Gibbs distribution for an ideal gas on the lattice,and we have Poisson convergence.We provide bounds on the approximation for such particles whileallowing for interactions.




More details

The model is as follows. Pick n ∈ N and suppose that A can bepartitioned into a regular array of d(n) sub-rectangles

{Sn1 , . . . ,Sn

d(n)} with volumes v(Snm) = zn

m

zn. For each m, n choose a

point qnm ∈ Sn

m. We consider a sequence of functions (fk)ksatisfying f0 ≡ 1 and for each k ≥ 1, fk(x1, . . . , xk) is anonnegative function, Riemann integrable on Ak , such that fk is asymmetric function for each k .




Let X nm, 1 ≤ m ≤ d(n), be 0-1 random variables with joint density

function

P(X n1 = a1, . . . ,X

nd(n) = ad(n)) ∝ fk(qn

i1, . . . , qn

ik)

d(n)∏

m=1

(znm)am

where each ai ∈ {0, 1} and the indices on the right-hand side are

determined by k =∑d(n)

m=1 am. Define Sn =∑d(n)

m=1 X nm.




Then, due to the symmetry of fk ,

P(Sn = k) =

(zn)k

k!

∑d(n)i1=1 · · ·

∑d(n)ik=1 fk(qn

i1, . . . , qn

ik)∏k

m=1 v(Snim

)∑d(n)

k=0(zn)k

k!

∑d(n)i1=1 · · ·

∑d(n)ik=1 fk(qn

i1, . . . , qn

ik)∏k

m=1 v(Snim

).

Note that we can write this distribution as a Gibbs measure µn.




Let S be a nonnegative integer valued random variable defined by

P(S = k) =zk

k!

∫

Ak fk(x1, . . . , xk)dx1 · · · dxk∑∞

k=0zk

k!

∫

Ak fk(x1, . . . , xk)dx1 · · · dxk

.

We write this distribution as a Gibbs measure µ. Using ourapproach we can bound the total variation distance between µn

and µ.




Comparing distributions via generators

Let µ have generator A and corresponding (ω, V ), and let µ2 havegenerator A2 and corresponding (ω2, V2). Assume that bothbirth-death processes have unit per-capita death rates. Then, forX ∼ µ2, if the solution f of the Stein equation for µ is such thatA2f exists,

Eh(X ) − µ(h) = EAf (X ) = E(A−A2)f (X ).

Using that X ∗ has the X -size bias distribution,

|Eh(X ) − µ(h)| ≤ ‖ f ‖ E(X )

{ |ω − ω2|ω2

+ω

ω2E∣

∣

∣e(V (X∗)−V (X∗−1))−(V2(X

∗)−V2(X∗−1)) − 1

∣

∣

∣

}

.




For example, for two Poisson distributions Poisson(λ1) andPoisson(λ2) this approach gives

∣

∣

∣

∣

Eh(X ) −∫

hdµ

∣

∣

∣

∣

≤ ‖ f ‖ |λ1 − λ2| .

Note that the normalising constant Z in the Gibbs distribution,which is often difficult to calculate, is not needed explicitly in theStein approach. This is one of the main advantages of the aboveapproach.




Final Remarks

In the first example, X1, X2, . . . ,Xn i.i.d.,P(Xi = 1) = p = 1 − P(Xi = 0), using Stein’s method we canshow that

supx

|P((np(1 − p))−1/2n∑

i=1

(Xi − p) ≤ x) − P(N (0, 1) ≤ x)|

≤ 6

√

p(1 − p)

n

and

supx

|P(n∑

i=1

Xi = x) − P(Po(np) = x)| ≤ min(np2, p).

So, if p < 36n+36 , the bound on the Poisson approximation is smaller

than the bound on the normal approximation.Gesine ReinertDepartment of Statistics University of Oxford A Short Introduction to Stein’s Method



Application areas

◮ sequence analysis

◮ random graph statistics, and network statistics

◮ number theory

◮ MCMC convergence

◮ exceedances in time series

◮ interacting particle systems

◮ epidemics

◮ permutation tests




◮ Many more distributions

◮ Exchangeable pair couplings, also used for variance reductionin simulations

◮ Multivariate, also coupling approaches

◮ Infinite-dimensional: random measures, functionals ofGaussian random fields

◮ General distributional transformations

◮ Bounds in the presence of dependence

◮ In the i.i.d. case the Berry-Esseen inequality is not quiterecovered.




Further reading

1. Arratia, R., Goldstein, L., and Gordon, L. (1989). Two momentssuffice for Poisson approximation: The Chen-Stein method. Ann.Probab. 17, 9-25.

2. Barbour, A.D., Holst, L., and Janson, S. (1992). PoissonApproximation. Oxford University Press.

3. G. Reinert (2005). Three general approaches to Stein’s method. In:A Program in Honour of Charles Stein: Tutorial Lecture Notes.A.D. Barbour, L.H.Y. Chen, eds. World Scientific, Singapore(2005), 183-221.


Documents

A Short Introduction to Stein’s Method - stats.ox.ac.uk · A Short Introduction to Stein’s Method Gesine Reinert Department of Statistics University of Oxford Michaelmas Term