An Introduction to Laws of Large Numberscorring/pdf/WLLN.pdf · 1 Introduction 2 Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law 3 Examples

Introduction

IntroductionTo Laws ofLargeNumbers

Weak Law ofLarge Numbers

Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables

Working withR.V.’s

Independence

Limits ofRandomVariables

Modes ofConvergence

Chebyshev

An Introduction to Laws of Large Numbers

JohnCVGMI Group

September 13, 2012

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Contents

1 Introduction

2 Introduction To Laws of Large NumbersWeak Law of Large NumbersStrong LawStrongest Law

3 ExamplesInformation TheoryStatistical Learning

4 AppendixRandom VariablesWorking with R.V.’sIndependenceLimits of Random VariablesModes of ConvergenceChebyshev

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Contents

1 Introduction




Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Contents

1 Introduction




Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Contents

1 Introduction




Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Intuition

We’re working with random variables. What could we observe?Random Variables Xn∞n=1

... More specifically

Bernoulli sequence, looking at probability of average of thesum of N random events. P(x1 <

∑Ni=1 Xi < x2)

Coin Flipping anybody?as N increases we see that the probability of observing anequal amount of 0 or 1 is ∼ equal

This is intuitive: As the number of samples increases theaverage observation should tend toward the theoreticalmean

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Intuition

We’re working with random variables. What could we observe?Random Variables Xn∞n=1 ... More specifically


∑Ni=1 Xi < x2)



Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Intuition



∑Ni=1 Xi < x2)

Coin Flipping anybody?

as N increases we see that the probability of observing anequal amount of 0 or 1 is ∼ equal


Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Intuition



∑Ni=1 Xi < x2)



Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Intuition



∑Ni=1 Xi < x2)



Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law

Let’s try to work out this most basic fundamental theorem ofrandom variables arising from repeatedly observing a randomevent. How do we build such a theorem?

In blackbox scenario we want to be able to useindependence.

We want to be able to use variance and expectation: solet’s make Xn ∈ L2 for all n, and associate with each Xn

it’s mean, we’ll denote by µn, and variance, by σn.

We will create new random variables from the sequenceand work with these; aim for results in terms of them.

This seems like a start, let’s try to prove a theorem...

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law







Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law







Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law







Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law







Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law

Theorem

Let Xn be a sequence of independent L2 random variables

with means µn and variances σn. Thenn∑

i=1Xi − µi → 0.

So a sequence of functions on Ω, the sample space, are goingtowards a function = zero...

Can we prove this?

We haven’t used (σn): we’ll clearly need these sinceotherwise the Xi are wildly unpredictable.

IDEA: Constrain limn∑

i=1

σ2i = 0

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law

Theorem



i=1Xi − µi → 0.


Can we prove this?



i=1

σ2i = 0

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law

Theorem



i=1Xi − µi → 0.


Can we prove this?



i=1

σ2i = 0

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law

Theorem



i=1Xi − µi → 0.


Can we prove this?



i=1

σ2i = 0

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Counterexample

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law: Pitfalls

We haven’t specified a mode of convergence: this is atechnical point, but if you recall uniform, pointwise, a.e.,normed convergence from analysis then the goal of theproof can change radically.

IDEA: We want our rule to hold with high probability, thatis it should hold for essentially all values of ω ∈ Ω.Convergence in the normed space is pretty ambitious atthis point.

We haven’t specified a rate of convergence.

The rate of convergence of the σn should imply somethingabout the rate of convergence of the Xn − µn.The Xn − µn should converge slower than the σn.

IDEA: Weaken constraint to lim n−2n∑

i=1

σ2i = 0

The Xn − µn should converge on average

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law: Pitfalls






i=1

σ2i = 0


Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law: Pitfalls






i=1

σ2i = 0


Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Weak Law: Pitfalls






i=1

σ2i = 0


Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Law of Large Numbers 2.0

Theorem


with means µn and variances σ2n. Then1n

n∑i=1

Xi − µi → 0 in

measure if lim 1n2

n∑i=1

σ2i = 0.

Let Yn(ω) = 1/nn∑

i=1(Xi (ω)− µi ). Then E (Yn) = 0 and

σ2(Yn) = 1/n2n∑

i=1σ2i by independence. Now we can use the

limit of σ2(Yn) from the hypothesis, and need to prove

P(ω : |Yn(ω)| > ε) ≤ σ2(Yn)

ε2→ 0 as n→∗ ∞

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Aside: Lp bounds

What does

P(ω : |Yn(ω)| > ε) ≤ σ2(Yn)

ε2→ 0 as n→∗ ∞

actually say?

Part of Yn > ε is bounded by the integral of Y 2n divided by

epsilon squared... Thinking about this makes it obvious.

Better bounds can be obtained: Chernoff bounds

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Law of Large Numbers 2.1

Theorem



n∑i=1

Xi − µi → 0 in

measure if lim n−2n∑

i=1σ2i = 0.

⇓ weakened; use P(maxk |∑k

i=1 Xi − µi | ≥ ε) ≤ ε−2∑n

i=1 σ2i

Theorem



n∑i=1

Xi − µi → 0 a.e.

if lim n−2n∑

i=1σ2i <∞.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Khinchine’s Law

Theorem

Let Xn be a sequence of independent L1 random variables,

identically distributed, with means µ. Then 1n

n∑i=1

Xi − µi → 0

a.e. as n→∞.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Large Sample Size in Coding

Now we need to use the laws. Start with an example in sourcecoding.

C : X (Ω)→ Σ∗ encoding events

Interesting to know how complex C must be to encode(Yi )

Ni=1

Entropy: H(x) = E [−lg(p(x))]; representation of howuncertain a r.v. is

Problem: p(·) is unknown to the function and thedistribution needs to be learned. WLLN can be used toanswer: how uncertain is the distribution?

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Asymptotic Equipartition Property

Theorem

Asymptotic Equipartition Property If X1, . . . ,Xni .i .d .∼ p(x) then

−1

nlg(p(x1, . . . , xn))→ H(x).

Xii .i .d .∼ p(x) so lgp(xi ) are i.i.d. The weak law of large numbers

says that

P(| − 1

n(lg(

∏i

(p(xi )))− E [∏i

−lg(p(x)))]| > ε)→ 0

P(| − 1

n

∑i

lg((p(xi )))− E [−lg(p(x))]| > ε)→ 0

so the sample entropy approaches the true entropy inprobability as the sample size increases.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Statistical Learning

Want to minimize R(f ) = E (l(x , y , f (x))) egl = (x , y , f )→ 1/2|f (x)− y |Stuck minimizing over all f , under a distribution we don’tknow... hopeless...

IDEA: Take Law of Large numbers and apply it to thisframework, and hopefullyRemp(f ) = 1/m

∑i l(xi , yi , f (xi ))→ R(f )

Use Lp bounds to prove convergence results on testingerror.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Mini Appendix: Measure Theory

I won’t go into too much detail in this regard. If we’ve made itthis far then great.

(X ,A) is a measurable space if A ⊂ 2X is closed undercountable unions and complements. These correspond to’measurable events’.

(X ,A) with µ is a measure space if µ : A → [0,∞) iscountably additive over disjoint sets: µ(∪iAi ) =

∑i µ(Ai )

if Ai ∩ Aj = ∅ if i 6= j .

More great properties fall out of this quickly: measure ofthe emptyset is zero, measures are monotonic under thecontainment (partial) ordering, even dominated andmonotone convergence (of sets) come out from this.

Chebyshev’s Inequality - Let f ∈ Lp(R) and denoteEα = x : |f | > α and now ||f ||pp ≥

∫Eα|f |p ≥ αpµ(Eα)

by monotonicity and positivity of measures.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

L2 R.V.’s: Definition

We say that a function f : X → R belongs to the functionspace Lp(X ) if ||f ||p = (

∫|f (x)|pdx)1/p <∞ and say f has

finite Lp norm.

Definition

The fundamental object being considered in statistics is therandom variable. Given a measure space (Ω,B, µ) X : Ω→ Ris an L2 random variable if

X < r ∈ B ∀r ∈ R

µ(Ω) = 1 and (∫X (ω)2dµ(ω)

)1/2

<∞.

We may also say that a random variable with finite 2ndmoment is a Borel-measurable real valued function in L2(X ).

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

L2 R.V.’s: Questions

Why B?Physical paradoxes, eg. Banach-TarskiWant to talk about physical events ∼ measurable setsB is as descriptive as we’d like most of the time

What does µ do here?µ weights the events ∼ measurable setsPuts values to ω ∈ Ω : X (ω) < r; Ω is the sample spaceX maps random events (patterns occuring in data) to thereals which have a good standard notion of measure. So Xinduces the distribution P(B) = µ(X−1(B)) which gives

P(X (ω) ∈ B) = µ(X−1(B))

as expected. This is usually written without the samplevariable.

For more sophisticated questions/answers see MAA 6616or STA 7466, but discussion is encouraged

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Statistics: Functions of Measured Events

Now we can start building

Theorem

If f : R→ R is measurable, then∫f (x) dP =

∫f (X (ω)) dµ

ie. the random variable pushes forward a measure onto R, andthe integrals of measurable functions of random variables aretherefore ultimately determined by real integration.

This may be proved using Dirac measures and measureproperties. Following these lines we may develop the basics ofthe theory of probability from measure theory. (ok, Homework)

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Statistics: Moments

Definition

Define the first moment as the linear functional

E (X ) =

∫t dP.

Then the pth central moment is is a functional given by

mp(X ) =

∫(t − E (X ))p dP.

Note that when X ∈ L2 these are well defined by the theoremabove. Pullbacks are omitted. Since we’re talking about L2

let’s define the second central moment as σ2 for convenience.If we need higher moments, sadly, mathematics says no.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Statistics: Independence

Now that we know what random variables are, let’s try todefine how a collection of them interact in the most basic waypossible.

Definition

Let X = Xα be a collection of random variables. Then X isindependent if

P(∩αX−1α (Bα)) = P((X1, ...,Xn)−1(B1 × ...× Bn))

= (µ(X−11 )× ...× µ(X−1n ))(B1 × ...× Bn).

(Head Explodes) Really this just means that the Xα areindependent if their joint distributions are given by the productmeasure over the induced measures.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Limits: Repeating Independent Trials

The significance of the belabored definition of independence isthat when a joint distribution, just a distribution over severalrandom variables, contains zero information about how thevariables are related.

If we have an infinite collection of random variables,independent of each other - in spite of independence - wewould like to be able to infer information about the lowerorder moments from the higher order moments and viceversa.

If we have ideal conditions on the the random variablesthen we should be able to deduce information aboutmoments from a very large sequence.

The type of convergence we get should be sensitive to thehypotheses.

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Modes of Convergence

We clearly need to specify how the means are converging,right?

Why?So how could

∑Ni=1(Xi − µi )/N → 0?

In measure : limN P(|∑N

i=1(Xi − µi )/N| ≥ ε)→ 0

In norm: limN(∫

(∑N

i=1(Xi − µi )/N)pdµ(ω))1/p → 0

A.E : P(limN |∑N

i=1(Xi − µi )/N| > 0)→ 0

AE

µ Lp Lp−i

fnfnk

fn

fnk

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev


We clearly need to specify how the means are converging,right? Why?

So how could∑N

i=1(Xi − µi )/N → 0?


i=1(Xi − µi )/N| ≥ ε)→ 0

In norm: limN(∫

(∑N

i=1(Xi − µi )/N)pdµ(ω))1/p → 0

A.E : P(limN |∑N

i=1(Xi − µi )/N| > 0)→ 0

AE

µ Lp Lp−i

fnfnk

fn

fnk

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev


We clearly need to specify how the means are converging,right? Why?So how could

∑Ni=1(Xi − µi )/N → 0?


i=1(Xi − µi )/N| ≥ ε)→ 0

In norm: limN(∫

(∑N

i=1(Xi − µi )/N)pdµ(ω))1/p → 0

A.E : P(limN |∑N

i=1(Xi − µi )/N| > 0)→ 0

AE

µ Lp Lp−i

fnfnk

fn

fnk

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev


We clearly need to specify how the means are converging,right? Why?So how could

∑Ni=1(Xi − µi )/N → 0?


i=1(Xi − µi )/N| ≥ ε)→ 0

In norm: limN(∫

(∑N

i=1(Xi − µi )/N)pdµ(ω))1/p → 0

A.E : P(limN |∑N

i=1(Xi − µi )/N| > 0)→ 0

AE

µ Lp Lp−i

fnfnk

fn

fnk

Introduction



Strong Law

Strongest Law

Examples

InformationTheory

StatisticalLearning

Appendix

RandomVariables


Independence


Modes ofConvergence

Chebyshev

Chebyshev

Documents

An Introduction to Laws of Large Numberscorring/pdf/WLLN.pdf · 1 Introduction 2 Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law 3 Examples