255
Welcome to STAB52 Instructor: Dr. Ken Butler 1

STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Welcome to

STAB52

Instructor: Dr. Ken Butler

1

Page 2: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Contact information

(on Intranet: intranet.utsc.utoronto.ca, My Courses)

• E-mail: [email protected]

• Office: H 417

• Office hours: to be announced

• Phone: 5654 (416-287-5654)

2

Page 3: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Probability Models

3

Page 4: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Measuring uncertainty

4

Page 5: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Random Variables and

Distributions

5

Page 6: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Random Variables

Suppose we flip two (fair) coins, and note whether each coin

(ordered) comes up H or T.

• Sample space is S = {HH,HT, TH, TT}.

• Probability measure is 14

for each of 4 outcomes.

What about “number of heads”? Could be 0, 1 or 2:

• P (0 heads) = P (TT ) = 14

• P (1 head) = P (TH) + P (HT ) = 12

• P (2 heads) = P (HH) = 14.

6

Page 7: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

“Number of heads” is random variable: function from S to R. That

is, given outcome, get value of random variable.

Random variables can be any function from S to R. If

S = {rain, snow, clear}, random variable X could be

X(rain) = 3

X(snow) = 6

X(clear) = −2.7.

7

Page 8: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Some more examples of random variables

Roll a fair 6-sided die, so that S = {1, 2, 3, 4, 5, 6}. Let X be the

number of spots showing, let Y be square of number of spots. If s is

number of spots, let W = s + 10, let U = s2 − 5s + 3, etc.

In previous situation, let C = 3 regardless of s. C is constant

random variable.

Suppose have event A, only interested in whether A happens or

not. Define indicator random variable I to be 1 if A happens, 0

otherwise. Example (rolling die) I6(s) = 1 if s = 6, 0 otherwise.

8

Page 9: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

≥, =, sum for random variables

Imagine rolling a fair die again, S = {1, 2, 3, 4, 5, 6}. Let X = s,

and let Y = X + I6.

X is number of spots, I6 is 1 if you roll a 6 and 0 otherwise. What

does Y mean?

Eg. roll a 4, X = 4, Y = 4 + 0 = 4. But if you roll a 6,

Y = 6 + 1 = 7. (That is, Y is the number of spots plus a “bonus

point” if you roll a 6.)

Sum of random variables (like Y here) for any outcome is sum of

their values for that outcome.

9

Page 10: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Also: if s = 1, 2, 3, 4, 5, values of X and Y are same. If s = 6,

X < Y .

Say that random variable X ≤ Y if value of X ≤ value of Y for

every single outcome. True in example.

Say that random variable X = Y if value of X equals value of Y

for every single outcome. Not true in example (different when

outcome is s = 6).

For constant random variable c, X ≤ c if all possible values of X

are≤ c.

10

Page 11: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

When S is infinite

When S infinite, random variable can take infinitely many different

values (but may not).

Example: S = {1, 2, 3, . . .}. If X = s, X takes all infinitely many

values in S. But define Y = 3 if s ≤ 4, Y = 2 if 4 < s ≤ 10,

Y = 1 when s > 10. Y has only finitely many (3) different values.

11

Page 12: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Distributions of random variables

A random variable can be described by listing all its possible vales

and their probabilities. Started this chapter with a coin-flipping

example:

Flip two (fair) coins, and note whether each coin (ordered) comes up

H or T.

Let X be “number of heads”. Could be 0, 1 or 2:

• P (X = 0) = P (TT ) = 14

• P (X = 1) = P (TH) + P (HT ) = 12

• P (X = 2) = P (HH) = 14.

Called the distribution of X .

12

Page 13: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Notice how can talk about P (X = s) for some s. In this case,

listing all the s for which P (X = s) > 0 describes distribution.

Consider now random variable U taking values in [0, 1] with

P (a ≤ X ≤ b) = b− a

for 0 ≤ a ≤ b ≤ 1. Try to figure out eg. P (X = 0.4): is

P (0.4 ≤ X ≤ 0.4) = 0.4− 0.4 = 0.

Can’t define probability of a value, but still can define probability of

landing in subset of R (namely interval).

13

Page 14: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

To account for all of this, define distribution of random variable X

as: collection of probabilities P (X ∈ B) for all subsets B of

real numbers.

Works for both examples above. Eg. in first example,

P (X ≤ 1) = P (X = 0) + P (X = 1) = 34.

In practice, often messy to define probabilities for “all possible

subsets”. Think first about examples like 1st, “discrete”, where can

talk about probabilities of individual values. Then consider

“continuous” case (like 2nd), where have to look at intervals.

14

Page 15: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Discrete distributions

Often it makes sense to talk about individual probs, P (X = x).

When all probability included in these probs, ie.∑x∈R

P (X = x) = 1,

don’t need to look at anything else.

Another way to look at it: there is a finite or countable set of x

values, x1, x2, . . ., each having probability pi = P (X = xi), such

that∑

x∈R pi = 1.

Either of these is definition of discrete distribution.

15

Page 16: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Compare case where P (a ≤ X ≤ b) = b− a: P (X = x) = 0

for all x, so not discrete distribution.

Another example: suppose X = −1 with prob 12, and for

0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b− a)/2. Can talk

about P (X = −1) = 12, but P (X = x) = 0 for any other x. So

not a discrete distribution.

Notation for discrete distributions (emphasize function):

pX(x) = P (X = x)

called probability function or mass function.

Now look at some important discrete distributions.

16

Page 17: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Degenerate distributions

If random variable C is constant, equal to c, then P (C = c) = 1

and P (C = x) = 0 for any x 6= c. Since∑x∈R P (C = x) = P (C = c) = 1, is a proper (though dull)

discrete distribution. Called degenerate distribution or point

mass.

17

Page 18: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Bernoulli distribution

Flip a coin once, let X be number of heads (has to be 0 or 1).

Suppose P (head) = θ, so P (tail) = 1− θ. Then

pX(1) = P (X = 1) = P (head) = θ;

pX(0) = P (X = 0) = P (tail) = 1− θ.

X said to have Bernoulli distribution; write X ∼ Bernoulli(θ).

Application: any kind of “success/failure”. Denote “success” by 1,

“failure” by 0. Or selection from population with two kinds of

individual like male/female.

18

Page 19: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Binomial distribution

Now suppose we flip the coin n times (independently) and again

count number of heads. Probability of exactly x heads is

pX(x) = P (X = x) =

(n

x

)θx(1− θ)n−x.

X said to have binomial distribution, written

X ∼ Binomial(n, θ).

Applications: as for Bernoulli. Eg. randomly select 100 Canadian

adults, let X be number of females.

19

Page 20: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Let X ∼ Binomial(4, 0.5), Y ∼ Binomial(4, 0.2). Then

x P(X=x) P(Y=x)

0 0.0625 0.4096

1 0.2500 0.4096

2 0.3750 0.1536

3 0.2500 0.0256

4 0.0625 0.0016

X probs symmetric about x = 2, Y more likely to be 0 or 1

because successes less likely.

Bernoulli and binomial count successes in fixed number of trials.

Could also look at waiting time problem: fix successes, count

number of trials needed to get them.

20

Page 21: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Geometric distribution

Same situation as for binomial: number of trials, independent, equal

prob. θ. Let X now be number of tails before 1st head.

X = k means we observe k tails, and then a head, so

pX(k) = P (X = k) = (1− θ)kθ, k = 0, 1, 2, . . .

X can be as large as you like, since you might wait a long time for

the first head. (Compare binomial: can’t have more than n

successes in n trials).

X has geometric distribution, prob. θ, written X ∼ Geometric(θ).

Applications: number of working light bulbs tested until first one that

fails; number of at-bats for baseball player until first hit.

21

Page 22: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples: suppose X1 ∼ Geometric(0.8) and

X2 ∼ Geometric(0.5).

k P (X1 = k) P (X2 = k)

0 0.8 0.5

1 0.16 0.25

2 0.032 0.125

3 0.0064 0.0625

4 0.00128 0.03125

. . . . . .

When θ larger, 1st success probably sooner.

Also: probabilities form geometric series, hence the name.

22

Page 23: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Negative binomial distribution

To take geometric one stage further: Let r be a fixed number, let Y

be the number of tails before the r-th head.

Y = k only if observe r − 1 heads and k tails, in any order,

followed by a head (must finish with a head). Are r + k − 1 flips

before the final head. Prob. of this is

pY (k) = P (Y = k) =

(r + k − 1

r − 1

)θr−1(1− θ)kθ

=

(r + k − 1

k

)θr(1− θ)k

Write this Y ∼ Negative-Binomial(r, θ).

23

Page 24: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Applications: can re-use geometric distribution examples. Thus:

number of working lightbulbs tested until 5th non-working one

encountered; number of at-bats until baseball player achieves 10th

hit.

Numerical examples: let Y1 ∼ Negative-Binomial(4, 0.8) and

Y2 ∼ Negative-Binomial(3, 0.5).

k P(Y1=k) P(Y2=k)

0 0.4096 0.1250

1 0.3276 0.1875

2 0.1638 0.1875

3 0.0655 0.1562

4 0.0229 0.1171

5 0.0073 0.0820

6 0.0022 0.0546

24

Page 25: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

With Y1, “heads” are likely so probably won’t see many tails before

4th H. With Y2, heads not so likely but only need to see 3 before

stopping.

General note: some books count total number of trials until first (or

r-th) head for geometric and negative binomial distributions. Gives

random variables 1 + X and r + Y as defined above.

25

Page 26: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Poisson distribution

Suppose X ∼ Binomial(n, λ/n). We’ll think of λ as being fixed

and see what happens as n →∞. That is, what if the number of

trials gets very large but the prob. of success gets very small?

Then

P (X = x) =

(n

x

)(λ

n

)x (1− λ

n

)n−x

=n!

x!(n− x)!nxλx

(1− λ

n

)n (1− λ

n

)−x

26

Page 27: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Thinking of x as fixed (for now) and letting n →∞: the behaviour

of the factorials is determined by the highest power of n. Thus n!

behaves like nn, (n− x)! behaves like nn−x and hence

n!

(n− x)!nx→ 1.

Also, (1− λ

n

)−x

→ 1

because 1− λ/n → 1 and raising it to a fixed power changes

nothing.

27

Page 28: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Finally,

limn→∞

(1− λ

n

)n

is a famous limit from calculus; it is e−λ. Thus

limn→∞

P (X = x) =e−λλx

x!.

A random variable Y with P (Y = y) = e−λλy/y! is said to have

a Poisson(λ) distribution, written Y ∼ Poisson(λ).

The Poisson distribution is a good model for rare events: that is,

events which have a large number of “chances” to happen, but have

a very small probability of happening at each “chance”. λ represents

“rate” at which events happen; doesn’t have to be integer.

28

Page 29: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Applications of Poisson distribution are things like: number of house

fires in a city on a given day, number of phone calls arriving at a

switchboard in an hour, number of radioactive events recorded by a

Geiger counter.

Let X ∼ Poisson(2), Y ∼ Poisson(0.8):

29

Page 30: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

lam=2 lam=0.8

x P(X=x) P(Y=x)

0 0.1353 0.4493

1 0.2707 0.3595

2 0.2707 0.1438

3 0.1804 0.0383

4 0.0902 0.0077

5 0.0361 0.0012

6 0.0120 0.0002

... ...

• When λ is integer, highest prob at that integer and next lower

• Otherwise, highest prob at next lower integer (so when λ < 1,

highest prob at x = 0).

30

Page 31: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Hypergeometric distribution

Introduction

Imagine a pot containing 10 balls, 7 red and 3 green. Prob. of

drawing a red ball is 0.7 (7/10). If we put the ball drawn back in the

pot, prob. of drawing a red ball the next time is still 0.7.

Thus, drawing with replacement, number of red balls in 4 draws

R ∼ Binomial(4, 0.7). Therefore

P (R = 4) =

(4

4

)(0.7)4(0.3)0 = 0.2401.

31

Page 32: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now suppose we draw without replacement: that is, don’t put balls

back in pot after drawing. If we draw a red ball 1st time, there are

only 6 red balls out of 9 balls left.

Should be harder to draw 4 red balls in 4 draws because there are

fewer left after we draw each one: now

P (R = 4) =7

10· 6

9· 5

8· 4

7= 0.1667.

This is not so bad, but suppose we now want P (R = 3), say?

Need general principle for drawing without replacement.

32

Page 33: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

The hypergeometric formula

Introduce symbols: suppose draw n balls out of a pot containing N

total. Suppose M of the balls in the pot are red. Let X be number

of red balls drawn. What is P (X = x)?

Need to count ways:

• Number of ways to draw n balls out of N in pot:(

Nn

).

• number of ways to draw x red balls out of M red balls in pot:(Mx

).

• number of ways to draw n− x green balls out of N −M green

balls in pot:(

N−Mn−x

).

P (X = x) is number of ways to draw the red and green balls

33

Page 34: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

divided by number of ways to draw n balls out of N :

P (X = x) =

(M

x

)(N −M

n− x

)/

(N

n

).

X said to have hypergeometric distribution:

X ∼ Hypergeometric(N, M, n). Checks:

M + (N −M) = N and x + (n− x) = n. Restrictions on x?

• Number of red balls: x ≤ n and x ≤ M so x ≤ min(n,M).

• Number of green balls: n− x ≤ n and n− x ≤ N −M , so

x ≥ 0 and x ≥ n + M −N , so x ≥ max(0, n + m−N).

34

Page 35: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example 1: let X ∼ Hypergeometric(10, 7, 4):

x P(X=x)

0 0.0000

1 0.0333

2 0.3000

3 0.5000

4 0.1667

10 balls in pot, 7 red, 4 drawn. Cannot draw 0 red, because that

would mean drawing 4 green, and only 3 in pot. (Also cannot draw

more than 4 red because only drawing 4).

35

Page 36: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example 2: let Y ∼ Hypergeometric(5, 3, 4):

y P(Y=y)

0 0.0

1 0.0

2 0.6

3 0.4

4 0.0

5 0.0

5 balls in pot, 3 red and 2 green, draw 4. Cannot draw more than 3

red. But also cannot draw only 0 or 1 red, because that would mean

drawing 4 or 3 green, and aren’t that many in the pot.

36

Page 37: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Applications

Anything that involves drawing without replacement from a finite set

of elements. Includes sampling, eg. selecting people to include in

opinion poll. (Don’t want to select same person twice). People

sampled from might agree (red ball) or disagree (green ball) with

question asked.

Large N

If N large, might imagine that it doesn’t matter much whether you

replace balls in pot or not. In other words, for large N , binomial

would be decent approximation. Turns out to be true:

If X ∼ Hypergeometric(N, M, n) and N large, then X has

37

Page 38: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

approx. same distribution as Y ∼ Binomial(n,M/N).

38

Page 39: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Continuous distributions

Suppose, for random variable X ,

P (a ≤ X ≤ b) = b− a

for 0 ≤ a ≤ b ≤ 1).

Is legitimate probability since 0 ≤ b− a ≤ 1. But

P (X = a) = a− a = 0 for any a, so not discrete distribution.

Where did the probability go?

39

Page 40: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Cumulative distribution functions

40

Page 41: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

One-dimensional change of variable

41

Page 42: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Joint Distributions

Know how to describe random variables one at a time: probability

function (discrete), density function (continuous), cumulative

distribution function (either).

But two random variables X , Y might be related. Don’t have a way

to describe this.

Example: X ∼ Bernoulli(2/3). Let Y = 1−X .

Y ∼ Bernoulli(1/3) (count failures not successes). X, Y

related, but doesn’t show in individual probability functions.

42

Page 43: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Joint probability functions

Can simply find probability of all possible combinations of values for

X, Y . Uses individual probability functions and relationship.

In example: if X = 0, then Y = 1; if X = 1, then Y = 0.

Possible values for Y depend on value of X . Also,

P (X = 1) = 2/3.

Notation: pX,Y (x, y) = P (X = x, Y = y) (comma is “and”),

called joint probability function. In example:

pX,Y (1, 0) = 2/3; pX,Y (0, 1) = 1/3.

Are only possible combinations of X and Y values.

43

Page 44: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Often convenient to depict as table. Above example:

x \ y 0 1

0 0 13

1 23

0

Another:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Note that all the probabilities sum to 1, because joint probability

function covers all possibilities.

44

Page 45: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Joint density functions

If random variables continuous, joint probability function makes no

sense; instead, define joint density function f(x, y) that

expresses chance of being “near” (X = x, Y = y).

Joint density function also covers all possible values of X, Y , so

integrates to 1 when integrated over both x and y.

Example: f(x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1 (page 85).

45

Page 46: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Sometimes possible values of Y depend on value of X . Account for

in integration.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. (Thus

if X = 0.6, Y cannot exceed 0.4.) Region forms triangle: Figure

2.7.3 of text (p. 85). Verify density by letting y limits of integration

depend on x (y = 1− x).

46

Page 47: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Bivariate normal distribution

Suppose X , Y both have standard normal distributions, and

suppose−1 < ρ < 1. Then the bivariate standard normal

distribution with correlation ρ has joint density function

f(x, y) =1

2π√

1− ρ2exp

{− 1

2(1− ρ2)(x2 + y2 − 2ρxy)

}.

Plotting in 3D (Figure 2.7.4) gives a 3D bell shape.

ρ measures relationship between X and Y :

• ρ = 0: no relationship

• ρ > 0: when X > 0, Y likely > 0

• ρ < 0: when X > 0, Y likely < 0.

47

Page 48: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Bivariate standard normal has peak at (0, 0). Replacing x by

(x− µ1)/σ1 and y by (y − µ2)/σ2 shifts peak to (µ1, µ2) and

changes decrease of density away from peak (larger σ values mean

slower decrease).

48

Page 49: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Calculating probabilities

For a continuous random variable X , calculate probabilities by

integrating, eg. P (a < X ≤ b) =∫ b

af(x) dx.

Same idea for continuous joint distribution, integrating over x and y.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. Find

P (0.5 ≤ X ≤ 0.7, Y > 0.2).

Draw picture. Area is trapezoid: y between 0.2 and diagonal line

(1− x), x between given limits. Integrate over y first, then x to get

0.294.

49

Page 50: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Marginal distributions

Started from individual distributions for X, Y plus relationship. But:

start from joint, get individual?

One way: get distribution of X by “averaging” over distribution of Y .

Discrete: simply row and column totals. Example:

u \ v 0 1 2 Sum

0 13

16

16

23

1 16

112

112

13

Sum 12

14

14

1

50

Page 51: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Without knowledge of V , U twice as likely 0 as 1; without

knowledge of U , V twice as likely 0 as 1 or 2.

Row totals here give marginal distribution of U ; column totals

here marginal distribution of V . Each marginal distribution is

proper probability distribution (probs sum to 1).

51

Page 52: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Continuous: integrate over other variable. Get marginal density

function.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1.

Marginal density for X : integrate over y. Limits 0, 1− x; get

fX(x) =

∫ 1−x

0

120x3y dy = 60x3(1− x)2.

For Y : integrate over x, limits 0, 1− y:

fY (y) =

∫ 1−y

0

120x3y dx = 30y(1− y)4.

“Integrating out” unwanted variable.

Alternative approach via cumulative; text page 79.

52

Page 53: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example 2: bivariate standard normal. Recall standard normal

density; integrates to 1, so∫ ∞

−∞

1√2π

exp

[−1

2u2

]du = 1.

Marginal distribution of x in bivariate standard normal: integrate out

y:

fX(x) =

∫ ∞

−∞

1

2π√

1− ρ2exp

[−x2 + y2 − 2ρxy

2(1− ρ2)

]dy.

Substitution: let u = (y − ρx)/√

1− ρ2, so du = dy/√

1− ρ2.

Then

u2 =y2 − 2ρxy + ρ2x2

1− ρ2

53

Page 54: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

which is nearly what appears inside “exp”. Precisely:

fX(x) =

∫ ∞

−∞

1

2πexp

[−u2 + x2

2

]du

=1√2π

exp(−x2/2)

∫ ∞

−∞

1√2π

exp(−u2/2) du.

Integral is 1 (of a standard normal density), so

fX(x) =1√2π

exp(−x2/2) :

that is, marginal distribution of X is standard normal.

54

Page 55: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditioning and Independence

Marginal distribution: of one variable, ignorant about other.

But what if we knew X ; what then about distribution of Y ?

Example 1:

x \ y 0 1

0 0 13

1 23

0

Suppose X = 1. Then ignore 1st row.

55

Page 56: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

But 2nd row not probability distribution (sum 23

not 1). Idea: divide

by sum. Then if X = 1, P (Y = 0) = 1 and P (Y = 1) = 0: that

is, if X = 1, Y certain to be 0. Called conditional distribution of

Y given X = 1.

If X = 0, Y certain to be 1. Conditional distribution of Y different

for different X : Y depends on X .

Notation: as for conditional probability. Eg. above:

P (Y = 1|X = 0) = 1.

56

Page 57: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example 2:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Conditional distribution of V given U = 0? Use U = 0 row. This

sums to 23, so divide by this to get P (V = 0|U = 0) = 1

2, P (V =

1|U = 0) = 14, P (V = 2|U = 0) = 1

4.

U = 1 line sums to 13; conditional distribution of V given U = 1 is

same as given U = 0.

In example 2, does not matter what U is – conditional distribution of

V same. Say that V and U are independent.

57

Page 58: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Two examples give extreme cases. In Example 1, knowing X gave

Y with certainty; in example 2, knowing U said nothing about V .

Most cases in between: knowing one variable has some effect on

distribution of other.

Symbols:

P (Y = b|X = a) =P (X = a, Y = b)∑y P (X = a, Y = y)

=P (X = a, Y = b)

P (X = a).

Denominator is marginal probability that X = a.

58

Page 59: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditioning on continuous random variables

Continuous case: no probabilities, so replace with density functions;

replace sum by integral. This gives conditional density function:

fY |X(y|x) =fX,Y (x, y)∫∞

−∞ fX,Y (x, y) dy=

fX,Y (x, y)

fX(x),

replacing infinities by actual limits for y. Denominator depends on x

only; is marginal density function for X .

Then use conditional density to evaluate conditional probabilities.

59

Page 60: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example: fX,Y (x, y) = 4x2y + 2y5 for x, y between 0 and 1, 0

otherwise. Find P (0.2 ≤ Y ≤ 0.3|X = 0.8).

Steps: find marginal density of X , use to find conditional density of

Y given X , integrate conditional density to find probability.

Answers: marginal density of X is fX(x) = 2x2 + 13

for

0 ≤ X ≤ 1, 0 otherwise. Conditional density of Y |X is

fY |X(y|x) =4x2y + 2y5

2x2 + 13

then integrate over 0.2 ≤ y ≤ 0.3 and put in x = 0.8 to get

P (0.2 ≤ Y ≤ 0.3|X = 0.8) = 0.0395.

60

Page 61: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Followup: what happens to P (0.2 ≤ Y ≤ 0.3) if X changes?

One answer: P (0.2 ≤ Y ≤ 0.3|X = 0.4) = 0.0242. So

probability does change as X changes; Y does depend on X .

However, change in probability quite small; dependence is not very

strong.

61

Page 62: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Law of total probability

Because

fY |X(y|x) =fX,Y (x, y)∫∞

−∞ fX,Y (x, y) dy=

fX,Y (x, y)

fX(x),

also true that

fX,Y (x, y) = fX(x)fY |X(y|x).

62

Page 63: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

So

P (a ≤ X ≤ b, c ≤ Y ≤ d)

=

∫ d

c

∫ b

a

fX,Y (x, y) dx dy

=

∫ d

c

∫ b

a

fX(x)fY |X(y|x) dx dy.

In words: can find probabilities either using joint density or using a

marginal and a conditional density. Can use whichever easier.

63

Page 64: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Independence of random variables

Recall this joint distribution:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Sum 12

14

14

Conditional distribution of V same given U = 0 and given U = 1.

Also same as marginal distribution of V . Knowing U says nothing

about V .

(Also, conditional dist. of U same for all V and same as marginal for

U .)

64

Page 65: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Suggests definition: random variables independent if conditional

distribution always same, and always same as marginal.

Mathematics: X,Y independent if

pY (y) = pY |X(y|x) =pX,Y (x, y)

pX(x)

so that

pX,Y (x, y) = pX(x)pY (y).

This is usually easiest check:

• if pX,Y (x, y) = pX(x)pY (y) for all x, y, then X, Y

independent.

• if pX,Y (x, y) 6= pX(x)pY (y) for any one (x, y) pair, then

X, Y not independent.

65

Page 66: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

For example above: P (U = 0) = 23, P (U = 1) = 1

3;

P (V = 0) = 12, P (V = 1) = P (V = 2) = 1

4. Also,

P (U = 0)P (V = 0) =2

3· 1

2=

1

3= P (U = 0, V = 0).

Repeat for all u and v: proves independence.

66

Page 67: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Compare this joint distribution:

x \ y 0 1

0 0 13

1 23

0

Now,

P (X = 0)P (Y = 0) =1

3· 2

3=

2

9

and P (X = 0, Y = 0) = 0 6= 29. One calculation shows X, Y

not independent.

67

Page 68: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Independence of continuous random variables

As usual, turn probability into density. If

fX,Y (x, y) = fX(x)fY (y)

for all x, y, then continuous random variables X, Y independent. If

it fails for any (x, y) pair, not independent.

Example: suppose fX(x) = 2x2 + 13, fY (y) = 4

3y + 2y5,

fX,Y (x, y) = 4x2y + 2y5 for 0 ≤ x, y ≤ 1. Then

fX(x)fY (y) =

(2x2 +

1

3

)(4

3y + 2y5

)

which cannot be simplified to fX,Y (x, y). So X, Y not

independent.

68

Page 69: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Order statistics

Suppose that X1, X2, . . . , Xn all, independently, have same

distribution (a sample from distribution). Suppose common cdf

FX(x).

For example: take 20 people, give each IQ test. Without knowing

about individuals, use same distribution for each. What might

highest score in sample be?

Idea: more people sampled, higher the highest score could be (get

more chances to see a very high score).

69

Page 70: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Let M = max(X1, X2, . . . , Xn). Then

P (M ≤ m) = P (X1 ≤ m,X2 ≤ m, . . . , Xn ≤ m)

= P (X1 ≤ m)P (X2 ≤ m) · · ·P (Xn ≤ m)

= [FX(m)]n .

If X continuous, differentiate to get density.

Example: each Xi ∼ Uniform[0, 1]. Then FX(x) = x, so

P (M ≤ m) = xn.

If n = 5, P (M ≤ 0.9) = 0.95 = 0.59; if n = 20,

P (M ≤ 0.9) = 0.920 = 0.1216, much smaller. That is, with

more observations, the maximum is likely to be higher (less likely to

be low).

70

Page 71: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Similar idea for minimum: let K = min(X1, X2, . . . , Xn). Then

P (K ≤ k) = 1− P (K > k)

= 1− P (X1 > k,X2 > k, . . . , Xn > k)

= 1− P (X1 > k)P (X2 > k) · · ·P (Xn > k)

= 1− (1− FX(k))n.

Example: if n = 10, Xi ∼ Uniform[0, 1], then

P (K ≤ 0.2) = 1− (1− 0.2)10 = 0.8926.

71

Page 72: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Simulating probability distributions

So far, considered mathematical properties of distributions:

probabilities, densities, cdf’s etc. But some distributions difficult to

understand or use.

Generate random values from distribution.

approximation of difficult-to-calculate quantities

simulation of complex systems

generating potential solutions for difficult problems

random choices for quizzes, computer games

understanding behaviour of samples (chapter 4)

72

Page 73: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Pseudo-random numbers

In practice, don’t get actual random numbers, but pseudo-random

numbers. These follow recipe, but look random. (Paradox?)

Not so bad, because crucial feature: unpredictable – cannot easily

say what comes next.

Typical method: multiplicative congruential generator. Start with

initial “seed” value R0, then, for n = 0, 1, . . .:

Rn+1 = 106Rn + 1283 (mod 6075)

(“take remainder on division by 6075”).

73

Page 74: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Eg. start with r0 = 1001:

R1 = 106(1001) + 1283 = 107389 (mod 6075) = 4114

R2 = 106(4114) + 1283 = 437367 (mod 6075) = 6042

R3 = 106(6042) + 1283 = 641735 (mod 6075) = 3860

and so on, with 0 ≤ Ri < 6075.

Gives up to 6075 different random integers before repeating itself.

Suitable choice of constants gives long “period” and unpredictable

sequence. (Number theory.)

In practice, use much larger constants – get many more possible

random numbers.

74

Page 75: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Continuous uniform on [0, 1]

To get (pseudo-) random values from Uniform[0, 1], take

pseudo-random integers and divide by maximum. Result has

approx. uniform distribution.

With generator above, max value is 6075, so random uniform values

are 4114/6075 = 0.677, 6042/6075 = 0.995,

3860/6075 = 0.635. (Only 6075 possible values, so only 3 or so

digits trustworthy.)

“Random numbers” in calculators, Excel etc. of this kind.

Random Uniform[0, 1] values are used as building block for

random values from other distributions. Eg. random

Y ∼ Uniform[0, b]: multiply a random Uniform[0, 1] by b.

75

Page 76: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Bernoulli distribution

Suppose we want to simulate X ∼ Bernoulli(0.4): single trial,

prob. 0.4 of success.

Take single random uniform U . If U ≤ 0.4, take X = 1 (success),

otherwise take X = 0 (failure).

Works because U ≤ 0.4 about 0.4 of the time, so will get

successes about 0.4 of the time (long run).

In general, for X ∼ Bernoulli(θ), take X = 1 if U ≤ θ, 0

otherwise.

76

Page 77: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Binomial and geometric distributions

If Y ∼ Binomial(n, θ), Y = X1 + X2 + · · ·+ Xn where

Xi ∼ Bernoulli(θ). So just generate n random Bernoullis and

add them up.

Similarly, if Z ∼ Geometric(θ), Z is number of failures (in

Bernoulli trials) before 1st success. So get random value of Z like

this:

1. set Z = 0

2. generate U from Uniform[0, 1]

3. if U ≤ θ, stop with current Z

4. otherwise, add 1 to Z and return to step 2.

77

Page 78: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Inverse-CDF method

Cdf F (x) = P (X ≤ x) defined for all x.

Also, in set of possible X-values (where f(x) > 0), F (x)

invertible: for any p, exactly one x where F (x) = p.

Example: X ∼ Exponential(λ). Then F (x) = 1− e−λx. For

x > 0, write p = F (x), and solve for x to get

x = −1

λln(1− p).

Then generate a random p from Uniform[0, 1], and put it in the

formula to get a random X .

78

Page 79: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

For instance, if λ = 2, might have p = 0.7 and hence random X is

−12ln(1− 0.7) = 0.602.

Why does this work in general?

Let Y be any random variable; let F (y) = P (Y ≤ y) be cdf of Y .

Define random variable W = F (Y ). Then

P (W ≤ w) = P (F (Y ) ≤ w)

= P (Y ≤ F−1(w)) = F{F−1(w)} = w.

That is, W ∼ Uniform[0, 1] whatever the distribution of Y .

79

Page 80: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

So: to simulate Y , simulate W , then use relationship

Y = F−1(W ) to simulate Y (by using simulated uniform in place

of W ).

This was done above for exponential. Called inverse-CDF method.

80

Page 81: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Also works for discrete. Example: Poisson(0.7) has this cdf:

x 0 1 2 3 4

P (X ≤ x) 0.497 0.844 0.966 0.994 0.999

Procedure: get random U ∼ Uniform[0, 1]. If U ≤ 0.497, take

random X = 0; else if U ≤ 0.844, take X = 1, . . . , else if

U > 0.999, take X = 5.

(Higher values possible, but very unlikely; for more accuracy use

more digits.)

81

Page 82: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Normal distribution

Difficult to simulate from (cannot invert cdf).

But consider X , Y with bivariate standard normal distribution,

correlation 0. Joint density is

fX,Y (x, y) =1

2πexp

{−1

2(x2 + y2)

}.

Thinking of (x, y) as point in R2, note that density depends only on

distance from origin (r2 = x2 + y2), not on angle.

So generate random (x, y) pair by generating random angle

θ ∼ Uniform[0, 2π], random distance, separately.

(details: 2-variable transformation using Jacobian determinant.)

82

Page 83: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Density function for distance R is

fR(r) = re−r2/2

and cdf is

FR(r) =

∫ r

0

te−t2/2 dt = 1− e−r2/2

(eg. use substitution u = t2/2, du = t dt).

FR(r) invertible; let p = FR(r), solve for r to get

r =√−2 ln(1− p).

Get random R by taking U ∼ Uniform[0, 1], using for p above.

83

Page 84: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Finally, convert random R, θ to (X, Y ) using polar coordinate

formulas

X = R cos θ; Y = R sin θ.

Example: suppose random θ = 1.8 (radians), U = 0.3. Then

R =√−2 ln(1− 0.3) = 0.8446. So

X = 0.8446 cos 1.8 = −0.19; Y = 0.8446 sin 1.8 = 0.82.

84

Page 85: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Rejection methods

Inverse-CDF method doesn’t always work – cdf can be too

complicated to invert. Example: X ∼ Gamma(3, 1), with density

function

f(x) =x2

2e−x.

This has maximum 2e−2 = 0.2707 at x = 2. Density “small”

beyond x = 10.

85

Page 86: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Idea: sample random point (X,Y ) in rectangle enclosing f(x),

with 0 ≤ X ≤ 10, 0 ≤ Y ≤ 2e−2 (using uniform distribution):

• if point below density function (Y ≤ f(X)), take X as random

value from distribution

• otherwise, reject (X, Y ) pair and try again.

Chance of X-value being accepted proportional to density f(X):

when value more likely in distribution, more likely to be accepted.

86

Page 87: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example:

X 7.3 1.0 2.7 1.7 9.4 5.5

Y 0.206 0.130 0.023 0.256 0.197 0.203

f(X) 0.018 0.184 0.245 0.264 0.004 0.062

reject y n n n y y

Values 7.3, 9.4, 5.5 rejected; 1.0, 2.7, 1.7 random values from

Gamma(3, 1).

Needed 12 random uniforms to generate 3 random gammas.

87

Page 88: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Can be made more sophisticated. Let g(x) be density function that

is easy to sample from, such that f(x) ≤ cg(x) for all x (choose

c). Above, g(x) = 1, c = 2e−2.

Generate random value X from distribution with density g(x).

Generate random Y ∼ Uniform[0, cg(X)]. If Y ≤ f(X),

accept X ; otherwise, reject and try again.

Efficiency of rejection method greatest when cg(x) only slightly

greater than f(x); then, very little rejection.

88

Page 89: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Simulation in Minitab

Minitab can generate random values from many distributions (using

methods above or variations).

Basic procedure:

• Select Calc, Random Data

• Select desired distribution

• Fill in number of random values to generate

• Fill in (empty) column to store values

• Fill in parameters of distribution (if any)

• Click OK.

89

Page 90: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples: Uniform[0, 1], Bernoulli(0.4), Binomial(5, 0.4),

Exponential(2), Poisson(0.7), Normal(0, 1).

To generate random values from another distribution, generate

column of values from Uniform[0, 1], then use Calculator to create

desired values (p. 47–48 of manual).

Recall random values actually “pseudo-random”: starting at same

seed value gives same sequence of random values. Can set seed

value in Minitab (Calc, Set Base) to get reproducible random values.

90

Page 91: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation

91

Page 92: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Introduction

Game: toss fair coin, win $2 for a head, lose $1 for a tail.

Amount you win is random variable W with

P (W = 2) = P (W = −1) = 12.

Could win or lose on any one play, but (a) winning and losing equally

likely, (b) amount won greater than amount lost.

Would probably play this game given chance, because expect to win

in long run, on average over many plays, even though anything

possible.

92

Page 93: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expected value of random variable is its long-run average. For W

above, expect equal number of 2’s and−1’s, so expected value

would be

E(W ) =2 + (−1)

2=

1

2.

Another: suppose Y = 7 always (ie. P (Y = 7) = 1,

P (Y = k) = 0 for k 6= 7). Then E(Y ) should be 7.

Another: roll 2 dice. Win $30 for double 6, lose $1 otherwise. Looks

good because potential win greater than potential loss, but win very

unlikely. How to balance? For winnings random variable V , what is

E(V )?

93

Page 94: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation for discrete random variables

Define expected value (expectation) of random variable X :

E(X) =∑

x

xP (X = x),

“sum of value times probability”. Sum over all possible x.

Check for above examples:

E(W ) = 2 · 1

2+ (−1) · 1

2=

1

2E(Y ) = 7 · 1 = 7

E(V ) = 30 · 1

36+ (−1)

35

36= − 5

36

94

Page 95: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

First 2 as expected.

For V , prob. of double 6 is 136

, so chance of losing is 1− 136

. Even

though prize large (win $30 for double 6), E(V ) < 0, so would lose

in long run, because win prob even smaller than prize large.

Formula much easier than reasoning out – less thought!

Now suppose X ∼ Bernoulli(θ). What is E(X)?

X = 1 with prob θ, 0 with prob 1− θ, so:

E(X) = 1 · θ + 0 · (1− θ) = θ.

In long run, average X equal to success probability.

Makes sense (think of θ = 0 and θ = 1 as extreme cases).

95

Page 96: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation for geometric and Poisson distributions

To find more complicated expectations, cleverness can be needed

to figure out sum.

Suppose Z ∼ Geometric(θ), so P (Z = k) = θ(1− θ)k. Then

E(Z) =∞∑

k=0

kθ(1− θ)k =1− θ

θ.

Method: write (1− θ)E(Z) to look like E(Z) but with k − 1 in

place of k, subtract.

Mean is odds against success: if failure 4 times more likely than

success, on average get 4 failures before 1st success.

96

Page 97: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

If X ∼ Poisson(λ), then

E(X) =∞∑

k=0

k · e−λλk

k!.

Note that the k = 0 term is 0, so start sum at k = 1, then let

l = k − 1 to get

E(X) = λ

∞∑

l=0

e−λλl

l!.

The sum is of all the probabilities from a Poisson distribution, so is

1. (Or,∑∞

l=0(λl/l!) is the Maclaurin series for eλ.)

So for X ∼ Poisson(λ), E(X) = λ. Thus parameter λ in fact

mean.

97

Page 98: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

St Petersburg Paradox

Game: toss fair coin, let Z be #tails before 1st head. Win 2Z

dollars. Thus for TTTH, win 23 = $8. Expected winnings (fair price

to pay to play)?

∞∑

k=0

2k · 1

2k· 1

2=

∞∑

k=0

1

2= ∞.

How can this be? Only ever win finite amount.

Play game 10 times:

Z 0 1 0 0 3 0 3 0 6 1

Winnings 1 2 1 1 8 1 8 1 64 2

Mean winnings $8.90, larger than actual winnings 90% of time!

98

Page 99: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Problem is that any one big payoff completely dominates average,

and by playing game enough times, can make it very likely that a

very big payoff will occur.

If there is a maximum payoff, say $230, expectation finite ($15.50).

When random variable can be arbitrarily large, expectation may not

be finite. But can be finite – compare Poisson, where probabilities

decrease faster than values increase. Similarly, lotteries with very

big prizes still have expected winnings less than ticket price

(because chance of winning big prize small enough).

99

Page 100: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Utility and Kelly betting

In St Petersburg paradox, expectation didn’t tell story, because “fair

price” ought to be finite. Changing game by a little changed

expected winnings a lot.

Most bets look like this: win known $w if you win, lose $1 if you lose.

Suppose probability of winning is θ. Then expectation is

E = wθ + (−1)(1− θ) = θ(w + 1)− 1

which is positive if θ > 1/(w + 1).

For instance, if w = 2, E > 0 if θ > 1/3. That is, if you believe

your chance of winning is better than 13, you should bet because in

long run you win more than you lose.

100

Page 101: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

If bet more than $1, wins and losses increase in proportion: on bet

of $b, win $wb or lose $b.

Positive expectation seems to say “bet everything you have”: far too

risky for most! Always possibility of losing.

Idea: consider utility of money, not same as money itself. If you

only have $10, $1 is a lot of money (has great utility), but if you have

$1 million, $1 almost meaningless.

Utility of money varies between people, but could be proportional to

current fortune. Then, utility of money depends on log of $ amount.

101

Page 102: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Suppose we currently have $c, and want to choose b for bet above,

assuming all else known. Then fortune after the bet is F = c + bw

if we win (prob θ), F = c− b if we lose (prob 1− θ). Utility idea:

choose b to maximize E(ln F ):

E(ln F ) = θ ln(c + bw) + (1− θ) ln(c− b).

Take derivative (for b), set to 0:

dE(ln F )

db= w

θ

c + bw+(−1)

1− θ

c− b=

θw(c− b)− (1− θ)(c + bw)

(c + bw)(c− b).

Zero when numerator zero; solve for b to get

b =c{θ(w + 1)− 1}

w=

cE

w.

This is called the Kelly bet. (If negative, don’t bet anything!)

102

Page 103: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples, with c = 100:

• w = 9, θ = 18. E = θ(w + 1)− 1 = 0.25, so Kelly bet

b = 100(0.25)/9 = $2.78.

• w = 1.5, θ = 12. E = 0.25 again; Kelly bet

b = 100(0.25)/1.5 = $16.67.

Note: expected winnings same in both cases, but bet less when

w = 9: more risk because less likely to win.

In general, bet fraction of current fortune that is bigger when

expected winnings bigger and chance of winning bigger.

103

Page 104: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation of functions of random variables

In St Petersburg problem above, random variable was number of

tails Z , but winnings 2Z . In effect, found that E(2Z) was infinite.

Method: sum values of 2Z times probability.

Formally: let g(X) be some function of random variable X . Then

E(g(X)) =∑

x

g(x)P (X = x).

104

Page 105: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Linearity of expected values

Suppose we have two random variables X, Y . What is

E(X + Y )?

Go back to definition, bearing in mind that X,Y might be related,

so have to use joint probability function:

E(X + Y ) =∑

x

∑y

(x + y)P (X = x, Y = y)

=∑

x

xP (X = x) +∑

y

yP (Y = y)

= E(X) + E(Y ).

Details: expand out (x + y) in first sum, recognize (eg.) that∑y P (X = x, Y = y) = P (X = x) (marginal distribution).

105

Page 106: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Same logic shows that E(aX + bY ) = aE(X) + bE(Y ).

Likewise,

E(X1 + X2 + · · ·+ Xn) = E(X1) + E(X2) + · · ·+ E(Xn).

Also, if Y = 1 always, we get E(aX + b) = aE(X) + b.

106

Page 107: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation for binomial distribution

If Y ∼ Binomial(n, θ), then Y actually sum of Bernoullis:

Y = X1 + X2 + · · ·+ Xn, where Xi ∼ Bernoulli(θ).

Know that E(Xi) = θ, so (by result on previous page)

E(Y ) = θ + θ + · · ·+ θ = nθ.

Makes sense: eg. if you succeed on one-third of trials on average

(θ = 13), and you have n = 30 trials, you’d expect 10 successes,

and nθ = 10.

107

Page 108: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Independence and E(XY )

Since E(X + Y ) = E(X) + E(Y ) for all X and Y , tempting to

claim that E(XY ) = E(X)E(Y ). But is this true?

Consider this joint distribution:

Y = 1 Y = 2 Total

X = 0 13

16

12

X = 1 14

14

12

Total 712

512

1

Using marginal distributions, E(X) = 12

and E(Y ) = 1712

. What is

E(XY )?

108

Page 109: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

When X = 0, XY = 0 for all Y . So P (XY = 0) = 13

+ 16

= 12.

XY = 1 when X = 1, Y = 1, so P (XY = 1) = 14. Likewise,

XY = 2 when X = 1, Y = 2, so P (XY = 2) = 14. Hence

E(XY ) = 0 · 1

2+ 1 · 1

4+ 2 · 1

4=

3

4.

But

E(X)E(Y ) =1

2· 17

12=

17

246= 3

4.

So E(XY ) 6= E(X)E(Y ) in general.

109

Page 110: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

But what if X,Y independent? Then

E(XY ) =∑

x

∑y

xyP (X = x)P (Y = y) = E(X)E(Y ),

rearranging, because joint prob is product of marginals.

So, if X, Y independent, then E(XY ) = E(X)E(Y ), but not

necessarily otherwise.

See later (in “covariance”) that difference E(XY )− E(X)E(Y )

measures extent of non-independence of X and Y .

110

Page 111: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Monotonicity of expectation

Suppose X, Y discrete random variables such that X ≤ Y . (That

is, for any event giving X = x and Y = y, x ≤ y always.

Example: roll 2 dice, let X be score on 1st die, Y be total score on 2

dice.)

How do E(X), E(Y ) compare?

Idea: let Z = Y −X . Then Z ≥ 0, discrete, and

E(Z) =∑

z≥0 zP (Z = z). All terms in sum positive or 0, so

E(Z) ≥ 0. But E(Z) = E(Y −X) = E(Y )− E(X). Hence

E(Y )− E(X) ≥ 0.

Conclusion: if X ≤ Y , then E(X) ≤ E(Y ).

111

Page 112: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectation for continuous random

variables

Can’t use formula

E(X) =∑

x

xP (X = x)

because probability of particular value not meaningful for continuous

X .

Standard procedure: replace probability by density function, replace

sum by integral.

112

Page 113: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

That is, if X continuous random variable, define

E(X) =

∫ ∞

−∞x f(x) dx.

In integral, replace infinite limits by actual upper and lower limits.

113

Page 114: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples

Suppose X ∼ Uniform[0, 1], so f(x) = 1, 0 ≤ x ≤ 1. Then

E(X) =

∫ 1

0

x · 1 dx =

[1

2x2

]1

0

=1

2.

As you would have guessed.

Suppose W ∼ Exponential(λ). Then

E(W ) =

∫ ∞

0

wλe−λw dw.

Integrate by parts with u = w, v′ = λe−λw: E(W ) = 1/λ.

If W represents time between events, E(W ) in units of time, so λ

in units of 1 / time: a rate, number of events per unit time.

114

Page 115: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Suppose Z ∼ N(0, 1), so f(z) = (1/√

2π)e−z2/2. Then

E(Z) =

∫ ∞

−∞

1√2π

ze−z2/2 dz.

Replacing z by−z gives negative of function in integral, ie. f(z) is

odd function. Hence integral is 0, so E(Z) = 0. (Alternative:

substitute u = z2/2.)

115

Page 116: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

As for discrete, expectation may not be finite.

f(x) = 1/x2, x ≥ 1 is a proper density, but for random variable X

with this distribution:

E(X) =

∫ ∞

1

x · 1

x2dx =

∫ ∞

1

1

xdx = [ln x]∞1 = ∞.

Problem: though density decreases as x increases, does not do so

fast enough to make E(X) integral converge.

116

Page 117: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Properties of expectation for continuous random

variables

These are same as for discrete variables. Proofs use integrals and

densities not sums, but otherwise very similar. Suppose X has

density fX(x) and X,Y have joint density fX,Y (x, y):

• E(g(X)) =∫∞−∞ g(x)fX(x) dx

• E(h(X,Y )) =∫∞−∞

∫∞−∞ h(x, y)fX,Y (x, y) dx dy.

• E(aX + bY ) = aE(X) + bE(Y )

• If X,Y independent, then E(XY ) = E(X)E(Y )

• If X ≤ Y , then E(X) ≤ E(Y ).

117

Page 118: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Expectations for general uniform and normal

distributions

Suppose X ∼ Uniform[a, b]. Then

U = (X − a)/(b− a) ∼ Uniform[0, 1], so E(U) = 12.

Write in terms of X : X = a + (b− a)U , so

E(X) = a + (b− a)E(U) = (a + b)/2. Again as expected.

Now suppose X ∼ Normal(µ, σ2). Then

Z = (X − µ)/σ ∼ N(0, 1). Write X = µ + σZ ; then

E(X) = µ + σE(Z) = µ + σ(0) = µ.

That is, parameter µ in normal distribution is the mean.

118

Page 119: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance, covariance and correlation

Compare random variables:

Z = 10 with prob 1, Y = 5, 15 each prob 12.

E(Z) = E(Y ) = 10, but Y further from mean than Z .

Expectation only gives long-run average of random variable, not how

much higher/lower than average it could be. For this, use variance:

Var(X) = E[(X − µX)2], µX = E(X).

119

Page 120: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

For discrete X , Var(X) =∑

x(x− µX)2P (X = x). So:

Var(Z) = (10− 10)2 · 1 = 0;

Var(Y ) = (5− 10)2 · 1

2+ (15− 10)2 · 1

2= 25.

Here, Var(Y ) > Var(Z) because Y tends to be further from its

mean than Z does.

(Here, Y always further from mean than Z . But in general,

Var(Y ) > Var(Z) means Y likely to be further from mean than

Z .)

120

Page 121: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

More about variance

Because (X − µX)2 ≥ 0, Var(X) ≥ 0 for all random variables

X .

Var(X) = 0 only if X does not vary (compare Z). No upper limit

on variance; larger variance means more unpredictable (can get

further from mean).

Why square? Cannot just omit: E(X − µX) = E(X)− µX = 0

always. Absolute value E(|X − µX |) possible, but hard to work

with (not differentiable).

121

Page 122: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Standard deviation

If random variable X in metres, Var(X) in metres-squared. For

interpretation, suggests using square root of variance:

SD(X) =√

Var(X)

which would be in metres. Called standard deviation of X .

SD easier for interpretation, variance easier for algebra.

122

Page 123: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance of Bernoulli

If X ∼ Bernoulli(θ), E(X) = θ, and

Var(X) =∑

x

(x− θ)2P (X = x)

= (1− θ)2θ + (0− θ)2(1− θ)

= θ(1− θ)(1− θ + θ) = θ(1− θ).

This is 0 if θ = 0, 1 (when results completely predictable) and

maximum, 14, when θ = 1

2.

123

Page 124: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Useful properties of variance

Var(aX + b) = a2 Var(X).

Because variance in squared units, changing X eg. from metres to

feet multiplies variance not by 3.3 but by that squared.

Also, adding b changes mean of X , but doesn’t change how spread

out distribution is (shifts left/right).

Var(X) = E(X2)− µ2X .

Useful result for finding variances in practice, since E(X2) not

usually too hard.

124

Page 125: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Proofs: use definition of variance as expectation, then rules of

expectation.

Bernoulli revisited: E(X2) = 12θ + 02(1− θ) = θ, so

Var(X) = θ − θ2 = θ(1− θ) as before.

125

Page 126: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance of exponential distribution

For continuous distributions, find E(X2) or variance using integral.

W ∼ Exponential(λ): already know E(W ) = 1/λ. Find

Var(W ) by first finding E(W 2), using integration by parts:

E(W 2) =

∫ ∞

0

w2λe−λw dw =[−w2e−λw

]∞0

+2

λ

∫ ∞

0

wλe−λw dw.

Square brackets 0; integral is E(W ) = 1/λ. Hence

E(W 2) = (2/λ)(1/λ) = 2/λ2, and

Var(W ) =2

λ2−

(1

λ

)2

=1

λ2.

For exponential distribution, variance is square of mean.

126

Page 127: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance of normal random variable

Suppose Z ∼ N(0, 1). Know that E(Z) = 0, so

Var(Z) = E(Z2)− 02 = E(Z2). Thus

Var(Z) =

∫ ∞

−∞z2 1√

2πe−z2/2 dz.

To tackle by parts: let u = z/√

2π, v′ = ze−z2/2. v′ has

antiderivative v = −e−z2/2. Gives

127

Page 128: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Var(Z) =

[− z√

2πe−z2/2

]∞

−∞+

∫ ∞

−∞

1√2π

e−z2/2 dz.

Square bracket 0 (e−z2/2 → 0 very fast); integral that of density of

Z , so 1. Hence Var(Z) = 1.

Suppose now X ∼ N(µ, σ2). Then Z = (x− µ)/σ, so

X = µ + σZ . So Var(X) = σ2 Var(Z) = σ2. That is,

parameter σ2 in normal distribution is variance.

128

Page 129: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Covariance

Consider discrete joint distribution:

Y = 1 Y = 2 sum

X = 0 0.4 0.2 0.6

X = 1 0.1 0.3 0.4

sum 0.5 0.5

If X = 0, Y more likely to be small; if X = 1, Y more likely to be

large. X, Y vary together.

Idea: covariance Cov(X, Y ) = E[(X − µX)(Y − µY )].

129

Page 130: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Here, µX = E(X) = 0.4, µY = E(Y ) = 1.5, so take all

combinations of (X − µX , Y − µY ) values and their probs:

Cov(X,Y )

= (0− 0.4)(1− 1.5)(0.4) + (0− 0.4)(2− 1.5)(0.2)

+ (1− 0.4)(1− 1.5)(0.1) + (1− 0.4)(2− 1.5)(0.3)

= 0.08− 0.04− 0.03 + 0.04 = 0.10.

Result positive. (X, Y ) combinations where (X − µX)(Y − µY )

positive outweigh those where negative. That is, when X large, Y

more likely to be large as well (and small with small).

Covariance can be negative: then large X goes with small Y and

vice versa. Covariance 0: no trend.

130

Page 131: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Calculating covariances

Useful formula:

Cov(X, Y ) = E(XY )− E(X)E(Y ).

Proof: definition of covariance, properties of expectation.

Previous example revisited:

E(XY ) = (0)(1)(0.4)+(0)(2)(0.2)+(1)(1)(0.1)+(1)(2)(0.3) = 0.7;

Cov(X, Y ) = 0.7− (0.4)(1.5) = 0.1.

As with corresponding variance formula, useful for calculations.

131

Page 132: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Covariance and independence

If X,Y independent, then E(XY ) = E(X)E(Y ), so

Cov(X, Y ) = E(XY )− E(X)E(Y ) = 0.

But covariance could be 0 without independence. Example:

(X, Y ) = (−1, 1), (0, 0), (1, 1), each prob 13. E(X) = 0,

E(Y ) = 23, E(XY ) = (−1)(1

3) + (0)(1

3) + (1)(1

3) = 0, so

Cov(X, Y ) = 0− (0)(23) = 0. But X, Y not independent: given

X , know Y exactly.

Relationship between X, Y not a trend: as X increases, Y

decreases then increases. No general statement about Y

large/small as X increases.

Fact: if X,Y bivariate normal, covariance 0 implies independence.

132

Page 133: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance of sum

Previously found that E(X + Y ) = E(X) + E(Y ) for all X,Y .

Corresponding formula for variances?

Derive formula for Var(X + Y ) by writing as expectation,

expanding out square, recognizing terms:

Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).

Logic: if Cov(X,Y ) > 0, X, Y big/small together, sum could be

very big/small, variance large. If Cov(X, Y ) < 0, large X

compensates small Y and vice versa, sum of moderate size,

variance small.

If X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).

133

Page 134: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Variance of binomial distribution

Suppose X ∼ Binomial(n, θ). Then can write

X = Y1 + Y2 + · · ·+ Yn,

where Yi ∼ Bernoulli(θ) independently. So

Var(X) = Var(Y1) + Var(Y2) + · · ·+ Var(Yn)

= θ(1− θ) + θ(1− θ) + · · ·+ θ(1− θ)

= nθ(1− θ).

Variance increases as n increases (fixed θ) because range of

possible #successes becomes wider.

134

Page 135: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Correlation

Covariance hard to interpret. Eg. size of positive correlation says

little about X,Y relationship.

Suppose X height (metres), Y weight (kg). Units of covariance m

× kg. Measure height in inches, weight in lbs: covariance in

different units.

Try for scale-free quantity. Covariance measures how X, Y vary

together: suggests use of variances. Var(X) m2, Var(Y ) kg2, so

right scaling is by sq root of each. Define correlation:

Corr(X,Y ) =Cov(X,Y )√

Var(X) Var(Y ).

135

Page 136: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example: (X, Y ) = (0, 1), (1, 3), each prob 12.

E(X) = 0.5, E(Y ) = 2; XY = 0, 3 each prob 12

so

Cov(X, Y ) = 32− (0.5)(2) = 1

2.

Also, Var(X) = 14, Var(Y ) = 1, so

Corr(X, Y ) =12√

(14)(1)

= 1.

When X larger (1 vs. 0), Y also larger (3 vs. 1) for certain: a perfect

trend. So this should be largest possible correlation.

(Proof later: Cauchy-Schwartz inequality.)

136

Page 137: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

More about correlation

Smallest possible correlation is−1, when larger X always goes

with smaller Y (eg. (X, Y ) = (0, 1), (1,−3) with prob 12).

If X,Y independent, covariance 0, so correlation 0 also.

In-between values represent in-between trends. Eg.

Corr(X, Y ) = 0.5: larger X with larger Y most of the time, but

not always.

Correlation actually measures extent of linear relationship between

random variables. X, Y in example related by Y = 2X + 1.

Perfect nonlinear relationship won’t give correlation±1.

137

Page 138: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Viewing correlation by simulation

Useful to have sense of what correlation “looks like”.

Generate random normals with required correlation, plot.

Suppose X, Y ∼ N(0, 1) independently. Then use X and

Z = αX + Y for suitable choice of α: correlated if α 6= 0

because X in both. Can show Cov(X,αX + Y ) = α and

Corr(X, αX + Y ) = α/√

1 + α2.

Choose α to get desired correlation ρ: α = ±ρ/√

1− ρ2.

138

Page 139: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Correlation 0.95:

−3 −2 −1 0 1 2

−10

−5

05

x

z

139

Page 140: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Correlation -0.8:

−3 −2 −1 0 1 2

−4

−2

02

4

x

z

140

Page 141: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Correlation 0.5:

−2 −1 0 1 2 3

−2

01

23

x

z

141

Page 142: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Correlation -0.2:

−3 −2 −1 0 1 2 3

−2

−1

01

2

x

z

142

Page 143: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Moment-generating functions

Means and variances (and eg. E(X3)) can be messy: each one

needs an integral (sum) to be solved. Would be nice to have function

that gives E(Xk) more easily than by integration (summing).

Consider mX(s) = E(esX). Function of s.

Maclaurin series for exp function:

mX(s) = E(1) + sE(X) +s2

2!E(X2) +

s3

3!E(X3) + · · · .

143

Page 144: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Differentiate both sides (as function of s):

m′X(s) = E(X) + sE(X2) +

s2

2!E(X3) + · · ·

Putting s = 0 gives m′(0) = E(X). Differentiate again:

mX(s) = E(X2) + sE(X3) + · · ·

so that m′′X(0) = E(X2).

By same process, find E(Xk) by differentiating mX(s) k times,

and setting s = 0. Differentiating easier than integrating!

E(Xk) called k-th moment of distribution of X ; function mX(s),

used to get moments, called moment generating function for X .

144

Page 145: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

If X discrete,

mX(s) = E(esX) =∑

x

esxP (X − x)

and if X continuous,

mX(s) = E(esX) =

∫ ∞

−∞esxfX(x) dx.

145

Page 146: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples of moment generating functions

Bernoulli is easiest of all:

mX(s) = es·0P (X = 0) + es·1P (X = 1) = 1− θ + θes.

So:

m′X(s) = θes ⇒ E(X) = θ

m′′X(s) = θes ⇒ E(X2) = θ

and indeed E(Xk) = θ for all k. Also,

Var(X) = E(X2)− [E(X)]2 = θ − θ2 = θ(1− θ).

146

Page 147: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now try X ∼ Exponential(λ), continuous:

mX(s) = E(esX) =

∫ ∞

0

esxλe−λx dx = λ(λ− s)−1

after some algebra. (Requires s < λ.)

m′X(s) = λ(λ− s)−2, so E(X) = m′

X(0) = 1/λ.

m′′X(s) = 2λ(λ− s)−3, so E(X2) = m′′

X(0) = 2/λ2. Hence

Var(X) =2

λ2−

(1

λ

)2

=1

λ2.

147

Page 148: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

More about moment-generating functions

If X ∼ Poisson(λ), then

mX(s) = eλ(es−1).

If X ∼ N(0, 1), then

mX(s) = es2/2.

Facts:

• mX+Y (s) = mX(s)mY (s). (Mgf of sum is product of

moment-generating functions.)

• maX+b(s) = ebsmX(as). (Mgf of linear function related to

mgf of original random variable.)

148

Page 149: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Proofs from definition.

First result very useful: distribution of sum very difficult to find, but

can get moments for sum much more easily.

If X ∼ Binomial(n, θ), then X = Y1 + Y2 + · · ·+ Yn where

each Yi ∼ Bernoulli(θ). Hence

mX(s) = [mYi(s)]n = (1− θ + θet)n.

If X ∼ N(µ, σ2), X = µ + σZ where Z ∼ N(0, 1). Thus

mX(s) = mσZ+µ(s) = eµsmZ(σs) = eµs+σ2s2/2.

149

Page 150: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Using mgfs to recognize distributions

Important result, called uniqueness theorem. Suppose X has mgf

finite for−s0 < s < s0; suppose mX(s) = mY (s) for

−s0 < s < s0. Then X , Y have same distribution.

In other words: if mgf of X is that of known distribution, then X

must have that distribution.

Example: X, Y ∼ Poisson(λ). X + Y has mgf

mX+Y (s) = {eλ(es−1)}2 = e2λ(es−1).

This is mgf of Poisson(2λ), so X + Y ∼ Poisson(2λ).

150

Page 151: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditional Expectation

Consider this joint distribution (Ex. 3.5.2):

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

X, Y related: if Y = 0, then X more likely to be 8.

151

Page 152: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Suppose Y = 3. Then P (X = 5|Y = 3) = (17)/(1

7) = 1,

P (X = 8|Y = 3) = 0/(17) = 0. If Y = 3, then X certain to be

5, so E(X|Y = 3) = 5.

Now suppose Y = 4:

P (X = 5|Y = 4) =17

17

+ 17

=1

2= P (X = 8|Y = 4).

If Y = 4, then average X is E(X|Y = 4) = 5 · 12

+ 8 · 12

= 6.5.

Likewise, E(X|Y = 0) = 7.25.

152

Page 153: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

These expectations from conditional distribution called conditional

expectations. E(X|Y = y) varies from 5 to 7.25 depending on

value of Y ; “on average, X depends on Y ”.

In general, if X, Y related, then mean of X depends on Y .

Calculate conditional distribution of X|Y , find X-expectation. This

is conditional expectation.

153

Page 154: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditional expectation: continuous case

Same principle: find expectation of conditional distribution. Now use

joint and marginal densities to find conditional density; then

integrate to get expectation.

Example: fX,Y (x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1.

Conditional density fX|Y (x, y) = fX,Y (x, y)/fY (y). So first find

marginal density fY (y) by integrating out x from joint density:

fY (y) = 43y + 2y5. Has no x. Hence

fX|Y (x, y) =4x2y + 2y5

43y + 2y5

.

154

Page 155: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Note: only x in numerator, so not so hard. Thus

E(X|Y = y) =

∫ 1

0

x · 4x2y + 2y5

43y + 2y5

dx =1 + y4

43

+ 2y4.

Depends slightly on Y : E(X|Y = 0) = 0.75,

E(X|Y = 0.5) = 0.729, E(X|Y = 1) = 0.6. As Y increases,

X decreases, on average.

155

Page 156: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditional expectations as random variables

Without particular Y -value in mind, can define E(X|Y ) by taking

E(X|Y = y) and replacing y by Y . Above example:

E(X|Y ) =1 + Y 4

43

+ 2Y 4.

This kind of conditional expectation is random variable (function of

random variable Y ).

156

Page 157: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

As random variable, E(X|Y ) must have expectation,

E[E(X|Y )]. What is it? Directly, as function of y:

E[E(X|Y )] =

∫ 1

0

E(X|Y = y)fY (y) dy =2

3

(much cancellation). Now: marginal density of x is

fX(x) = 2x2 + 13

(integrate out y from joint density), so

E(X) =

∫ 1

0

x

(2x2 +

1

3

)dx =

2

3= E[E(X|Y )].

Not a coincidence. Illustrates theorem of total expectation:

E[E(X|Y )] = E(X). In words: effect of varying Y is to change

E(X|Y ), but E[E(X|Y )] averages out these effects, leaving only

overall average of X .

157

Page 158: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Conditional variance

Conditional variance is variance of conditional distribution.

Return to previous discrete example:

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

If Y = 3, X certain to be 5, so Var(X|Y = 3) = 0.

But if Y = 4, X equally likely 5 or 8; Var(X|Y = 4) = 2.25.

158

Page 159: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

(Calculation: E(X|Y = 4) = 6.5, E(X2|Y = 4) = 44.5,

Var(X|Y = 4) = 44.5− (6.5)2 = 2.25.)

Another expression of how Y affects X . If know Y = 3, know X

exactly, but if Y = 4, more uncertain about possible X .

159

Page 160: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Inequalities relating probability, mean and

variance

Mean and variance closely related to probabilies. Are general

relationships true for wide range of random variables and

distributions.

Markov inequality: If X cannot be negative, then

P (X ≥ a) ≤ E(X)

a.

In words: if mean small, X unlikely to be very large.

160

Page 161: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Chebychev inequality:

P (|Y − µY | ≥ a) ≤ Var(Y )

a2.

In words: if variance small, Y unlikely to be far from mean.

(Variations in spelling: best English transliteration from Russian

probably “Chebyshov”.)

161

Page 162: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example: suppose X = 0, 1, 2 each with probability 13. Then

E(X) = 1, E(X2) = 53, so Var(X) = 2

3.

Markov with a = 1.5 says P (X ≥ 1.5) ≤ 11.5

= 23. Actual

P (X ≥ 1.5) = P (X = 2) = 13, which is indeed≤ 2

3.

Chebychev with a = 0.9:

P (|X − 1| ≥ 0.9) ≤ (2/3)/(0.9)2 = 0.823. Actual

P (|X − 1| ≥ 0.9) = P (X ≤ 0.1) + P (X ≥ 1.9) = P (X =

0) + P (X = 2) = 23.

Bounds from Markov and Chebychev inequalities often not very

close to truth, but guaranteed, so can use inequalities to prove

results.

162

Page 163: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Proof of Markov inequality

Uses idea that if Z ≤ X , then E(Z) ≤ E(X).

Define random variable Z = a if X ≥ a, 0 otherwise. Because

X ≥ 0, value of Z always≤ that of X : Z ≤ X .

E(Z) = aP (X ≥ a) + 0P (X < a) = aP (X ≥ a).

But Z ≤ X so E(Z) ≤ E(X) and therefore

aP (X ≥ a) ≤ E(X). Divide both sides by a. Done.

163

Page 164: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Proof of Chebychev inequality

This uses Markov’s inequality with clever choice of random variable.

Let X = (Y − µY )2; X ≥ 0. Then Markov’s inequality (with a2

replacing a) says

P (X ≥ a2) ≤ E(X)

a2⇒ P [(Y−µY )2 ≥ a2] ≤ E[(Y − µY )2]

a2.

In last inequality, E[.] is Var(Y ). On left, both terms in probability

≥ 0, so can square-root both sides. Gives

P (|Y − µY | ≥ a) ≤ Var(Y )

a2

which is Chebychev’s inequality. Done.

164

Page 165: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Cauchy-Schwartz and Jensen inequalities

Cauchy-Schwartz:

|Cov(X, Y )| ≤√

Var(X) Var(Y ) ⇒ |Corr(X, Y )| ≤ 1.

Proof: page 188 of text. Idea, for X, Y having mean 0: write

E(X − λY )2 in terms of variances and covariances; result must

be≥ 0.

Jensen’s inequality relates E(g(X)) and g(E(X)). Specifically,

if g(x) is concave up (that is, g′′(x) > 0), then

g(E(X)) ≤ E(g(X)).

165

Page 166: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Proof: Tangent line to concave-up function always≤ function

(picture). Consider tangent line to g(x) at x = E(X); suppose

equation is a + bx. Then g(E(X)) = a + bE(X). Also, line

≤ g(x) everywhere else, so

a + bX ≤ g(X) ⇒ E(a + bX) ≤ E(g(X))

⇒ a + bE(X) ≤ E(g(X))

⇒ g(E(X)) ≤ E(g(X)).

Done.

(Note: text uses “convex” for “concave up”.)

166

Page 167: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Consequences of Jensen’s inequality

Take g(x) = x2. Then (E(X))2 ≤ E(X2). But

Var(X) = E(X2)− (E(X))2 ≥ 0, so knew that anyway.

Another: suppose X = 1, 2, 3, each prob 13. Then E(X) = 2.

But get another kind of average by multiplying 3 possible values and

taking 3rd root. This is called geometric mean. Here is

(1.2.3)1/3 = 1.817. Ordinary mean greater than geometric mean.

Look at log of geometric mean:

ln{(1.2.3)1/3} =1

3ln(1.2.3) =

1

3(ln 1+ln 2+ln 3) = E(ln X).

Thus geometric mean is eE(ln X).

167

Page 168: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Jensen: − ln x is concave up for x > 0, so

− ln(E(X)) ≤ E(− ln X) ⇒ ln(E(X)) ≥ E(ln X).

Exponentiate both sides (eln y = y):

E(X) ≥ eE(ln X).

This says that for any positive random variable X , the ordinary

mean will always be≥ the geometric mean.

168

Page 169: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Sampling Distributions and

Limits

169

Page 170: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Introduction: roulette

See http://tinyurl.com/238p5 for intro to game.

Basic idea: bet on number or number combination. Roulette wheel

spun, one number is winner. Your bet wins if it contains winning

number.

Wheel also contains numbers 0, 00. Winning bets paid as if 0, 00

absent (advantage to casino).

Bet 1: “high number”: win with 19–36, lose otherwise. Bet $1, win

$1 if win. Let W be winnings on one play; P (W = 1) = 18/38,

P (W = 0) = 20/38. Then

E(W ) = 1 · 18

38+ (−1) · 20

38= − 2

38' −$0.05.

170

Page 171: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Bet 2: “lucky number”: win if 24 comes up, lose otherwise. Win $35

for $1 bet. Now P (W = 35) = 1/38, P (W = −1) = 37/38, so

E(W ) = 35 · 1

38+ (−1) · 37

38= − 2

38' −$0.05.

In both bets, lose 5 cents per $ bet in long run.

Play game not once but many times. Interested in total winnings, or

mean winnings per play. Let Wi be winnings on play i; then mean

winnings per play Mn over n plays is

Mn =1

n

n∑i=1

Wi.

Investigate behaviour of Mn by simulation.

171

Page 172: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

High-number, 30 plays:

0 5 10 15 20 25 30

−0.

40.

00.

20.

4

n

M_n

172

Page 173: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

High-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

173

Page 174: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Lucky-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

174

Page 175: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Notes about roulette simulation

1st graph: in high-number bet, fortune goes up/down by $1 per play;

winnings/play pattern similar. On this sequence, in profit after 30

plays, but losing after 15.

2nd graph: same bet, 1000 plays. Less fluctuation after more trials;

winnings per play apparently tending to dotted line, E(W ). (Other

simulations have different shape but similar end behaviour.)

3rd graph: lucky-number bet, 1000 plays. Large jump upwards on

each win. Picture more erratic than for high-number bet; long-term

behaviour not clear yet. (Need more plays.)

175

Page 176: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Understanding Mn mathematically: mean, variance

Mn =1

n

n∑i=1

Wi

is sum. Wi in sum independent, each same distribution (one spin of

wheel has no effect on other spins). So can calculate E(Mn) and

Var(Mn).

Already found E(Wi) = − 238

for both our bets.

Find variances for bets: for high-number bet, Var(Wi) = 0.9972;

for lucky-number bet Var(Wi) = 33.21.

176

Page 177: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

For mean:

E(Mn) =1

n

n∑i=1

E(Wi) =1

n

n∑i=1

(− 2

38

)= − 2

38,

since there are n terms in the sum, all the same.

That is, regardless of how long you play, you will lose 5 cents per $

bet on average.

177

Page 178: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Var(Mn) =1

n2

n∑i=1

Var(Wi) =Var(Wi)

n.

Sum has n terms all equal to variance of one play’s winnings. So for

high-number bet, Var(Mn) = 0.9972/n, for lucky-number bet,

Var(Mn) = 33.21/n.

For any particular n, variance for high-number bet lower. Supports

simulation: high-number bet results more predictable.

In both cases, as n →∞, Var(Mn) → 0. Longer you play, more

predictable Mn is.

178

Page 179: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Distribution of Mn

Mean and variance not whole story – want to know things like

P (Mn > 0) (chance of profit). For this, need distribution of Mn.

Start with M2 (2 plays). Do lucky-number bet (P (W = 35) = 138

,

P (W = −1) = 3738

).

4 possibilities:

• win both times. M2 = (35 + 35)/2 = 35;

P (M2 = 35) = ( 138

)2 = 11444

' 0.0007.

• win on 1st, lose on 2nd. M2 = (35 + (−1))/2 = 17; prob is138· 37

38= 37

1444.

179

Page 180: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

• lose on 1st, win on 2nd. Again M2 = 17 and prob is same as

above. Thus overall P (M2 = 17) = 741444

' 0.0512.

• lose on both. M2 = ((−1) + (−1))/2 = −1;

P (M2 = −1) = (3738

)2 = 13691444

' 0.9480.

Calculation complicated, even for n = 2, because have to consider

all possible combinations.

In general: this kind of distribution very difficult to find exactly. So

look for approximations to it.

180

Page 181: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Sampling distributions

Suppose X1, X2, . . . , Xn are random variables, each independent

and with same distribution. For example:

• Xi is winnings from i-th play of a roulette bet.

• Xi is height of i-th randomly chosen Canadian.

• Xi = 1 if randomly chosen voter supports Liberal party,

Xi = 0 otherwise.

• Xi is randomly generated value from a distribution with density

fX(x).

In each case: underlying phenomenon of interest, collect data at

random to help understand phenomenon.

181

Page 182: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Summarize Xi values using random variable

Yn = h(X1, X2, . . . , Xn) for some function h (eg. mean, like

Mn).

Some jargon:

• total collection of individuals (all possible spins of roulette

wheel, all Canadians, all possible values) called population.

• particular individuals selected, or Xi values obtained from

them, called sample.

• Yn defined above called sample statistic.

Usually don’t know about population, so draw conclusion about it

based on sample.

182

Page 183: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

First: opposite problem: if we know population, find out what

samples from it look like.

“At random” important, and specific. Each individual value in

population must have correct chance of being in sample (same

chance, for human populations), and each must be in sample or not

independently of others.

Aim: learn about distribution of Yn, called sampling distribution.

General statements difficult. Approach: find what happens as

n →∞, then use result as approximation for finite n.

183

Page 184: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence in probability; weak law of

large numbers

In mathematics, accustomed to convergence ideas. Eg. if

an = 1− 1/n, so that a1 = 0, a2 = 12, a3 = 2

3, etc., an → 1

(converges to 1) as n →∞ because, by taking n large enough, all

values after an as close to 1 as desired.

For sequence X1, X2, . . . of random variables, what is meaning of

Xn → Y , where Y is random variable?

184

Page 185: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Different possibilities. One idea: “prob of Xn being far from Y goes

to 0 as n gets large”. Leads to definition:

Sequence {Xn} converges in probability to Y if, for all ε > 0,

limn→∞ P (|Xn − Y | ≥ ε) = 0. Notation: XnP→ Y .

Example: suppose U ∼ Uniform[0, 1]. Let Xn = 3 when

U ≤ 23(1− 1

n) and 8 otherwise.

Thus when n = 1, X1 must be 8. If U > 23, Xn remains 8 forever,

but if U ≤ 23, U ≤ 2

3(1− 1

n) eventually, so Xn becomes 3 for

some n, then remains 3 forever.

(Cannot know which will happen since U random variable.)

185

Page 186: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now define Y = 3 if U ≤ 23

and Y = 8 otherwise. Same as

“eventual” Xn, so should have XnP→ Y . Correct?

P (|Xn − Y | ≥ ε) ≤ P (Xn 6= Y )

= P

(2

3

(1− 1

n

)< U <

2

3

)

=2

3n.

This tends to 0 as n →∞, so XnP→ Y .

186

Page 187: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence to a constant

What if Y not random variable, but number?

Example: suppose Zn ∼ Exponential(n). Then E(Zn) = 1/n,

suggesting that Zn typically gets smaller and smaller. Does

ZnP→ 0?

P (|Zn − 0| ≥ ε) = P (Zn ≥ ε)

=

∫ ∞

ε

ne−nx dx = e−nε.

For any fixed ε, P (|Zn − 0| ≥ ε) → 0, so ZnP→ 0.

Important special case (usually easier to handle).

187

Page 188: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence to mean

Suppose sequence {Yn} has E(Yn) = µ for all n. Then YnP→ µ

if P (|Yn − µ| ≥ ε) → 0.

But recall Chebychev’s inequality,

P (|Y − µY | ≥ a) ≤ Var(Y )/a2. Here:

P (|Yn − µ| ≥ ε) ≤ Var(Yn)

ε2.

For fixed ε, right side (and hence left side) tends to 0 if

Var(Yn) → 0, in which case YnP→ µ.

(Logically: if Var(Yn) getting smaller, Yn becoming closer to their

mean µ.)

188

Page 189: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Weak Law of Large Numbers

Return to X1, X2, . . . , Xn being a random sample from some

population with mean E(Xi) = µ and variance Var(Xi) = v.

Consider sample mean

Mn =1

n

n∑i=1

Xi.

Intuitively, expect Mn to be “close” to population mean µ, and to get

closer as n increases (more information in larger sample).

Does MnP→ µ? Re-do roulette calculations to show that

E(Mn) = µ and Var(Mn) = Var(Xi)/n = v/n.

189

Page 190: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now, {Mn} is sequence of random variables with same mean µ.

Result of section “convergence to mean” says that MnP→ µ if

Var(Mn) → 0. But here, Var(Mn) = v/n → 0. This proves

that MnP→ µ.

This justifies use of sample mean as estimate of the population

mean. Can estimate average height of all Canadians by measuring

average height of sample of Canadians; the larger the sample,

closer estimate will likely be.

Important result, called weak law of large numbers.

190

Page 191: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

To generalize: suppose now that Xn do not all have same variance,

but Var(Xi) = vi. Then

Var(Mn) =1

n2

n∑i=1

vi.

This might not→ 0. But suppose that vi ≤ v for all i. Then

Var(Mn) =1

n2

n∑i=1

vi ≤ 1

n2

n∑i=1

v =v

n→ 0.

In other words, MnP→ µ even if the variances are not all equal,

provided that they are bounded.

191

Page 192: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence with probability 1

Previous example: suppose U ∼ Uniform[0, 1]. Let Xn = 3

when U ≤ 23(1− 1

n) and 8 otherwise. Let Y = 3 if U ≤ 2

3and

Y = 8 otherwise. Concluded that XnP→ Y .

Take another approach. Suppose we knew U , eg. suppose

U = 0.4. Then

0.4 ≤ 2

3

(1− 1

n

)⇒ n ≥ 5

2.

Thus X1 = X2 = 8, X3 = X4 = · · · = 3. This is ordinary

sequence of numbers, converges to 3. Also, if U = 0.4, Y = 3.

192

Page 193: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

In general: if U < 23, Xn = 8 for n ≤ 2/(2− 3U) and Xn = 3

after that. If U > 23, Xn = 8 for all n.

In both cases, Xn → Y as ordinary sequence for any particular

value of U . Potentially different idea of convergence of random

variables.

Definition: Xn converges to Y with probability 1 if

P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;

notation Xna.s.→ Y .

In words: consider all ways to get (number) sequences {Xn}; for

each, consider corresponding Y . If Xn → Y always, then

Xna.s.→ Y .

193

Page 194: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Is it same as convergence in probability?

Example: let U ∼ Uniform[0, 1], and define {Xn} like this:

• X1 = 1 if 0 ≤ U < 12, 0 otherwise

• X2 = 1 if 12≤ U < 1, 0 otherwise

• X3 = 1 if 0 ≤ U < 14, 0 otherwise

• X4 = 1 if 14≤ U < 1

2, 0 otherwise

• X5 = 1 if 12≤ U < 3

4, 0 otherwise

• X6 = 1 if 34≤ U < 1, 0 otherwise

• X7 = 1 if 0 ≤ U < 18, 0 otherwise

• X8 = 1 if 18≤ U < 1

4, 0 otherwise, etc.

194

Page 195: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

(Divided [0, 1] into 2, then 4, then 8,. . . intervals.)

Intervals getting shorter, so P (Xn = 1) decreasing. Indeed, for

ε < 1, P (|Xn − 0| ≥ ε) = P (Xn = 1) → 0, so XnP→ 0.

Suppose U = 0.2. Then Xn = 0 except for

X1 = X3 = X8 = · · · = 1. Beyond any n, always another

Xn = 1 (always another interval containing 0.2). So for U = 0.2,

number sequence {Xn} has no limit. Hence not true that Xna.s.→ 0.

Example shows that two comvergence ideas different –

convergence with probability 1 harder to achieve.

195

Page 196: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Strong law of large numbers

Random sample X1, X2, . . . , Xn with E(Xi) = µ,

Var(Xi) ≤ v; let Mn = (∑n

i=1 Xi)/n be sample mean.

Already showed that MnP→ µ (“weak law of large numbers”).

Also strong law of large numbers: Mna.s.→ µ. Proof difficult.

In words: out of (infinitely) many different sequences {Mn}obtainable, every one of them converges to µ.

196

Page 197: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence in distribution

Consider independent sequence of random variables {Xn} with

P (Xn = 1) = 12

+ 1n

and P (Xn = 0) = 12− 1

n. Also, let

P (Y = 0) = P (Y = 1) = 12

independently of the Xn.

Now, take ε < 1. Then P (|Xn − Y | ≥ ε) = P (Xn 6= Y ). Could

have Xn = 0, Y = 1 or Xn = 1, Y = 0; use independence:

P (Xn 6= Y ) =

(1

2− 1

n

)1

2+

(1

2+

1

n

)1

2=

1

2.

Not→ 0, so not true that XnP→ Y .

197

Page 198: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

But Xn does converge to Y in sense that

P (Xn = 1) → 12

= P (Y = 1) and

P (Xn = 0) → 12

= P (Y = 0). Called convergence in

distribution.

To make definition: note that P (Xn = x) meaningless for

continuous Xn, so work with P (Xn ≤ x) instead.

Then: {Xn} converges in distribution to Y if

P (Xn ≤ x) → P (Y ≤ x) for all x. Notation: XnD→ Y .

198

Page 199: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Example: Poisson approximation to binomial

Suppose Xn ∼ Binomial(n, λ/n) (that is, trials increasing but

success prob decreasing so that E(X) = n(λ/n) = λ constant.

Then

P (Xn = j) =

(n

j

)(λ

n

)j (1− λ

n

)n−j

→ e−λλj

j!,

which is P (Y = j) when Y ∼ Poisson(λ). That is,

XnD→ Poisson(λ).

(Proof based on limn→∞(1− (x/n))n = e−x.)

Suggests that if n large and θ small, Poisson is good approx to

binomial.

199

Page 200: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Try this: take λ = 1.5 for n = 2, 5, 10, 20, 100:

x n=2 n=5 n=10 n=20 n=100 Poisson

0 0.0625 0.1680 0.1968 0.2102 0.2206 0.2231

1 0.3750 0.3601 0.3474 0.3410 0.3359 0.3346

2 0.5625 0.3087 0.2758 0.2626 0.2532 0.2510

3 0.0000 0.1323 0.1298 0.1277 0.1259 0.1255

4 0.0000 0.0283 0.0400 0.0440 0.0465 0.0470

5 0.0000 0.0024 0.0084 0.0114 0.0136 0.0141

6 0.0000 0.0000 0.0012 0.0023 0.0032 0.0035

Approx for n = 20 not bad; for n = 100 is very good.

200

Page 201: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Convergence in distribution and moment generating

functions

Moment-generating function mY (s) for random variable Y is

function of s.

Uniqueness theorem: if mX(s) = mY (s) for all s where both

finite, then X, Y have same distribution.

Suggests following (true) result: if {Xn} is sequence of random

variables with mXn(s) → mY (s) (for all s where both sides finite),

then XnD→ Y .

201

Page 202: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Central Limit Theorem

Return to “random sample” X1, X2, . . . , Xn; suppose E(Xi) = 0

and Var(Xi) = 1.

Define Mn = (∑n

i−1 Xi)/n. Does Mn converge in distribution to

anything interesting?

Well, E(Mn) = 0 but Var(Mn) = 1/n → 0. So look instead at

Zn =√

nMn: E(Zn) = 0 and Var(Zn) = 1. Then

Zn = (∑n

i=1 Xi)/√

n.

202

Page 203: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Moment-generating function for Xi is

mXi(s) = 1 + sE(Xi) +

s2

2!E(X2

i ) +s3

3!E(X3

i ) + · · · ;

here E(Xi) = 0, Var(Xi) = 1 so E(X2i ) = 1, giving

mXi(s) = 1 +

s2

2+

s3

3!E(X3

i ) + · · · .

Now, by rules for mgf’s,

mZn(s) = mX1(s/√

n) ·mX2(s/√

n) · · · · ·mXn(s/√

n)

= {mXi(s/√

n)}n

=

(1 +

s2

2n+

s3

3!n3/2E(X3

i ) + · · ·)n

.

203

Page 204: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Recall that as n →∞, (1 + y/n)n → ey. Above, the terms in s3

and higher contribute less and less as n increases, so only the 1

and s2/n terms in bracket have effect. Thus

limn→∞

mZn(s) = limn→∞

(1 +

s2

2n

)n

= es2/2

which is mgf of standard normal distribution.

Thus, remarkable fact: regardless of distribution of Xi,

ZnD→ N(0, 1).

Also works for Xi with any mean and variance: standardized

MnD→ N(0, 1). Called central limit theorem.

204

Page 205: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Exact distribution of Mn very difficult to find. But if n “large”,

distribution can be approximated very well by normal distribution,

easier to work with.

This is reason for studying normal distribution.

Note that theorem uses convergence in distribution, so that it is the

cdf that converges, not the density function. Important if Xi discrete.

Also, for approximation, don’t need to be so careful about

standardization. Any sum/mean for large n works.

205

Page 206: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

CLT by simulation

Let U1, U2, . . . ∼ Uniform[0, 1]; investigate distribution of

Yn = (U1 + U2 + · · ·+ Un)/n for various n. Uniform[0, 1]

distribution completely unlike normal. Do by simulation:

1. choose “large” number of Yn’s to simulate (eg. nsim = 10, 000)

2. in each of n columns, generate nsim random values from

Uniform[0, 1]

3. calculate simulated Yn values as row means. Eg. for n = 5,

let c10=rmean(c1-c5).

4. Draw histogram of results, compare normal distribution shape.

Normal good if curve through top middle of histogram bars.

206

Page 207: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Histogram of y

y

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

n = 2: normal too high at top, too low elsewhere.

207

Page 208: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Histogram of y

y

Den

sity

0.2 0.4 0.6 0.8

0.0

1.5

3.0

n = 5: much closer approx.

208

Page 209: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Histogram of y

y

Den

sity

0.3 0.4 0.5 0.6 0.7

02

4

n = 20: almost perfect.

209

Page 210: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Normal approx to binomial

Binomial is sum of Bernoullis, so CLT should apply if #trials n large.

Suppose Y ∼ Binomial(4, 0.5). Then E(Y ) = 2, Var(Y ) = 1.

Exact P (Y ≤ 1):

P (Y ≤ 1) =

(4

0

)(0.5)0(1−0.5)4+

(4

1

)(0.5)1(0.5)3 = 0.3125.

Take X ∼ N(2, 1) (same mean, variance as Y ). P (X ≤ 1)?

P (X ≤ 1) = P

(Z ≤ 1− 2√

1

)= P (Z ≤ −1) = 0.1587.

Not very close!

210

Page 211: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Problem: X continuous, but Y discrete. Y ≤ 1 really “Y ≤anything rounding to 1”. Suggests approximating P (Y ≤ 1) by

P (X ≤ 1.5):

P (X ≤ 1.5) = P

(Z ≤ 1.5− 2√

1

)= P (Z ≤ −0.5) = 0.3085.

For such small n, really very close to P (Y ≤ 1) = 0.3125.

In general, add 0.5 for≤ and subtract 0.5 for <. Called continuity

correction; do whenever discrete distribution approximated by

continuous.

(Alternatively: for binomial, P (Y ≤ 1) 6= P (Y < 1), but for

normal, P (X ≤ 1) = P (X < 1).)

211

Page 212: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Compare Y ∼ Binomial(20, 0.5); E(Y ) = 10, Var(Y ) = 5.

Then exact P (Y ≤ 8) = 0.2517; approx by X ∼ N(10, 5) as

P (Y ≤ 8) ' P (X ≤ 8.5)

= P

(Z ≤ 8.5− 10√

5

)

= P (Z ≤ −0.67) = 0.2514.

Now, approx very good.

212

Page 213: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

If p 6= 0.5, binomial skewed; skewness decreases as n increases.

So need larger n for p far from 0.5.

Example: n = 20, p = 0.1. Simulate and plot using Minitab:

MTB > random 1000 c3;

SUBC> binomial 20 0.1.

MTB > hist c3

Shape clearly skewed, not normal. n = 20 not large enough here.

Rule of thumb: normal approx OK if np ≥ 5 and n(1− p) ≥ 5.

Examples: n = 4, p = 0.5: np = 2 < 5, no good.

n = 20, p = 0.5: np = n(1− p) = 10 ≥ 5, good;

n = 20, p = 0.1: np = 2 < 5, no good.

213

Page 214: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Monte Carlo integration

Integral I =∫ 1

0sin(x4) dx: impossible algebraically (no

antiderivative). Get approximate answer numerically eg. by

Simpson’s rule. But can also recognize that

I = E{sin(U4)}

where U ∼ Uniform[0, 1]. I is “average” of sin(U4), suggesting

procedure:

1. Generate U randomly from Uniform[0, 1].

2. Calculate T = sin(U4)

3. Repeat steps 1 and 2 many times, find mean value m of T .

214

Page 215: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Minitab commands to do this (U in c1, T in c2):

MTB > random 1000 c1;

SUBC> uniform 0 1.

MTB > let c2=sin(c1**4)

MTB > mean c2

I got m = 0.19704. How accurate?

m observed value of random variable M . M mean of 1000 values,

so central limit theorem applies: approx normal distribution.

Mean, variance unknown but estimate using sample mean 0.19704,

sample SD 0.25221: E(M) ' 0.19704,

Var(M) = σ2/n ' 0.252212/1000 = 6.36× 10−5.

215

Page 216: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now, 99.7% of normal distribution within mean± 3× SD, so I

almost certainly in

0.19704± 3√

6.36× 10−5 = (0.189, 0.205).

To get more accurate answer, get more simulated values.

216

Page 217: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Recognizing as expectation

Consider now I =∫∞0

5x cos(x2)e−5x dx.

Again impossible algebraically; because of limits, can’t use previous

trick.

Idea: use distribution with right limits and density in integral. Here,

Exponential(5) has density 5e−5x on correct interval, so

I = E{X cos(X2)} where X ∼ Exponential(5).

Minitab annoyance: its exponential dist has parameter 1/λ, so we

have to feed in 1/5 = 0.2.

217

Page 218: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Commands:

MTB > random 1000 c1;

SUBC> exponential 0.2.

MTB > let c2=c1*cos(c1**2)

MTB > describe c2

I got mean 0.1884, SD 0.1731, so this area almost certainly in

0.1884± 30.1731√

1000= (0.1720, 0.2048).

218

Page 219: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Approximating sampling distributions

Central Limit Theorem only applies to means (sums), so is no help

for other quantities (median, variance etc).

Can approximate sampling distributions for these by simulation.

Idea:

1. simulate random sample from population

2. calculate sample quantity

3. repeat steps 1 and 2 many times, summarize results.

219

Page 220: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Sampling distribution of sample median in normal

population

Suppose X1, X2, . . . , Xn is random sample from normal

population mean 10, SD 2; take n = 3.

MTB > Random 500 c1-c3;

SUBC> Normal 10 2.

MTB > RMedian c1-c3 c4.

Samples in rows; use “row statistics” to get sample medians.

220

Page 221: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Shape is very like normal, even for such small sample.

221

Page 222: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Sampling distribution of sample variance in normal

population

Again suppose X1, X2, . . . , Xn ∼ N(10, 22). Now take n = 5:

MTB > Random 500 c1-c5;

SUBC> Normal 10 2.

MTB > RStDev c1-c5 c6.

MTB > let c7=c6*c6

MTB > histogram c7

(samples in rows again; variance as square of SD.)

222

Page 223: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Shape definitely skewed right: not normal-shaped.

223

Page 224: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Normal distribution theory

Normal distribution arises often from CLT, so worth knowing

properties and related distributions. These used frequently in

Chapter 5 and beyond (STAB57).

First: suppose U, V are independent. Then Cov(U, V ) =

E(UV )− E(U)E(V ) = E(U)E(V )− E(U)E(V ) = 0 as

expected.

But: now suppose that Cov(U, V ) = 0. If U, V normal, then (fact)

U, V independent.

That is, for normal U, V , Cov(U, V ) = 0 if and only if U, V

independent. Not true for other distributions.

224

Page 225: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

The chi-squared distribution

Suppose Z ∼ N(0, 1). What is distribution of W = Z2? Can’t

use usual transformation because Z2 neither increasing nor

decreasing.

FW (w) = P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √w).

This as integral is

FW (w) =

∫ √w

−√w

e−z2/2

√2π

dz =

∫ √w

−∞

e−z2/2

√2π

dz−∫ −√w

−∞

e−z2/2

√2π

dz.

225

Page 226: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Differentiate both sides and simplify to get

fW (w) =1√2πw

e−w/2.

This is called chi-squared distribution with 1 degree of freedom

(df). Written W ∼ χ21.

Now suppose Z1, Z2, . . . , Zn ∼ N(0, 1) independently.

Distribution of W = Z21 + Z2

2 + · · ·+ Z2n called chi-squared with

n degrees of freedom. Written W ∼ χ2n.

What is E(W )?

E(W ) = E

(n∑

i=1

Z2i

)=

n∑i=1

E(Z2i ) = n(1) = n

since E(Z2i ) = Var(Zi) = 1.

226

Page 227: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

To get density function of χ2n, compare gamma density with χ2

1:

λαwα−1

Γ(α)e−λw =

1√2πw

e−w/2

if α = 12

and λ = 12. That is, χ2

1 = Gamma(12, 1

2).

If Z2i ∼ χ2

1, use mgf formula for gamma dist to write

mZ2i(s) =

(1

2

)1/2 (1

2− s

)−1/2

.

227

Page 228: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

If W =∑n

i=1 Z2i ∼ χ2

n, mgf of W is n copies of mZ2i(s)

multiplied together, ie.

MW (s) =

(1

2

)n/2 (1

2− s

)−n/2

which is mgf of Gamma(n/2, 12). Using formula for gamma

density, then, for W ∼ χ2n,

fW (w) =1

2n/2Γ(n/2)wn/2−1e−w/2.

Has skew-to-right shape (picture page 225).

228

Page 229: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Distribution of sample variance

Suppose X1, X2, . . . , Xn ∼ N(µ, σ2). Define X̄ =∑n

i=1 Xi/n

to be sample mean, S2 =∑n

i=1(Xi− X̄)2/(n− 1) to be sample

variance.

Know that X̄ ∼ N(µ, σ2/n). Distribution of S2?

Actually look at (n− 1)S2/σ2 =∑n

i=1(Xi − X̄)2/σ2. Can write

(p. 235) as sum of n− 1 squared N(0, 1)’s, so

(n− 1)S2

σ2∼ χ2

n−1.

Fact: E(S2) = σ2 (explains division by n− 1).

229

Page 230: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

The t distribution

Standardize X̄ :X̄ − µ√

σ2/n∼ N(0, 1).

But what if σ2 unknown? Idea: replace σ2 by sample variance S2.

Distribution of result no longer normal (even though Xi are).

X̄ − µ√S2/n

=X̄ − µ√

σ2/n· 1√

(n− 1)S2/σ2/(n− 1)=

Z√Y/(n− 1)

where Z ∼ N(0, 1) and Y ∼ χ2n−1.

This called t distribution with n− 1 degrees of freedom, written

tn−1.

230

Page 231: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

What happens as n increases? Write

Y/(n− 1) =∑n−1

i=1 Z2i /(n− 1) where Zi ∼ N(0, 1). Then

E(Y/(n− 1)) = 1. Let k = Var(Z2i ); then

Var(Y/(n− 1)) = (n− 1)k/(n− 1)2 = k/(n− 1) → 0.

That is, Y/(n− 1)P→ 1 and therefore

Z√Y/(n− 1)

D→ N(0, 1);

that is, for large n, the t distribution with n− 1 df well approximated

by N(0, 1).

t distribution hard to work with; use tables/software for probabilities.

231

Page 232: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

The F distribution

Suppose S21 and S2

2 sample variances from independent samples

sizes m, n, both from normal populations with variance σ2. Then

might compare variances by looking at ratio R = S21/S

22 :

R =S2

1

S22

=(m− 1)S2

1/σ2

(n− 1)S22/σ

2· 1/(m− 1)

1/(n− 1)=

X/(m− 1)

Y/(n− 1)

where X ∼ χ2m−1 and Y ∼ χ2

n−1.

This defined to have F distribution with m− 1 and n− 1

degrees of freedom, written F (m− 1, n− 1).

232

Page 233: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Properties of F distribution

Ratio could have been S22/S

21 = 1/R with similar result: therefore,

if R ∼ F (m− 1, n− 1), then 1/R ∼ F (n− 1,m− 1).

Suppose T = X/√

Y/(n− 1) ∼ tn−1. Then

T 2 =X2/1

Y/(n− 1)

is a χ21/1 over χ2

n−1/(n− 1); that is, T 2 ∼ F (1, n− 1).

233

Page 234: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

In

R =X/(m− 1)

Y/(n− 1):

if n →∞, know that Y/(n− 1)P→ 1, and numerator of

R ∼ χ2m−1/(m− 1).

Hence, as n →∞,

(m− 1)RD→ χ2

m−1.

Thus χ2m−1 is useful approx to F (m− 1, n− 1) if n large.

234

Page 235: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Stochastic Processes

235

Page 236: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Random walks

Consider gambling game: win $1 with prob p, lose $1 with prob q

(p + q = 1). Each play independent. Start with fortune a; let Xn

denote fortune after n plays.

Thus X0 = a; X1 = a + 1 if win (prob p), X1 = a− 1 if lose

(prob q).

Sequence {Xn} of random variables called random walk.

236

Page 237: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Properties of random walk

At each step, two possible outcomes (win/lose), same prob p of

winning, independent. So number of wins Wn ∼ Binomial(n, p).

With Wn wins, must be n−Wn losses, so fortune after Wn wins is

Xn = a + (1)Wn + (−1)(n−Wn) = a + 2Wn − n.

Since E(Wn) = np, have

E(Xn) = a + 2np− n = a + 2n

(p− 1

2

).

Also

Var(Xn) = 22 Var(Wn) = 4np(1− p).

237

Page 238: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Since Wn ∼ Binomial(n, p), have

P (Wn = j) =

(n

j

)pjqn−j;

write in terms of Xn to get

P (Xn = a + k) = P (a + k = a + 2Wn − n)

= P (Wn = (n + k)/2)

=

(n

(n + k)/2

)p(n+k)/2q(n−k)/2.

Only certain values of Xn possible; formula fails for impossible

values.

238

Page 239: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Examples

Suppose a = 5, p = 14. Then

E(Xn) = 5 + 2n(14− 1

2) = 5− n/2. Expect fortune to decrease

on average.

What is P (X3 = 6)? Write 6 = 5 + 1 so k = 1, n = 3;

(n + k)/2 = 2 and (n− k)/2 = 1:

P (X3 = 6) =

(3

2

)(1

4

)2 (3

4

)1

=9

64.

How about P (X9 = 7)? This is P (X9 = 5 + 2), so n = 9 and

k = 2. But (n + k)/2 = (5 + 2)/2 not integer, so formula fails.

X9 cannot be 7 (in fact X9 must be even).

239

Page 240: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Now suppose a = 20, p = 23. Then

E(Xn) = 20 + 2n

(2

3− 1

2

)= 20 + n/3,

increasing with n.

Find P (X5) = 21 = 20 + 1: n = 5, k = 1 so (n + k)/2 = 3,

(n− k)/2 = 2 and

P (X5 = 21) =

(5

3

)(2

3

)3 (1

3

)2

' 0.329,

fairly likely.

240

Page 241: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Gambler’s ruin

Suppose we gamble with aim to reach fortune c > 0. How likely do

we succeed before fortune reaches 0 (run out of money)?

Hard to see answer: no idea how long it takes to reach c or 0.

Idea: let S(a) be prob of reaching c first starting from fortune a.

Then for all c > 0, S(0) = 0, S(c) = 1. Also, if current fortune a,

fortune at next step either a + 1 or a− 1, leading to

S(a) = pS(a + 1) + qS(a− 1).

241

Page 242: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Solve above recurrence relation to get formula: if p = 12,

S(a) = a/c; otherwise,

S(a) =1− (q/p)a

1− (q/p)c.

Example: start with $20, want to win $50. If p = 12, chance of

success is 20/50 = 0.4. If p = 0.51, chance of success is

S(20) =1− (0.49/0.51)20

1− (0.49/0.51)50' 0.637.

Even a very small edge makes success much more likely. (Even

small disadvantage makes eventual failure much more likely.)

242

Page 243: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Markov Chains

Simple model of weather:

• if sunny today, prob 0.7 of sunny tomorrow, prob 0.3 of rainy.

• if rainy today, prob 0.4 of sunny tomorrow, prob 0.6 of rainy.

Weather has two states (sunny, rainy). From one day to next,

weather may change state.

Probs above called transition probabilities. This kind of probability

model called Markov chain.

243

Page 244: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Can write as matrix:

P =

0.7 0.3

0.4 0.6

where element pij is P (go to state j|currently state i).

Note assumption: only need to know weather today to predict

weather tomorrow. (If weather today known, past weather

irrelevant). Called Markov property.

Suppose sunny today. Chance of sun in two days?

One idea: list possibilities. Two: SSS, SRS. Use transition probs to

get (0.7)(0.7) + (0.3)(0.4) = 0.61.

244

Page 245: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Another: calculate matrix P 2:

P 2 =

0.7 0.3

0.4 0.6

0.7 0.3

0.4 0.6

=

0.61 0.39

0.52 0.48

.

Note that top-left calculation same as 1st idea above.

Matrix P 2 gives two-step transition probs. That is, if sunny today,

prob of sunny in 2 days’ time 0.61; if rainy today, almost even

chance of being rainy in 2 days.

In general, P n gives n-step transition probs (weather in n days’

time given weather today).

245

Page 246: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Another example

“Ehrenfest’s Urn”: Two urns, containing total of 4 balls. Choose one

ball at random, take out of current urn, place in other urn. Keep

track of number of balls in urn 1.

Transition matrix (states 0, 1, 2, 3, 4 balls in urn 1):

P =

0 1 0 0 0

14

0 34

0 0

0 24

0 24

0

0 0 34

0 14

0 0 0 1 0

Apparent tendency for number of balls in 2 urns to even out.

246

Page 247: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Find likely number of balls in urn 1 after 9 steps by finding P 9. (Use

Minitab: see section E.1 of manual, p. 162.) Answer (rounded):

P 9 =

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

Start with even number of balls in urn 1: end with either odd

number, equally likely. Start with odd number: end with even

number, most likely 2.

247

Page 248: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Stationary distributions

Instead of starting from particular state, pick starting state from

prob. distribution θ = (θ1, θ2, . . .).

In weather example: suppose 80% chance today sunny, so

θ = (0.80, 0.20).

To get prob of each state n steps later, multiply θ as row vector by

P n. Weather example, for n = 2 days later:

(0.8 0.2

)P 2 =

(0.8 0.2

)0.61 0.39

0.52 0.48

=

(0.592 0.408

).

248

Page 249: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Suppose we could find θ such that θP = θ. Then starting

distribution θ would be stationary: (marginal) prob of sunny day

same for all days.

Can try directly for weather example:(θ1 θ2

)P =

(0.7θ1 + 0.4θ2 0.3θ1 + 0.6θ2

)=

(θ1 θ2

).

2 equations in 2 unknowns, collapse into one equation

0.3θ1 − 0.4θ2 = 0, but θi are probs so that θ1 + θ2 = 1 also.

Solve: θ1 = 47, θ2 = 3

7.

More generally: solve θP = θ by transposing both sides to get

P T θT = θT . Like solution to Av = λv with λ = 1: stationary

prob θ is eigenvector of P T with eigenvalue 1.

249

Page 250: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Can use Minitab to get eigenvalues/vectors (manual p. 167). Usually

need to scale eigenvector to get probs summing to 1.

Ehrenfest urn example: 5 eigenvectors; one with eigenvalue 1 is

(0.120, 0.478, 0.717, 0.478, 0.120), scaling to 116

, 416

, 616

, 416

, 116

.

(Actually binomial probs: see text p. 595).

250

Page 251: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Limiting distributions

If initial state chosen from stationary distribution, then prob of each

state remains same for all time.

Also: if watch Markov chain for many steps, should not matter much

which state we began in.

Weather example: 8-step transition matrix is

P 8 =

0.57146 0.42854

0.57139 0.42861

'

47

37

47

37

Starting either from sunny or rainy day, chance of sunny day in 8

days’ time is about 47. Same as stationary distribution.

251

Page 252: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Compare Ehrenfest urn example:

P 8 '

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

not getting stationary distribution in each row.

Problem here: number of balls in urn 1 always goes from odd to

even or vice versa. So eg. P (1 ball in urn 1 after n steps)

alternates between 0 and positive; cannot have limit. Chain called

periodic.

252

Page 253: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Consider a third example:

P =

0.5 0.5 0

0.75 0.25 0

0 0 1

.

Has two eigenvectors for eigenvalue 1: (0.6, 0.4, 0) and (0, 0, 1).

Note: start in state 1 or 2, can never reach state 3. Start in state 3,

can never reach states 1 or 2.

Such chain called reducible: can split up into two chains, {1, 2}and {3} and treat each separately.

253

Page 254: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

Markov chain limit theorem

Previous work suggests following theorem:

Suppose a Markov chain has a stationary distribution, is not

reducible, and is not periodic. Then its stationary distribution also

gives the probability, as n →∞, of being in any particular state

after n steps.

In effect, the stationary distribution gives approx to long-term

behaviour of chain.

254

Page 255: STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two (fair) coins, and note whether each coin ... Can’t define probability of a value,

... that’s all, folks!

255