STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we ﬂip two (fair) coins, and note whether each coin ... Can’t deﬁne probability of a value,

Welcome to

STAB52

Instructor: Dr. Ken Butler

1

Contact information

(on Intranet: intranet.utsc.utoronto.ca, My Courses)

• E-mail: [email protected]

• Office: H 417

• Office hours: to be announced

• Phone: 5654 (416-287-5654)

2

Probability Models

3

Measuring uncertainty

4

Random Variables and

Distributions

5

Random Variables

Suppose we flip two (fair) coins, and note whether each coin

(ordered) comes up H or T.

• Sample space is S = {HH,HT, TH, TT}.

• Probability measure is 14

for each of 4 outcomes.

What about “number of heads”? Could be 0, 1 or 2:

• P (0 heads) = P (TT ) = 14

• P (1 head) = P (TH) + P (HT ) = 12

• P (2 heads) = P (HH) = 14.

6

“Number of heads” is random variable: function from S to R. That

is, given outcome, get value of random variable.

Random variables can be any function from S to R. If

S = {rain, snow, clear}, random variable X could be

X(rain) = 3

X(snow) = 6

X(clear) = −2.7.

7

Some more examples of random variables

Roll a fair 6-sided die, so that S = {1, 2, 3, 4, 5, 6}. Let X be the

number of spots showing, let Y be square of number of spots. If s is

number of spots, let W = s + 10, let U = s2 − 5s + 3, etc.

In previous situation, let C = 3 regardless of s. C is constant

random variable.

Suppose have event A, only interested in whether A happens or

not. Define indicator random variable I to be 1 if A happens, 0

otherwise. Example (rolling die) I6(s) = 1 if s = 6, 0 otherwise.

8

≥, =, sum for random variables

Imagine rolling a fair die again, S = {1, 2, 3, 4, 5, 6}. Let X = s,

and let Y = X + I6.

X is number of spots, I6 is 1 if you roll a 6 and 0 otherwise. What

does Y mean?

Eg. roll a 4, X = 4, Y = 4 + 0 = 4. But if you roll a 6,

Y = 6 + 1 = 7. (That is, Y is the number of spots plus a “bonus

point” if you roll a 6.)

Sum of random variables (like Y here) for any outcome is sum of

their values for that outcome.

9

Also: if s = 1, 2, 3, 4, 5, values of X and Y are same. If s = 6,

X < Y .

Say that random variable X ≤ Y if value of X ≤ value of Y for

every single outcome. True in example.

Say that random variable X = Y if value of X equals value of Y

for every single outcome. Not true in example (different when

outcome is s = 6).

For constant random variable c, X ≤ c if all possible values of X

are≤ c.

10

When S is infinite

When S infinite, random variable can take infinitely many different

values (but may not).

Example: S = {1, 2, 3, . . .}. If X = s, X takes all infinitely many

values in S. But define Y = 3 if s ≤ 4, Y = 2 if 4 < s ≤ 10,

Y = 1 when s > 10. Y has only finitely many (3) different values.

11

Distributions of random variables

A random variable can be described by listing all its possible vales

and their probabilities. Started this chapter with a coin-flipping

example:

Flip two (fair) coins, and note whether each coin (ordered) comes up

H or T.

Let X be “number of heads”. Could be 0, 1 or 2:

• P (X = 0) = P (TT ) = 14

• P (X = 1) = P (TH) + P (HT ) = 12

• P (X = 2) = P (HH) = 14.

Called the distribution of X .

12

Notice how can talk about P (X = s) for some s. In this case,

listing all the s for which P (X = s) > 0 describes distribution.

Consider now random variable U taking values in [0, 1] with

P (a ≤ X ≤ b) = b− a

for 0 ≤ a ≤ b ≤ 1. Try to figure out eg. P (X = 0.4): is

P (0.4 ≤ X ≤ 0.4) = 0.4− 0.4 = 0.

Can’t define probability of a value, but still can define probability of

landing in subset of R (namely interval).

13

To account for all of this, define distribution of random variable X

as: collection of probabilities P (X ∈ B) for all subsets B of

real numbers.

Works for both examples above. Eg. in first example,

P (X ≤ 1) = P (X = 0) + P (X = 1) = 34.

In practice, often messy to define probabilities for “all possible

subsets”. Think first about examples like 1st, “discrete”, where can

talk about probabilities of individual values. Then consider

“continuous” case (like 2nd), where have to look at intervals.

14

Discrete distributions

Often it makes sense to talk about individual probs, P (X = x).

When all probability included in these probs, ie.∑x∈R

P (X = x) = 1,

don’t need to look at anything else.

Another way to look at it: there is a finite or countable set of x

values, x1, x2, . . ., each having probability pi = P (X = xi), such

that∑

x∈R pi = 1.

Either of these is definition of discrete distribution.

15

Compare case where P (a ≤ X ≤ b) = b− a: P (X = x) = 0

for all x, so not discrete distribution.

Another example: suppose X = −1 with prob 12, and for

0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b− a)/2. Can talk

about P (X = −1) = 12, but P (X = x) = 0 for any other x. So

not a discrete distribution.

Notation for discrete distributions (emphasize function):

pX(x) = P (X = x)

called probability function or mass function.

Now look at some important discrete distributions.

16

Degenerate distributions

If random variable C is constant, equal to c, then P (C = c) = 1

and P (C = x) = 0 for any x 6= c. Since∑x∈R P (C = x) = P (C = c) = 1, is a proper (though dull)

discrete distribution. Called degenerate distribution or point

mass.

17

Bernoulli distribution

Flip a coin once, let X be number of heads (has to be 0 or 1).

Suppose P (head) = θ, so P (tail) = 1− θ. Then

pX(1) = P (X = 1) = P (head) = θ;

pX(0) = P (X = 0) = P (tail) = 1− θ.

X said to have Bernoulli distribution; write X ∼ Bernoulli(θ).

Application: any kind of “success/failure”. Denote “success” by 1,

“failure” by 0. Or selection from population with two kinds of

individual like male/female.

18

Binomial distribution

Now suppose we flip the coin n times (independently) and again

count number of heads. Probability of exactly x heads is

pX(x) = P (X = x) =

(n

x

)θx(1− θ)n−x.

X said to have binomial distribution, written

X ∼ Binomial(n, θ).

Applications: as for Bernoulli. Eg. randomly select 100 Canadian

adults, let X be number of females.

19

Let X ∼ Binomial(4, 0.5), Y ∼ Binomial(4, 0.2). Then

x P(X=x) P(Y=x)

0 0.0625 0.4096

1 0.2500 0.4096

2 0.3750 0.1536

3 0.2500 0.0256

4 0.0625 0.0016

X probs symmetric about x = 2, Y more likely to be 0 or 1

because successes less likely.

Bernoulli and binomial count successes in fixed number of trials.

Could also look at waiting time problem: fix successes, count

number of trials needed to get them.

20

Geometric distribution

Same situation as for binomial: number of trials, independent, equal

prob. θ. Let X now be number of tails before 1st head.

X = k means we observe k tails, and then a head, so

pX(k) = P (X = k) = (1− θ)kθ, k = 0, 1, 2, . . .

X can be as large as you like, since you might wait a long time for

the first head. (Compare binomial: can’t have more than n

successes in n trials).

X has geometric distribution, prob. θ, written X ∼ Geometric(θ).

Applications: number of working light bulbs tested until first one that

fails; number of at-bats for baseball player until first hit.

21

Examples: suppose X1 ∼ Geometric(0.8) and

X2 ∼ Geometric(0.5).

k P (X1 = k) P (X2 = k)

0 0.8 0.5

1 0.16 0.25

2 0.032 0.125

3 0.0064 0.0625

4 0.00128 0.03125

. . . . . .

When θ larger, 1st success probably sooner.

Also: probabilities form geometric series, hence the name.

22

Negative binomial distribution

To take geometric one stage further: Let r be a fixed number, let Y

be the number of tails before the r-th head.

Y = k only if observe r − 1 heads and k tails, in any order,

followed by a head (must finish with a head). Are r + k − 1 flips

before the final head. Prob. of this is

pY (k) = P (Y = k) =

(r + k − 1

r − 1

)θr−1(1− θ)kθ

=

(r + k − 1

k

)θr(1− θ)k

Write this Y ∼ Negative-Binomial(r, θ).

23

Applications: can re-use geometric distribution examples. Thus:

number of working lightbulbs tested until 5th non-working one

encountered; number of at-bats until baseball player achieves 10th

hit.

Numerical examples: let Y1 ∼ Negative-Binomial(4, 0.8) and

Y2 ∼ Negative-Binomial(3, 0.5).

k P(Y1=k) P(Y2=k)

0 0.4096 0.1250

1 0.3276 0.1875

2 0.1638 0.1875

3 0.0655 0.1562

4 0.0229 0.1171

5 0.0073 0.0820

6 0.0022 0.0546

24

With Y1, “heads” are likely so probably won’t see many tails before

4th H. With Y2, heads not so likely but only need to see 3 before

stopping.

General note: some books count total number of trials until first (or

r-th) head for geometric and negative binomial distributions. Gives

random variables 1 + X and r + Y as defined above.

25

Poisson distribution

Suppose X ∼ Binomial(n, λ/n). We’ll think of λ as being fixed

and see what happens as n →∞. That is, what if the number of

trials gets very large but the prob. of success gets very small?

Then

P (X = x) =

(n

x

)(λ

n

)x (1− λ

n

)n−x

=n!

x!(n− x)!nxλx

(1− λ

n

)n (1− λ

n

)−x

26

Thinking of x as fixed (for now) and letting n →∞: the behaviour

of the factorials is determined by the highest power of n. Thus n!

behaves like nn, (n− x)! behaves like nn−x and hence

n!

(n− x)!nx→ 1.

Also, (1− λ

n

)−x

→ 1

because 1− λ/n → 1 and raising it to a fixed power changes

nothing.

27

Finally,

limn→∞

(1− λ

n

)n

is a famous limit from calculus; it is e−λ. Thus

limn→∞

P (X = x) =e−λλx

x!.

A random variable Y with P (Y = y) = e−λλy/y! is said to have

a Poisson(λ) distribution, written Y ∼ Poisson(λ).

The Poisson distribution is a good model for rare events: that is,

events which have a large number of “chances” to happen, but have

a very small probability of happening at each “chance”. λ represents

“rate” at which events happen; doesn’t have to be integer.

28

Applications of Poisson distribution are things like: number of house

fires in a city on a given day, number of phone calls arriving at a

switchboard in an hour, number of radioactive events recorded by a

Geiger counter.

Let X ∼ Poisson(2), Y ∼ Poisson(0.8):

29

lam=2 lam=0.8

x P(X=x) P(Y=x)

0 0.1353 0.4493

1 0.2707 0.3595

2 0.2707 0.1438

3 0.1804 0.0383

4 0.0902 0.0077

5 0.0361 0.0012

6 0.0120 0.0002

... ...

• When λ is integer, highest prob at that integer and next lower

• Otherwise, highest prob at next lower integer (so when λ < 1,

highest prob at x = 0).

30

Hypergeometric distribution

Introduction

Imagine a pot containing 10 balls, 7 red and 3 green. Prob. of

drawing a red ball is 0.7 (7/10). If we put the ball drawn back in the

pot, prob. of drawing a red ball the next time is still 0.7.

Thus, drawing with replacement, number of red balls in 4 draws

R ∼ Binomial(4, 0.7). Therefore

P (R = 4) =

(4

4

)(0.7)4(0.3)0 = 0.2401.

31

Now suppose we draw without replacement: that is, don’t put balls

back in pot after drawing. If we draw a red ball 1st time, there are

only 6 red balls out of 9 balls left.

Should be harder to draw 4 red balls in 4 draws because there are

fewer left after we draw each one: now

P (R = 4) =7

10· 6

9· 5

8· 4

7= 0.1667.

This is not so bad, but suppose we now want P (R = 3), say?

Need general principle for drawing without replacement.

32

The hypergeometric formula

Introduce symbols: suppose draw n balls out of a pot containing N

total. Suppose M of the balls in the pot are red. Let X be number

of red balls drawn. What is P (X = x)?

Need to count ways:

• Number of ways to draw n balls out of N in pot:(

Nn

).

• number of ways to draw x red balls out of M red balls in pot:(Mx

).

• number of ways to draw n− x green balls out of N −M green

balls in pot:(

N−Mn−x

).

P (X = x) is number of ways to draw the red and green balls

33

divided by number of ways to draw n balls out of N :

P (X = x) =

(M

x

)(N −M

n− x

)/

(N

n

).

X said to have hypergeometric distribution:

X ∼ Hypergeometric(N, M, n). Checks:

M + (N −M) = N and x + (n− x) = n. Restrictions on x?

• Number of red balls: x ≤ n and x ≤ M so x ≤ min(n,M).

• Number of green balls: n− x ≤ n and n− x ≤ N −M , so

x ≥ 0 and x ≥ n + M −N , so x ≥ max(0, n + m−N).

34

Example 1: let X ∼ Hypergeometric(10, 7, 4):

x P(X=x)

0 0.0000

1 0.0333

2 0.3000

3 0.5000

4 0.1667

10 balls in pot, 7 red, 4 drawn. Cannot draw 0 red, because that

would mean drawing 4 green, and only 3 in pot. (Also cannot draw

more than 4 red because only drawing 4).

35

Example 2: let Y ∼ Hypergeometric(5, 3, 4):

y P(Y=y)

0 0.0

1 0.0

2 0.6

3 0.4

4 0.0

5 0.0

5 balls in pot, 3 red and 2 green, draw 4. Cannot draw more than 3

red. But also cannot draw only 0 or 1 red, because that would mean

drawing 4 or 3 green, and aren’t that many in the pot.

36

Applications

Anything that involves drawing without replacement from a finite set

of elements. Includes sampling, eg. selecting people to include in

opinion poll. (Don’t want to select same person twice). People

sampled from might agree (red ball) or disagree (green ball) with

question asked.

Large N

If N large, might imagine that it doesn’t matter much whether you

replace balls in pot or not. In other words, for large N , binomial

would be decent approximation. Turns out to be true:

If X ∼ Hypergeometric(N, M, n) and N large, then X has

37

approx. same distribution as Y ∼ Binomial(n,M/N).

38

Continuous distributions

Suppose, for random variable X ,

P (a ≤ X ≤ b) = b− a

for 0 ≤ a ≤ b ≤ 1).

Is legitimate probability since 0 ≤ b− a ≤ 1. But

P (X = a) = a− a = 0 for any a, so not discrete distribution.

Where did the probability go?

39

Cumulative distribution functions

40

One-dimensional change of variable

41

Joint Distributions

Know how to describe random variables one at a time: probability

function (discrete), density function (continuous), cumulative

distribution function (either).

But two random variables X , Y might be related. Don’t have a way

to describe this.

Example: X ∼ Bernoulli(2/3). Let Y = 1−X .

Y ∼ Bernoulli(1/3) (count failures not successes). X, Y

related, but doesn’t show in individual probability functions.

42

Joint probability functions

Can simply find probability of all possible combinations of values for

X, Y . Uses individual probability functions and relationship.

In example: if X = 0, then Y = 1; if X = 1, then Y = 0.

Possible values for Y depend on value of X . Also,

P (X = 1) = 2/3.

Notation: pX,Y (x, y) = P (X = x, Y = y) (comma is “and”),

called joint probability function. In example:

pX,Y (1, 0) = 2/3; pX,Y (0, 1) = 1/3.

Are only possible combinations of X and Y values.

43

Often convenient to depict as table. Above example:

x \ y 0 1

0 0 13

1 23

0

Another:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Note that all the probabilities sum to 1, because joint probability

function covers all possibilities.

44

Joint density functions

If random variables continuous, joint probability function makes no

sense; instead, define joint density function f(x, y) that

expresses chance of being “near” (X = x, Y = y).

Joint density function also covers all possible values of X, Y , so

integrates to 1 when integrated over both x and y.

Example: f(x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1 (page 85).

45

Sometimes possible values of Y depend on value of X . Account for

in integration.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. (Thus

if X = 0.6, Y cannot exceed 0.4.) Region forms triangle: Figure

2.7.3 of text (p. 85). Verify density by letting y limits of integration

depend on x (y = 1− x).

46

Bivariate normal distribution

Suppose X , Y both have standard normal distributions, and

suppose−1 < ρ < 1. Then the bivariate standard normal

distribution with correlation ρ has joint density function

f(x, y) =1

2π√

1− ρ2exp

{− 1

2(1− ρ2)(x2 + y2 − 2ρxy)

}.

Plotting in 3D (Figure 2.7.4) gives a 3D bell shape.

ρ measures relationship between X and Y :

• ρ = 0: no relationship

• ρ > 0: when X > 0, Y likely > 0

• ρ < 0: when X > 0, Y likely < 0.

47

Bivariate standard normal has peak at (0, 0). Replacing x by

(x− µ1)/σ1 and y by (y − µ2)/σ2 shifts peak to (µ1, µ2) and

changes decrease of density away from peak (larger σ values mean

slower decrease).

48

Calculating probabilities

For a continuous random variable X , calculate probabilities by

integrating, eg. P (a < X ≤ b) =∫ b

af(x) dx.

Same idea for continuous joint distribution, integrating over x and y.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. Find

P (0.5 ≤ X ≤ 0.7, Y > 0.2).

Draw picture. Area is trapezoid: y between 0.2 and diagonal line

(1− x), x between given limits. Integrate over y first, then x to get

0.294.

49

Marginal distributions

Started from individual distributions for X, Y plus relationship. But:

start from joint, get individual?

One way: get distribution of X by “averaging” over distribution of Y .

Discrete: simply row and column totals. Example:

u \ v 0 1 2 Sum

0 13

16

16

23

1 16

112

112

13

Sum 12

14

14

1

50

Without knowledge of V , U twice as likely 0 as 1; without

knowledge of U , V twice as likely 0 as 1 or 2.

Row totals here give marginal distribution of U ; column totals

here marginal distribution of V . Each marginal distribution is

proper probability distribution (probs sum to 1).

51

Continuous: integrate over other variable. Get marginal density

function.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1.

Marginal density for X : integrate over y. Limits 0, 1− x; get

fX(x) =

∫ 1−x

0

120x3y dy = 60x3(1− x)2.

For Y : integrate over x, limits 0, 1− y:

fY (y) =

∫ 1−y

0

120x3y dx = 30y(1− y)4.

“Integrating out” unwanted variable.

Alternative approach via cumulative; text page 79.

52

Example 2: bivariate standard normal. Recall standard normal

density; integrates to 1, so∫ ∞

−∞

1√2π

exp

[−1

2u2

]du = 1.

Marginal distribution of x in bivariate standard normal: integrate out

y:

fX(x) =

∫ ∞

−∞

1

2π√

1− ρ2exp

[−x2 + y2 − 2ρxy

2(1− ρ2)

]dy.

Substitution: let u = (y − ρx)/√

1− ρ2, so du = dy/√

1− ρ2.

Then

u2 =y2 − 2ρxy + ρ2x2

1− ρ2

53

which is nearly what appears inside “exp”. Precisely:

fX(x) =

∫ ∞

−∞

1

2πexp

[−u2 + x2

2

]du

=1√2π

exp(−x2/2)

∫ ∞

−∞

1√2π

exp(−u2/2) du.

Integral is 1 (of a standard normal density), so

fX(x) =1√2π

exp(−x2/2) :

that is, marginal distribution of X is standard normal.

54

Conditioning and Independence

Marginal distribution: of one variable, ignorant about other.

But what if we knew X ; what then about distribution of Y ?

Example 1:

x \ y 0 1

0 0 13

1 23

0

Suppose X = 1. Then ignore 1st row.

55

But 2nd row not probability distribution (sum 23

not 1). Idea: divide

by sum. Then if X = 1, P (Y = 0) = 1 and P (Y = 1) = 0: that

is, if X = 1, Y certain to be 0. Called conditional distribution of

Y given X = 1.

If X = 0, Y certain to be 1. Conditional distribution of Y different

for different X : Y depends on X .

Notation: as for conditional probability. Eg. above:

P (Y = 1|X = 0) = 1.

56

Example 2:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Conditional distribution of V given U = 0? Use U = 0 row. This

sums to 23, so divide by this to get P (V = 0|U = 0) = 1

2, P (V =

1|U = 0) = 14, P (V = 2|U = 0) = 1

4.

U = 1 line sums to 13; conditional distribution of V given U = 1 is

same as given U = 0.

In example 2, does not matter what U is – conditional distribution of

V same. Say that V and U are independent.

57

Two examples give extreme cases. In Example 1, knowing X gave

Y with certainty; in example 2, knowing U said nothing about V .

Most cases in between: knowing one variable has some effect on

distribution of other.

Symbols:

P (Y = b|X = a) =P (X = a, Y = b)∑y P (X = a, Y = y)

=P (X = a, Y = b)

P (X = a).

Denominator is marginal probability that X = a.

58

Conditioning on continuous random variables

Continuous case: no probabilities, so replace with density functions;

replace sum by integral. This gives conditional density function:

fY |X(y|x) =fX,Y (x, y)∫∞

−∞ fX,Y (x, y) dy=

fX,Y (x, y)

fX(x),

replacing infinities by actual limits for y. Denominator depends on x

only; is marginal density function for X .

Then use conditional density to evaluate conditional probabilities.

59

Example: fX,Y (x, y) = 4x2y + 2y5 for x, y between 0 and 1, 0

otherwise. Find P (0.2 ≤ Y ≤ 0.3|X = 0.8).

Steps: find marginal density of X , use to find conditional density of

Y given X , integrate conditional density to find probability.

Answers: marginal density of X is fX(x) = 2x2 + 13

for

0 ≤ X ≤ 1, 0 otherwise. Conditional density of Y |X is

fY |X(y|x) =4x2y + 2y5

2x2 + 13

then integrate over 0.2 ≤ y ≤ 0.3 and put in x = 0.8 to get

P (0.2 ≤ Y ≤ 0.3|X = 0.8) = 0.0395.

60

Followup: what happens to P (0.2 ≤ Y ≤ 0.3) if X changes?

One answer: P (0.2 ≤ Y ≤ 0.3|X = 0.4) = 0.0242. So

probability does change as X changes; Y does depend on X .

However, change in probability quite small; dependence is not very

strong.

61

Law of total probability

Because

fY |X(y|x) =fX,Y (x, y)∫∞

−∞ fX,Y (x, y) dy=

fX,Y (x, y)

fX(x),

also true that

fX,Y (x, y) = fX(x)fY |X(y|x).

62

So

P (a ≤ X ≤ b, c ≤ Y ≤ d)

=

∫ d

c

∫ b

a

fX,Y (x, y) dx dy

=

∫ d

c

∫ b

a

fX(x)fY |X(y|x) dx dy.

In words: can find probabilities either using joint density or using a

marginal and a conditional density. Can use whichever easier.

63

Independence of random variables

Recall this joint distribution:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Sum 12

14

14

Conditional distribution of V same given U = 0 and given U = 1.

Also same as marginal distribution of V . Knowing U says nothing

about V .

(Also, conditional dist. of U same for all V and same as marginal for

U .)

64

Suggests definition: random variables independent if conditional

distribution always same, and always same as marginal.

Mathematics: X,Y independent if

pY (y) = pY |X(y|x) =pX,Y (x, y)

pX(x)

so that

pX,Y (x, y) = pX(x)pY (y).

This is usually easiest check:

• if pX,Y (x, y) = pX(x)pY (y) for all x, y, then X, Y

independent.

• if pX,Y (x, y) 6= pX(x)pY (y) for any one (x, y) pair, then

X, Y not independent.

65

For example above: P (U = 0) = 23, P (U = 1) = 1

3;

P (V = 0) = 12, P (V = 1) = P (V = 2) = 1

4. Also,

P (U = 0)P (V = 0) =2

3· 1

2=

1

3= P (U = 0, V = 0).

Repeat for all u and v: proves independence.

66

Compare this joint distribution:

x \ y 0 1

0 0 13

1 23

0

Now,

P (X = 0)P (Y = 0) =1

3· 2

3=

2

9

and P (X = 0, Y = 0) = 0 6= 29. One calculation shows X, Y

not independent.

67

Independence of continuous random variables

As usual, turn probability into density. If

fX,Y (x, y) = fX(x)fY (y)

for all x, y, then continuous random variables X, Y independent. If

it fails for any (x, y) pair, not independent.

Example: suppose fX(x) = 2x2 + 13, fY (y) = 4

3y + 2y5,

fX,Y (x, y) = 4x2y + 2y5 for 0 ≤ x, y ≤ 1. Then

fX(x)fY (y) =

(2x2 +

1

3

)(4

3y + 2y5

)

which cannot be simplified to fX,Y (x, y). So X, Y not

independent.

68

Order statistics

Suppose that X1, X2, . . . , Xn all, independently, have same

distribution (a sample from distribution). Suppose common cdf

FX(x).

For example: take 20 people, give each IQ test. Without knowing

about individuals, use same distribution for each. What might

highest score in sample be?

Idea: more people sampled, higher the highest score could be (get

more chances to see a very high score).

69

Let M = max(X1, X2, . . . , Xn). Then

P (M ≤ m) = P (X1 ≤ m,X2 ≤ m, . . . , Xn ≤ m)

= P (X1 ≤ m)P (X2 ≤ m) · · ·P (Xn ≤ m)

= [FX(m)]n .

If X continuous, differentiate to get density.

Example: each Xi ∼ Uniform[0, 1]. Then FX(x) = x, so

P (M ≤ m) = xn.

If n = 5, P (M ≤ 0.9) = 0.95 = 0.59; if n = 20,

P (M ≤ 0.9) = 0.920 = 0.1216, much smaller. That is, with

more observations, the maximum is likely to be higher (less likely to

be low).

70

Similar idea for minimum: let K = min(X1, X2, . . . , Xn). Then

P (K ≤ k) = 1− P (K > k)

= 1− P (X1 > k,X2 > k, . . . , Xn > k)

= 1− P (X1 > k)P (X2 > k) · · ·P (Xn > k)

= 1− (1− FX(k))n.

Example: if n = 10, Xi ∼ Uniform[0, 1], then

P (K ≤ 0.2) = 1− (1− 0.2)10 = 0.8926.

71

Simulating probability distributions

So far, considered mathematical properties of distributions:

probabilities, densities, cdf’s etc. But some distributions difficult to

understand or use.

Generate random values from distribution.

approximation of difficult-to-calculate quantities

simulation of complex systems

generating potential solutions for difficult problems

random choices for quizzes, computer games

understanding behaviour of samples (chapter 4)

72

Pseudo-random numbers

In practice, don’t get actual random numbers, but pseudo-random

numbers. These follow recipe, but look random. (Paradox?)

Not so bad, because crucial feature: unpredictable – cannot easily

say what comes next.

Typical method: multiplicative congruential generator. Start with

initial “seed” value R0, then, for n = 0, 1, . . .:

Rn+1 = 106Rn + 1283 (mod 6075)

(“take remainder on division by 6075”).

73

Eg. start with r0 = 1001:

R1 = 106(1001) + 1283 = 107389 (mod 6075) = 4114

R2 = 106(4114) + 1283 = 437367 (mod 6075) = 6042

R3 = 106(6042) + 1283 = 641735 (mod 6075) = 3860

and so on, with 0 ≤ Ri < 6075.

Gives up to 6075 different random integers before repeating itself.

Suitable choice of constants gives long “period” and unpredictable

sequence. (Number theory.)

In practice, use much larger constants – get many more possible

random numbers.

74

Continuous uniform on [0, 1]

To get (pseudo-) random values from Uniform[0, 1], take

pseudo-random integers and divide by maximum. Result has

approx. uniform distribution.

With generator above, max value is 6075, so random uniform values

are 4114/6075 = 0.677, 6042/6075 = 0.995,

3860/6075 = 0.635. (Only 6075 possible values, so only 3 or so

digits trustworthy.)

“Random numbers” in calculators, Excel etc. of this kind.

Random Uniform[0, 1] values are used as building block for

random values from other distributions. Eg. random

Y ∼ Uniform[0, b]: multiply a random Uniform[0, 1] by b.

75

Bernoulli distribution

Suppose we want to simulate X ∼ Bernoulli(0.4): single trial,

prob. 0.4 of success.

Take single random uniform U . If U ≤ 0.4, take X = 1 (success),

otherwise take X = 0 (failure).

Works because U ≤ 0.4 about 0.4 of the time, so will get

successes about 0.4 of the time (long run).

In general, for X ∼ Bernoulli(θ), take X = 1 if U ≤ θ, 0

otherwise.

76

Binomial and geometric distributions

If Y ∼ Binomial(n, θ), Y = X1 + X2 + · · ·+ Xn where

Xi ∼ Bernoulli(θ). So just generate n random Bernoullis and

add them up.

Similarly, if Z ∼ Geometric(θ), Z is number of failures (in

Bernoulli trials) before 1st success. So get random value of Z like

this:

1. set Z = 0

2. generate U from Uniform[0, 1]

3. if U ≤ θ, stop with current Z

4. otherwise, add 1 to Z and return to step 2.

77

Inverse-CDF method

Cdf F (x) = P (X ≤ x) defined for all x.

Also, in set of possible X-values (where f(x) > 0), F (x)

invertible: for any p, exactly one x where F (x) = p.

Example: X ∼ Exponential(λ). Then F (x) = 1− e−λx. For

x > 0, write p = F (x), and solve for x to get

x = −1

λln(1− p).

Then generate a random p from Uniform[0, 1], and put it in the

formula to get a random X .

78

For instance, if λ = 2, might have p = 0.7 and hence random X is

−12ln(1− 0.7) = 0.602.

Why does this work in general?

Let Y be any random variable; let F (y) = P (Y ≤ y) be cdf of Y .

Define random variable W = F (Y ). Then

P (W ≤ w) = P (F (Y ) ≤ w)

= P (Y ≤ F−1(w)) = F{F−1(w)} = w.

That is, W ∼ Uniform[0, 1] whatever the distribution of Y .

79

So: to simulate Y , simulate W , then use relationship

Y = F−1(W ) to simulate Y (by using simulated uniform in place

of W ).

This was done above for exponential. Called inverse-CDF method.

80

Also works for discrete. Example: Poisson(0.7) has this cdf:

x 0 1 2 3 4

P (X ≤ x) 0.497 0.844 0.966 0.994 0.999

Procedure: get random U ∼ Uniform[0, 1]. If U ≤ 0.497, take

random X = 0; else if U ≤ 0.844, take X = 1, . . . , else if

U > 0.999, take X = 5.

(Higher values possible, but very unlikely; for more accuracy use

more digits.)

81

Normal distribution

Difficult to simulate from (cannot invert cdf).

But consider X , Y with bivariate standard normal distribution,

correlation 0. Joint density is

fX,Y (x, y) =1

2πexp

{−1

2(x2 + y2)

}.

Thinking of (x, y) as point in R2, note that density depends only on

distance from origin (r2 = x2 + y2), not on angle.

So generate random (x, y) pair by generating random angle

θ ∼ Uniform[0, 2π], random distance, separately.

(details: 2-variable transformation using Jacobian determinant.)

82

Density function for distance R is

fR(r) = re−r2/2

and cdf is

FR(r) =

∫ r

0

te−t2/2 dt = 1− e−r2/2

(eg. use substitution u = t2/2, du = t dt).

FR(r) invertible; let p = FR(r), solve for r to get

r =√−2 ln(1− p).

Get random R by taking U ∼ Uniform[0, 1], using for p above.

83

Finally, convert random R, θ to (X, Y ) using polar coordinate

formulas

X = R cos θ; Y = R sin θ.

Example: suppose random θ = 1.8 (radians), U = 0.3. Then

R =√−2 ln(1− 0.3) = 0.8446. So

X = 0.8446 cos 1.8 = −0.19; Y = 0.8446 sin 1.8 = 0.82.

84

Rejection methods

Inverse-CDF method doesn’t always work – cdf can be too

complicated to invert. Example: X ∼ Gamma(3, 1), with density

function

f(x) =x2

2e−x.

This has maximum 2e−2 = 0.2707 at x = 2. Density “small”

beyond x = 10.

85

Idea: sample random point (X,Y ) in rectangle enclosing f(x),

with 0 ≤ X ≤ 10, 0 ≤ Y ≤ 2e−2 (using uniform distribution):

• if point below density function (Y ≤ f(X)), take X as random

value from distribution

• otherwise, reject (X, Y ) pair and try again.

Chance of X-value being accepted proportional to density f(X):

when value more likely in distribution, more likely to be accepted.

86

Example:

X 7.3 1.0 2.7 1.7 9.4 5.5

Y 0.206 0.130 0.023 0.256 0.197 0.203

f(X) 0.018 0.184 0.245 0.264 0.004 0.062

reject y n n n y y

Values 7.3, 9.4, 5.5 rejected; 1.0, 2.7, 1.7 random values from

Gamma(3, 1).

Needed 12 random uniforms to generate 3 random gammas.

87

Can be made more sophisticated. Let g(x) be density function that

is easy to sample from, such that f(x) ≤ cg(x) for all x (choose

c). Above, g(x) = 1, c = 2e−2.

Generate random value X from distribution with density g(x).

Generate random Y ∼ Uniform[0, cg(X)]. If Y ≤ f(X),

accept X ; otherwise, reject and try again.

Efficiency of rejection method greatest when cg(x) only slightly

greater than f(x); then, very little rejection.

88

Simulation in Minitab

Minitab can generate random values from many distributions (using

methods above or variations).

Basic procedure:

• Select Calc, Random Data

• Select desired distribution

• Fill in number of random values to generate

• Fill in (empty) column to store values

• Fill in parameters of distribution (if any)

• Click OK.

89

Examples: Uniform[0, 1], Bernoulli(0.4), Binomial(5, 0.4),

Exponential(2), Poisson(0.7), Normal(0, 1).

To generate random values from another distribution, generate

column of values from Uniform[0, 1], then use Calculator to create

desired values (p. 47–48 of manual).

Recall random values actually “pseudo-random”: starting at same

seed value gives same sequence of random values. Can set seed

value in Minitab (Calc, Set Base) to get reproducible random values.

90

Expectation

91

Introduction

Game: toss fair coin, win $2 for a head, lose $1 for a tail.

Amount you win is random variable W with

P (W = 2) = P (W = −1) = 12.

Could win or lose on any one play, but (a) winning and losing equally

likely, (b) amount won greater than amount lost.

Would probably play this game given chance, because expect to win

in long run, on average over many plays, even though anything

possible.

92

Expected value of random variable is its long-run average. For W

above, expect equal number of 2’s and−1’s, so expected value

would be

E(W ) =2 + (−1)

2=

1

2.

Another: suppose Y = 7 always (ie. P (Y = 7) = 1,

P (Y = k) = 0 for k 6= 7). Then E(Y ) should be 7.

Another: roll 2 dice. Win $30 for double 6, lose $1 otherwise. Looks

good because potential win greater than potential loss, but win very

unlikely. How to balance? For winnings random variable V , what is

E(V )?

93

Expectation for discrete random variables

Define expected value (expectation) of random variable X :

E(X) =∑

x

xP (X = x),

“sum of value times probability”. Sum over all possible x.

Check for above examples:

E(W ) = 2 · 1

2+ (−1) · 1

2=

1

2E(Y ) = 7 · 1 = 7

E(V ) = 30 · 1

36+ (−1)

35

36= − 5

36

94

First 2 as expected.

For V , prob. of double 6 is 136

, so chance of losing is 1− 136

. Even

though prize large (win $30 for double 6), E(V ) < 0, so would lose

in long run, because win prob even smaller than prize large.

Formula much easier than reasoning out – less thought!

Now suppose X ∼ Bernoulli(θ). What is E(X)?

X = 1 with prob θ, 0 with prob 1− θ, so:

E(X) = 1 · θ + 0 · (1− θ) = θ.

In long run, average X equal to success probability.

Makes sense (think of θ = 0 and θ = 1 as extreme cases).

95

Expectation for geometric and Poisson distributions

To find more complicated expectations, cleverness can be needed

to figure out sum.

Suppose Z ∼ Geometric(θ), so P (Z = k) = θ(1− θ)k. Then

E(Z) =∞∑

k=0

kθ(1− θ)k =1− θ

θ.

Method: write (1− θ)E(Z) to look like E(Z) but with k − 1 in

place of k, subtract.

Mean is odds against success: if failure 4 times more likely than

success, on average get 4 failures before 1st success.

96

If X ∼ Poisson(λ), then

E(X) =∞∑

k=0

k · e−λλk

k!.

Note that the k = 0 term is 0, so start sum at k = 1, then let

l = k − 1 to get

E(X) = λ

∞∑

l=0

e−λλl

l!.

The sum is of all the probabilities from a Poisson distribution, so is

1. (Or,∑∞

l=0(λl/l!) is the Maclaurin series for eλ.)

So for X ∼ Poisson(λ), E(X) = λ. Thus parameter λ in fact

mean.

97

St Petersburg Paradox

Game: toss fair coin, let Z be #tails before 1st head. Win 2Z

dollars. Thus for TTTH, win 23 = $8. Expected winnings (fair price

to pay to play)?

∞∑

k=0

2k · 1

2k· 1

2=

∞∑

k=0

1

2= ∞.

How can this be? Only ever win finite amount.

Play game 10 times:

Z 0 1 0 0 3 0 3 0 6 1

Winnings 1 2 1 1 8 1 8 1 64 2

Mean winnings $8.90, larger than actual winnings 90% of time!

98

Problem is that any one big payoff completely dominates average,

and by playing game enough times, can make it very likely that a

very big payoff will occur.

If there is a maximum payoff, say $230, expectation finite ($15.50).

When random variable can be arbitrarily large, expectation may not

be finite. But can be finite – compare Poisson, where probabilities

decrease faster than values increase. Similarly, lotteries with very

big prizes still have expected winnings less than ticket price

(because chance of winning big prize small enough).

99

Utility and Kelly betting

In St Petersburg paradox, expectation didn’t tell story, because “fair

price” ought to be finite. Changing game by a little changed

expected winnings a lot.

Most bets look like this: win known $w if you win, lose $1 if you lose.

Suppose probability of winning is θ. Then expectation is

E = wθ + (−1)(1− θ) = θ(w + 1)− 1

which is positive if θ > 1/(w + 1).

For instance, if w = 2, E > 0 if θ > 1/3. That is, if you believe

your chance of winning is better than 13, you should bet because in

long run you win more than you lose.

100

If bet more than $1, wins and losses increase in proportion: on bet

of $b, win $wb or lose $b.

Positive expectation seems to say “bet everything you have”: far too

risky for most! Always possibility of losing.

Idea: consider utility of money, not same as money itself. If you

only have $10, $1 is a lot of money (has great utility), but if you have

$1 million, $1 almost meaningless.

Utility of money varies between people, but could be proportional to

current fortune. Then, utility of money depends on log of $ amount.

101

Suppose we currently have $c, and want to choose b for bet above,

assuming all else known. Then fortune after the bet is F = c + bw

if we win (prob θ), F = c− b if we lose (prob 1− θ). Utility idea:

choose b to maximize E(ln F ):

E(ln F ) = θ ln(c + bw) + (1− θ) ln(c− b).

Take derivative (for b), set to 0:

dE(ln F )

db= w

θ

c + bw+(−1)

1− θ

c− b=

θw(c− b)− (1− θ)(c + bw)

(c + bw)(c− b).

Zero when numerator zero; solve for b to get

b =c{θ(w + 1)− 1}

w=

cE

w.

This is called the Kelly bet. (If negative, don’t bet anything!)

102

Examples, with c = 100:

• w = 9, θ = 18. E = θ(w + 1)− 1 = 0.25, so Kelly bet

b = 100(0.25)/9 = $2.78.

• w = 1.5, θ = 12. E = 0.25 again; Kelly bet

b = 100(0.25)/1.5 = $16.67.

Note: expected winnings same in both cases, but bet less when

w = 9: more risk because less likely to win.

In general, bet fraction of current fortune that is bigger when

expected winnings bigger and chance of winning bigger.

103

Expectation of functions of random variables

In St Petersburg problem above, random variable was number of

tails Z , but winnings 2Z . In effect, found that E(2Z) was infinite.

Method: sum values of 2Z times probability.

Formally: let g(X) be some function of random variable X . Then

E(g(X)) =∑

x

g(x)P (X = x).

104

Linearity of expected values

Suppose we have two random variables X, Y . What is

E(X + Y )?

Go back to definition, bearing in mind that X,Y might be related,

so have to use joint probability function:

E(X + Y ) =∑

x

∑y

(x + y)P (X = x, Y = y)

=∑

x

xP (X = x) +∑

y

yP (Y = y)

= E(X) + E(Y ).

Details: expand out (x + y) in first sum, recognize (eg.) that∑y P (X = x, Y = y) = P (X = x) (marginal distribution).

105

Same logic shows that E(aX + bY ) = aE(X) + bE(Y ).

Likewise,

E(X1 + X2 + · · ·+ Xn) = E(X1) + E(X2) + · · ·+ E(Xn).

Also, if Y = 1 always, we get E(aX + b) = aE(X) + b.

106

Expectation for binomial distribution

If Y ∼ Binomial(n, θ), then Y actually sum of Bernoullis:

Y = X1 + X2 + · · ·+ Xn, where Xi ∼ Bernoulli(θ).

Know that E(Xi) = θ, so (by result on previous page)

E(Y ) = θ + θ + · · ·+ θ = nθ.

Makes sense: eg. if you succeed on one-third of trials on average

(θ = 13), and you have n = 30 trials, you’d expect 10 successes,

and nθ = 10.

107

Independence and E(XY )

Since E(X + Y ) = E(X) + E(Y ) for all X and Y , tempting to

claim that E(XY ) = E(X)E(Y ). But is this true?

Consider this joint distribution:

Y = 1 Y = 2 Total

X = 0 13

16

12

X = 1 14

14

12

Total 712

512

1

Using marginal distributions, E(X) = 12

and E(Y ) = 1712

. What is

E(XY )?

108

When X = 0, XY = 0 for all Y . So P (XY = 0) = 13

+ 16

= 12.

XY = 1 when X = 1, Y = 1, so P (XY = 1) = 14. Likewise,

XY = 2 when X = 1, Y = 2, so P (XY = 2) = 14. Hence

E(XY ) = 0 · 1

2+ 1 · 1

4+ 2 · 1

4=

3

4.

But

E(X)E(Y ) =1

2· 17

12=

17

246= 3

4.

So E(XY ) 6= E(X)E(Y ) in general.

109

But what if X,Y independent? Then

E(XY ) =∑

x

∑y

xyP (X = x)P (Y = y) = E(X)E(Y ),

rearranging, because joint prob is product of marginals.

So, if X, Y independent, then E(XY ) = E(X)E(Y ), but not

necessarily otherwise.

See later (in “covariance”) that difference E(XY )− E(X)E(Y )

measures extent of non-independence of X and Y .

110

Monotonicity of expectation

Suppose X, Y discrete random variables such that X ≤ Y . (That

is, for any event giving X = x and Y = y, x ≤ y always.

Example: roll 2 dice, let X be score on 1st die, Y be total score on 2

dice.)

How do E(X), E(Y ) compare?

Idea: let Z = Y −X . Then Z ≥ 0, discrete, and

E(Z) =∑

z≥0 zP (Z = z). All terms in sum positive or 0, so

E(Z) ≥ 0. But E(Z) = E(Y −X) = E(Y )− E(X). Hence

E(Y )− E(X) ≥ 0.

Conclusion: if X ≤ Y , then E(X) ≤ E(Y ).

111

Expectation for continuous random

variables

Can’t use formula

E(X) =∑

x

xP (X = x)

because probability of particular value not meaningful for continuous

X .

Standard procedure: replace probability by density function, replace

sum by integral.

112

That is, if X continuous random variable, define

E(X) =

∫ ∞

−∞x f(x) dx.

In integral, replace infinite limits by actual upper and lower limits.

113

Examples

Suppose X ∼ Uniform[0, 1], so f(x) = 1, 0 ≤ x ≤ 1. Then

E(X) =

∫ 1

0

x · 1 dx =

[1

2x2

]1

0

=1

2.

As you would have guessed.

Suppose W ∼ Exponential(λ). Then

E(W ) =

∫ ∞

0

wλe−λw dw.

Integrate by parts with u = w, v′ = λe−λw: E(W ) = 1/λ.

If W represents time between events, E(W ) in units of time, so λ

in units of 1 / time: a rate, number of events per unit time.

114

Suppose Z ∼ N(0, 1), so f(z) = (1/√

2π)e−z2/2. Then

E(Z) =

∫ ∞

−∞

1√2π

ze−z2/2 dz.

Replacing z by−z gives negative of function in integral, ie. f(z) is

odd function. Hence integral is 0, so E(Z) = 0. (Alternative:

substitute u = z2/2.)

115

As for discrete, expectation may not be finite.

f(x) = 1/x2, x ≥ 1 is a proper density, but for random variable X

with this distribution:

E(X) =

∫ ∞

1

x · 1

x2dx =

∫ ∞

1

1

xdx = [ln x]∞1 = ∞.

Problem: though density decreases as x increases, does not do so

fast enough to make E(X) integral converge.

116

Properties of expectation for continuous random

variables

These are same as for discrete variables. Proofs use integrals and

densities not sums, but otherwise very similar. Suppose X has

density fX(x) and X,Y have joint density fX,Y (x, y):

• E(g(X)) =∫∞−∞ g(x)fX(x) dx

• E(h(X,Y )) =∫∞−∞

∫∞−∞ h(x, y)fX,Y (x, y) dx dy.

• E(aX + bY ) = aE(X) + bE(Y )

• If X,Y independent, then E(XY ) = E(X)E(Y )

• If X ≤ Y , then E(X) ≤ E(Y ).

117

Expectations for general uniform and normal

distributions

Suppose X ∼ Uniform[a, b]. Then

U = (X − a)/(b− a) ∼ Uniform[0, 1], so E(U) = 12.

Write in terms of X : X = a + (b− a)U , so

E(X) = a + (b− a)E(U) = (a + b)/2. Again as expected.

Now suppose X ∼ Normal(µ, σ2). Then

Z = (X − µ)/σ ∼ N(0, 1). Write X = µ + σZ ; then

E(X) = µ + σE(Z) = µ + σ(0) = µ.

That is, parameter µ in normal distribution is the mean.

118

Variance, covariance and correlation

Compare random variables:

Z = 10 with prob 1, Y = 5, 15 each prob 12.

E(Z) = E(Y ) = 10, but Y further from mean than Z .

Expectation only gives long-run average of random variable, not how

much higher/lower than average it could be. For this, use variance:

Var(X) = E[(X − µX)2], µX = E(X).

119

For discrete X , Var(X) =∑

x(x− µX)2P (X = x). So:

Var(Z) = (10− 10)2 · 1 = 0;

Var(Y ) = (5− 10)2 · 1

2+ (15− 10)2 · 1

2= 25.

Here, Var(Y ) > Var(Z) because Y tends to be further from its

mean than Z does.

(Here, Y always further from mean than Z . But in general,

Var(Y ) > Var(Z) means Y likely to be further from mean than

Z .)

120

More about variance

Because (X − µX)2 ≥ 0, Var(X) ≥ 0 for all random variables

X .

Var(X) = 0 only if X does not vary (compare Z). No upper limit

on variance; larger variance means more unpredictable (can get

further from mean).

Why square? Cannot just omit: E(X − µX) = E(X)− µX = 0

always. Absolute value E(|X − µX |) possible, but hard to work

with (not differentiable).

121

Standard deviation

If random variable X in metres, Var(X) in metres-squared. For

interpretation, suggests using square root of variance:

SD(X) =√

Var(X)

which would be in metres. Called standard deviation of X .

SD easier for interpretation, variance easier for algebra.

122

Variance of Bernoulli

If X ∼ Bernoulli(θ), E(X) = θ, and

Var(X) =∑

x

(x− θ)2P (X = x)

= (1− θ)2θ + (0− θ)2(1− θ)

= θ(1− θ)(1− θ + θ) = θ(1− θ).

This is 0 if θ = 0, 1 (when results completely predictable) and

maximum, 14, when θ = 1

2.

123

Useful properties of variance

Var(aX + b) = a2 Var(X).

Because variance in squared units, changing X eg. from metres to

feet multiplies variance not by 3.3 but by that squared.

Also, adding b changes mean of X , but doesn’t change how spread

out distribution is (shifts left/right).

Var(X) = E(X2)− µ2X .

Useful result for finding variances in practice, since E(X2) not

usually too hard.

124

Proofs: use definition of variance as expectation, then rules of

expectation.

Bernoulli revisited: E(X2) = 12θ + 02(1− θ) = θ, so

Var(X) = θ − θ2 = θ(1− θ) as before.

125

Variance of exponential distribution

For continuous distributions, find E(X2) or variance using integral.

W ∼ Exponential(λ): already know E(W ) = 1/λ. Find

Var(W ) by first finding E(W 2), using integration by parts:

E(W 2) =

∫ ∞

0

w2λe−λw dw =[−w2e−λw

]∞0

+2

λ

∫ ∞

0

wλe−λw dw.

Square brackets 0; integral is E(W ) = 1/λ. Hence

E(W 2) = (2/λ)(1/λ) = 2/λ2, and

Var(W ) =2

λ2−

(1

λ

)2

=1

λ2.

For exponential distribution, variance is square of mean.

126

Variance of normal random variable

Suppose Z ∼ N(0, 1). Know that E(Z) = 0, so

Var(Z) = E(Z2)− 02 = E(Z2). Thus

Var(Z) =

∫ ∞

−∞z2 1√

2πe−z2/2 dz.

To tackle by parts: let u = z/√

2π, v′ = ze−z2/2. v′ has

antiderivative v = −e−z2/2. Gives

127

Var(Z) =

[− z√

2πe−z2/2

]∞

−∞+

∫ ∞

−∞

1√2π

e−z2/2 dz.

Square bracket 0 (e−z2/2 → 0 very fast); integral that of density of

Z , so 1. Hence Var(Z) = 1.

Suppose now X ∼ N(µ, σ2). Then Z = (x− µ)/σ, so

X = µ + σZ . So Var(X) = σ2 Var(Z) = σ2. That is,

parameter σ2 in normal distribution is variance.

128

Covariance

Consider discrete joint distribution:

Y = 1 Y = 2 sum

X = 0 0.4 0.2 0.6

X = 1 0.1 0.3 0.4

sum 0.5 0.5

If X = 0, Y more likely to be small; if X = 1, Y more likely to be

large. X, Y vary together.

Idea: covariance Cov(X, Y ) = E[(X − µX)(Y − µY )].

129

Here, µX = E(X) = 0.4, µY = E(Y ) = 1.5, so take all

combinations of (X − µX , Y − µY ) values and their probs:

Cov(X,Y )

= (0− 0.4)(1− 1.5)(0.4) + (0− 0.4)(2− 1.5)(0.2)

+ (1− 0.4)(1− 1.5)(0.1) + (1− 0.4)(2− 1.5)(0.3)

= 0.08− 0.04− 0.03 + 0.04 = 0.10.

Result positive. (X, Y ) combinations where (X − µX)(Y − µY )

positive outweigh those where negative. That is, when X large, Y

more likely to be large as well (and small with small).

Covariance can be negative: then large X goes with small Y and

vice versa. Covariance 0: no trend.

130

Calculating covariances

Useful formula:

Cov(X, Y ) = E(XY )− E(X)E(Y ).

Proof: definition of covariance, properties of expectation.

Previous example revisited:

E(XY ) = (0)(1)(0.4)+(0)(2)(0.2)+(1)(1)(0.1)+(1)(2)(0.3) = 0.7;

Cov(X, Y ) = 0.7− (0.4)(1.5) = 0.1.

As with corresponding variance formula, useful for calculations.

131

Covariance and independence

If X,Y independent, then E(XY ) = E(X)E(Y ), so

Cov(X, Y ) = E(XY )− E(X)E(Y ) = 0.

But covariance could be 0 without independence. Example:

(X, Y ) = (−1, 1), (0, 0), (1, 1), each prob 13. E(X) = 0,

E(Y ) = 23, E(XY ) = (−1)(1

3) + (0)(1

3) + (1)(1

3) = 0, so

Cov(X, Y ) = 0− (0)(23) = 0. But X, Y not independent: given

X , know Y exactly.

Relationship between X, Y not a trend: as X increases, Y

decreases then increases. No general statement about Y

large/small as X increases.

Fact: if X,Y bivariate normal, covariance 0 implies independence.

132

Variance of sum

Previously found that E(X + Y ) = E(X) + E(Y ) for all X,Y .

Corresponding formula for variances?

Derive formula for Var(X + Y ) by writing as expectation,

expanding out square, recognizing terms:

Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).

Logic: if Cov(X,Y ) > 0, X, Y big/small together, sum could be

very big/small, variance large. If Cov(X, Y ) < 0, large X

compensates small Y and vice versa, sum of moderate size,

variance small.

If X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).

133

Variance of binomial distribution

Suppose X ∼ Binomial(n, θ). Then can write

X = Y1 + Y2 + · · ·+ Yn,

where Yi ∼ Bernoulli(θ) independently. So

Var(X) = Var(Y1) + Var(Y2) + · · ·+ Var(Yn)

= θ(1− θ) + θ(1− θ) + · · ·+ θ(1− θ)

= nθ(1− θ).

Variance increases as n increases (fixed θ) because range of

possible #successes becomes wider.

134

Correlation

Covariance hard to interpret. Eg. size of positive correlation says

little about X,Y relationship.

Suppose X height (metres), Y weight (kg). Units of covariance m

× kg. Measure height in inches, weight in lbs: covariance in

different units.

Try for scale-free quantity. Covariance measures how X, Y vary

together: suggests use of variances. Var(X) m2, Var(Y ) kg2, so

right scaling is by sq root of each. Define correlation:

Corr(X,Y ) =Cov(X,Y )√

Var(X) Var(Y ).

135

Example: (X, Y ) = (0, 1), (1, 3), each prob 12.

E(X) = 0.5, E(Y ) = 2; XY = 0, 3 each prob 12

so

Cov(X, Y ) = 32− (0.5)(2) = 1

2.

Also, Var(X) = 14, Var(Y ) = 1, so

Corr(X, Y ) =12√

(14)(1)

= 1.

When X larger (1 vs. 0), Y also larger (3 vs. 1) for certain: a perfect

trend. So this should be largest possible correlation.

(Proof later: Cauchy-Schwartz inequality.)

136

More about correlation

Smallest possible correlation is−1, when larger X always goes

with smaller Y (eg. (X, Y ) = (0, 1), (1,−3) with prob 12).

If X,Y independent, covariance 0, so correlation 0 also.

In-between values represent in-between trends. Eg.

Corr(X, Y ) = 0.5: larger X with larger Y most of the time, but

not always.

Correlation actually measures extent of linear relationship between

random variables. X, Y in example related by Y = 2X + 1.

Perfect nonlinear relationship won’t give correlation±1.

137

Viewing correlation by simulation

Useful to have sense of what correlation “looks like”.

Generate random normals with required correlation, plot.

Suppose X, Y ∼ N(0, 1) independently. Then use X and

Z = αX + Y for suitable choice of α: correlated if α 6= 0

because X in both. Can show Cov(X,αX + Y ) = α and

Corr(X, αX + Y ) = α/√

1 + α2.

Choose α to get desired correlation ρ: α = ±ρ/√

1− ρ2.

138

Correlation 0.95:

−3 −2 −1 0 1 2

−10

−5

05

x

z

139

Correlation -0.8:

−3 −2 −1 0 1 2

−4

−2

02

4

x

z

140

Correlation 0.5:

−2 −1 0 1 2 3

−2

01

23

x

z

141

Correlation -0.2:

−3 −2 −1 0 1 2 3

−2

−1

01

2

x

z

142

Moment-generating functions

Means and variances (and eg. E(X3)) can be messy: each one

needs an integral (sum) to be solved. Would be nice to have function

that gives E(Xk) more easily than by integration (summing).

Consider mX(s) = E(esX). Function of s.

Maclaurin series for exp function:

mX(s) = E(1) + sE(X) +s2

2!E(X2) +

s3

3!E(X3) + · · · .

143

Differentiate both sides (as function of s):

m′X(s) = E(X) + sE(X2) +

s2

2!E(X3) + · · ·

Putting s = 0 gives m′(0) = E(X). Differentiate again:

mX(s) = E(X2) + sE(X3) + · · ·

so that m′′X(0) = E(X2).

By same process, find E(Xk) by differentiating mX(s) k times,

and setting s = 0. Differentiating easier than integrating!

E(Xk) called k-th moment of distribution of X ; function mX(s),

used to get moments, called moment generating function for X .

144

If X discrete,

mX(s) = E(esX) =∑

x

esxP (X − x)

and if X continuous,

mX(s) = E(esX) =

∫ ∞

−∞esxfX(x) dx.

145

Examples of moment generating functions

Bernoulli is easiest of all:

mX(s) = es·0P (X = 0) + es·1P (X = 1) = 1− θ + θes.

So:

m′X(s) = θes ⇒ E(X) = θ

m′′X(s) = θes ⇒ E(X2) = θ

and indeed E(Xk) = θ for all k. Also,

Var(X) = E(X2)− [E(X)]2 = θ − θ2 = θ(1− θ).

146

Now try X ∼ Exponential(λ), continuous:

mX(s) = E(esX) =

∫ ∞

0

esxλe−λx dx = λ(λ− s)−1

after some algebra. (Requires s < λ.)

m′X(s) = λ(λ− s)−2, so E(X) = m′

X(0) = 1/λ.

m′′X(s) = 2λ(λ− s)−3, so E(X2) = m′′

X(0) = 2/λ2. Hence

Var(X) =2

λ2−

(1

λ

)2

=1

λ2.

147

More about moment-generating functions

If X ∼ Poisson(λ), then

mX(s) = eλ(es−1).

If X ∼ N(0, 1), then

mX(s) = es2/2.

Facts:

• mX+Y (s) = mX(s)mY (s). (Mgf of sum is product of

moment-generating functions.)

• maX+b(s) = ebsmX(as). (Mgf of linear function related to

mgf of original random variable.)

148

Proofs from definition.

First result very useful: distribution of sum very difficult to find, but

can get moments for sum much more easily.

If X ∼ Binomial(n, θ), then X = Y1 + Y2 + · · ·+ Yn where

each Yi ∼ Bernoulli(θ). Hence

mX(s) = [mYi(s)]n = (1− θ + θet)n.

If X ∼ N(µ, σ2), X = µ + σZ where Z ∼ N(0, 1). Thus

mX(s) = mσZ+µ(s) = eµsmZ(σs) = eµs+σ2s2/2.

149

Using mgfs to recognize distributions

Important result, called uniqueness theorem. Suppose X has mgf

finite for−s0 < s < s0; suppose mX(s) = mY (s) for

−s0 < s < s0. Then X , Y have same distribution.

In other words: if mgf of X is that of known distribution, then X

must have that distribution.

Example: X, Y ∼ Poisson(λ). X + Y has mgf

mX+Y (s) = {eλ(es−1)}2 = e2λ(es−1).

This is mgf of Poisson(2λ), so X + Y ∼ Poisson(2λ).

150

Conditional Expectation

Consider this joint distribution (Ex. 3.5.2):

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

X, Y related: if Y = 0, then X more likely to be 8.

151

Suppose Y = 3. Then P (X = 5|Y = 3) = (17)/(1

7) = 1,

P (X = 8|Y = 3) = 0/(17) = 0. If Y = 3, then X certain to be

5, so E(X|Y = 3) = 5.

Now suppose Y = 4:

P (X = 5|Y = 4) =17

17

+ 17

=1

2= P (X = 8|Y = 4).

If Y = 4, then average X is E(X|Y = 4) = 5 · 12

+ 8 · 12

= 6.5.

Likewise, E(X|Y = 0) = 7.25.

152

These expectations from conditional distribution called conditional

expectations. E(X|Y = y) varies from 5 to 7.25 depending on

value of Y ; “on average, X depends on Y ”.

In general, if X, Y related, then mean of X depends on Y .

Calculate conditional distribution of X|Y , find X-expectation. This

is conditional expectation.

153

Conditional expectation: continuous case

Same principle: find expectation of conditional distribution. Now use

joint and marginal densities to find conditional density; then

integrate to get expectation.

Example: fX,Y (x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1.

Conditional density fX|Y (x, y) = fX,Y (x, y)/fY (y). So first find

marginal density fY (y) by integrating out x from joint density:

fY (y) = 43y + 2y5. Has no x. Hence

fX|Y (x, y) =4x2y + 2y5

43y + 2y5

.

154

Note: only x in numerator, so not so hard. Thus

E(X|Y = y) =

∫ 1

0

x · 4x2y + 2y5

43y + 2y5

dx =1 + y4

43

+ 2y4.

Depends slightly on Y : E(X|Y = 0) = 0.75,

E(X|Y = 0.5) = 0.729, E(X|Y = 1) = 0.6. As Y increases,

X decreases, on average.

155

Conditional expectations as random variables

Without particular Y -value in mind, can define E(X|Y ) by taking

E(X|Y = y) and replacing y by Y . Above example:

E(X|Y ) =1 + Y 4

43

+ 2Y 4.

This kind of conditional expectation is random variable (function of

random variable Y ).

156

As random variable, E(X|Y ) must have expectation,

E[E(X|Y )]. What is it? Directly, as function of y:

E[E(X|Y )] =

∫ 1

0

E(X|Y = y)fY (y) dy =2

3

(much cancellation). Now: marginal density of x is

fX(x) = 2x2 + 13

(integrate out y from joint density), so

E(X) =

∫ 1

0

x

(2x2 +

1

3

)dx =

2

3= E[E(X|Y )].

Not a coincidence. Illustrates theorem of total expectation:

E[E(X|Y )] = E(X). In words: effect of varying Y is to change

E(X|Y ), but E[E(X|Y )] averages out these effects, leaving only

overall average of X .

157

Conditional variance

Conditional variance is variance of conditional distribution.

Return to previous discrete example:

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

If Y = 3, X certain to be 5, so Var(X|Y = 3) = 0.

But if Y = 4, X equally likely 5 or 8; Var(X|Y = 4) = 2.25.

158

(Calculation: E(X|Y = 4) = 6.5, E(X2|Y = 4) = 44.5,

Var(X|Y = 4) = 44.5− (6.5)2 = 2.25.)

Another expression of how Y affects X . If know Y = 3, know X

exactly, but if Y = 4, more uncertain about possible X .

159

Inequalities relating probability, mean and

variance

Mean and variance closely related to probabilies. Are general

relationships true for wide range of random variables and

distributions.

Markov inequality: If X cannot be negative, then

P (X ≥ a) ≤ E(X)

a.

In words: if mean small, X unlikely to be very large.

160

Chebychev inequality:

P (|Y − µY | ≥ a) ≤ Var(Y )

a2.

In words: if variance small, Y unlikely to be far from mean.

(Variations in spelling: best English transliteration from Russian

probably “Chebyshov”.)

161

Example: suppose X = 0, 1, 2 each with probability 13. Then

E(X) = 1, E(X2) = 53, so Var(X) = 2

3.

Markov with a = 1.5 says P (X ≥ 1.5) ≤ 11.5

= 23. Actual

P (X ≥ 1.5) = P (X = 2) = 13, which is indeed≤ 2

3.

Chebychev with a = 0.9:

P (|X − 1| ≥ 0.9) ≤ (2/3)/(0.9)2 = 0.823. Actual

P (|X − 1| ≥ 0.9) = P (X ≤ 0.1) + P (X ≥ 1.9) = P (X =

0) + P (X = 2) = 23.

Bounds from Markov and Chebychev inequalities often not very

close to truth, but guaranteed, so can use inequalities to prove

results.

162

Proof of Markov inequality

Uses idea that if Z ≤ X , then E(Z) ≤ E(X).

Define random variable Z = a if X ≥ a, 0 otherwise. Because

X ≥ 0, value of Z always≤ that of X : Z ≤ X .

E(Z) = aP (X ≥ a) + 0P (X < a) = aP (X ≥ a).

But Z ≤ X so E(Z) ≤ E(X) and therefore

aP (X ≥ a) ≤ E(X). Divide both sides by a. Done.

163

Proof of Chebychev inequality

This uses Markov’s inequality with clever choice of random variable.

Let X = (Y − µY )2; X ≥ 0. Then Markov’s inequality (with a2

replacing a) says

P (X ≥ a2) ≤ E(X)

a2⇒ P [(Y−µY )2 ≥ a2] ≤ E[(Y − µY )2]

a2.

In last inequality, E[.] is Var(Y ). On left, both terms in probability

≥ 0, so can square-root both sides. Gives

P (|Y − µY | ≥ a) ≤ Var(Y )

a2

which is Chebychev’s inequality. Done.

164

Cauchy-Schwartz and Jensen inequalities

Cauchy-Schwartz:

|Cov(X, Y )| ≤√

Var(X) Var(Y ) ⇒ |Corr(X, Y )| ≤ 1.

Proof: page 188 of text. Idea, for X, Y having mean 0: write

E(X − λY )2 in terms of variances and covariances; result must

be≥ 0.

Jensen’s inequality relates E(g(X)) and g(E(X)). Specifically,

if g(x) is concave up (that is, g′′(x) > 0), then

g(E(X)) ≤ E(g(X)).

165

Proof: Tangent line to concave-up function always≤ function

(picture). Consider tangent line to g(x) at x = E(X); suppose

equation is a + bx. Then g(E(X)) = a + bE(X). Also, line

≤ g(x) everywhere else, so

a + bX ≤ g(X) ⇒ E(a + bX) ≤ E(g(X))

⇒ a + bE(X) ≤ E(g(X))

⇒ g(E(X)) ≤ E(g(X)).

Done.

(Note: text uses “convex” for “concave up”.)

166

Consequences of Jensen’s inequality

Take g(x) = x2. Then (E(X))2 ≤ E(X2). But

Var(X) = E(X2)− (E(X))2 ≥ 0, so knew that anyway.

Another: suppose X = 1, 2, 3, each prob 13. Then E(X) = 2.

But get another kind of average by multiplying 3 possible values and

taking 3rd root. This is called geometric mean. Here is

(1.2.3)1/3 = 1.817. Ordinary mean greater than geometric mean.

Look at log of geometric mean:

ln{(1.2.3)1/3} =1

3ln(1.2.3) =

1

3(ln 1+ln 2+ln 3) = E(ln X).

Thus geometric mean is eE(ln X).

167

Jensen: − ln x is concave up for x > 0, so

− ln(E(X)) ≤ E(− ln X) ⇒ ln(E(X)) ≥ E(ln X).

Exponentiate both sides (eln y = y):

E(X) ≥ eE(ln X).

This says that for any positive random variable X , the ordinary

mean will always be≥ the geometric mean.

168

Sampling Distributions and

Limits

169

Introduction: roulette

See http://tinyurl.com/238p5 for intro to game.

Basic idea: bet on number or number combination. Roulette wheel

spun, one number is winner. Your bet wins if it contains winning

number.

Wheel also contains numbers 0, 00. Winning bets paid as if 0, 00

absent (advantage to casino).

Bet 1: “high number”: win with 19–36, lose otherwise. Bet $1, win

$1 if win. Let W be winnings on one play; P (W = 1) = 18/38,

P (W = 0) = 20/38. Then

E(W ) = 1 · 18

38+ (−1) · 20

38= − 2

38' −$0.05.

170

Bet 2: “lucky number”: win if 24 comes up, lose otherwise. Win $35

for $1 bet. Now P (W = 35) = 1/38, P (W = −1) = 37/38, so

E(W ) = 35 · 1

38+ (−1) · 37

38= − 2

38' −$0.05.

In both bets, lose 5 cents per $ bet in long run.

Play game not once but many times. Interested in total winnings, or

mean winnings per play. Let Wi be winnings on play i; then mean

winnings per play Mn over n plays is

Mn =1

n

n∑i=1

Wi.

Investigate behaviour of Mn by simulation.

171

High-number, 30 plays:

0 5 10 15 20 25 30

−0.

40.

00.

20.

4

n

M_n

172

High-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

173

Lucky-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

174

Notes about roulette simulation

1st graph: in high-number bet, fortune goes up/down by $1 per play;

winnings/play pattern similar. On this sequence, in profit after 30

plays, but losing after 15.

2nd graph: same bet, 1000 plays. Less fluctuation after more trials;

winnings per play apparently tending to dotted line, E(W ). (Other

simulations have different shape but similar end behaviour.)

3rd graph: lucky-number bet, 1000 plays. Large jump upwards on

each win. Picture more erratic than for high-number bet; long-term

behaviour not clear yet. (Need more plays.)

175

Understanding Mn mathematically: mean, variance

Mn =1

n

n∑i=1

Wi

is sum. Wi in sum independent, each same distribution (one spin of

wheel has no effect on other spins). So can calculate E(Mn) and

Var(Mn).

Already found E(Wi) = − 238

for both our bets.

Find variances for bets: for high-number bet, Var(Wi) = 0.9972;

for lucky-number bet Var(Wi) = 33.21.

176

For mean:

E(Mn) =1

n

n∑i=1

E(Wi) =1

n

n∑i=1

(− 2

38

)= − 2

38,

since there are n terms in the sum, all the same.

That is, regardless of how long you play, you will lose 5 cents per $

bet on average.

177

Var(Mn) =1

n2

n∑i=1

Var(Wi) =Var(Wi)

n.

Sum has n terms all equal to variance of one play’s winnings. So for

high-number bet, Var(Mn) = 0.9972/n, for lucky-number bet,

Var(Mn) = 33.21/n.

For any particular n, variance for high-number bet lower. Supports

simulation: high-number bet results more predictable.

In both cases, as n →∞, Var(Mn) → 0. Longer you play, more

predictable Mn is.

178

Distribution of Mn

Mean and variance not whole story – want to know things like

P (Mn > 0) (chance of profit). For this, need distribution of Mn.

Start with M2 (2 plays). Do lucky-number bet (P (W = 35) = 138

,

P (W = −1) = 3738

).

4 possibilities:

• win both times. M2 = (35 + 35)/2 = 35;

P (M2 = 35) = ( 138

)2 = 11444

' 0.0007.

• win on 1st, lose on 2nd. M2 = (35 + (−1))/2 = 17; prob is138· 37

38= 37

1444.

179

• lose on 1st, win on 2nd. Again M2 = 17 and prob is same as

above. Thus overall P (M2 = 17) = 741444

' 0.0512.

• lose on both. M2 = ((−1) + (−1))/2 = −1;

P (M2 = −1) = (3738

)2 = 13691444

' 0.9480.

Calculation complicated, even for n = 2, because have to consider

all possible combinations.

In general: this kind of distribution very difficult to find exactly. So

look for approximations to it.

180

Sampling distributions

Suppose X1, X2, . . . , Xn are random variables, each independent

and with same distribution. For example:

• Xi is winnings from i-th play of a roulette bet.

• Xi is height of i-th randomly chosen Canadian.

• Xi = 1 if randomly chosen voter supports Liberal party,

Xi = 0 otherwise.

• Xi is randomly generated value from a distribution with density

fX(x).

In each case: underlying phenomenon of interest, collect data at

random to help understand phenomenon.

181

Summarize Xi values using random variable

Yn = h(X1, X2, . . . , Xn) for some function h (eg. mean, like

Mn).

Some jargon:

• total collection of individuals (all possible spins of roulette

wheel, all Canadians, all possible values) called population.

• particular individuals selected, or Xi values obtained from

them, called sample.

• Yn defined above called sample statistic.

Usually don’t know about population, so draw conclusion about it

based on sample.

182

First: opposite problem: if we know population, find out what

samples from it look like.

“At random” important, and specific. Each individual value in

population must have correct chance of being in sample (same

chance, for human populations), and each must be in sample or not

independently of others.

Aim: learn about distribution of Yn, called sampling distribution.

General statements difficult. Approach: find what happens as

n →∞, then use result as approximation for finite n.

183

Convergence in probability; weak law of

large numbers

In mathematics, accustomed to convergence ideas. Eg. if

an = 1− 1/n, so that a1 = 0, a2 = 12, a3 = 2

3, etc., an → 1

(converges to 1) as n →∞ because, by taking n large enough, all

values after an as close to 1 as desired.

For sequence X1, X2, . . . of random variables, what is meaning of

Xn → Y , where Y is random variable?

184

Different possibilities. One idea: “prob of Xn being far from Y goes

to 0 as n gets large”. Leads to definition:

Sequence {Xn} converges in probability to Y if, for all ε > 0,

limn→∞ P (|Xn − Y | ≥ ε) = 0. Notation: XnP→ Y .

Example: suppose U ∼ Uniform[0, 1]. Let Xn = 3 when

U ≤ 23(1− 1

n) and 8 otherwise.

Thus when n = 1, X1 must be 8. If U > 23, Xn remains 8 forever,

but if U ≤ 23, U ≤ 2

3(1− 1

n) eventually, so Xn becomes 3 for

some n, then remains 3 forever.

(Cannot know which will happen since U random variable.)

185

Now define Y = 3 if U ≤ 23

and Y = 8 otherwise. Same as

“eventual” Xn, so should have XnP→ Y . Correct?

P (|Xn − Y | ≥ ε) ≤ P (Xn 6= Y )

= P

(2

3

(1− 1

n

)< U <

2

3

)

=2

3n.

This tends to 0 as n →∞, so XnP→ Y .

186

Convergence to a constant

What if Y not random variable, but number?

Example: suppose Zn ∼ Exponential(n). Then E(Zn) = 1/n,

suggesting that Zn typically gets smaller and smaller. Does

ZnP→ 0?

P (|Zn − 0| ≥ ε) = P (Zn ≥ ε)

=

∫ ∞

ε

ne−nx dx = e−nε.

For any fixed ε, P (|Zn − 0| ≥ ε) → 0, so ZnP→ 0.

Important special case (usually easier to handle).

187

Convergence to mean

Suppose sequence {Yn} has E(Yn) = µ for all n. Then YnP→ µ

if P (|Yn − µ| ≥ ε) → 0.

But recall Chebychev’s inequality,

P (|Y − µY | ≥ a) ≤ Var(Y )/a2. Here:

P (|Yn − µ| ≥ ε) ≤ Var(Yn)

ε2.

For fixed ε, right side (and hence left side) tends to 0 if

Var(Yn) → 0, in which case YnP→ µ.

(Logically: if Var(Yn) getting smaller, Yn becoming closer to their

mean µ.)

188

Weak Law of Large Numbers

Return to X1, X2, . . . , Xn being a random sample from some

population with mean E(Xi) = µ and variance Var(Xi) = v.

Consider sample mean

Mn =1

n

n∑i=1

Xi.

Intuitively, expect Mn to be “close” to population mean µ, and to get

closer as n increases (more information in larger sample).

Does MnP→ µ? Re-do roulette calculations to show that

E(Mn) = µ and Var(Mn) = Var(Xi)/n = v/n.

189

Now, {Mn} is sequence of random variables with same mean µ.

Result of section “convergence to mean” says that MnP→ µ if

Var(Mn) → 0. But here, Var(Mn) = v/n → 0. This proves

that MnP→ µ.

This justifies use of sample mean as estimate of the population

mean. Can estimate average height of all Canadians by measuring

average height of sample of Canadians; the larger the sample,

closer estimate will likely be.

Important result, called weak law of large numbers.

190

To generalize: suppose now that Xn do not all have same variance,

but Var(Xi) = vi. Then

Var(Mn) =1

n2

n∑i=1

vi.

This might not→ 0. But suppose that vi ≤ v for all i. Then

Var(Mn) =1

n2

n∑i=1

vi ≤ 1

n2

n∑i=1

v =v

n→ 0.

In other words, MnP→ µ even if the variances are not all equal,

provided that they are bounded.

191

Convergence with probability 1

Previous example: suppose U ∼ Uniform[0, 1]. Let Xn = 3

when U ≤ 23(1− 1

n) and 8 otherwise. Let Y = 3 if U ≤ 2

3and

Y = 8 otherwise. Concluded that XnP→ Y .

Take another approach. Suppose we knew U , eg. suppose

U = 0.4. Then

0.4 ≤ 2

3

(1− 1

n

)⇒ n ≥ 5

2.

Thus X1 = X2 = 8, X3 = X4 = · · · = 3. This is ordinary

sequence of numbers, converges to 3. Also, if U = 0.4, Y = 3.

192

In general: if U < 23, Xn = 8 for n ≤ 2/(2− 3U) and Xn = 3

after that. If U > 23, Xn = 8 for all n.

In both cases, Xn → Y as ordinary sequence for any particular

value of U . Potentially different idea of convergence of random

variables.

Definition: Xn converges to Y with probability 1 if

P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;

notation Xna.s.→ Y .

In words: consider all ways to get (number) sequences {Xn}; for

each, consider corresponding Y . If Xn → Y always, then

Xna.s.→ Y .

193

Is it same as convergence in probability?

Example: let U ∼ Uniform[0, 1], and define {Xn} like this:

• X1 = 1 if 0 ≤ U < 12, 0 otherwise

• X2 = 1 if 12≤ U < 1, 0 otherwise

• X3 = 1 if 0 ≤ U < 14, 0 otherwise

• X4 = 1 if 14≤ U < 1

2, 0 otherwise

• X5 = 1 if 12≤ U < 3

4, 0 otherwise

• X6 = 1 if 34≤ U < 1, 0 otherwise

• X7 = 1 if 0 ≤ U < 18, 0 otherwise

• X8 = 1 if 18≤ U < 1

4, 0 otherwise, etc.

194

(Divided [0, 1] into 2, then 4, then 8,. . . intervals.)

Intervals getting shorter, so P (Xn = 1) decreasing. Indeed, for

ε < 1, P (|Xn − 0| ≥ ε) = P (Xn = 1) → 0, so XnP→ 0.

Suppose U = 0.2. Then Xn = 0 except for

X1 = X3 = X8 = · · · = 1. Beyond any n, always another

Xn = 1 (always another interval containing 0.2). So for U = 0.2,

number sequence {Xn} has no limit. Hence not true that Xna.s.→ 0.

Example shows that two comvergence ideas different –

convergence with probability 1 harder to achieve.

195

Strong law of large numbers

Random sample X1, X2, . . . , Xn with E(Xi) = µ,

Var(Xi) ≤ v; let Mn = (∑n

i=1 Xi)/n be sample mean.

Already showed that MnP→ µ (“weak law of large numbers”).

Also strong law of large numbers: Mna.s.→ µ. Proof difficult.

In words: out of (infinitely) many different sequences {Mn}obtainable, every one of them converges to µ.

196

Convergence in distribution

Consider independent sequence of random variables {Xn} with

P (Xn = 1) = 12

+ 1n

and P (Xn = 0) = 12− 1

n. Also, let

P (Y = 0) = P (Y = 1) = 12

independently of the Xn.

Now, take ε < 1. Then P (|Xn − Y | ≥ ε) = P (Xn 6= Y ). Could

have Xn = 0, Y = 1 or Xn = 1, Y = 0; use independence:

P (Xn 6= Y ) =

(1

2− 1

n

)1

2+

(1

2+

1

n

)1

2=

1

2.

Not→ 0, so not true that XnP→ Y .

197

But Xn does converge to Y in sense that

P (Xn = 1) → 12

= P (Y = 1) and

P (Xn = 0) → 12

= P (Y = 0). Called convergence in

distribution.

To make definition: note that P (Xn = x) meaningless for

continuous Xn, so work with P (Xn ≤ x) instead.

Then: {Xn} converges in distribution to Y if

P (Xn ≤ x) → P (Y ≤ x) for all x. Notation: XnD→ Y .

198

Example: Poisson approximation to binomial

Suppose Xn ∼ Binomial(n, λ/n) (that is, trials increasing but

success prob decreasing so that E(X) = n(λ/n) = λ constant.

Then

P (Xn = j) =

(n

j

)(λ

n

)j (1− λ

n

)n−j

→ e−λλj

j!,

which is P (Y = j) when Y ∼ Poisson(λ). That is,

XnD→ Poisson(λ).

(Proof based on limn→∞(1− (x/n))n = e−x.)

Suggests that if n large and θ small, Poisson is good approx to

binomial.

199

Try this: take λ = 1.5 for n = 2, 5, 10, 20, 100:

x n=2 n=5 n=10 n=20 n=100 Poisson

0 0.0625 0.1680 0.1968 0.2102 0.2206 0.2231

1 0.3750 0.3601 0.3474 0.3410 0.3359 0.3346

2 0.5625 0.3087 0.2758 0.2626 0.2532 0.2510

3 0.0000 0.1323 0.1298 0.1277 0.1259 0.1255

4 0.0000 0.0283 0.0400 0.0440 0.0465 0.0470

5 0.0000 0.0024 0.0084 0.0114 0.0136 0.0141

6 0.0000 0.0000 0.0012 0.0023 0.0032 0.0035

Approx for n = 20 not bad; for n = 100 is very good.

200

Convergence in distribution and moment generating

functions

Moment-generating function mY (s) for random variable Y is

function of s.

Uniqueness theorem: if mX(s) = mY (s) for all s where both

finite, then X, Y have same distribution.

Suggests following (true) result: if {Xn} is sequence of random

variables with mXn(s) → mY (s) (for all s where both sides finite),

then XnD→ Y .

201

Central Limit Theorem

Return to “random sample” X1, X2, . . . , Xn; suppose E(Xi) = 0

and Var(Xi) = 1.

Define Mn = (∑n

i−1 Xi)/n. Does Mn converge in distribution to

anything interesting?

Well, E(Mn) = 0 but Var(Mn) = 1/n → 0. So look instead at

Zn =√

nMn: E(Zn) = 0 and Var(Zn) = 1. Then

Zn = (∑n

i=1 Xi)/√

n.

202

Moment-generating function for Xi is

mXi(s) = 1 + sE(Xi) +

s2

2!E(X2

i ) +s3

3!E(X3

i ) + · · · ;

here E(Xi) = 0, Var(Xi) = 1 so E(X2i ) = 1, giving

mXi(s) = 1 +

s2

2+

s3

3!E(X3

i ) + · · · .

Now, by rules for mgf’s,

mZn(s) = mX1(s/√

n) ·mX2(s/√

n) · · · · ·mXn(s/√

n)

= {mXi(s/√

n)}n

=

(1 +

s2

2n+

s3

3!n3/2E(X3

i ) + · · ·)n

.

203

Recall that as n →∞, (1 + y/n)n → ey. Above, the terms in s3

and higher contribute less and less as n increases, so only the 1

and s2/n terms in bracket have effect. Thus

limn→∞

mZn(s) = limn→∞

(1 +

s2

2n

)n

= es2/2

which is mgf of standard normal distribution.

Thus, remarkable fact: regardless of distribution of Xi,

ZnD→ N(0, 1).

Also works for Xi with any mean and variance: standardized

MnD→ N(0, 1). Called central limit theorem.

204

Exact distribution of Mn very difficult to find. But if n “large”,

distribution can be approximated very well by normal distribution,

easier to work with.

This is reason for studying normal distribution.

Note that theorem uses convergence in distribution, so that it is the

cdf that converges, not the density function. Important if Xi discrete.

Also, for approximation, don’t need to be so careful about

standardization. Any sum/mean for large n works.

205

CLT by simulation

Let U1, U2, . . . ∼ Uniform[0, 1]; investigate distribution of

Yn = (U1 + U2 + · · ·+ Un)/n for various n. Uniform[0, 1]

distribution completely unlike normal. Do by simulation:

1. choose “large” number of Yn’s to simulate (eg. nsim = 10, 000)

2. in each of n columns, generate nsim random values from

Uniform[0, 1]

3. calculate simulated Yn values as row means. Eg. for n = 5,

let c10=rmean(c1-c5).

4. Draw histogram of results, compare normal distribution shape.

Normal good if curve through top middle of histogram bars.

206

Histogram of y

y

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

n = 2: normal too high at top, too low elsewhere.

207

Histogram of y

y

Den

sity

0.2 0.4 0.6 0.8

0.0

1.5

3.0

n = 5: much closer approx.

208

Histogram of y

y

Den

sity

0.3 0.4 0.5 0.6 0.7

02

4

n = 20: almost perfect.

209

Normal approx to binomial

Binomial is sum of Bernoullis, so CLT should apply if #trials n large.

Suppose Y ∼ Binomial(4, 0.5). Then E(Y ) = 2, Var(Y ) = 1.

Exact P (Y ≤ 1):

P (Y ≤ 1) =

(4

0

)(0.5)0(1−0.5)4+

(4

1

)(0.5)1(0.5)3 = 0.3125.

Take X ∼ N(2, 1) (same mean, variance as Y ). P (X ≤ 1)?

P (X ≤ 1) = P

(Z ≤ 1− 2√

1

)= P (Z ≤ −1) = 0.1587.

Not very close!

210

Problem: X continuous, but Y discrete. Y ≤ 1 really “Y ≤anything rounding to 1”. Suggests approximating P (Y ≤ 1) by

P (X ≤ 1.5):

P (X ≤ 1.5) = P

(Z ≤ 1.5− 2√

1

)= P (Z ≤ −0.5) = 0.3085.

For such small n, really very close to P (Y ≤ 1) = 0.3125.

In general, add 0.5 for≤ and subtract 0.5 for <. Called continuity

correction; do whenever discrete distribution approximated by

continuous.

(Alternatively: for binomial, P (Y ≤ 1) 6= P (Y < 1), but for

normal, P (X ≤ 1) = P (X < 1).)

211

Compare Y ∼ Binomial(20, 0.5); E(Y ) = 10, Var(Y ) = 5.

Then exact P (Y ≤ 8) = 0.2517; approx by X ∼ N(10, 5) as

P (Y ≤ 8) ' P (X ≤ 8.5)

= P

(Z ≤ 8.5− 10√

5

)

= P (Z ≤ −0.67) = 0.2514.

Now, approx very good.

212

If p 6= 0.5, binomial skewed; skewness decreases as n increases.

So need larger n for p far from 0.5.

Example: n = 20, p = 0.1. Simulate and plot using Minitab:

MTB > random 1000 c3;

SUBC> binomial 20 0.1.

MTB > hist c3

Shape clearly skewed, not normal. n = 20 not large enough here.

Rule of thumb: normal approx OK if np ≥ 5 and n(1− p) ≥ 5.

Examples: n = 4, p = 0.5: np = 2 < 5, no good.

n = 20, p = 0.5: np = n(1− p) = 10 ≥ 5, good;

n = 20, p = 0.1: np = 2 < 5, no good.

213

Monte Carlo integration

Integral I =∫ 1

0sin(x4) dx: impossible algebraically (no

antiderivative). Get approximate answer numerically eg. by

Simpson’s rule. But can also recognize that

I = E{sin(U4)}

where U ∼ Uniform[0, 1]. I is “average” of sin(U4), suggesting

procedure:

1. Generate U randomly from Uniform[0, 1].

2. Calculate T = sin(U4)

3. Repeat steps 1 and 2 many times, find mean value m of T .

214

Minitab commands to do this (U in c1, T in c2):


SUBC> uniform 0 1.

MTB > let c2=sin(c1**4)

MTB > mean c2

I got m = 0.19704. How accurate?

m observed value of random variable M . M mean of 1000 values,

so central limit theorem applies: approx normal distribution.

Mean, variance unknown but estimate using sample mean 0.19704,

sample SD 0.25221: E(M) ' 0.19704,

Var(M) = σ2/n ' 0.252212/1000 = 6.36× 10−5.

215

Now, 99.7% of normal distribution within mean± 3× SD, so I

almost certainly in

0.19704± 3√

6.36× 10−5 = (0.189, 0.205).

To get more accurate answer, get more simulated values.

216

Recognizing as expectation

Consider now I =∫∞0

5x cos(x2)e−5x dx.

Again impossible algebraically; because of limits, can’t use previous

trick.

Idea: use distribution with right limits and density in integral. Here,

Exponential(5) has density 5e−5x on correct interval, so

I = E{X cos(X2)} where X ∼ Exponential(5).

Minitab annoyance: its exponential dist has parameter 1/λ, so we

have to feed in 1/5 = 0.2.

217

Commands:


SUBC> exponential 0.2.

MTB > let c2=c1*cos(c1**2)

MTB > describe c2

I got mean 0.1884, SD 0.1731, so this area almost certainly in

0.1884± 30.1731√

1000= (0.1720, 0.2048).

218

Approximating sampling distributions

Central Limit Theorem only applies to means (sums), so is no help

for other quantities (median, variance etc).

Can approximate sampling distributions for these by simulation.

Idea:

1. simulate random sample from population

2. calculate sample quantity

3. repeat steps 1 and 2 many times, summarize results.

219

Sampling distribution of sample median in normal

population

Suppose X1, X2, . . . , Xn is random sample from normal

population mean 10, SD 2; take n = 3.

MTB > Random 500 c1-c3;

SUBC> Normal 10 2.

MTB > RMedian c1-c3 c4.

Samples in rows; use “row statistics” to get sample medians.

220

Shape is very like normal, even for such small sample.

221

Sampling distribution of sample variance in normal

population

Again suppose X1, X2, . . . , Xn ∼ N(10, 22). Now take n = 5:

MTB > Random 500 c1-c5;

SUBC> Normal 10 2.

MTB > RStDev c1-c5 c6.

MTB > let c7=c6*c6

MTB > histogram c7

(samples in rows again; variance as square of SD.)

222

Shape definitely skewed right: not normal-shaped.

223

Normal distribution theory

Normal distribution arises often from CLT, so worth knowing

properties and related distributions. These used frequently in

Chapter 5 and beyond (STAB57).

First: suppose U, V are independent. Then Cov(U, V ) =

E(UV )− E(U)E(V ) = E(U)E(V )− E(U)E(V ) = 0 as

expected.

But: now suppose that Cov(U, V ) = 0. If U, V normal, then (fact)

U, V independent.

That is, for normal U, V , Cov(U, V ) = 0 if and only if U, V

independent. Not true for other distributions.

224

The chi-squared distribution

Suppose Z ∼ N(0, 1). What is distribution of W = Z2? Can’t

use usual transformation because Z2 neither increasing nor

decreasing.

FW (w) = P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √w).

This as integral is

FW (w) =

∫ √w

−√w

e−z2/2

√2π

dz =

∫ √w

−∞

e−z2/2

√2π

dz−∫ −√w

−∞

e−z2/2

√2π

dz.

225

Differentiate both sides and simplify to get

fW (w) =1√2πw

e−w/2.

This is called chi-squared distribution with 1 degree of freedom

(df). Written W ∼ χ21.

Now suppose Z1, Z2, . . . , Zn ∼ N(0, 1) independently.

Distribution of W = Z21 + Z2

2 + · · ·+ Z2n called chi-squared with

n degrees of freedom. Written W ∼ χ2n.

What is E(W )?

E(W ) = E

(n∑

i=1

Z2i

)=

n∑i=1

E(Z2i ) = n(1) = n

since E(Z2i ) = Var(Zi) = 1.

226

To get density function of χ2n, compare gamma density with χ2

1:

λαwα−1

Γ(α)e−λw =

1√2πw

e−w/2

if α = 12

and λ = 12. That is, χ2

1 = Gamma(12, 1

2).

If Z2i ∼ χ2

1, use mgf formula for gamma dist to write

mZ2i(s) =

(1

2

)1/2 (1

2− s

)−1/2

.

227

If W =∑n

i=1 Z2i ∼ χ2

n, mgf of W is n copies of mZ2i(s)

multiplied together, ie.

MW (s) =

(1

2

)n/2 (1

2− s

)−n/2

which is mgf of Gamma(n/2, 12). Using formula for gamma

density, then, for W ∼ χ2n,

fW (w) =1

2n/2Γ(n/2)wn/2−1e−w/2.

Has skew-to-right shape (picture page 225).

228

Distribution of sample variance

Suppose X1, X2, . . . , Xn ∼ N(µ, σ2). Define X̄ =∑n

i=1 Xi/n

to be sample mean, S2 =∑n

i=1(Xi− X̄)2/(n− 1) to be sample

variance.

Know that X̄ ∼ N(µ, σ2/n). Distribution of S2?

Actually look at (n− 1)S2/σ2 =∑n

i=1(Xi − X̄)2/σ2. Can write

(p. 235) as sum of n− 1 squared N(0, 1)’s, so

(n− 1)S2

σ2∼ χ2

n−1.

Fact: E(S2) = σ2 (explains division by n− 1).

229

The t distribution

Standardize X̄ :X̄ − µ√

σ2/n∼ N(0, 1).

But what if σ2 unknown? Idea: replace σ2 by sample variance S2.

Distribution of result no longer normal (even though Xi are).

X̄ − µ√S2/n

=X̄ − µ√

σ2/n· 1√

(n− 1)S2/σ2/(n− 1)=

Z√Y/(n− 1)

where Z ∼ N(0, 1) and Y ∼ χ2n−1.

This called t distribution with n− 1 degrees of freedom, written

tn−1.

230

What happens as n increases? Write

Y/(n− 1) =∑n−1

i=1 Z2i /(n− 1) where Zi ∼ N(0, 1). Then

E(Y/(n− 1)) = 1. Let k = Var(Z2i ); then

Var(Y/(n− 1)) = (n− 1)k/(n− 1)2 = k/(n− 1) → 0.

That is, Y/(n− 1)P→ 1 and therefore

Z√Y/(n− 1)

D→ N(0, 1);

that is, for large n, the t distribution with n− 1 df well approximated

by N(0, 1).

t distribution hard to work with; use tables/software for probabilities.

231

The F distribution

Suppose S21 and S2

2 sample variances from independent samples

sizes m, n, both from normal populations with variance σ2. Then

might compare variances by looking at ratio R = S21/S

22 :

R =S2

1

S22

=(m− 1)S2

1/σ2

(n− 1)S22/σ

2· 1/(m− 1)

1/(n− 1)=

X/(m− 1)

Y/(n− 1)

where X ∼ χ2m−1 and Y ∼ χ2

n−1.

This defined to have F distribution with m− 1 and n− 1

degrees of freedom, written F (m− 1, n− 1).

232

Properties of F distribution

Ratio could have been S22/S

21 = 1/R with similar result: therefore,

if R ∼ F (m− 1, n− 1), then 1/R ∼ F (n− 1,m− 1).

Suppose T = X/√

Y/(n− 1) ∼ tn−1. Then

T 2 =X2/1

Y/(n− 1)

is a χ21/1 over χ2

n−1/(n− 1); that is, T 2 ∼ F (1, n− 1).

233

In

R =X/(m− 1)

Y/(n− 1):

if n →∞, know that Y/(n− 1)P→ 1, and numerator of

R ∼ χ2m−1/(m− 1).

Hence, as n →∞,

(m− 1)RD→ χ2

m−1.

Thus χ2m−1 is useful approx to F (m− 1, n− 1) if n large.

234

Stochastic Processes

235

Random walks

Consider gambling game: win $1 with prob p, lose $1 with prob q

(p + q = 1). Each play independent. Start with fortune a; let Xn

denote fortune after n plays.

Thus X0 = a; X1 = a + 1 if win (prob p), X1 = a− 1 if lose

(prob q).

Sequence {Xn} of random variables called random walk.

236

Properties of random walk

At each step, two possible outcomes (win/lose), same prob p of

winning, independent. So number of wins Wn ∼ Binomial(n, p).

With Wn wins, must be n−Wn losses, so fortune after Wn wins is

Xn = a + (1)Wn + (−1)(n−Wn) = a + 2Wn − n.

Since E(Wn) = np, have

E(Xn) = a + 2np− n = a + 2n

(p− 1

2

).

Also

Var(Xn) = 22 Var(Wn) = 4np(1− p).

237

Since Wn ∼ Binomial(n, p), have

P (Wn = j) =

(n

j

)pjqn−j;

write in terms of Xn to get

P (Xn = a + k) = P (a + k = a + 2Wn − n)

= P (Wn = (n + k)/2)

=

(n

(n + k)/2

)p(n+k)/2q(n−k)/2.

Only certain values of Xn possible; formula fails for impossible

values.

238

Examples

Suppose a = 5, p = 14. Then

E(Xn) = 5 + 2n(14− 1

2) = 5− n/2. Expect fortune to decrease

on average.

What is P (X3 = 6)? Write 6 = 5 + 1 so k = 1, n = 3;

(n + k)/2 = 2 and (n− k)/2 = 1:

P (X3 = 6) =

(3

2

)(1

4

)2 (3

4

)1

=9

64.

How about P (X9 = 7)? This is P (X9 = 5 + 2), so n = 9 and

k = 2. But (n + k)/2 = (5 + 2)/2 not integer, so formula fails.

X9 cannot be 7 (in fact X9 must be even).

239

Now suppose a = 20, p = 23. Then

E(Xn) = 20 + 2n

(2

3− 1

2

)= 20 + n/3,

increasing with n.

Find P (X5) = 21 = 20 + 1: n = 5, k = 1 so (n + k)/2 = 3,

(n− k)/2 = 2 and

P (X5 = 21) =

(5

3

)(2

3

)3 (1

3

)2

' 0.329,

fairly likely.

240

Gambler’s ruin

Suppose we gamble with aim to reach fortune c > 0. How likely do

we succeed before fortune reaches 0 (run out of money)?

Hard to see answer: no idea how long it takes to reach c or 0.

Idea: let S(a) be prob of reaching c first starting from fortune a.

Then for all c > 0, S(0) = 0, S(c) = 1. Also, if current fortune a,

fortune at next step either a + 1 or a− 1, leading to

S(a) = pS(a + 1) + qS(a− 1).

241

Solve above recurrence relation to get formula: if p = 12,

S(a) = a/c; otherwise,

S(a) =1− (q/p)a

1− (q/p)c.

Example: start with $20, want to win $50. If p = 12, chance of

success is 20/50 = 0.4. If p = 0.51, chance of success is

S(20) =1− (0.49/0.51)20

1− (0.49/0.51)50' 0.637.

Even a very small edge makes success much more likely. (Even

small disadvantage makes eventual failure much more likely.)

242

Markov Chains

Simple model of weather:

• if sunny today, prob 0.7 of sunny tomorrow, prob 0.3 of rainy.

• if rainy today, prob 0.4 of sunny tomorrow, prob 0.6 of rainy.

Weather has two states (sunny, rainy). From one day to next,

weather may change state.

Probs above called transition probabilities. This kind of probability

model called Markov chain.

243

Can write as matrix:

P =

0.7 0.3

0.4 0.6

where element pij is P (go to state j|currently state i).

Note assumption: only need to know weather today to predict

weather tomorrow. (If weather today known, past weather

irrelevant). Called Markov property.

Suppose sunny today. Chance of sun in two days?

One idea: list possibilities. Two: SSS, SRS. Use transition probs to

get (0.7)(0.7) + (0.3)(0.4) = 0.61.

244

Another: calculate matrix P 2:

P 2 =

0.7 0.3

0.4 0.6

0.7 0.3

0.4 0.6

=

0.61 0.39

0.52 0.48

.

Note that top-left calculation same as 1st idea above.

Matrix P 2 gives two-step transition probs. That is, if sunny today,

prob of sunny in 2 days’ time 0.61; if rainy today, almost even

chance of being rainy in 2 days.

In general, P n gives n-step transition probs (weather in n days’

time given weather today).

245

Another example

“Ehrenfest’s Urn”: Two urns, containing total of 4 balls. Choose one

ball at random, take out of current urn, place in other urn. Keep

track of number of balls in urn 1.

Transition matrix (states 0, 1, 2, 3, 4 balls in urn 1):

P =

0 1 0 0 0

14

0 34

0 0

0 24

0 24

0

0 0 34

0 14

0 0 0 1 0

Apparent tendency for number of balls in 2 urns to even out.

246

Find likely number of balls in urn 1 after 9 steps by finding P 9. (Use

Minitab: see section E.1 of manual, p. 162.) Answer (rounded):

P 9 =

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

Start with even number of balls in urn 1: end with either odd

number, equally likely. Start with odd number: end with even

number, most likely 2.

247

Stationary distributions

Instead of starting from particular state, pick starting state from

prob. distribution θ = (θ1, θ2, . . .).

In weather example: suppose 80% chance today sunny, so

θ = (0.80, 0.20).

To get prob of each state n steps later, multiply θ as row vector by

P n. Weather example, for n = 2 days later:

(0.8 0.2

)P 2 =

(0.8 0.2

)0.61 0.39

0.52 0.48

=

(0.592 0.408

).

248

Suppose we could find θ such that θP = θ. Then starting

distribution θ would be stationary: (marginal) prob of sunny day

same for all days.

Can try directly for weather example:(θ1 θ2

)P =

(0.7θ1 + 0.4θ2 0.3θ1 + 0.6θ2

)=

(θ1 θ2

).

2 equations in 2 unknowns, collapse into one equation

0.3θ1 − 0.4θ2 = 0, but θi are probs so that θ1 + θ2 = 1 also.

Solve: θ1 = 47, θ2 = 3

7.

More generally: solve θP = θ by transposing both sides to get

P T θT = θT . Like solution to Av = λv with λ = 1: stationary

prob θ is eigenvector of P T with eigenvalue 1.

249

Can use Minitab to get eigenvalues/vectors (manual p. 167). Usually

need to scale eigenvector to get probs summing to 1.

Ehrenfest urn example: 5 eigenvectors; one with eigenvalue 1 is

(0.120, 0.478, 0.717, 0.478, 0.120), scaling to 116

, 416

, 616

, 416

, 116

.

(Actually binomial probs: see text p. 595).

250

Limiting distributions

If initial state chosen from stationary distribution, then prob of each

state remains same for all time.

Also: if watch Markov chain for many steps, should not matter much

which state we began in.

Weather example: 8-step transition matrix is

P 8 =

0.57146 0.42854

0.57139 0.42861

'

47

37

47

37

Starting either from sunny or rainy day, chance of sunny day in 8

days’ time is about 47. Same as stationary distribution.

251

Compare Ehrenfest urn example:

P 8 '

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

not getting stationary distribution in each row.

Problem here: number of balls in urn 1 always goes from odd to

even or vice versa. So eg. P (1 ball in urn 1 after n steps)

alternates between 0 and positive; cannot have limit. Chain called

periodic.

252

Consider a third example:

P =

0.5 0.5 0

0.75 0.25 0

0 0 1

.

Has two eigenvectors for eigenvalue 1: (0.6, 0.4, 0) and (0, 0, 1).

Note: start in state 1 or 2, can never reach state 3. Start in state 3,

can never reach states 1 or 2.

Such chain called reducible: can split up into two chains, {1, 2}and {3} and treat each separately.

253

Markov chain limit theorem

Previous work suggests following theorem:

Suppose a Markov chain has a stationary distribution, is not

reducible, and is not periodic. Then its stationary distribution also

gives the probability, as n →∞, of being in any particular state

after n steps.

In effect, the stationary distribution gives approx to long-term

behaviour of chain.

254

... that’s all, folks!

255

Documents

STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we ﬂip two (fair) coins, and note whether each coin ... Can’t deﬁne probability of a value,