Download pdf - Chapter 1 Probability and Distributions - MyWeb · 1/129 Chapter 1 Probability and Distributions 1.1 Introduction Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

1/129

Chapter 1 Probability and Distributions

1.1 Introduction

Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

2/129

Question: What is probability?Answer:

I Probability is the chance for something to happen, so it’salways a number in [0, 1].

I In addition, in this class, probability is also the name of themathematical tool we are going to use.

How do we learn probability?I Mathematics is the language we use to communicate with the

universe. Its grammars are rules called axioms andtheorems.

I When we learn probability, we learn those rules. So preparefor a bit of a headache at the beginning similar to learning asecond language.


3/129

Relationship between probability and statistics?I Probability: From population to sample.I Statistics: From sample to population.


4/129

Random experiment

I The experiment can be repeated under the same condition.I Each experiment terminates with an outcome.I The outcome cannot be predicted with certainty prior to the

performance of the experiment.I The collection of all possible outcomes can be described prior

to the performance of the experiment.

This collection is called the sample space, usually denoted by C.

Example (1.1.1)

1 Toss a coin once, sample space C = {T,H}.2 Toss a coin twice, sample space?


5/129

Event

A subset of the sample space C, usually denoted by C, is called anevent. If an outcome belongs to C, then we say that the event Chas occurred.

Example (1.1.2)

1 Casting one red and one white die. Sample space is 36ordered pairs,

C = {(1, 1), . . . , (1, 6), (2, 1), . . . , (2, 6), . . . , (6, 6)}.

2 Event: sum of numbers on dice is 9,

C = {(3, 6), (4.5), (5, 4), (6, 3)}


6/129

Probability

Probability quantifies the notion of chance or likelihood of an event.Relative frequency is an empirical definition of probability:

I Suppose the experiment is repeated N times.I Let kN denote the number of times the event C actually

occurred.I The fN = kN/N is the relative frequency of the event C in the

repeated experiments.


7/129

Probability: frequentist

I Relative frequency: fN = kN/N .I Suppose that N increases.I Suppose that p = limN→∞ fN exists. Note p ∈ [0, 1].I Then p is the probability of the event C.

Example (1.1.2, cont’d)

Sum of numbers on dice is 9:

C = {(3, 6), (4.5), (5, 4), (6, 3)}

If each of the 36 outcomes is equally likely, then p = 4/36.

I We denote the probability of an event C by P (C).I It is the long-run relative frequency of the event C in a very

large number of independent replications of the experiment.


8/129

Subjective probability

Consider the event C = Hawkeye wins NCAA basketballchampionship in 2019. Suppose that I offer you two lottery tickets,and you can choose between them.

I If you pick lottery ticket 1, then we spin a roulette wheel thathas 100 slots and has been declared “fair” by the NevadaGaming Commission. If the ball lands in a slot from 1 through10 you get $100 and otherwise you get $0.

I If you pick lottery ticket 2, then you get $100 if C occurs andotherwise you get $0.

If you choose Lottery ticket 1, then your subjective p ≤ 0.10, and ifyou choose Lottery ticket 2 then your subjective p ≥ 0.10.

I The advantage of subjective probability is that it can beextended to experiments that cannot be repeated. (Thinkabout betting in sports, investing your money, . . . .)

I Mathematically, the two concepts of probability are identical.In many situations we will not need to make the distinction.


9/129


1.2 Set Theory


10/129

Sets

I A set is a collection of objects.If an element x belongs to a set C, then we write x ∈ C.

I If each element of a set C1 is also an element of another setC2, then C1 is called a subset of C2, written as C1 ⊂ C2.

I If C1 ⊂ C2 and C2 ⊂ C1, then C1 = C2.

Example (1.2.1)

Define sets C1 = {x : 0 ≤ x ≤ 1} and C2 = {x : −1 ≤ x ≤ 2}.We have C1 ⊂ C2.

Example (1.2.2)

Define sets C1 = {(x, y) : 0 ≤ x = y ≤ 1} andC2 = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1}. We have C1 ⊂ C2.

I If a set C has no elements, then C is called the null set,written as C = φ.


11/129

Union

I The set of all elements that belong to at least one of the setsC1 and C2 is called the union of C1 and C2, written asC1 ∪ C2.

I For example, if C1 = {1, 2, 3}, C2 = {2, 3, 5}, thenC1 ∪ C2 = {1, 2, 3, 5}.

I The set of elements that belongs to at least one of the setsC1, . . . , Ck is C1 ∪ C2 . . . ∪ Ck, also written ∪ki=1Ci.


12/129

ExampleI C ∪ φ =

I C ∪ C =

I If C1 ⊂ C2, then C1 ∪ C2 =

I If Ci = {x : x ∈ [i− 1, i]}, i = 1, . . . , k then∪ki=1Ci = {x : x ∈ [0, k]}.

Example (1.2.7)

Ck =

{x :

1

k + 1≤ x ≤ 1

}, k = 1, 2, . . . .

ThenC1 ∪ C2 ∪ C3 ∪ · · · = (0, 1].


13/129

Intersection

I The set of all elements that belong to each of the sets C1 andC2 is called the intersection of C1 and C2, written as C1 ∩C2.

I If C1 = {1, 2, 3}, C2 = {2, 3, 5}, then C1 ∪ C2 = {2, 3}.I If C1 = [0, 1] and C2 = [−1, 0] then C1 ∩ C2 = {0}.I If Ci = (0, 1/i), i = 1, . . . , k then ∩ki=1Ci = (0, 1/k).

Example (1.2.11)Let

Ck =

{x : 0 < x <

1

k

}, k = 1, 2, . . . .

ThenC1 ∩ C2 ∩ C3 ∩ · · · = ∅.


14/129

ExampleUse Venn diagrams to depict the sets C1 ∪ C2, C1 ∩ C2,(C1 ∪ C2) ∩ C3 and (C1 ∩ C2) ∪ C3.


15/129

Space & Complement

I The set of all elements under consideration is called thespace, written as C.

I Number of heads in tossing a coin ten times. The space isC = {0, 1, . . . , 10}.

I Let C be the sample space and C be its subset. The set thatconsists of all elements of C that are not elements of C iscalled the complement of C, written as Cc.

I Number of heads in tossing a coin ten times. IfC = {0, 1, 2, 3, 4} then Cc = {5, 6, 7, 8, 9, 10}.

1 C ∪ Cc = C.2 C ∩ Cc = ∅.3 (Cc)

c= C.


16/129

Important basic rules

1 DeMorgan’s laws: Let C denote the space and supposeC1, C2 ⊂ C. Then

A : (C1 ∩ C2)c = Cc1 ∪ Cc2.

B : (C1 ∪ C2)c = Cc1 ∩ Cc2.

2 Distributive laws: Let C denote the space and supposeC1, C2, C3 ⊂ C. Then

A : C1 ∪ (C2 ∩ C3) = (C1 ∪ C2) ∩ (C1 ∪ C3).

B : C1 ∩ (C2 ∪ C3) = (C1 ∩ C2) ∪ (C1 ∩ C3).


17/129

Set functions

I Usual function maps each point to a real number.

I Set function maps each set to a real number:

More specifically, let C be a space and C be its subset. Amapping Q that assigns a value to the subset C (rather thanan element x) is called a set function.

Example (1.2.18)

Let C be a set in one-dimensional space and let Q(C) be equal tothe number of points in C which correspond to positive integers.Then Q(C) is a function of the set C. Thus:

I if C = {x : 0 < x < 5}, then Q(C) = 4.I if C = {−2,−1}, then Q(C) = 0.I if C = {x : −∞ < x < 6}, then Q(C) = 5.


18/129

Example (1.2.23)Let C be a set in one-dimensional space and let

Q(C) =

∫Ce−xdx.

If C = {x : 0 ≤ x <∞}, then

Q(C) =

∫ ∞0

e−xdx = 1.

If C = {x : 1 < x ≤ 3}, then

Q(C) =

∫ 3

1e−xdx = e−1 − e−3.


19/129

Example (1.2.24)Let C be a set in n-dimensional space and let

Q(C) =

∫· · ·∫

C

dx1 · · · dxn.

If C = {(x1, x2, . . . , xn) : 0 ≤ x1, x2, · · · , xn ≤ 1}, then

Q(C) =

∫ 1

0

∫ 1

0· · ·∫ 1

0dx1dx2 · · · dxn = 1.

If C = {(x1, x2, . . . , xn) : 0 ≤ x1 ≤ x2 ≤ · · · ≤ xn ≤ 1}, then

Q(C) =

∫ 1

0

∫ xn

0· · ·∫ x3

0

∫ x2

0dx1dx2 · · · dxn−1dxn =

1

n!.


20/129


1.3 The Probability Set Function


21/129

σ-field

Definition (σ-field)

A collection of sets that is closed under complementation andcountable union of its members is a σ-field. This collection ofevents is usually denoted by B.

I The collection is also closed under countable intersectionsaccording to DeMorgan’s Laws.


22/129

Probability set function

Definition (Probability set function)

Let C be a sample space and B be a σ-field defined on C. Let P bea real-valued function defined on B. Then P is a probabilityfunction if

1 P (C) ≥ 0 for all C ∈ B;

2 P (C) = 1;

3 If Ci ∈ B (i = 1, 2, . . .) and Ci ∩ Cj = Φ ∀i 6= j, thenP (∪∞i=1Ci) =

∑∞i=1 P (Ci).

I A collection of events whose members are pairwise disjoint issaid to be mutually exclusive.

I The collection is further said to be exhaustive if the union of itsevents is the sample space.

I These three axioms imply many properties of the probabilityset function. Let us see a few examples.


23/129

Basic results on probability functions

Theorem (1.3.1)

P (C) = 1− P (Cc).

Theorem (1.3.2)

P (φ) = 0.

proof: Let C = ∅ and follows from P (C) = 1.

Theorem (1.3.3)

If C1 ⊂ C2, then P (C1) ≤ P (C2).

proof: Let T = C1 ∪ (CC1 ∩ C2). Notice that C2 = C1 ∪ T andC1 ∩ T = ∅.


24/129

Theorem (1.3.4)

0 ≤ P (C) ≤ 1.

proof: 0 = P (∅) ≤ P (C) ≤ P (C).

Theorem (1.3.5)

For two arbitrary events C1 and C2, it holds that

P (C1 ∪ C2) = P (C1) + P (C2)− P (C1 ∩ C2).


25/129

Inclusion Exclusion Formula

For three arbitrary events C1, C2 and C3, it holds that

P (C1 ∪ C2 ∪ C3) = P (C1) + P (C2) + P (C3)

− P (C1 ∩ C2)− P (C1 ∩ C3)− P (C2 ∩ C3)

+ P (C1 ∩ C2 ∩ C3).

I Boole’s inequality:

P (C1) + P (C2) + . . .+ P (Ck) ≥ P (C1 ∪ C2 ∪ . . . ∪ Ck).

I Bonferroni’s inequality:

P (C1 ∩ C2) ≥ P (C1) + P (C2)− 1.


26/129

If an experiment can result in any one of N different outcomes,and if exactly n of those outcomes correspond to event C, then theprobability of event C is

P (C) =n

N


27/129

Example 1.3.2

An unbiased coin is to be tossed twice and the outcomes are in anordered pair. Then,C = {(TT ), (TH), (HT ), (HH)}.

Write C1 = {the first toss results in a head},C2 = {the second toss results in a tail}.Then, P (C1 ∪ C2) =?

Method 1: C1 = {HH,HT}, C2 = {HT, TT}, soC1 ∪ C2 = {HH,HT, TT}. We see P (C1 ∪ C2) = 3/4.

Method 2: C1 = {HH,HT}, so P (C1) = 1/2, C2 = {HT, TT},so P (C2) = 1/2. We have C1 ∩ C2 = {HT}, soP (C1 ∩ C2) = 1/4. Thus by Inclusion Exclusion Formula,P (C1 ∪ C2) = P (C1) + P (C2)− P (C1 ∩ C2) = 3/4.


28/129

Counting rules

I Suppose we have two experiments, The first experimentresults in m outcomes while the second results in noutcomes. The composite experiment (the first experimentfollowed by the second) has mn outcomes. This is called themultiplication rule.

I Let A be a set with n elements. Suppose we are interested ink−tuples whose components are elements of A. Then, by theextended multiplication rule, there are nk such k−tuples.


29/129

Permutation

Suppose k ≤ n and we are interested in k−tuples whosecomponents are distinct elements of A. Hence, by themultiplication rule, there are n(n− 1) . . . (n− (k − 1)) suchk−tuples with distinct elements.

ExampleFor the integers 1, 2, 3, 4, 5, and ordered subsets of size 2 we have:

(1,2), (1,3), (1,4), (1,5), (2, 1), (2,3), (2,4), (2,5), (3, 1), (3, 2),

(3,4), (3,5), (4, 1), (4, 2), (4, 3), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4).

The number of permutations of n things taken k at a time is

Pnk = n(n− 1) · . . . · (n− j + 1) · . . . · (n− k + 1) =n!

(n− k)!

I Apply to previous example: 5!/2! = 20.Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

30/129

Combination

I Suppose we have n objects, a1, . . . , an. How many subsets ofsize k without regard for order can we choose from theseobjects?For the previous example we have 10.

I In general the number of combinations of n things taken k at atime is (

n

k

)=

n!

k!(n− k)!

ExamplePoker hand example: 52 cards, 5 cards in a hand.

1 Probability of any specific hand: 1/(525

).

2 Probability that all cards are hearts:(135

)/(525

).

3 Probability of a flush (all cards same suit):(41

)(135

)/(525

).

4 Probability of a full house (three kings and two queens):(43

)(42

)/(525

).


31/129

TheoremThe number of distinct permutations of n objects of which n1 are ofone kind, n2 of a second kind, ..., nk of a kth kind is

n!

n1!n2! . . . nk!.

ExampleHow many distinct permutations can be made form the wordINTERNET?

I The letters are 8 letters in the word ”INTERNET”:

I N T E R1 2 2 2 1

I Number of different permutations:

8!

1!2!2!2!1!= 5040.


32/129

A different point of view

I If you put n distinct objects into k cells. And you want to putn1 into cell 1, n2 into cell 2, . . . , nk into cell k, wheren = n1 + n2 + . . .+ nk, then the answer is the same as

n!

n1!n2! . . . nk!

I Multiplication rule:(n

n1, n2, . . . , nk

)=

(n

n1

)(n− n1n2

). . .

(n− n1 − . . . nk−1

nk

)=

n!

n1!n2! . . . nk!


33/129


1.4 Conditional Probability and Independence


34/129

Motivation of conditional probability

I An experiment involves three tosses of a coin.Sample space consists of 8 possible outcomes, all equallylikely:

HHH,HHT,HTH,HTT, THH, THT, TTH, TTT.

I Define two events:C1 = {first toss results in a head},C2 = {at least two heads}.What is the probability of event C2?

P (C2) =

I Suppose we know event C1 occurs. Now what is theprobability of event C2?

P (C2|C1) =

(the conditional probability of C2 given that C1 occurs).


35/129

Definition of conditional probability

Definition. For two events C1 and C2, with C1 satisfyingP (C1) > 0, the conditional probability of C2 given C1 is

P (C2|C1) =P (C1 ∩ C2)

P (C1).

Properties of conditional probabilities:

1 P (C2|C1) ≥ 0;

2 P (C2 ∪C3 ∪ · · · |C1) = P (C2|C1) +P (C3|C1) + · · · , providedthat C2, C3, · · · are mutually disjoint;

3 P (C1|C1) = 1.


36/129

Example (1.4.2)A bowl contains eight chips. Three of the chips are red and theremaining five are blue. Two chips are to be drawn successively, atrandom and without replacement. Compute the probability that thefirst draw results in a red chip (C1) and the second draw results ina blue chip (C2).

Solution:

P (C1 ∩ C2) = P (C1)P (C2|C1) =3

8

5

7=

15

56.

P (C1 ∩ C2) = P (C2)P (C1|C2) = hard · hard.


37/129

The multiplication rule can be extended to three or more events.For example, if P (C1 ∩ C2) > 0, hence P (C1) > 0, we have

P (C1 ∩ C2 ∩ C3) = P (C1)P (C2|C1)P (C3|C1 ∩ C2).

Example (1.4.4)Four cards are drawn successively, at random and withoutreplacement, from an ordinary deck of playing cards. Compute theprobability of receiving a spade, a heart, a diamond, and a club, inthat order?

Solution:13

52

13

51

13

50

13

49.

What is the probability of receiving a spade, a heart, a diamond,and a club in any order?Solution:

4! · 13

52

13

51

13

50

13

49.


38/129

Law of total probability

Suppose C1, . . . , Ck form a partition of the sample space C, i.e.

1 C1, . . . , Ck are mutually exclusive

2 C1, . . . , Ck are exhaustive, i.e. P (C1 ∪ . . . ∪ Ck) = 1

Then for any event C ∈ B

P (C) = P (C|C1)P (C1) + P (C|C2)P (C2) + · · ·+ P (C|Ck)P (Ck)

Proof:

P (C) = P [C ∩ C)]

= P [(C1 ∪ C2 ∪ . . . ∪ Ck) ∩ C)]

= P [(C1 ∩ C) ∪ . . . ∪ (Ck ∩ C))]

= P (C1 ∩ C) + · · ·+ P (Ck ∩ C)

= P (C|C1)P (C1) + · · ·+ P (C|Ck)P (Ck).


39/129

Bayes’s theorem

Suppose C1, . . . , Ck form a partition of the sample space C. Thenfor any event C ∈ B,

P (Cj |C) =P (C ∩ Cj)P (C)

=P (C|Cj)P (Cj)∑ki=1 P (C|Ci)P (Ci)


40/129

Example

On Friday evening 10% of the drivers in Iowa City are drunk. Theprobability that a drunk driver will be involved in a traffic accident is0.01% and the probability that a sober driver will be involved in atraffic accident is 0.002%. If you read in the morning paper that aparticular individual was involved in a traffic accident, what is theprobability that this individual was drunk?Solution:Let C = Accident occurs; C1 = Drunk; C2 = Sober. Then

P (C1|C) =P (C1)P (C|C1)

P (C1)P (C|C1) + P (C2)P (C|C2)

=.10 · .0001

.10 · .0001 + .9 · .00002= 0.357


41/129

Four inspectors at a film factory are stamping the expiration dateon each package of a film at the end of the assembly line.

I John, 20% of the packages, fails to stamp 1 in 200,I Tom, 60% of the packages, fails to stamp 1 in 100,I Jeff, 15% of the packages, fails to stamp 1 in 90,I Pat, 5% of the packages, fails to stamp 1 in 200.

If a customer complains that her package of film does not show theexpiration date, what is the probability that it was inspected byJohn?


42/129

Solution:

I Let B1 be the event that John inspected the package.Similarly, denote the event that the package was inspected byTom, Jeff and Pat using B2, B3, B4 respectively. ThenB1, B2, B3, B4 form a partition of the sample space. Let F bethe event that the package failed to be stamped.

I P (B1) = 0.2, P (B2) = 0.6, P (B3) = 0.15, P (B4) = 0.05.P (F |B1) = 1

200 , P (F |B2) = 1100 , P (F |B3) = 1

90 ,P (F |B4) = 1

200 .I The probability that the package was inspected by John given

the failure is

P (B1|F ) =P (F |B1)P (B1)

P (F |B1)P (B1) + P (F |B2)P (B2) + P (F |B3)P (B3) + P (F |B4)P (B4)

=(0.2)(1/200)

(0.2)(1/200) + (0.6)(1/100) + (0.15)(1/90) + (0.05)(1/200)

= 0.112


43/129

Suppose 5 out of 10000 employees at Lawrence LivermoreNational Laboratory are spies. According to experts, a polygraphtest has sensitivity of 0.91 and specificity of 0.94.Mathematically

Prevalence = P (S) = 0.0005

Sensitivity = P (+|S) = 0.91

Specificity = P (−|NS) = 0.941 Find the probability of a false positive test.

Solution:

P (+|NS) = 1− P (−|NS) = 1− 0.94 = 0.06

The complement rule can be used for conditional probabilitiesif what is being conditioned on is held fixed (i.e. NS was heldfixed in the above computation). In other words,P (+|NS) 6= 1− P (+|S).

2 Find the probability of a false negative test.Solution:

P (−|S) = 1− P (+|S) = 1− 0.91 = 0.09


44/129

1 Suppose an employee is randomly selected. Find theprobability that he/she tests positive for being a spy.Solution: By the law of total probability,

P (+) = P (+|S)P (S) + P (+|NS)P (NS)

= (0.91)(0.0005) + (0.06)(0.9995) = 0.060425.

2 Given that a randomly selected employee tested positive, findthe probability that he/she actually is a spy.Solution: By Bayes’s theorem,

P (S|+) =P (+ ∩ S)

P (+)

=P (+|S)P (S)

P (+|S)P (S) + P (+|NS)P (NS)

=(0.91)(0.0005)

0.060425= 0.00753.

Therefore, if an employee tests positive for being a spy, thereis a 0.753% chance that he/she actually is a spy!


45/129

Independence

Definition

Two events C1 and C2 with P (C1) > 0 and P (C2) > 0 areindependent if and only if P (C1 ∩ C2) = P (C1)P (C2).

Remark:I Two events C1, C2 are independent if and only ifP (C1 ∩ C2) = P (C1)P (C2), which is equivalent toP (C1|C2) = P (C1), when P (C2) > 0, and also equivalent toP (C2|C1) = P (C2), when P (C1) > 0.

I Two events are independent if the occurrence (ornon-occurrence) of one event does not influence the likelihoodof occurrence of the other event.


46/129

I If the two events C1 and C2 are independent, then thefollowing pairs of events are also independent:(i) C1 and Cc2; (ii) Cc1 and C2; (iii) Cc1 and Cc2.

Proof.It suffices to show that

P (Cc1 ∩ C2) = P (Cc1)P (C2) = (1− P (C1))P (C2),

where

P (C2) = P (C ∩ C2)

= P ((C1 ∪ Cc1) ∩ C2)

= P [(C1 ∩ C2) ∪ (Cc1 ∩ C2)] (distributive law)

= P (C1 ∩ C2) + P (Cc1 ∩ C2) (axiom of probability)

= P (C1) · P (C2) + P (Cc1 ∩ C2). (independence)


47/129

Mutual independence

Suppose now we have three events, C1, C2, and C3. We say thatthey are mutually independent if

P (C1 ∩ C2) = P (C1)P (C2),

P (C1 ∩ C3) = P (C1)P (C3),

P (C2 ∩ C3) = P (C2)P (C3).

andP (C1 ∩ C2 ∩ C3) = P (C1)P (C2)P (C3).

More generally, we say the n events, C1, C2, . . . , Cn, are mutuallyindependent if for any collection of distinct integers, d1, d2, . . . , dk,from {1, 2, . . . , n}, it holds that

P (Cd1 ∩ Cd2 ∩ · · · ∩ Cdk) = P (Cd1)P (Cd2) · · ·P (Cdk).


48/129

Example

Suppose a circuit board contains 3 modules. The probability thatthe first module works properly is 0.98, while the second and thirdmodules work properly with probability 0.95 and 0.92, respectively.Modules are independent.

1 Find the probability that all 3 modules work properly.

P (W1 and W2 and W3)indep= P (W1)P (W2)P (W3)

= (0.98)(0.95)(0.92) = 0.8565

2 Find the probability that one or more modules work.

P (one or more work) = 1− P (0 work)indep= 1− P (W c

1 )P (W c2 )P (W c

2 )

= 1− (1− 0.98)(1− 0.95)(1− 0.92)

= 1− 0.00008 = 0.99992Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

49/129


1.5 Random Variables


50/129

Definition of random variable

Definition A random variable X is a map from the sample spaceC to the real set R. It assigns to each element c ∈ C one and onlyone value X(c) = x.

1 X induces a new sample space:

D = {x = X(c) : c ∈ C} ⊂ R.

Example 1: Coin flip: C = (T,H), and we create X(T ) = 0,X(H) = 1.Example 2: C = {c1, . . . , cn}, students at the UIowa.X(ci) = ci’s height and Y (ci) = ci’s weight.


51/129

2 X induces a probability set function on R:

PX(B) = P{c ∈ C : X(c) ∈ B} for B ⊂ R.

Consider D ∈ D. We haveP (X ∈ D) = P ({c ∈ C : X(c) ∈ D}).

ExampleFlip a coin twice. Let X = total number of heads. The samplespace is C = {HH,HT, TH, TT} and all outcomes are equallylikely, so

x : 0 1 2

P (X = x) :1

4

2

4

1

4


Discrete Random Variables

52/129

52/129

Discrete random variables

DefinitionA random variable X is discrete if it can assume a finite orcountably infinite (i.e. you can create a one-to-onecorrespondence with the positive integers) number of values, thenX is a discrete random variable.

Example1 Let X be the number of broken eggs in a dozen randomly

selected eggs.Possible values for X are x = 0, 1, . . . , 12, so X is discrete.

2 Let X be the number of accidents at an intersection in a year.Possible values for X are x = 0, 1, 2, . . ., so X is discrete.

3 Flipping a coin until you obtain Head.Possible values for X are x = 1, 2, . . ., so X is discrete.


53/129

Probability mass function

For the discrete random variables, the probability mass function(pmf)

pX(di) = P (X = di) i = 1, 2 . . . ,

determines the probability set function PX .

The pmf pX(x) has the following properties,

1 0 ≤ pX(x) ≤ 1,

2∑

x∈D pX(x) = 1,

3 P [X ∈ A] =∑

x∈A pX(x).

Definition

The support of a discrete random variable is {x : P (X = x) > 0},sometimes denoted S


54/129

Suppose we have a bowl with 1 chip labeled “1”, 2 chips labeled“2”, and 3 chips labeled “3”. Draw 2 chips without replacement.Let X = sum of the two draws.

1 Find the probability mass function (p.m.f.) of X.x : 3 4 5 6

pX(x) = P (X = x) : 215

415

615

315

2 Find the probability that the sum of the two draws is 5 or more.

P (X ≥ 5) = P (X = 5) + P (X = 6) =9

15.

3 Find the probability that the sum equals 6 given that the sumis 4 or more.

P (X = 6|X ≥ 4) =P (X = 6 ∩X ≥ 4)

P (X ≥ 4)=P (X = 6)

P (X ≥ 4)=

3

13.

4 Find the probability that the sum is 5 or less given that thesum is more than 4.

P (X ≤ 5|X > 4) =P (X ≤ 5 ∩X > 4)

P (X > 4)=P (X = 5)

P (X > 4)=

2

3.


55/129

Cumulative distribution function

Let X be a random variable. Then its cumulative distributionfunction (cdf) is defined by

FX(x) = PX((−∞, x]) = P (X ≤ x).

For a discrete random variable, this is a step function.

FX(x0) =∑x≤x0

PX(X = x). (cdf is sum of pmf)

For the previous example

F (x) =

0 x < 32/15 3 ≤ x < 46/15 4 ≤ x < 512/15 5 ≤ x < 61 x ≥ 6,

given the pmfx : 3 4 5 6

pX(x) = P (X = x) : 215

415

615

315


Continuous Random Variables

56/129

56/129

Continuous random variables

DefinitionA random variable is a continuous random variable if itscumulative function FX(x) is a continuous function for all x ∈ R.

The cumulative distribution function (cdf) defined for thediscrete random variables can be used to describe continuousrandom variables too.

ExampleLet X denote a real number chosen at random between 0 and 1.Since any number can be chosen equally likely, it is reasonable toassume

PX [(a, b)] = b− a, for 0 < a < b < 1. (1)

Describe its cdf.


57/129

Probability density function

I The notion of probability mass function (pmf) defined fordiscrete random variables does NOT work here:FX(x)− FX(x−) = 0 ∀x ∈ R, so P (X = x) = 0 ∀x ∈ R.

I Instead, for the continuous case, if there is a non-negativefunction fX such that

FX(x) =

∫ x

−∞fX(t)dt, for all −∞ < x <∞, (cdf is integral of pdf)

I We call fX the probability density function (pdf) of therandom variable X. It is easy to see that the following relationusually holds:

fX(x) = F ′X(x), for all −∞ < x <∞.


58/129

A pdf fX(x) always has the properties

1 f(x) ≥ 0 ∀x ∈ R.

2∫R f(x)dx = 1.

3 If A ⊂ R, then P (A) =∫A f(x)dx. In other word,

P (a < X < b) =

∫ b

af(x)dx.

Definition

The support of a continuous random variable is {x : fX(x) > 0},sometimes denoted S


59/129

Let

f(x) =

{k(1− x2), −1 < x < 1,

0, otherwise.

Find

1 the constant k such that f(x) is a pdf.

2 P (−0.5 < X < 0.5).

3 P (X ≤ 0.1).

4 P (X > 0.7|X > 0.1).

5 Evaluate F (x).


60/129

Solution:

1 We must have

1 =

∫ ∞−∞

f(x) dx =

∫ 1

−1k(1− x2) dx = k

∫ 1

−11− x2 dx

Thus k · 43 = 1 and k = 34 .

2 By the definition of PDF, we have

P (−0.5 < X < 0.5) =

∫ 0.5

−0.5

3

4(1− x2) dx =

11

16= 0.6875.

3 P (X ≤ 0.1) = P (X < 0.1) =

∫ 0.1

−1

3

4(1− x2) dx = 0.57475.

4 For r.v.’s, the convention is to use comma instead of ∩.

P (X > 0.7|X > 0.1) =P (X > 0.7, X > 0.1)

P (X > 0.1)

=P (X > 0.7)

P (X > 0.1)=

∫ 10.7

34(1− x2) dx∫ 1

0.134(1− x2) dx

= 0.143.


61/129

Percentile & quartile

Definition

Let X be a continuous-type random variable with pdf f(x) and cdfF (x). The (100p)th percentile is a number πp such that the areaunder f(x) to the left of πp is p. That is,

p =

∫ πp

−∞f(x)dx = F (πp).

The 50th percentile is called the median (m = π0.50). The 25thand 75th percentiles are called the first and third quartiles,respectively, denoted by q1 = π0.25 and q3 = π0.75. Of course, themedian m = π0.50 = q2 is also called the second quartile.


62/129

ExampleLet Y be a continuous random variable with probability densityfunction given by

f(y) =

{3y2 0 ≤ y ≤ 10 otherwise.

Find the first, second and third quartiles.

Solution: ∫ π0.25

03y2dy = 1/4, π0.25 = 1/

3√

4.∫ π0.5

03y2dy = 1/2, π0.5 = 1/

3√

2.∫ π0.75

03y2dy = 3/4, π0.75 =

3√

3/3√

4.


Basic Properties of CDF (for both discrete and continuous r.v.)

63/129

63/129

Theorem 1.5.2

Let X be a random variable with cdf FX . Then,

PX(a, b] = P (a < X ≤ b) = FX(b)− FX(a).

Proof.It holds that

{−∞ < X ≤ b} = {−∞ < X ≤ a} ∪ {a < X ≤ b}.

The theorem follows from the third axiom of the probability sincethe two sets on the right-hand side are disjoint.


64/129

Example 1.5.7 & 1.5.8

Determine the constant c and the probability P (2 < X ≤ 4) in thefollowing questions:

1 X has a pmf

pX(x) =

{cx x = 1, 2, . . . , 100 otherwise.

2 X has a pdf

fX(x) =

{cx 0 < x < 100 otherwise.


65/129

Example

Consider an urn which contains balls each with one of thenumbers 1, 2, 3, 4 on it. Suppose there are i balls with the numberi for i = 1, 2, 3, 4. Suppose one ball is drawn at random. Let X bethe number on the ball.

(a) Determine the pmf of X;

(b) Compute P (X ≤ 3);

(c) Determine the cdf of X.


66/129

Theorem 1.3.6 Continuity of Probability

For an increasing sequence of events {Cn}, define its limit aslimn→∞Cn =

⋃∞n=1Cn. It holds that

limn→∞

P (Cn) = P ( limn→∞

Cn) = P

( ∞⋃n=1

Cn

).

Symmetrically, for an decreasing sequence of events {Cn}, defineits limit as limn→∞Cn =

⋂∞n=1Cn. It holds that

limn→∞

P (Cn) = P ( limn→∞

Cn) = P

( ∞⋂n=1

Cn

).


67/129

Theorem 1.5.1

Let X be a random variable with cdf F .

(a) If a < b, then F (a) ≤ F (b).(b) limx→−∞ F (x) = 0.

(c) limx→∞ F (x) = 1.

(d) limx↘x0 F (x) = F (x0).

Proof.

(a) If a < b, then {X ≤ a} ⊂ {X ≤ b}. The result then followsfrom the monotonicity of the probability.

(b) If {xn} is an decreasing sequence such that xn → −∞, thenCn = {X ≤ xn} is decreasing with ∅ = ∪∞n=1Cn. From thecontinuity of probability theorem,

limn→−∞

F (xn) = P (∩∞n=1Cn) = P (∅) = 0.


68/129

Theorem 1.5.1 (cont’d)

Let X be a random variable with cdf F .(c) limx→∞ F (x) = 1.(d) limx↘x0 F (x) = F (x0) (right continuous).

Proof.

(c) If {xn} is an increasing sequence such that xn →∞, thenCn = {X ≤ xn} is increasing with {X ≤ ∞} = ∪∞n=1Cn.From the continuity of probability theorem,

limn→∞

F (xn) = P (∪∞n=1Cn) = 1.

(d) Let {xn} be any decreasing sequence of real numbers suchthat xn ↓ x0. Then the sequence of sets {Cn} is decreasingand ∩∞n=1Cn = {X ≤ x0}. The continuity of probabilitytheorem implies that

limn→∞

F (xn) = P (∩∞n=1Cn) = F (x0).


69/129

Theorem 1.5.3

For any random variable,

P (X = x) = FX(x)− FX(x−), ∀x ∈ R.

Proof.

For any x, we have {x} =⋂∞n=1

(x− 1

n, x

]. By the continuity of

the probability function,

P (X = x) = P

[ ∞⋂n=1

{x− 1

n< X ≤ x

}]

= limn→∞

P

[x− 1

n< X ≤ x

]= lim

n→∞[FX(x)− FX(x− 1/n)]

= FX(x)− FX(x−).


Summary

70/129

70/129

I Random variable X is a function from a sample space C intothe real numbers R.

I Every random variable is associated with a cdf:

FX(x) = PX(X ≤ x) for −∞ < x <∞.

I FX(x) is defined for all x, not just those in its domain D.I Notation wise, random variables will always be denoted with

uppercase letters and the realized values of the variable willbe denoted by the corresponding lowercase letters. Thus therandom variable X can take the value x.

I We say a random variable X is discrete if FX(x) is a stepfunction: F (x0) =

∑x≤x0 P (X = x).

I We say a random variable X is continuous if FX(x) is acontinuous function: F (x0) =

∫ x0−∞ f(x)dx.


71/129


1.6 Discrete Random Variables


72/129

Review of discrete random variable

D = {x = X(c) : c ∈ C} is either finite or countable.(“D is countable” means that there is a one-to-one correspondencebetween D and the positive integers.)

The support of a discrete random variable X is {x : pX(x) > 0} =(In English: the support of X consists of all points x such thatpX(x) > 0. )


73/129

Example: geometric distribution

Consider a sequence of independent flips of a coin, each resultingin a head with probability p and a tail with probability q = 1− p. Letthe random variable X be the number of tails before the first headappears. Determine the pmf of X and the probability that X iseven.Solution:

P (X = x) = qxp, x = 0, 1, 2, . . .

∞∑k=0

P (X = 2k) = p

∞∑k=0

(q2)k

= p/(1− q2).


74/129

Example 1.6.2: hypergeometric distribution

A lot consists of m good fuses and n defective fuses. Choose k,k ≤ min{m,n}, fuses at random from the lot. Let the randomvariable X be the number of defective fuses among the k.Determine the pmf of X.Solution:

pX(x) =

(mx

)(nk−x)(

m+nk

) for x = 0, 1, . . . , k,

0 otherwise.


75/129

Transformation of discrete random variables

Let X be a discrete random variable with pmf pX(x) and supportDX = {x : pX(x) > 0}. Suppose we are interested in Y = g(X).Then Y is also a random variable, and we want to determine itspmf.

Let DY = {y : y = g(x) for some x ∈ DX}. We have

pY (y) = P (Y = y)

= P (g(X) = y)

=∑

x:g(x)=y

pX(x)

= pX(g−1(y)) if g is one to one

= P (X ∈ g−1(y)),

where g−1(y) = {x : g(x) = y}.Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

76/129

Example Consider the geometric random variable Xpmf: P (X = x) = qxp, x = 0, 1, 2, . . ..1 Let Y be the number of flips needed to obtain the first head.

Then Y = X + 1.Y = g(X) = X + 1 and X = g−1(Y ) = Y − 1.

Determine the pmf of Y .Solution:

pY (y) = pX(g−1(y)) = qy−1p.

2 Let Y = (X − 2)2. Determine the pmf of Y .Solution:

pY (y) =

pX(2) if y = 0,

pX(1) + pX(3) if y = 1,

pX(0) + pX(4) if y = 4,

pX(√y + 2) if y ≥ 9, 16, 25 . . .


77/129


1.7 Continuous Random Variables


78/129

Review of continuous random variable

I Recall that a random variable is continuous if its cdf FX(x) isa continuous function for all x ∈ R. Hence, if the randomvariable X is continuous, thenP (X = x) = FX(x)− FX(x−) = 0, for all x ∈ R. Thismeans that there is no point of discrete mass.

I Recall that a nonnegative function fX is a pdf of the randomvariable X if∫ x

−∞fX(t)dt = FX(x), for all −∞ < x <∞,

I The support of a continuous random variable X is{x : fX(x) > 0} = D.(In English: the support of X consists of all points x such thatfX(x) > 0. )


79/129

Example 1.7.1

Suppose we select a point at random in the interior of a circle ofradius 1. Let X be the distance of the selected point from theorigin. Determine the cdf and pdf of X and the probability that theselected point falls in the ring with radius 1/4 and 1/2.Solution:

P (X ≤ x) = x2,

FX(x) =

0, x < 0,

x2, 0 ≤ x < 1,

1, x ≥ 1,

fX(x) =

{2x, 0 ≤ x < 1,

0, otherwise,

P (1/4 < X ≤ 1/2) =

∫ 1/2

1/42tdt = 3/16.


80/129

Example 1.7.2 (Exponential Distribution)

Let the random variable X be the time in seconds betweenincoming telephone calls at a busy switchboard. Suppose that Xhas a pdf

fX(x) =

{λe−λx x > 00 otherwise.

This is an exponential distribution with parameter/rate λ > 0.Knowing that the parameter λ = 1/4, compute the probability thatthe time between successive phone calls exceeds 4 seconds.Solution:

P (X > 4) =

∫ ∞4

1

4e−x/4dx = e−1 = 0.3679.


Transformation of Continuous R.V.

81/129

81/129

Sometimes we know the distribution of a continuous randomvariable X. We are interested in the distribution of a randomvariable Y which is a transformation (function) of X, sayY = g(X), and we want to determine the distribution of Y .

Example 1.7.3. Let X be the random variable with the pdf

fX(x) =

{2x 0 < x < 10 otherwise.

Determine the cdf and pdf of Y = X2.Solution:

FY (y) = P (Y ≤ y) = P (X2 ≤ y) = P (X ≤ √y) = FX(√y) = y.

fY (y) =

{1 0 < y < 1,

0 otherwise.


82/129

Theorem 1.7.1: Transformation of continuous r.v.

Assume that X is a continuous random variable with pdf fX(x) andsupport SX . Suppose Y = g(X), where g is one-to-one. Then,

fY (y) = fX(h(y))|h′(y)|, where h = g−1.

Proof: Suppose that g (and thus h) is monotone increasing. Then

FY (y) = P (Y ≤ y) = P [g(X) ≤ y] = P [X ≤ h(y)] = FX [h(y)],

Therefore,

fY (y) = F ′Y (y) = F ′X [h(y)]·h′(y) = fX [h(Y )]h′(y) = fX [h(Y )]dx

dy.

Note that since h is increasing h′(y) is positive on SY .(h′(y) = |h′(y)|).


83/129

Proof (cont’d)

Suppose that g (and therefore h) is monotone decreasing. Then,

FY (y) = P (Y ≤ y) = P [g(X) ≤ y] = P [X ≥ h(y)] = 1−FX [h(y)]

Therefore,

fY (y) = F ′Y (y) = −F ′X [h(y)] · h′(y) = fX [h(Y )](−h′(y))

Note that since h is decreasing h′(y) is negative on SY .(−h′(y) = |h′(y)|).

We refer to dx/dy = J as the Jacobian of the transformation.


84/129

Example

Assume that f(x) = 1 when 0 < x < 1 and 0 otherwise.Therefore SX = (0, 1).

1 Let Y = X2. g(x) = x2 is a one-to-one function on (0, 1).h(y) = y1/2, and h′(y) = 1

2y−1/2. Therefore,

fY (y) = fX(h(y))|h′(y)| = 1 · 1

2y−1/2 =

1

2y−1/2,

when 0 < y < 1. (SY = (0, 1)).

2 Suppose now that Y = g(X) = − log(X). Thenh(y) = exp(−y), and |h′(y)| = exp(−y). Therefore

fY (y) = fX(h(y))|h′(y)| = 1 · exp(−y) = exp(−y) y > 0.

You can verify that∫∞0 fY (y)dy = 1.


85/129

Extended theorem: Transformation of continuous r.v.

Assume X is a continuous random variable with pdf fX(x) andsupport SX . Suppose Y = g(X) and the support has a partitionA1, . . . Ak. Further, suppose there exist functions g1(x), . . . , gk(x)such that

I g(x) = gi(x), for x ∈ Ai,I gi(x) is monotone on each x ∈ Ai,I define Yi = {y = gi(x), ∀x ∈ Ai}, then Y1 = Y2 = . . . = Yk.

Then

FY (y) =

∑k

i=1 fX(g−1i (y))

∣∣∣∣ ddyg−1i (y)

∣∣∣∣ , y ∈ Y,

0 otherwise.


86/129

ExampleI Assume that f(x) = 1/2 when −1 < x < 1 and 0 otherwise.

Therefore SX = (−1, 1).I Let Y = X2 and g(x) = x2. Then g1(x) = x2 is a one-to-one

function on (−1, 0) and g2(x) = x2 is one-to-one on (0, 1).We see Yi = {y : y = gi(x), x ∈ Ai} = (0, 1).

I Then h1(y) = g−11 (y) = −y1/2, and h′1(y) = −12y−1/2.

h2(y) = g−12 (y) = y1/2, and h′2(y) = 12y−1/2.

I Therefore,

fY (y) = fX(h1(y))|h′1(y)|+ fX(h2(y))|h′2(y)|

=1

2· 1

2y−1/2 +

1

2· 1

2y−1/2

=1

2y−1/2,

when 0 < y < 1. (SY = (0, 1)).


Mixture of Discrete and Continuous R.V.

87/129

87/129

Example 1.7.6

Study the cdf

FX(x) =

0 x < 0x+12 0 ≤ x < 1

1 1 ≤ x.

I What is P (−3 < X ≤ 1/2).I What is P (X = 0).I What is P (−3 < X ≤ 0).I What is P (−3 < X < 0).

It always holds from the definition of cdf that

P (a < X ≤ b) = P (X ≤ b)− P (X ≤ a) = F (b)− F (a).


88/129

Example 1.7.7

Let X equals the size of a wind loss in millions of dollars, andsuppose it has the cdf

FX(x) =

{0 −∞ < x < 0

1−(

1010+x

)30 ≤ x <∞.

If losses beyond $10, 000, 000 are reported as 10, then the cdf ofthis censored distribution is

FY (y) =

0 −∞ < y < 0

1−(

1010+y

)30 ≤ y < 10

1 10 ≤ y <∞.

where Y = min(X, 10),


89/129

I This is an example of a mixed continuous and discreterandom variable. The particular example chosen is known ascensoring because it creates the discrete part by lumpingone end of the distribution into a single point.

I The continuous part of the random variable has the same pdfas the pdf for X on (0, 10), i.e.

3 · 103

(y + 4)41(0,10)(y).

I The discrete part has the pmf

P (Y = 10) = 1− FX(10) =

(10

20

)3

=1

8.


90/129


1.8 Expectation of a Random Variable


91/129

Definition of expectation

Let X be a random variable.I If X is continuous with pdf f(x) then the expectation of X is

E(X) =

∫ ∞−∞

xf(x)dx.

provided that∫∞−∞ |x| f(x)dx <∞.

We say the expectation does not exist if∫∞−∞ |x| f(x)dx =∞.

I If X is discrete with pmf p(x) then the expectation of X is

E(X) =∑x

xp(x).

provided that∑

x |x| p(x) <∞.We say the expectation does not exist if

∑x |x| p(x) =∞.

I Sometimes the expectation of X is called the mathematicalexpectation of X, the expected value of X, or the mean of X.We often denote E(X) by µ (µ = E(X)).


92/129

Example

Suppose we have a bowl with 1 chip labeled “1”, 2 chips labeled“2”, and 3 chips labeled “3”. Draw 2 chips without replacement. LetX = sum of the two draws. The pmf (derived before) is as follows:

x : 3 4 5 6pX(x) = P (X = x) : 2

15415

615

315

We now find the expected value (mean) of the random variable X.

µ =

6∑x=3

xp(x) = 3

(2

15

)+ 4

(4

15

)+ 5

(6

15

)+ 6

(3

15

)= 4.667.

If the random variable X was observed many times and therealizations were recorded, the expected value, µ, describes themean of the observed realizations. If I repeatedly drew 2 chipsfrom the bowl without replacement and recorded the sum of thetwo draws, the mean of the recorded sums would equal µ = 4.667.


93/129

Exponential distribution

Let X be a random variable that follows the exponential distributionwith parameter λ, that is,

fX(x) =


Compute its expectation.Solution:

EX =

∫ ∞0

λxe−λxdx

= −∫ ∞0

xde−λx

= −xe−λx∣∣∣∣∣∞

0

+

∫ ∞0

e−λxdx

=1

λ.


94/129

Two formulations of exponential distribution

1 The parameter λ is defined as rate:

fX(x) =


In this case, EX = 1/λ.

2 The parameter λ is defined as scale:

fX(x) =

{ 1

λe−

1λx x > 0

0 otherwise.

In this case, EX = λ.

3 The first case is used in this course.


Expectation of a Function of X

95/129

95/129

Let X be a random variable and let Y = g(X) for some function g.We can calculate EY = Eg(X) in two different ways.Suppose X is discrete and

∑x∈DX

∣∣g(x)∣∣ pX(x) <∞, then

1

Eg(X) =∑x∈DX

g(x) pX(x).

2 Let pY (y) be the pmf of Y . ThenEg(X) = EY =

∑y∈DY y pY (y).

Suppose X is continuous and∫∞−∞

∣∣g(t)∣∣fX(t)dt <∞, then

1

Eg(X) =

∫ ∞−∞

g(x)fX(x)dx.

2 Let Y = g(X) and let fY (y) be the pdf of Y . ThenEg(X) = EY =

∫∞−∞ yfY (y)dy.


96/129

Assume X has a pdf f(x) = 1, for 0 < x < 1. Let Y = − logX.What is EY ?Solution:

EY =

∫ 1

0− log xdx = −[x log x− x]|10 = 1.


97/129

For a certain ore samples the proportion X of impurities persample is a random variable with density function given by

f(x) =

32x

2 + x 0 ≤ x ≤ 1,

0 elsewhere.

The dollar value of each sample is Y = 5− 0.5X. Find the meanof X and Y .Solution:

EX =

∫ 1

0x

(3

2x2 + x

)dx =

17

24= 0.708.

EY =

∫ 1

0(5− 0.5x)

(3

2x2 + x

)dx = 5− 17

48= 4.646.


98/129

Linearity of expectation

Theorem (1.8.2)

Let g1(X) and g2(X) be functions of a random variable X.Suppose the expectations of g1(X) and g2(X) exist.Then for any constants k0, k1 and k2, the expectation ofk0 + k1g1(X) + k2g2(X) exists and it is given by

E [k0 + k1g1(X) + k2g2(X)] = k0 + k1E [g1(X)] + k2E [g2(X)] .

Corollary: If g(X) = a+ bX, i.e. g is a linear function, then

E[g(X)] = a+ bE(X) = g[E(X)]


99/129

Example

Assume X has a pdf f(x) = 1, for 0 < x < 1.

E(X + 3X2) = E(X) + 3E(X2) =1

2+ 3 · 1

3=

3

2

Warning: If g is a nonlinear function, then in generalE[g(X)] 6= g[E(X)]. For the previous example

E(X2) =1

36= (E(X))2 =

(1

2

)2

=1

4.

Which one is larger? Jensen inequality...


100/129

Example

Let X have a pdf f(x) = 3x2, 0 < x < 1, zero elsewhere.Consider a random rectangle whose sides are X and (1−X).Determine the expected value of the area of the rectangle.Solution: ∫ 1

03x2 · x(1− x)dx = 0.15.


101/129

Suppose X is a random variable such that E(X2)<∞. Consider

the function

h(b) = E[(X − b

)2].

The value of b that minimizes h(b) is EX.


102/129

Let X be a continuous random variable with cdf F (x). Determinethe expectation of F (X).


103/129


1.9 Some Special Expectations


104/129

Mean and variance

Definition (Mean)

Let X be a random variable whose expectation exists. The meanvalue µ of X is defined to be µ = E(X).

Definition (Variance)

Let X be a random variable with finite mean µ. Then the varianceof X is defined to be σ2 = Var(X) = E

[(X − µ)2

].

The positive square root σ =√

Var(X) is called the standarddeviation of X.

Computation of Var(X):

Var(X) = EX2 − (EX)2,

which is because E is a linear operator.


105/129

Coin-flipping example

Define X = 1 if it is head and X = 0 if it is tail.Assume P (X = 1) = p and then P (X = 0) = 1− p.Find the mean and variance of X.

Solution:

E(X) = 1 · p+ 0 · (1− p) = p,

E(X2) = 12 · p+ 02 · (1− p) = p,

σ2 = p− p2 = p(1− p). σ =√p(1− p).


106/129

Assume a > 0 and let

fX(x) =

{1/a 0 < x < a0 otherwise.

Solution:

µ =

∫ a

0x

1

adx =

a

2,

E(X2) =

∫ a

0x2

1

adx =

a2

3,

σ2 =a2

3− a2

4=a2

12.


107/129

Example 1.9.2

If X has the pdf

fX(x) =

{ 1

x21 < x <∞

0 otherwise.

then the mean value of X does not exist:∫ ∞1

x · 1

x2dx =

∫ ∞1

1

xdx

=∞.


108/129

Linear transformation

I Suppose that Y = a+ bX. Then,1 The mean of Y is E(Y ) = µY = a+ bE(X) = a+ bµX .2 The variance of Y is

Var(Y ) = E(Y − µY )2 = E(b2(X − µX)2) = b2σ2X .

In fact,

Var(Y ) = Var(a+ bX) = Var(bX) = b2Var(X).

3 The standard deviation of Y is |b|σX .

I Let the random variable X have a mean µ and a standarddeviation σ. Show that

E

[(X − µσ

)2]

= 1.


109/129

Expectation of non-negative random variables

I Let X be a random variable of the discrete type with pmf p(x),x = 0, 1, 2, . . .. It holds that

E(X) =

∞∑x=0

[1− F (x)] .

I Let X be a continuous random variable with pdf f(x).Suppose that f(x) = 0 for x < 0. It holds that

E(X) =

∫ ∞0

[1− F (x)] dx.


110/129

Proof for discrete case

∞∑x=0

[1− F (x)] =

∞∑x=0

[1− P (X ≤ x)] =

∞∑x=0

P (X > x)

=

∞∑x=0

∞∑t=x+1

P (X = t)

=

∞∑x=0

∞∑t=0

1t>xP (X = t)

=

∞∑t=0

∞∑x=0

1x<tP (X = t) (interchangable due to absolute convergence)

=

∞∑t=0

tP (X = t) = E(X).


111/129

Proof for continuous case

In this 4000-level class, we only prove

E(X) =

∫ b

0[1− F (x)] dx,

where the support of X is (0, b) and b <∞, although the equalityholds for b =∞.

Proof. ∫ b

0[1− F (x)]dx = (x− xF (x))

∣∣b0

+

∫ b

0xf(x)dx

=

∫ b

0xf(x)dx

= E(X).


112/129

Moments

I The m’th raw moment of X is defined as E(Xm)if the expectation exists.

I The m’th central moment of X is defined as E(X − µ)m

if the expectation exists.

Example

1 Coin flipping: P (X = 1) = p,

E(Xm) = 1m · p+ 0m(1− p) = p.

2 Assume a > 0 and let

fX(x) =

{1/a 0 < x < a,0 otherwise.

E(Xm) =

∫ a

0xm

1

adx =

am

m+ 1.


113/129

Moment generating function

I Let X be a random variable. Assume there is a positivenumber h such that E[exp(tX)] exists for all t ∈ (−h, h).

I The moment generating function (mgf) of X is defined tobe the function

MX(t) = E(etX), −h < t < h.

Example

1 Assume P (X = 0) = 1− p and P (X = 1) = p.

MX(t) = E[etX ] = p·et+(1−p)·e0 = pet+1−p, for −∞ < t <∞.

2

fX(x) =

{1/a 0 < x < a0 otherwise.

MX(t) = E[etX ] =1

a

∫ a

0

etxdx =1

atetx|a0 =

eat − 1

at,

when t 6= 0 and MX(t) = 1 when t = 0.Boxiang Wang, The University of Iowa Chapter 1 STAT 4100 Fall 2018

114/129

Properties of mgf

1 It is always true that MX(0) = 1.

2 The moments of X can be found (or “generated”) from thesuccessive derivatives of MX(t).

M ′X(0) = E(X), M ′′X(0) = E(X2), M(n)X = E(Xn).

I We have M ′X(0) = E(X) =∫∞−∞ xf(x)dx = µ, since

M ′(t) =d

dt

∫ ∞−∞

etxf(x)dx =

∫ ∞−∞

d

dtetxf(x)dx =

∫ ∞−∞

xetxf(x)dx.

I We then see

M ′′(0) = E(X2) =

∫ ∞−∞

x2f(x)dx = µ2 + σ2,

so σ2 = E(X2)− µ2 = M ′′(0)− [M ′(0)]2.


115/129

The pdf of X is

fX(x) =


Find the mgf of X and use it to find the mean and the variance.Solution:

The mgf is given as follows,

M(t) = Eetx =

∫ ∞0

etxf(x)dx =λ

λ− t, t < 1,

and

M ′(t) =λ

(λ− t)2,

M ′′(t) =2λ

(λ− t)3.

Thus

µ = M ′(0) = 1/λ, σ2 = M ′′(0)− µ2 = 2/λ2 − 1/λ2 = 1/λ2.


116/129

Theorem 1.9.1

TheoremLet X and Y be random variables with moment generatingfunctions MX and MY , respectively, existing in open intervalsabout 0. If MX(t) = MY (t) for all t in an interval counting t = 0,then X and Y have identical probability distributions.


117/129

Example

Suppose X is a random variable of the continuous type with mgf

M(t) =1

1− 3t, t <

1

3.

Identify its distribution.

Solution:The random variable X follows the exponential distribution withrate λ = 1/3.


118/129

Characteristic function

I Distributions may not have mgf.I Can you show the mgf of Cauchy distribution does not exist?

f(x) =1

π

1

x2 + 1, −∞ < x <∞

I Characteristic function: (not in exams)

φ(t) = E(eitX), for an arbitrary real value t.

I Every distribution has a unique characteristic function, andE(X) = iφ′(0) and E(X2) = −φ′′(0).

I Uniqueness of characteristic function is due to the uniquenessof Fourier transform. In fact, uniqueness of mgf comes fromthe uniqueness of Laplacian transform.


119/129


1.10 Important Inequalities


Chebyshev’s inequality

120/129

120/129

Motivation

I The variance of a random variable tells use something aboutthe variability of the observations about the mean.

I If a random variable has a small variance or standarddeviation, we would expect most of the values to be groupedaround the mean.

I For any random variable, the probability between any twovalues symmetric about the mean should be related to thestandard deviation.


121/129


122/129

Chebyshev’s Theorem

Theorem (1.10.3)

Suppose for a random variable X, E(X2) exists, then for anyconstant k > 0,

P (µ− kσ < X < µ+ kσ) ≥ 1− 1

k2.


123/129

Example

A random variable X has a mean µ = 8, a variance σ2 = 9, andan unknown probability distribution. Find the lower bounds of

1 P (−4 < X < 20),

2 P (|X − 8| ≥ 6).

Solution:

P (−4 < X < 20) = P (8− 4× 3 < X < 8 + 4× 3) ≥ 15/16.

P (|X − 8| ≥ 6) = 1− P (|X − 8| < 6)

= 1− P (−6 < X − 8 < 6)

= 1− P (8− 2× 3 < X < 8 + 2× 3) ≤ 1/4.


124/129

Comments on Chebyshev’s Theorem

I Chebyshev’s theorem holds for any distribution ofobservations.

I For this reason, the results are usually weak. The value givenby the theorem is a lower bound only.

I We know the probability of a random variable falling within twostandard deviations will be no less than 3/4, but we neverknow how much more it might actually be, unless we candetermine exact probabilities.

I Chebyshev’s theorem is thus called distribution-free result.The results will be less conservative when specificdistributions are known.


125/129

Example

Compute P (µ− 2σ < X < µ+ 2σ), where X has the densityfunction

f(x) =

{6x(1− x), 0 < x < 1,

0, elsewhere,

and compare with the result given in Chebyshev’s theorem.

Solution:


Jensen’s inequality

126/129

126/129

Definition 1.10.1

A function φ defined on an interval (a, b) is said to be a convexfunction if for all x and y in (a, b) and all 0 < γ < 1,

φ [γx+ (1− γ)y] ≤ γφ(x) + (1− γ)φ(y).

We say φ is strictly convex if the above inequality is strict.

If φ is a (strictly) convex function, then −φ is a (strictly) concavefunction.


127/129

Theorem 1.10.4

If φ is twice differentiable on (a, b), then

(a) (first-order condition) φ is convex if and only if

φ(x1) ≥ φ(x2) + φ′(x2)(x1 − x2).

(b) φ is strictly convex if and only if

φ(x1) > φ(x2) + φ′(x2)(x1 − x2).

(c) (second-order condition) φ is convex if and only ifφ′′(x) ≥ 0 for all x ∈ (a, b);

(d) φ is strictly convex if and only ifφ′′(x) > 0 for all x ∈ (a, b).

The second-order condition is usually used to check the convexity,provided that the second-order derivative exists.


128/129

Jensen’s inequality

Let φ be a convex function on an open interval I, let X be arandom variable whose support is contained in I. If µ = E(X)exists then

φ [E(X)] ≤ E [φ(X)] .

The inequality reverses if φ is a concave function.

Proof.By the first-order condition,

φ(x) ≥ φ(µ) + φ′(µ)(x− µ).

Then taking expectations of both sides leads to the result.The inequality is strict if φ is strictly convex.

Example

We have µ2 < E(X2) as φ(t) = t2 is strictly convex.


129/129

Example

Let X be a positive random variable. Argue that

1

E

(1

X

)>

1

E(X).

2

E(√X) <

√E (X).

Solution:

1 When x > 0, φ′′(x) = 2x−3 > 0, so φ(x) = 1/x is strictlyconvex.The result follows from Jensen’s inequality.

2 When x > 0, φ′′(x) = −1/(4x3/2) < 0, so φ(x) =√x is

strictly concave.The result follows from Jensen’s inequality.