Lectures Notes for Probability (135A)

DRAFTv2-R.Granero-Belinchon

Lectures Notes for Probability (135A)

Rafael Granero Belinchon,[email protected],Department of Mathematics,University of California, Davis,

Fall Quarter 2014


User’s guide

These are the lecture notes of my 135A course on basic probability. These notes arebased on the book Probability and random processes, 3rd edition by G. Grimmett and D.Stirzaker. The goal of these notes is to help the students to understand the materialcovered during the classes and to serve as a supplement of the textbook. These notes arenot written to replace a careful reading of the previously mentioned book. Consequently,it is highly recommended to study also the book. On the one hand, in general, probabilityis a rather intuitive subject. On the other hand, there are more than a bunch of cases whereprobability may sometimes be very unintuitive (even counterintuitive). As an example wewill discuss several paradoxical situations. Due to this fact, this course should be takencautiously in the sense that the average student must study several hours per week. Inparticular, during the lectures and in these notes there are a number of exercises andexamples that every student should try.

There are two types of text in these notes other than the standard one. There are severalmemories from other courses. I called them memento (not only from the english word’memento’, but also for the latin sentence ’memento studere’). In these notes I also provideadvanced material called one step forward. Finally, let me emphasize that it is importantto become familiar with all the material in these lecture notes as a first step towardsthe proficiency in the subject required for the exams and real-life situations.



Contents

1 Introduction and preliminaries 1

1.1 An intuitive approach to probability . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 The story of a problem with history . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Maximizing our chances in a tv show . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Definitions and axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Bertrand’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.7 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.8 Evaluation of forensic evidence: the Dreyfus case . . . . . . . . . . . . . . . 29

1.9 Evaluation of forensic evidence: the O.J. Simpson case . . . . . . . . . . . . 32

1.10 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.11 Evaluation of forensic evidence (revisited) . . . . . . . . . . . . . . . . . . . 35

1.12 Matlab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Random variables 39

2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2 Examples of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.1 Constant variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.2 Bernoulli variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.3 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 Discrete random variables 53

3.1 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2 Pairs of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3 An introduction to the random walk . . . . . . . . . . . . . . . . . . . . . . 66

4 Continuous random variables 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


4.5 Pairs of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Characteristic functions 81


List of Figures

1.1 Random endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.2 Random midpoint (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.3 Random midpoint (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.4 J’accuse by Emile Zola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31



”What I want to promote more than anything else in the world is rationalthought, as opposed to irrational thought, which I think there is far too muchof. For instance, there are always people who know about probabilities, whowill take advantage of people who don’t. It’s hard to be a functioning memberof society and not understand how probabilities work. The lottery is not a taxon the poor; it’s a tax on people who don’t do well in mathematics. I want topromote a world where people are trained in how to think rationally about howthe world works. Knowing facts about science is important, but knowing howto think matters more.”’A perfect world’,Neil DeGrasse Tyson.



Chapter 1

Introduction and preliminaries

1.1 An intuitive approach to probability

If someone approaches you and ask you about the probability of having a head when tossinga perfectly fair coin you will answer almost instantly ”one half”. The thing becomes alittle bit more tricky when the probability asked is the probability of having 5 heads [***try to compute this probability!!! ***].

The though process that most of you are following may be summarized as ”Well, thereare two possible outcomes. One is a head and the other one is a tail. Consequently, I willget a head, more or less, half of the times”. This is a perfectly correct argument.

Or, in other words, first we are considering every possible outcome or event. We counthow many of these possible events are. We write T (from total cases) for this number.Then, we are selecting the ones that are interesting for the problem that we are addressingand we count how many of these interesting events are. We write I (from interesting cases)for this number. We say that the probability in this random experiment is

P =I

T.

In particular, the probability of a head is

P (Head) =number of interesting cases

number of total cases=

1

2.

The previous definition of probability (sometimes) is called Laplace’s rule. We will go overit deeper below. First, more examples:Example 1.1. Compute the probability of getting a ’4’ when throwing a perfectly fair, sixfaced dice.

Solution: We have

P (Getting a 4) =number of interesting cases


1

6.

1


Chapter 1. Introduction and preliminaries

Example 1.2. Compute the probability of getting two heads when tossing a perfectly faircoin.

Solution: We have that the possible outcomes are

HH,HT, TH, TT.

Consequently, T = 4. The case ’two heads’ corresponds to the event HH, so

P (Two heads) =number of interesting cases


1

4.

Example 1.3. Compute the probability of getting one head and one tail when tossing aperfectly fair coin.

Solution: As before, we have that the possible outcomes are

HH,HT, TH, TT,

thus, T = 4. The case ’one head and one tail’ corresponds to the events TH and HT , so

P (One head and one tail) =number of interesting cases


2

4= 0.5.

Example 1.4. Compute the probability of getting an even number when throwing a per-fectly fair, dice with six sides.

Solution: We have that the possible outcomes are

1, 2, 3, 4, 5, 6

thus, T = 6. The case ’even number’ corresponds to the events 2, 4, 6, so

P (Even number) =number of interesting cases


3

6= 0.5.

In particular, the interesting outcome may be formed by different events.Example 1.5. Mr. Jones has two children. The older child is a girl. What is theprobability that both children are girls?

Solution: Let’s clarify our assumptions:

• Each child is either male or female.

• Each child has the same chance of being male as of being female.

• The sex of each child is independent of the sex of the other.

The cases are

Younger Older Fail/ok

boy boy fail!boy girl ok!girl girl ok!girl boy fail!

2


1.2. Counting

Notice that it seems that there are 4 cases (one per row in the previous table), however,two of them are not possible due to the assumptions in the problem (’The older childis a girl.’ ). As a consequence, T = 2. Within these two cases, only the case girl-girl isinteresting to us, so I = 1. Consequently,

P (Two girls) = 0.5.

Example 1.6. Mr. Smith has two children. At least one of them is a boy. What is theprobability that both children are boys?1

Solution: The cases are

Younger Older Fail/ok

boy boy ok!boy girl ok!girl girl fail!girl boy ok!

Notice that it seems that there are 4 cases (one per row in the previous table), however,one of them is not possible due to the assumptions in the problem (’At least one of themis a boy’.). As a consequence, T = 3. Within these three cases, only the case boy-boy isinteresting to us, so I = 1. Consequently,

P (Two boys) =1

3.

As you can imagine, if the number of total cases in a given experiment is big, the idea ofwriting and count all of them is not very ’doable’. To overcome this difficulty, in the nextsection we explain some techniques.

1.2 Counting

In general, if the statement of the problem is hard, to count the possible events may betricky.

Permutations: The number of ways of matching n elements with another n orderedelements is n! = n · (n− 1) · (n− 2)...3 · 2 · 1.Let’s illustrate this with an example. Let’s consider the letters a, b, c. We want to matchthem with the (ordered) numbers 1, 2, 3. For the first number we have three possiblesituations:

(1, a), (1, b), (1, c).

Let’s assume that we assign (1, c) [*** we are not losing generality here. Why?***]. Once the first case is assigned, we can not use the same letter again. So, for the

1Let’s explain how we realize this experiment. Within the families with two children such that at leastone is a boy we choose a family randomly.

3



next number we have only two possibilities, let’s say

(2, a), (2, b).

Once we assign (2, a), the third pair is mandatory: (3, b). Consequently, there are

3 ways for the first number×2 ways for the second number×1 way for the third number.

The number of ways of match n elements with another k ≤ n ordered elements is n!(n−k)! =

n · (n− 1) · (n − 2)...(n − k + 1).Example 1.7. What order could 16 pool balls be in? (Notice that, once you use a ball,you can not use it again)

Solution: We have 16!. This is due to the fact that, after choosing the first one, to choosethe second one we only have 15 (remaining) balls.

If, instead of ordering 16 balls, we want to order just 5 balls within the full set of 16 balls,the answer is 16 · 15 · 14 · 13 · 12. The reasoning in this case is similar.Example 1.8. In a chess tournament with 20 participants, how many ways can the firsttable in the first round be chosen? (There are two players per table and, by usage, to bethe first player means that you play with white pieces)

Solution: We have to choose the white player within 20 possible players. After that, wecan choose the second player within 19 possible cases (as we remove the white player).Consequently, we have 20 · 19.Notice that [*** when the order is important we have to deal with permutations***].

Combinations: Given n elements, the number of subsets of k elements is

C(n, k) =n!

k!(n − k)!.

We can argue as follows: there are n!/(n − k)! ways of choosing k elements within a setwith n elements. However, the previous number was obtained assuming that the orderis important. This means that we are considering a, b, c and b, c, a as different waysof choosing a group of three letters from the (english) alphabet. Notice that, as subsets,they are identical because the elements are exactly the same elements. Consequently, iforder is not important, we should consider both ways only once. So, we need to takeinto account how many ways of ordering k elements we have. We know from the previousarguments that this latter number is k!. Consequently, the final answer will be C(n, k).Example 1.9. Given the english alphabet, let’s compute the number of ’words’ of threeletters without repetition that you may form (by ’word’ I mean that the order is important,even if a given group of letters has no real meaning). Then compute the number of subsetsof three letters that you may form.

Solution: The english alphabet has 26 letters. Within 26 letters you may form

26 · 25 · 24 words with three different letters.

4


1.3. The story of a problem with history

As we can distinguish the word ’ash’ from the word ’has’, here the order is important.

Now, as subsets of letters, the subset a, s, h is exactly the same as the subset h, a, s.In particular, we should count it only once. By the previous ideas, there are 3! ways ofordering a group of 3 letters. So we obtain

C(26, 3) =26!

3!(26 − 3)!=

26 · 25 · 243 · 2 subsets of three letters.

Notice that [*** when the order is not important we have to deal with combi-nations ***].

Let me finish this section emphasizing a paragraph in the paper ’Generalizing Galileo’sPassedix Game’ by Prof. Vassilios C. Hombas2

Complexity in combinations on the one hand and permutations on the otherhas been problematic in teaching probabilities and remains so.

1.3 The story of a problem with history

In the next two sections we will explain in full detail some famous problems using theprevious ideas. Consequently, you should think on these two sections as examples and Irecommend you to try to solve them by yourself.

Let me start this section with a Leibniz’s quote3. Leibniz believed that

”with two dice, it is as feasible to throw a total 12 points as to throw a total11: as either one can only be achieved in one way.”

[*** Is Leibniz’s though true or not? why? ***]

At this point may be interesting to speak a little bit about the origins (so to speak) of theprobability.

In the XVI century, the Duke of Tuscany4 Ferdinando dei Medici asked Galileo to explain:

”Why, although there were an equal number of 6 partitions of the numbers 9and 10, did experience state that the chance of throwing a total 9 with threefair dice was less than that of throwing a total of 10?

By partitions, the Duke meant the possible ways of summing up to 9 or 10, respectively,i.e., in the case of total sum 9 we have

(3, 3, 3), (1, 2, 6), (1, 3, 5), (1, 4, 4), (2, 2, 5), (2, 3, 4),

2available in the link http://interstat.statjournals.net/YEAR/2004/articles/0401001.pdf3For further details, the interested student may read the paper ’Generaliz-

ing Galileo’s Passedix Game’ by Prof. Vassilios C. Hombas (available in the linkhttp://interstat.statjournals.net/YEAR/2004/articles/0401001.pdf)

4Tuscany is a region in the center of the Italy whose capital is Florence. If you are able, you shouldvisit it!

5



while in the case of total sum 10 we have

(1, 3, 6), (1, 4, 5), (2, 2, 6), (2, 3, 5), (2, 4, 4), (3, 3, 4).

Using the same naive approach as the Duke, we are tempted to say that the probabilityshould be the same as the number of partitions is exactly the same.

Let’s assume now that we have three fair dices but these dices are from different colors, red,blue and green. In that case, we realize that every partition has not the same probability!For instance we have that the partition (3,3,3) is only realized with every dice having athree, however, the partition (1,2,6) is obtained in more ways. Indeed, we have

(1, 2, 6), (1, 6, 2), (2, 1, 6), (2, 6, 1), (6, 1, 2), (6, 2, 1).

This new splitting suggest that the partition (1,2,6) is 6 times more likely than the partition(3,3,3).

The important thing that the Duke did not realize is that order matters in this experiment.So, for a given partition with different elements, let’s say (1, 2, 6), we have to compute allthe possible ways to match these three numbers with the three dices. In other words, weare talking about permutations of 3 elements. As we studied before, this number is 3! = 6,which is the number of possibilities that we obtained arguing with coloured dices.

Let’s study now the case of the partition (1, 4, 4). Using coloured dices, we have

(1, 4, 4), (4, 1, 4), (4, 4, 1).

If instead of having a partition with three different elements, we have a partition with only2 elements (i.e. one of the elements appears twice as in, say, (1, 4, 4)) the situation is nota permutation. Instead, we have 3 elements (the coloured dices) and we have to matchthem with another three elements (the different numbers) but there are repeated elements.In other words, if we just compute the number of permutations of three elements 3! = 6as the number of ways that the partition (1,4,4) appears, we are overestimating the realanswer. The reason is that we are splitting

(1, 4, 4) and (1, 4, 4),

but they are the same situation! This implies that we are counting every case twice, so thereal number of ways leading to (1, 4, 4) is 3!/2 = 6/2 = 3 (as our answer using coloureddices).

With these arguments, we can, finally, count the total number of ways that a total sumof 9 appears. We have 1 case that is only possible once (the (3,3,3)), three cases whichappear in 6 different ways each one (like (1,2,6)) and 2 cases appearing in 3 different ways(the cases (2,2,5) and (1,4,4)). Summing up all this, we have

number of ways of getting 9 as total sum = 1 + 3 · 6 + 2 · 3 = 1 + 18 + 6 = 25 ways.

[*** Can you compute the number of ways of getting a total sum equal to10? ***]

6


1.4. Maximizing our chances in a tv show

Let’s go back to Leibniz’s comment. With two dices, we have only one partition leading to11 or 12 (in this part he is obviously right), say, (5,6) and (6,6). However, using coloureddices, we have

(5, 6), (6, 5),

and(6, 6).

Consequently, the probability of having 11 is twice the probability of having 12 and Leibnizwas wrong!

This example may be not as straightforward as desired, so I would recommend you to readand convince yourself that you understand it.

1.4 Maximizing our chances in a tv show

In this section we are going to explain another (very famous) problem in basic probability.Let me state the problem:

You are in a tv show. In this show, there are three closed doors (door 1, door2 and door 3, let’s say) and you are asked to chose one of these doors. Youknow that behind one of these doors there is a car, which is the desired prize(if you already have a new car, motivate yourself with your favourite object ofdesire!), while the other two doors are empty.

After you choose one of these doors, the presenter opens one of the emptydoors. Notice that he knows in advance where is the car and he will neveropen that door. He will never open the door that you have chosen, either.

Then, he ask you if you change your choice to the remaining closed door (sofar not chosen).

The question is: should you would change your choice or, on the other hand,you should maintain your original choice? and why?

Let me explain again how the tv show works with an example. Let’s assume that the caris behind the door 1. Then

• if you choose door 1, the presenter will open door 2 or door 3 at random (both areempty and he knows),

• if you choose door 2 (so, your door is empty), he will open door 3 (another emptydoor),

• if you choose door 3 (so, your door is empty), he will open door 2 (another emptydoor).

[*** Try to guess what the better choice is. ***]

In a naive way, we can think that, once the presenter opens one of the remaining doors,since there are only two remaining doors, the probability of having the car behind any ofthese doors is 0.5. But this is not correct.

7



Let’s start from the beginning. The probability of having the car behind our first choice is1/3 (since there are three doors and the car is behind one them with the same probability).No matter which door we choose, the probability of our first choice is 1/3.

In other words, the probability of having an empty door as our first choice is 2/3.

Now the presenter opens a door and we are asked to decide if we change or not our initialchoice. Notice that our door remains closed and nobody moved the car during the game.Consequently, the probability of having the car remains unchanged! This suggests thatwe should change our initial choice to maximize our probabilities.

[*** Before continue proving that this argument is correct, let me give adifferent, more clear version of the game.

You are in a tv show. In this show, there are 100 closed doors (door1, door 2 and door 3, etc, let’s say) and you are asked to chose one ofthese doors. You know that behind one of these doors there is a car,which is the desired prize (if you already have a new car, motivateyourself with your favourite object of desire!), while the other 99doors are empty.After you choose one of these doors, the presenter opens 98 of theempty doors. Notice that he knows in advance where is the car andhe will never open that door. He will never open the door that youhave chosen, either.Then, he ask you if you would change your choice to the remainingclosed door (so far not chosen).The question is: should you change your choice or, on the otherhand, you should maintain your original choice? and why?

This example seems more clear, right? The probability of having the carbehind our first choice is 1/100. However, once the other 98 doors are openand knowing that the presenter knows where the car is, seems clear that theremaining closed, not chosen door will have the car (most likely). ***]

Let’s study now the different cases with only three doors. Let’s rename the door with thecar as door A and the other two doors doors B and C respectively.

First, we are going to compute the probability of winning the car without changing ourinitial choice, i.e. we choose a door and we continue with it all along the game.

Initial choice Door open by presenter Final choice Win/lose

A B/C A WinB C B LoseC B C Lose

Now, let’s assume that you change the door:

8


1.5. Definitions and axioms

Initial choice Door open by presenter Final choice Win/lose

A B/C C/B LoseB C A WinC B A Win

It seems that, by switching the door, our probability raise from 1/3 to 2/3.

Notice that one of our cases (the first one in each table), in fact, represents two differentcases. These two (sub)cases have probability 1/6 [*** Can you explain why? ***],so when added together the probability of each case in our table is 1/3. We will go overit later.

This particular example with 3 doors is not very intuitive. It appeared as a logic gamein some magazine. Once the answer was provided, some people got mad about it. Forinstance, the author of the problem received letters like this: (notice that he use ’revealsa goat’ as our empty door)

”You blew it, and you blew it big! Since you seem to have difficulty grasping thebasic principle at work here, I’ll explain. After the host reveals a goat, you nowhave a one-in-two chance of being correct. Whether you change your selectionor not, the odds are the same. There is enough mathematical illiteracy in thiscountry, and we don’t need the world’s highest IQ propagating more. Shame!”S.S., Ph.D. University of Florida

1.5 Definitions and axioms

In the previous sections we introduce the basic ideas of discrete probability. Now we aregoing to make them rigorous, but first we need some tools from set theory:Memento 1.1. Let us recall some notation:

• A ∪ B means the union of these two sets, i.e. the elements that are in any of thetwo sets.

• A∩B means the intersection of these two sets, i.e. the elements that are in both ofthe two sets..

• Ac means the complement of the set A.

• A − B means the difference between the set A and B, i.e. the elements that are inA but not in B.

In particular, given the sets A = a, b, c, A′ = a and B = 1, 2, we have

A ∪B = a, b, c, 1, 2, A ∩A′ = a, (A′)c = b, c, A ∩B = ∅, A−A′ = b, c.

Example 1.10. Is

A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C)

true?

9



Solution: Yes. Let me explain how to prove it. [*** The idea is to obtain that theset on the left hand side A ∪ (B ∩ C) is included in the set on the right handside (A ∪B) ∩ (A ∪ C), i.e.

A ∪ (B ∩ C) ⊂ (A ∪B) ∩ (A ∪C).

After that, we will prove the opposite inclusion

A ∪ (B ∩ C) ⊃ (A ∪B) ∩ (A ∪C).

As both inclusions are true, the sets should be the same. ***]

Take x ∈ A ∪ (B ∩C). We split this situation in two cases:

1. x ∈ B ∩ C, or

2. x ∈ A.

If x is in B ∩C then x ∈ B and x ∈ C. Consequently, x ∈ (A ∪B) and x ∈ (A∪C). Weconclude the inclusion in this case. We have to consider now that x ∈ A. In this lattercase, we obtain the inclusion with the same argument. So far, we have proved

A ∪ (B ∩ C) ⊂ (A ∪B) ∩ (A ∪C).

Take now y ∈ (A ∪B) ∩ (A ∪ C), then we can split in two cases:

1. y /∈ A, or

2. y ∈ A.

If y /∈ A but y ∈ (A ∪B) ∩ (A ∪C), then y ∈ B and y ∈ C. Consequently, y ∈ B ∩C andwe get the desired inclusion. If y ∈ A the inclusion is obtained straightforwardly.

As this is the first example, let me try to rephrase what we have been doing previously.Notice that on the left hand side we have the elements that are (at the same time) inB and C together with the elements in A. On the right hand side, each union set isformed with the elements in A together with the elements in B and C, respectively. Inthe intersection, the elements in A are contained for sure, because A is contained in bothof the intersecting union sets. Consequently, the only elements that we have to check arethe elements in B and C. Let me illustrate these words with an example. Let’s consider

A = a, b, c, d, B = 1, 2, 3, C = 2, 3, 4.

We haveA ∪ (B ∩ C) = a, b, c, d ∪ (2, 3) = a, b, c, d, 2, 3,

and

(A ∪B) ∩ (A ∪ C) = (a, b, c, d, 1, 2, 3) ∩ (a, b, c, d, 2, 3, 4) = a, b, c, d, 2, 3.

[*** Convince yourself with a plot. ***]Example 1.11. Is

A ∪ (B ∩ C) = (A ∪B) ∩C

true?

10



Solution: In general, this is not true. We can consider

A = a, b, c, d, B = a, b, c, d, C = 1, 2, 3,

B ∩ C = ∅ so A ∪ (B ∩ C) = A and (A ∪B) ∩C = ∅. However, sometimes it is true. Forinstance, let’s consider the example

A = a, b, c, d, B = 1, 2, 3, C = a, b, c, d.

We have B ∩ C = ∅, so A ∪ (B ∩ C) = A, but C = A, so, (A ∪ B) ∩ C = A. In otherwords, a ∈ A, but a /∈ C does appear in the left hand side while it does not appear asan element in the right hand side.Example 1.12. Is

A ∩ (B ∩ C) = (A ∩B) ∩C

true?

Solution: Yes. Notice that both sets are formed with the elements that are, at the sametime, in the thee sets A,B,C.Example 1.13. Is

A− (B ∩ C) = (A−B) ∪ (A− C)

true?

Solution: Yes. The key remark is that if b ∈ B and b ∈ A but b /∈ C, then b /∈ A−B butb ∈ A− C, so b ∈ (A−B) ∪ (A− C).

So, to make this argument rigorous, let’s consider x ∈ A− (B ∩ C) an arbitrary element.By definition, x ∈ A but x /∈ B ∩ C. This implies that x /∈ B or x /∈ C (or both).In particular, x ∈ A − C or x ∈ A − B [*** Convince yourself! ***]. We getA− (B ∩ C) ⊂ (A−B) ∪ (A− C).

Now, we have to prove the opposite inclusion. Let y ∈ (A−B)∪(A−C) be an element. Inparticular, y ∈ (A−B) or y ∈ (A−C). We obtain that y ∈ A and we can study two cases:if y ∈ B∩C or if y ∈ (B∩C)c. In the first case we have that y /∈ (A−B)∪ (A−C) and wearrive to a contradiction. So, y ∈ (B ∩C)c, or, in other words, y /∈ B ∩C. In this case wehave y ∈ A−(B∩C) and we conclude the second inclusion A−(B∩C) ⊃ (A−B)∪(A−C).Both inclusions guarantee that the sets match.Example 1.14. Prove De Morgan’s laws:

(∪Ai)c = ∩Ac

i , (∩Ai)c = ∪Ac

i .

Solution: Let’s prove it by induction5. Let’s consider first the case with only two setsA1, A2. Then, given y ∈ (A1∪A2)

c, we have that y /∈ A1 and y /∈ A2 (otherwise, we arriveto a contradiction [*** Why?? ***]). Then y ∈ Ac

1 and y ∈ Ac2, so y ∈ Ac

1 ∩Ac2 and we

get (A1 ∪A2)c ⊂ Ac

1 ∩Ac2.

5”Why then is this view [the induction principle] imposed upon us with such an irresistible weight ofevidence? It is because it is only the affirmation of the power of the mind which knows it can conceive ofthe indefinite repetition of the same act, when that act is once possible.” Henri Poincare

11



The opposite inclusion is obtained in a similar way. Let y ∈ Ac1 ∩ Ac

2. This means thaty /∈ A1, y /∈ A2. Consequently, y /∈ A1 ∪A2 and (A1 ∪A2)

c ⊃ Ac1 ∩Ac

2.

Once that we have proved the case with only two sets, we do by induction. Let’s assumethat (∪Ai)

c = ∩Aci is true if i = 1, ...n and let’s prove the case n+ 1. We have

(A1 ∪ ...An ∪An+1)c = ((∪Ai) ∪An+1)

c = (A1 ∪ A2)c,

with A1 = A1 ∪ ...An and A2 = An+1. Then,

(A1 ∪ A2)c = Ac

1 ∩ Ac2 = (A1 ∪ ...An)

c ∩Acn+1.

Using the induction hypothesis, we have

(A1 ∪ A2)c(A1 ∪ ...An)

c ∩Acn+1 = Ac

1 ∩ ...Acn ∩Ac

n+1.

This concludes with the first De Morgan’s law [*** Prove the second one by your-self! ***].

In what follows we assume that we are realizing some sort of random experiment.Definition 1.1. The set of every possible outcomes for the experiment is called sample

space. Typically, we denote it by Ω.Definition 1.2. A collection F of subsets of Ω is called a σ−field if the following threeconditions hold (all of them):

• ∅ ∈ F ,

• If Ai ∈ F then B = ∪∞i=1Ai ∈ F ,

• If A ∈ F then Ac ∈ F .Definition 1.3. Given Ω and F , the elements of F are called events. In particular, allthe events appear as subsets of Ω, however, not every subset of Ω is an event.

Notice that, given the set Ω, there are more than one σ−field associated with it. Inparticular, F = ∅, Ω is a σ−field [*** Check this claim!! ***].Example 1.15. Write the sample space corresponding to tossing once a coin. Then definea σ−algebra adapted to this sample space.

Solution: We have

Ω = H,T.

One σ−algebra may be defined as

F = ∅, H,T, H, T.

Example 1.16. Write the sample space corresponding to throwing once a (six-faced) dice.Then define a σ−algebra adapted to this sample space.

Solution: We have

Ω = 1, 2, 3, 4, 5, 6.

12



One σ−algebra may be defined as

F = ∅, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6.[*** Convince yourself!! ***]

At a first sight, the condition on the countable union of sets (condition 2 in the definition ofσ−algebra) may seem obscure. Let’s use the same motivation as the book does (Example4, page 3). We toss a coin until the first head appears and we are interested in the event’first head after an even number of tails’. Notice that this event is a countable union ofelements in Ω.

We understand the sample space and the events, but we do not know yet the precisedefinition of probability measure or probability function, P (·). Let’s motivate some of itsproperties.Example 1.17. You toss a fair coin twice. Compute the following probabilities:

1. What is the probability that both coins show heads.

2. What is the probability that both coins show two tails.

3. Compute the probability of having two tails or two heads.

4. Compute the probability of having two tails or two heads or one head and one tail(in any order).

5. What is the probability that two tails are not shown.

Solution: The possible cases are

Ω = H,H, H,T, T,H, T, T.The probability of two heads is 1/4. The probability of two tails is 1/4. The probabilityof two heads or, alternatively, two tails is the probability of having

H,H ∪ T, T,which is

P (H,H ∪ T, T) = 1/4 + 1/4 = P (H,H) + P (T, T) = 1/2.

The probability of having two tails or two heads or one head and one tail (in any order)is the probability of

H,H ∪ T, T ∪ H,T ∪ T,H.This is the sample space Ω, with probability P (Ω) = 1. Finally, the probability of nothaving two tails is the probability of

H,H ∪ H,T ∪ T,H = Ω− T, T.This probability equals

P (H,H ∪ H,T ∪ T,H) = 1− 1/4 = P (Ω)− P (T, T) = 3/4.

We proceed now with the definition:

13



Definition 1.4. A probability measure P on (Ω,F) is a function

P : F 7→ [0, 1],

such that

• P (Ω) = 1,

• if Ai are pairwise disjoint events, then

P (∪∞i=0Ai) =

∞∑

i=0

P (Ai).

Example 1.18. What is the probability of ∅?Solution: For a given sample space, we have

Ω = Ω ∪ ∅,being Ω ∩ ∅ = ∅. Thus, using the definition, we have

1 = P (Ω) = P (Ω ∪ ∅) = P (Ω) + P (∅) = 1 + P (∅).Consequently, P (∅) = 0.Example 1.19. In our example of the fair coin, what is the probability function?

Solution: We only have two possible outcomes, head or tail. Consequently, it is enoughto consider the σ−algebra given by

∅, H,T, H, T.It is enough to define the probability for the last two events:

P (Head) = P (Tail) = 0.5.

Example 1.20. In our example of the six faced, fair dice, what is the probability function?

Solution: The good σ−algebra in this case is bigger (for instance, it allows questions likecompute the probability of having an even number). We define the probability as

P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 1/6.

Once we know this basic bricks, due to the properties of probability measures, we havethat, for a given event A, its probability is

P (A) =cardinal of A

6.

This set A ∈ F is an arbitrary event. In particular, we can use the previous definition tocompute the probability of having an even number:

P (even number) =number of even numbers from 1 to 6

6=

3

6= 0.5.

[*** Compare this approach with the one in Example 1.4 ***]

Let’s write some consequences of our definition of probability function as our first lemma:

14



Lemma 1.1. We have

1.

P (Ac) = 1− P (A),

2. if A ⊂ B,

P (A) ≤ P (B),

3.

P (A ∪B) = P (A) + P (B)− P (A ∩B),

Proof. 1. We use A ∪Ac = Ω, A ∩Ac = ∅ to write

1 = P (Ω) = P (A ∪Ac) = P (A) + P (Ac).

2. We have B = A ∪ (B −A) and A ∩B −A = ∅. So

P (B) = P (A ∪ (B −A)) = P (A) + P (B −A) ≥ P (A).

3. We define the sets A−B, A ∩B and B −A. These sets are pairwise disjoints:

(A ∩B) ∩ (A−B) = (A ∩B) ∩ (B −A) = (B −A) ∩ (A−B) = ∅.

We also have

A = (A ∩B) ∪ (A−B), B = (A ∩B) ∪ (B −A),

so

P (A) = P (A ∩B) + P (A−B), P (B) = P (A ∩B) + P (B −A).

Finally we have

A ∪B = (A ∩B) ∪ (A−B) ∪ (B −A),

or, in other words the elements in the union of two sets may be in both set separatelyor only in one of these two sets. So A∪B is formed by elements of A and B (at thesame time), elements of A only (i.e. elements of A that are not elements of B) andelements of B only (i.e. elements of B that are not elements of A). Due to this, wehave

P (A ∩B) = P ((A ∩B) ∪ (A−B) ∪ (B −A))

= P (A ∩B) + P (A−B) + P (B −A)

= P (A ∩B) + P (A−B) + P (B −A) + P (A ∩B)− P (A ∩B)

= P (A) + P (B)− P (A ∩B).

Lemma 1.2 (Boole’s inequalities). We have

15



1.

P (∪n+1i=1 Ai) ≤

n+1∑

i=1

P (Ai)

2.

P (∩ni=1Ai) ≥ 1−

n∑

i=1

P (Aci ).

Proof. 1. We prove it by induction. Using the point 3 in Lemma 1.1, we have the casewith n = 2 set. The general case is then

P (∪n+1i=1 Ai) ≤ P (∪n

i=1Ai) + P (An+1) ≤n+1∑

i=1

P (Ai).

2. The second one is a consequence of the first one. We have

P (∩ni=1Ai) = 1− P (∪n

i=1Aci ) ≥ 1−

n∑

i=1

P (Aci ).

Lemma 1.3 (Inclusion-exclusion principle). Given a family of events Ai, 2 ≤ i ≤ n, then

P (∪ni=1Ai) =

n∑

i=1

P (Ai)−∑

1≤i,j≤n

P (Ai ∩Aj)

+∑

1≤i,j,k≤n

P (Ai ∩Aj ∩Ak) + (−1)n−1P (∩ni=1Ai).

Proof. As a warm up, let’s do first the case with three sets, i.e. n = 3. We have threesets, A,B,C and we consider the sets formed with

• the elements in A that are NOT elements of B,C:

A ∩Bc ∩Cc,

• the elements in B that are NOT elements of A,C:

Ac ∩B ∩Cc,

• the elements in C that are NOT elements of A,B:

Ac ∩Bc ∩ C,

• the elements in A AND B that are NOT elements of C:

A ∩B ∩ Cc,

16



• the elements in A AND C that are NOT elements of B:

A ∩Bc ∩C,

• the elements in B AND C that are NOT elements of A:

Ac ∩B ∩C,

• the elements in A AND B AND C:

A ∩B ∩ C.

Notice that d ∈ A ∪B ∪ C implies the alternative

1. d appears only in one of the three sets,

2. d appears only in two of the three sets,

3. d appears in every set.

The first case corresponds to the sets

A ∩Bc ∩ Cc, Ac ∩B ∩Cc, Ac ∩Bc ∩ C.

The second case corresponds to the sets

A ∩B ∩ Cc, A ∩Bc ∩ C, Ac ∩B ∩ C.

The third case corresponds to

A ∩B ∩C.

Consequently, as all these sets are disjoints, we have

P (A ∪B ∪ C) = P (A ∩Bc ∩Cc) + P (Ac ∩B ∩ Cc) + P (Ac ∩Bc ∩ C)

+P (A ∩B ∩Cc) + P (A ∩Bc ∩ C) + P (Ac ∩B ∩C)

+P (A ∩B ∩C).

Now notice that

P (A) = P (A ∩Bc ∩ Cc) + P (A ∩B ∩ Cc) + P (A ∩Bc ∩ C) + P (A ∩B ∩ C),

P (B) = P (Ac ∩B ∩ Cc) + P (A ∩B ∩ Cc) + P (Ac ∩B ∩ C) + P (A ∩B ∩ C),

P (C) = P (Ac ∩Bc ∩C) + P (A ∩Bc ∩ C) + P (Ac ∩B ∩ C) + P (A ∩B ∩ C).

17



We can complete the previous expression and we get

P (A ∪B ∪C) = P (A) + P (Ac ∩B ∩ Cc) + P (Ac ∩Bc ∩ C) + P (Ac ∩B ∩ C)

= P (A) + P (B) + P (Ac ∩Bc ∩ C)− P (A ∩B ∩ Cc)− P (A ∩B ∩C)

= P (A) + P (B) + P (C)−P (A ∩B ∩ Cc)− P (A ∩B ∩ C)

−P (A ∩Bc ∩C)−P (Ac ∩B ∩ C)− P (A ∩B ∩ C)

= P (A) + P (B) + P (C)−P (A ∩B)

−P (A ∩Bc ∩C)−P (B ∩ C)

= P (A) + P (B) + P (C)− P (A ∩B)− P (B ∩ C)

−P (A ∩Bc ∩C)− P (A ∩B ∩C) + P (A ∩B ∩C)

= P (A) + P (B) + P (C)− P (A ∩B)− P (B ∩ C)

−P (A ∩ C) + P (A ∩B ∩ C).

We concluded the case with 3 sets. To prove the general case, we assume that the sampleset is formed by a finite number of elements, N , and we define de characteristic functionsof the set Ai as

fi(x) = 1 if x ∈ Ai and 0 otherwise.

We define

F (x) =n∏

i=1

(1− fi(x)).

Notice that for every x ∈ ∪ni=1Ai, F (x) = 0. So

1

N

∑

x∈Ω

F (x) = P (Ω− ∪ni=1Ai) = 1− P (∪n

i=1Ai)

We develop the product for F (x) and we have

F (x) =∑

I∈I

(−1)|I|∏

I

fi(x),

where I moves in the set I formed by n−tuples of ones and zeroes. In other words, Iis a n−tuple. By definition, the j−th coordinate of I = (I1, I2, ...Ij , ...In) gives us thecontribution of the term (1− fj(x)) with the following convention: if Ij = 0 we have that(1 − fj(x)) contributes with 1, and if Ij = 1, (1 − fj(x)) contributes with fj(x). For agiven I ∈ I, we define

∏

I

fi(x) =∏

j if Ij=1

fj(x)

[*** To clarify our notation and what we are doing developing F (x) in thissecond way, let’s assume that n = 2. Then we compute

F (x) = (1− f1(x))(1 − f2(x)) = 1− f2(x)− f1(x) + f1(x)f2(x),

soI = (0, 0), (0, 1), (1, 0), (1, 1).

18



The first component of the pair (I1, I2) indicates the contribution of (1− f1(x)).If this term contributed with the 1, we write 0, while if the contribution is−f1(x), we write 1. In the same way, the second component of the pair (I1, I2)indicates the contribution of (1 − f2(x)). If this term contributed with the 1,we write 0, while if the contribution is −f1(x), we write 1. So, summarizing,we identify

1 = (0, 0), −f1(x) = (1, 0), −f2(x) = (0, 1), f1(x)f2(x) = (1, 1).

We also have

|I| =∑

j≤n

Ij .

***] [*** Can you write the case with n = 3? ***]

Notice that the term I = (0, 0, 0.., 0) correspond to the term identically 1 in the right handside of the definition of F (x). Now we compute

1

N

∑

x∈Ω

F (x) =∑

I∈I

(−1)|I|1

N

∑

x∈Ω

∏

I

fi(x).

Notice that

1

N

∑

x∈Ω

∏

i∈I

fi(x) = P (∩i∈IAi) if I 6= (0, 0, ...0) and 1 otherwise.

Putting all together, we have

1− P (∪ni=1Ai) = 1 +

∑

I∈I

(−1)|I|P (∩i if Ii=1Ai).

We are done with this proof. [*** Convince yourself! ***]

The proof of the general inclusion-exclusion principle can also be obtained by induction.As messy as it is, we only sketch how to approach it using induction. Applying 3. in thelemma, we have

P (∪n+1i=1 Ai) = P ((∪n

i=1Ai) ∪An+1) = P ((∪ni=1Ai)) + P (An+1)− P ((∪n

i=1Ai) ∩An+1).

By the induction hypothesis, we have

P (∪n+1i=1 Ai) =

n∑

i=1

P (Ai)−∑

1≤i,j≤n

P (Ai ∩Aj) +∑

1≤i,j,k≤n

P (Ai ∩Aj ∩Ak) + ...

+ (−1)n−1P (∩ni=1Ai) + P (An+1)− P ((∪n

i=1Ai) ∩An+1).

19



and, by the induction hypothesis again,

P ((∪ni=1Ai) ∩An+1) = P (∪n

i=1(Ai ∩An+1))

=n∑

i=1

P (Ai ∩An+1)−∑

1≤i,j≤n

P ((Ai ∩An+1) ∩ (Aj ∩An+1))

+∑

1≤i,j,k≤n

P ((Ai ∩An+1) ∩ (Aj ∩An+1) ∩ (Ak ∩An+1))

+ (−1)n−1P (∩ni=1(Ai ∩An+1)).

Now one should consider the term with l intersections.

Definition 1.5. Given a sample space Ω, a σ−algebra F and a probability function P ,the triple (Ω,F , P ) is called a probability space.Definition 1.6. Given an event A then if P (A) = 1 we say that A occurs almost surely.If P (A) = 0, we say that A is a null event.Example 1.21. Consider an increasing sequence of events A1 ⊂ A2 ⊂ A3.... ComputeP (∪∞

i=1Ai).

Solution: We have

P (∪∞i=1Ai) = lim

n→∞P (∪n

i=1Ai) = limn→∞

P (An).

Example 1.22. Consider a decreasing sequence of events A1 ⊃ A2 ⊃ A3.... ComputeP (∩∞

i=1Ai).

Solution: We have

P (∩∞i=1Ai) = lim

n→∞P (∩n

i=1Ai) = limn→∞

P (An).

Example 1.23 (Exercise 1.3.1). Let A, B be two events such that P (A) = 0.75 andP (B) = 1/3. Show 1/12 ≤ P (A ∩B) ≤ 1/3.

Solution: Notice that P (A ∩B) 6= 0 [*** Why? ***]. We have

1 ≥ P (A ∪B) = P (A) + P (B)− P (A ∩B),

so

P (A ∪B) ≥ P (A) + P (B)− 1 =1

12.

On the other hand, we have A ∩B ⊂ A,B and

P (A ∩B) ≤ minP (B), P (A) ≤ P (B) = 1/3.

[*** Can you give examples that show that both extremes are possible? ***][*** Find corresponding bounds for A ∪B. ***]Example 1.24 (Exercise 1.3.2). A fair coin is tossed repeatedly. Show that with probabilityone a head turns up.

20



Solution: We have

P (Not having a head in the first n tosses) = P (n tails) = 2−n.

Now, due to the continuity of the probability, we have

P (No head ever) = limn→∞

2−n = 0.

Example 1.25 (Exercise 1.3.5). Consider a group of events Ai such that P (Ai) = 1.Show that P (∩∞

i=1Ai) = 1.

Solution: We have

P (∩∞i=1Ai) = lim

n→∞P (∩n

i=1Ai) = 1− limn→∞

P ([∩ni=1Ai]

c) = 1− limn→∞

P (∪ni=1A

ci) ≥ 1− lim

n→∞

n∑

i=1

P (Aci ).

Example 1.26 (Exercise 1.3.6). You are given that at least one of the events Ai, 1 ≤ i ≤ nis certain to occur but certainly no more than 2 occur. If P (Ai) = p and P (Ai ∩Aj) = q,show p ≥ 1/n, q ≤ 2/n.

Solution: The fact that no more than two events occur implies, in particular, thatP (Ai ∩Aj ∩Ak) = 0. We have

1 = P (∪ni=1Ai) =

n∑

i=1

P (Ai)−∑

i,j

P (Ai ∩Aj) ≤ np,

and we conclude p ≥ 1/n. We have to count the possible number of sets appearing in−∑i,j P (Ai ∩ Aj). To do that notice that, fixing i = 1, we have n − 1 options to choosej. Then, for i = 2, we have n− 2 possible choices of j... In this way we have to add

N = (n− 1) + (n− 2) + (n− 3) + ...+ 3 + 2 + 1.

We have [*** Why? ***]

N =n(n− 1)

2.

Now we have

1 =

n∑

i=1

P (Ai)−∑

i,j

P (Ai ∩Aj) = np− n(n− 1)

2q,

so

n− 1 ≥ np− 1 =n(n− 1)

2q,

and we conclude.

21



Figure 1.1: Random endpoint

1.6 Bertrand’s paradox

The notion of probability space seems simpler than how it actually is. The tricky point isthat the meaning of the word ’random’ is encoded in the notion of probability space. Weare going to show this with an historical example: the Bertrand’s paradox.

The statement is the following

’Consider an equilateral triangle inscribed in a circle. Suppose a chord of thecircle is chosen at random. What is the probability that the chord is longerthan a side of the triangle?’

Notice that the words at random, by itself, has no practical implication on how to choosethe chord. In particular, we are going to give three different ways of choosing the chords,all of them random, and we are going to obtain three different numbers.

1. The random endpoint method: We fix one of the vertex of the triangle. We choosea random point over the circle and and form the chord between them (see Figure 1).We observe that the desired probability is the same as the probability of having thesecond point in the grey region in the circumference. In this way, we obtain that thedesired probability is exactly 1/3 [*** Why? ***].

2. The random midpoint method (I): We fix a radius and we assume that the radiusis perpendicular to one of the sides of the triangle. Now, we choose a random pointover the radius and form the chord with this point as midpoint (see Figure 2). Wenotice that the desired probability is the same as the probability of having the pointin the white region in the radius. In this way, we obtain that the desired probabilityis exactly 1/2 [*** Why? ***].

3. The random midpoint method (II): In the previous case, the random midpoint was

22


1.6. Bertrand’s paradox

Figure 1.2: Random midpoint (I)

chosen over one of the radius of the circle. Now, we randomly choose a point in thecircle and we construct the chord with the chose point as its midpoint (see Figure1.6). We see that the desired probability is the same as the probability of having thepoint in the region dark grey inner circle. In this way, we obtain that the desiredprobability is exactly 1/4 [*** Why? ***].

Ok, now we run into troubles. If everything is correct, we have three different (correct)answers to the same question. In general, the good thing about science is that it predictsdifferent behaviour in nature. However, if, for the same question, we find three differentanswers, we can not predict anything.

The answer to this strange situation relies on the notion of probability space. In fact,every answer assumes different probability spaces when computing the probability, so,consequently, the probabilities are different. Let’s notice that different probability spacesimplies different problems in the sense that this notion gives meaning to randomness inthe particular problem that we are addressing.

The probability spaces in each solution are

1. Here we are choosing a point over the circumference. Consequently,

Ω = points over the circumference = (0, 2π)

[*** Why? ***]. The σ−algebra is rather tricky, so let’s just say the intervals(a, b), 0 < a < b < 2π generate it somehow [*** Even if it is not rigorousor precise, you can intuitively identify the σ−algebra with the previousintervals. ***]. Finally, the probability function is defined through

P ((a, b)) =b− a

2π.

23



Figure 1.3: Random midpoint (II)

[*** Check that the previous function is a valid probability function.***]

2. Here we are choosing a point over a given radius (with total length equal to 1). Thus,Ω = point over the radius = (0, 1) [*** Why? ***]. The σ−algebra is again rathertricky. As before, let’s just say the intervals (0, b), 0 < b < 1 generate it. Finally,the probability function is defined through

P ((0, b)) = b.


3. Here we are choosing a point in the circle (with radius one). So, Ω = point in the circle =(x, y) with x2 + y2 = 1 [*** Why? ***]. In this case, the σ−algebra is generatedby the rectangles (a, b)× (c, d), a2+ b2, c2 + d2 ≤ 1. Finally, the probability functionis defined through

P ((a, b) × (c, d)) =Area((a, b) × (c, d))

π=

(d− c) · (b− a)

π.


[*** Can you recover the previous values for the desired probability usingthe definition of the probability function? ***]

24


1.7. Conditional probability

1.7 Conditional probability

The amount of information is very important when dealing with the probability of someevent. In particular, in example 1.6, we observe how the information ’at least one isa boy’ changes the probability from a 1/4 (without the information) to 1/3 (with theinformation) [*** Compare also with example 1.5 ***]. To deal with the extrainformation provided, we introduce the concept of conditional probability.Definition 1.7. The conditional probability of an event A given another event B is

P (A|B) =P (A ∩B)

P (B).

[*** In this definition, B is assumed to occur. In particular P (B) > 0. ***]Let’s us explain the definition using different words. Let’s assume (Ω,F , P ) to be a givenprobability space. As B is given (and this means that B is going to occur), our originalsample space is going to change from Ω, to Ωnew = B ⊂ Ω. Then, we can ask questions onthe probability of events A ⊂ B. Notice that if we use our original probability function, P ,to compute the probability of A we are not computing it correctly. This is due to the factthat P (Ωnew) = P (B) 6= 1, i.e. our ’new’ probability function has not been normalized.To change this, we introduce the term P (B) in the denominator. Consequently, we get anew probability function

Pnew(A) =P (A ∩B)

P (B).

The value of this new probability function is the conditional probability defined above.Example 1.27. You are throwing a six faced dice. The dice shows a even number. Whatis the probability that a ’2’ turns up.

Solution: As we have discussed before, our sample space is

Ω = 1, 2, 3, 4, 5, 6.

However, as we know that the dice shows an even number, some of these events are nullevents. In particular, after discarding the null events, our ’new’ probability space is

Ωnew = Ω ∩ even = 2, 4, 6.

Then, the probability of having exactly ’2’ is 1/3.Example 1.28. You are throwing a six faced dice. The dice shows a even number. Whatis the probability that either ’2’ or ’1’ turns up.

Solution: Our original sample space is

Ω = 1, 2, 3, 4, 5, 6.

As before, some of these events are null events. In particular, as ’1’ is odd, the eventA = 1, 2 has the same probability of A = 2. This example shows that even if theevent A is ’bigger’ than A as subsets of the original sample space Ω, once we use conditionalprobability, our intuition should change.

25



Example 1.29. You are tossing two fair coins. You know that they show at least onehead. What is the probability of both heads?

Solution: The probability of ’at least one head’ is 3/4 [*** why? ***] and the proba-bility of both heads is 1/4. Consequently,

P (’both heads’ |’at least one head’ ) =’both heads’ ∩ ’at least one head’

’at least one head’=

1434

.

Example 1.30. You are throwing two six faced dices. What is the probability that thetotal sum is an even number.

Solution: Let’s write the event

A = sum is even.

We now that the sample space is

Ω = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.

Counting the good outcomes we conclude that the desired probability is P (A) = 1/2. Nowlet’s compute it differently. Notice that an even number can be obtained either as the sumof two even numbers or as the sum of two odd numbers. We consider the events

Ae = sum of two even numbers, Ao = sum of two odd numbers.

The desired event can be written as A = Ae ∪ Ao. Furthermore, Ae ∩ Ao = ∅. Thus, weknow (see Lemma 1.1) that

P (A) = P (A ∩Ae) + P (A ∩Ao) =1

2

1

2+

1

2

1

2=

1

2.

Now notice that

P (A) = P (A ∩Ae) + P (A ∩Ao) = P (A|Ae)P (Ae) + P (A|Ao)P (Ao).

This formula will be helpful.Lemma 1.4 (Bayes’s formula(v1)). Given B an event, we have

P (A) = P (A|B)P (B) + P (A|Bc)P (Bc).

Furthermore, given a finite collection of disjoints events Bi such that ∪Bi = Ω [*** (sucha collection is called a partition of Ω) ***], we have

P (A) =

n∑

i=1

P (A|Bi)P (Bi).

26


1.7. Conditional probability

Proof. Let’s prove the fist formula first. We now that

A = (A ∩B) ∪ (A ∩Bc).

and (A ∩B) ∩ (A ∩Bc) = ∅ [*** why? ***]. Thus, using Lemma 1.1, we get

P (A) = P (A ∩B) + P (A ∩Bc) = P (A|B)P (B) + P (A|Bc)P (Bc).

[*** Complete this proof!! ***].

Notice that Bi play the role some hypotheses. The idea of this formula is to reduce theproblem to a finite number of probabilities that are easy to compute. In this way, thegoal is to obtain a partition Bi such that P (Bi) are somehow easy to compute (or a prioriknown) and such that, given Bi, we can easily obtain P (A ∩Bi).

There is another interesting formula here. Notice that

P (A|B) =P (A ∩B)

P (B), P (B|A) = P (A ∩B)

P (A).

Consequently, [***P (A|B)P (B)

P (A)= P (B|A).

This formula is (also) called Bayes’s formula. (see exercise 1.8.14) ***] Let’sstate a related formula as a Lemma:Lemma 1.5 (Bayes’s formula(v2)). With the hypothesis of Lemma 1.4, we have

P (Bi|A) =P (A ∩Bi)

P (A)=

P (A|Bi)P (Bi)∑

j P (A|Bj)P (Bj)

Example 1.31 (exercise 1.4.5). Solve the problem presented in Section 1.4 using condi-tional probability.

Solution: We have the following events:

D1 = Prize behind door 1,

D2 = Prize behind door 2,D3 = Prize behind door 3.

These three events form a partition. We have P (Di) = 1/3 (all the doors are equallylikely). Let’s assume that you initially choose door 1 and the host opens door 2 (we arenot losing generality here). Notice that we have

P (host opens 2) = P (host opens 2|D1)P (D1) + P (host opens 2|D2)P (D2)

+P (host opens 2|D3)P (D3)

=1

2

1

3+ 0 + 1

1

3

=1

2.

27



[*** Why? ***] Then

P (D1|host opens 2) =P (host opens 2|D1)P (D1)

P (host opens 2)=

121312

=1

3,

and

P (D3|host opens 2) =P (host opens 2|D3)P (D3)

P (host opens 2)=

11312

=2

3.

We conclude that we should switch the door.Example 1.32 (Example 1.7.14). The probability of being infected with some virus is0.001. Assume that the test is 99% accurate and you test positive :-(. How likely is it thatyou actually have the virus?

Solution: Let’s write V for the event ’you have the virus’. We have P (V ) = 0.001,P (positive|V ) = 0.99 and P (positive|V c) = 0.01. We compute

P (V |positive) = P (positive|V )P (V )

P (positive)=

0.99 · 0.001P (positive)

.

To obtain P (positive), we use that V and V c form a partition, so

P (positive) = P (positive|V )P (V ) + P (positive|V c)P (V c)

= P (positive|V )P (V ) + P (positive|V c)(1− P (V ))

= 0.99 · 0.001 + 0.01(1 − 0.001).

Plugging this information into the previous formula, we can compute the desired proba-bility.Example 1.33 (Exercise 1.4.7). Given n urns of which the r− th urn has r− 1 red ballsand n − r magenta balls. You pick at random one urn and remove two balls at randomwithout replacement. Find the probability that the second ball is magenta.

Solution: The events

Ai = you choose the i-th urn

form a partition of the sample space. We are going to use Bayes’s formula (v1) with themas a partition. We have

M1 = 1st ball is magenta, R1 = 1st ball is red,M2 = 2nd ball is magenta,

We have

P (M2) =n∑

i=1

P (M2|Ai)P (Ai).

As every urn has the same probability, P (Ai) = 1/n. Now we are going to use R1 andM1 as a partition:

P (M2|Ai) = P (M2|Ai, R1)P (R1|Ai) + P (M2|Ai,M1)P (M1|Ai).

28


1.8. Evaluation of forensic evidence: the Dreyfus case

Let’s consider the first urn. As the first urn has n − 1 magenta balls and 0 red balls, wehave

P (M2|A1, R1) = 0, P (R1|A1) = 0, P (M2|A1,M1) = 1, P (M1|A1) = 1,

soP (M2|A1) = 0 · 0 + 1 · 1.

For the second urn, we have that the number of magenta balls is n− 2 and there is also ared ball. The probabilities now are

P (R1|A2) =1

n− 1, P (M1|A2) =

n− 2

n− 1.

Now we have to consider that, after we extract the first ball, the total number of balls inthe urn is n− 2.

P (M2|A2, R1) =n− 2

n− 2= 1, P (M2|A2,M1) =

n− 3

n− 2,

thus

P (M2|A2) = 1 · 1

n− 1+

n− 3

n− 2

n− 2

n− 1.

Let’s consider now the general case i ≥ 2. As the i-th urn has i − 1 red balls and n − imagenta balls (in total there are n− 1 balls in the urn) ,

P (R1|Ai) =i− 1

n− 1, P (M1|Ai) =

n− i

n− 1.

So, after extracting the first ball, the total number of balls in the urn is n− 2 and we get

P (M2|Ai, R1) =n− i

n− 2, P (M2|Ai,M1) =

n− i− 1

n− 2.

We obtain

P (M2|Ai) =n− i

n− 2· i− 1

n− 1+

n− i− 1

n− 2· n− i

n− 1=

n

n− 2· n− i

n− 1.

Putting all together, we have

P (M2) =

n∑

i=1

n

n− 2· n− i

n− 1

1

n=

n∑

i=1

1

n− 2· n− i

n− 1.

1.8 Evaluation of forensic evidence: the Dreyfus case

In a March 26, 2013 New York Times opinion piece about the Amanda Knox trial youcan read

29



The challenge is to make sure that the math behind the legal reasoning is fun-damentally sound. Good math can help reveal the truth. But in inexperiencedhands, math can become a weapon that impedes justice and destroys innocentlives.

First, some notation: let us write Hd for the probability of being innocent, Hp for theprobability of guilty and E for the evidence. We are interested in the probability P (Hp|E),i.e. the probability of being guilty given the evidence.

Let us review a very interesting historic affair. We quote the book Statistics and theevaluation of evidence for forensic scientists:

Dreyfus, an officer in the French Army assigned to the War Ministry, was ac-cused in 1894 of selling military secrets to the German military attache. Partof the evidence against Dreyfus centred on a document called the bordereau,admitted to have been written by him, and said by his enemies to containcipher messages. This assertion was made because of the examination of theposition of words in the bordereau. In fact, after reconstructing the bordereauand tracing on it with 4 mm interval vertical lines, Alphonse Bertillon showedthat four pairs of polysyllabic words (among 26 pairs) had the same relativeposition with respect to the grid. Then, with reference to probability theory,Bertillon stated that the coincidences described could not be attributed to nor-mal handwriting. Therefore, the bordereau was a forged document. Bertillonsubmitted probability calculations to support his conclusion. His statisticalargument can be expressed as follows: if the probability for one coincidenceequals 0.2, then the probability of observing N coincidences is 0.2N . Bertilloncalculated that the four coincidences observed by him had, then, a probabil-ity of 0.24, or 1/625, a value that was so small as to demonstrate that thebordereau was a forgery (Charpentier, 1933). However, this value of 0.2 waschosen purely for illustration and had no evidential foundation; for a commenton this point, see Darboux et al. (1908). Bertillon’s deposition included notonly this simple calculation but also an extensive argument to identify Drey-fus as the author of the bordereau on the basis of other measurements and acomplex construction of hypotheses. (For an extensive description of the case,see the literature quoted in Taroni et al.,1998, p. 189.)

The story ends badly for Dreyfus, who was in jail for several years until he was finallyreleased due, in part, to the pressure ejerced by intellectuals like Emile Zola.

Then, it seems that Bertillon is computing

P (Hd|E) =P (E ∩Hd)

P (E)= 0.24,

so

P (Hp|E) =P (E ∩Hp)

P (E)= 1− 0.24.

However, what Bertillon is really computing is

P (E|Hd).

30


1.8. Evaluation of forensic evidence: the Dreyfus case

Figure 1.4: J’accuse by Emile Zola

This is called the prosecutor’s fallacy. [*** Notice that, using Bayes’s formula, wehave

P (A|B) =P (A ∩B)

P (B)=

P (B|A)P (B)

P (A),

so

P (A|B) = P (B|A) ⇒ P (B)

P (A)= 1.

Consequently, we can NOT change P (A|B) with P (B|A) in general (see Exercise1.4.6). ***]Memento 1.2. Notice that Hd = Hc

p. Consequently,

P (E) = P (E ∩Hd) + P (E ∩Hp),

thus

1 =P (E ∩Hd)

P (E)+

P (E ∩Hp)

P (E)= P (Hd|E) + P (Hp|E).

This part of the argument was correct!

This argument is the same as the following

There is a 10% chance that the defendant would have the crime blood type ifhe were innocent. Thus there is a 90% chance that he is guilty.

[*** Can you see it here? ***]

31



1.9 Evaluation of forensic evidence: the O.J. Simpson case

Nicole Brown was murdered at her home in Los Angeles on the night of June 12, 1994.The suspect was her husband 0.J.Simpson. This is a very famous murder trial. The factthat the murder suspect had previously physically abused his wife played an importantrole in the trial. The famous defense lawyer Alan Dershowitz, a member of the team oflawyers defending the accused, claimed that this fact is not probative. He commented onLarry King’s television program:

”Nobody dismisses spousal abuse. It’s a serious and horrible crime in thiscountry. The statistics demonstrate that only one-tenth of one percent ofmen who abuse their wives go on to murder them. And therefore, it’s veryimportant for that fact to be put into empirical perspective, and for the jurynot to be led to believe that a single instance, or two instances of alleged abusenecessarily means that the person then killed. That just doesn’t follow.”

His argument relies in the statistics that only 0.1% of the men who physically abuse theirwives actually end up murdering them.

[*** If you were one of the jurors, do you think that Mr. Dershowitz isright? ***]

We are going to argue as in J. F. Merz and J. C. Caulkins, Chance (Vol. 8, 1995, pg. 14)(see also the lecture notes by Prof. Gravner).

Let’s write

N = husband end up killing his wife,

Hp = woman murdered by her husband,

Hd = woman murdered by someone else,

E = woman was abused by her husband.

I = woman was murdered.

Notice that, if we assume that the woman was already murdered, Hcp = Hd.

[*** Before continue reading, can you express the probability that Mr.Dershowitz used? ***]

Mr. Dershowitz used

P (N |E) = P (abuse leads to murder) = 0.1.

In particular, the event here is formed by the women who are abused by their husbandswithout regarding if they have been killed or not.

We want to compute

P (Hp|E, I),

i.e. the probability that a woman was killed by her husband given that the woman hasbeen killed and the husband abuse her. In other words, the sample space here is formed

32


1.10. Independence

by the woman who have been murdered. In what follows we drop I in the probability andwe consider the sample space formed by murdered women.

Let’s drop I from our notation. Let’s use real statistics: in 1992, 4.936 women weremurdered. Within this total amount, 1.430 were killed by their husbands. Consequently,we can estimate

P (Hp) =1430

4936= 0.28971.

The common statistics is that 5% of women are abused by their husbands, so, let’s assumethat to be true. We have P (Hd) = 1 − P (Hp) = 0.71029. Finally, we have to estimateP (E|Hp). In other words, we have to estimate the number of husband who kill their wivesand also abuse them firstly. Mr. Dershowitz claimed that this probability is P (E|Hp) =0.5, so, again, let’s assume that to be true. Then we end up with

P (Hp|E) =P (E|Hp)P (Hp)

P (E)=

P (E|Hp) · 0.28971P (E|Hp)P (Hp) + P (E|Hd)P (Hd)

,

P (Hp|E) =P (E|Hp) · 0.28971

P (E|Hp) · 0.28971 + 0.05 · 0.71029 ,

P (Hp|E) =0.5 · 0.28971

0.5 · 0.28971 + 0.05 · 0.71029 = 0.8031.

This number is pretty larger than the original 0.1.

1.10 Independence

We have seen that the information provided play an important role on the probabilityof some event. For instance, if we know ’a priori’ that the coin is two headed, then ourinitial probability P (H) = 0.5 changes. However, one might think that the informationis not always helpful. In other words, that the probability is somehow independent of theinformation provided.Definition 1.8. Given two events A,B they are independent when

P (A ∩B) = P (A)P (B).

In general, a given family of events is independent if

P (∩iAi) =∏

i

P (Ai).

Example 1.34. You roll two dices at the same time. Show that the outcome of each diceare independent.

Solution: We write A for the outcome of the first dice and B for the outcome of thesecond dice. Then for x, y = 1, ...6

P (A = x ∩ B = y) = P (A = x)P (B = y) = 1

6· 16.

33



Definition 1.9. Given three events A,B,C we say that A and B are conditionally inde-pendent given C when

P (A ∩B|C) = P (A|C)P (B|C).

In general, a given family of events is independent if

P (∩iAi) =∏

i

P (Ai).

Example 1.35 (Exercise 1.5.7). Jane has three children, each of which is equally likelyto be a boy or a girl independently of the others. Define the events:

A = all the children are of the same sex,

B = there is at most one boy,C = the family includes a boy and a girl.

1. Show that A is independent of B and that B is independent of C.

2. Is A independent of C?

3. Do these result hold if boys and girls are not equally likely?

4. Do these results hold if Jane has four children?

Solution:

1. There are 23 = 8 possible outcomes. We have P (A) = 2/8 = 1/4, P (B) = 4/8 and

P (A ∩B) =1

8=

1

4

4

8= P (A)P (B).

[*** Compute the other case. ***]

2. No. P (A ∩ C) = 0 but both events have positive probability.

3. In general no.[*** Why? ***]

4. No. There are 24 = 16 possible outcomes. We have P (A) = 2/16 = 1/8, P (B) =5/16 and

P (A ∩B) =1

166= 1

8

5

16= P (A)P (B).

Example 1.36 (Exercise 1.5.8). You flip three fair coins. At least two are alike. Thethird has probability 0.5 of head and probability 0.5 of tail. Therefore P (all alike) = 0.5.Do you agree?

Solution: No.

P (all alike) = P (H)P (the other coins show two heads)

+ P (T )P (the other coins show two tails) =1

2

1

4+

1

2

1

4.

34


1.11. Evaluation of forensic evidence (revisited)

1.11 Evaluation of forensic evidence (revisited)

Let’s quote the book Statistics and the evaluation of evidence for forensic scientists foranother interesting real application:

A further example of the fallacy occurred in a case which has achieved a certainnotoriety in the probabilistic legal literature, namely that of People v. Collins(Kingston, 1965a, b, 1966; Fairley and Mosteller, 1974, 1977). In this case,probability values, for which there was no objective justification, were quotedin court. Briefly, the crime was as follows. An old lady, Juanita Brooks, waspushed to the ground in an alleyway in the San Pedro area of Los Angelesby someone whom she neither saw nor heard. According to Mrs Brooks, ablond-haired woman wearing dark clothing grabbed her purse and ran away.John Bass, who lived at the end of the alley, heard the commotion and saw ablond- haired woman wearing dark clothing run from the scene. He also no-ticed that the woman had a ponytail and that she entered a yellow car drivenby a black man who had a beard and a moustache (Koehler, 1997a). (...)Theprosecutor called as a witness an instructor of mathematics at a state collegein an attempt to bolster the identifications. This witness testified to the prod-uct rule for multiplying together the probabilities of independent events (...)The instructor of mathematics then applied this rule to the characteristics astestified to by the other witnesses. Values to be used for the probabilities ofthe individual characteristics were suggested by the prosecutor without anyjustification, a procedure which would not now go unchallenged. The jurorswere invited to choose their own values but, naturally, there is no record ofwhether they did or not. (...) Using the product rule for independent char-acteristics, the prosecutor calculated the probability that a couple selected atrandom from a population would exhibit all these characteristics as 1 in 12million.

The accused were found guilty. This verdict was finally overturned.

There is a very straightforward error. The prosecutor and his team claimed independenceof events which are not a priori independent. For instance, the probability of having abeard and a moustache doesn’t seem independent to me. Then, the probability that theycomputed is

P (E|Hd),

which, as we saw before, in general, it is not the same as

P (Hd|E).

Let’s see another sad example of misuse of a probabilistic argument:

Lest it be thought that matters have improved, consider the case of R. v. Clark.Sally Clark’s first child, Christopher, died unexpectedly at the age of about 3months when Clark was the only other person in the house. The death wasinitially treated as a case of sudden infant death syndrome (SIDS). Her second

35



child, Harry, was born the following year. He died in similar circumstances.Sally Clark was arrested and charged with murdering both her children. Attrial, a professor of paediatrics quoted from a report (Fleming et al., 2000)that, in a family like the Clarks, the probability that two babies would bothdie of SIDS was around 1 in 73 million. This was based on a study whichestimated the probability of a single SIDS death in such a family as 1 in 8500,and then squared this to obtain the probability of two deaths, a mathematicaloperation which assumed the two deaths were independent. In an open letterto the Lord Chancellor, copied to the President of the Law Society of Englandand Wales, the Royal Statistical Society expressed its concern at the statisticalerrors which can occur in the courts, with particular reference to Clark. Toquote from the letter:

One focus of the public attention was the statistical evidence given by a medicalexpert witness, who drew on a published study [Confidential Enquiry intoStillbirths and Deaths in Infancy] to obtain an estimate [1 in 8543] of thefrequency of sudden infant death syndrome (SIDS, or ‘cot death’) in familieshaving some of the characteristics of the defendant’s family. The witness wenton to square this estimate to obtain a value of 1 in 73 million for the frequencyof two cases of SIDS in such a family. Some press reports at the time statedthat this was the chance that the deaths of Sally Clark’s two children wereaccidental. This (mis-)interpretation is a serious error of logic known as theProsecutor’s Fallacy.

1.12 Matlab Code

One step forward 1.1. Here a code to simulate N different outcomes of a six-faced dice

N=1;

D=floor(6*rand(N))+1;

Here a code to simulate the random endpoint method in the Bertrand’s paradox:

theta=-180*rand(1);

m=tan(theta);

%Y=m*X+1; this is the equation for the straight line

plot(cos([0:0.01:2*pi]),sin([0:0.01:2*pi])); hold on

x=[-1:0.01:1];

plot(x,m*x+1,’r’)

1.13 Conclusion

There are a bunch of basic things that you can not forget:

1. Think carefully what is the probability space (sample space, probability functionand σ−algebra).

36


1.13. Conclusion

2. In general, you may think on the probability of an event as the size or the measure ofthat set. If the set is discrete, you also can think on the probability as the number ofelements on that set over the total number of elements in the sample space. in otherwords, if the set is discrete, the probability ’reduces to counting’, roughly speaking.

3. If you can not compute a desired probability, you maybe want to look for a goodpartition and the good use of conditional probability.

4. Remember, in general, P (A|B) 6= P (B|A)!!.5. Do not use independence of events without a carefully reasoning. Sometimes it is

not obvious if two events are independent (see Example 1.35).

37



38


Chapter 2

Random variables

2.1 Random variables

So far, we have been seeing the basic concepts in probability. In particular, we have beenseeing the definition of sample space, σ−algebra, and probability measure [*** checkthose definitions! ***]. However, in general, the sample space is not observable.This meaning that in most real applications, we can not figure out every possible event.

For instance, you may think in the evolution of the price of some stuff in the stock market.You see how this price evolves, and you can guess how likely is that the price raises ordecreases. However, the sample space (that you may think that it is formed by everypossible event that alter the price of the material, the transport, the worker’s wages...) isso complicated that you can not really know what it is.

Alternatively, you may also think in the steam in a hot pot. Every molecule of steam ismoving around, spinning and hitting each other and, consequently, it is very complicatedto know the sample space (that you may think that is formed by every possible position,velocity, crash between different molecules and so on).

The common point in all these situations is that, even if we don’t know the sample space,we can see the effects of the random outcome.Example 2.1. A worker in a casino is tossing a coin. If the outcome is head, we have topay the casino 1 dollar while if the outcome is tails, the casino pays us 0.5 dollars.

Solution: We have

Ω = H,T, P (H) = 0.5, P (T ) = 0.5.

Now define X(ω), ω ∈ Ω (this means ω = H or ω = T ) as

X(H) = −1, X(T ) = 0.5.

This function

X : Ω → R,

is called random variable [*** Yes, even if it is a function is called random

39


Chapter 2. Random variables

variable. It is arguably the worst name possible for this object :-). ***]and reflects the outcome of the random experiment.

[*** Notice that, in general, what we can see in reality is X(ω), not ω itself !.***] It is important to observe that, the original probability measure P (ω) and the randomvariable X(ω) induce a probability with sample space R. For instance, in the previousexample, we can ask the question

Compute P (ω ∈ Ω : X(ω)=0.5) = P (T ) = f(0.5) = 0.5.

Even if this is very intuitive, the appropriate concept is

F (x) = Probability(ω ∈ Ω : X(ω)≤x).

So we obtain the following definitions:Definition 2.1. Let (Ω,F , P ) be a probability space. Then, a function

X : Ω → R

such that, for each x ∈ R, Ax = ω ∈ Ω : X(ω) ≤ x ∈ F is called random variable.Definition 2.2. Given X a random variable, we define its distribution function

F : R → [0, 1]

asF (x) = P (ω ∈ Ω : X(ω) ≤ x).

For the sake of a better notation, we write

P (X ≤ x) = P (ω ∈ Ω : X(ω) ≤ x).

[*** If g is a sufficiently smooth function, Y = g(X) is a random variable. ***]Example 2.2. Let assume that we toss two coins and we receive 1 dollar for every headappearing. Compute the distribution function in this example.

Solution: We have

F (x) =

0 if x < 00.25 if 0 ≤ x < 10.75 if 1 ≤ x < 21 if 2 ≤ x

Let’s explain this numbers:

F (x) = P (number heads less than x,

consequently, if x ∈ [0, 1),F (x) = P (TT ) = 0.25.

If x ∈ [1, 2),F (x) = P (TT ∪ TH ∪HT ) = 0.75.

40


2.1. Random variables

If x ∈ [2,∞),

F (x) = P (Ω) = 1.

There are a bunch of interesting consequences from the definitions:Lemma 2.1. Let F (x) be the distribution function of the random variable X. Then

•limy→∞

F (y) = 1

•lim

y→−∞F (y) = 0

• if x < y, then

F (x) ≤ F (y)

• F is right-continuous

limh→0+

F (y + h) = F (y).

Proof. Parts 1) and 2) rely on the fact that

limy→∞

ω ∈ Ω : X(ω) ≤ y = Ω,

and

limy→−∞

ω ∈ Ω : X(ω) ≤ y = ∅.

The third part is a consequence of the embedding

ω ∈ Ω : X(ω) ≤ x ⊂ ω ∈ Ω : X(ω) ≤ y,

so

P (ω ∈ Ω : X(ω) ≤ x) ≤ P (ω ∈ Ω : X(ω) ≤ y).The fun is on the last part. For simplicity, let’s take h = 1

n . Notice that

ω ∈ Ω : X(ω) ≤ y =

∞⋂

n=1

ω ∈ Ω : X(ω) ≤ y +1

n

.

Consequently,

F (y) = P (ω ∈ Ω : X(ω) ≤ y) =

limn→∞

P

(

ω ∈ Ω : X(ω) ≤ y +1

n

)

= limn→∞

F

(

y +1

n

)

.

Lemma 2.2. The distribution function verifies:

41



•P (X > x) = 1− F (x)

•P (x < X < y) = F (y)− F (x)

•P (X = x) = F (x)− lim

n→∞F

(

x− 1

n

)

.

Proof. Part 1) follows from the (well-known) fact

X ≤ x = X > xc,

so

F (x) = P (X ≤ x) = 1− P (X > x).Part 2) follows the same idea. We have

x < X < y = X < y − X < x,

so

P (x < X < y) = P (X < y)− P (X < x) = F (y)− F (x).

Part 3 is obtained from

X = x =

∞⋂

n=1

x− 1

n< X ≤ x

,

and we conclude using part 2.

Example 2.3 (Exercise 2.1.5). Given F (x) a distribution function and r > 0 a positiveinteger, show that F (x)r is a distribution function.

Solution: As 0 ≤ F (x) ≤ 1, we have 0 ≤ F (x)r ≤ 1. The limits at x → ±∞ alsoholds for F (x)r. The last property is the right continuity, but this one is obtained becausethe function g(y) = yr is continuous, and consequently, the composition is right continu-ous.[*** More generally, if g(y) is a continuous, non decreasing function suchthat g(0) = 0 and g(1) = 1, Y = g(X) is a random variable. ***]Example 2.4 (Exercise 2.1.1). Given X a random variable, show that aX with a ∈ R isa random variable.

Solution: We have that Y = aX : Ω → R. Also, we have

aX ≤ x = X ≤ x/a ∈ F ,

where the last inclusion is due to the hypothesis that X is already a random variable.Example 2.5 (Exercise 2.1.2). Given X a random variable with distribution F (x), com-pute the distribution function of Y = aX + b with 0 < a, b ∈ R.

42


2.1. Random variables

Solution:P (Y ≤ y) = P (X ≤ (y − b)/a) = F ((y − b)/a).

Definition 2.3. A random variable X is called discrete when X(Ω) is a countable set ofR (Z or Q, say). In this case, we define the mass function as

f(x) = P (ω ∈ Ω s.t.X(ω) = x).

Lemma 2.3. The mass function f(x) of a random variable verifies

• f(x) ∈ [0, 1],

•∑

y∈X(Ω)

f(y) = 1,

•F (x) =

∑

y≤x

f(y).

Proof. [*** The proof is easy :-). Can you write it? ***]

The last point in the previous Lemma allows us to define an important concept:Definition 2.4. A random variable X is called continuous when its distribution functionF (x) can be written as

F (x) =

∫ x

−∞f(y)dy.

The function f(y) is called the probability density function (we will refer this as pdf some-times).Example 2.6. Let’s toss a coin and define

X(H) = 1, X(T ) = −1.

Then X is a discrete random variable.Example 2.7. Let’s spin a watch hand and measure the angle, θ, with the horizontal. Wedefine

X = θ.

Is X a continuous random variable?

Solution: We have

Ω = [0, 2π) and P ((a, b)) =b− a

2π,

and

F (x) = P (ω ∈ Ω : X(ω) = θ ≤ x) = min2π,maxx, 02π

.

We get

F (x) =

∫ x

−∞f(y)dy,

43



with

f(y) = 1[0,2π).

Consequently, X is a continuous random variable. Then X is a continuous random vari-able.

[*** As you can imagine, there are random variables that are neither con-tinuous nor discrete. ***] To clarify this, let’s study the example from the book. Wetoss a coin, if the out come is heads, a rod is flung on the ground and we measure theangle (θ). Then we define

X(T ) = −1, X(H, θ) = θ.

Notice that P (X = −1) = 0.5. [*** Can you explain why there is not a probabil-ity density function in this case? Maybe you want to compare the behaviourof∫ δ0 log(x)dx and

∫ δ0

dxx , with the behaviour of

∫ δ−1 f(x)dx, where f is the p.d.f.

***]

Notice that, even if we use the same notation f(x) for the probability mass function (for adiscrete random variable) and for the pdf (for a continuous random variable) the conceptsare not exactly the same. In particular, for the pdf, we have

F (x+ δ) − F (x− δ) =

∫ x+δ

x−δf(y)dy = P (x− δ ≤ X ≤ x+ δ),

and, if we take the limit δ → 0, we have

0 = limδ→0

∫ x+δ

x−δf(y)dy = P (X = x).

In particular, [*** the probability for a continuous function to reach a single,accurate value x is zero! ***]

2.2 Examples of random variables

Let (Ω,F , P ) be a probability space. This space is fixed and implicit in what follows.

2.2.1 Constant variables

Let fix c ∈ R and define the random variable

X(ω) = c ∀ω ∈ Ω.

The distribution function of this random variable is

F (x) = 1x≥c.

44


2.2. Examples of random variables

2.2.2 Bernoulli variables

Let’s fix an event A ∈ F with P (A) = p, and define the random variable

X =

1 if ω ∈ A0 if ω ∈ Ac

As a particular example we can consider a biased coin. This coin shows H with probabilityp and T with probability 1 − p. Then, we can design a game such that if the coin showsH the payment is one dollars, while if the coin shows T , nothing happens. Then A = Hand Ac = T. The distribution function now is

F =

0 if x < 01− p if 0 ≤ x < 11 if 1 ≤ x.

Let’s continue with the example with a biased coin. Let’s consider the following experi-ment: we toss a coin n times and count the total number of heads. We write Sn for thenumber of heads [*** The letter S is from sum. We can think that we areadding 1 each time a head appears. ***]. Then we are interested in the ratio

Sn/n when n → ∞.

[*** can you guess the answer?) ***] Let’s fix ǫ > 0 and notice first that

P (Sn/n ≥ p+ ǫ) = P (Sn ≥ n(p+ ǫ)) =∑

k≥n(p+ǫ)

P (Sn = k).

We have

P (Sn = k) =

(

n

k

)

pk(1− p)n−k,

so

P (Sn/n ≥ p+ ǫ) =n∑

k≥n(p+ǫ)

(

n

k

)

pk(1− p)n−k

≤n∑

k≥n(p+ǫ)

(

n

k

)

eλ(k−n(p+ǫ))pk(1− p)n−k

= e−λnǫn∑

k≥n(p+ǫ)

(

n

k

)

eλk(p+1−p)e−λnppk(1− p)n−k

= e−λnǫn∑

k≥n(p+ǫ)

(

n

k

)

(peλ(1−p))k[e−λp(1− p)]n−k

= e−λnǫn∑

k≥0

(

n

k

)

(peλ(1−p))k[e−λp(1− p)]n−k

= e−λnǫ(peλ(1−p) + e−λp(1− p))n.

45



Now we use ex ≤ x+ ex2to obtain

P (Sn/n ≥ p+ ǫ) ≤ e−λnǫ(peλ2(1−p)2 + e−λ2p2(1− p))n

≤ e−λnǫeλ2n.

Now notice that f(λ) = λ2n− λnǫ verifies

f(0) = 0, f ′(0) < 0,

consequently, if we take 0 < λ << 1 we obtain

P (Sn/n ≥ p+ ǫ) → 0 as n → ∞.

The probability P (Sn/n ≤ p− ǫ) can be bounded in a similar way, and we get

Sn

n→ p as n → ∞.

2.3 Monte Carlo

We have seen that if we repeat a big number of times a given experiment (toss a coinlooking for heads, say) and compute the average, we have

limn→∞

Sn

n= p

in the sense that

P

(

Sn

n= p

)

→ 1.

[*** The limit of a random variable (or any other function) is, in general,a delicate issue. ***] Consequently, we can estimate the probability of a particularoutcome in a given experiment just simulating the experiment a million of times andchecking the frequency of the desired outcome.Example 2.8. You have a coin and you want to know if this coin is biased or not. Howcan you do that without any sensitive measurement?

Solution: We can toss the coin 100 times, counting how many heads are we obtaining.We know that

Number of heads

100≈ P (H).

Example 2.9 (Exercise 2.2.1). You ask an embarrassing question to a big group of people.Let’s assume that people try to avoid the answer ’yes’ because it is too embarrassing. Thenwe design the following procedure: when the question is asked, the interviewed toss a coinout of the sight of the questioner. Then, if the answer would have been ’no’ and the coinshows a head, then the answer is ’yes’, otherwise, people responds truthfully. Is this a goodmethod to estimate the probability of embarrassed people?

46


2.4. Some examples

Solution: In fact, yes. Notice that, if p is the probability of people answering ’yes’ truth-fully and independently of the other members in the sample, then the probability of ainterviewed to answer ’yes’ is

P = p+ 0.5(1 − p) = (p+ 1)0.5.

After asking to every member of the sample and computing the ratio of ’yes’ (writing qfor this ratio), we can approximate p by 2q − 1.Example 2.10. Let’s

g(x) : [−1, 1] → R

be a continuous, positive function. We want to approximate with a computer the value of∫ 1−1 g(x)dx and we can not use any sofisticated integrators from numerical analysis.

Solution: We can use the idea of probability as area. We define the rectangle

R = [−1, 1]× [0,maxx

g(x)],

with total area |R| = 2maxx g(x). We randomly chose N points within this rectangle.Here by the word randomly implies that every two point are equally likely (this is calleda uniform distribution). Then, for every point, we check if the y component is biggeror smaller than g(x) (in other words, we check if the point is above or below the curve(x, g(x)). Then we have

Number of points below (x, g(x))

N≈ P (point below (x, g(x))) =

∫ 1−1 g(x)dx

|R| .

This method was invented by Stanislaw Ulam at Los Alamos (during the forties). It wasoriginally used to compute the integrals that arise in the theory of nuclear chain reactions.

2.4 Some examples

Example 2.11 (Exercise 2.7.1). Let’s toss a biased coin (that shows head with probabilityp) until the first head shows up. Let X be the total number of tosses. What is P (X > m)?Find the distribution function for X.

Solution: We can compute

P (X > m) = P (m tails) = (1− p)m.

Then, for x ≥ 0,

F (x) = P (X ≤ x) = 1− (1− p)⌊x⌋

while F (x) = 0 if x < 0.Example 2.12 (Exercise 2.7.2.a). Show that any discrete random variable may be writtenas a linear combination of indicator variables.

47



Solution: Let’s write xi, i ≥ 1 for the elements of X(Ω) (we can do this because of thecountability). Then we have

X(ω) =

∞∑

i=1

xi1X(ω)=xi.

Example 2.13 (Exercise 2.7.2.b). Show that any random variable may be expressed asthe limit of an increasing sequence of discrete random variables.

Solution: As a warm up, let’s assume first that the random variable takes value in aclosed interval, [−N,N ], say. Then, we can write

[0, N ] =

[

0,1

n

]

∪[

1

n,2

n

]

∪ ... ∪[

Nn− 1

n,nN

n

]

,

[−N, 0] =

[

− 1

n, 0

]

∪[

− 2

n,− 1

n

]

∪ ... ∪[

−nN

n,−Nn− 1

n

]

.

We are splitting the interval [−N,N ] in 2nN subintervals with length 1/n. Removingsome of the extremal points we can make those subintervals disjoints. Now can define

Xn(ω) =nN∑

j=−nN

j

n1j/n≤X≤(j+1)/n.

[*** Can you plot some example of X and Xn ***] A general random variabletakes values in the whole real line R. Consequently, we can take the limit in N → ∞, andwe get

Xn(ω) =

∞∑

j=−∞

j

n1j/n≤X≤(j+1)/n.

Example 2.14 (Exercise 2.7.2.c). Show that the limit of any increasing convergent se-quence of random variables is a random variable.

Solution: We have to check that, for every x, X ≤ x ∈ F holds. Let’s assume that fora given ω ∈ Ω, we have X(ω) ≤ x. Then, as Xn tends to X from below, for a big enoughn0, we have Xn(ω) ≤ x, ∀n ≥ n0, too. Consequently,

X ≤ x = ∩Xn ≤ x.

Now we conclude using that the countable intersection of elements in the σ−field is in theσ−field.Example 2.15 (Exercise 2.7.8). A fairground performer claims the power of telekinesis.The crowd throws coins and he wills them to fall heads up. He succeeds five times out ofsix. What chance would he have of doing at least as well if he had no supernatural powers?

48


2.4. Some examples

Solution: The probability of having at least 5 heads equals the probability of havingexactly 5 heads and a tail plus the probability of having 6 heads (notice that both setsare disjoint). The probability of having 6 heads is

P (six heads) =1

26,

P (five heads) = 61

251

2.

Notice that 6 is because there are 6 ways of ordering the tail within 6 tosses. Consequently,the final answer is

7

26.

Example 2.16 (Exercise 2.7.19). Is

F (x) = e−x1x≥0,

a probability distribution function?

Solution: Nope. Notice that every probability distribution function represents the ac-cumulation of probability. In other words, is non-decreasing. However, F (x) tends tozero as x tends to infinity. Furthermore, the limit as x tends to infinity of a probabilitydistribution function is exactly 1 (not zero!).Example 2.17. Does

f(x) = e−x1x≥0,

define a probability density function?

Solution: Yes. f(x) is positive, and

∫ ∞

−∞f(x)dx = 1.

Example 2.18 (Exercise 2.7.3). Let X and Y be two random variables defined on thesame probability space. Show that

• X + Y

• minX,Y • c1X + c2Y

are random variables.

Solution: We have to express the set

X + Y ≤ x

as a countable union/intersection of sets in the σ−field. Then we can conclude that

X + Y ≤ x ∈ F

49



because of the definition of σ−field. We write

X + Y ≤ x = ∪q∈Q (X ≤ q ∩ Y ≤ x− q) .In the same way,

minX,Y ≤ x = minX,Y > xc,and

minX,Y > x = X > x ∩ Y > x.The latter case reduces to the first one just by changing X by c1X and Y by c2Y .Example 2.19 (Exercise 2.7.7). Airlines find that each passenger who reserves a seat failsto turn up with probability 0.1 independently of the other passengers. So Teeny WeenyAirlines always sell 10 tickets for their 9 seat aeroplane while Blockbuster Airways alwayssell 20 tickets for their 18 seat aeroplane. Which is more often over-booked?

Solution: Let’s write TW for the number of passengers in the Teeny Weeny’s airplanesand B for the passengers in the Blockbuster’s airplanes. Overbooking occurs when

TW = 10, B = 19, 20.

This happens with probability

P (TW = 10) = 0.910.

The probability for B is more complicated. We have

P (B = k) =

(

20

k

)

0.9k0.120−k .

So

P (B is overbooked) = P (B = 19) + P (B = 20) = 0.920 +

(

20

19

)

0.9190.1.

Example 2.20 (Exercise 2.7.6). Buses arrive at ten minute intervals starting at noon.A man arrives at the bus stop a random number X minutes after noon, where X hasdistribution function

F (x) =

0 if x < 0x/60 if 0 ≤ x < 601 if x ≥ 60

.

What is the probability that he waits less than five minutes for a bus?

Solution: Prior to every bus there is an interval of 5 minutes. And we have 6 of suchintervals (at 12:10, 12:20, 12:30, 12:40, 12:50 and 13:00). The probability that the manarrives in a 5 minutes time interval (y, y + 5) is

F (y + 5)− F (y) = P (y ≤ X ≤ y + 5) =5

60=

1

12.

The other intervals are equally likely (due to the form of F (x)). Consequently, as thei aredisjoints, we have the final answer

P =6

12= 0.5.

50


2.5. Conclusion

2.5 Conclusion

There is a whole bunch of important concepts in this chapter that you can not forget. Inparticular, there is a common mistake that we need to avoid. This mistake is to thinkthat

F (x) = P (X = x).

Recall that F (x) is the accumulated (up to x) probability, while, at least for a discretevariable we have

P (X = x) = f(x).

In the case of a continuous random variable, recall that the probability for a continuousfunction to reach a single, accurate value x is zero!

51



52


Chapter 3

Discrete random variables

3.1 Expectation and variance

Recall the notion of probability mass function (pmf from this point onwards):Definition 3.1. For a given random variable X, we define the pmf as

f(x) = P (X = x).

Example 3.1 (Bernoulli variable). Let’s fix an event A ∈ F with P (A) = p, and definethe random variable

X =


This random variable is a good model of a black/white experiment, i.e. an experimentwith only two possible outcomes.

An example of these experiments can be tossing a coin. In this case we can have

X =

1 if H0 if T

Consequently, we have

f(x) =

1/2 if x = 11/2 if x = 0

More generally, if the coin is biased,

f(x) =

p if x = 1q = 1− p if x = 0

Example 3.2 (Binomial variable). The next step is to repeat a black/white experiment.In this case, we are interested in the number of black (say) outcomes. For instance, onecan think that we are tossing a coin N times and we are interested in the total numberof heads. [*** Every time we toss the coin, we have a Bernoulli variable,

53


Chapter 3. Discrete random variables

consequently, we can think that a Binomial variable is just the sum ofBernoulli variables ***]. Then, we have

f(k) = P (k heads in n tossings) =

(

n

k

)

pkqn−k 0 ≤ k ≤ n

Notice thatn∑

k=0

f(k) =

n∑

k=0

(

n

k

)

pkqn−k = (p+ q)n = 1.

Example 3.3 (Poisson variable). Given λ > 0, a random variable taking values in thenon-negative (positive + zero) integers with pmf

f(k) =λk

k!e−λ,

is called Poisson random variable with parameter λ.

Notice that∞∑

k=0

f(k) =

∞∑

k=0

λk

k!e−λ = e−λeλ = 1.

This random variable gives us the probability of a given number of events occurring in afixed interval of time assuming that these events occur with a known average rate and in-dependently of the time since the last event, i.e. the process has no memory. For instance,the number of customers at the CoHo can be modeled by this random variable.Example 3.4 (Exercise 3.1.1.a). Find C such that C2−x is a pmf on the positive integers.

Solution: This is a geometric series. There is a well known trick to sum it. Let’s startfirst ith the n−th partial sum, Sn. Then we have

Sn = 2−1 + 2−2 + ...+ 2−n

Sn

2= 2−2 + 2−2 + ...+ 2−n−1.

Substracting, we have

Sn − Sn

2= 0.5

Sn

2= 2−1 − 2−n−1.

So,Sn

2= 2(1− 2−n−1).

Taking the limit n → ∞, we have∞∑

n=0

2−n = 1,

so C = 1.Example 3.5 (Exercise 3.1.1.b). Find C such that Cx−2 is a pmf on the positive integers.

54


3.1. Expectation and variance

Solution: This is the celebrated Basel problem

∞∑

n=1

1

n2.

This problem was solved first by Euler... the bad part is that his first solution was notrigorous (even if it was correct!!). He continued working on it until he got a fully rigorousanswer. His first idea (the one that was wrong) is great. He considered the functionsin(x)/x. He observed that for a polynomial p(x) with roots ri, you always can write

p(x) = (x− r1)(x− r2)(x− r3)...(x− rn) =

n∏

i=1

(x− ri).

Applying this idea [*** This is the problem. Can you find a counter exampleto this fact? ***],

g(x) =sin(x)

x

has zeroes atx

iπ= 1,

so

g(x) =∞∏

i=−∞

(

1− x

iπ

)

,

and reordering the term ±1,±2...

g(x) =(

1− x

π

)(

1 +x

π

)(

1− x

2π

)(

1 +x

2π

)

...

so,

g(x) =

(

1− x2

π2

)(

1− x2

4π2

)

...

We compare the coefficient from this product with the second order coefficient from theTaylor series,

g(x) = 1− x2

6+ ...

We obtain

− 1

π

(

∞∑

i=1

1

i2

)

= −1

6.

Consequently∞∑

i=1

1

i2=

π2

6.

[*** This problem is also related to the Riemann’s zeta function. ***]Example 3.6 (Exercise 3.1.1.b). Find C such that C2−x/x is a pmf on the positiveintegers.

55



Solution: For this case, let’s compute the Taylor expansion of

g(x) = log

(

x

x− 1

)

.

Then evaluate at x = 2.[*** Can you fill the details? Can you find a betterway of solving this problem? ***]Definition 3.2. X,Y are independent random variables if and only if the events

X = x, Y = y,

are independents for every x, y ∈ R.Proposition 3.1. Let X,Y be two independent random variables. Then g(X), h(Y ) with

g, h : R → R,

are also independent random variables.

Proof. Let’s prove the case where X,Y are discrete random variables. We have

g(X) = u, h(Y ) = v = X = x, Y = y, g(x) = u, h(y) = v.

So

P (g(X) = u, h(Y ) = v) =∑

x,y,g(x)=u,h(y)=v

P (X = x, Y = y).

By hypothesis

∑

x,y,g(x)=u,h(y)=v

P (X = x, Y = y) =∑

x,g(x)=u

P (X = x)∑

y,h(y)=v

P (Y = y).

and then

∑

x,y,g(x)=u,h(y)=v

P (X = x, Y = y) = P (g(X) = u)P (h(Y ) = v).

Definition 3.3. Given X a random variable with mass function f , we define its meanvalue, (also expectation or expected value) as

E(X) =∑

x

xf(x)

Proposition 3.2. Let X be a random variable. Then g(X) has expected value given by

E(g(X)) =∑

x

g(x)f(x)

56



Proof. Applying the definition we have

E(g(X)) =∑

y

yP (g(X) = y) =∑

y

∑

x,g(x)=y

yP (X = x) =∑

x

g(x)f(x).

A particularly interesting case is when g(x) = xk for some k. [*** This is called thek-th moment ***].Proposition 3.3. Assume that X is a random variable taking values in the non-negativeintegers, then

E(X) =∞∑

k=0

P (X ≥ k).

Proof. We have∞∑

k=0

P (X ≥ k) =

∞∑

k=0

∞∑

r=k

P (X = r),

Changing the order of summation, we have

∞∑

k=0

∞∑

r=k

P (X = r) =∞∑

r=0

r∑

k=0

P (X = r) =∞∑

r=0

rP (X = r) = E(X).

Definition 3.4. For a random variable X we define the variance of X as

V ar(X) = E((X − E(X))2).

Definition 3.5.

σ =√

V ar(X)

is called standard deviation.

[*** Notice that E(X) may be negative but V ar(X) is always non-negative.***]Example 3.7. Let X be the random variable with pmf given by

f(x) =6

π2

1

x2.

Find E(X).

Solution: We have

E(X) =∑

x

xf(x) =6

π2

∑

x

x1

x2= ∞!!

[*** Not every random variable has finite expectation. The existence of ex-pectation is related to the decay of the pmf. ***]

57



Example 3.8. Let X be the random variable with pmf given by

f(x) =6

π2

1

x3.

Find E(X).

Solution: We have

E(X) =∑

x

xf(x) =6

π2

∑

x

x1

x3=

6

π2

∑

x

1

x2= 1.

Example 3.9. Let X be the random variable with pmf given by

f(x) =6

π2

1

x3.

Find V ar(X).

Solution: We have

E((X − E(X))2) = E((X − 1)2) =∑

x

(x− 1)2f(x) =6

π2

∑

x

(x− 1)21

x3

=6

π2

∑

x

(x− 1)2

x3= ∞!!.

[*** The moral of these examples is that

• There are random variables with no finite expectation.

• even with a finite expectation, maybe there are random variables withinfinite higher moments or variance.

***]Example 3.10 (Bernoulli variable). Find the expected value of a Bernoulli random vari-able

X =


Solution:

E(X) = 1f(1) = 1P (X = 1) = 1P (A) = p.

Example 3.11 (Binomial variable). Find the expected value of a binomial random variablewith parameters n and p (this means n is the total number of Bernoulli experiment withprobability p).

Solution: We have

f(x) =

(

n

x

)

pxqn−x.

58



Consequently

E(X) =

n∑

x=0

x

(

n

x

)

pxqn−x = qnn∑

x=0

x

(

n

x

)

(p/q)x.

By binomial theorem, we have

g(y) =

n∑

x=0

(

n

x

)

yx = (1 + y)n,

so

g′(y) =

n∑

x=0

x

(

n

x

)

yx−1 = n(1 + y)n−1,

and

yg′(y) =n∑

x=0

x

(

n

x

)

yx = ny(1 + y)n−1.

We take now y = p/q

n∑

x=0

x

(

n

x

)

(p/q)x = n(p/q)(1 + p/q)n−1,

and

qnn∑

x=0

x

(

n

x

)

(p/q)x = npqn−1(1 + p/q)n−1 = np.

Memento 3.1. We have(

1 +x

n

)n→ ex.

Example 3.12 (Poisson variable). We already know well the Bernoulli Ber(p) and thebinomial Bin(n, p) random variables from the previous examples. The Poisson randomvariable Pois(λ) appears as the probability of a given number of events occurring in a fixedinterval of time once we know that the process has no memory and the average rate (forinstance, the number of clients entering in a store). Let’s motivate the Poisson randomvariable following a different approach. Let’s consider a binomial with parameters n andp = λ/n where n = O(105) and λ = O(1).

We fix k = O(1) and we compute

P (Bin(n, p) = k) =

(

n

k

)

pkqn−k ≈ nk

k!

(

λ

n

)k (

1− λ

n

)n

≈ e−λλk

k!.

The random variable such that the pmf is given by

f(x) = e−λλx

x!,

is called Poisson random variable with parameter λ.

Find the expected value of a Poisson random variable with parameter λ.

59



Solution:

E(X) = e−λ∞∑

x=0

xλx

x!,

E(X) = e−λλ∞∑

x=1

λx−1

(x− 1)!.

E(X) = e−λλ

∞∑

k=0

λk

k!= e−λλeλ = λ.

Example 3.13 (Geometric variable). Given a set of Bernoulli random variables, we areinterested in the number of experiments realized before the first success. This waiting timeis a random variable called geometric (with parameter p). The pmf of this random variableis given by

f(x) = (1− p)x−1p = qx−1p.

Compute the expected value of a geometric random variable with parameter p.

Solution: We have

E(X) =

∞∑

k=1

kP (X = k) =

∞∑

k=1

kqk−1p.

We have∞∑

k=1

kxk−1 =d

dx

∞∑

k=1

xk =d

dx

1

1− x=

1

(1− x)2.

Evaluating this in q we find∞∑

k=1

kqk−1 =1

p2,

so

E(X) =1

p.

Theorem 3.1. The expectation verifies

• if X ≥ 0 then E(X) ≥ 0,

• for constants a, b and X,Y two random variables, we have E(aX + bY ) = aE(X) +bE(Y ),

• E(1) = 1,

• if X,Y are two independent random variables, we have

E(XY ) = E(X)E(Y ).

[*** THIS IS NOT TRUE IN GENERAL!!! ***]

60



Proof. The proof relies in the fact that

E(X) =∑

x

xf(x).

As f is the pfm, we have f ≥ 0, so if x ≥ 0 we obtain E(X) ≥ 0. For the second point,we have

E(aX + bY ) =∑

x,y

(ax+ by)P (X = x, Y = y),

now notice that (as P (Ω) = 1)

∑

x

(ax+ by)P (X = x, Y = y) = P (X = x),∑

y

(ax+ by)P (X = x, Y = y) = P (Y = y),

E(aX+bY ) =∑

x,y

axP (X = x, Y = y)+∑

x,y

byP (X = x, Y = y) =∑

x

axP (X = x)+∑

y

byP (Y = y).

The third part is again the fact that P (Ω) = 1. The last part is

E(XY ) =∑

x,y

xyP (X = x, Y = y) =∑

x

xP (X = x)∑

y

yP (Y = y) = E(X)E(Y ).

[*** THIS IS NOT TRUE IN GENERAL!!! ***][*** X,Y such that E(XY ) =E(X)E(Y ) are called uncorrelated random variables. ***]

[*** As a consequence, the variance can be written as

V ar(X) = E((X − E(X))2) = E(X2 − 2XE(X) + (E(X))2) = E(X2)− E(X)2.

***]Theorem 3.2. The variance operator verifies

• V ar(aX) = a2V ar(X),

• if X,Y are uncorrelated, we have V ar(X +Y ) = V ar(X)+V ar(Y ). [*** THIS ISNOT TRUE IN GENERAL!!! ***]

Proof. The first part can be obtained using

V ar(aX) = a2E(X2)− a2E(X)2 = a2V ar(X).

For the second part, we have

V ar(X + Y ) = E((X + Y )2)− (E(X) + E(Y ))2

V ar(X + Y ) = E(X2) + E(Y 2) + 2E(XY )− E(X)2 −E(Y )2 − 2E(X)E(Y ).

Using the hypothesis of the statement, we obtain the result.

[*** Notice that E(·), as is an integral operator (kind of), is a linear operator.However, V ar(·) is NOT a linear operator. ***]

61



Example 3.14 (Bernoulli variable). Find the variance of a Bernoulli random variable

X =


Solution:

E(X2) = 12f(1) = 1P (X = 1) = 1P (A) = p.

Consequently, the variance is

V ar(X) = E(X2)− E(X)2 = p− p2 = pq.

Example 3.15 (Binomial variable). Find the variance of a binomial random variablewith parameters n and p (this means n is the total number of Bernoulli experiment withprobability p).

Solution: We have that the binomial random variable is a sum of n independent Bernoullirandom variables. Consequently, applying the previous Theorem, we have

V ar(X) = npq.

Example 3.16 (Poisson variable). Find the variance of a Poisson random variable withparameter λ.

Solution:

E(X2) = e−λ∞∑

x=0

x2λx

x!,

E(X2) = e−λλ

∞∑

x=1

xλx−1

(x− 1)!.

E(X2) = e−λλ

(

∞∑

x=2

λx−1

(x− 2)!+

∞∑

x=1

λx−1

(x− 1)!

)

.

Using Taylor’s theorem, we have

E(X2) = e−λλ(

λeλ + eλ)

= λ2 + λ.

So,

V ar(X) = E(X2)− E(X)2 = λ2 + λ− λ2 = λ.

Example 3.17 (Geometric variable). Compute the variance of a geometric random vari-able with parameter p.

62


3.2. Pairs of random variables

Solution: We have

E(X2) =∞∑

k=1

k2P (X = k) =∞∑

k=1

k2qk−1p.

We have∞∑

k=1

k2xk−1 =

∞∑

k=1

(k + 1)kxk−1 − kxk−1.

We can split in two terms:

−∞∑

k=1

kxk−1 = − 1

(1− x)2,

and∞∑

k=1

(k + 1)kxk−1 =d

dx

∞∑

k=1

(k + 1)xk.

Notice that

∞∑

k=1

(k + 1)xk =d

dx

∞∑

k=1

xk+1 =d

dx

(

1

1− x− 1− x

)

=1

(1− x)2− 1.

We conclude∞∑

k=1

k2xk−1 = 21

(1− x)3− 1

(1− x)2.

Evaluating this in q we find

∞∑

k=1

k2qk−1 = 21

p3− 1

p2=

1

p2

(

2− p

p

)

so

E(X2) =2− p

p2.

Then

V ar(X) =2− p

p2− 1

p2=

q

p2

3.2 Pairs of random variables

In most of the real world applications there are more than one random variable takingplace. Consequently, it is interesting to study the case with at least two random variables.The good news are that at this point of the course we can generalize from one a to tworandom variables very easily.

63



Definition 3.6. Given X,Y two random variables, we define the joint distribution func-tion

FX,Y : R2 → [0, 1]

as

FX,Y (x, y) = P (X≤x and Y≤y).

The joint mass function is then

fX,Y : R2 → [0, 1]

with

fX,Y (x, y) = P (X=x and Y=y)

Lemma 3.1. X,Y are independent if and only if

fX,Y (x, y) = fX(x)fY (y).

The proof of this result is left as an exercise.Definition 3.7. fX(x), fY (y) are called marginal mass functions.Lemma 3.2. We have

∑

y

fX,Y (x, y) = fX(x),

∑

x

fX,Y (x, y) = fY (y).

Proof. This is due to the fact that

X = x = ∪y [X = x ∩ Y = y] .

These sets are disjoint.

Lemma 3.3.

E(g(X,Y )) =∑

x,y

g(x, y)fX,Y (x, y).

For the sake of completeness, we introduce the definition of covariance here.Definition 3.8. Given two random variables, we define the covariance as

Cov(X,Y ) = E((X − E(X))(Y − E(Y ))) = E(XY )− E(X)E(Y ).

Recall that this concept is not new and it appeared when computing V (X + Y ).Definition 3.9. Given two random variables, we define the correlation as

ρ(X,Y ) =Cov(X,Y )

√

V ar(X)√

V ar(Y ).

64



[*** Now the term uncorrelated seems more clear. ***]

There is an inequality which is ubiquitous in mathematical analysis (from harmonic anal-ysis to partial differential equations). It is called the Cauchy-Schwartz-Bunyakowski in-equality. In the probability context reads

E(XY ) ≤√

E(X2)√

E(Y 2).

To prove this inequality we consider Z = aX + Y and compute the polynomial (in the avariable)

p(a) = E(Z2) ≥ 0.

As p(a) may have at most one real root, we have that the discriminant should be non-positive. The discriminant is

4(E(XY ))2 − 4E(X2)E(Y 2).

As a corollary, we haveLemma 3.4.

Cov(X,Y )2 ≤ V ar(X)V ar(Y )

Example 3.18. Given two random variables X,Y and its joint mass function, obtain thepmf of Z = X + Y .

Solution: We have

Z = z = X + Y = z = ∪xX = x ∩ Y = z − x,

and these sets are disjoint. So

fZ(z) = P (Z = z) =∑

x

fX,Y (x, z − x).

In the particular case where X,Y are independent, the previous formula can be simplifiedfurther

fZ(z) = P (Z = z) =∑

x

fX(x)fY (z − x).

One step forward 3.1. The function

h(z) =∑

x

f(x)g(z − x)

is called the convolution of f and g. Usually we denote the convolution as

h = f ∗ g.

Example 3.19. Given two random variables X,Y taking values in the non-negative in-tegers and its joint mass function, obtain the pmf of Z = XY .

65



Solution: We have

Z = z = XY = z = ∪xdividing zX = x ∩ Y = z/x,and these sets are disjoint. So

fZ(z) = P (Z = z) =∑

xdividing z

fX,Y (x, z/x).

Given two random variables, we can use conditional probability to affect the distribution.We defineDefinition 3.10. Given X,Y two random variables and x a real number such that P (X =x) 6= 0. Then we define the distribution function of Y given X = x as

FY |X(y|x) = P (Y ≤ y|X = x).

The conditional probability mass function of Y given X is then

fY |X(y|x) = P (Y = y|X = x) =P (Y = y,X = x)

P (X = x)=

fX,Y (x, y)

fX(x).

For this random variable we can compute the expectation

E(Y |X = x) =∑

y

yfY |X(y|x).

Remarkably, [*** the conditional expectation is a random variable. To see this,recall that you can move x. We have

E(E(Y |X = x)) =∑

x

E(Y |X = x)fX(x) =∑

x,y

yfX,Y (x, y)

fX(x)fX(x) = E(Y ).

***] Notice that, given two random variables, X,Y , the object Y given X is very useful.You may think that the parameter present in Y is replaced by a random value given byX. For instance, we can consider the case where Y = Bin(X, p) and X = Pois(λ). ThatY is a model of the number of double expresso coffees that the CoHo serves before youactually order.

3.3 An introduction to the random walk

We define the random variable

X =

1 with probability p−1 with probability q = 1− p

.

Then we consider the sum of n independent copies of the previous random variable X:

Sn = x+

n∑

k=0

Xk.

Sn is called random walk and can be understood as a model of a particle with initialposition x, moving somehow randomly.

66


3.3. An introduction to the random walk

Lemma 3.5. The random walk verifies

1. (Space homogeneous) P (Sn = s|S0 = x) = P (Sn = s+ a|S0 = x+ a) =

2. (Time homogeneous) P (Sn = s|S0 = x) = P (Sn+m = s|Sm = x) =.

3. (Markov) P (Sn+m = s|S0, S1, S2...Sm) = P (Sn+m = s|Sm) =.

Proof. 1. The probability in both cases is equal to

P (

n∑

k=1

Xk = s− x).

2. Here the probabilities are

P (n∑

k=1

Xk = s− x)

and

P (m+n∑

k=m+1

Xk = s− x).

Both are the same because of the definition of Xk.

3. Due to the definition of the random walk and the independence of Xk, the futurepositions only depends on the last known position and not on the whole history.

67



68


Chapter 4

Continuous random variables

4.1 Introduction

In this chapter we are going to extend the results from chapter 3 to the case of continuousrandom variables. Recall that a random variable X is called continuous if its distributionfunction F verifies

F (x) = P (X ≤ x) =

∫ x

−∞f(s)ds,

wheref(x) : R → [0,∞)

is called the probability density function of X. Notice that for a continuous randomvariable, we have

P (X = x) = 0 ∀x ∈ R,

limx→∞

F (x) =

∫ ∞

−∞f(s)ds = 1,

and

P (a ≤ X ≤ b) =

∫ b

af(s)ds.

In general, for a set A ⊂ R

P (X ∈ A) =

∫

Af(x)dx.

4.2 Independence

For two discrete random variables, the notion of independence is written in terms of theindependence of the sets

X = x, Y = y,however, for a continuous random variable, these sets are useless. We restate the notionof independence as

69


Chapter 4. Continuous random variables

Definition 4.1. X,Y are independent random variables if and only if the events

X≤x, Y ≤y,

are independents for all x, y ∈ R.Example 4.1. Show that, for discrete random variables, the previous definition is equiv-alent to the definition 3.2.

Solution: Notice that if two discrete random variables verify the definition with = insteadof ≤ (Definition 3.2), in fact we have

X ≤ x ∩ Y ≤ y = ∪k≤x ∪j≤y X = k ∩ Y = j

and

P (X ≤ x∩Y ≤ y) =∑

k≤x

∑

j≤y

P (X = k)P (Y = j) =∑

k≤x

P (X = k)∑

j≤y

P (Y = j),

soP (X ≤ x ∩ Y ≤ y) = P (X ≤ x)P (Y ≤ y).

The converse implication is also true. We have

P (X ≤ x ∩ Y ≤ y) =∑

k≤x

∑

j≤y

P (X = k ∩ Y = j) = P (X ≤ x)P (Y ≤ y),

and

P (X ≤ x− 1∩ Y ≤ y) =∑

k≤x−1

∑

j≤y

P (X = k∩ Y = j) = P (X ≤ x− 1)P (Y ≤ y),

so

P (X = x ∩ Y ≤ y) =∑

j≤y

P (X = x ∩ Y = j) = P (X = x)P (Y ≤ y).

Now, in the same way

P (X = x ∩ Y ≤ y − 1) =∑

j≤y−1

P (X = x ∩ Y = j) = P (X = x)P (Y ≤ y − 1).

Subtracting,P (X = x ∩ Y = y) = P (X = x)P (Y = y).

We also haveProposition 4.1. Let X,Y be two independent random variables. Then g(X), h(Y ) with

g, h : R → R,

are also independent random variables.

This proposition is the (generalized) alter ego of Proposition 3.1.

70


4.3. Expectation and Variance

4.3 Expectation and Variance

Definition 4.2. Given a continuous random variable with density function f , the expec-tation is

E(X) =

∫ ∞

−∞xf(x)dx,

as long as this integral exists.Proposition 4.2. Assume that X is a random variable taking values in the non-negativereal numbers, then

E(X) =

∫ ∞

0P (X > x)dx.

Proof.

∫ ∞

0P (X > x)dx =

∫ ∞

0

∫ ∞

xf(y)dydx =

∫ ∞

0

∫ y

0f(y)dxdy = E(X).

This result is the alter ego of Proposition 3.3.Proposition 4.3. Assume that X is a random variable and g is a continuous function,then

E(g(X)) =

∫ ∞

−∞g(x)f(x)dx.

Example 4.2 (Exercise 4.3.1). For which values of α is E(|X|α) finite if the densityfunction is given by

f(x) = e−x1x≥0.

Solution: We have

E(|X|α) =∫ ∞

−∞xαe−x1x≥0dx =

∫ ∞

0xαe−xdx.

Notice that

xα = 1 + x⌊α⌋+1,

so, after several integration by parts,

E(|X|α) ≤∫ ∞

0(1 + x⌊α⌋+1)e−xdx < ∞,

for every α.Example 4.3 (Exercise 4.3.2). For which values of α is E(|X|α) finite if the densityfunction is given by

f(x) = C(1 + x2)−m.

71



Solution: We have

E(|X|α) =∫ ∞

−∞

xα

C(1 + x2)mdx.

We havexα

C(1 + x2)m≤ c

xα

x2m,

so, we need2m− α > 1.

Example 4.4 (Exercise 4.3.3). Prove that, for a non-negative random value,

E(Xr) =

∫ ∞

0rxr−1P (X > x)dx.

Solution: We have∫ ∞

0rxr−1P (X > x)dx =

∫ ∞

0

∫ ∞

xrxr−1f(y)dydx =

∫ ∞

0

∫ y

0rxr−1dxf(y)dy =

∫ ∞

0yrf(y)dy.

Definition 4.3.

V ar(X) =

∫

R

(x2 − E(X)2)f(x)dx.

4.4 Examples

Example 4.5. Compute the expectation and the variance of the uniform distribution

Solution: The random variable such that

F (x) =

0 if x ≤ ax−ab−a if a < x ≤ b

1 if b < x

is called uniform. In this case the density function is

f(x) =1

b− a1a<x≤b.

The expectation is then

E(X) =1

b− a

∫ b

axdx =

1

2

b2 − a2

b− a=

b+ a

2.

We have

E(X2) =1

b− a

∫ b

ax2dx =

1

3

b3 − a3

b− a.

The variance is

V ar(X) = E(X2)−E(X)2 =1

3

b3 − a3

b− a− b+ a

2.

72


4.4. Examples

Example 4.6. Compute the expectation and the variance of the exponential distribution

Solution: For this random variable, the distribution is given by

F (x) = (1− e−λx)1x≥0.

The density is then

f(x) = λe−λx1x≥0.

We can compute the expectation

E(X) =

∫ ∞

−∞λxe−λx1x≥0dx =

∫ ∞

0λxe−λxdx,

we integrate by parts

∫ ∞

0xλe−λxdx = −xe−λx|∞0 +

∫ ∞

0e−λxdx =

1

λ.

Integrating by parts twice, we can obtain

E(X2) =

∫ ∞

−∞λx2e−λx1x≥0dx,

and

V ar(X) =1

λ2.

This distribution arises as a model of a waiting time between unpredictable events. Let’sassume that we are performing a Bernoulli (Ber(p)) trial every δ seconds. Then, we writeN for the time that we have to wait until the first head. Then

P (N > kδ) = (1− p)k.

Now we fix a time t and consider the case where p = λδ for a constant λ > 0. Then

P (N > t) = P (N >

(

t

δ

)

δ) = (1− p)tδ = (1− λδ)

tδ .

If we take the limit δ → 0, we obtain

P (N > t) → e−λt.

Example 4.7. The Cauchy distribution is given by

f(x) =1

π

1

1 + x2.

Show that the Cauchy distribution is a valid distribution.

73



Solution: Notice that∫ ∞

−∞f(x)dx =

1

πarctan(x)|∞0 = 1,

consequently, f is a bona fide density function.

Notice that the moments E(Xα) makes no sense. The problem for the expectation (α = 1)is that

E(X) =

∫

R

xf(x)dx

is order 1x at infinity, consequently, the improper integral is no well defined.

Example 4.8. The Normal distribution, N(µ, σ) is given by

f(x) =1√2πσ2

e−(x−µ)2

2σ2 .

The parameters µ, σ2 are the mean and the variance. Check that µ is the expected value.

Solution:

We have

E(X) =

∫

R

xf(x)dx =

∫

R

(x− µ)1√2πσ2

e−(x−µ)2

2σ2 dx+ µ

∫

R

1√2πσ2

e−(x−µ)2

2σ2 dx.

Using that the first integral vanishes due to the oddness of the integrand, we have

E(X) = µ.

4.5 Pairs of random variables

Definition 4.4. Given X,Y two random variables, we define the joint distribution func-tion

FX,Y : R2 → [0, 1]

as

FX,Y (x, y) = P (X≤x and Y≤y).

The joint density function is then

fX,Y : R2 → R+

such that

FX,Y (x, y) =

∫ x

−∞

∫ y

−∞fX,Y (a, b)dadb.

74



Notice that, if we can compute two derivatives, we have

fX,Y (x, y) = ∂x∂yF (x, y).

We define then

P (a ≤ X ≤ b, c ≤ Y ≤ d) =

∫ b

a

∫ d

cfX,Y (x, y)dydx = FX,Y (b, d)−F (a, d)−F (b, c)+F (a, c).

In general, for a set A ⊂ R2

P ((X,Y ) ∈ A) =

∫

AfX,Y (x, y)dxdy.

The marginal distributions are then

FX(x) = FX,Y (x,∞), FY (y) = FX,Y (∞, y).

Equivalently,

fX(x) =

∫

R

fX,Y (x, y)dy, fY (y) =

∫

R

fX,Y (x, y)dx.

Example 4.9. Given

fX,Y (x, y) =1

ye−y−x

y , 0 < x, y < ∞

compute fY (y).

Solution: We have

fY (y) =

∫

R

fX,Y (x, y)dx =

∫ ∞

0

1

ye−y−x

y dx = e−y.

As in the case of discrete random variables, we have thatLemma 4.1. X,Y are independent if and only if

FX,Y (x, y) = FX(x)FY (y).

We can take two derivatives ∂x∂y and prove that, if the random variables are independent,we have that

fX,Y (x, y) = fX(x)fY (y).

The converse implication is also true. This can be proved by taking the integral.Definition 4.5. For continuous random variables, we define the covariance as

Cov(X,Y ) = E(XY )− E(X)E(Y ).

In fact, for continuous random variables, we also have the Cauchy-Schwarz inequality

E(XY )2 ≤ E(X2)E(Y 2).

75



Example 4.10. Show that every random variable, X, with finite second moment has awell-defined expected value.

Solution: We have

E(X) = E(X ∗ 1) ≤√

E(X2)E(12) ≤√

E(X2) < ∞.

Example 4.11. Prove that there exists examples of random variables such that the expec-tation is finite but there second moment is infinite.

Solution: We define the density function

fX(x) =C

x31x≥1.

Notice that C is a constant such that

C

∫ ∞

1

1

x3dx = 1.

[*** Compute that constant explicitly! ***] For this density function, we have

E(X) = C

∫ ∞

1

x

x3dx < ∞,

and

E(X2) = C

∫ ∞

1

x2

x3dx = ∞.

The definition of conditional distribution of Y given X is a little bit more involved thanin the discrete case.Definition 4.6. Given X,Y two random variables and x a real number such that fX(x) >0. We define the distribution function of Y given X = x as

FY |X(y|x) =∫ y

−∞

fX,Y (x, b)

fX(x)db.

The conditional density function of Y given X is then

fY |X(y|x) = fX,Y (x, y)

fX(x).

For this random variable we can compute the expectation

E(Y |X) =

∫

R

yfY |X(y|x)dy.

As in the case of discrete random variables, [*** the conditional expectation is arandom variable. To see this, recall that you can move x. We have

E(E(Y |X)) =

∫

R

E(Y |X)fX(x)dx =

∫

R2

yfX,Y (x, y)

fX(x)fX(x)dydx = E(Y ).

76



***] Notice that, given two random variables, X,Y , the object Y given X is very useful.You may think that the parameter present in Y is replaced by a random value given byX. Let’s recall the example from Chapter 3: we consider the case where Y = Bin(X, p)and X = Pois(λ). Y is a model of the number of double expresso coffees that the CoHoserves before you order.Example 4.12 (Exercise 4.6.1.a). A point is picked uniformly at random on the surface ofa unit sphere. Writing θ and φ for its longitude and latitude, find the conditional densityfunctions of θ given φ.

Solution: As the point X is picked uniformly on the sphere, we have that

P (X ∈ A) =1

4π

∫

AdS.

Using polar coordinates

x = cos(φ) cos(θ), y = sin(θ) cos(φ), z = sin(φ),

we can write the surface integral as

P (X ∈ A) =1

4π

∫

A| cos(φ)|dθdφ,

where we have used

J = ‖∂φ(x, y, z)× ∂θ(x, y, z)‖ = | cos(φ)|.

The joint density is then

fΘ,Φ(θ, φ) =| cos(φ)|

4π.

The marginal is

fΦ(φ) =

∫ 2π

0

| cos(φ)|4π

dθ =| cos(φ)|

2.

Then

fΘ|Φ(θ|φ) =fΘ,Φ

fΦ=

1

2π.

This implies that, for every φ, Θ given Φ is uniformly distributed.Example 4.13 (Exercise 4.6.1.b). A point is picked uniformly at random on the surface ofa unit sphere. Writing θ and φ for its longitude and latitude, find the conditional densityfunctions of φ given θ.

Solution: Using the results from the previous example, we have that the marginal is

fΘ(θ) =

∫ π/2

−π/2

| cos(φ)|4π

dφ =1

2π.

Then

fΦ|Θ(φ|θ) =fΘ,Φ

fΘ=

| cos(φ)|2

.

77



Example 4.14 (Exercise 4.6.4.a). Find the conditional density function and expectationof Y given X when they have joint density function

fX,Y (x, y) = λ2e−λy10≤x≤y<∞.

Solution: We have that the marginal is

fX(x) =

∫

R

fX,Y (x, y)dy =

∫ ∞

xλ2e−λydy = −λe−λy

∣

∣

∣

∣

∞

x

= λe−λx, 0 ≤ x < ∞.

Then

fY |X(y|x) = λ2e−λy

λe−λx= λeλ(x−y), 0 ≤ x ≤ y < ∞

Given two random variables, X,Y , the random variable Z = X +Y appears very often inapplications. Notice that, in the case of discrete random variables, we have

Z = z =⋃

x

X = x, Y = z − x.

So, the probability mass function is then

fZ(z) =∑

x

fX,Y (x, z − x).

If we have continuous random variables, the situation is very similar. We have

P (Z ≤ z) =

∫ ∫

X+Y≤zfX,Y (u, v)dvdu =

∫ ∞

−∞

∫ z−u

−∞fX,Y (u, v)dvdu.

By changing variables and the order of integration, we obtain

P (Z ≤ z) =

∫ ∞

−∞

∫ z

−∞fX,Y (x, y − x)dydx =

∫ z

−∞

∫ ∞

−∞fX,Y (x, y − x)dxdy.

Consequently,

FZ(z) =

∫ z

−∞fZ(y)dy =

∫ z

−∞

∫ ∞

−∞fX,Y (x, y − x)dxdy.

We take a derivative ∂z to conclude the result.

We have proved the following resultLemma 4.2. If X,Y have a joint density function fX,Y , then Z = X + Y has a densityfunction given by

fZ(z) =

∫

R

fX,Y (x, z − x)dx.

Furthermore,

78



Lemma 4.3. If X,Y are two independent random variables with joint density functionfX,Y , then Z = X + Y has a density function given by

fZ(z) =

∫

R

fX,Y (x, z − x)dx =

∫

R

fX(x)fY (z − x)dx.

One step forward 4.1. Given two functions f, g, we define its convolution as

f ∗ g(z) =∫

R

g(x)f(z − x)dx

79



80


Chapter 5

Characteristic functions

Given a random variable (continuous or discrete, it does not matter at this point), wecan be interested in some composition g(X). Due to its good properties, there are severalcommon choices for g(y). The first one isDefinition 5.1. Given X a random variable, we define the moment generating functionas

M(t) = E(etX ).

[*** Notice that M(t) is a function of t. ***]One step forward 5.1. The moment generating function is related to the Laplace trans-form.Memento 5.1. Recall that for a given g,

E(g(X)) =∑

x

g(x)fX(x) (discrete random variable)

and

E(g(X)) =

∫

R

g(x)fX (x)dx (continuous random variable).

Example 5.1 (Bernoulli random variable). Compute the moment generating function ofa Bernoulli random variable with parameter p.

Solution: We have

M(t) = E(etX ) = et·1p+ et·0(1− p) = etp+ (1− p).

Example 5.2 (Binomial random variable). Compute the moment generating function ofa binomial random variable with parameters n and p.

Solution: We have

M(t) = E(etX ) =

n∑

k=0

(

n

k

)

etkpk(1− p)n−k =

n∑

k=0

(

n

k

)

(

etp)k

(1− p)n−k = (etp+ 1− p)n.

81


Chapter 5. Characteristic functions

Example 5.3 (Uniform random variable). Compute the moment generating function ofa Uniform random variable on the interval (0, 1).

Solution: We have

M(t) = E(etX) =

∫

R

etxfX(x)dx =

∫ 1

0etxdx =

1

tetx∣

∣

∣

∣

1

0

=1

t

(

et − 1)

.

Now notice that

M ′(t) =

∫

R

d

dtetxfX(x)dx =

∫

R

xetxfX(x)dx.

Consequently, [***

M ′(0) =

∫

R

xfX(x)dx = E(X).

***] We have that

M ′′(t) =

∫

R

d2

dt2etxfX(x)dx =

∫

R

x2etxfX(x)dx.

So, [***M ′′(0) = E(X2).

***] In generalM (k)(0) = E(Xk),

and the name moment generating function seems more clear. As a consequence, usingTaylor’s theorem, we have

M(t) =∑

k

= 1∞E(Xk)

k!tk

Of course, in the case of discrete random variables, we have the exact same result.

We have thatLemma 5.1. If X,Y are independent, we have

MX,Y (t) = MX(t)MY (t)

Proof. We have

MX,Y (t) = E(et(X+Y )) = E(etXetY ) = E(etX)E(etY ) = MX(t)MY (t).

Another interesting choice for g(y) isDefinition 5.2. Given X a random variable, we define the characteristic function as

φ(t) = E(eitX ), i =√−1.

One step forward 5.2. The characteristic function is related to the Fourier transform.

82


Example 5.4 (Bernoulli random variable). Compute the characteristic function of aBernoulli random variable with parameter p.

Solution: We have

φ(t) = E(eitX ) = eit·1p+ eit·0(1− p) = eitp+ (1− p).

Example 5.5 (Binomial random variable). Compute the characteristic function of a bi-nomial random variable with parameters n and p.

Solution: We have

φ(t) = E(etiX) =

n∑

k=0

(

n

k

)

eitkpk(1− p)n−k =

n∑

k=0

(

n

k

)

(

eitp)k

(1− p)n−k = (eitp+1− p)n.

We have thatLemma 5.2. If X,Y are independent, we have

φX,Y (t) = φX(t)φY (t)

Proof. We have

φX,Y (t) = E(eit(X+Y )) = E(eitXeitY ) = E(eitX )E(eitY ) = φX(t)φY (t).

[*** This, actually, is a ’if and only if statement’. ***]Lemma 5.3. Given X a random variable, Z = aX + b has characteristic equation givenby

φZ(t) = eitbφX(at).

Proof. We have

φZ(t) = E(eit(aX+b)) = eitbE(eitaX ) = eitbφX(at).

Lemma 5.4. Given X a random variable, its characteristic equation φX verifies

φX(0) = 1, |φX | = 1∀ t,

and φX is continuous.

Proof. We haveφX(0) = E(ei0X) = 1,

and|φX(t)| = |E(eitX )| ≤ E(|eitX |) ≤ E(1) = 1.

83



The continuity is obtained using that

|φX(t+ h)− φX(t)| ≤ E(|eitX (eihX − 1)|) → 0

as h → 0.

Definition 5.3. Given X,Y two random variables, we define the joint characteristic func-tion as

φ(s, t) = E(eisXeitY ), i =√−1.

[*** Notice that we have two variables, s, t!! ***]

Finally, we are going to provide a very interesting (and delicate) result.Theorem 5.1. Let M(t) be the moment generating function and φ be the characteristicfunction. Then, for any a > 0, the following conditions are equivalent

• |M(t)| < ∞ for |t| < a

• φ is analytic on the strip |Imz| < a

• the moments E(Xk) exists (and satisfy a growth criterion)

[*** Basically this theorem says that under any of these three conditions,M(t) may be extended and we have

φ(t) = M(it).

***]

The main point of the characteristic function φX is that, if we know it, that is equivalentto know the distribution.Theorem 5.2. Given X a continuous random variable with characteristic function φX .Then the probability density function is

fX(x) =1

2π

∫

R

φX(t)e−txdt,

(as long as everything is well defined).

In particular, [*** if X,Y are two random variables with the same characteristicfunction, then they have the same distribution. ***]

The convergence of functions is a rather complicate issue. Let’s start with the basic:Definition 5.4. We say that the sequence of distributions Fi converges towards the dis-tribution F if, for every point x where F is continuous, we have

F (x) = limn→∞

Fi(x).

Definition 5.5. We say that a sequence of random variables Xi converge IN DISTRIBU-TION to X if their distribution converges:

Fi → F.

84


Then we haveLemma 5.5. Let Fi have characteristic function given by φi and F has characteristicfunction given by φ. Then, if Fi → F then φi → φ.

and, conversely,Lemma 5.6. φi be a sequence characteristic function (for random variables with distribu-tion given by Fi) converging to φ. Then, if φ is continuous at t = 0, φ is the characteristicfunction of a random variable with distribution F and Fi → F .

Now we state two main results in probability:Theorem 5.3 (Law of large numbers). Let Xi be a sequence of independent randomvariables with the same distribution and finite means µ, then

Sn

n=

X1 +X2 + ...Xn

n→ µ

in distribution.

andTheorem 5.4 (Central limit theorem). Let Xi be a sequence of independent randomvariables with the same distribution and finite means µ and variance σ2, then

Sn − nµ√nσ2

=(X1 +X2 + ...Xn)− nµ√

nσ2→ N(0, 1)

in distribution.

85



86

Documents

Lectures Notes for Probability (135A)