probanotesjulia16april2010

Lecture Notes: Discrete Probability(ST111 Probability A)

Warwick University, 2010

Julia Brettschneider

Version β.1 (16.4.2010)

What others have said about probability

Many returning Warwick mathematics graduates:Why didn’t someone tell us that probability and statistics are the keymathematical subjects in applied quantitative work?

A randomly selected first year student:Given I passed the exam, what is the probability I studied for it?

Another first year student:Is there any way I can buy that hat off you?

Another student:You keep telling me to read books.

A rumour:The exam will only be taken by a sample of 40 students selected at randomfrom this class.

A coin:This module tossed me around in my sleep!

Pierre Simon Laplace:It is remarkable that a science which began with the consideration of games ofchance should have become the most important object of human knowledge.

Niels Bohr:Einstein, stop telling God what to do with his dice.

Stephen Jay Gould:Misunderstanding of probability may be the greatest of all impediments toscientific literacy.

ebay:Shadowfist CCG Probability Manipulator SE Rare0 Bids £0.99 + Postage £0.75 12h 21m

Oscar Wilde:Always be a little improbable.

1 Preface

Lecture notes

It is very difficult to simultaneously listen to a lecturer, read what is being written onthe blackboard and copy the essential of both sources. First year modules are a goodopportunity to experiment with note taking strategies, because you can get hold of mostof the material (and more) from textbooks and/or lecture notes.

Ω: What are you talking about? Thatt fonny ackzenntt iss a gott reesonn to giff us destuff in writingg! Ah, sorry, I forgot to introduce myself. I’m a rather polite fellow underregular circumstances, actually. My name is Omega. I have this little sister omi. Sheis small, and very funny. Well, omi is actually her nickname. Her real name is (small)omega, like this ω. We have Greek names, because a lot of mathematical objects haveGreek names.

May I continue my lecture please? I do remember getting very frustrated in my first yearas an undergraduate student about having to take notes. There were no lecture notes,and there were these indistinguishable indices i, j and ι in form of 1cm high wet chalksculptures on a dusty blackboard. I now have to acknowledge that my own blackboardshave, in the past, missed the qualification for the RBBCCM 1, I will do my best to writesufficiently big and readable and to speak loud enough. If I forget try waving orshouting.

This is the first time I’m giving probability lectures to Warwick students and I’m trying tooptimally adapt existing material and methods to both the Mathematics and the MORSEstudents. These typed lecture notes are work in progress. I will be grateful to you forpointing out any kinds of errors as well as for making any other relevant suggestions (canthis explained better in a different way? anything missing? etc.)

These lectures are based on a number of other sources. Firstly, there are previous versionsof this module taught by my colleagues. I thank Saul Jacka for sharing his lecture noteswith me – some parts are taken straight from his notes – and I thank Roger Tribe forproviding a sketch about content and motivation for his lectures. Secondly, some of thefascinating probability textbooks; see below. Thirdly, my notes from teaching at Universityof California at Berkeley, Technical University Berlin and Humboldt University Berlin.Finally, I have always been inspired by Hans Follmer’s lectures who taught me probabilityin the first place. I thank my daughter Chaya as well as Charlie and Lola 2 for fascinatingdiscussions on the topic.

My grandmother passed away unexpectedly in the first week of this term. That remindedme that the only practical experience I have with chance games is playing dice and cardswith her in my childhood. It also reminds me of a children’s set theory box my grand-parents gave to me. In the wake of the 1960s New Math 3 movement, ∩ and ∪ betweensets of colourful triangles, squares and circles had become a regular part of the primaryschool mathematics curriculum and eventually found their way into toy stores. The samefor Greek letters. My grandmother would have enjoyed Ω and ω. I chose ω’s nicknameomi in memory of my grandmother.

1Royal Blackboard Beauty Contest for the Counties of the Midlands2http://www.bbc.co.uk/cbeebies/charlieandlola/stories/3get the idea at http://www.youtube.com/watch?v=I8aW4YuFSiY&feature=related

Websites

Confusing enough, you may come across a few different websites when searching for thismodule:http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/

http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/resources

The first one is very general and is intended to help students with their module choice orwith anticipating the role of this module within the course program. The second website,the resources website, is the one that is relevant for you now. There, I post lecture notes,exercise sheets, information about the module organisation, old exams, computer codeand interesting links about probability. In previous years, some of the lecturers have usedthe MathStuff website by the Warwick Mathematics Department to post exercises, lecturenotes, old exams and other information. This could still be useful for you and you canfind it there under ST111 > Module Pages > Archived Material; or as this address:http://mathstuff.warwick.ac.uk/ST111/archive

Finally, there is my website:http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/brettschneider/

Textbooks

Please check out the list on the resources website. Which one is best? Some people saythe one by Ross tends to more or less suit the majority of students. The other obviousone is the textbook by Pitman.

By the way, have you experienced the law of the second source? The second source whereyou read about a piece of mathematics new to you tends to be the one that explains itbetter – regardless of the order of the sources. Interestingly, there is no such law for thethird source. In fact, there seems to be a k ≥ 3 such that while reading the kth sourceyou end up being more confused than while reading the first one, again regardless of theorder of the sources. Without getting to that point you will never know how well you hadalready understood the material before getting hold of source k. For this and many otherreasons,

∞∑i=1

“read another book”

Pictures sources

The top cover picture showing the foyer of the Warwick Mathematics and Statistics De-partment, known as The Street, is fromhttp://www.maths.warwick.ac.uk/general/news/images/wall-400.jpg

Other picture sources are mentioned in the figure captions. If no source is mentioned, suchas for the bottom picture on the cover, I made the picture. For simulations I am typicallyusing the statistical programming language R which is publicly available athttp://www.r-project.org

The one most frequently asked question

How do I study for the exams? There is a lot of advice on the resource website.In a nutshell:

∞∑i=1

“do another exercise”

Synopsis

Week I

Lecture 1: Randomness, probability as mathematical disciplineLecture 2: A few questions (birthday problem, door problem, random sequences)

Week II

Lecture 3: Terminology for random events, classical examples for random experimentsLecture 4: Classical probabilityLecture 5: Combinatorics

Week III

Lecture 6: Axioms of probabilityLecture 7: Conditional probabilityLecture 8: Total probability theorem, Bayes theorem

Week IV

Lecture 9: General multiplication rule, independence, binomial distributionLecture 10: Geometric distribution, negative binomial distributionLecture 11: Poisson approximation to the binomial, Poisson distribution

Week V

Lecture 12: In-class testLecture 13: Random variablesLecture 14: Expectation, variance

Have a great journey into probability spaces!

2 Motivation

2.1 Observing random phenomena

We start by looking at a few pictures from the real world that exhibit random phenomena.The slide show used during this lecture can be downloaded from the ST111 resourcewebsite. It includes ice crystals, animal patterns, genetics, traffic, networks, statisticalmechanics, stock markets, gambling, knitting and more.

The common definition of mathematics as the science of patterns goes back to G.F. Hardy,if not longer. Number theorist look for patterns in the integers, topologists in shapes,analysts in motion and change and so on.

Look at the bottom picture on the cover sheet of these lecture notes. Are the black andwhite dot sequences regular or not? In which sense (not)?

Row 1: Yes, it is constantRow 2: Yes, it is periodic.Row 3: No, but almost. It is the sequence from row (2) with a few errors.Row 4: No, there seems to be no rule that generates this sequence.Row 5: No, there seems to be no rule that generates this sequence.

Question 2.1. Statistical regularity in dot sequences. The sequences in Row 4 and Row 5clearly are not regular in the classical sense: There is no deterministic rule that generatesthem. But are they statistically regular in some sense? And if so, do they have the samekind of statistical regularity in the sense that they show similar statistical characteristics?One of them has actually been generated by flipping a fair coin 100 times and printing awhite dot for head and a black dot for tail. Can you say which one? And why do you thinkso?See end of this section for an answer.

ω: But excuse me, I do not like black and white. My very most favourite colour is totallytomato red and my other favourite colour is bitter chocolate brown with lime green sparklesin it and YOU should really try this link: http://biscuitsandjam.com/stripe_maker.php

And this one about statistical mechanics makes some pretty shape, too:http://physics.ucsc.edu/~peter/ising/ising.html

Ω: Speaking of black and white, have you heard that half of the students taking thismodule loves and the other half hates us? Don’t cry, omi, you know as well as me that wecan be quite annoying and silly...I have decided that the most sensible step for us is to turn grey and to shrink bit, we do not overstay our

welcome.

2.2 Studying random phenomena

It seems desirable to get more insight into the phenomenon of randomness, but what kindof discoveries can we expect from studying something as inconceivable as randomness?

The actual science of logic is conversant at present only with things eithercertain, impossible, or entirely doubtful, none of which (fortunately) we have toreason on. Therefore the true logic for this world is the calculus of probabilities,which takes account of the magnitude of the probability which is, or ought to

be, in a reasonable man’s mind.— James Clerk Maxwell

How dare we speak of the laws of chance? Is not chance the antithesis of alllaw?

— Joseph Bertrand, Calcul des probabilites

Why would anyone want to study randomness? There are a number of reasons including

• describing and understanding patterns in random phenomena,

• unrevealing scientific principles,

• predicting and forecasting events,

• learning and developing degrees of belief about events,

• quantifying and comparing risks,

• taking decisions under uncertain circumstances.

Probability theory provides a formal theory to describe and analyse random phenomena.

2.3 Probability is pure mathematics (with some probability p > 0)

Probability can be practiced as pure math. The birth of of probability theory as a math-ematical discipline is often associated with the axiomatisation the field underwent duringthe twenties and early thirties of the 20th century.

It borders other fields in mathematics. Most obvious are the overlaps with measure theory,analysis, functional analysis, PDEs, geometry, combinatorics, computing, ergodic theory,number theory and mathematical physics. The last International Congress of Mathemati-cians (2006 in Madrid) recognised and honoured such connections. Andrei Okounkov fromRussia received a Fields Medal for his contributions bridging probability, representationtheory and algebraic geometry. Wendelin Werner from France received a Fields Medalfor his contributions to the development of stochastic Loewner evolution, the geometry oftwo-dimensional Brownian motion, and conformal field theory.

The probability group at Warwick is among the leading ones in the UK. Their major homeis P@W; see http://www2.warwick.ac.uk/fac/sci/statistics/paw/

2.4 Probability is applied mathematics, with some probability q > 0

Probability can be practiced as applied mathematics. (To keep things open we will notimpose p+ q = 1.)

The instrument that mediates between theory and practice, between thoughtand observation, is mathematics; it builds the bridge and makes it strongerand stronger. Thus it happens that our entire present day culture, to thedegree that it reflects intellectual achievement and the harnessing of nature, isfounded on mathematics.

— David Hilbert, radio speech (1930)

Pobability theory has been serving the human mind’s inquiries and activities in manyareas. Besides statistics (see below) these including the following:

Physical world: e.g. astronomy (in particular, theories about measurement error), me-chanics, statistical physics, quantum mechanics.

Living world: e.g. integrative biology, evolution, genetics, genomics, medicine, neuro-science.

Social world: e.g. economics (in particular, financial markets), sociology, demography.

Engineering: e.g. computer science (in particular, machine learning), electrical engineer-ing, risk assessment, reliability, operations research, information theory, communicationtheory, control theory, traffic engineering.

Humanities: e.g. epistemology, determinism, philosophy of language, philosophy ofreligion, philosophy of logic, political philosophy, belief systems.

Games of chance: e.g. roulette, dice, card games.

Arts: e.g. stochastic music, visual art (inspired by chaos theory and fractals, patternssimulated by stochastic algorithms).

Probability theory is based on a set of axioms that can be summarised in just a fewlines. What makes it applicable to so many different kinds of applications is a hugerepertoire of probabilistic models. For example: Classical discrete probability modelsin games of chance, genetic inheritance, quality control, measurement error, electricalcircuits and lattice modes for particle interaction. Discrete time stochastic processes (ofteninvolving a certain depth of dependency on the past) in queuing, coding and stochasticcomposition algorithm. Continuous stochastic processes for particle motion and stockprices. Probabilistic networks in sociology, mobile phone and genomics.

Before trying our models in the real world, let us state some thoughts on the applicationof probability theory from Saul Jacka’s lecture notes: ”It is a very powerful language, forif we accept a few simple propositions, such as involve mapping a practical situation onto some axioms, the whole battery of a theory’s deductions follow as true, conditionalupon the validity of that mapping. Of course, if that first mapping is nonsense then, nomatter how good our mathematics is, the result is almost certainly rubbish and can havedisastrous practical consequences.”

2.5 Probability as the mathematical foundation of statistics.

A quick note: You will need every bit of Probability A and Probability B (and more) tounderstand Mathematical Statistics A and B in your second year. The latter modules arethe basis for a lot of exciting modules, theoretical and applied, ahead of you in the finalyears. Please make sure you are on top of the probability modules.

Probability and statistics are linked by studying similar objects from different perspectives:Probability starts out with a model for a random experiment that describes potentialoutcomes and assumes certain basic parameters. Probabilities for events of interest arededuced from these model assumptions. For statistics, the starting point are the observedoutcomes (aka oberservations or data) of an experiment that has already been performed.The task is to infer characteristics of an optimal model for the unknown (or only partiallyknown) mechanism of the experiment. That means, we are trying the find the model thatis, among a certain class of models, most likely to have created the observations.

“Probability and statistics used to be married, then they separated, then they got divorcedand now they hardly see each other,” says David Williams. At Warwick they have movedon to be good friends offering a wide range of research activities. They cover any possibleblend of the two disciplines from pure probability to applied statistics and extending to

interdisciplinary collaborations with other Warwick research groups such as Complexity,Systems Biology and Finance.

2.6 How are we going to study probability in this module?

In his classical book An Introduction to Probability Theory and its Applications, WilliamFeller stresses that in probability, as in other mathematical disciplines, we must distinguishthree aspects of the theory:

(a) the formal logical content,

(b) the intuitive background,

(c) the applications.

He continues: ”The character, and the charm, of the whole structure cannot be appreciatedwithout considering all three aspects in their proper relation.”

Considering both the history of probability and the fact that this is a first year module westart with models for equally like probabilities. This approach is minimalistic in the formalsense but already provides enough mathematical foundation to properly dive into someinteresting examples to motivate probability as a field. While some students are extremelygood at solving probability problems intuitively, others would prefer to be see a more for-mal approach first, which is why we will then introduce the modern axiomatic approachto probability including the notions of σ-algebras, probability measures, and random vari-ables. Besides these objects, the mathematical theory of probability which includes crucialconcepts such as conditioning and independence, expectations and variances. In the easierexamples, the introduction of such formalism may feel like breaking a butterfly on a wheel,On some occasions they serve as an introduction of the formalism needed for more complexquestions some of which even have counterintuitive answers. Along with the theory wewill get to know classical probability distributions such as Bernoulli, uniform, binomial,geometric and Poisson. All of this will be motivated by real world questions and illustratedby further applications and by examples constructed from probabilistic experiments (suchas coin tossing or gambling examples).

2.7 Some examples to surprise and enthuse you

ω: I am sure that this is a typo. It should say confuse you.

Ω: Close your eyes and ears if you’re worried.

ω: But confusion completely does not scare me. When I am big I will be working as a researching researcher.

And also it is your birthday and Mum said I should help you make sure you have an extremely lovely,

happy birthday, so that’s why I will stick around.

Ω: OK. Thank you omi. Let us sort out a problem together then. . .

2.7.1 Birthday problem

Question 2.2. Suppose there are r students in this class. What is the probability that atleast two students in the class have the same birthday?

ω: But I need to know everybody’s birthday first to find out whether any two are on the same day!

Ω: Listen, in the question they don’t care about exactly which students these are. They just want to know

what are the chances of two identical birthdays whatever class of r people.

ω: But when it’s your birthday that means you are really special. How can you feel really special if all you

know is your odds of having a birthday today are say one in two, or one in a million or whatever?

Ω: Don’t be silly. I’m just curious to find out how likely is it, that a random class of r students has at least

two students in it with identical birthdays?

ω: You are the one being silly. A random class is how our school teacher calls us when we are naughty.

Ω: Not that kind of random. I mean a class picked at random from all the classes of r students in the

universe.

ω: Universe, outer space, probability space – you are just making up funny words. How is the universe

going to help you out? You do not even know who is out there. Maybe on Mars everybody is born in May,

and on Saturn everybody is born on a Saturday, and in the Milky Way everybody is born on a Lucky Day!

Ω: OK. That is a very good point. We need a realistic probability space to answer this question on Earth.

I will make the following assumptions:

• Every year has 365 days.

• Every student is equally like to be born an any of these days.

• Birthdays are independent of each other.

ω: But excuse me, there are leap years and everybody knows there are more birthdays during some parts

of the year and twins have the same birthday.

Ω: You’re absolutely right, but I just want to start somewhere and my assumptions still capture the essence

of the problem. Once we have solved it this way we can see how important the assumptions are in the

calculation and then tackle it with a different set of assumptions.

ω: There are 23 kids in my class and they are all very special. I’m not worried that any two of us have

the same birthday. The risk Alphie is born on my birthday is only 1 in 365, for Betty it is also 1 in 365

and so on makes 24/365. Now, Alphie makes the same calculation for her, and so make Betty and so on

makes 25 times 24/365.

Ω: That is about 1.64. In other words 164%. Did you know that probabilities can not be bigger than

100%?

ω: OK, I see, it’s just that I counted some things twice, like Alphie on the same day as me is the same as

me on the same day as Alphie. I will try to answer that question again. . . .

Ω: I know you like questions. But this is actually my question. I was just going to look at a simpler version

of this problem. There are three students in a class and we want to know what are the chances that at least

two of them have their birthday in the same month. I will call the months a, b, c, . . . , l. Now I will figure

out all possibilities for the birthday months of the three students and count how many of them have at

least two identical months in it: aaa, aab, aac, . . . , aal obviously all belong in this group, makes 12. Now,

aba, abb, abc, . . . , abl. Of those, only two have at least two identical months. Same for all the once starting

with ac or ad and so on up to al. So far, we have 12 plus 2 times 11. Now, there are baa, bab, . . . , bal and

bba, bbb, . . . , bbl. . .

Problem solving technique: Consider all options.

ω: And bla, bla, bla and. . . Omega, I am hungry, and there’s got to be a better way for this!

Answer to Question 2.2: We make three assumptions.

(i) The year has n = 365 days.This is correct for 3 in 4 years, otherwise it is almost correct.

(ii) Every student in this class is equally likely to be born on any day of the year.In fact, there is a surge of birthdays in autumn in the UK, but it’s not that big and wewill neglect it for now. To be sure the models suits, we would also need to check outthe data from overseas to account for the students not born here. Finally, we needto think about how the class was selected from the general population and whetherthis might in any way be affected by birthdays. It is reasonable to assume this is notthe case for this class, but it could be different. For example, if the university hadidentification numbers involving birthdays and then split the students taking theprobability module into classes based on identification numbers, rather than based

on degree as currently done.)

(iii) The birthdays are independent of each other.Again, a reasonable assumption for this class, but it could be different in a situationwhere the class was composed by a different process.

We order the students in some way. With the above assumptions our model is thatthe birthdays are equally likely to form any ordered r-tuple with numbers chosen from1, 2, . . . , n.

We want to compute the probability pn,r that there are at least two students with thesame birthday among the r students in the class. Obviously, if r > n this has to happen,so pn,r = 1. Otherwise, we can compute it by dividing the number of r-tuples containingrepeats by the total number nr of r-tuples from a set of n elements.

Omega attempted to list all possible r-tuples, identify which of them have repeats in themand count those. Whereas this is a correct way of solving the problem, it is more efficientto just list and count those r-tuples that do not have any repeats in them. This will allowus to compute the probability qn,r of the opposite, that is, not any two students have thesame birthday?Problem solving technique: Check if the opposite event is easier to handle?

In how many ways can birthdays be assigned without ever repeating one? Let the firststudent have whatever birthday. Avoiding that day for the second student leaves n − 1options. The third student has to avoid both previously taken birthdays which leaves n−2choices. The forth has n − 3 choices and so on up to the n − r + 1 choices for the rthstudent. So we have

qn,r = n · (n− 1) · (n− 2) · . . . · (n− r + 1)/nr,

which yieldspn,r = 1− n · (n− 1) · (n− 2) · . . . · (n− r + 1)/nr.

Let us look at some numerical results:

p365,10 ≈ 11.7% p365,40 ≈ 89.1%p365,20 ≈ 41.1% p365,50 ≈ 97.1%p365,30 ≈ 70.6% p365,60 ≈ 99.4%

In particular, we can see that the smallest number of students r such that the probabilityof having at least two of them with the same birthday is 23:

p365,22 ≈ 47.6% p365,23 ≈ 50.7%

2.7.2 Door problem

An old kind of problem that became famous worldwide via the Monty Hall game show in1990. You are being shown three closed doors. Behind one of the doors is a car; behindeach of the other doors is a goat. You choose one of the doors. The show’s host, whoknows which door conceals the car, opens one of the two remaining doors which he knowswill definitively reveal a goat. Then he asks you whether or not you want to switch yourchoice to the remaining closed door. What are you going to to? Here are some attemptsto answer this question.

(i) Either the prize is behind the door you bet on originally or behind the other stillclosed door makes it a fifty-fifty chance for each, so it doesn’t matter whether youchange or not.

(ii) All doors were equally like originally. That is not going to change because of whateverthat show’s host did. So it doesn’t matter whether you change or not.

(iii) Imagine the same problem but with 100 doors instead of 3. Behind one of the doorsis a car, behind all the other 99 doors is a goat. The show’s host knows where theprize is and opens all doors but the one you picked and one other one. Now it seemsintuitive, that you would want to switch.Problem solving technique: Exaggerating the original question helps to revealthe essential characteristics of a problem. Often, the qualitative aspects of the answerto the exaggerated question can be carried over to answer the original question.

(iv) We compare the two different strategies stick (with your original choice) and switch.There are two possibilities:Case 1: Original choice is door with car. Now you will get a goat.Case 2: Original choice is a doors with a goat. Now you will get a car.The probability for Case 1 is 1/3, the probability for Case 2 is 2/3. If your strategyis stick then your chance of winning the car are 1/3. If your strategy is switch thenyour chance of winning the car are 2/3.Problem solving technique: Distinguishing cases allows to find solutions of theproblem under more constraint conditions. In probability, it is usually done to getrid of (some of) the randomness.

(v) A qualitative approach. The shop’s host knows where the car is and opens, onpurpose, a door that reveals a goat. If you make use of that information it shouldincrease your chances of getting the car. At least it should not decrease it.

(vi) A variation of the problem. The show’s host does not know where the car is, he justhappens to reveal a goat when he opens the door. Does this change your answer tothe problem?

2.7.3 Two short questions

Question 2.3. Which one of the following birth orders in a family with six children ismore likely, or are they the same? (G means girls, B means boy.)

GBGBBG BGBBBB

Question 2.4. A hexagonal die with 2 red and 4 green faces is thrown 20 times. You haveto bet on the occurrence of one of the following patterns. Which one would you choose?

RGRRR GRGRRR GRRRRR

Answers to these two questions are given below. Before you read them, try to come upwith your own answers.

About Question 2.3: Both are equally likely. Experiments with people who have notraining in probability have shown that they tend to believe that the first order is morelikely than the second order. The second sequence is perceived to be too regular. The

answers suggest that the respondents compare the frequency of families with 3 girls and 3girls with the frequency of families with 5 boys and 1 girl. Yet both exact orders of birth,GBGBBG and BGBBBB, are equally like, because they both represent one of 64 equallylike possibilities. (See Kahneman and Tversky, 1972b, p.432 see Chapter 5, Neglectingexact birth order.)

About Question 2.4: Your best bet is (i). Most people bet on (ii), because G is morelikely than R and (ii) has two Gs in it. Yet (i) is more likely, simply because RGRRR isnested in GRGRRR. This is an example of committing the conjunction fallacy withoutrealising that there is a conjunction. (See Kahneman and Tversky, 1983, p.303, Failing todetect a hidden binary sequence.)The conjunction fallacy has been studied by many researchers using the example aboutthe (hypothetical) person Linda. The description reads:

Linda is 31 years old, single, outspoken, and very bright. She majored in phi-losophy. As a student, she was deeply concerned with issues of discriminationand social justice, and also participated in anti-nuclear demonstrations.

Then people are asked to rank statements about Linda by their probability, in particularlythe following two:

Linda is a bank teller.Linda is a bank teller and is active in the feminist movement.

People assigned higher probabilities to the second statement. This is not in tune withthe mathematical rules about probability and has stimulated a discussion about how realpeople deal with probabilities. (See Kahneman and Tversky, 1982b, p.92.)

ω: People’s minds don’t follow the usual probability rules, small particles do not follow them either; are

they of any use?

ω: Oh yes, they work in many situations. Besides, I believe that human minds do follow them for the most

part, it’s just that we enjoy the examples where we’re getting mixed up. And these fractions of elementary

particles, well, that’s why quantum probability was invented.

About Question 2.1: Most likely, the second one (Row 5) is the coin toss. Does thefirst one look more random to you? Well, it does switch more often between black andwhite, and we seem to take that as a sign for proper randomness. But could this be toomuch of a good thing? Consider that staying with the same colour is as likely as switchingto another colour we would expect that a sequence of 100 coin tosses has about 50 colourchanges. However, there are 66 colour changes in the first one (Row 4) and 46 in thesecond (Row 5). While we can not say with certainty, we have argued that the differencein the number of observed colour changes makes Row 5 much more likely to be generatedby the coin tossing by than Row 4.

This is an example for the thinking used in hypothesis testing, a branch of statistics. Thisparticular argument is used in the runs test used to check the independence assumption.

3 Models for equally probable events

3.1 Terminology for random experiments

The set of all outcomes is called outcome space or sample space and is usually denotedby Ω. Before the experiment ω is not know, after the experiment is is known. However,

we will often calculate just probabilities for ω to belong to a certain subset of Ω. Thesesubsets are called events.

3.1.1 Some classical random experiments

Example 3.1. Toss a coin and see what it is facing. Ω = h, t with h for the coin facinghead and t for the coin facing tail.

Example 3.2. Roll a die and observe the number of dots it faces. Ω = 1, 2, 3, 4, 5, 6and events can be any subsets of Ω. For example, ”rolling an odd number” is 1, 3, 5 and”rolling a number bigger than 4” is 4, 5 and ”rolling a six” is 6.

Example 3.3. The sides of an icosahedron die have 20 equilateral triangles, all of thesame size, with the numbers 1 to 20 written on them. Roll an icosahedron die and observethe number it faces. Ω = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 andevents can be any subsets of Ω. For more information on different kinds of dice and a nicepicture of all the platonic solid shaped dices you may look under ”dice” in Wikipedia.

Example 3.4. There are several ways of modelling the experiment of tossing two coins.For example:(i) Observe which sides are up. This corresponds to Ω = hh, ht, th, tt.(ii) Count the total number of heads. This corresponds to Ωheads = 0, 1, 2.Note that (i) contains more information than (ii).

Example 3.5. The deck most often seen in English-speaking cultures, and common inother countries where the deck has been introduced, is the Anglo-American poker deck.This deck contains 52 unique cards in the four French suits, Spade (♠), Heart (♥), Dia-mond (♦) and Club (♣) and thirteen ranks running from two (deuce) to ten, Jack, Queen,King, and Ace. Draw a card from the deck and observe the suit and the rank.For simplicity, introduce the coding Ace=1, Jack=11, Queen=12, King=13 and chooseΩ = Sk |S = ♠,♥,♦,♣; k = 1, 2, . . . , 10. Events can be any subsets of Ω, for example,”an Ace”=♠1,♥1,♦1,♣1, or ”a Spade”=♠k | k = 1, 2, . . . , 13.

Alternatively, we can define outcome spaces that reflect only a partial observation of thecharacteristics of the cards. For example Ωsuit = ♠,♥,♦,♣ or Ωnumber = 1, 2, . . . , 13.

Example 3.6. A box contains a finite number n of tickets enumerated 1, 2, . . . , n. Drawone at random. The standard choice for an outcome space is W = 1, 2, . . . , n. Events canbe subsets of Ω, for example, ”a 3”= 3, ”a number between 5 and 10”= 5, 6, 7, 8, 9, 10,or ”an even number”= 1k | k ∈ N, k ≤ n/2.

Note that the previous example is a general form of a random experiment with finitelymany outcomes. Examples 3.1 to 3.5 can be represented in this form. The next twoexamples are both infinite models. However, the first one is countable whereas the secondone is continuous.

Example 3.7. Coin tossing until heads come up.A coin is tossed until heads come up. The outcomes are all sequences consisting of somenumber of tails t and then end with a head h, so the outcome space can be represented asΩ = h, th, tth, ttth, tttth, ttttth, . . .. Note that while there are infinitely many differentoutcomes, each individual outcome is finite. Some examples for events are: ”it takes 3trials to get a head”, ”it takes at least 3 trials to get a head”, ”it takes at most 3 trials toget a head”.

Example 3.8. Infinitely many coin tosses.A coin is tossed infinitely many times. In each toss, it is recorded which face shows up.The outcomes are all sequences consisting of h and t, so Ω = h, tN. Some examples forevents are: ”first toss is a tail”, ”at least 20 heads”, ”the pattern hhhthhh occurs at leastonce”, ”infinitely many heads”.Since there is a one-to-one map of the set of binary sequences onto the interval [0, 1], thiscorresponds to picking a number at random between 0 and 1, as random number generatorsdo. It is now obvious that this is an example for a continuous outcome space.

In Examples 3.1 to 3.6 we can define probabilities representing a random experiment in afinite outcome space Ω = ω1, . . . , ωn with equally like outcomes as follows:

P (ω1) = 1/n, P (ω2) = 1/n, . . . , P (ωn) = 1/n. (1)

ω: What about the infinite examples?

Ω: Mmh, I would say that in Example 3.8 I would use the approach for finitely many outcomes to define

probabilities for events that just specify the first n numbers of tosses. Then let n got to infinity. Honestly,

though, I have no idea what it means for probabilities to converge. These concepts will come up in the

modules Probability B and Random events...

ω: Example 3.7 I’m sure about one thing; it is absolutely completely not possible to give every outcome

the same chance. Why not? You just have to imagine they did. Then there was this number p > 0 and

the probability that one of the first n outcomes occurred was n · p and the probability that one of the first

n+ 1 of them occurred was (n+ 1) · p. And so on. That goes to infinity. But the probability that one of

all of them occurred should be at most 1. So it just can’t be! And this is why I’m never ever allowed to

invited infinitely many friends to my birthday party. You can not cut a cake into infinitely many equally

big pieces.

To prepare for the next sections, review set theory notation.

In particularly, recall de Morgan’s rules.

3.2 Classical probability

The axioms for classical probability we will define below are only a slight extension ofa model with equally likely, or equiprobable, outcomes as in (1). Classical probabilitymodels are based on symmetric characteristics of the random mechanism generating theoutcomes. Often, these are physical characteristics such as a die with faces of the sameshape or a box with balls of the same size. Calculations can usually be performed directlyor indirectly using the equiprobable model.

Definition 3.9. Algebra of sets.Let Ω be a set of points. A system A of subsets of Ω is called algebra if(a) Ω ∈ A,(b) A,B ∈ A =⇒ A ∪B ∈ A, A ∩B ∈ A,(c) A ∈ A =⇒ Ac ∈ A.

Example 3.10. Algebra generated by a partition.The algebra A generated by a partition B1, . . . , Bn of Ω consists of the empty set and allpossible unions of any number of element of the partition. A consists of 2n subsets of Ω.

Definition 3.11. Classical measurable space.Let Ω be a non-empty set of points and F the algebra generated by a partition B1, . . . , Bnof Ω. Then (Ω,F) is called a classical measurable space. Ω is called sample space andB1, . . . , Bn are called basic events.

Definition 3.12. Axioms for classical probability.(Ω,F , P ) is called classical probability space if the following two axioms are fulfilled:

C1: (Ω,F) is a classical measurable space.

C2: P : F −→ [0, 1] is a set function such that, for any event A ∈ F which compromisesexactly k distinct basic events Bi, it is P (A) = k/n.

Obvious examples for classical probability spaces are examples 3.1 to 3.6 with P definedas in (1).

Ω: Stop this equally likely approach to probability! David Williams (you can find him in Wikipedia) calls

it an invitation to disaster.

ω: But I like it. I think it’s more fair to give everybody the same chance.

Ω: Ah, are you bringing socialism back on the stage? Anyway, from the mathematical point of view, this

equally likely approach to probability requires only a minimal amount of techniques and terminology.

All you need is counts, count, counts, is all you need.

All you need is counts (all together now),

All you need is counts (everybody).

3.3 Some results from combinatorics

3.3.1 Counting

“I think you’re begging the question,” said Haydock, “and I can see loomingahead one of those terrible exercises where six men have white hats and sixmen have black hats and you have to work it out by mathematics how likely itis that the hats will get mixed up and in what proportion. If you start thinkingabout things like that, you would go round the bend. Let me assure you ofthat!” — Agatha Christie, The Mirror Crack’d

Question 3.13. Campus residences.A campus residence unit consists of three two-storey buildings with four rooms in eachstorey. What is the total number of rooms in this unit?

Answer: 3 buildings times 2 storeys times 4 rooms makes 24 rooms all together.

A way to visualise this is drawing a tree shaped graph. First, it shows three branches, onefor each building. Then, each of these branches grows two branches, one for each storey.Finally, each of these ones, further branches out into four limbs, one for each room in thatstorey.

Let us now formally state the counting methods we have just applied intuitively.

Theorem 3.14. Fundamental Rule.(i) Given a set of m distinct elements a1, . . . , am and a set of n distinct elements b1, . . . , bn,there exist m · n distinct ordered pairs (ai, bj) comprising one element from each set.(ii) Given a set of n1 distinct elements a1, . . . , an1 ; a set of n2 distinct elements b1, . . . , bn2 ;up to set of nv distinct elements x1, . . . , xv; there exist

∏vi=1 ni distinct ordered v-tupels

(ai1 , . . . , xiv).

ω: That’s a funny theorem. It only applies to tupels of lengths 2 or 24.

Ω: I see you are taking the notation very literally. Look, just forget about using the latin alphabet

completely and exclusively. You can squeeze any number of characters and more between a and x. For

example, a, b, c, x makes quadrupels, and a, α, β, γ, δ, b, c, . . . , x makes 29-tupels. a and x are just used in

this way to avoid more indices; otherwise we would have something like a1,i1 , . . . , anv,iv ), you see? It’s all

fine, there’s no account for taste.

Proof.(i) is obvious: Arrange the paris in a rectangular array with pair (ai, bj) at the intersectionof the ith row and the jth column.(ii) By induction. True for v = 2 from (i). Now suppose it is true for v − 1.Expressing the v-tupel as a pair ((ai1 , . . . , wiv−1 , xiv) and applying (i) with m = n1 · n2 ·. . . · niv−1 and n = nv shows the claim for v.

Question 3.15. Group picture.You want to take a picture of seven people arranged on chairs in a row. How many choicesdo you have?

Answer: Put the seven chairs in a row. The first person has the choice between sevenchairs. For the second person, there are six chairs left. (Note: For the number of choices itdoes no matter which chair the first person chose.) The the third person, there a five chairsleft. And so on. The sixth person only has a choice between two chairs, and the seventhperson has no choice at all. So, the total number of arrangements is 7 ·6 ·5 · . . . ·2 ·1 = 5040.

ω: But I want to be next to you!

Ω: Well, eh, that will reduce the number of options. I guess that’s a great exercise for first year students

learning about combinatorics. . .

Definition 3.16. A permutation of a finite set is any ordering of its elements in a list.

Ω: Actually, this it not what I learned in my algebra class. They said, a permutation of a finite set is a

bijective map of that set onto itself.

ω: It doesn’t matter to me. Bijective maps uniquely describe how to reorder the set, and list orderings are

the results of these reordering processes. Practically speaking, it’s all the same.

Theorem 3.17. Total number of permutations.A set with n elements has n! different permutations.

Proof. Let a1, . . . , an be the n distinct elements of the set. We compute the number ofpermutations following the idea of the calculation in Question 3.15. There are n choicesfor an to go, n − 1 choices for an−1, n − 2 choices for an−2, and so on, up to 2 choicesfor a2 and just one option for a1. Using the fundamental rule, that makes n! differentorderings.

3.3.2 Sampling

Ω: I am collecting a collection.

Use Saul Jacka’s notes p.9-11, 13.

Here is some motivation for multinomial theorem.

Question 3.18. Head like a sieve.You secured your bicycle with a combination lock involving seven rotating discs arrangedin a row each showing the digits 0 to 9. After a day of revising material for an economics

class, your memory of the lock combination is rather vague. All you remember is that the(unique) combination for opening the lock has three times the digit 3, twice the digit 2 andagain twice the digit 7. Making use of whatever little you remember, you decide to havea go. You estimate that it takes you about three seconds to try a combination. How longwill it take you to open the lock in the worst case scenario that you won’t find it until yourlast trial?Answer: There are 7! ways to order the seven digits you remember. However, that’s notyet the answer to the question. You need to divide by the number of orderings of the three3s and the two 2s and the two 7s. So the total number of combinations you’ve got to tryis 7!/(3! · 2! · 2!) = 7 · 6 · 5 = 210. Trying them all would take 101

2 minutes.

ω: That’s silly. First she cares about order, then she doesn’t.

Ω: Look, over here are seven M&Ms. Three of them are green, two are orange and two are violet, then

you would care about the order of the colours as colours go, but you would not care about the order of

the individual pieces within one colour. For example, you can reorder the green ones among themselves

without even noticing. . .

ω: Did you notice I just ate all the seven while you weren’t looking? I was much faster than three seconds

and I can not even taste the order.

4 Axioms of probability

Use Saul Jacka’s notes p.6-7.

Remark 4.1. When can we use the power set?If Ω is countable the default is to use the power set P(Ω) for F . However, that you candefine measures on all subsets of the space that does not mean you always should. In somesituations it is actually desirable to work with a submodel (Ω,F∗) of the measurable space(Ω,F), that is, F∗ ⊂ F ; as will be discussed below. The example about chess and bridgein Saul Jacka’s notes is such a situation. In order to define a probability measure basedon the partial information available we need to use a submodel.For continuous Ω the situation is very different, because it is typically not possible todefine set functions simultaneously on all subsets of Ω that fulfils all the characteristics ofa measure (in the sense of the axioms). (If you would like to find out more about this,wait for advanced modules in probability theory and measure theory or read about theBanach-Tarski paradox now).

Example 4.2. Bernoulli experiment.This is the simplest possible random experiment that is not trivial. There are two out-comes, often called 0 and 1 with the latter referred to as “success”. There is a fixedp ∈ [0, 1], the probability for success. Use Ω = 0, 1 and the σ-algebra generated by thepartition (0, 1), that is, F = ∅, 0, 1, 0, 1 . Then P (1) = p uniquely definesa probability measure by P (0) = 1− P (1) = 1− p.

ω: With p = 1 I’ve won! I always win... always, always, always!

Ω: Obviously, this case is very boring, and so is the case p = 0. They are included in the model, but they

are the extreme cases. Such experiments are not actually random, they are deterministic.

Example 4.3. Finitely many outcomes.Given weights pi ≥ 0 for i = 1, . . . , n with p1 + . . .+ pn = 1,

P (wi) = pi (i = 1, . . . , n) (2)

defines a probability measure on Ω = w1, . . . , ωn with σ-algebra P(Ω).If pi = 1/n for all i = 1, . . . , n then this is called the uniform distribution.Here are some explicit situations:

(i) An n-faced die with pi specifying the probability for the ith face to show up. Ifpi = 1/n for all i = 1, . . . , n then the die is fair. Otherwise it is loaded. (This refersto methods of inserting small quantities of metal into some of the sides of the die toincrease the likelihood it lands the other side up.)

(ii) Drawing a ball at random from a box with a finite number of coloured balls. Withm different colours and Ni (i = 1, . . . ,m) balls of colour i, N = N1 + . . .+Nm, theprobabilities are pi = Ni/N.

Example 4.4. Countable number of outcomes.Given weights pi ≥ 0 for i = 1, 2, . . . with

∑∞i=1 pi = 1,

P (wi) = pi (i = 1, 2, . . .) (3)

defines a probability measure on Ω = w1, w2, . . . with σ-algebra P(Ω).

ω: I need to invite infinitely many friends to my birthday party. Because, whenever I think I will invite

these n friends I realise I forgot one. But how can I divide my birthday cake into infinitely many equally

large pieces?

Ω: That would not be LARGE pieces, omega, but small pieces! You are even in much bigger trouble here.

Just read on. . .

Remark 4.5. No uniform distribution on a countable set.There is no uniform distribution on a countably infinite set. Why? If there was a uniformdistribution than there would be a c ∈ [0, 1] such that P (ωi) = c for all i = 1, 2, . . . Bythe properties of a measure, P (Ω) =

∑∞i=1 c (that means adding up c infinitely often). If

c > 0 the series diverges and if c = 0 the series equals 0. Both cases contradict the axiomthat P (Ω) = 1.

Ω: In part B of this module you can make a model for equally sharing a cake with a continuum of friends.

Is that what you would like to do for you next birthday, omega?

ω: But I do not have a continuum of friends. Even if I added all my imaginary ones.

Ω: Try nextgenerationfacebook.

5 Conditioning

Find a way to get 6 male and 4 female students to come in the front and form a circle.Offer 3 of the women and 2 of the men to put a hat on their head. Get another studentand call him Mr. Randomness. Bring him in the centre of the circle, blindfold him andlet him spin around until the audience calls him to stop. He faces one of the other 10students, say X. What is the probability X is wearing a hat? Five of the ten have a hat,so it’s a half.

Now Mr. Random is asked to exchange ”hi” with the student he is facing. What is theprobability that X is wearing a hat? Is there anything different? Well, Mr. Random didget some information. From the voice saying ”hi” he can tell whether X is male or female.As Mr. Random had seen the students with their hats before getting blindfolded, he knowsthat 3 of the 4 women wear hat, whereas only 2 of the 6 mean wear hats. In this run of

the experiment it turns out that X is a woman. What he is going to calculate now is theprobability that the person has a hat on given he already knows the person is a woman.Among the women, what is the chance she has got a hat on? He takes the number ofwomen with hats and divides by the number of women.

5.1 Conditional probability.

Definition 5.1. Conditional probability.Let (Ω,F , P ) be a probability space and A,B ∈ F events. Assume P (B) > 0. The condi-tional probability of A with respect to B is given by

P (A |B) :=P (A ∩B)P (B)

.

Remark 5.2. Conditional probability is a probability measure.We call P (A |B) conditional probability, but is it actually a probability as defined inSection 4? The answer is yes, because for any fixed B, the set function A 7→ P (A |B)defines a probability measure. The proof amounts to showing the two properties in (P2).(i) is obvious: P

(Ω |B

)= P (Ω ∩B)/P (B) = 1.

(ii) follows from basic set operations: For A1, A2, . . . be mutually exclusive,

P( ⋃i∈N

Ai

∣∣∣B) =P((⋃

i∈NAi)∩B

)P (B)

=P(⋃

i∈N(Ai ∩B))

P (B)

=∑

i∈N P (Ai ∩B)P (B)

=∑i∈N

P (Ai |B).

ω: Why does the left-hand side in Defintion 5.1 look symmetric, but then the right-hand side is not?!

Ω: P (A |B) is just a short form for the right-hand side and it is not symmetric. But I agree, that the

symbolic expression (A |B) is misleading in terms of its symmetric appearance. But what actually is the

relationship between P (A |B) and P (B |A)?

Some implications of the definition are both easy to proof and very frequently used:

Multiplication rule

P (A ∩B) = P (A |B) · P (B) ∀A,B ∈ F with P (B) > 0 (4)

= P (B |A) · P (A) ∀A,B ∈ F with P (A) > 0 (5)

“Flip around” formula

P (A |B) =P (A)P (B)

· P (B |A) ∀A,B ∈ F with P (A) > 0, P (B) > 0 (6)

Averaging conditional probabilities

P (A) = P (A |B) · P (B) + P (A |Bc) · P (Bc) ∀A,B ∈ F with 0 < P (B) < 1 (7)

Bayes’ rule

P (B |A) =P (A |B) · P (B)

P (A |B) · P (B) + P (A |Bc) · (1− P (B))(8)

∀A,B ∈ F with P (A) > 0, 0 < P (B) < 1

The first two statements are immediate consequences of Definition 5.1. The last twostatement are special cases of more general Theorems 5.6 and 5.7. Before we state andprove these theorem, we will show a few examples for the use of the rules (4) to (8).

Question 5.3. Pick a box and pick a ticket.There are two boxes. One has three tickets with the numbers 1, 3 and 5, the other one hastickets with the numbers 2 and 4. First toss a fair coin. If heads come up choose the boxwith the odd numbers, if tails come up choose the box with the even numbers. Then youpick a ticket at random from the chosen box. What is the probability you get a 3?

We will present two solutions, a naive approach and one based on the multiplication rule.

Answer 1: There are five different outcomes defined by the number of the ticket drawn,so Ω = 1, 2, 3, 4, 5. Since the two boxes are equally like, P (1, 3, 5) = 1/2 = P (2, 4).Since the tickets within each box are equally like, using the addition axiom, P (1, 3, 5) =P (1) + P (3) + P (5) = 3 · P (3). Putting this all together yields P (3) = 1/6.

Answer 2: Denote the outcomes by two-character combinations: bo and be for boxes withodd and even numbered tickets in it, followed by the number on the ticket. So the outcomespace is Ω = bo1, bo3, bo5, be2, be4. We want to find the probability for bo3. Define events:

Bo :=“box with odd numbered tickets was chosen”= bo1, bo3, bo5,Be :=“box with even numbered tickets was chosen”= be2, be4,Ti :=“ticket with number i was drawn” (i = 1, 2, 3, 4, 5).

By assumption, P (Bo) = P (Be), so both must be 1/2. The only way to draw a ticket withthe number 3 is by choosing the box with odd numbers first. If drawing from that box,chances for each of the numbers are equal; in other words, P (Ti |Bo) = 1/3 for i = 1, 3, 5.Using (4), P (bo3) = P (T3 |Bo) · P (Bo) = 1/3 · 1/2 = 1/6.

ω: The first answer looks a million times smarter.

Ω: Well, you’re right... but you’re not. To come up with the first answer you need an idea about how

to exploit the specific conditions of this example. The second answer is a standard application of the

multiplication rule. It should be in your repertoire. In particular, when things get a bit more complex.

What if the coin is biased? What if there are more than one ticket numbered 3, and in both boxes? What

if the tickets in a box are not equally likely to be drawn? You can still use the first approach, but it won’t

look so elegant anymore.

Question 5.4. Golden tickets.

There are two shops with Wonka chocolate bars and it is impossible to recognise whetheror not they contain a golden ticket to visit Willy Wonka’s factory. In one of them, thechance to find a golden ticket is a half, in the other one it is a third. Omega bought onechocolate bar and found a golden ticket inside. Guess which shop Omega got it from andexplain why you make that guess.

Answer: Denote the outcomes by two-character combinations: se for the shop with equalprobabilities, su for the other shop, g for a golden ticket and b for a blank. So the outcomespace is Ω = seg, seb, sug, sub. Define events:

Se :=“Shop with equal chances was chosen”= seg, seb,G :=“Golden ticket was found”

By assumption, P (Se) = P (Su) = 1/2, and P (G |Se) = 1/2, and P (G |Su) = 1/3. Toanswer omega’s question we need to compare the probability of Se given G with theprobability of Su given G. Averaging conditional probabilities according to (7) yields

P (G) = P (G |Se) · P (Se) + P (G |Su) · P (Su) = 1/2 · 1/2 + 1/3 · 1/2 = 5/12.

Using (6) and putting this all together yields

P (Se |G) =P (Se)P (G)

· P (G |Se) =1/25/12

· 1/2 = 3/5.

Furthermore, P (Su |G) = 1−P (Se |G) = 2/5. Therefore it is 50% more likely that Omega’sgolden ticket came from the shop with equal probabilities than from the other shop.

Remark 5.5. About the proof.The answer above uses (7) and (6) explicitly to show what is going on. This actuallyamounts to proving (8) for that example. Alternatively, one could just plug numbers into(8) and obtain the same result.

ω: I know how I can answer Question 5.4 much faster! Say there’s a place with only one Wonka bar with

a golden ticket in it, and there is a place with a million Wonka bars one of which has a golden ticket in it

whereas the other 999,999 have none. The chance of getting a golden ticket when you search in place one

is, well, there isn’t even any chance, you surely get one. But when you are searching for a golden ticket

in the second place, it’s like 1 in a million, so your chances are incy wincy small. If you actually got a

ticket from one of those places, well, chances are it’s the first place where you got it from. The boxes in

the example are not quite as extreme, but the answer still goes in the same direction.

Problem solving technique: Exaggerating the original question (see also dialogue following Question

2.7.2).

Ω: Yes. And did you notice you argued the other way round from usual? Here is another example. Who

knitted the Klein bottle hat, the lecturer or her husband? Only a small fraction of the men know how to

knit, whereas, at least in their generation, a decent fraction of women do. So it’s more likely she knitted

that hat than he did it. This kind of thinking is a probabilistic version of reverse engineering also known

as statistical inference. Regular people actually use it in everyday life all the time.

ω: Like when you blamed me for destroying your favourite sunglasses. Just because I broke your fabulous

prize-winning rocket, I dropped your report card in the loo, I reset your password and I played frisbee with

your data CD, you think that I also stepped on your glasses. But it wasn’t me!

Theorem 5.6. Total probabilityLet (Ω,F , P ) be a probability space and B1, . . . , Bn ∈ F a partition of Ω. Assume thatP (Bi) > 0 for i = 1, . . . , n. Then

P (A) =n∑i=1

P (A |Bi)P (Bi) (9)

Proof. Since A∩B1, . . . , A∩Bn are mutually exclusive, P (A) =∑n

i=1 P (A∩Bi). ApplyingDefinition 5.1 transforms the addends to P (A |Bi)P (Bi).

5.2 Bayes theorem and applications

Theorem 5.7. BayesLet (Ω,F , P ) be a probability space and B1, . . . , Bn ∈ F a partition of Ω. Assume thatP (Bk) > 0 for k = 1, . . . , n. Then

P (Bk |A) =P (A |Bk)P (Bk)

P (A |B1)P (B1) + . . .+ P (A |Bn)P (Bn)(10)

Proof. Fix k ∈ 1, . . . , n. Formula (6) yields P (Bk |A) = P (A |Bk)P (Bk)/P (A), whichresults in (9) after replacing P (A) by the right-hand side in Theorem 5.6.

Terminology

P (Bi) are called prior probabilities – that’s because it’s before knowing about AP (A |Bi) are called likelihoods – probabilities of A given the different options BiP (Bi |A) are called posterior probabilities – that’s because it’s after knowing about A

ω: How comes a theorem that is so easy to proof became so famous?

Ω: Sometimes, it’s not the technique that makes a result famous, but the idea to even think of of looking

at the mathematical objects in a new way. The key idea of first the ”flip around” formula and then also

Bayes’ theorem is to use the information about the probability for A given a few different Bis to sort of

reverse engineer which of the Bi’s happened in the first place.

ω: You mean which one is most likely to have occurred given A did. Actually, the theorem calculates the

probability for each of the Bis given we know A happened. My teacher is using this all the time. Whenever

it’s noisy in the classroom while she’s writing at the board, she blames me! But just because I’m talking

a lot doesn’t meant that’s always me!

Example 5.8. Medical diagnostics.A lab test for the disease VerySick is conducted on blood samples and has the poten-tial outcomes ”positive” and ”negative”. We consider a certain population that has anincidence rate of 1% for the disease. We have experience with applying the test in thispopulation: In the past, 95% of the people with the disease produce a positive test result,and 2% of the people without the disease produce a positive test result.What is the probability that an individual chosen at random from that population willhave the disease given his or her test was positive?

Let D be the event the individual has the disease and let + and − be events that theindividual has a positive or negative test, respectively.

Using Bayes theorem,

P (D |+) =P (+ |D)P (D)

P (+ |D)P (D) + P (+ |Dc)(1− P (D))=

0.95 · 0.010.95 · 0.01 + 0.02 · 0.99

≈ 32%

ω: So what?

Ω: You are being told you’ve got this disease that’s going to kill you within 5 years. You spent a few weeks

being horrified and you end up dying of a heart attack. So, indeed you died. However, the likelihood you

do not have the disease given your test was positive is actually about two in three!

ω: Strange, how can this come about?

Ω: Not actually having the disease despite being tested positive is an event called false positive. Related

to that is the false positive rate and the specificity and the sensitivity of the test and...

ω: I don’t care about all this medical terminology. I just want to know this: If they gave me a positive

test result, what is the probability that I actually have the disease.

Ω: That’s what the physicians call predictive value of a positive test and it’s just what we computed above.

It does depend on the quality of the test but also on the disease prevalence (i.e., the incidence of the

disease) in the population you are drawn from. In the example above, the incidence was only 1%. This

made the nominator small, but inflated the denominator because of the term 1− P (D) = 99%.

ω: What’s the point of such a test then?

Ω: actually, I was being a bit cynical about the heart attack... There are ways to apply the test in a more

meaningful way.

Example 5.9. Medical diagnostics (ct.)Say a patient who already feels sick walks into a surgery. The physician conducts acareful examination excluding a number of diseases or conditions. Based on experience,the physician estimates there is a probability of about 30% that the patient has thedisease VerySick. The physician thinks of the patient as member of a certain population

of similarly sick people. In that population, P (D) = 30% and P (Dc) = 70%. Let us makethe same assumptions for the test accuracies. Applying Bayes theorem in the same wayas before yields a false positive rate P (D |+) ≈ 95.3%.

The medical diagnostics example is a very common situation. The false positive issue is aconcern when the prevalence is low even with good quality tests. This is often the case for ascreening test for a disease in an unspecific population, unless the disease is very common.Depending on the nature of the diagnosis, such news may create stress and anxiety inthe individual who received them, even if it is understood that the diagnosis has a lot ofuncertainty attached to it. Often, the individual will have to wait long periods of timeuntil further tests can be done and deliver refined results. Such aim of avoiding stress forthe patient has to be outweighed with the usefulness of potentially true news, for exampleif early treatment is crucial for survival. Examples for population-wide screening with lowincidence rates of the condition tested for may be found, for example, in antenatal care ordetection of drug users.

ω: I want to tell you something. I really very much hate word problems.

Ω: Why?

ω: Either I do math or I do writing.

Ω: You mean one part of your brain is in charge of math and the other part of your brain is in charge of

the rest of the world, and the two don’t mix? You are basically saying that applied mathematics can not

be done. Either you are coming up with the most beautiful piece of math to describe a certain real world

application but you neglect the point in the application, or you pour all you thoughts into an excellent

description of a piece of the real world but then you have no brains left to work out mathematically rigorous

answers to questions? I suggest you activate a third part of your brain to get the first two parts to act in

concert.

5.3 General multiplication rule and independence

If the occurrence of the event B has no impact on the occurrence of the event A thenP (A |B) = P (A). Using the multiplication rule (4), this can be expressed as P (A ∩B) =P (A) · P (B). Or it could be expressed as P (B |A) = P (B). Note that in formulationsusing conditional probabilities we need to assume that the probability of the event thatcomprises the condition is not zero.

Definition 5.10. Independence of two events.Let (Ω,F , P ) be a probability space and A ∈ F and B ∈ F events. A and B are calledindependent, if

P (A ∩B) = P (A) · P (B) (11)

and dependent otherwise.

ω: It just means they have nothing to do with each other. For example, if I toss a coin and I want it to

come up heads and you want it to come up tails than we what you and me want is independent...

Ω: Eh, I’m not so sure about that. Let’s see... P (h) · P (t) = 1/2 · 1/2 = 1/4 and P (h ∩ t) =

P (∅) = 0, so they are disjoint, but not independent.

ω: Ah sure, they have a lot to do with each other. Because if it’s heads it can’t be tails!

Example 5.11. Independent and dependent events.

Roll a die. In Example 3.2 define the events Odd = 1, 3, 5 and Gk = 1, . . . , k. ThenOdd and Gk are independent if k is even, otherwise they are dependent. Just check (11),

for example: P (Odd ∩G2) = P (1) = 1/6 and P (Odd) ·P (G2) = 1/2 ·1/3 = 1/6, so theyare independent, but P (Odd ∩G1) = P (1) = 1/6 and P (Odd)·P (G1) = 1/6·1/6 = 1/36.

Toss two coins. See Example 3.4. The assumption of equiprobable outcomes impliesindependency of the events Hk =“kth toss is heads” (k = 1, 2), becauseP (H1 ∩H2) = P (hh) = 1/4 = 1/2 · 1/2 = P (hh, ht) · P (hh, th) = P (H1) · P (H2).This technical property reflects our idea that, if the coin tossing is conducted properly,the coin has no memory.

Remark 5.12. Real world applications.Independency is an assumption very often made for one or more of the following reasons:(i) It is justified by theoretical considerations.(ii) It is empirically justified (i.e., checked on data).(iii) It simplifies calculations.

Reason (iii) is fine as long as the real world application only serves as an inspiration fora mathematical mind’s desire to create beautiful problem and subsequently solve them.Before applying results of such creative processes to make statements about the real worldadditional study is required. In particularly, the magnitude and the direction of the errorintroduced by the simplification need to be addressed. A recent example of substantialerrors caused by inappropriate independence assumptions occured in the modelling ofthe default risks of mortgages (even when they were secured by houses located in socio-economically highly homogeneous neighbourhoods). Such oversimplified risk models leadto the subprime mortgage crisis in the US real-estate market that is believed to havetriggered the global financial crises.Reason (i) is often made successfully, especially in scientific or technological applications.However, it does bear the risk that partial knowledge or an inappropriate weighing ofexisting knowledge leads to incorrect models.Reason (ii) is convincing and sometimes the only choice. However, the collection of dataand the testing of independence in data bear their own challenges. Among other issues,model development based on empirical knowledge only can be misled by chance variationor bias in data. A simple but striking example is the empirical“proof” that the area ofa rectangle is a linear function of the sum of the lengths of its edges (see the statisticstextbook by Freedman, Pisani and Purves).

Ideally, independence assumptions are based on both theoretical and empirical justifica-tions. In some applications, independence assumptions are wrong, but can still play arole. For example in obtaining upper or lower bounds of an unknown true probability.

The multiplication rule (4) can be generalized to more than two events. This will be usedto define a notion of independence for more than two events.

Theorem 5.13. General multiplication rule.Let (Ω,F , P ) be a probability space and A1, . . . , An ∈ F with P (A1∩ . . .∩An−1) > 0. Then

P (A1 ∩ . . . ∩An) = P (A1) · P (A2 |A1) · . . . · P (An |A1 ∩ . . . ∩An−1). (12)

Proof. Exercise.(Hint: The result can be shown by induction based on multiple application of (4).)

Definition 5.14. (Mutual) independence.Let (Ω,F , P ) be a probability space and A1, . . . , An ∈ F events with P (A1∩. . .∩An−1) > 0.The events are called (mutually) independent, if

P (A1 ∩ . . . ∩An) = P (A1) · P (A2) · . . . · P (An) (13)

Definition 5.15. Pairwise independence.Let (Ω,F , P ) be a probability space and A1, . . . , An ∈ F events. The events are calledpairwise independent, if

P (Aj ∩Ak) = P (Aj) · P (Ak) for all j, k ∈ 1, . . . , n with j 6= k. (14)

Remark 5.16. Pairwise and mutual independence are different.Mutual independence applies pairwise independence. The reverse is not true. For example,take two independent coin tosses and consider the events Hk from Example 5.11 and theevent “both tosses are heads”; see Exercise sheet 3, Part II, Problem 1.

6 Repeated experiments

This section is about doing the same experiment several times. We consider a simple, butnot trivial, situation: a series of independent Bernoulli trials (see Example 4.2).

Model 6.1. Bernoulli trials process.A Bernoulli trials process is a finite or infinite sequence of random experiments with thefollowing characteristics:

(i) Each experiment has two possible outcomes, referred to as “success” and “failure”.

(ii) The probability p ∈ [0, 1] for “success” is the same for each experiment; in particu-larly, it is not affected by the outcome of any of the other experiments.

The probability 1− p for failure is often denoted by q.The standard outcome space for a Bernoulli trials process of length n is Ω = 0, 10,1,...,nand F = P(Ω). (In other words, Ω is the set of all n-tuples of zeros and ones.) Thestandard outcome space for an infinite Bernoulli trials process is Let Ω = 0, 1N. (Inother words, Ω is the set of all sequences of zeros and ones.)

This is an abstract version of a model for independent coin tosses.

In the following sections we will study probability distributions relating to a number ofdifferent kinds of events. In the finite case, the standard probability space (Ω,P(Ω)) willbe used. In the case of infinitely many trials submodels are used (see Remark 4.1), forexample in Sections 6.2 and 6.3.

hier

In this module, we only consider independent repetitions of experiments. In advancedprobability modules about stochastic processes you will encounter more general models.In particularly, there will be models of processes that allow future repetitions to depend onthe present (Markov processes) or on the present and a finite window of the past (higherorder Markov processes).

6.1 Number of successes

Consider a Bernoulli trials process of length n with success probability p ∈ [0, 1] and failureprobability q = 1− p as described in Model 6.1. We want to compute probabilities for theoutcomes. Illustrate this a special case.Problem solving technique: Consider a special case or an example.

Example 6.2. Three independents Bernoulli trials.Let n = 3. Using independence we observe

ω 111 110 101 100 011 010 001 000P (ω) p3 p2q p2q pq2 p2q pq2 pq2 q3

We have P (111) = p3 and P (000) = q3.

The other probabilities fall into two groups:P (110) = P (101) = P (011) = p2 q and P (001) = P (010) = P (100) = p q2.

Go back to the general case of a Bernoulli trial process of length n. For ω ∈ Ω let ω(i) be theith place in the tuple. At a closer look, using independence, shows that the probability ofan outcome only depends on the number of successes and failures. Given the total numberof trials is fixed, it is enough to record only the number of successes. The events

Sk = “k successes in n trials” (k = 0, 1, . . . , n) (15)

can be represented using the 0-1-coding we chose to represent failure and success (checkyourself):

Sk =ω∣∣∣ n∑i=1

ω(i) = k

(k = 0, 1, . . . , n). (16)

This implies

P (ω) = pk qn−k for all ω ∈ Sk.

To calculate the probability of the event Sk it remains to count the number of outcomesin Sk. In how many ways can we choose (exactly) k successes in n trials? According toSection 3.3, using the formula for “without replacement, not ordered”, this is

(nk

), so

P (Sk) =(n

k

)pk qn−k (k = 0, 1, . . . , n). (17)

This defines a probability measure (see Example 4.3) on the set of possible outcomes forthe number of successes.

Definition 6.3. Binomial distribution.The measure

P (k) =(n

k

)pk qn−k (k = 0, 1, . . . , n). (18)

on 0, 1, . . . , n is called binomial distribution.

Remark 6.4. Special case 1/2.For p = 1/2 formula (17) simplifies to the following form:

P (k) =12n

(n

k

)(k = 0, 1, . . . , n). (19)

Example 6.5. Library books.You go the the library with a list of 7 books you need for your class. You would like toknow the probability you can get a least 5 of them. Answers to questions like this onecan be computed, but some assumptions have to be made. Here, we assume that theprobability that a book is checked out is 10%, for any one of the books and independentlyof any of the other books. (This is a simplification and whether or not such assumptionsare at least approximately true would need to be verified for a particular library.)

Now the model for a Bernoulli trials process of length 7 with success probability 1/10 (seeModel 6.1) applies and we can compute the probability of the events Sk for k successesusing (17)). Here “success” is being used for the book being checked out, because thecomputation becomes shorter this way.)

P (“at least 5 are not checked out”) = P (“at most 2 are checked out”)

= P( ⋃k∈0,1,2

Sk

)=

2∑k=0

P (Sk) =2∑

k=0

(7k

)( 110

)k( 910

)7−k

=95

107· (92 + 7 · 9 + 21) ≈ 0.9743

So the probability that you can get at least 5 of the books from your list is about 97.43%.

Remark 6.6. Normal approximation.What if in the last example we had asked about the availability of 20 out of 70 books?Or 200 out of 700? They same methods would apply, but the computations become cum-bersome. Even with computers nowadays facilitating even such cumbersome calculations,practice, binomial probabilities often approximated using the normal distribution. Seeslide show for this topic (can be downloaded from the module resources website).

The approximation of discrete distributions is a fundamental theme in probability theory.The most well known result of this kind is the Central Limit Theorem. Such results arecrucial for applications in statistics, where empirical distributions (based on finite datasets) are compared with continuous distributions that arise in models or as limits.

6.2 Waiting for success

In the last section we calculated the probability for a certain number of successes ina fixed number of trials. In this section, we work with the model for infinitely manyBernoulli trials (see Model 6.1). We keep trying until the first success. That meansΩ = 1, 01, 001, 0001, 00001, . . .. Ω is countable. This can be shown, for example, byusing the one-to-one map between Ω and N0 that is defined by counting the number ofzeros in any outcome ω ∈ Ω.

Define events based on the waiting time:

Wk = “it takes k trials until first success” for k = 1, 2, . . . (20)

To calculate the probability for Wk just look at the meaning: The first k − 1 trials arefailures, the kth trials is a success. Since the trials are independent, the probabilityfactorizes:

P (Wk) = qk−1 p (k = 1, 2, . . .). (21)

According to Example 4.4, formula (21) does define a probability measure on (Ω,P(Ω))The weights pk = P (Wk) are non-negative, and their sum equals 1, because (Wk)k∈Nform a partition of Ω The probability measure (Ω,P(Ω)) can also be represented as aprobability measure on (N,P(N)). Probability measures on spaces like N or R are oftencalled distributions.

Definition 6.7. Geometric distribution.The measure P (k) = qk−1 p (k = 1, 2, . . .) on 1, 2, . . . is called geometric distribution.

To get a feeling for this distribution it is helpful to look at the consecutive odds ratios(and they are useful to quickly sketch a histogram of the distribution).

P (k + 1)P (k)

= q (k = 1, 2, . . .). (22)

Define the events Vk =“it takes at most k trials until first success” k = 1, 2, . . . Let k ∈ N.Then Vk = W1 ∪W2 ∪ . . . ∪Wk with W1,W2, . . . ,Wn mutually exclusive. Using resultsabout geometric sums (see Analysis textbooks) we get

P (Vk) =k∑i=1

P (Wi) =k∑i=1

qk−1 · p =1− qk

1− q· p = 1− qk. (23)

Finally, let us put this into practice in an everyday life example.

Example 6.8. Can you spare some change?Penny needs some cash. She positions herself on the pavement and keeps asking everybodywho passes by for some change until she succeeds. Assume that people react to her requestindependently of each other and that each of her requests is successful with probability1/10. Calculate the probability for the following events:

A =“She has to ask at least 5 people until the first success”B =“She has to ask exactly 5 people until the first success”C =“She has to ask at most 4 people until the first success”

Because of the assumptions, the situation can be described by a model for infinitely manyBernoulli trials with success probability 1/10 (see Model 6.1).

Another way of describing A is to say that the first 4 trials are failures (and the fifth andlater trials may or may not be successes). This implies P (A) = (9/10)4.

Since B = W5, Definition 6.7 yields P (B) = (9/10)4 · 1/10.

C is the complement of A, so P (C) = 1 − (9/10)4. Or use C = V4 and (23) to get thesame result.

ω: ‘Excuse me UK, can you spare some cash? Excuse me EU, can you spare some cash?’ Is the Penny

example about illiquidity?

Ω: I wasn’t going to talk about finance here.

ω: Why not? The totally most important application of probability theory...

Ω: It certainly is an application of no less importance than Archeology, Biology, Coding, Demography, Elec-

trical engineering, Genetics, Hurricanes, Insurances, Juggling (most times), Knitting (sometimes), Logic,

Magnetism, Networks, Operations research, Psychology, Queuing, Risk, Statistics, Trees, Uncertainty,

Volcano science, Walks, some random applications X and Y, and the Zeta function.

Question 6.9. Waiting for some more change.Penny in Example 6.8 needs some more cash. She keeps asking people in the street untilshe succeeds in receiving cash from at least 5 of them.

6.3 Waiting for multiple successes

As in the last section, consider the model for infinitely many Bernoulli trials (see Model6.1). Now keep trying until you have obtained r successes. The outcome space Ω− r nowconsists of all 0-1-patterns that have exactly r 1’s in it, one of them at the end.For example, for r = 2 : Ωr = 11, 011, 101, 0011, 0101, 1001, 00011, 00101, 01001, 10001, . . ..

Lemma 6.10. Ωr is countable (for any r ∈ N).

Proof. There is no canonical candidate for a one-to-one map between Ωr and the integersas in the last section, but it is still easy to do. One could arrange the outcomes in atriangle and then define a map from the integers to Ω by enumerating the outcomes rowby row. For example, for r = 2 use

11011 1010011 0101 100100011 00101 01001 10001. . .

and define f(1) = 11, f(2) = 011, f(3) = 101, f(4) = 0011, . . .It works similarly for larger r, but the explicit construction of such a map becomes morecumbersome as r gets bigger. Here is another way to show that Ωr is countable: Everyω ∈ Ωr can be described by the its length and the location of the first r − 1 successes.This suggests a surjection from the countable set Nr−1 × N onto Ωr, which implies thatΩr is countable.

We are interested in the probabilities for waiting a certain number of trials until the rthsuccess. Proceed similarly to the last section.

Define events based on the waiting time until the rth success:

Wk = “it takes k trials until first success” for k = r, r + 1, . . . (24)

To calculate the probability for Wk just look at the meaning: The kth trials is a success,and there are r−1 successes and k−1−(r−1) failures among the first k−1 trials. For anyoutcome of this kind the probability is pr−1 pk−1−(r−1) p. So, similarly to the derivation of(17) we obtain

P (Wk) =(k − 1r − 1

)pr qk−r (k = r, r + 1, . . .). (25)

This probability measure (Ω,P(Ω)) can also be represented as a probability measure on(N,P(N)).

Definition 6.11. Negative binomial distribution.The measure P (Wk) =

(k−1r−1

)pr qk−r (k = r, r+ 1, . . .). on r, r+ 1, r+ 2, . . . is called

negative binomial distribution.

To get a feeling for the probabilities of these events it is helpful to look at the consecutiveodds ratios (and hey are handy to quickly sketch a histogram). A simple calculation yields

P (Wk+1)P (Wk)

=k

k − 1(k = r, r + 1, . . .). (26)

Here are some applications of the negative binomial distribution. In other words, theBernoulli trials process may be an appropriate model and we are interested in looking atthe waiting time for the rth success:• A robe consists of several cables. I has to be replaced after 3 of them broke.• How likely do milk bottles get stolen more often than twice in a year?• Parents of pupils who arrive late more than 3 times a term need to see the

head teacher.

6.4 Occurances of rare independent events.

6.4.1 Poisson approximation to the binomial (for rare events)

Use Saul Jacka’s lecture notes p.24/25.

6.4.2 Poisson distribution

It turns out that the the numbers e−µ µk

k! add up to 1, because∑∞

k=0µk

k! = eµ. Furthermore,they are non-negative, so they actually define a probability mass function.

Definition 6.12. Poisson distribution.Let µ > 0. The measure Pµ(k) = e−k · µ

k

k! (k = 0, 1, 2, . . .) on 0, 1, 2, . . . is calledPoisson distribution with intensity parameter µ.

ω: We use this for counting fish. And the Sauterelle distribution for counting grasshoppers and the de

Chevalier’s distribution for counting horsemen.

Ω: Oh you’re being silly. Read this old story about Chevalier de Mere... e.g. at

http://www.ualberta.ca/MATH/gauss/fcm/BscIdeas/probability/DeMere.htm.

Here are some of the traditional applications of the Poisson distribution.

Example 6.13. Missprints.From experience it is known that misprints in books by the publisher PROP (ProofRead-OncePress) follow a Poisson distribution with a rate of 0.5 misprints per page. What isthe probability that in a book from PROP there is at least one misprint on page 20? LetN be the number of misprints on page 20. Then

P (N ≥ 1) = 1− P (N = 0) = 1− e−0.5 · 0.50

0!= 1− e−0.5 ≈ 0.393.

In other words, there is a likelihood of almost 40% for a misprint on page 20 (or on anyother page for that matter).

Example 6.14. Radioactive decay.Let N be the number of α-particles given off by a 1-gram block of radioactive materialduring a 1-second interval. From previous experiments it is known that, on average, 3.2α-particles are emitted per second. 1 gram material consists of n atoms for some n large.Each atom has a probability of 3.2/n of disintegrating and sending of an α-particle duringthe next second. We assume independence which justifies the use of a Binomial model.Using the Poisson approximation to the Binomial helps for explicit computations. Forexample, we will approximate the probability that no more than 2 α-particles will be

emitted during the next second.

P (N ≤ 2) =P (N = 0) + P (N = 1) + P (N = 2)

= e−3.2 + 3.2 · e−3.2 + 3.2 · e−3.2 +3.22

2!· e−3.2

=(1 + 3.2 +3.22

2!) · e−3.2 ≈ 0.37989.

Example 6.15. Earthquakes.Earthquakes are very unpredictable. A common simple model is to assume they follow aPoisson distribution. For a region in the West of the US it is known that the rate is about2 earthquakes per week. We will know calculate a few different kinds of probabilities usingthis model.

(i) What is the probability that there are at least 3 earthquakes next week? Let N bethe number of earthquakes next week.P (N ≥ 3) = 1−

∑2k=0 P (N = k) = 1− (22

2! + 21

1! + 20

0! ) · e−2 = 1− 2 e−2.

(ii) What is the probability that there are exactly k earthquakes during the next 3 weeks?This corresponds to changing the unit to 3 weeks with a new intensity parameter 6.The probability is therefore e−6 · 6k

k! .

More generally, what is the probability that there are exactly k earthquakes duringthe next m weeks?Reasoning as before we obtain e−2m · (2m)k

k! .

(iii) Let us know change the perspective we look at the situation and define T to be thetime (in weeks) until the next earthquake. Let Nm be the number of earthquakesduring the next m weeks. The crucial point here is to understand that the setsT > k and Nm = 0 are identical. Therefore P (T > m) = P (Nm = 0) = e−2m.

Some other situations where a Poisson distribution is a good candidate for a suitablemodel:• number of people who survive age 100• number of floods in the UK per year• number of occurrences of a certain pattern in a stretch of DNA

6.4.3 Random scatter

Some comments about the T-shirt with the random scatter — not relevant for the exam.

7 Random variables and distributions

In many examples the aim is to compute the probability of an expression of the form

N = “number of <something that occurs randomly>”.

For example:

N = “number of heads in n coin tosses”N = “number of coin tosses until third head”N = “number of α-particles emitted by 1 gram of radioactive material in 1 second”N = “number of people taking the 18:22 bus to Leamington Spa”

N = “number of goals scored by Coventry City FC in the 2009-2010 season”N = “number of broken eggs in an egg carton selected at random from your grocery store”

More generally, a lot of examples call for the probability of expression of the form

X = “<some mathematical function> of <an outcome of a random experiment>”.

For example:

X = “sum of two dice”X = “test score of a randomly selected student from this class”X = “phone number selected at random from the Coventry phone book”X = “weight (rounded off to the nearest stone) of a randomly selected British citizen”X = “minimum due date of ten books you borrowed from the library”

In other words, we are interested in events that are characterised by the level sets of somefunction of the outcomes of the random experiment. Formally, this is captured by thefollowing concept.

Definition 7.1. Discrete random variable.Let (Ω,F) be a measurable space and let S be a countable set. A function X from Ω to Sis called discrete random variable (RV) if

X = x ∈ F for all x ∈ S. (27)

ω: I will not ever never use a random variable. They are totally unnecessary, because I completely already

know everything from the sample space. If I want to know the number of heads in three coin tosses then

I choose Ω = 0, 1, 2, 3 and my probabilities are defined right there.

Ω: Okay, whether or not you use random variables is your choice, but consider that I’m witnessing the

exact same coin tossing experiment, and it would be nice if we could simply share the outcome space. I’m

interested in whether or not the last toss is a head, and your Ω = 0, 1, 2, 3 does NOT tell me that.

Remark 7.2. About notation and about the rationale for this definition.

(i) The first term in (27) is not a typographic error, but a very common shortformused in probability theory and measure theory. Here is what it means: X = s =ω ∈ Ω |X(ω) = s = X−1(s). Expressions like X > s and X ≥ s are definedaccordingly.

(ii) Typical choices for S are finite sets, N or Z. S does not need to be numerical; forexample, a random variable describing the colour of a stripe picked at random fromthe mural in the Street is in a finite set of colours. Another example for a colour-valued random variable is shown in Example 7.8 below.

(iii) Condition (27) is called measurability of the function X, a basic concept in measuretheory that will be used massively in modules on advanced probability theory andmeasure theory. The immediate use here is that because of the measurability of Xany probability measure P on the space (Ω,F) canonically defines probabilities forthe level sets of X via the equation pX := P X−1 as detailed in Definition 7.6.

(iv) Condition (27) may seem artificial in the discrete setting, because one could alwaysforce it to be true simply by choosing F = Ω. By contrast, this is not always possiblein the continuous case and not always appropriate in the discrete case (see Remark4.1).

Example 7.3. Indicator functions.Let (Ω,F) be a measurable space and A ∈ F . The function 1A on Ω defined by

1A(ω) =

1 for ω ∈ A ,0 for ω ∈ Ac .

(28)

is a discrete random variable with range S = 0, 1.

Note the trivial cases 1Ω ≡ 1 and 1∅ ≡ 0, which do not actually depend on ω. In other wordsthese random variables degenerated into something deterministic (that is, non-random).

Lemma 7.4. Properties of indicator functions.

(i) 1A + 1Ac = 1 for any A ∈ F .

(ii) 1A∪B = 1A + 1B − 1A∩B for any A,B ∈ F .

Can you proof this? Try before you read on.Or try this hint: Distinguish the cases (ω ∈ A,ω ∈ Ac etc.).

Proof of the lemma.(i) Distinguish two cases.1. If ω ∈ A then 1A(ω) + 1Ac(ω) = 1 + 0 = 1.2. If ω ∈ Ac then 1A(ω) + 1Ac(ω) = 0 + 1 = 1.(ii) If ω ∈ (A ∪B)c then all indicator functions are 0 and the equality is obvious.Otherwise, 1A∪B(ω) = 1 and we have to show that 1A(ω) + 1B(ω)− 1A∩B(ω) = 1.We distinguish three cases.1. ω ∈ A ∩Bc : 1A(ω) + 1B(ω)− 1A∩B(ω) = 1 + 0 + 0 = 1.2. ω ∈ B ∩Ac : 1A(ω) + 1B(ω)− 1A∩B(ω) = 0 + 1 + 0 = 1.3. ω ∈ A ∩B : 1A(ω) + 1B(ω)− 1A∩B(ω) = 0 + 0 + 1 = 1.

ω: These indicator functions are a very usefulish invention.

Ω: After your dislike of random variables earlier, that comment does surprise me.

Example 7.5. Some random variables from three coin tosses.All of the following functions are discrete random variables on the measurable spacegiven by Ω = hhh, hht, hth, htt, thh, tht, tth, ttt and F = P(Ω) (= power set of Ω).

X(ω) = “number of heads in ω”

Y is the pay-off of the following game: If the coin lands heads you get £1, if the coin landstails you loose £1. In other words,

Y (ω) = “number of heads in ω” − “number of tails in ω”

Z(ω) = 1 if the third toss lands heads, and Z(ω) = 0 otherwise.

T is the time until the first head comes up. If no heads come up in the three tosses, setT (ω) = 4.

The table below shows the values the random variables take for the different outcomes.

ω hhh hht hth htt thh tht tth tttX(ω) 3 2 2 1 2 1 1 0Y (ω) 3 1 1 -1 1 -1 -1 -3Z(ω) 1 0 1 0 1 0 1 0T (ω) 1 1 1 1 2 2 3 4

Assume that the coin is fair and the tosses are independent, so P (ω) = 1/8 for all ω ∈ Ω.By Remark 7.2(iii), this defines probabilities for the level sets of the random variables.For example:pX(3) = P (X−1(3)) = P (hhh) = 1/8, pX(2) = P (X−1(2)) = 3/8 etc.pT (1) = P (T−1(1)) = P (hhh, hht, hth, htt) = 4/8, pT (2) = P (T−1(2)) = 2/8 etc.

The table below shows the probabilities of all non-empty level sets for the four randomvariables.

x 0 1 2 3pX(x) 1/8 3/8 3/8 1/8

y -3 -1 1 3pY (y) 1/8 3/8 3/8 1/8

x 0 1pX(x) 1/2 1/2

t 1 2 3 4pT (t) 1/2 1/4 1/8 1/8

We now define formally what we just computed in the last example.

Definition 7.6. Probability mass function (or probability distribution)Let (Ω,F , P ) be a probability space and X : Ω −→ S a discrete random variable. Then

pX(x) = P (X = x) (x ∈ S) (29)

is called probability mass function (PMF) or probability distribution of X. We also saythat X is distributed according to pX or, short, X ∼ pX .

Example 7.7. Indicator functions.Let (Ω,F , P ) be a probability space, A ∈ F and X := 1A the indicator function of A asdefined in Example 7.3. Then, range(X)=0, 1 and pX(1) = P (1A = 1) = P (A) andpX(0) = P (1A = 0) = P (Ac) = 1− P (A).

The above example is the easiest non-trivial example. We have already seen a number ofmore complicated probability mass functions for random variables such as Bernoulli distri-bution (X =“number of successes”), geometrical distribution (X =“number of trials untilsuccess”) and negative binomial distribution (X =“number of trials until rth success”).The following example shows a probability mass function obtained by (real world) data.

Example 7.8. Colour of flowers (a PMF defined by data).Poinsettias can be red, pink, or white. In one study of the hereditary mechanism control-ling the colour, 182 progeny of a certain parental cross were categorised by colour resultingin 108 red ones, 34 pink ones and 40 white ones.Let X be the colour of a poinsettia picked at random. Then the PMF of X is given by:

x red pink whitepX(x) 0.59 0.19 0.22

Lemma 7.9. Transformation of a random variable.Let (Ω,F , P ) be a probability space and X a discrete random variable with range S. Let ψbe a numerical function on S. Then Y := ψ X defines a discrete random variable.

Proof. The range of Y is ψ(S). It is discrete because the cardinality is at most thecardinality of S. It remains to show condition (27) for Y. For any y ∈ ψ(S), Y = y =

ω ∈ Ω |ψ(X(ω)) = y = ω ∈ Ω |X(ω) = ψ−1(y) =⋃x∈ψ−1(y)X = x. By (27) for

X, each of the individual sets is in F , and since F is a σ-algebra, so is their (countable)union.

Using the addition axiom we find that the distribution of Y is given by

pY (y) = P (ψ(X) = y) =∑

x:ψ(x)=y

PX(x) (30)

Example 7.5. Some random variables from three coin tosses (ct.)Y = ψ X for ψ(x) = 2x− n and pY (y) =

∑x:ψ(x)=y pX(x).

In particular, pY (3) = pX(3), pY (1) = pX(2), pY (−1) = pX(1), pY (−3) = pX(1).Choose Y = ψ X for ψ(x) = | 2x − n |. In other words, Y is the gain in the game,regardless who receives it.Now pY (y) =

∑x:ψ(x)=y pX(x), but pY (3) = pX(3) + pX(0) and pY (1) = pX(2) + pX(1).

Lemma 7.10. Combination of random variables.Let (Ω,F) be a measurable space. Let X and Y be discrete random variables with rangesthat are S and R, which are discrete subsets of R. Then the following expressions are alsodiscrete random variables: max(X,Y ),min(X,Y ), X + Y,X − Y,X · Y and, if Y (ω) 6= 0for all ω ∈ Ω, X/Y.

Proof. Let Z := (X,Y ). This is a random variable on (Ω, F), where Ω = Ω × Ω is theproduct space and F is the product σ-algebra. The operations max, min, +,−, · and /

between the two random variables can be understood as a function ψ of the pair Z. Forexample=, X+Y = ψ Z, where ψ1 is the function on S×R defined by ψ1(x, y) := x+ y,

and max(X,Y ) = ψ2 Z, for ψ2(x, y) := max(x, y). Since S and R are discrete subsets ofR, so is the range of ψ Z. By Lemma 7.9, ψ Z is a discrete random variable.

Remark 7.11. Continuous analogues.Both Lemma 7.9 and Lemma 7.10 have analogues for continuous random variables, butadditional assumptions are needed and more care needs to be taken in the proofs.

Example 7.12. Simple random variables built from indicator functions.Let αi ∈ R (i = 1, . . . , n) and let Ai ∈ F (i = 1 . . . , n). By Example 7.3, the lat-ter define discrete random variables. Applying Lemma 7.9 to ψi(x) = αix yields thatαi1Ai (i = 1, . . . , n) are also discrete random variables and multiple application of Lemma7.10 insures that

∑ni=1 αi1Ai is also a discrete random variable. Random variables with

such a representation are called simple random variables.A common technique used in proofs is to first show a results for such simple random vari-ables, then approximate more general random variables with simple random variables andfinally work out how to carry over the result to the limit.

Note that in the proof above we made use of the pair of two random variables definedon the product space. Thinking about them in this way will often come in handy whencomputing their probability mass function. The following definition will help formalisingthis.

Definition 7.13. Joint PMF/joint distribution.Let (Ω,F , P ) be a probability space and let X and Y be discrete random variables . Then

p(x, y) = P (X = x, Y = y) ((x, y) ∈ range(X)× range(Y )) (31)

is called joint distribution function of X and Y.

Any set of non-negative numbers p(x, y) (x ∈ S, y ∈ R) with∑

x∈S,y∈R p(x, y) = 1 defines ajoint distribution for X and Y. Given such a joint distribution, we can obtain distributionsfor the individual random variables by the column sums and the row sums of the jointprobabilities:

pX(x) :=∑Y ∈R

p(x, y) and pY (y) :=∑X∈S

p(x, y). (32)

They are called marginal distributions of X and Y.

Remark 7.14. What does the comma mean?It’s just a short form, X = x, Y = y = X = x ∩ Y = y.

Example 7.15. Golden tickets (continued).In the situation of Question 5.4 define two random variables, one denotes the chosen shopand the other one says whether or not a golden ticket was found in the chocolate bar. Forexample, set X = 1Se and Y = 1G. Obviously, range(X) = range(Y ) = 0, 1.The marginal distribution of X is given by the assumption and can also be recovered assums of the columns: pX(0) = pX(1) = 1/2.The joint distribution of X and Y can be computed using the multiplication rule.p(0, 0) = P (X = 0, Y = 0) = P (Su ∩Gc) = P (Su) · P (Gc |Su) = 1/2 · 2/3 = 1/3.p(0, 1) = P (X = 0, Y = 1) = P (Su ∩G) = P (Su) · P (G |Su) = 1/2 · 1/3 = 1/6.p(1, 0) = P (X = 1, Y = 0) = P (Se ∩Gc) = P (Se) · P (Gc |Se) = 1/2 · 1/2 = 1/4.p(1, 1) = P (X = 1, Y = 1) = P (Se ∩G) = P (Se) · P (G |Se) = 1/2 · 1/2 = 1/4.

range(X) distribution of Y0 1

range(Y)0 4/12 3/12 7/121 2/12 3/12 5/12

distribution of X 6/12 6/12

(33)

The distribution of Y is obtained by the sums of the rows: pY (0) = 7/12 and py(1) = 5/12.

7.1 Expectation

Ω: Look, I’ve got a box with two 50p-coins, one £1-coins and two £2-coins. You can pick one at random.

What do you expect to get?

ω: One of the five coins from that box!

Ω: You’re always taking things so literally. Think of average outcome...

ω: OK. I can do that. £(2 · 0.5 + 1 · 1 + 2 · 2)/5 = £6/5 = £1.20.

Ω: This can be rewritten as p(0.5) · 0.5 + p(1) · 1 + p(2) · 2, where p(ω) = P (ω) for ω ∈ Ω = 0.5, 1, 2where p(0.5) = P (£0.5) = 2/5, p(1) = P (£1) = 2/5 and p(2) = P (£2) = 2/5.

Definition 7.16. Expectation of a discrete random variable.Let (Ω,F) be a measurable space and X a random variable with countable range S andPMF pX . Then

E[X] =∑x∈S

pX(x) · x (34)

is called expected value of X.

In other words, E[X] is the average of all possible values of X weighted by their proba-bilities.

Example 7.17. Tickets in a box.There is a box with n tickets with numbers z1, . . . , zn ∈ N. One ticket is drawn at random.To compute the expected value for the number shown on the ticket use the model Ω =z1, . . . , zn with equally likely probabilities and let X be the number on the ticket. X isa random variable and E[X] = (

∑ni=1 zi)/n. Some explicit examples:

(i) The numbers are 1, 1, 2, 4. Then E[X] = (1 + 1 + 2 + 4)/4 = 2.

(ii) There are numbers are 10 tickets in the box. One of them has the number 1 all othershave a 0. E[X] = (9 · 0 + 1 · 1)/10 = 0.1.

(iii) There are 100 tickets in the box. One of them has the number 1000 all others havethe number 1. E[X] = (99 · 1 + 1 · 1000)/100 = 1099/100 = 10.99

Remark 7.18. Is the expectation a good summary of data?Imagine you live in a country where 1% of the population earns 1000 Dollars a day and 99%of the population earn 1 Dollar a day. Picking a person at random from the populationand asking about his or her salary corresponds to (ii). Now imagine you are one of the99% and you read in the newspaper that in your country the daily salary is about 11Dollars. That is 11 times more than yours even though you feel like earning just the sameas pretty much everybody else around you.

What happened? The one outlier value pulled the expected value up. It gives now anunrealistic picture of the salary distribution in that country.

What can be done? Remove outliers, use alternative measures for location (e.g. median)that are more robust to outliers, always report a measure of spread also etc.

In other examples, the expectation may give you a good first idea about the magnitudeof the numbers in a data set. It depends!

Example 7.19. Dice.Roll a die once. Let X be the number shown. The model is the same as for tickets in abox. E[X] = (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 This was for a hexagonal die.For a tetrahedral die: E[X] = (1 + 2 + 3 + 4)/4 = 2.5.For an octahedral die: E[X] = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8)/8 = 4.5.For a dodecahedral die: E[X] = (1+2+3+4+5+6+7+8+9+10+11+12)/12 = 78/12 = 6.5.For an icosahedral die: DIYSee Section 5.2.1 in Wiki under “dice” for more info and a beautiful picture of a platonicsolid set of dice.

Example 7.20. Expected value of an indicator function.Let (Ω,F , P ) be a probability space, A ∈ F and 1A the indicator function of A defined inExample 7.3. Then, using (34) and Example 29,

E[1A] = p1A(0) · 0 + p1A(1) · 1 = P (A).

Lemma 7.21. Linearity of the expected value.Let X and Y be random variables on a probability space (Ω,F ,P). Then,

E[X + Y ] = E[X] + E[Y ] and E[αX] = αE[X] for all α ∈ R (35)

Remark 7.22. Properties of expected value.(i) Note that E[X · Y ] = E[X] · E[Y ] is not generally true. In fact, this is an additionalproperty of X and Y called uncorrelated.(ii) Using induction, (35) can be generalised as follows.For any random variables Xi (i=1,. . . , n) on (Ω,F ,P) and all αi ∈ R (i = 1, . . . , n),

E[ n∑i=1

αiXi

]=

n∑i=1

αiE[Xi] (36)

Example 7.23. Many rolls of a die.Roll a tetrahedral die n times. Let Ω = 1, 2, 3, 4 and Xi (i = 1, . . . , n) the numbersshown in the ith roll. The Xi are random and so is their sum Sn =

∑ni=1Xi and their

average An := 1/n · Sn. Let us compute their expected values using the linearity of E.

E[Sn] =∑

E[Xi] = n · 2.5.

E[An] = E[1nSn] =

1nE[Sn] = 2.5.

Question 7.24. Expected number of working components.In a system of n components each works with a certain probability pi (i = 1, . . . , n).Compute the expected number of working components.

Answer: Use Ω0 = w, f as an outcome space describing the state of an individualcomponent using w for “works” and f for “fails”. Let Ω = Ωn

0 be the outcome spacedescribing the state of the whole system. Let Ai (i = 1, . . . , n) be the event that the i thcomponent works. (Formally, this looks for example as follows: A1 = w, ω2, . . . , ωn |ωi ∈Ω0 for i = 2, . . . , n.) The number X of working components can be represented as thesum of indicator functions: X(ω) = 1A1(ω) + . . .+ 1An(ω). Using (36) yields

E[X] = E

[ n∑i=1

1Ai

]=

n∑i=1

E[1Ai ] =n∑i=1

pi. (37)

The method used in the answer above is very general and has got a name:

Lemma 7.25. Method of indicators.Let (Ω,F ,P) be a probability space and Ai ∈ F (i = 1, . . . , n) and let X be the number ofevents that occur. Then E[X] =

∑ni=1 P (Ai).

Proof. Using X(ω) =∑n

i=1 1Ai(ω) (ω ∈ Ω) the statement follows from (37).

The proof is just a formalisation of the simple idea used in the explicit example above. Itdeserve to be a lemma, because the technique it is often quite useful. However, it is a bitof art to recognise when it can be applied.

Ω: They forgot the independence assumption.

ω: That is not at all true, because it was not at all needed!

Example 7.26. Expected number of successes in Binomial trials.Let X be the number of successes in n independent Bernoulli trials with success parameterp. X has a Binomial distribution and it is possible to compute the expected value of Xdirectly using the weights of the Binomial distribution. Much easier, however, is to justuse the method of indicators. It yields E[X] = np.

7.2 Variance

Definition 7.27. Variance of a discrete random variable.Let (Ω,F) be a measurable space and X a random variable with countable range S, PMFpX and expected value µ. Then

V ar(X) = E[(X − µ)2] (38)

is called variance of X.

In other words, we measure how much the outcomes deviate from the expected value bythe square of their difference. Then we take the expected values of that expression. Usingthe square makes the expression always non-negative. Alternatively, absolute values couldbe used (and they are used in the so-called L1-norm approach to statistics).

Remark 7.28. Alternative expression for the variance.

V ar(X) = E[X2]− E[X]2

This follows from the following computation using the linearity of the expected value andthe fact that µ is just a constant (not actually random). E[(X−µ)2] = E[X−2·X ·µ+µ2] =E[X2]− 2 · µ · E[X] + µ2 = E[X2]− 2 · µ · µ+ µ2 = E[X2]− 2 · µ2 + µ2 = E[X2]− µ2.

Example 7.29. Tickets in a box.Use the same notation and numbers as in Example 7.17.

(i) The numbers are 1, 1, 2, 4. E[X2] = (12 + 12 + 22 + 42)/4 = 22/4 = 5.5, E[X]2 = 4, soV ar(X) = 5.5− 4 = 1.5.

(ii) There are 10 tickets in the box. One of them has the number 1 all others have a 0.E[X2] = (9 · 02 + 1 · 12)/10 = 0.1 and E[X]2 = 0.12 = 0.01, so V ar(X) = 0.09.

(iii) There are 100 tickets in the box. One of them has the number 1000 all others havethe number 1. E[X2] = (99 · 12 + 1 · 10002)/100 = 1, 000, 099/100 = 10, 000.99 andE[X]2 = 10.992 so V ar(X) = 10, 000.99− 10.992.

The variance is (iii) is high. This is related to the issue discussed in Remark 7.18.

The expectation is a measure of location and the variance is a measure of spread. They areexamples of one-parameter summaries of distributions or data sets. Their mathematicalexpressions can be computed by the recipes provided. However, whether they do providegood summaries or whether they give an oversimplified or even misleading picture of thesituation has to be checked in the particular cases.

Documents

probanotesjulia16april2010