PROBABILITY - NTNU · Probability • Before embarking on the concept of probability, we will first define a set of other concepts. • A stochastic experiment is characterized by:

BACKGROUND NOTES

FYS 4550/FYS9550 -

EXPERIMENTAL HIGH ENERGY PHYSICS

AUTUMN 2017

PROBABILITY

A. STRANDLIE

NTNU

AND

UNIVERSITY OF OSLO

Probability

• Before embarking on the concept of probability, we will first define a

set of other concepts.

• A stochastic experiment is characterized by:

– All possible elementary outcomes of the experiment are known

– Only one of the outcomes can occur in a single experiment

– The outcome of an experiment is not known a priori

• Example: throwing a dice

– Outcomes are: S={1,2,3,4,5,6}

– Can only observe one of these each time you throw

– Don’t know beforehand what you will observe

• The set S is called the sample space of the experiment

Probability

• An event A is one or more outcomes which satisfy certain

specifications

• Example: A=”odd number” when throwing a dice

• An event is therefore also a subset of S

• Here: A={1,3,5}

• If B=”even number”, what is the subset of S describing B?

• The probability of occurence of an event A, P(A), is a number

between 0 and 1

• Intuitively a number for P(A) close to 0 means that A is supposed to

occur very rarely in an experiment, whereas a number close to 1

means that A occurs very often

Probability

• There are three ways of quantifying probability

1. Classical approach, valid when all outcomes can be assumed equally

likely. Probability is defined as number of favourable outcomes for a

given event divided by total number of outcomes. Example: throwing

a dice has N=6 different outcomes. Assume that the event A =

”observing 6 eyes”. Only n=1 of the outcomes are favourable for A.

P(A)=n/N=1/6=0.167.

2. Approach based on convergence value of relative frequency for a

very large number of repeated, identical experiments. Example:

throwing a dice, recording relative frequency of occurence of A for

various numbers of trials

3. Subjective approach, reflecting ”degree of belief” of occurence of a

certain event A. Possible guideline: convergence value of a large

number of hypothetical experiments

Probability

relative

frequency

logarithm (base 10) of trials

true

probability

Convergence of relative frequency

Probability

• Approach 2) forms the basis of frequentist statistics, whereas

approach 3) is the baseline of Bayesian statistics

– Two different schools

• When estimating parameters from a set of data, the two approaches

usually give the same numbers for the estimates if there is a large

amount of data

• If there is little available data, estimates might differ

– No easy way of determining which approach is ”best”

– Both approaches advocated in high-energy physics experiments

• Will not enter any further into such questions in this course

Probability

• Will now look at probabilities of combinations of events

• Need some concepts from set theory:

• The union

is a new event which occurs if A or B or both events occur.

• To events are disjoint if they cannot occur simultaneously

• The intersection

is a new event which occurs if both A and B occurs

• The complement

is a new event which occurs if A does not occur

BA

A

Probability

S

ABA∩B

VENN DIAGRAM

BA

C (disjoint

with A and B)

outcomes

Probability

• The mathematical axioms of probability:

1. Probability is never negative, P(A) ≥ 0

2. The probability for the event which corresponds to the entire

sample space S (i.e. the probability of observing any of the

possible outcomes of the experiment) is equal to the unit

value, i. e. P(S) = 1

3. Probability must comply with the addition rule of disjoint

events:

• A couple of useful formulas which can be derived from the

axioms:P(A)1)AP(

)()()()( 2121 nn APAPAPAAAP

)()()()( BAPBPAPBAP

Probability

ABA∩B

Concept of conditional probability:

What is the probability of occurence of A given that

we know B will occur, i. e. P(A|B) ?

Probability

• Recalling the definition of probability as the number of favourable

outcomes divided by the total number of outcomes, we get:

• Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6}

– What is P(A|B)??

3

1)(}6,4{ BAPBA

)(

)(

/

/)|(

BP

BAP

NN

NN

N

NBAP

totB

totBA

B

BA

2

1

3/2

3/1

)(

)()|(

BP

BAPBAP

Probability

BA

B

BA∩B

Important observation: and are disjoint!BAA∩B

Probability

• Therefore:

• Expressing P(A) in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is:

where all { } are disjoint and span the entire sample space S.

)()|()()|(

)()())()(()(

BPBAPBPBAP

BAPBAPBABAPAP

i

ii BPBAPAP )()|()(

iB

Probability

• From the definition of conditional probability it follows:

• A quick manipulation gives:

which is called Bayes’ theorem.

)|()()|()()( ABPAPBAPBPBAP

)(

)()|()|(

AP

BPBAPABP

Probability

• By using the law of total probability, one ends up with the general

formulation of Bayes’ theorem:

which is an extremely important result in statistics. Particularly in

Bayesian statistics this theorem is often used to update or refine the

knowledge about a set of unknown parameters by the introduction of

information from new data.

i

ii

jj

jBPBAP

BPBAPABP

)()|(

)()|()|(

Probability

• This can be explained by a rewrite of Bayes theorem:

P(parameters|data) α P(data|parameters) × P(parameters).

P(data|parameters) is often called the likelihood, P(parameters)

denotes the prior knowledge of the parameters, whereas

P(parameters|data) is the posterior probability of the parameters

given the data.

• If P(parameters) cannot be deduced by any objective means, a

subjective belief of its value is used in Bayesian statistics.

• Since there is no fundamental rule describing how to deduce this

prior probability, Bayesian statistics is still debated (also in high-

energy physics!)

Probability

• Definition of independence of events A and B: P(A|B) = P(A), i.e.

any given information about B does not affect the probability of

observing A.

• Physically this means that the events A and B are uncorrelated.

• For practical applications such independence can not be derived but

rather has to be assumed, given the nature of the physical problem

one intends to model.

• General multiplication rule for independent events :

)()()()( 2121 nn APAPAPAAAP

nAAA ,,, 21

Probability

• Stochastic or random variable:

– Number which can be attached to all outcomes of an experiment

• Example: throwing two dice, sum of number of spots

– Mathematical terminology: real-valued function defined over the

elements of the sample space S of an experiment

– A capital letter is often used to denote a random variable, for instance X

• Simulation experiment: throwing two dice N times, recording sum of

spots each time and calculating the relative frequency of occurence

for each of the outcomes

Probability

N=10

Blue columns:

observed

rel. freq.

Red columns:

teoretically

expected

rel. freq.

Probability

N=20

Probability

N=100

Probability

N=1000

Probability

N=10000

Probability

N=100000

Probability

N=1000000

Probability

N=10000000

Probability

• The relative frequencies seem to converge towards the theoretically

expected probabilities

• Such a diagram is an expression of a probability distribution:

– A list of all different values of a random variable together with the

associated probabilities

– Mathematically: a function f(x) = P(X=x) defined for all possible values x

of X (given by the experiment at hand)

– The values of X can be discrete (like in the previous example), or

continuous

– For continuous x, f(x) is called a probability density function

• Simulation experiment: height of Norwegian men

• Collecting data, calculating relative frequencies of occurences in

intervals of various widths

Probability

interval width

10 cm

Probability

interval width

5 cm

Probability

interval width

1 cm

Probability

interval width

0.5 cm

Probability

interval width

0

continuous

probability

distribution

Probability

• Cumulative distribution function: F(a)=P(X ≤ a)

• For discrete, random variables:

• For continuous, random variables:

ax ax

ii

i i

xfxXPaF )()()(

a

dxxfaF )()(

Probability

• It follows:

• For continuous variables:

)()()( aFbFbXaP

b

a

dxxfbXaP )()(

Probability

a b

shaded area is

P(a < X < b)

Probability

b

shaded area is

P(X < b)

Probability

a

shaded area is

P(X > a)

Probability

• A function u(X) of a random variable X is also a random variable.

• The expectation value of such a function is:

• Two very important special cases are:

dxxfxuXuE )()()(

dxxfxXE )()(

dxxfxXEXVar )()()()( 222

mean

variance

Probability

• The mean μ is the most important measure of the centre of the

distribution of X.

• The variance, or its square root σ, the standard deviation, is the

most important measure of the spread of the distribution of X around

the mean.

• The mean is the first moment of X, whereas the variance is the

second central moment of X.

• In general, the n’th moment of X is

dxxfxXE nn

n )(

Probability

• The n’th central moment is

• Another measure of the centre of the distribution of X is the median,

defined as

or, in words, the value of of X of which half of the probability lies

above and half lies below.

dxxfxXEm nn

n )()()( 11

2

1)( medxF

Probability

• Assume now that X and Y are two random variables with a joint

probability distribution function (pdf) f(x,y).

• The marginal pdf of X is

whereas the marginal pdf of Y is

dyyxfxf ),()(1

dxyxfyf ),()(2

Probability

• The mean values of X and Y are

• The covariance of X and Y is

dxxfxdxdyyxfxX )(),( 1

dyyfydxdyyxfyY )(),( 2

YXYX XYEYXEYX ))((,cov

Probability

• If several random variables are considered simultaneously, one

frequently arranges the variables in a stochastic or random vector

• The covariances are then naturally displayed in a covariance matrix

TnXX ,,,X 21 X

),cov(),cov(),cov(

),cov(),cov(),cov(

),cov(),cov(),cov(

cov

21

22212

12111

nnnn

n

n

XXXXXX

XXXXXX

XXXXXX

X

Probability

• If two variables X and Y are independent, the joint pdf can be written

• The covariance of X and Y vanishes in this case (why?), and the

variances add: V(X+Y)=V(X)+V(Y).

• If X and Y are not independent, the general formula is:

V(X+Y)=V(X)+V(Y)+2Cov(X,Y).

• For n mutually independent random variables the covariance matrix

becomes diagonal (i.e. all off-diagonal terms are identically zero).

)()(),( 21 yfxfyxf

Probability

• If a random vector is related to a vector X (with

pdf f(X) )by a function Y(X), the pdf of Y is

where |J| is the absolute value of the determinant of a matrix J.

• This matrix is the so-called Jacobian of the transformation from Y to

X:

nYYY ,,, 21 Y

Jyxy ))(()( fg

n

nn

n

y

x

y

x

y

x

y

x

1

1

1

1

J

Probability

• The transformation of the covariance matrix is

where the inverse of J is

• The transformation from x to y must be one-to-one, such that the inverse functional relationship exists.

T11 )cov()cov( JXJY

n

nn

n

x

y

x

y

x

y

x

y

1

1

1

1

1J

Probability

• Obtaining cov(Y) from cov(X) as in the previous slide is a very much

used technique in high-energy physics data analysis.

• It is called linear error propagation and is applicable any time one

wants to transform from one set of estimated parameters to another

– Transformation between different sets of parameters describing a

reconstructed particle track

– Transport of track parameters from one location in a detector to another

– ………….

• Will see examples later in the course

Probability

• The characteristic function Φ(u) associated with the pdf f(x) is the Fourier transform of f(x):

• Such functions are useful in deriving results about moments of random variables.

• The relation between Φ(u) and the moments of X are

• If Φ(u) is known, all moments of f(x) can be calculated without the knowledge of f(x) itself

dxxfeeEu iuxiux )()(

n

n

u

n

nn dxxfxdu

di

)(

0

Probability

• Some common probability distributions:

– Binomial distribution

– Poisson distribution

– Gaussian distribution

– Chisquare distribution

– Student’s t distribution

– Gamma distribution

• We will take a closer look at some of them

Probability

• Binomial distribution:

• Assume that we make n identical experiments with only two possible

outcomes: ”success” or ”no success”

• The probability of success p is the same for all experiments

• The individual experiments are independent of each other

• The probability of x successes out of n trials is then

• Example: throwing dice n times

• Defining event of success to be occurence of six spots in a throw

• Probability p=1/6

xnx ppx

nxXP

1)(

Probability

probability

distribution for

number of

successes

in 5 throws

Probability

probability

distribution for

number of

successes

in 15 throws

Probability

probability

distribution for

number of

successes

in 50 throws

anything familiar

with the shape

of this distribution?

Probability

• Mean value and variance:

• Five throws with a dice:

– E(# six spots) = 5/6

– Var(# six spots) = 25/36

– Std(# six spots) = 5/6

)1()(

)(

pnpXVar

npXE

Probability

• Poisson distribution:

– Number of occurences of event A per given time (length, area,

volume,…) interval is constant and equal to λ.

– Probability distribution of observing x occurences in the interval is

– Both mean value and variance of X is λ.

– Example: number of particles in a beam passing through a given area in

a given time must be Poisson distributed. If the average number λ is

known, the probabilities for all x can be calculated according to the

formula above.

!)(

x

exXP

x

Probability

• Gaussian distribution:

– Most frequently occurring distribution in nature.

– Most measurement uncertainties, disturbances of directions of charged

particles when penetrating through (enough) matter, number of

ionizations created by charged particle in a slab of material etc. follow

Gaussian distribution.

– Main reason: CENTRAL LIMIT THEOREM

– States that sum of n independent random variables converges to a

Gaussian distribution when n is ”large enough”, irrespective of the

individual distributions of the variables.

– Abovementioned examples are typically of this type.

Probability

• Gaussian probability density function with mean value μ and

standard deviation σ:

• For a random vector X of size n with mean value μ and covariance

matrix V the function is (multivariate Gaussian distribution):

222/

2

2

2

1),;(

xexf

μxVμx

VVμx

1

2/2

1exp

)det(2

1,;

T

nf

Probability

• Usual terminology: X ~ N(μ,σ) : ”X is distributed according to a Gaussian (normal) with mean value μ and standard deviation σ”.

• 68 % of distribution within plus/minus one σ.

• 95 % of distribution within plus/minus two σ.

• 99.5 % of distribution within plus/minus three σ.

• Standard normal variable Z~N(0,1): Z=(X- μ)/ σ

• Quantiles of the standard normal distribution:

• The value is denoted the ”100 * α % quantile of the standard normal distribution”

• Such quantiles can be found in tables or by computer programs

1)( zZP

z

Probability

10 % quantile

Probability

5 % quantile

(1.64)

Probability

95 % of area

within plus/

minus 2.5 %

quantile (1.96)

Probability

• distribution:

• If are independent, Gaussian random variables, then

follow a distribution with n degrees of freedom.

• Often used in evaluating level of compatibility between observed

data and assumed pdf of the data

• Example: is position of measurement in a particle detector

compatible with the assumed distribution of the measurement?

• Mean value is n and variance 2n.

nXX ,,1

n

i i

iiX

12

2

2

2

2

Probability

chisquare

distribution with

10 degrees of

freedom

Documents

PROBABILITY - NTNU · Probability • Before embarking on the concept of probability, we will first define a set of other concepts. • A stochastic experiment is characterized by: