Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
BACKGROUND NOTES
FYS 4550/FYS9550 -
EXPERIMENTAL HIGH ENERGY PHYSICS
AUTUMN 2017
PROBABILITY
A. STRANDLIE
NTNU
AND
UNIVERSITY OF OSLO
Probability
• Before embarking on the concept of probability, we will first define a
set of other concepts.
• A stochastic experiment is characterized by:
– All possible elementary outcomes of the experiment are known
– Only one of the outcomes can occur in a single experiment
– The outcome of an experiment is not known a priori
• Example: throwing a dice
– Outcomes are: S={1,2,3,4,5,6}
– Can only observe one of these each time you throw
– Don’t know beforehand what you will observe
• The set S is called the sample space of the experiment
Probability
• An event A is one or more outcomes which satisfy certain
specifications
• Example: A=”odd number” when throwing a dice
• An event is therefore also a subset of S
• Here: A={1,3,5}
• If B=”even number”, what is the subset of S describing B?
• The probability of occurence of an event A, P(A), is a number
between 0 and 1
• Intuitively a number for P(A) close to 0 means that A is supposed to
occur very rarely in an experiment, whereas a number close to 1
means that A occurs very often
Probability
• There are three ways of quantifying probability
1. Classical approach, valid when all outcomes can be assumed equally
likely. Probability is defined as number of favourable outcomes for a
given event divided by total number of outcomes. Example: throwing
a dice has N=6 different outcomes. Assume that the event A =
”observing 6 eyes”. Only n=1 of the outcomes are favourable for A.
P(A)=n/N=1/6=0.167.
2. Approach based on convergence value of relative frequency for a
very large number of repeated, identical experiments. Example:
throwing a dice, recording relative frequency of occurence of A for
various numbers of trials
3. Subjective approach, reflecting ”degree of belief” of occurence of a
certain event A. Possible guideline: convergence value of a large
number of hypothetical experiments
Probability
relative
frequency
logarithm (base 10) of trials
true
probability
Convergence of relative frequency
Probability
• Approach 2) forms the basis of frequentist statistics, whereas
approach 3) is the baseline of Bayesian statistics
– Two different schools
• When estimating parameters from a set of data, the two approaches
usually give the same numbers for the estimates if there is a large
amount of data
• If there is little available data, estimates might differ
– No easy way of determining which approach is ”best”
– Both approaches advocated in high-energy physics experiments
• Will not enter any further into such questions in this course
Probability
• Will now look at probabilities of combinations of events
• Need some concepts from set theory:
• The union
is a new event which occurs if A or B or both events occur.
• To events are disjoint if they cannot occur simultaneously
• The intersection
is a new event which occurs if both A and B occurs
• The complement
is a new event which occurs if A does not occur
BA
A
Probability
S
ABA∩B
VENN DIAGRAM
BA
C (disjoint
with A and B)
outcomes
Probability
• The mathematical axioms of probability:
1. Probability is never negative, P(A) ≥ 0
2. The probability for the event which corresponds to the entire
sample space S (i.e. the probability of observing any of the
possible outcomes of the experiment) is equal to the unit
value, i. e. P(S) = 1
3. Probability must comply with the addition rule of disjoint
events:
• A couple of useful formulas which can be derived from the
axioms:P(A)1)AP(
)()()()( 2121 nn APAPAPAAAP
)()()()( BAPBPAPBAP
Probability
ABA∩B
Concept of conditional probability:
What is the probability of occurence of A given that
we know B will occur, i. e. P(A|B) ?
Probability
• Recalling the definition of probability as the number of favourable
outcomes divided by the total number of outcomes, we get:
• Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6}
– What is P(A|B)??
3
1)(}6,4{ BAPBA
)(
)(
/
/)|(
BP
BAP
NN
NN
N
NBAP
totB
totBA
B
BA
2
1
3/2
3/1
)(
)()|(
BP
BAPBAP
Probability
BA
B
BA∩B
Important observation: and are disjoint!BAA∩B
Probability
• Therefore:
• Expressing P(A) in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is:
where all { } are disjoint and span the entire sample space S.
)()|()()|(
)()())()(()(
BPBAPBPBAP
BAPBAPBABAPAP
i
ii BPBAPAP )()|()(
iB
Probability
• From the definition of conditional probability it follows:
• A quick manipulation gives:
which is called Bayes’ theorem.
)|()()|()()( ABPAPBAPBPBAP
)(
)()|()|(
AP
BPBAPABP
Probability
• By using the law of total probability, one ends up with the general
formulation of Bayes’ theorem:
which is an extremely important result in statistics. Particularly in
Bayesian statistics this theorem is often used to update or refine the
knowledge about a set of unknown parameters by the introduction of
information from new data.
i
ii
jj
jBPBAP
BPBAPABP
)()|(
)()|()|(
Probability
• This can be explained by a rewrite of Bayes theorem:
P(parameters|data) α P(data|parameters) × P(parameters).
P(data|parameters) is often called the likelihood, P(parameters)
denotes the prior knowledge of the parameters, whereas
P(parameters|data) is the posterior probability of the parameters
given the data.
• If P(parameters) cannot be deduced by any objective means, a
subjective belief of its value is used in Bayesian statistics.
• Since there is no fundamental rule describing how to deduce this
prior probability, Bayesian statistics is still debated (also in high-
energy physics!)
Probability
• Definition of independence of events A and B: P(A|B) = P(A), i.e.
any given information about B does not affect the probability of
observing A.
• Physically this means that the events A and B are uncorrelated.
• For practical applications such independence can not be derived but
rather has to be assumed, given the nature of the physical problem
one intends to model.
• General multiplication rule for independent events :
)()()()( 2121 nn APAPAPAAAP
nAAA ,,, 21
Probability
• Stochastic or random variable:
– Number which can be attached to all outcomes of an experiment
• Example: throwing two dice, sum of number of spots
– Mathematical terminology: real-valued function defined over the
elements of the sample space S of an experiment
– A capital letter is often used to denote a random variable, for instance X
• Simulation experiment: throwing two dice N times, recording sum of
spots each time and calculating the relative frequency of occurence
for each of the outcomes
Probability
N=10
Blue columns:
observed
rel. freq.
Red columns:
teoretically
expected
rel. freq.
Probability
N=20
Probability
N=100
Probability
N=1000
Probability
N=10000
Probability
N=100000
Probability
N=1000000
Probability
N=10000000
Probability
• The relative frequencies seem to converge towards the theoretically
expected probabilities
• Such a diagram is an expression of a probability distribution:
– A list of all different values of a random variable together with the
associated probabilities
– Mathematically: a function f(x) = P(X=x) defined for all possible values x
of X (given by the experiment at hand)
– The values of X can be discrete (like in the previous example), or
continuous
– For continuous x, f(x) is called a probability density function
• Simulation experiment: height of Norwegian men
• Collecting data, calculating relative frequencies of occurences in
intervals of various widths
Probability
interval width
10 cm
Probability
interval width
5 cm
Probability
interval width
1 cm
Probability
interval width
0.5 cm
Probability
interval width
0
continuous
probability
distribution
Probability
• Cumulative distribution function: F(a)=P(X ≤ a)
• For discrete, random variables:
• For continuous, random variables:
ax ax
ii
i i
xfxXPaF )()()(
a
dxxfaF )()(
Probability
• It follows:
• For continuous variables:
)()()( aFbFbXaP
b
a
dxxfbXaP )()(
Probability
a b
shaded area is
P(a < X < b)
Probability
b
shaded area is
P(X < b)
Probability
a
shaded area is
P(X > a)
Probability
• A function u(X) of a random variable X is also a random variable.
• The expectation value of such a function is:
• Two very important special cases are:
dxxfxuXuE )()()(
dxxfxXE )()(
dxxfxXEXVar )()()()( 222
mean
variance
Probability
• The mean μ is the most important measure of the centre of the
distribution of X.
• The variance, or its square root σ, the standard deviation, is the
most important measure of the spread of the distribution of X around
the mean.
• The mean is the first moment of X, whereas the variance is the
second central moment of X.
• In general, the n’th moment of X is
dxxfxXE nn
n )(
Probability
• The n’th central moment is
• Another measure of the centre of the distribution of X is the median,
defined as
or, in words, the value of of X of which half of the probability lies
above and half lies below.
dxxfxXEm nn
n )()()( 11
2
1)( medxF
Probability
• Assume now that X and Y are two random variables with a joint
probability distribution function (pdf) f(x,y).
• The marginal pdf of X is
whereas the marginal pdf of Y is
dyyxfxf ),()(1
dxyxfyf ),()(2
Probability
• The mean values of X and Y are
• The covariance of X and Y is
dxxfxdxdyyxfxX )(),( 1
dyyfydxdyyxfyY )(),( 2
YXYX XYEYXEYX ))((,cov
Probability
• If several random variables are considered simultaneously, one
frequently arranges the variables in a stochastic or random vector
• The covariances are then naturally displayed in a covariance matrix
TnXX ,,,X 21 X
),cov(),cov(),cov(
),cov(),cov(),cov(
),cov(),cov(),cov(
cov
21
22212
12111
nnnn
n
n
XXXXXX
XXXXXX
XXXXXX
X
Probability
• If two variables X and Y are independent, the joint pdf can be written
• The covariance of X and Y vanishes in this case (why?), and the
variances add: V(X+Y)=V(X)+V(Y).
• If X and Y are not independent, the general formula is:
V(X+Y)=V(X)+V(Y)+2Cov(X,Y).
• For n mutually independent random variables the covariance matrix
becomes diagonal (i.e. all off-diagonal terms are identically zero).
)()(),( 21 yfxfyxf
Probability
• If a random vector is related to a vector X (with
pdf f(X) )by a function Y(X), the pdf of Y is
where |J| is the absolute value of the determinant of a matrix J.
• This matrix is the so-called Jacobian of the transformation from Y to
X:
nYYY ,,, 21 Y
Jyxy ))(()( fg
n
nn
n
y
x
y
x
y
x
y
x
1
1
1
1
J
Probability
• The transformation of the covariance matrix is
where the inverse of J is
• The transformation from x to y must be one-to-one, such that the inverse functional relationship exists.
T11 )cov()cov( JXJY
n
nn
n
x
y
x
y
x
y
x
y
1
1
1
1
1J
Probability
• Obtaining cov(Y) from cov(X) as in the previous slide is a very much
used technique in high-energy physics data analysis.
• It is called linear error propagation and is applicable any time one
wants to transform from one set of estimated parameters to another
– Transformation between different sets of parameters describing a
reconstructed particle track
– Transport of track parameters from one location in a detector to another
– ………….
• Will see examples later in the course
Probability
• The characteristic function Φ(u) associated with the pdf f(x) is the Fourier transform of f(x):
• Such functions are useful in deriving results about moments of random variables.
• The relation between Φ(u) and the moments of X are
• If Φ(u) is known, all moments of f(x) can be calculated without the knowledge of f(x) itself
dxxfeeEu iuxiux )()(
n
n
u
n
nn dxxfxdu
di
)(
0
Probability
• Some common probability distributions:
– Binomial distribution
– Poisson distribution
– Gaussian distribution
– Chisquare distribution
– Student’s t distribution
– Gamma distribution
• We will take a closer look at some of them
Probability
• Binomial distribution:
• Assume that we make n identical experiments with only two possible
outcomes: ”success” or ”no success”
• The probability of success p is the same for all experiments
• The individual experiments are independent of each other
• The probability of x successes out of n trials is then
• Example: throwing dice n times
• Defining event of success to be occurence of six spots in a throw
• Probability p=1/6
xnx ppx
nxXP
1)(
Probability
probability
distribution for
number of
successes
in 5 throws
Probability
probability
distribution for
number of
successes
in 15 throws
Probability
probability
distribution for
number of
successes
in 50 throws
anything familiar
with the shape
of this distribution?
Probability
• Mean value and variance:
• Five throws with a dice:
– E(# six spots) = 5/6
– Var(# six spots) = 25/36
– Std(# six spots) = 5/6
)1()(
)(
pnpXVar
npXE
Probability
• Poisson distribution:
– Number of occurences of event A per given time (length, area,
volume,…) interval is constant and equal to λ.
– Probability distribution of observing x occurences in the interval is
– Both mean value and variance of X is λ.
– Example: number of particles in a beam passing through a given area in
a given time must be Poisson distributed. If the average number λ is
known, the probabilities for all x can be calculated according to the
formula above.
!)(
x
exXP
x
Probability
• Gaussian distribution:
– Most frequently occurring distribution in nature.
– Most measurement uncertainties, disturbances of directions of charged
particles when penetrating through (enough) matter, number of
ionizations created by charged particle in a slab of material etc. follow
Gaussian distribution.
– Main reason: CENTRAL LIMIT THEOREM
– States that sum of n independent random variables converges to a
Gaussian distribution when n is ”large enough”, irrespective of the
individual distributions of the variables.
– Abovementioned examples are typically of this type.
Probability
• Gaussian probability density function with mean value μ and
standard deviation σ:
• For a random vector X of size n with mean value μ and covariance
matrix V the function is (multivariate Gaussian distribution):
222/
2
2
2
1),;(
xexf
μxVμx
VVμx
1
2/2
1exp
)det(2
1,;
T
nf
Probability
• Usual terminology: X ~ N(μ,σ) : ”X is distributed according to a Gaussian (normal) with mean value μ and standard deviation σ”.
• 68 % of distribution within plus/minus one σ.
• 95 % of distribution within plus/minus two σ.
• 99.5 % of distribution within plus/minus three σ.
• Standard normal variable Z~N(0,1): Z=(X- μ)/ σ
• Quantiles of the standard normal distribution:
• The value is denoted the ”100 * α % quantile of the standard normal distribution”
• Such quantiles can be found in tables or by computer programs
1)( zZP
z
Probability
10 % quantile
Probability
5 % quantile
(1.64)
Probability
95 % of area
within plus/
minus 2.5 %
quantile (1.96)
Probability
• distribution:
• If are independent, Gaussian random variables, then
follow a distribution with n degrees of freedom.
• Often used in evaluating level of compatibility between observed
data and assumed pdf of the data
• Example: is position of measurement in a particle detector
compatible with the assumed distribution of the measurement?
• Mean value is n and variance 2n.
nXX ,,1
n
i i
iiX
12
2
2
2
2
Probability
chisquare
distribution with
10 degrees of
freedom