Probability and Statistics
Basic concepts
(from a physicist point of view)
Laura Ferraris-Bouchez UGA / LPSC
What are statistics?
1) Descriptive statistics non-interpreted poll
results, weather observation, …
2) Explicative statistics estimation, variables
correlation
3) Decisional statistics statistical tests
4) Forecasting statistics (time is an additional
parameter) weather, health, …
3
Bibliography
• Kendall’s Advanced theory of statistics, Hodder Arnold Pub.
volume 1 : Distribution theory, A. Stuart et K. Ord
volume 2a : Classical Inference and and the Linear Model, A.
Stuart, K. Ord, S. Arnold
volume 2b : Bayesian inference, A. O’Hagan, J. Forster
• The Review of Particle Physics, M. Tanabashi et al. (Particle Data
Group), Phys. Rev. D 98, 030001 (2018) pdg.lbl.gov
• Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling,
Oxford Science Publication
• Statistical Data Analysis, Glen Cowan, Oxford Science
Publication
• Analyse statistique des données expérimentales, K. Protassov,
EDP sciences
• Probabilités, analyse des données et statistiques, G. Saporta,
Technip
• Analyse de données en sciences expérimentales, B.C., Dunod
4
Sample and population
SAMPLE
- Finite size
- Selected through a random process
eg. Measurement result
POPULATION
- Potentially infinite size
eg. All possible results
Characterizing : - the sample,
- the population
- the sampling process
Probability theory (first lecture)
Observable
Statistics
5
SAMPLE
𝒙𝒊
POPULATION
𝒇(𝒙; 𝜽)
INFERENCE Observable
Using the sample to estimate the
characteristics of the population
Statistical inference (second lecture)
EXPERIMENT
PHYSICS
parameters 𝜽
It will be described by :
- Universe (𝛀) set of all possible outcomes
- Event logical condition on an outcome.
It can either be true or false;
It splits the universe in 2 subsets
6
Random process
Random process
(“measurement”
or “experiment”)
Process whose outcome
cannot be predicted with
certainty.
𝛀
𝑨ഥ𝑨 An event A will be identified
by the subset 𝑨 for which A
is true.
A probability function 𝑷 is defined by (Kolmogorov, 1933)
𝑷 ∶ 𝐄𝐯𝐞𝐧𝐭𝐬 → 𝟎; 𝟏
𝑨 → 𝑷 𝑨
satisfying :
𝑷(𝛀) = 𝟏
𝑷(𝑨 ∪ 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) if 𝑨 ∩ 𝑩 = Ø
Interpretation:
- Frequentist 𝐥𝐢𝐦𝒏→∞
𝒏𝑨
𝒏= 𝑷 𝑨 defines a probability if we
repeat a random process 𝒏 times (𝒏 great), and count
the outcomes satisfying the event 𝑨, 𝒏𝑨.
- Bayesian probability = measure of the credibility
associated to the event.7
Probability
𝑨 𝑩
8
Simple logic
Event « not 𝑨 » is associated
with the complement ഥ𝑨.
Event « 𝑨 and 𝑩 »
Event « 𝑨 or 𝑩 »
𝛀
𝑨ഥ𝑨
𝑷 𝑨 ∩ 𝑩
𝛀𝑨 𝑩
𝑨 ∩ 𝑩
𝛀
𝑨 𝑩
𝑨 ∪ 𝑩
𝑷 ഥ𝑨 = 𝟏 − 𝑷 𝑨𝑷 Ø = 𝟏 − 𝑷 𝛀 = 𝟎
𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩 − 𝑷 𝑨 ∩ 𝑩
9
Conditional probability
If an event 𝑩 is known to be true, one can define a new
probability function on the restrained universe 𝛀′ = 𝑩,
the conditional probability:
𝑷 𝑨|𝑩 = « probability of 𝑨 given 𝑩 »
By definition of the conditional probability:
𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨|𝑩 ∙ 𝑷 𝑩 = 𝑷 𝑩|𝑨 ∙ 𝑷 𝑨
𝑷 𝑨|𝑩 =𝑷 𝑨 ∩ 𝑩
𝑷 𝑩
𝛀𝑨 𝑩
𝑨 ∩ 𝑩
𝛀′ = 𝑩
𝑨|𝑩
𝑷 𝑩|𝑨 =𝑷 𝑨|𝑩 ∙ 𝑷 𝑩
𝑷 𝑨Bayes theorem
10
Incompatibility / Independence
Two incompatible events cannot be true
simultaneously, then :
𝑷 𝑨 ∩ 𝑩 = 𝟎 and 𝑨 ∩ 𝑩 = Ø
𝑷(𝑨 ∪ 𝑩) = 𝑷(𝑨) + 𝑷(𝑩)
Two events are independent,
if the realization of one is not linked in any
way to the realization of the other :
𝑷 𝑨|𝑩 = 𝑷 𝑨 and 𝑷 𝑩|𝑨 = 𝑷 𝑩
𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∙ 𝑷 𝑩
- Real variable: 𝑷 𝑿 = 𝒙 = 𝟎
Cumulative density function 𝑭 𝒙 = 𝑷 𝑿 < 𝒙
𝑑𝐹 = 𝐹 𝑥 + 𝑑𝑥 − 𝐹 𝑥 = 𝑃 𝑋 < 𝑥 + 𝑑𝑥 − 𝑃 𝑋 < 𝑥
= 𝑃 𝑋 < 𝑥 + 𝑃 𝑥 < 𝑋 < 𝑥 + 𝑑𝑥 − 𝑃 𝑋 < 𝑥
= 𝑃 𝑥 < 𝑋 < 𝑥 + 𝑑𝑥 = 𝑓 𝑥 𝑑𝑥
11
Random variable
When the random process outcome is a number 𝒙,
we associate to the process a random variable 𝑿.
Each realization leads to a particular result: 𝑿 = 𝒙.
𝒙 is a realization of 𝑿.
- Discrete variable: Probability law 𝒑 𝒙 = 𝑷 𝑿 = 𝒙
𝒇 𝒙 =𝒅𝑭
𝒅𝒙Probability density function (pdf)
Cumulative density functionProbability density function
Density function
12
x
f(x)
x
F(x)
1
න
−∞
+∞
𝒇(𝒙)𝒅𝒙 = 𝑷 𝛀 = 𝟏
Discrete variables can also
be described by a pdf using
Dirac distributions:
𝒇 𝒙 =
𝒊
𝒑(𝒊)𝜹(𝒊 − 𝒙)
𝒊
𝒑 𝒊 = 𝟏
By construction :
𝑭 −∞ = 𝑷 Ø = 𝟎
𝑭 +∞ = 𝑷 𝛀 = 𝟏
𝑭 𝒂 = න
−∞
𝒂
𝒇(𝒙)𝒅𝒙
𝑷 𝒂 < 𝑿 < 𝒃 = 𝑭 𝒃 −𝑭 𝒂 = න
𝒂
𝒃
𝒇(𝒙)𝒅𝒙
𝒇 𝒙 =𝒅𝑭
𝒅𝒙
𝑭(𝒙) = 𝑷(𝑿 < 𝒙)
Characteristic function:
Moments
For any function 𝒈 𝒙 , the expectation of 𝒈 is given by:
13
𝑿′ = 𝑿 − 𝝁𝟏 is a central variable.
- 2nd central moment: 𝝁𝟐′ = 𝝈𝟐 (variance)
𝝋 𝒕 = 𝑬 𝒆𝒊𝒙𝒕 = න𝒇 𝒙 𝒆𝒊𝒙𝒕 𝒅𝒙 = 𝐅𝐓−𝟏 𝒇
𝝋 𝒕 = න
𝒌
𝒊𝒕𝒙 𝒌
𝒌!𝒇(𝒙)𝒅𝒙 =
𝒌
𝒊𝒕 𝒌
𝒌!𝝁𝒌
𝑬 𝒈 𝒙 = න𝒈 𝒙 𝒇 𝒙 𝒅𝒙 It’s the mean value of 𝒈.
Moments 𝝁𝒌 are the expectation of 𝑿𝒌.
- 0th moment: 𝝁𝟎 = 𝟏 (pdf normalization)
- 1st moment: 𝝁𝟏 = 𝝁 (mean)
Pdf entirely defined by its moments
CF : useful tool for demonstrations
From Taylor expansion
𝝁𝒌 = อ−𝒊𝒌𝒅𝒌𝝋
𝒅𝒕𝒌𝒕=𝟎
Sample PDF
Sample: obtained from a random drawing within a population,
described by a pdf.
We’re going to discuss how to characterize, independently
from one another:
- a population
- a sample
Useful to consider a sample as a finite set from which one
can randomly draw elements, with equiprobability.
We can then associate to this process a pdf:
the empirical density or sample density
It will be useful to translate properties of distribution to a finite
sample. 14
𝒇𝐬𝐚𝐦𝐩𝐥𝐞 𝒙 =𝟏
𝒏
𝒊
𝜹 𝒙 − 𝒊
Characterizing a distribution
How to reduce a distribution / sample to a finite
number of values ?
Measure of location:
Reducing the distribution to one central value
Result
Measure of dispersion:
Spread of the distribution around the central value
Uncertainty / Error
Frequency table / histogram (for a finite sample)15
Location and dispersion
16
Continuous Discrete
Standard deviation 𝝈 / variance 𝐯 = 𝝈²Average value of the squared deviation to the mean
𝝁 = ഥ𝒙 = න
−∞
+∞
𝒙𝒇 𝒙 𝒅𝒙
𝝈𝟐 = න
−∞
+∞
𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙
𝝁 = ഥ𝒙 =
𝒊
𝒙𝒊𝒑(𝒙𝒊)
Mean: Sum of all possible values weighted by their probabilities
𝝈𝟐 =
𝒊
𝒙𝒊 − 𝝁 𝟐𝒑(𝒙𝒊)
Koenig’s theorem
17
𝝈𝟐 = න
−∞
+∞
𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙
= න𝒙𝟐𝒇 𝒙 𝒅𝒙 + 𝝁𝟐න𝒇 𝒙 𝒅𝒙 − 𝟐𝝁න𝒙𝒇 𝒙 𝒅𝒙
= 𝒙𝟐 + 𝝁𝟐 − 𝟐𝝁ഥ𝒙
= 𝒙𝟐 − ഥ𝒙𝟐
= 𝝁𝒙𝟐 − 𝝁𝟐
𝝈𝟐 = 𝝁𝒙𝟐 − 𝝁𝟐
Discrete distributions
Binomial distribution Repetition of 𝒏 Bernoulli processes.
Randomly choosing 𝑲 objects within a finite set of 𝒏, with a fixed
drawing probability of 𝒑
Variable : 𝑲
Parameters: 𝒏, 𝒑
Law : 𝑷 𝒌;𝒏, 𝒑 =𝒏!
𝒌! 𝒏−𝒌 !𝒑𝒌 𝟏 − 𝒑 𝒏−𝒌
Mean : 𝒏𝒑
Variance: 𝒏𝒑(𝟏 − 𝒑)18
p = 0.65
n = 10
Bernoulli distribution Random variable 𝑲 associated to a random
process with 2 possible outcomes « success » (𝑲 = 𝟏) or « failure »
(𝑲 = 𝟎) with a success probability 𝒑
Variable : 𝑲
Parameters : 𝒑
Law : 𝒑 𝒌 = 𝒑𝒌 𝟏 − 𝒑 𝟏−𝒌
Mean : 𝒑
Variance : 𝒑(𝟏 − 𝒑)
𝒑 = 𝟎. 𝟑𝟓
Poisson distribution
binomial limit when 𝒏 → +∞, 𝒑 → 𝟎, 𝒏𝒑 = 𝝀
Counting 𝒌 events with a probability per time/space unit
𝝀 (𝝀 is fixed and independent from the time/space).
Variable : 𝒌
Parameters : 𝝀
Law : 𝑷 𝒌; 𝝀 =𝒆−𝝀𝝀𝒌
𝒌!
Mean : 𝝀
Variance : 𝝀
Discrete distributions
19
𝝀 = 𝟔. 𝟓
Real distributions
Uniform distribution
Equiprobability over a finite range 𝑎; 𝑏
Parameters: 𝒂, 𝒃
Law : 𝒇 𝒙; 𝒂, 𝒃 =𝟏
𝒃−𝒂𝐢𝐟 𝒂 < 𝒙 < 𝒃
Mean : 𝝁 =𝒂+𝒃
𝟐
Variance : 𝝈𝟐 =𝒃−𝒂 𝟐
𝟏𝟐
20
Real distributions
21
Normal distribution (Gaussian)
Limit of many processes,
Centered and symmetric distribution
Parameters:𝝁, 𝝈
Law : 𝒇 𝒙; 𝝁, 𝝈 =𝟏
𝝈 𝟐𝝅𝒆−
𝒙−𝝁 𝟐
𝟐𝝈𝟐
Mean : 𝝁
Variance : 𝝈𝟐
Real distributions
22
Chi-square distribution
Sum of the square of 𝒏 normal reduced variables
Variable :
Parameters: 𝒏, number of dof
Law :𝒇 𝒄; 𝒏 = 𝒄𝒏
𝟐−𝟏𝒆−
𝒄
𝟐/𝚪𝒏
𝟐𝟐𝒏
𝟐
Mean : 𝒏
Variance : 𝟐𝒏
𝑪 =
𝒌=𝟏
𝒏𝑿𝒌 − 𝝁𝒌
𝟐
𝝈𝒌𝟐
Convergence
23
Poisson law
𝑷 𝒌; 𝝀 =𝒆−𝝀𝝀𝒌
𝒌!𝝁 = 𝝀 and 𝝈 = 𝝀
Binomial law
𝑷 𝒌;𝒏, 𝒑 = 𝑪𝒏𝒌𝒑𝒌 𝟏 − 𝒑 𝒏−𝒌
𝝁 = 𝒏𝒑
𝝈 = 𝒏𝒑 𝟏 − 𝒑
Normal law
𝒇 𝒙; 𝝁, 𝝈 =𝟏
𝟐𝝅𝒆−𝒙−𝝁 𝟐
𝟐𝝈𝟐
𝝁 = 𝝁 and 𝝈 = 𝝈
Chi-square law
𝒇 𝒄; 𝒏 =𝒄𝒏𝟐−𝟏
𝟐𝒏𝟐Γ
𝒏𝟐
𝒆−𝒄𝟐
𝝁 = 𝒏 and 𝝈 = 𝟐𝒏
𝒏 > 𝟓𝟎
𝒏 > 𝟑𝟎
𝝀 > 𝟐𝟓𝒑 small, 𝒌 ≪ 𝒏
𝒏𝒑 = 𝝀
Multidimensional PDF
The probability density function becomes:
24
𝑿 = 𝑿𝟏, 𝑿𝟐, … , 𝑿𝒏
𝒇 𝒙 𝒅𝒙 = 𝒇 𝒙𝟏, 𝒙𝟐, … , 𝒙𝒏 𝒅𝒙𝟏𝒅𝒙𝟐…𝒅𝒙𝒏
= 𝑷(𝒙𝟏 < 𝑿𝟏 < 𝒙𝟏 + 𝒅𝒙𝟏 ∩𝒙𝟐 < 𝑿𝟐 < 𝒙𝟐 + 𝒅𝒙𝟐 ∩⋯𝒙𝒏 < 𝑿𝒏 < 𝒙𝒏 + 𝒅𝒙𝒏)
𝑷 𝒂 < 𝑿 < 𝒃 ∩ 𝒄 < 𝒀 < 𝒅 = න
𝒂
𝒃
𝒅𝒙න
𝒄
𝒅
𝒅𝒚𝒇 𝒙, 𝒚
𝒇𝑿 𝒙 𝒅𝑥 = 𝑷 𝒙 < 𝑿 < 𝒙 + 𝒅𝒙 ∩ −∞ < 𝒀 < +∞= 𝒇 𝒙, 𝒚 𝒅𝒙 𝒅𝒚
Marginal density probability of only one of the component
Random variables random vectors:
And
𝒇𝑿 𝒙 = න𝒇 𝒙, 𝒚 𝒅𝒚
Multidimensional PDF
If 𝒀 = 𝒚𝟎 one can define the conditional density for X:
25
𝑿 and 𝒀 are independent if all events of the form
𝒙 < 𝑿 < 𝒙 + 𝒅𝒙 are independent from 𝒚 < 𝒀 < 𝒚 + 𝒅𝒚:
and
𝒇 𝒙 𝒚𝟎 𝒅𝒙 = 𝑷 𝒙 < 𝑿 < 𝒙 + 𝒅𝒙|𝒀 = 𝒚𝟎
𝒇 𝒙|𝒚 ∝ 𝒇 𝒙, 𝒚 න𝒇 𝒙|𝒚 𝒅𝒙 = 𝟏
𝒇 𝒙|𝒚 =𝒇 𝒙, 𝒚
𝒇 𝒙, 𝒚 𝒅𝒙=𝒇 𝒙, 𝒚
𝒇𝒀 𝒚
𝒇 𝒙|𝒚 = 𝒇𝑿 𝒙 𝒇 𝒚|𝒙 = 𝒇𝒀 𝒚
In term of pdf’s, Bayes’ theorem becomes:
𝒇 𝒙,𝒚 = 𝒇𝑿 𝒙 ∙ 𝒇𝒀 𝒚
𝒇 𝒚|𝒙 =𝒇 𝒙|𝒚 𝒇𝒀 𝒚
𝒇𝑿 𝒙=
𝒇 𝒙|𝒚 𝒇𝒀 𝒚
𝒇 𝒙|𝒚 𝒇𝒀 𝒚 𝒅𝒚
It is proportional to 𝒇 𝒙, 𝒚𝟎 .
Independent Uncorrelated
Correlation: Uncorrelated: 𝝆 = 0
Covariance and correlation
𝑿, 𝒀 can be treated as 2 separate variable marginal densities:
Mean and variance for each variable : 𝝁𝒙, 𝝁𝒚, 𝝈𝒙, 𝝈𝒚
BUT ! It doesn’t take variable correlations into account!
26
ρ = -0.5 ρ = 0 ρ = 0.9
ρ = 0
Generalized measure of dispersion covariance of 𝑿 and 𝒀
Discrete: 𝐂𝐨𝐯 𝑿, 𝒀 =𝟏
𝒏σ𝒊=𝟏𝒏 𝒙𝒊 − 𝝁𝒙 𝒚𝒊 − 𝝁𝒚
Real: 𝐂𝐨𝐯 𝑿,𝒀 = 𝒙−𝝁𝒙 𝒚−𝝁𝒚 𝒇 𝒙,𝒚 𝒅𝒙𝒅𝒚 = 𝝆𝝈𝒙𝝈𝒚 = 𝝁𝒙𝒚−𝝁𝒙𝝁𝒚
𝝆 =𝐂𝐨𝐯 𝑿, 𝒀
𝝈𝒙𝝈𝒚
Decorrelation
Covariance matrix for 𝒏 variables 𝑿𝒊
27
Σ𝒊𝒋 = 𝑪𝒐𝒗 𝑿𝒊, 𝑿𝒋 Σ =𝝈𝟏𝟐 ⋯ 𝝆𝟏𝒏𝝈𝟏𝝈𝒏⋮ ⋱ ⋮
𝝆𝟏𝒏𝝈𝟏𝝈𝒏 ⋯ 𝝈𝒏𝟐
- Uncorrelated variables 𝚺 diagonal
- Matrix real and symmetric can be diagonalized
𝒏 new uncorrelated variables 𝒀𝒊
Sorted for the larger to the smaller 𝝈′they allow dimensional reduction
Σ′ =𝝈′𝟏
𝟐⋯ 𝟎
⋮ ⋱ ⋮
𝟎 ⋯ 𝝈′𝒏𝟐
= 𝑩−𝟏Σ𝑩 𝒀 = 𝑩𝑿
- 𝝈′𝒊𝟐are the eigenvalues of 𝚺
- 𝑩 contains the orthonormal
eigenvectors
- The 𝒀𝒊 are the principal components
Regression
Measure of location: point (𝝁𝑿 , 𝝁𝒀) / curve closest to the points
Linear regression: Minimizing the dispersion between the curve
“𝒚 = 𝒂𝒙 + 𝒃” and the distribution.
28
Fully correlated 𝝆 = 𝟏Fully anti-correlated 𝝆 = −𝟏Then 𝒀 = 𝒂𝑿 + 𝒃
𝒘 𝒂, 𝒃 = ඵ 𝒚− 𝒂𝒙 − 𝒃 𝟐𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚 =𝟏
𝒏
𝒊
𝒚𝒊 − 𝒂𝒙𝒊 − 𝒃 𝟐
𝝏𝒘
𝝏𝒂= 𝟎 =ඵ𝒙 𝒚 − 𝒂𝒙 − 𝒃 𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚
𝝏𝒘
𝝏𝒃= 𝟎 =ඵ 𝒚− 𝒂𝒙 − 𝒃 𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚
⇔ ൝𝒂 𝝈𝒙
𝟐 − 𝝁𝒙𝟐 + 𝒃𝝁𝒙 = 𝝆𝝈𝒙𝝈𝒚 + 𝝁𝒙𝝁𝒚
𝒂𝝁𝒙 + 𝒃 = 𝝁𝒚
⇔
𝒂 = 𝝆𝝈𝒚
𝝈𝒙
𝒃 = 𝝁𝒚 − 𝝆𝝈𝒚
𝝈𝒙𝝁𝒙
Multidimensional Pdfs
Multinomial distribution
Randomly choosing 𝑲𝟏, 𝑲𝟐, …𝑲𝒔 objects within a finite set of 𝒏,
with a fixed drawing probability for each category 𝒑𝟏, 𝒑𝟐, … 𝒑𝒔with σ𝑲𝒊 = 𝒏 and σ𝒑𝒊 = 𝟏
Ex: a 3-way election A got 20% of the votes, B 30% and C 50%. 6 voters are randomly selected. Probability that there is 1 A supporter, 2 B supporters and 3 C supporters in the sample?
29
Parameters : 𝒏, 𝒑𝟏, 𝒑𝟐, …𝒑𝒔
Law : 𝑷 𝒌;𝒏, 𝒑 =𝒏!
𝒌𝟏!𝒌𝟐!…𝒌𝒔!𝒑𝟏𝒌𝟏𝒑𝟐
𝒌𝟐 …𝒑𝒔𝒌𝒔
Mean : 𝝁𝒊 = 𝒏𝒑𝒊
Variance : 𝝈𝒊𝟐 = 𝒏𝒑𝒊 𝟏 − 𝒑𝒊 and 𝑪𝒐𝒗 𝑲𝒊, 𝑲𝒋 = −𝒏𝒑𝒊𝒑𝒋
Rem: variables are not independent.
The binomial corresponds to s=2, but has only one independent variable.
Independent
Parameters: 𝝁, Σ
Law : 𝒇 𝒙; 𝝁, Σ =𝟏
𝟐𝝅 Σ𝒆−
𝟏
𝟐𝒙−𝝁 𝑻Σ−𝟏 𝒙−𝝁
if uncorrelated 𝒇 𝒙;𝝁,Σ = ς𝟏
𝝈𝒊 𝟐𝝅𝒆−
𝒙𝒊−𝝁𝒊𝟐
𝟐𝝈𝒊𝟐
Multidimensional Pdfs
30
Multinormal distribution
Generalization of the normal law to a random variable vector
Uncorrelated
Assuming the mean and variance of each variable exists,
Average of 𝑺 :
The mean is an additive quantity
Sum of random variables
31
For uncorrelated variables, the variance is additive
used for error combinations
A sum of random variables is a new random variable:
Variance of S :
𝑺 =
𝒊=𝟏
𝒏
𝑿𝒊
𝝁𝑺 = න
𝒊=𝟏
𝒏
𝒙𝒊 𝒇 𝒙𝟏, … , 𝒙𝒏 𝒅𝒙𝟏…𝒅𝒙𝒏 =
𝒊=𝟏
𝒏
න𝒙𝒊𝒇𝑿𝒊 𝒙𝒊 𝒅𝒙𝒊 =
𝒊=𝟏
𝒏
𝝁𝒊
𝝈𝑺𝟐 = න
𝒊=𝟏
𝒏
𝒙𝒊 − 𝝁𝑿𝒊
𝟐
𝒇 𝒙𝟏, … , 𝒙𝒏 𝒅𝒙𝟏 …𝒅𝒙𝒏
=
𝒊=𝟏
𝒏
𝝈𝑿𝒊𝟐 + 𝟐
𝒊
𝒋<𝒊
𝑪𝒐𝒗 𝑿𝒊, 𝑿𝒋
𝝈𝑺𝟐 =
𝒊=𝟏
𝒏
𝝈𝑿𝒊𝟐
The characteristic function factorizes.
Finally the pdf is the Fourier transform of the CF, so :
pdf 𝒇𝑺 𝒔 can be found, using the characteristic function:
Sum of random variables
32
Sum of Normal variables (𝝁𝟏, 𝝈𝟏 and 𝝁𝟐, 𝝈𝟐)
Normal, 𝝁 = 𝝁𝟏 + 𝝁𝟐 and 𝝈 = 𝝈𝟏 + 𝝈𝟐Sum of Poisson variables (λ𝟏 and λ𝟐) Poisson, λ = λ𝟏 + λ𝟐Sum of Khi-2 variables (𝒏𝟏 and 𝒏𝟐) Khi-2, 𝒏 = 𝒏𝟏 + 𝒏𝟐
For independent variables:
𝝋𝒔 𝒕 =ෑන𝒇𝑿𝒌 𝒙𝒌 𝒆𝒊𝒕𝒙𝒌𝒅𝒙𝒌 =ෑ𝝋𝒙𝒊 𝒕
𝝋𝒔 𝒕 = න𝒇𝑺 𝒔 𝒆𝒊𝒕𝒔𝒅𝒔 = න𝒇𝑿 𝒙 𝒆𝒊𝒕 σ 𝒙𝒊𝒅𝒙
𝒇𝑺 = 𝒇𝒙𝟏 ∗ 𝒇𝒙𝟐 ∗ ⋯∗ 𝒇𝒙𝒏The pdf of the sum is a convolution.
The pdf of 𝑪 converge to a reduced normal distribution:
The sum of many random fluctuations is normally distributed.
Weak law of large numbers
Sample of size 𝒏 = realization of 𝒏 independent variables, with
the same distribution (mean 𝝁, variance 𝝈𝟐).
The sample mean is a realization of
Sum of independent variables
33
𝑴 =𝑺
𝒏=𝟏
𝒏𝑿𝒊
Central-Limit theorem
𝒏 independent random variables of mean 𝝁𝒊 and variance 𝝈𝒊𝟐
Mean value of 𝑴 : 𝝁𝑴 = 𝝁 Variance of 𝑴 : 𝝈𝑴𝟐 =
𝝈𝟐
𝒏
Sum of the reduced variables : 𝑪 =𝟏
𝒏
𝑿𝒊 − 𝝁𝒊𝝈𝒊
𝒇𝑪 𝒄𝒏→+∞
𝟏
𝟐𝝅𝒆−
𝒄𝟐
𝟐
Central limit theorem
34
Sum of several reduced
uniform distributions
Sum of several reduced
exponential laws
The central limit theorem insures the convergence,
but not its speed.
Dispersion and uncertainty
Any measure (or combination of measure) is a realization of a
random variable.
- Measured value : 𝜽
- True value : 𝜽𝟎
Uncertainty = quantifying the difference between 𝜽 and 𝜽𝟎 :
measure of dispersion
We will postulate: ∆𝜽 = 𝜶𝝈𝜽 Absolute error, always positive
Usually one differentiate
- Statistical error: due to the measurement Pdf.
- Systematic errors or bias: fixed but unknown deviation
(equipment, assumptions,…)
Systematic errors can be seen as statistical error in a set a
similar experiences. 35
Position error : ∆𝑷
Error sources
36
Observation error: ∆𝑶 Scaling error: ∆𝑺
𝜽 = 𝜽𝟎 + 𝜹𝑶 + 𝜹𝑷 + 𝜹𝑺Each 𝜹𝒊 is a realization of a random variable: 𝝁𝒊 = 𝟎and 𝝈𝒊
𝟐.
For uncorrelated error sources:
Choice of 𝜶?
If many sources, from central-limit normal distribution
𝜶 = 𝟏 gives (approximately) a 68% confidence interval
𝜶 = 𝟐 gives 95% CL (and at least 75% from Bienaymé-Chebyshev)
ቑ
∆𝑶= 𝜶𝝈𝑶∆𝑷= 𝜶𝝈𝑷∆𝑺= 𝜶𝝈𝑺
∆𝐭𝐨𝐭𝟐 = 𝜶𝝈𝐭𝐨𝐭
𝟐 = 𝜶𝟐 𝝈𝑶𝟐 + 𝝈𝑷
𝟐 + 𝝈𝑺𝟐 = ∆𝑶
𝟐 + ∆𝑷𝟐 + ∆𝑺
𝟐
Measure: 𝒙 ± 𝚫𝒙
Compute: 𝒇 𝒙 𝚫𝒇 ?
Error propagation
37
f(x)
x
Δx
Δx
dx
df(x)f'
Δf
Δf
𝑓 𝑥 + 𝛥𝑥 = 𝑓 𝑥 +𝑑𝑓
𝑑𝑥𝛥𝑥 +
1
2
𝑑2𝑓
𝑑2𝑥𝛥𝑥2 + 𝑂 𝛥𝑥3
𝑓 𝑥 − 𝛥𝑥 = 𝑓 𝑥 −𝑑𝑓
𝑑𝑥𝛥𝑥 +
1
2
𝑑2𝑓
𝑑2𝑥𝛥𝑥2 + 𝑂 𝛥𝑥3
𝜟𝒇 =𝟏
𝟐𝒇 𝒙 + 𝚫𝒙 − 𝒇(𝒙 − 𝚫𝒙) =
𝒅𝒇
𝒅𝒙𝜟𝒙
Assuming small errors,
using Taylor expansion :
Error propagation
Measure: 𝒙 ± 𝚫𝒙, 𝒚 ± 𝚫𝒚,…
Compute: 𝒇 𝒙, 𝒚, … 𝚫𝒇 ?
Idea treat each variable effect as separate error sources
38
Uncorrelated
Then
Error due to 𝒙 error 𝚫𝒇𝒙 =𝝏𝒇
𝝏𝒙𝚫𝒙
Error due to 𝒚 error 𝚫𝒇𝒚 =𝝏𝒇
𝝏𝒚𝚫𝒚
𝜟𝒇𝟐 = 𝜟𝒇𝒙𝟐+𝜟𝒇𝒚
𝟐+𝟐𝝆𝒙𝒚𝚫𝒇𝒙𝚫𝒇𝒚 =𝝏𝒇
𝝏𝒙𝜟𝒙
𝟐+
𝝏𝒇
𝝏𝒚𝜟𝒚
𝟐+𝟐𝝆𝒙𝒚
𝝏𝒇
𝝏𝒙
𝝏𝒇
𝝏𝒚𝜟𝒙𝜟𝒚
𝚫𝒇𝟐 =
𝒊=𝟏
𝒌𝝏𝒇
𝝏𝒙𝒊𝚫𝒙𝒊
𝟐
+
𝒊=𝟏
𝒌
𝒋≠𝒊
𝒌
𝝆𝒙𝒊𝒙𝒋𝝏𝒇
𝝏𝒙𝒊
𝝏𝒇
𝝏𝒙𝒋𝚫𝒙𝒊𝚫𝒙𝒋
𝚫𝒇𝟐 =𝝏𝒇
𝝏𝒙𝜟𝒙
𝟐
+𝝏𝒇
𝝏𝒚𝜟𝒚
𝟐
𝚫𝒇=𝝏𝒇
𝝏𝒙𝜟𝒙+
𝝏𝒇
𝝏𝒚𝜟𝒚 𝚫𝒇=
𝝏𝒇
𝝏𝒙𝜟𝒙−
𝝏𝒇
𝝏𝒚𝜟𝒚
AnticorrelatedCorrelated
Parametric estimation
From a finite sample 𝒙𝒊 estimating a parameter 𝜽
Statistic a function 𝑺 = 𝒇 𝒙𝒊Any statistic can be considered as an estimator of 𝜽
To be a good estimator it needs to satisfy :
• Consistency : limit of the estimator for an infinite sample.
• Bias : difference between the estimator and the true
value
• Efficiency : speed of convergence
• Robustness : sensitivity to statistical fluctuations
A good estimator should at least be consistent and
asymptotically unbiased
Efficient / Unbiased / Robust often contradict each other
different choices for different applications 39
Bias Estimator average difference 𝒃 𝜽 = 𝑬 𝚯 − 𝜽𝟎 = 𝝁𝚯 − 𝜽𝟎
Unbiased 𝒃 𝜽 = 𝟎
Asymptotically unbiased 𝒃 𝜽𝒏→+∞
𝟎
Bias and consistency
A sample is a set of random variables (or vector) realizations,
So is the estimator 𝜽 is a realization of 𝚯
It has a mean, a variance,… and a probability density function.
40
biasedasymptotically
unbiasedunbiased
Consistency formally 𝑷 𝜽 − 𝜽 > 𝜺𝒏→+∞
𝟎, ∀𝜺
in practice, if asymptotically unbiased: 𝝈𝚯 𝒏→∞𝟎
Efficiency
For any unbiased estimator of 𝜽, the variance cannot
exceed the Cramer-Rao bound:
The efficiency of a convergent estimator is given by
its variance.
An efficient estimator reaches the Cramer-Rao bound
(at least asymptotically) minimal variance estimator
Such an estimator will often be biased, asymptotically
unbiased.41
𝝈𝚯𝟐 ≥
𝟏
𝑬𝝏 ln𝓛𝝏𝜽
𝟐=
−𝟏
𝑬𝝏𝟐 ln 𝓛𝝏𝜽𝟐
Empirical mean estimator
Sample mean ഥ𝒙 good estimator of the population mean 𝝁
42
ෝ𝝁 = ഥ𝒙 =𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊
Using the weak law of large numbers:
𝝁ෝ𝝁 = 𝑬 ෝ𝝁 = 𝑬𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊
=𝟏
𝒏
𝒊=𝟏
𝒏
𝑬 𝒙𝒊 =𝟏
𝒏
𝒊=𝟏
𝒏
𝝁
= 𝝁
𝝈ෝ𝝁𝟐 = 𝑽𝒂𝒓 ෝ𝝁 = 𝑽𝒂𝒓
𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊
=𝟏
𝒏𝟐
𝒊=𝟏
𝒏
𝑽𝒂𝒓 𝒙𝒊 =𝟏
𝒏𝟐
𝒊=𝟏
𝒏
𝝈𝟐
=𝝈𝟐
𝒏
𝒃 𝜽 = 𝝁ෝ𝝁 − 𝝁 = 𝟎
Unbiased Convergent
𝝈ෝ𝝁𝟐 =
𝝈𝟐
𝒏 𝒏→+∞𝟎
Empirical variance estimator
43
Variance of the estimator (convergence)
Sample variance as an estimator of the population variance:
𝝈𝟐 =𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊 − ෝ𝝁 𝟐
𝑬 𝝈𝟐 = 𝑽𝒂𝒓 𝒙𝒊 − 𝑽𝒂𝒓 ෝ𝝁 = 𝝈𝟐 −𝝈𝟐
𝒏=𝒏 − 𝟏
𝒏𝝈𝟐
=𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊 − 𝑬 𝒙𝒊𝟐 − ෝ𝝁 − 𝑬 ෝ𝝁 𝟐
Biased, asymptotically unbiased
𝒔𝟐 =𝒏
𝒏 − 𝟏𝝈𝟐 =
𝟏
𝒏 − 𝟏
𝒊=𝟏
𝒏
𝒙𝒊 − ෝ𝝁 𝟐
Unbiased variance estimator
𝝈𝒔𝟐𝟐 =
𝝈𝟒
𝒏 − 𝟏
𝒏 − 𝟏
𝒏𝜸𝟐 + 𝟐
𝐍𝐨𝐫𝐦𝐚𝐥 𝐬𝐚𝐦𝐩𝐥𝐞
𝟐𝝈𝟒
𝒏 − 𝟏 𝒏→+∞𝟎
Errors on these estimators
Use the variance estimator ො𝒔 = 𝒔𝟐 !!! Biased !!!
44
𝝈ෝ𝝁𝟐 =
𝝈𝟐
𝒏 𝜟ෝ𝝁 =
𝒔𝟐
𝒏
ෝ𝝁 =𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊𝒔𝟐 =
𝟏
𝒏 − 𝟏
𝒊=𝟏
𝒏
𝒙𝒊 − ෝ𝝁 𝟐
𝝈𝒔𝟐𝟐 ≈
𝟐𝝈𝟒
𝒏 𝜟𝒔𝟐 =
𝟐𝒔𝟐𝟐
𝒏
Mean Variance
Central-Limit theorem
For large enough samples, mean and variance empirical
estimators are normally distributed.
ෝ𝝁 ± 𝜟ෝ𝝁 and 𝒔𝟐 ± 𝜟𝒔𝟐 define 68% confidence intervals.
Uncertainty Standard deviation estimator
Likelihood function
Generic function 𝒌(𝒙, 𝜽)𝒙 = random variable(s)
𝜽 = parameter(s)
45
Probability density function
𝒇 𝒙; 𝜽 = 𝒌 𝒙, 𝜽𝟎
න𝒇(𝒙; 𝜽)𝒅𝒙 = 𝟏
For Bayesian 𝒇 𝒙|𝜽 = 𝒇 𝒙; 𝜽
Likelihood function
𝓛 𝜽 = 𝒌 𝒖, 𝜽
න𝓛 𝜽 𝒅𝜽 = ?
For Bayesian 𝒇 𝜽|𝒙 =𝓛 𝜽
𝓛 𝜽 𝒅𝜽
Fix 𝜽 = 𝜽𝟎 (true value) Fix 𝒙 = 𝒖 (one realization
of the random variable)
For a sample of 𝒏 independent realizations of the same variable 𝑿
𝓛 𝜽 =ෑ
𝒊
𝒌 𝒙𝒊, 𝜽 =ෑ
𝒊
𝒇 𝒙𝒊; 𝜽
Maximum likelihood (1)
For a sample of measurements, 𝒙𝒊 The analytical form of the density is known
It depends on several unknown parameters 𝜽
eg. Event counting follows a Poisson distribution, with a the
physics-depending parameter: 𝝀 𝜽
46
Estimators 𝜽 of the parameters 𝜽 are the ones that
maximize the probability of observing the observed result.
Maximum of the likelihood function
System of equations for several parameters
Often minimize − 𝐥𝐧 𝓛 simpler expressions
𝓛 𝜽 =ෑ
𝒊
𝒆−𝝀 𝜽 𝝀 𝜽 𝒙𝒊
𝒙𝒊!
ቤ𝝏𝓛
𝝏𝜽𝜽=𝜽
= 𝟎
Maximum likelihood (2)
47
𝓛 𝜽 =ෑ
𝒊=𝟏
𝒏𝒆−𝝀 𝜽 𝝀 𝜽 𝒙𝒊
𝒙𝒊!
− 𝐥𝐧𝓛 𝜽 =
𝒊=𝟏
𝒏
𝝀 𝜽 − 𝒙𝒊 𝐥𝐧 𝝀 𝜽 + 𝐥𝐧𝒙𝒊!
−𝝏 𝐥𝐧 𝜽
𝝏𝜽=
𝒊=𝟏
𝒏
𝝀′ 𝜽 − 𝒙𝒊𝝀′ 𝜽
𝝀 𝜽= 𝒏𝝀′ 𝜽 −
𝝀′ 𝜽
𝝀 𝜽
𝒊=𝟏
𝒏
𝒙𝒊
−𝝏 𝐥𝐧 𝜽
𝝏𝜽= 𝟎⇔𝒏 =
𝟏
𝝀 𝜽
𝒊=𝟏
𝒏
𝒙𝒊
⇔𝝀 𝜽 =𝟏
𝒏
𝒊=𝟏
𝒏
𝒙𝒊 = ෝ𝝁
Minimize − 𝐥𝐧 𝓛
Properties of MLE
Mostly asymptotic properties valid for large samples.
Often assumed in any case for lack of better information
Asymptotically unbiased
Asymptotically efficient (reaches the CR bound)
Asymptotically normally distributed
Multinormal law
with covariance given by generalization of CR Bound:
48
𝒇𝜽; 𝜽, 𝚺 =
𝟏
𝟐𝝅 𝚺𝒆−𝟏𝟐𝜽−𝜽
𝑻𝚺−𝟏
𝜽−𝜽
𝚺𝒊𝒋−𝟏 = −𝑬
𝝏 𝐥𝐧𝓛
𝝏𝜽𝒊
𝝏 𝐥𝐧𝓛
𝝏𝜽𝒋
Errors on MLE
Confidence contours are defined by the equation 𝜟𝐥𝐧𝓛 = 𝜷 𝒏𝜽,𝜶 ,
with 𝜶 = 𝟎𝟐𝜷𝒇𝝌𝟐 𝒙;𝒏𝜽 𝒅𝒙
49
(only one realization
of the estimator
empirical mean
of 1 value)
𝒏𝜽𝜶
1
𝟎. 𝟓 × 𝒏𝜽𝟐
2 3
68.3 0.5 1.15 1.76
95.4 2 3.09 4.01
99.7 4.5 5.92 7.08
𝒇𝜽; 𝜽, 𝜮 =
𝟏
𝟐𝝅 𝜮𝒆−𝟏𝟐𝜽−𝜽
𝑻𝜮−𝟏
𝜽−𝜽
𝜮𝒊𝒋−𝟏 = −𝑬
𝝏𝟐𝐥𝐧𝓛
𝝏𝜽𝒊𝝏𝜽𝒋
∆𝜽 = ෝ𝝈𝜽 =−𝟏
𝝏𝟐𝐥𝐧𝓛𝝏𝜽𝟐
𝜟𝐥𝐧𝓛 = 𝐥𝐧𝓛 𝜽 − 𝐥𝐧𝓛 𝜽 =𝟏
𝟐
𝒊
𝒋
𝚺𝒊𝒋−𝟏 𝜽𝒊− 𝜽𝒊 𝜽𝒋− 𝜽𝒋 +𝓞 𝜽𝟑
Errors on parameter from the covariance matrix
For 1 parameter, 68% interval
More generally:
Values of 𝜷 for different
number of parameters 𝒏𝜽and confidence levels 𝜶
3 criteria:
1) Chi-2 law mean: 𝝁 = 𝑵𝑫𝑭
−𝟐𝒍𝒏(𝓛)~𝑵𝑫𝑭 or −𝟐𝒍𝒏(𝓛)
𝑵𝑫𝑭~𝟏
2) Chi-2 law variance: 𝝈𝟐 = 𝟐𝑵𝑫𝑭If 𝑵𝑫𝑭 > 𝟑𝟎 chi-2 approaches a normal law
−𝟐𝒍𝒏(𝓛) has 95% probability to be in [𝑵𝑫𝑭 ± 𝟐 𝟐𝑵𝑫𝑭]
3) Probability of getting a worse agreement p-value:
𝒑−𝐯𝐚𝐥𝐮𝐞 = 𝟐𝒍𝒏− 𝓛
+∞𝒇𝛘𝟐 𝒙;𝑵𝑫𝑭 𝒅𝒙
Goodness of fit
50
−𝟐𝐥𝐧(𝓛) is the realization of a random variable following a
Chi-2 law with 𝑵𝑫𝑭 degrees of freedom:
𝒏 = number of fitted points (sample size)
𝒏𝜽 = number of fitting parameters
𝑵𝑫𝑭 = 𝒏 − 𝒏𝜽
Maximum likelihood assume each 𝒚𝒊 is normally distributed
with a 𝒇 𝒙𝒊, 𝜽 mean and a 𝜟𝒚𝒊 variance
Least squares
- Naive approach: regression
𝒘 𝜽 = σ𝒊 𝒚𝒊 − 𝒇 𝒙𝒊, 𝜽𝟐
𝝏𝒘
𝝏𝜽𝒊= 𝟎
- Weight by the error: least squares
𝑲𝟐 𝜽 = σ𝒊𝒚𝒊−𝒇 𝒙𝒊,𝜽
∆𝒚𝒊
𝟐
𝝏𝑲𝟐
𝝏𝜽𝒊= 𝟎
51
Least squares or Khi-2 fit is the
MLE, for Gaussian errors
Generic case
with correlations:
𝓛 𝜽 =ෑ
𝒊
𝟏
𝟐𝝅∆𝒚𝒊𝒆−𝟏𝟐𝒚𝒊−𝒇 𝒙𝒊,𝜽
∆𝒚𝒊
𝟐
𝝏𝓛
𝝏𝜽= 𝟎 ⇔ −𝟐
𝝏𝓛
𝝏𝜽=𝝏𝑲𝟐
𝝏𝜽= 𝟎
𝑲𝟐 𝜽 =𝟏
𝟐𝒚 − 𝒇 𝒙, 𝜽
𝑻𝜮−𝟏 𝒚 − 𝒇 𝒙, 𝜽
Set of measurements 𝒙𝒊, 𝒚𝒊 with vertical uncertainties 𝜟𝒚𝒊Theoretical law 𝒚 = 𝒇 𝒙, 𝜽
Example : fitting a line
For 𝒇(𝒙) = 𝒂𝒙:
52
𝑨 =
𝒊
𝒙𝒊𝟐
∆𝒚𝒊𝟐
𝑩 =
𝒊
𝒙𝒊𝒚𝒊
∆𝒚𝒊𝟐 𝑪 =
𝒊
𝒚𝒊𝟐
∆𝒚𝒊𝟐
𝑲𝟐 𝒂 = 𝑨𝒂𝟐 − 𝟐𝑩𝒂 + 𝑪 = −𝟐𝐥𝐧𝓛
𝝏𝑲𝟐
𝝏𝒂= 𝟐𝑨𝒂 − 𝟐𝑩 = 𝟎 ෝ𝒂 =
𝑩
𝑨
𝝏𝟐𝑲𝟐
𝝏𝒂𝟐= 𝟐𝑨 =
𝟐
𝝈𝒂𝟐
∆ෝ𝒂 =𝑩
𝑨𝑩 − 𝑪𝟐
Example : fitting a line
53
For 𝒇 𝒙 = 𝒂𝒙 + 𝒃:
𝑨 =
𝒊
𝒙𝒊𝟐
∆𝒚𝒊𝟐 𝑫 =
𝒊
𝒙𝒊𝒚𝒊
∆𝒚𝒊𝟐
𝐅 =
𝒊
𝒚𝒊𝟐
∆𝒚𝒊𝟐
𝑲𝟐 𝒂, 𝒃 = 𝑨𝒂𝟐 + 𝑩𝒃𝟐 + 𝟐𝑪𝒂𝒃 − 𝟐𝑫𝒂 − 𝟐𝑬𝒃 + 𝑭 = −𝟐𝐥𝐧𝓛
𝑬 =
𝒊
𝒚𝒊
∆𝒚𝒊𝟐𝑪 =
𝒊
𝒙𝒊
∆𝒚𝒊𝟐
𝑩 =
𝒊
𝟏
∆𝒚𝒊𝟐
𝝏𝑲𝟐
𝝏𝒂= 𝟐𝑨𝒂 + 𝟐𝑪𝒃 − 𝟐𝑫 = 𝟎
𝝏𝟐𝑲𝟐
𝝏𝒂𝟐= 𝟐𝑨 = 𝟐𝜮𝟏𝟏
−𝟏
ෝ𝒂 =𝑩𝑫− 𝑬𝑪
𝑨𝑩 − 𝑪𝟐, 𝒃 =
𝑨𝑬 − 𝑫𝑪
𝑨𝑩 − 𝑪𝟐𝝏𝑲𝟐
𝝏𝒃= 𝟐𝑪𝒂 + 𝟐𝑩𝒃 − 𝟐𝑬 = 𝟎
∆ෝ𝒂 =𝑩
𝑨𝑩 − 𝑪𝟐, ∆𝒃 =
𝑨
𝑨𝑩 − 𝑪𝟐
𝝏𝟐𝑲𝟐
𝝏𝒃𝟐= 𝟐𝑩 = 𝟐𝜮𝟐𝟐
−𝟏
𝝏𝟐𝑲𝟐
𝝏𝒂𝝏𝒃= 𝟐𝑪 = 𝟐𝜮𝟏𝟐
−𝟏
𝜮−𝟏 =𝑨 𝑪𝑪 𝑩
𝜮 =𝟏
𝑨𝑩 − 𝑪𝟐𝑩 −𝑪−𝑪 𝑨
Example : fitting a line
2 dimensional error contours on a and b
54
Non parametric estimation
Directly estimating the probability density function
• Likelihood ratio discriminant
• Separating power of variables
• Data / MC agreement
• …
Frequency Table : For a sample 𝒙𝒊 , 𝒊 = 𝟏…𝒏
1. Define successive intervals (bins) 𝑪𝒌 = 𝒂𝒌, 𝒂𝒌+𝟏2. Count the number of events 𝒏𝒌 in 𝑪𝒌
Histogram : Graphical representation of the
frequency table
55 𝒉 𝒙 = 𝒏𝒌 𝐢𝐟 𝒙 ∈ 𝑪𝒌
Histogram
56
N/Z for stable heavy nuclei1.321, 1.357, 1.392, 1.410, 1.428, 1.446,
1.464, 1.421, 1.438, 1.344, 1.379, 1.413,
1.448, 1.389, 1.366, 1.383, 1.400, 1.416,
1.433, 1.466, 1.500, 1.322, 1.370, 1.387,
1.403, 1.419, 1.451, 1.483, 1.396, 1.428,
1.375, 1.406, 1.421, 1.437, 1.453, 1.468,
1.500, 1.446, 1.363, 1.393, 1.424, 1.439,
1.454, 1.469, 1.484, 1.462, 1.382, 1.411,
1.441, 1.455, 1.470, 1.500, 1.449, 1.400,
1.428, 1.442, 1.457, 1.471, 1.485, 1.514,
1.464, 1.478, 1.416, 1.444, 1.458, 1.472,
1.486, 1.500, 1.465, 1.479, 1.432, 1.459,
1.472, 1.486, 1.513, 1.466, 1.493, 1.421,
1.447, 1.460, 1.473, 1.486, 1.500, 1.526,
1.480, 1.506, 1.435, 1.461, 1.487, 1.500,
1.512, 1.538, 1.493, 1.450, 1.475, 1.500,
1.512, 1.525, 1.550, 1.506, 1.530, 1.487,
1.512, 1.524, 1.536, 1.518, 1.577, 1.554,
1.586, 1.586
Histogram as PDF estimator
Statistical description 𝒏𝒌 are multinomial random variables.
57
The histogram is an estimator of the probability density
Each bin can be described by a Poisson density.
The 𝟏𝝈 error on 𝒏𝒌 is then : 𝜟𝒏𝒌 = ෝ𝝈𝒌𝟐 = ෝ𝝁𝒌 = 𝒏𝒌
𝒏 =
𝒌
𝒏𝒌 𝒑𝒌 = 𝑷 𝒙 ∈ 𝑪𝒌 = න
𝑪𝒌
𝒇𝑿 𝒙 𝒅𝒙
𝝁𝒌 = 𝒏𝒑𝒌 𝝈𝒌𝟐 = 𝒏𝒑𝒌 𝟏−𝒑𝒌
𝒑𝒌≪𝟏𝝁𝒌 𝑪𝒐𝒗 𝒏𝒌,𝒏𝒓 =−𝒏𝒑𝒌𝒑𝒓
𝒑𝒌≪𝟏𝟎
lim𝒏→+∞
𝒏𝒌𝒏=𝝁𝒌𝒏= 𝒑𝒌 𝒑𝒌 = න
𝑪𝒌
𝒇𝑿 𝒙 𝒅𝒙 ≈ 𝜹𝒇𝑿 𝒙𝒄 𝐥𝐢𝐦𝜹→𝟎
𝒑𝒌𝜹= 𝒇𝑿 𝒙
𝒇 𝒙 = 𝐥𝐢𝐦𝒏→+∞𝜹→𝟎
𝟏
𝒏𝜹𝒉 𝒙
For a large sample For small classes (width 𝜹)
And since 𝒉 𝒙 = 𝒏𝒌 𝐢𝐟 𝒙 ∈ 𝑪𝒌 :
For Bayesians: the posterior density is the probability density
of the true value. It can be used to derive interval :
No such thing for a Frequentist: The interval itself becomes the
random variable [𝒂, 𝒃] is a realization of [𝑨, 𝑩]
Independently of 𝜽
Generalization of the concept of uncertainty:
Interval that contains the true value with a given probability
slightly different concepts
Confidence interval
For a random variable 𝑿, a confidence interval [𝒂, 𝒃] with
confidence level 𝜶 is such that the probability of finding a
realization of 𝑿 inside the interval is:
58
𝑷 𝑿 ∈ 𝒂, 𝒃 = න𝒂
𝒃
𝒇𝑿 𝒙 𝒅𝒙 = 𝜶
𝑷 𝜽 ∈ 𝒂, 𝒃 = 𝜶
𝑷 𝑨 < 𝜽 and 𝑩 > 𝜽 = 𝜶
Confidence interval
Mean centered,
symmetric interval
[𝝁 − 𝒂, 𝝁 + 𝒂]
𝝁−𝒂𝝁+𝒂
𝒇 𝒙 𝒅𝒙 = 𝜶
59
Mean centered,
probability symmetric interval:
[𝒂, 𝒃]
𝒂𝝁𝒇 𝒙 𝒅𝒙 = 𝝁
𝒃𝒇 𝒙 𝒅𝒙 = Τ𝜶 𝟐
Highest Probability Density (HDP)
[𝒂, 𝒃]
𝒂𝒃𝒇 𝒙 𝒅𝒙 = 𝜶
𝒇 𝒙 > 𝒇 𝒚 for 𝒙 ∈ [𝒂, 𝒃] and 𝐲 ∉ [𝒂, 𝒃]
Confidence Belt
To build a frequentist interval for an estimator 𝝀 of 𝝀
1. Make pseudo-experiments for several values of 𝝀 and compute
the estimator 𝝀 for each (MC sampling of the estimator pdf)
2. For each 𝝀, determine 𝜩−𝟏(𝝀) and 𝜴−𝟏(𝝀) such as :
These 2 curves are the confidence belt, for a CL α.
3. Inverse these functions. The interval 𝜴−𝟏(𝝀), 𝜩−𝟏(𝝀) satisfies:
60
Confidence Belt for
Poisson parameter 𝝀estimated with the
empirical mean of
3 realizations (68%CL)
𝝀 < 𝜩 𝝀 for a fraction Τ(𝟏 − 𝜶) 𝟐 of the pseudo-experiments𝝀 > 𝜴 𝝀 for a fraction Τ(𝟏 − 𝜶) 𝟐 of the pseudo-experiments
𝑷 𝜴−𝟏 𝝀 < 𝝀< 𝜩−𝟏(𝝀) = 𝟏−𝑷 𝜩−𝟏 𝝀 < 𝝀 −𝑷 𝜴−𝟏 𝝀 > 𝝀
=𝟏−𝑷 𝝀 < 𝜩 𝝀 −𝑷 𝝀 >𝜴 𝝀 =𝜶
Dealing with systematics
The variance of the estimator only measure the
statistical uncertainty.
Often, we will have to deal with some parameters
whose value is known with limited precision.
Systematic uncertainties
The likelihood function becomes :
The known parameters 𝝂 are nuisance parameters
61
𝓛 𝜽, 𝝂 𝝂 = 𝝂𝟎 ± 𝚫𝝂 or 𝝂𝟎−𝚫𝝂−+𝚫𝝂+
Bayesian inference
In Bayesian statistics, nuisance parameters are
dealt with by assigning them a prior 𝝅(𝝂).
Usually, a multinormal law is used with mean 𝝂𝟎 and
covariance matrix estimated from 𝚫𝝂𝟎 (+ correlation,
if needed)
The final prior is obtained by marginalization over
the nuisance parameters
62
𝒇 𝜽, 𝝂|𝒙 =𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂
𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝜽𝒅𝝂
𝒇 𝜽|𝒙 = න𝒇 𝜽, 𝝂|𝒙 𝒅𝝂 =𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝝂
𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝜽𝒅𝝂
Profile Likelihood
63
No true frequentist way to add systematic effects.
Popular method of the day: profiling
Deal with nuisance parameters as realization of random
variables: extend the likelihood :
𝑮(𝒗) is the likelihood of the new parameters (identical to prior)
For each value of 𝜽, maximize the likelihood with respect to
nuisance : profile likelihood 𝑷𝑳(𝜽).
𝑷𝑳(𝜽) has the same statistical asymptotical properties than
the regular likelihood
𝓛 𝜽, 𝝂 → 𝓛′ 𝜽, 𝝂 𝑮(𝝂)
Statistical Tests
Statistical tests aim at:
• Checking the compatibility of a dataset {𝒙𝒊} with
a given distribution
• Checking the compatibility of 2 datasets {𝒙𝒊}, {𝒚𝒊} are they issued from the same distribution.
• Comparing different hypothesis :
background vs signal+background
In every case :
• build a statistic that quantify the agreement with
the hypothesis
• convert it into a probability of
compatibility/incompatibility : p-value64
Pearson test
Test for binned data : use the Poisson limit of the histogram
• Sort the sample into 𝒌 bins 𝑪𝒊 𝒏𝒊• Compute the probability of this class : pi=∫Cif(x)dx
• The test statistics compare, for each bin the deviation of
the observation from the expected mean to the theoretical
standard deviation.
Then χ2 follow (asymptotically) a Khi-2 law with 𝒌 − 𝟏 degrees of
freedom (1 constraint ∑ni=n)
p-value : probability of doing worse,
For a “good” agreement χ2 /(k-1) ~ 1,
More precisely (1σ interval ~ 68%CL)65
i bins i
2ii2
np
)np(nχ
Poisson mean
Poisson variance
Data
2 2 1)dx-(x;kfvaluepχ χ
1)2(k1)(kχ2
Kolmogorov-Smirnov test
Test for unbinned data : compare the sample cumulative density
function to the tested one
Sample Pdf (ordered sample)
The Kolmogorov statistic is the largest deviation :
The test distribution has been computed by Kolmogorov:
[0;β] define a confidence interval for Dn
β=0.9584/√n for 68.3% CL β=1.3754/√n for 95.4% CL66
x x1
xx xn
k x x0
(x)Fi)-δ(x n
1 (x)f
n
1kk
0
si
s
F(x)(x)FsupD Sx
n
22z2r
r
1rn e1)(2)nβP(D
Example
Test compatibility with an exponential law :
67
0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421,
0.423, 0.455, 0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774,
0.825, 0.843, 0.921, 0.987, 0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287,
1.317, 1.320, 1.333, 1.412, 1.421, 1.438, 1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146,
2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372, 2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029,
3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814, 3.854, 3.929, 4.038, 4.065, 4.089, 4.177,
4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222, 6.556, 6.670, 7.673, 8.071,
8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174
0.4λ ,λef(x) λx
Dn = 0.069
p-value = 0.0617
1σ : [0, 0.0875]