Download pdf - Probability and Statistics Basic concepts · 2020-02-11 · Probability and Statistics Basic concepts (from a physicist point of view) Laura Ferraris-Bouchez UGA / LPSC [email protected]

Probability and Statistics

Basic concepts

(from a physicist point of view)

Laura Ferraris-Bouchez UGA / LPSC

[email protected]

mailto:[email protected]

What are statistics?

1) Descriptive statistics non-interpreted poll

results, weather observation, …

2) Explicative statistics estimation, variables

correlation

3) Decisional statistics statistical tests

4) Forecasting statistics (time is an additional

parameter) weather, health, …

3

Bibliography

• Kendall’s Advanced theory of statistics, Hodder Arnold Pub.

volume 1 : Distribution theory, A. Stuart et K. Ord

volume 2a : Classical Inference and and the Linear Model, A.

Stuart, K. Ord, S. Arnold

volume 2b : Bayesian inference, A. O’Hagan, J. Forster

• The Review of Particle Physics, M. Tanabashi et al. (Particle Data

Group), Phys. Rev. D 98, 030001 (2018) pdg.lbl.gov

• Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling,

Oxford Science Publication

• Statistical Data Analysis, Glen Cowan, Oxford Science

Publication

• Analyse statistique des données expérimentales, K. Protassov,

EDP sciences

• Probabilités, analyse des données et statistiques, G. Saporta,

Technip

• Analyse de données en sciences expérimentales, B.C., Dunod

pdg.lbl.gov

4

Sample and population

SAMPLE

- Finite size

- Selected through a random process

eg. Measurement result

POPULATION

- Potentially infinite size

eg. All possible results

Characterizing : - the sample,

- the population

- the sampling process

Probability theory (first lecture)

Observable

Statistics

5

SAMPLE

𝒙𝒊

POPULATION

𝒇(𝒙; 𝜽)

INFERENCE Observable

Using the sample to estimate the

characteristics of the population

Statistical inference (second lecture)

EXPERIMENT

PHYSICS

parameters 𝜽

It will be described by :

- Universe (𝛀) set of all possible outcomes

- Event logical condition on an outcome.

It can either be true or false;

It splits the universe in 2 subsets

6

Random process

Random process

(“measurement”

or “experiment”)

Process whose outcome

cannot be predicted with

certainty.

𝛀

𝑨ഥ𝑨 An event A will be identified

by the subset 𝑨 for which A

is true.

A probability function 𝑷 is defined by (Kolmogorov, 1933)

𝑷 ∶ 𝐄𝐯𝐞𝐧𝐭𝐬 → 𝟎; 𝟏

𝑨 → 𝑷 𝑨

satisfying :

𝑷(𝛀) = 𝟏

𝑷(𝑨 ∪ 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) if 𝑨 ∩ 𝑩 = Ø

Interpretation:

- Frequentist 𝐥𝐢𝐦𝒏→∞

𝒏𝑨

𝒏= 𝑷 𝑨 defines a probability if we

repeat a random process 𝒏 times (𝒏 great), and count

the outcomes satisfying the event 𝑨, 𝒏𝑨.

- Bayesian probability = measure of the credibility

associated to the event.7

Probability

𝑨 𝑩

8

Simple logic

Event « not 𝑨 » is associated

with the complement ഥ𝑨.

Event « 𝑨 and 𝑩 »

Event « 𝑨 or 𝑩 »

𝛀

𝑨ഥ𝑨

𝑷 𝑨 ∩ 𝑩

𝛀𝑨 𝑩

𝑨 ∩ 𝑩

𝛀

𝑨 𝑩

𝑨 ∪ 𝑩

𝑷 ഥ𝑨 = 𝟏 − 𝑷 𝑨𝑷 Ø = 𝟏 − 𝑷 𝛀 = 𝟎

𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩 − 𝑷 𝑨 ∩ 𝑩

9

Conditional probability

If an event 𝑩 is known to be true, one can define a new

probability function on the restrained universe 𝛀′ = 𝑩,

the conditional probability:

𝑷 𝑨|𝑩 = « probability of 𝑨 given 𝑩 »

By definition of the conditional probability:

𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨|𝑩 ∙ 𝑷 𝑩 = 𝑷 𝑩|𝑨 ∙ 𝑷 𝑨

𝑷 𝑨|𝑩 =𝑷 𝑨 ∩ 𝑩

𝑷 𝑩

𝛀𝑨 𝑩

𝑨 ∩ 𝑩

𝛀′ = 𝑩

𝑨|𝑩

𝑷 𝑩|𝑨 =𝑷 𝑨|𝑩 ∙ 𝑷 𝑩

𝑷 𝑨Bayes theorem

10

Incompatibility / Independence

Two incompatible events cannot be true

simultaneously, then :

𝑷 𝑨 ∩ 𝑩 = 𝟎 and 𝑨 ∩ 𝑩 = Ø

𝑷(𝑨 ∪ 𝑩) = 𝑷(𝑨) + 𝑷(𝑩)

Two events are independent,

if the realization of one is not linked in any

way to the realization of the other :

𝑷 𝑨|𝑩 = 𝑷 𝑨 and 𝑷 𝑩|𝑨 = 𝑷 𝑩

𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∙ 𝑷 𝑩

- Real variable: 𝑷 𝑿 = 𝒙 = 𝟎

Cumulative density function 𝑭 𝒙 = 𝑷 𝑿 < 𝒙

𝑑𝐹 = 𝐹 𝑥 + 𝑑𝑥 − 𝐹 𝑥 = 𝑃 𝑋 < 𝑥 + 𝑑𝑥 − 𝑃 𝑋 < 𝑥

= 𝑃 𝑋 < 𝑥 + 𝑃 𝑥 < 𝑋 < 𝑥 + 𝑑𝑥 − 𝑃 𝑋 < 𝑥

= 𝑃 𝑥 < 𝑋 < 𝑥 + 𝑑𝑥 = 𝑓 𝑥 𝑑𝑥

11

Random variable

When the random process outcome is a number 𝒙,

we associate to the process a random variable 𝑿.

Each realization leads to a particular result: 𝑿 = 𝒙.

𝒙 is a realization of 𝑿.

- Discrete variable: Probability law 𝒑 𝒙 = 𝑷 𝑿 = 𝒙

𝒇 𝒙 =𝒅𝑭

𝒅𝒙Probability density function (pdf)

Cumulative density functionProbability density function

Density function

12

x

f(x)

x

F(x)

1

න

−∞

+∞

𝒇(𝒙)𝒅𝒙 = 𝑷 𝛀 = 𝟏

Discrete variables can also

be described by a pdf using

Dirac distributions:

𝒇 𝒙 =

𝒊

𝒑(𝒊)𝜹(𝒊 − 𝒙)

𝒊

𝒑 𝒊 = 𝟏

By construction :

𝑭 −∞ = 𝑷 Ø = 𝟎

𝑭 +∞ = 𝑷 𝛀 = 𝟏

𝑭 𝒂 = න

−∞

𝒂

𝒇(𝒙)𝒅𝒙

𝑷 𝒂 < 𝑿 < 𝒃 = 𝑭 𝒃 −𝑭 𝒂 = න

𝒂

𝒃

𝒇(𝒙)𝒅𝒙

𝒇 𝒙 =𝒅𝑭

𝒅𝒙

𝑭(𝒙) = 𝑷(𝑿 < 𝒙)

Characteristic function:

Moments

For any function 𝒈 𝒙 , the expectation of 𝒈 is given by:

13

𝑿′ = 𝑿 − 𝝁𝟏 is a central variable.

- 2nd central moment: 𝝁𝟐′ = 𝝈𝟐 (variance)

𝝋 𝒕 = 𝑬 𝒆𝒊𝒙𝒕 = න𝒇 𝒙 𝒆𝒊𝒙𝒕 𝒅𝒙 = 𝐅𝐓−𝟏 𝒇

𝝋 𝒕 = න

𝒌

𝒊𝒕𝒙 𝒌

𝒌!𝒇(𝒙)𝒅𝒙 =

𝒌

𝒊𝒕 𝒌

𝒌!𝝁𝒌

𝑬 𝒈 𝒙 = න𝒈 𝒙 𝒇 𝒙 𝒅𝒙 It’s the mean value of 𝒈.

Moments 𝝁𝒌 are the expectation of 𝑿𝒌.

- 0th moment: 𝝁𝟎 = 𝟏 (pdf normalization)

- 1st moment: 𝝁𝟏 = 𝝁 (mean)

Pdf entirely defined by its moments

CF : useful tool for demonstrations

From Taylor expansion

𝝁𝒌 = อ−𝒊𝒌𝒅𝒌𝝋

𝒅𝒕𝒌𝒕=𝟎

Sample PDF

Sample: obtained from a random drawing within a population,

described by a pdf.

We’re going to discuss how to characterize, independently

from one another:

- a population

- a sample

Useful to consider a sample as a finite set from which one

can randomly draw elements, with equiprobability.

We can then associate to this process a pdf:

the empirical density or sample density

It will be useful to translate properties of distribution to a finite

sample. 14

𝒇𝐬𝐚𝐦𝐩𝐥𝐞 𝒙 =𝟏

𝒏

𝒊

𝜹 𝒙 − 𝒊

Characterizing a distribution

How to reduce a distribution / sample to a finite

number of values ?

Measure of location:

Reducing the distribution to one central value

Result

Measure of dispersion:

Spread of the distribution around the central value

Uncertainty / Error

Frequency table / histogram (for a finite sample)15

Location and dispersion

16

Continuous Discrete

Standard deviation 𝝈 / variance 𝐯 = 𝝈²Average value of the squared deviation to the mean

𝝁 = ഥ𝒙 = න

−∞

+∞

𝒙𝒇 𝒙 𝒅𝒙

𝝈𝟐 = න

−∞

+∞

𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙

𝝁 = ഥ𝒙 =

𝒊

𝒙𝒊𝒑(𝒙𝒊)

Mean: Sum of all possible values weighted by their probabilities

𝝈𝟐 =

𝒊

𝒙𝒊 − 𝝁 𝟐𝒑(𝒙𝒊)

Koenig’s theorem

17

𝝈𝟐 = න

−∞

+∞

𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙

= න𝒙𝟐𝒇 𝒙 𝒅𝒙 + 𝝁𝟐න𝒇 𝒙 𝒅𝒙 − 𝟐𝝁න𝒙𝒇 𝒙 𝒅𝒙

= 𝒙𝟐 + 𝝁𝟐 − 𝟐𝝁ഥ𝒙

= 𝒙𝟐 − ഥ𝒙𝟐

= 𝝁𝒙𝟐 − 𝝁𝟐

𝝈𝟐 = 𝝁𝒙𝟐 − 𝝁𝟐

Discrete distributions

Binomial distribution Repetition of 𝒏 Bernoulli processes.

Randomly choosing 𝑲 objects within a finite set of 𝒏, with a fixed

drawing probability of 𝒑

Variable : 𝑲

Parameters: 𝒏, 𝒑

Law : 𝑷 𝒌;𝒏, 𝒑 =𝒏!

𝒌! 𝒏−𝒌 !𝒑𝒌 𝟏 − 𝒑 𝒏−𝒌

Mean : 𝒏𝒑

Variance: 𝒏𝒑(𝟏 − 𝒑)18

p = 0.65

n = 10

Bernoulli distribution Random variable 𝑲 associated to a random

process with 2 possible outcomes « success » (𝑲 = 𝟏) or « failure »

(𝑲 = 𝟎) with a success probability 𝒑

Variable : 𝑲

Parameters : 𝒑

Law : 𝒑 𝒌 = 𝒑𝒌 𝟏 − 𝒑 𝟏−𝒌

Mean : 𝒑

Variance : 𝒑(𝟏 − 𝒑)

𝒑 = 𝟎. 𝟑𝟓

Poisson distribution

binomial limit when 𝒏 → +∞, 𝒑 → 𝟎, 𝒏𝒑 = 𝝀

Counting 𝒌 events with a probability per time/space unit

𝝀 (𝝀 is fixed and independent from the time/space).

Variable : 𝒌

Parameters : 𝝀

Law : 𝑷 𝒌; 𝝀 =𝒆−𝝀𝝀𝒌

𝒌!

Mean : 𝝀

Variance : 𝝀

Discrete distributions

19

𝝀 = 𝟔. 𝟓

Real distributions

Uniform distribution

Equiprobability over a finite range 𝑎; 𝑏

Parameters: 𝒂, 𝒃

Law : 𝒇 𝒙; 𝒂, 𝒃 =𝟏

𝒃−𝒂𝐢𝐟 𝒂 < 𝒙 < 𝒃

Mean : 𝝁 =𝒂+𝒃

𝟐

Variance : 𝝈𝟐 =𝒃−𝒂 𝟐

𝟏𝟐

20

Real distributions

21

Normal distribution (Gaussian)

Limit of many processes,

Centered and symmetric distribution

Parameters:𝝁, 𝝈

Law : 𝒇 𝒙; 𝝁, 𝝈 =𝟏

𝝈 𝟐𝝅𝒆−

𝒙−𝝁 𝟐

𝟐𝝈𝟐

Mean : 𝝁

Variance : 𝝈𝟐

Real distributions

22

Chi-square distribution

Sum of the square of 𝒏 normal reduced variables

Variable :

Parameters: 𝒏, number of dof

Law :𝒇 𝒄; 𝒏 = 𝒄𝒏

𝟐−𝟏𝒆−

𝒄

𝟐/𝚪𝒏

𝟐𝟐𝒏

𝟐

Mean : 𝒏

Variance : 𝟐𝒏

𝑪 =

𝒌=𝟏

𝒏𝑿𝒌 − 𝝁𝒌

𝟐

𝝈𝒌𝟐

Convergence

23

Poisson law

𝑷 𝒌; 𝝀 =𝒆−𝝀𝝀𝒌

𝒌!𝝁 = 𝝀 and 𝝈 = 𝝀

Binomial law

𝑷 𝒌;𝒏, 𝒑 = 𝑪𝒏𝒌𝒑𝒌 𝟏 − 𝒑 𝒏−𝒌

𝝁 = 𝒏𝒑

𝝈 = 𝒏𝒑 𝟏 − 𝒑

Normal law

𝒇 𝒙; 𝝁, 𝝈 =𝟏

𝟐𝝅𝒆−𝒙−𝝁 𝟐

𝟐𝝈𝟐

𝝁 = 𝝁 and 𝝈 = 𝝈

Chi-square law

𝒇 𝒄; 𝒏 =𝒄𝒏𝟐−𝟏

𝟐𝒏𝟐Γ

𝒏𝟐

𝒆−𝒄𝟐

𝝁 = 𝒏 and 𝝈 = 𝟐𝒏

𝒏 > 𝟓𝟎

𝒏 > 𝟑𝟎

𝝀 > 𝟐𝟓𝒑 small, 𝒌 ≪ 𝒏

𝒏𝒑 = 𝝀

Multidimensional PDF

The probability density function becomes:

24

𝑿 = 𝑿𝟏, 𝑿𝟐, … , 𝑿𝒏

𝒇 𝒙 𝒅𝒙 = 𝒇 𝒙𝟏, 𝒙𝟐, … , 𝒙𝒏 𝒅𝒙𝟏𝒅𝒙𝟐…𝒅𝒙𝒏

= 𝑷(𝒙𝟏 < 𝑿𝟏 < 𝒙𝟏 + 𝒅𝒙𝟏 ∩𝒙𝟐 < 𝑿𝟐 < 𝒙𝟐 + 𝒅𝒙𝟐 ∩⋯𝒙𝒏 < 𝑿𝒏 < 𝒙𝒏 + 𝒅𝒙𝒏)

𝑷 𝒂 < 𝑿 < 𝒃 ∩ 𝒄 < 𝒀 < 𝒅 = න

𝒂

𝒃

𝒅𝒙න

𝒄

𝒅

𝒅𝒚𝒇 𝒙, 𝒚

𝒇𝑿 𝒙 𝒅𝑥 = 𝑷 𝒙 < 𝑿 < 𝒙 + 𝒅𝒙 ∩ −∞ < 𝒀 < +∞= 𝒇 𝒙, 𝒚 𝒅𝒙 𝒅𝒚

Marginal density probability of only one of the component

Random variables random vectors:

And

𝒇𝑿 𝒙 = න𝒇 𝒙, 𝒚 𝒅𝒚

Multidimensional PDF

If 𝒀 = 𝒚𝟎 one can define the conditional density for X:

25

𝑿 and 𝒀 are independent if all events of the form

𝒙 < 𝑿 < 𝒙 + 𝒅𝒙 are independent from 𝒚 < 𝒀 < 𝒚 + 𝒅𝒚:

and

𝒇 𝒙 𝒚𝟎 𝒅𝒙 = 𝑷 𝒙 < 𝑿 < 𝒙 + 𝒅𝒙|𝒀 = 𝒚𝟎

𝒇 𝒙|𝒚 ∝ 𝒇 𝒙, 𝒚 න𝒇 𝒙|𝒚 𝒅𝒙 = 𝟏

𝒇 𝒙|𝒚 =𝒇 𝒙, 𝒚

𝒇 𝒙, 𝒚 𝒅𝒙=𝒇 𝒙, 𝒚

𝒇𝒀 𝒚

𝒇 𝒙|𝒚 = 𝒇𝑿 𝒙 𝒇 𝒚|𝒙 = 𝒇𝒀 𝒚

In term of pdf’s, Bayes’ theorem becomes:

𝒇 𝒙,𝒚 = 𝒇𝑿 𝒙 ∙ 𝒇𝒀 𝒚

𝒇 𝒚|𝒙 =𝒇 𝒙|𝒚 𝒇𝒀 𝒚

𝒇𝑿 𝒙=

𝒇 𝒙|𝒚 𝒇𝒀 𝒚

𝒇 𝒙|𝒚 𝒇𝒀 𝒚 𝒅𝒚

It is proportional to 𝒇 𝒙, 𝒚𝟎 .

Independent Uncorrelated

Correlation: Uncorrelated: 𝝆 = 0

Covariance and correlation

𝑿, 𝒀 can be treated as 2 separate variable marginal densities:

Mean and variance for each variable : 𝝁𝒙, 𝝁𝒚, 𝝈𝒙, 𝝈𝒚

BUT ! It doesn’t take variable correlations into account!

26

ρ = -0.5 ρ = 0 ρ = 0.9

ρ = 0

Generalized measure of dispersion covariance of 𝑿 and 𝒀

Discrete: 𝐂𝐨𝐯 𝑿, 𝒀 =𝟏

𝒏σ𝒊=𝟏𝒏 𝒙𝒊 − 𝝁𝒙 𝒚𝒊 − 𝝁𝒚

Real: 𝐂𝐨𝐯 𝑿,𝒀 = 𝒙−𝝁𝒙 𝒚−𝝁𝒚 𝒇 𝒙,𝒚 𝒅𝒙𝒅𝒚 = 𝝆𝝈𝒙𝝈𝒚 = 𝝁𝒙𝒚−𝝁𝒙𝝁𝒚

𝝆 =𝐂𝐨𝐯 𝑿, 𝒀

𝝈𝒙𝝈𝒚

Decorrelation

Covariance matrix for 𝒏 variables 𝑿𝒊

27

Σ𝒊𝒋 = 𝑪𝒐𝒗 𝑿𝒊, 𝑿𝒋 Σ =𝝈𝟏𝟐 ⋯ 𝝆𝟏𝒏𝝈𝟏𝝈𝒏⋮ ⋱ ⋮

𝝆𝟏𝒏𝝈𝟏𝝈𝒏 ⋯ 𝝈𝒏𝟐

- Uncorrelated variables 𝚺 diagonal

- Matrix real and symmetric can be diagonalized

𝒏 new uncorrelated variables 𝒀𝒊

Sorted for the larger to the smaller 𝝈′they allow dimensional reduction

Σ′ =𝝈′𝟏

𝟐⋯ 𝟎

⋮ ⋱ ⋮

𝟎 ⋯ 𝝈′𝒏𝟐

= 𝑩−𝟏Σ𝑩 𝒀 = 𝑩𝑿

- 𝝈′𝒊𝟐are the eigenvalues of 𝚺

- 𝑩 contains the orthonormal

eigenvectors

- The 𝒀𝒊 are the principal components

Regression

Measure of location: point (𝝁𝑿 , 𝝁𝒀) / curve closest to the points

Linear regression: Minimizing the dispersion between the curve

“𝒚 = 𝒂𝒙 + 𝒃” and the distribution.

28

Fully correlated 𝝆 = 𝟏Fully anti-correlated 𝝆 = −𝟏Then 𝒀 = 𝒂𝑿 + 𝒃

𝒘 𝒂, 𝒃 = ඵ 𝒚− 𝒂𝒙 − 𝒃 𝟐𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚 =𝟏

𝒏

𝒊

𝒚𝒊 − 𝒂𝒙𝒊 − 𝒃 𝟐

𝝏𝒘

𝝏𝒂= 𝟎 =ඵ𝒙 𝒚 − 𝒂𝒙 − 𝒃 𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚

𝝏𝒘

𝝏𝒃= 𝟎 =ඵ 𝒚− 𝒂𝒙 − 𝒃 𝒇 𝒙, 𝒚 𝒅𝒙𝒅𝒚

⇔ ൝𝒂 𝝈𝒙

𝟐 − 𝝁𝒙𝟐 + 𝒃𝝁𝒙 = 𝝆𝝈𝒙𝝈𝒚 + 𝝁𝒙𝝁𝒚

𝒂𝝁𝒙 + 𝒃 = 𝝁𝒚

⇔

𝒂 = 𝝆𝝈𝒚

𝝈𝒙

𝒃 = 𝝁𝒚 − 𝝆𝝈𝒚

𝝈𝒙𝝁𝒙

Multidimensional Pdfs

Multinomial distribution

Randomly choosing 𝑲𝟏, 𝑲𝟐, …𝑲𝒔 objects within a finite set of 𝒏,

with a fixed drawing probability for each category 𝒑𝟏, 𝒑𝟐, … 𝒑𝒔with σ𝑲𝒊 = 𝒏 and σ𝒑𝒊 = 𝟏

Ex: a 3-way election A got 20% of the votes, B 30% and C 50%. 6 voters are randomly selected. Probability that there is 1 A supporter, 2 B supporters and 3 C supporters in the sample?

29

Parameters : 𝒏, 𝒑𝟏, 𝒑𝟐, …𝒑𝒔

Law : 𝑷 𝒌;𝒏, 𝒑 =𝒏!

𝒌𝟏!𝒌𝟐!…𝒌𝒔!𝒑𝟏𝒌𝟏𝒑𝟐

𝒌𝟐 …𝒑𝒔𝒌𝒔

Mean : 𝝁𝒊 = 𝒏𝒑𝒊

Variance : 𝝈𝒊𝟐 = 𝒏𝒑𝒊 𝟏 − 𝒑𝒊 and 𝑪𝒐𝒗 𝑲𝒊, 𝑲𝒋 = −𝒏𝒑𝒊𝒑𝒋

Rem: variables are not independent.

The binomial corresponds to s=2, but has only one independent variable.

Independent

Parameters: 𝝁, Σ

Law : 𝒇 𝒙; 𝝁, Σ =𝟏

𝟐𝝅 Σ𝒆−

𝟏

𝟐𝒙−𝝁 𝑻Σ−𝟏 𝒙−𝝁

if uncorrelated 𝒇 𝒙;𝝁,Σ = ς𝟏

𝝈𝒊 𝟐𝝅𝒆−

𝒙𝒊−𝝁𝒊𝟐

𝟐𝝈𝒊𝟐

Multidimensional Pdfs

30

Multinormal distribution

Generalization of the normal law to a random variable vector

Uncorrelated

Assuming the mean and variance of each variable exists,

Average of 𝑺 :

The mean is an additive quantity

Sum of random variables

31

For uncorrelated variables, the variance is additive

used for error combinations

A sum of random variables is a new random variable:

Variance of S :

𝑺 =

𝒊=𝟏

𝒏

𝑿𝒊

𝝁𝑺 = න

𝒊=𝟏

𝒏

𝒙𝒊 𝒇 𝒙𝟏, … , 𝒙𝒏 𝒅𝒙𝟏…𝒅𝒙𝒏 =

𝒊=𝟏

𝒏

න𝒙𝒊𝒇𝑿𝒊 𝒙𝒊 𝒅𝒙𝒊 =

𝒊=𝟏

𝒏

𝝁𝒊

𝝈𝑺𝟐 = න

𝒊=𝟏

𝒏

𝒙𝒊 − 𝝁𝑿𝒊

𝟐

𝒇 𝒙𝟏, … , 𝒙𝒏 𝒅𝒙𝟏 …𝒅𝒙𝒏

=

𝒊=𝟏

𝒏

𝝈𝑿𝒊𝟐 + 𝟐

𝒊

𝒋<𝒊

𝑪𝒐𝒗 𝑿𝒊, 𝑿𝒋

𝝈𝑺𝟐 =

𝒊=𝟏

𝒏

𝝈𝑿𝒊𝟐

The characteristic function factorizes.

Finally the pdf is the Fourier transform of the CF, so :

pdf 𝒇𝑺 𝒔 can be found, using the characteristic function:

Sum of random variables

32

Sum of Normal variables (𝝁𝟏, 𝝈𝟏 and 𝝁𝟐, 𝝈𝟐)

Normal, 𝝁 = 𝝁𝟏 + 𝝁𝟐 and 𝝈 = 𝝈𝟏 + 𝝈𝟐Sum of Poisson variables (λ𝟏 and λ𝟐) Poisson, λ = λ𝟏 + λ𝟐Sum of Khi-2 variables (𝒏𝟏 and 𝒏𝟐) Khi-2, 𝒏 = 𝒏𝟏 + 𝒏𝟐

For independent variables:

𝝋𝒔 𝒕 =ෑන𝒇𝑿𝒌 𝒙𝒌 𝒆𝒊𝒕𝒙𝒌𝒅𝒙𝒌 =ෑ𝝋𝒙𝒊 𝒕

𝝋𝒔 𝒕 = න𝒇𝑺 𝒔 𝒆𝒊𝒕𝒔𝒅𝒔 = න𝒇𝑿 𝒙 𝒆𝒊𝒕 σ 𝒙𝒊𝒅𝒙

𝒇𝑺 = 𝒇𝒙𝟏 ∗ 𝒇𝒙𝟐 ∗ ⋯∗ 𝒇𝒙𝒏The pdf of the sum is a convolution.

The pdf of 𝑪 converge to a reduced normal distribution:

The sum of many random fluctuations is normally distributed.

Weak law of large numbers

Sample of size 𝒏 = realization of 𝒏 independent variables, with

the same distribution (mean 𝝁, variance 𝝈𝟐).

The sample mean is a realization of

Sum of independent variables

33

𝑴 =𝑺

𝒏=𝟏

𝒏𝑿𝒊

Central-Limit theorem

𝒏 independent random variables of mean 𝝁𝒊 and variance 𝝈𝒊𝟐

Mean value of 𝑴 : 𝝁𝑴 = 𝝁 Variance of 𝑴 : 𝝈𝑴𝟐 =

𝝈𝟐

𝒏

Sum of the reduced variables : 𝑪 =𝟏

𝒏

𝑿𝒊 − 𝝁𝒊𝝈𝒊

𝒇𝑪 𝒄𝒏→+∞

𝟏

𝟐𝝅𝒆−

𝒄𝟐

𝟐

Central limit theorem

34

Sum of several reduced

uniform distributions

Sum of several reduced

exponential laws

The central limit theorem insures the convergence,

but not its speed.

Dispersion and uncertainty

Any measure (or combination of measure) is a realization of a

random variable.

- Measured value : 𝜽

- True value : 𝜽𝟎

Uncertainty = quantifying the difference between 𝜽 and 𝜽𝟎 :

measure of dispersion

We will postulate: ∆𝜽 = 𝜶𝝈𝜽 Absolute error, always positive

Usually one differentiate

- Statistical error: due to the measurement Pdf.

- Systematic errors or bias: fixed but unknown deviation

(equipment, assumptions,…)

Systematic errors can be seen as statistical error in a set a

similar experiences. 35

Position error : ∆𝑷

Error sources

36

Observation error: ∆𝑶 Scaling error: ∆𝑺

𝜽 = 𝜽𝟎 + 𝜹𝑶 + 𝜹𝑷 + 𝜹𝑺Each 𝜹𝒊 is a realization of a random variable: 𝝁𝒊 = 𝟎and 𝝈𝒊

𝟐.

For uncorrelated error sources:

Choice of 𝜶?

If many sources, from central-limit normal distribution

𝜶 = 𝟏 gives (approximately) a 68% confidence interval

𝜶 = 𝟐 gives 95% CL (and at least 75% from Bienaymé-Chebyshev)

ቑ

∆𝑶= 𝜶𝝈𝑶∆𝑷= 𝜶𝝈𝑷∆𝑺= 𝜶𝝈𝑺

∆𝐭𝐨𝐭𝟐 = 𝜶𝝈𝐭𝐨𝐭

𝟐 = 𝜶𝟐 𝝈𝑶𝟐 + 𝝈𝑷

𝟐 + 𝝈𝑺𝟐 = ∆𝑶

𝟐 + ∆𝑷𝟐 + ∆𝑺

𝟐

Measure: 𝒙 ± 𝚫𝒙

Compute: 𝒇 𝒙 𝚫𝒇 ?

Error propagation

37

f(x)

x

Δx

Δx

dx

df(x)f'

Δf

Δf

𝑓 𝑥 + 𝛥𝑥 = 𝑓 𝑥 +𝑑𝑓

𝑑𝑥𝛥𝑥 +

1

2

𝑑2𝑓

𝑑2𝑥𝛥𝑥2 + 𝑂 𝛥𝑥3

𝑓 𝑥 − 𝛥𝑥 = 𝑓 𝑥 −𝑑𝑓

𝑑𝑥𝛥𝑥 +

1

2

𝑑2𝑓

𝑑2𝑥𝛥𝑥2 + 𝑂 𝛥𝑥3

𝜟𝒇 =𝟏

𝟐𝒇 𝒙 + 𝚫𝒙 − 𝒇(𝒙 − 𝚫𝒙) =

𝒅𝒇

𝒅𝒙𝜟𝒙

Assuming small errors,

using Taylor expansion :

Error propagation

Measure: 𝒙 ± 𝚫𝒙, 𝒚 ± 𝚫𝒚,…

Compute: 𝒇 𝒙, 𝒚, … 𝚫𝒇 ?

Idea treat each variable effect as separate error sources

38

Uncorrelated

Then

Error due to 𝒙 error 𝚫𝒇𝒙 =𝝏𝒇

𝝏𝒙𝚫𝒙

Error due to 𝒚 error 𝚫𝒇𝒚 =𝝏𝒇

𝝏𝒚𝚫𝒚

𝜟𝒇𝟐 = 𝜟𝒇𝒙𝟐+𝜟𝒇𝒚

𝟐+𝟐𝝆𝒙𝒚𝚫𝒇𝒙𝚫𝒇𝒚 =𝝏𝒇

𝝏𝒙𝜟𝒙

𝟐+

𝝏𝒇

𝝏𝒚𝜟𝒚

𝟐+𝟐𝝆𝒙𝒚

𝝏𝒇

𝝏𝒙

𝝏𝒇

𝝏𝒚𝜟𝒙𝜟𝒚

𝚫𝒇𝟐 =

𝒊=𝟏

𝒌𝝏𝒇

𝝏𝒙𝒊𝚫𝒙𝒊

𝟐

+

𝒊=𝟏

𝒌

𝒋≠𝒊

𝒌

𝝆𝒙𝒊𝒙𝒋𝝏𝒇

𝝏𝒙𝒊

𝝏𝒇

𝝏𝒙𝒋𝚫𝒙𝒊𝚫𝒙𝒋

𝚫𝒇𝟐 =𝝏𝒇

𝝏𝒙𝜟𝒙

𝟐

+𝝏𝒇

𝝏𝒚𝜟𝒚

𝟐

𝚫𝒇=𝝏𝒇

𝝏𝒙𝜟𝒙+

𝝏𝒇

𝝏𝒚𝜟𝒚 𝚫𝒇=

𝝏𝒇

𝝏𝒙𝜟𝒙−

𝝏𝒇

𝝏𝒚𝜟𝒚

AnticorrelatedCorrelated

Parametric estimation

From a finite sample 𝒙𝒊 estimating a parameter 𝜽

Statistic a function 𝑺 = 𝒇 𝒙𝒊Any statistic can be considered as an estimator of 𝜽

To be a good estimator it needs to satisfy :

• Consistency : limit of the estimator for an infinite sample.

• Bias : difference between the estimator and the true

value

• Efficiency : speed of convergence

• Robustness : sensitivity to statistical fluctuations

A good estimator should at least be consistent and

asymptotically unbiased

Efficient / Unbiased / Robust often contradict each other

different choices for different applications 39

Bias Estimator average difference 𝒃 𝜽 = 𝑬 𝚯 − 𝜽𝟎 = 𝝁𝚯 − 𝜽𝟎

Unbiased 𝒃 𝜽 = 𝟎

Asymptotically unbiased 𝒃 𝜽𝒏→+∞

𝟎

Bias and consistency

A sample is a set of random variables (or vector) realizations,

So is the estimator 𝜽 is a realization of 𝚯

It has a mean, a variance,… and a probability density function.

40

biasedasymptotically

unbiasedunbiased

Consistency formally 𝑷 𝜽 − 𝜽 > 𝜺𝒏→+∞

𝟎, ∀𝜺

in practice, if asymptotically unbiased: 𝝈𝚯 𝒏→∞𝟎

Efficiency

For any unbiased estimator of 𝜽, the variance cannot

exceed the Cramer-Rao bound:

The efficiency of a convergent estimator is given by

its variance.

An efficient estimator reaches the Cramer-Rao bound

(at least asymptotically) minimal variance estimator

Such an estimator will often be biased, asymptotically

unbiased.41

𝝈𝚯𝟐 ≥

𝟏

𝑬𝝏 ln𝓛𝝏𝜽

𝟐=

−𝟏

𝑬𝝏𝟐 ln 𝓛𝝏𝜽𝟐

Empirical mean estimator

Sample mean ഥ𝒙 good estimator of the population mean 𝝁

42

ෝ𝝁 = ഥ𝒙 =𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊

Using the weak law of large numbers:

𝝁ෝ𝝁 = 𝑬 ෝ𝝁 = 𝑬𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊

=𝟏

𝒏

𝒊=𝟏

𝒏

𝑬 𝒙𝒊 =𝟏

𝒏

𝒊=𝟏

𝒏

𝝁

= 𝝁

𝝈ෝ𝝁𝟐 = 𝑽𝒂𝒓 ෝ𝝁 = 𝑽𝒂𝒓

𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊

=𝟏

𝒏𝟐

𝒊=𝟏

𝒏

𝑽𝒂𝒓 𝒙𝒊 =𝟏

𝒏𝟐

𝒊=𝟏

𝒏

𝝈𝟐

=𝝈𝟐

𝒏

𝒃 𝜽 = 𝝁ෝ𝝁 − 𝝁 = 𝟎

Unbiased Convergent

𝝈ෝ𝝁𝟐 =

𝝈𝟐

𝒏 𝒏→+∞𝟎

Empirical variance estimator

43

Variance of the estimator (convergence)

Sample variance as an estimator of the population variance:

𝝈𝟐 =𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊 − ෝ𝝁 𝟐

𝑬 𝝈𝟐 = 𝑽𝒂𝒓 𝒙𝒊 − 𝑽𝒂𝒓 ෝ𝝁 = 𝝈𝟐 −𝝈𝟐

𝒏=𝒏 − 𝟏

𝒏𝝈𝟐

=𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊 − 𝑬 𝒙𝒊𝟐 − ෝ𝝁 − 𝑬 ෝ𝝁 𝟐

Biased, asymptotically unbiased

𝒔𝟐 =𝒏

𝒏 − 𝟏𝝈𝟐 =

𝟏

𝒏 − 𝟏

𝒊=𝟏

𝒏


Unbiased variance estimator

𝝈𝒔𝟐𝟐 =

𝝈𝟒

𝒏 − 𝟏

𝒏 − 𝟏

𝒏𝜸𝟐 + 𝟐

𝐍𝐨𝐫𝐦𝐚𝐥 𝐬𝐚𝐦𝐩𝐥𝐞

𝟐𝝈𝟒

𝒏 − 𝟏 𝒏→+∞𝟎

Errors on these estimators

Use the variance estimator ො𝒔 = 𝒔𝟐 !!! Biased !!!

44

𝝈ෝ𝝁𝟐 =

𝝈𝟐

𝒏 𝜟ෝ𝝁 =

𝒔𝟐

𝒏

ෝ𝝁 =𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊𝒔𝟐 =

𝟏

𝒏 − 𝟏

𝒊=𝟏

𝒏


𝝈𝒔𝟐𝟐 ≈

𝟐𝝈𝟒

𝒏 𝜟𝒔𝟐 =

𝟐𝒔𝟐𝟐

𝒏

Mean Variance

Central-Limit theorem

For large enough samples, mean and variance empirical

estimators are normally distributed.

ෝ𝝁 ± 𝜟ෝ𝝁 and 𝒔𝟐 ± 𝜟𝒔𝟐 define 68% confidence intervals.

Uncertainty Standard deviation estimator

Likelihood function

Generic function 𝒌(𝒙, 𝜽)𝒙 = random variable(s)

𝜽 = parameter(s)

45

Probability density function

𝒇 𝒙; 𝜽 = 𝒌 𝒙, 𝜽𝟎

න𝒇(𝒙; 𝜽)𝒅𝒙 = 𝟏

For Bayesian 𝒇 𝒙|𝜽 = 𝒇 𝒙; 𝜽

Likelihood function

𝓛 𝜽 = 𝒌 𝒖, 𝜽

න𝓛 𝜽 𝒅𝜽 = ?

For Bayesian 𝒇 𝜽|𝒙 =𝓛 𝜽

𝓛 𝜽 𝒅𝜽

Fix 𝜽 = 𝜽𝟎 (true value) Fix 𝒙 = 𝒖 (one realization

of the random variable)

For a sample of 𝒏 independent realizations of the same variable 𝑿

𝓛 𝜽 =ෑ

𝒊

𝒌 𝒙𝒊, 𝜽 =ෑ

𝒊

𝒇 𝒙𝒊; 𝜽

Maximum likelihood (1)

For a sample of measurements, 𝒙𝒊 The analytical form of the density is known

It depends on several unknown parameters 𝜽

eg. Event counting follows a Poisson distribution, with a the

physics-depending parameter: 𝝀 𝜽

46

Estimators 𝜽 of the parameters 𝜽 are the ones that

maximize the probability of observing the observed result.

Maximum of the likelihood function

System of equations for several parameters

Often minimize − 𝐥𝐧 𝓛 simpler expressions

𝓛 𝜽 =ෑ

𝒊

𝒆−𝝀 𝜽 𝝀 𝜽 𝒙𝒊

𝒙𝒊!

ቤ𝝏𝓛

𝝏𝜽𝜽=𝜽

= 𝟎

Maximum likelihood (2)

47

𝓛 𝜽 =ෑ

𝒊=𝟏

𝒏𝒆−𝝀 𝜽 𝝀 𝜽 𝒙𝒊

𝒙𝒊!

− 𝐥𝐧𝓛 𝜽 =

𝒊=𝟏

𝒏

𝝀 𝜽 − 𝒙𝒊 𝐥𝐧 𝝀 𝜽 + 𝐥𝐧𝒙𝒊!

−𝝏 𝐥𝐧 𝜽

𝝏𝜽=

𝒊=𝟏

𝒏

𝝀′ 𝜽 − 𝒙𝒊𝝀′ 𝜽

𝝀 𝜽= 𝒏𝝀′ 𝜽 −

𝝀′ 𝜽

𝝀 𝜽

𝒊=𝟏

𝒏

𝒙𝒊

−𝝏 𝐥𝐧 𝜽

𝝏𝜽= 𝟎⇔𝒏 =

𝟏

𝝀 𝜽

𝒊=𝟏

𝒏

𝒙𝒊

⇔𝝀 𝜽 =𝟏

𝒏

𝒊=𝟏

𝒏

𝒙𝒊 = ෝ𝝁

Minimize − 𝐥𝐧 𝓛

Properties of MLE

Mostly asymptotic properties valid for large samples.

Often assumed in any case for lack of better information

Asymptotically unbiased

Asymptotically efficient (reaches the CR bound)

Asymptotically normally distributed

Multinormal law

with covariance given by generalization of CR Bound:

48

𝒇𝜽; 𝜽, 𝚺 =

𝟏

𝟐𝝅 𝚺𝒆−𝟏𝟐𝜽−𝜽

𝑻𝚺−𝟏

𝜽−𝜽

𝚺𝒊𝒋−𝟏 = −𝑬

𝝏 𝐥𝐧𝓛

𝝏𝜽𝒊

𝝏 𝐥𝐧𝓛

𝝏𝜽𝒋

Errors on MLE

Confidence contours are defined by the equation 𝜟𝐥𝐧𝓛 = 𝜷 𝒏𝜽,𝜶 ,

with 𝜶 = 𝟎𝟐𝜷𝒇𝝌𝟐 𝒙;𝒏𝜽 𝒅𝒙

49

(only one realization

of the estimator

empirical mean

of 1 value)

𝒏𝜽𝜶

1

𝟎. 𝟓 × 𝒏𝜽𝟐

2 3

68.3 0.5 1.15 1.76

95.4 2 3.09 4.01

99.7 4.5 5.92 7.08

𝒇𝜽; 𝜽, 𝜮 =

𝟏

𝟐𝝅 𝜮𝒆−𝟏𝟐𝜽−𝜽

𝑻𝜮−𝟏

𝜽−𝜽

𝜮𝒊𝒋−𝟏 = −𝑬

𝝏𝟐𝐥𝐧𝓛

𝝏𝜽𝒊𝝏𝜽𝒋

∆𝜽 = ෝ𝝈𝜽 =−𝟏

𝝏𝟐𝐥𝐧𝓛𝝏𝜽𝟐

𝜟𝐥𝐧𝓛 = 𝐥𝐧𝓛 𝜽 − 𝐥𝐧𝓛 𝜽 =𝟏

𝟐

𝒊

𝒋

𝚺𝒊𝒋−𝟏 𝜽𝒊− 𝜽𝒊 𝜽𝒋− 𝜽𝒋 +𝓞 𝜽𝟑

Errors on parameter from the covariance matrix

For 1 parameter, 68% interval

More generally:

Values of 𝜷 for different

number of parameters 𝒏𝜽and confidence levels 𝜶

3 criteria:

1) Chi-2 law mean: 𝝁 = 𝑵𝑫𝑭

−𝟐𝒍𝒏(𝓛)~𝑵𝑫𝑭 or −𝟐𝒍𝒏(𝓛)

𝑵𝑫𝑭~𝟏

2) Chi-2 law variance: 𝝈𝟐 = 𝟐𝑵𝑫𝑭If 𝑵𝑫𝑭 > 𝟑𝟎 chi-2 approaches a normal law

−𝟐𝒍𝒏(𝓛) has 95% probability to be in [𝑵𝑫𝑭 ± 𝟐 𝟐𝑵𝑫𝑭]

3) Probability of getting a worse agreement p-value:

𝒑−𝐯𝐚𝐥𝐮𝐞 = 𝟐𝒍𝒏− 𝓛

+∞𝒇𝛘𝟐 𝒙;𝑵𝑫𝑭 𝒅𝒙

Goodness of fit

50

−𝟐𝐥𝐧(𝓛) is the realization of a random variable following a

Chi-2 law with 𝑵𝑫𝑭 degrees of freedom:

𝒏 = number of fitted points (sample size)

𝒏𝜽 = number of fitting parameters

𝑵𝑫𝑭 = 𝒏 − 𝒏𝜽

Maximum likelihood assume each 𝒚𝒊 is normally distributed

with a 𝒇 𝒙𝒊, 𝜽 mean and a 𝜟𝒚𝒊 variance

Least squares

- Naive approach: regression

𝒘 𝜽 = σ𝒊 𝒚𝒊 − 𝒇 𝒙𝒊, 𝜽𝟐

𝝏𝒘

𝝏𝜽𝒊= 𝟎

- Weight by the error: least squares

𝑲𝟐 𝜽 = σ𝒊𝒚𝒊−𝒇 𝒙𝒊,𝜽

∆𝒚𝒊

𝟐

𝝏𝑲𝟐

𝝏𝜽𝒊= 𝟎

51

Least squares or Khi-2 fit is the

MLE, for Gaussian errors

Generic case

with correlations:

𝓛 𝜽 =ෑ

𝒊

𝟏

𝟐𝝅∆𝒚𝒊𝒆−𝟏𝟐𝒚𝒊−𝒇 𝒙𝒊,𝜽

∆𝒚𝒊

𝟐

𝝏𝓛

𝝏𝜽= 𝟎 ⇔ −𝟐

𝝏𝓛

𝝏𝜽=𝝏𝑲𝟐

𝝏𝜽= 𝟎

𝑲𝟐 𝜽 =𝟏

𝟐𝒚 − 𝒇 𝒙, 𝜽

𝑻𝜮−𝟏 𝒚 − 𝒇 𝒙, 𝜽

Set of measurements 𝒙𝒊, 𝒚𝒊 with vertical uncertainties 𝜟𝒚𝒊Theoretical law 𝒚 = 𝒇 𝒙, 𝜽

Example : fitting a line

For 𝒇(𝒙) = 𝒂𝒙:

52

𝑨 =

𝒊

𝒙𝒊𝟐

∆𝒚𝒊𝟐

𝑩 =

𝒊

𝒙𝒊𝒚𝒊

∆𝒚𝒊𝟐 𝑪 =

𝒊

𝒚𝒊𝟐

∆𝒚𝒊𝟐

𝑲𝟐 𝒂 = 𝑨𝒂𝟐 − 𝟐𝑩𝒂 + 𝑪 = −𝟐𝐥𝐧𝓛

𝝏𝑲𝟐

𝝏𝒂= 𝟐𝑨𝒂 − 𝟐𝑩 = 𝟎 ෝ𝒂 =

𝑩

𝑨

𝝏𝟐𝑲𝟐

𝝏𝒂𝟐= 𝟐𝑨 =

𝟐

𝝈𝒂𝟐

∆ෝ𝒂 =𝑩

𝑨𝑩 − 𝑪𝟐


53

For 𝒇 𝒙 = 𝒂𝒙 + 𝒃:

𝑨 =

𝒊

𝒙𝒊𝟐

∆𝒚𝒊𝟐 𝑫 =

𝒊

𝒙𝒊𝒚𝒊

∆𝒚𝒊𝟐

𝐅 =

𝒊

𝒚𝒊𝟐

∆𝒚𝒊𝟐

𝑲𝟐 𝒂, 𝒃 = 𝑨𝒂𝟐 + 𝑩𝒃𝟐 + 𝟐𝑪𝒂𝒃 − 𝟐𝑫𝒂 − 𝟐𝑬𝒃 + 𝑭 = −𝟐𝐥𝐧𝓛

𝑬 =

𝒊

𝒚𝒊

∆𝒚𝒊𝟐𝑪 =

𝒊

𝒙𝒊

∆𝒚𝒊𝟐

𝑩 =

𝒊

𝟏

∆𝒚𝒊𝟐

𝝏𝑲𝟐

𝝏𝒂= 𝟐𝑨𝒂 + 𝟐𝑪𝒃 − 𝟐𝑫 = 𝟎

𝝏𝟐𝑲𝟐

𝝏𝒂𝟐= 𝟐𝑨 = 𝟐𝜮𝟏𝟏

−𝟏

ෝ𝒂 =𝑩𝑫− 𝑬𝑪

𝑨𝑩 − 𝑪𝟐, 𝒃 =

𝑨𝑬 − 𝑫𝑪

𝑨𝑩 − 𝑪𝟐𝝏𝑲𝟐

𝝏𝒃= 𝟐𝑪𝒂 + 𝟐𝑩𝒃 − 𝟐𝑬 = 𝟎

∆ෝ𝒂 =𝑩

𝑨𝑩 − 𝑪𝟐, ∆𝒃 =

𝑨

𝑨𝑩 − 𝑪𝟐

𝝏𝟐𝑲𝟐

𝝏𝒃𝟐= 𝟐𝑩 = 𝟐𝜮𝟐𝟐

−𝟏

𝝏𝟐𝑲𝟐

𝝏𝒂𝝏𝒃= 𝟐𝑪 = 𝟐𝜮𝟏𝟐

−𝟏

𝜮−𝟏 =𝑨 𝑪𝑪 𝑩

𝜮 =𝟏

𝑨𝑩 − 𝑪𝟐𝑩 −𝑪−𝑪 𝑨


2 dimensional error contours on a and b

54

Non parametric estimation

Directly estimating the probability density function

• Likelihood ratio discriminant

• Separating power of variables

• Data / MC agreement

• …

Frequency Table : For a sample 𝒙𝒊 , 𝒊 = 𝟏…𝒏

1. Define successive intervals (bins) 𝑪𝒌 = 𝒂𝒌, 𝒂𝒌+𝟏2. Count the number of events 𝒏𝒌 in 𝑪𝒌

Histogram : Graphical representation of the

frequency table

55 𝒉 𝒙 = 𝒏𝒌 𝐢𝐟 𝒙 ∈ 𝑪𝒌

Histogram

56

N/Z for stable heavy nuclei1.321, 1.357, 1.392, 1.410, 1.428, 1.446,

1.464, 1.421, 1.438, 1.344, 1.379, 1.413,

1.448, 1.389, 1.366, 1.383, 1.400, 1.416,

1.433, 1.466, 1.500, 1.322, 1.370, 1.387,

1.403, 1.419, 1.451, 1.483, 1.396, 1.428,

1.375, 1.406, 1.421, 1.437, 1.453, 1.468,

1.500, 1.446, 1.363, 1.393, 1.424, 1.439,

1.454, 1.469, 1.484, 1.462, 1.382, 1.411,

1.441, 1.455, 1.470, 1.500, 1.449, 1.400,

1.428, 1.442, 1.457, 1.471, 1.485, 1.514,

1.464, 1.478, 1.416, 1.444, 1.458, 1.472,

1.486, 1.500, 1.465, 1.479, 1.432, 1.459,

1.472, 1.486, 1.513, 1.466, 1.493, 1.421,

1.447, 1.460, 1.473, 1.486, 1.500, 1.526,

1.480, 1.506, 1.435, 1.461, 1.487, 1.500,

1.512, 1.538, 1.493, 1.450, 1.475, 1.500,

1.512, 1.525, 1.550, 1.506, 1.530, 1.487,

1.512, 1.524, 1.536, 1.518, 1.577, 1.554,

1.586, 1.586

Histogram as PDF estimator

Statistical description 𝒏𝒌 are multinomial random variables.

57

The histogram is an estimator of the probability density

Each bin can be described by a Poisson density.

The 𝟏𝝈 error on 𝒏𝒌 is then : 𝜟𝒏𝒌 = ෝ𝝈𝒌𝟐 = ෝ𝝁𝒌 = 𝒏𝒌

𝒏 =

𝒌

𝒏𝒌 𝒑𝒌 = 𝑷 𝒙 ∈ 𝑪𝒌 = න

𝑪𝒌

𝒇𝑿 𝒙 𝒅𝒙

𝝁𝒌 = 𝒏𝒑𝒌 𝝈𝒌𝟐 = 𝒏𝒑𝒌 𝟏−𝒑𝒌

𝒑𝒌≪𝟏𝝁𝒌 𝑪𝒐𝒗 𝒏𝒌,𝒏𝒓 =−𝒏𝒑𝒌𝒑𝒓

𝒑𝒌≪𝟏𝟎

lim𝒏→+∞

𝒏𝒌𝒏=𝝁𝒌𝒏= 𝒑𝒌 𝒑𝒌 = න

𝑪𝒌

𝒇𝑿 𝒙 𝒅𝒙 ≈ 𝜹𝒇𝑿 𝒙𝒄 𝐥𝐢𝐦𝜹→𝟎

𝒑𝒌𝜹= 𝒇𝑿 𝒙

𝒇 𝒙 = 𝐥𝐢𝐦𝒏→+∞𝜹→𝟎

𝟏

𝒏𝜹𝒉 𝒙

For a large sample For small classes (width 𝜹)

And since 𝒉 𝒙 = 𝒏𝒌 𝐢𝐟 𝒙 ∈ 𝑪𝒌 :

For Bayesians: the posterior density is the probability density

of the true value. It can be used to derive interval :

No such thing for a Frequentist: The interval itself becomes the

random variable [𝒂, 𝒃] is a realization of [𝑨, 𝑩]

Independently of 𝜽

Generalization of the concept of uncertainty:

Interval that contains the true value with a given probability

slightly different concepts

Confidence interval

For a random variable 𝑿, a confidence interval [𝒂, 𝒃] with

confidence level 𝜶 is such that the probability of finding a

realization of 𝑿 inside the interval is:

58

𝑷 𝑿 ∈ 𝒂, 𝒃 = න𝒂

𝒃

𝒇𝑿 𝒙 𝒅𝒙 = 𝜶

𝑷 𝜽 ∈ 𝒂, 𝒃 = 𝜶

𝑷 𝑨 < 𝜽 and 𝑩 > 𝜽 = 𝜶

Confidence interval

Mean centered,

symmetric interval

[𝝁 − 𝒂, 𝝁 + 𝒂]

𝝁−𝒂𝝁+𝒂

𝒇 𝒙 𝒅𝒙 = 𝜶

59

Mean centered,

probability symmetric interval:

[𝒂, 𝒃]

𝒂𝝁𝒇 𝒙 𝒅𝒙 = 𝝁

𝒃𝒇 𝒙 𝒅𝒙 = Τ𝜶 𝟐

Highest Probability Density (HDP)

[𝒂, 𝒃]

𝒂𝒃𝒇 𝒙 𝒅𝒙 = 𝜶

𝒇 𝒙 > 𝒇 𝒚 for 𝒙 ∈ [𝒂, 𝒃] and 𝐲 ∉ [𝒂, 𝒃]

Confidence Belt

To build a frequentist interval for an estimator 𝝀 of 𝝀

1. Make pseudo-experiments for several values of 𝝀 and compute

the estimator 𝝀 for each (MC sampling of the estimator pdf)

2. For each 𝝀, determine 𝜩−𝟏(𝝀) and 𝜴−𝟏(𝝀) such as :

These 2 curves are the confidence belt, for a CL α.

3. Inverse these functions. The interval 𝜴−𝟏(𝝀), 𝜩−𝟏(𝝀) satisfies:

60

Confidence Belt for

Poisson parameter 𝝀estimated with the

empirical mean of

3 realizations (68%CL)

𝝀 < 𝜩 𝝀 for a fraction Τ(𝟏 − 𝜶) 𝟐 of the pseudo-experiments𝝀 > 𝜴 𝝀 for a fraction Τ(𝟏 − 𝜶) 𝟐 of the pseudo-experiments

𝑷 𝜴−𝟏 𝝀 < 𝝀< 𝜩−𝟏(𝝀) = 𝟏−𝑷 𝜩−𝟏 𝝀 < 𝝀 −𝑷 𝜴−𝟏 𝝀 > 𝝀

=𝟏−𝑷 𝝀 < 𝜩 𝝀 −𝑷 𝝀 >𝜴 𝝀 =𝜶

Dealing with systematics

The variance of the estimator only measure the

statistical uncertainty.

Often, we will have to deal with some parameters

whose value is known with limited precision.

Systematic uncertainties

The likelihood function becomes :

The known parameters 𝝂 are nuisance parameters

61

𝓛 𝜽, 𝝂 𝝂 = 𝝂𝟎 ± 𝚫𝝂 or 𝝂𝟎−𝚫𝝂−+𝚫𝝂+

Bayesian inference

In Bayesian statistics, nuisance parameters are

dealt with by assigning them a prior 𝝅(𝝂).

Usually, a multinormal law is used with mean 𝝂𝟎 and

covariance matrix estimated from 𝚫𝝂𝟎 (+ correlation,

if needed)

The final prior is obtained by marginalization over

the nuisance parameters

62

𝒇 𝜽, 𝝂|𝒙 =𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂

𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝜽𝒅𝝂

𝒇 𝜽|𝒙 = න𝒇 𝜽, 𝝂|𝒙 𝒅𝝂 =𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝝂

𝒇 𝒙|𝜽, 𝝂 𝝅 𝜽 𝝅 𝝂 𝒅𝜽𝒅𝝂

Profile Likelihood

63

No true frequentist way to add systematic effects.

Popular method of the day: profiling

Deal with nuisance parameters as realization of random

variables: extend the likelihood :

𝑮(𝒗) is the likelihood of the new parameters (identical to prior)

For each value of 𝜽, maximize the likelihood with respect to

nuisance : profile likelihood 𝑷𝑳(𝜽).

𝑷𝑳(𝜽) has the same statistical asymptotical properties than

the regular likelihood

𝓛 𝜽, 𝝂 → 𝓛′ 𝜽, 𝝂 𝑮(𝝂)

Statistical Tests

Statistical tests aim at:

• Checking the compatibility of a dataset {𝒙𝒊} with

a given distribution

• Checking the compatibility of 2 datasets {𝒙𝒊}, {𝒚𝒊} are they issued from the same distribution.

• Comparing different hypothesis :

background vs signal+background

In every case :

• build a statistic that quantify the agreement with

the hypothesis

• convert it into a probability of

compatibility/incompatibility : p-value64

Pearson test

Test for binned data : use the Poisson limit of the histogram

• Sort the sample into 𝒌 bins 𝑪𝒊 𝒏𝒊• Compute the probability of this class : pi=∫Cif(x)dx

• The test statistics compare, for each bin the deviation of

the observation from the expected mean to the theoretical

standard deviation.

Then χ2 follow (asymptotically) a Khi-2 law with 𝒌 − 𝟏 degrees of

freedom (1 constraint ∑ni=n)

p-value : probability of doing worse,

For a “good” agreement χ2 /(k-1) ~ 1,

More precisely (1σ interval ~ 68%CL)65

i bins i

2ii2

np

)np(nχ

Poisson mean

Poisson variance

Data

2 2 1)dx-(x;kfvaluepχ χ

1)2(k1)(kχ2

Kolmogorov-Smirnov test

Test for unbinned data : compare the sample cumulative density

function to the tested one

Sample Pdf (ordered sample)

The Kolmogorov statistic is the largest deviation :

The test distribution has been computed by Kolmogorov:

[0;β] define a confidence interval for Dn

β=0.9584/√n for 68.3% CL β=1.3754/√n for 95.4% CL66

x x1

xx xn

k x x0

(x)Fi)-δ(x n

1 (x)f

n

1kk

0

si

s

F(x)(x)FsupD Sx

n

22z2r

r

1rn e1)(2)nβP(D

Example

Test compatibility with an exponential law :

67

0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421,

0.423, 0.455, 0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774,

0.825, 0.843, 0.921, 0.987, 0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287,

1.317, 1.320, 1.333, 1.412, 1.421, 1.438, 1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146,

2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372, 2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029,

3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814, 3.854, 3.929, 4.038, 4.065, 4.089, 4.177,

4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222, 6.556, 6.670, 7.673, 8.071,

8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174

0.4λ ,λef(x) λx

Dn = 0.069

p-value = 0.0617

1σ : [0, 0.0875]