Statistical estimation - Statistical modelling: theory and ...€¦ · Likelihood theory A new look...

Preview:

Citation preview

Statistical estimationStatistical modelling: theory and practice

Gilles Guillot

gigu@dtu.dk

September 3, 2013

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 1 / 27

1 Introductory example

2 Principles of estimation

3 Likelihood theory

4 Reading

5 Exercises

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 2 / 27

Introductory example

Introductory example

A batch of 1000 electronic components contains some faulty items.One takes a sample of size 100 with replacement, of which 3 arefaulty. What is the proportion of faulty items in the batch?

Arriving in a new city, you see a tram passing in the street with thenumber 16. How many tram lines are there in this city?

For a set of measurements y1, ..., yn of temperatures at dates t1, .., tnobserved at a certain location, we want to �t a line y = at+ b. Whatare the values a and b that best �t the data?

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 3 / 27

Introductory example

A common set up

We have some data

There is a �mechanism� that generates the data

This mechanism depends on a unknown parameter that we want toestimate

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 4 / 27

Introductory example

The Statistics way

We relate the unknown parameter to the data by mean of a probabilitydistribution.

Proportion of faulty items: the number of faulty items in te samplecan be assumed to be follow a binomial distribution B(n, p)

Tram lines: the number observed can be assumed to follow a uniformdistribution U{1, ..., N}

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 5 / 27

Principles of estimation

Estimator, estimate

Estimator

Denoting generically θ the unknown parameter value, an estimator is a rule(or algorithm) allowing us to �guess� θ, from the data.From a mathematical point of view, it is a function

Rn −→ Rd

(x1, ..., xn) −→ θ̂

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 6 / 27

Principles of estimation

d is the dimension of the parameter space (often d = 1 for us)

Since we assume that data are random, we will often stress this bydenoting them (X1, ..., Xn).

The number θ̂ is an estimate of θ. It is a random variable, denotedsometimes θ̂(X1, ..., Xn)

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 7 / 27

Principles of estimation

Bias of an estimator

De�nition: bias

The bias of an estimator is the average discrepancy between the estimateand the true parameter value:

Bias(θ̂) = E[θ̂ − θ] = E[θ̂]− θ

An estimator is said to be unbiased if Bias(θ̂) = 0

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 8 / 27

Principles of estimation

Precision and accuracy

Precision and accuracy are two concepts that belong to science andengineering best explained by the �gure below:

In statistics, we have two related concepts: variance and mean square error.

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 9 / 27

Principles of estimation

De�nition: variance of an estimator

V [θ̂] = E[(θ̂ − E[θ̂])2

]V [θ̂] is a measure of how much θ̂ is scaterred around its mean (which maydi�er from the true value θ).

De�nition: mean square error of an estimator

MSE[θ̂] = E[(θ̂ − θ)2

]is a measure of how much θ̂ is scaterred around the true value θ.

When θ̂ is unbiased, E[θ̂] = θ hence V [θ̂] = MSE[θ̂].

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 10 / 27

Principles of estimation

Con�dence interval

De�nition: con�dence interval

A con�dence interval at level (1− α) ∈ [0, 1] is an interval that containsthe true unknown parameter value θ with probbaility 1− α.

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 11 / 27

Principles of estimation

Example: estimation of a proportion

We have a sample of n objects of which x are faulty. We estimate theunknown proportion p by p̂ = x/n.Exercise: give the bias and variance of p̂.

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 12 / 27

Likelihood theory

A new look at the probability of the data

We consider again the problem of estimating a proportion with binomialsampling.

P (X = x) = px(1− p)n−x is the probability to obtain x faulty objects.

If we consider px(1− p)n−x as a function of p, it can be interpreted asthe likelihood of the unknown parameter p.

To acknowledge the dependence on p, we denote L(x; p) = px(1− p)n−xor for short L(p).

0.0 0.2 0.4 0.6 0.8 1.0

0.00

000.

0005

0.00

100.

0015

0.00

20

p

Pro

babi

lity

of d

ata

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 13 / 27

Likelihood theory

The maximum likelihood principle

The above suggests a method to estimate the unknown parameter p:

p̂ = Argmaxp px(1− p)n−x

p̂ is the parameter value that makes our data most probable,

It is known as the Maximum Likelihood Estimate of p.and denoted p̂ML

Note that de�ning p̂ as Argmaxp

(n

x

)px(1− p)n−x would lead to the same

result as(nx

)does not depend on p.

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 14 / 27

Likelihood theory

Likelihood in a general statistical model

De�nition: likelihood function

We consider a dataset consisting of n observations (x1, ..., xn)

We assume that the probability density function or probability massfunction of (x1, ..., xn) denoted by fθ(x1, ..., xn) is known up to anunknown parameter θ.

The likelihood function L is de�ned as

L(x1, ..., xn; θ) = fθ(x1, ..., xn)

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 15 / 27

Likelihood theory

Examples of likelihood functions

Poisson counts

We observe the number of phone calls at various calling centres over agiven period and denote them by (x1, ..., xn).

We assume that the xi are independent realizations of a Poissonrandom variable Xi with parameter λ, i.e.P (Xi = x) = exp(−λ)λx/x!

NB: x ∈ N and λ ∈ R+

E[Xi] = λ and V [Xi] = λ

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 16 / 27

Likelihood theory

Likelihood for i.i.d Poisson observations

Remember: "Likelihood = probability of data for a given parameter value "

L(x1, ..., xn;λ) =

n∏i=1

exp(−λ)λxi/xi!

= exp(−nλ)λ∑

i xi∏i xi!

∝ exp(−nλ)λ∑

i xi

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 17 / 27

Likelihood theory

Likelihood for i.i.d Normal observations

"Likelihood = probability of data for a given parameter value "Parameter: θ = (µ, σ) ∈ R× R+

L(x1, ..., xn;µ, σ) =

n∏i=1

1

σ√

2πexp[−1

2

(xi − µ)2

σ2]

∝n∏i=1

1

σexp[−1

2

(xi − µ)2

σ2]

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 18 / 27

Likelihood theory

General maximum likelihood principle

Maximum likelihood estimator

We consider a dataset consisting of n observations (x1, ..., xn)

We assume that we know the likelihood functionL(x1, ..., xn; θ) = fθ(x1, ..., xn)

The maximum likelihood estimator of θ is de�ned as

θ̂ML(x1, ..., xn) = ArgmaxθL(x1, ..., xn; θ)

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 19 / 27

Likelihood theory

Deriving p̂ explicitly for the previous binomial sampling

We want to maximize L(p) = px(1− p)n−xWe could work on L(p) directly in this case but let us denotel(p) = lnL(p).

l(p) = ln[px(1− p)n−x] = x ln p+ (n− x) ln(1− p)l′(p) = x/p− (n− x)/(1− p) (1)

l′(p) = 0 if p = x/n

p̂ = x/n is the estimate of p

for a generic sample with random outcome X, p̂ = X/n is theestimator or p, it is a random variable

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 20 / 27

Likelihood theory

Maximum likelihood estimator for i.i.d Poisson observations

Omitting the term that does not depend on λ, we have

l(x1, ..., xn;λ) = lnL(x1, ..., xn;λ) = ln[exp(−nλ)λ∑

i xi ]

= −nλ+∑i

xi lnλ

Henced

dλl(x1, ..., xn;λ) = −n+

∑i

xi/λ

Andd

dλl(x1, ..., xn;λ) = 0⇐⇒ λ =

∑i

xi/n

The MLE of λ is λ̂ML =∑

i xi/n = x̄

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 21 / 27

Likelihood theory

Likelihood for i.i.d Normal observations

"Likelihood = probability of data for a given parameter value "Parameter: θ = (µ, σ) ∈ R× R+

L(x1, ..., xn;µ, σ) =

n∏i=1

1

σ√

2πexp[−1

2

(xi − µ)2

σ2]

∝n∏i=1

1

σexp[−1

2

(xi − µ)2

σ2]

l(x1, ..., xn;µ, σ) =

n∑i=1

[− lnσ − 1

2

(xi − µ)2

σ2

]

= −n lnσ − 1

2σ2

n∑i=1

(xi − µ)2

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 22 / 27

Likelihood theory

d

dµl(x1, ..., xn;µ, σ) =

1

2σ2

n∑i=1

(xi − µ)

d

dσl(x1, ..., xn;µ, σ) = −n

σ+

1

σ3

n∑i=1

(xi − µ)2

d

dµl = 0 and

d

dσl = 0 give

n∑i=1

(xi − µ) = 0

and − nσ2 +

n∑i=1

(xi − µ)2 = 0

hence

θ̂ML = (̂µ, σ)ML =

(1

n

n∑i=1

xi,1

n

n∑i=1

(xi − x̄)2

)

NB: The ML estimator of the variance is biased. This bias tends to zero asn −→∞

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 23 / 27

Likelihood theory

Remarks on the MLE

The rule �likelihod = product of marginal densities� applies only whenthe observations are independent

Taking the log

linearizes the product into a sum

simpli�es greatly the math expressions in the case density proportional

to exp(axb)xc

avoids numerical instabilities when using numerical computation

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 24 / 27

Likelihood theory

Remarks on the MLE (cont')

Deriving the MLE in closed form is often impossible in real-lifeproblems. One has to resort to numerical optimization. Hence theimportance of optimization methods in statistics.

If the parameter θ belongs to a discrete set, di�erentiating l(θ) ismeaningless. One has to resort to discrete optimization methods.

The likelihood L(x1, ..., xn; θ) is sometimes denoted L(θ|x1, ..., xn).�

This is misleading and mathematically completely wrong since inthe likelihood theory, θ is not a random variable.

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 25 / 27

Reading

Reading

To go beyond these slides, you can read the �rst two chapters of In All

Likelihood, Yudi Pawitan, Oxford Science Publications, 2001.This book is not in DTU digital library but almost completely on [Googlebooks ]

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 26 / 27

Exercises

Exercises

1 We assume that we have recorded the life duration of n light bulbsdenoted x1, ..., xn. We assume that they are n iid replicates of anexponential E(α) distribution.Derive analytically the expression of the MLE of α.

2 Derive analytically the MLE of a for a dataset consisting of n iidreplicates of a U [0, a] distribution. Evaluate the bias of this estimator.What is the limit of the bias when n tends to +∞?

3 For a distribution fθ, the expectation of X under fθ can be expressedas a function φ(θ). The moment method consists in identifying φ(θ)to the empirical mean. Apply this principle to the case above anddiscuss the estimator in terms of bias, variance, other remarks?

Gilles Guillot (gigu@dtu.dk) Estimation September 3, 2013 27 / 27

Recommended