Data Analysis Class 6: Hypothesis testing and confidence intervals

Preview:

Citation preview

Data Analysis

Class 6: Hypothesis testingand confidence intervals

What to ‘expect’?

• We have studied various distributions:– Bernoulli – binary– Geometric – number of Bernoulli experiments to first success– Binomial – number of successes in n Bernoulli experiments– Gaussian – real-valued, bell-shaped curve– Exponential – real-valued time to first event– Poisson – number of events in a unit time interval

• Tells us what to expect from measurements• What if we are not sure about the parameters?• Can we use measurements as evidence?

– Fitting distributions to data– Testing hypotheses based on data– Confidence intervals

Remember…

Type Distribution / density function Mean Variance

Bernoulli

Geometric

Binomial

Gaussian

Exponential

Poisson

pXP

pXP

1)0(

)1(

xnxnx ppxXP )1()(

ppxXP x 1)1()(

2

2

2

)(exp

2

1)(

x

xXP

x)(- x) P(X exp

!

)exp()(

xxXP

x

p

1

p1

2

1

p

p

21

)1( pp

np )1( pnp

2

Properties of distributions

• Mean

• Sample mean (average)

• Variance

• Sample variance

xx

dxxXxPxXxPXXE )()(

x x

dxxXPxxXPxXXE )()()()( 22222

i

ixn

222 1

i

ixn

1

Moment matching

• Choosing parameters such that

First, second, … order moments are equal to their empirical estimate

...

Bernoulli / geometric / binomial

• Bernoulli first order moment:

• (Similar for geometric and binomial)

• Only one parameter higher order moments not needed

ii

ii

xn

p

xn

p

1

1

Bernoulli / geometric / binomial

Empirical mean: p = 0.34,estimated based on 100*10 Bernoulli outcomes

Gaussian

• First and second order moments:

• Only two parameters, so no need for higher moments

222 1

1

ii

ii

xn

xn

Gaussian

Empirical means: 3.2 and 15.6

Empirical stds: 1.1 and 2.0

Multivariate Gaussian

• Multivariate Gaussian density function:

• Parameters:– Mean vector mu– Covariance matrix Sigma

2exp

2||

1)(

1 μxΣμx

ΣxX

T

P

Multivariate Gaussian

• Parameters can be estimated from a set of samples {x1,x2,…,xn}:

i

Tii

ii

n

n

μxμxΣ

1

1

Exponential / Poisson

• Exponential mean:

• Poisson mean:

• Both will give the same result (lambda = empirical number of events per unit of time)

iix

n

1

1

iix

n

1

Exponential / Poisson

http://news.bbc.co.uk/1/hi/world/europe/2008892.stm

Significant plane crashes since 1 January 1998…

Exponential / Poisson

Lambda = 0.015 = 1/mean(time between crashes)

Exponential / Poisson

Lambda = 1.5 = average number of crashes in 100 days

(note: 100 larger, because unit time interval is x 100 too)

Hypothesis testing• Given a distribution (and parameters)

• E.g. binomial: number of faulty items in a lot of a factory pipeline

• Empirical data may urge us to revise our hypothesis

Binomial distribution

• Consider a pipeline in a factory

• If the expected probability of a fault is (should be) 0.01, what can we conclude if we see a batch with 10% faults?

• p=0.01, n=100 (batch size), x=10 (number faulty)

• Probability to see something equally or more surprising= the p-value: P(X>=10)?

• Extremely small we should reject the hypothesis that p=0.01! the pipeline must be broken!?

86.7)1()10(10

EppXP xnx

x

nx

Binomial distribution

• In practice:

• This can be computed by the cumulative binomial distribution function

• In matlab (with p=0.01):1-binocdf(9,100,p)

86.7)1()10(10

EppXP xnx

x

nx

Poisson distribution

• Assume the expected number of plane crashes in 100 days is supposed to be 1.5What can we conclude if there are 5 in a given 100 days? (It is true for the 16th unit time interval)

• The p-value = P(X>=5)

• P-value is small – should we reject the null hypothesis that lambda=1.5?

02.0!

)exp()5(

5

x

x

xXP

Poisson distribution

• In practice:

• This can be computed by means of the cumulative Poisson distribution function

• In matlab (with lambda=1.5):1-poisscdf(4,lambda)

02.0!

)exp()5(

5

x

x

xXP

Hypothesis testing

• In general:– Assume a null hypothesis for the data

• Faults are Bernoulli random variables with given p• Crashes occur with a fixed probability lambda per unit time

interval– Gather data– Compute a test statistic of the data

• Number of faults in a batch of n• Number of crashes in a unit time interval

– Compute the p-value as the test statistic is equally large on random data from the null hypothesis

– If the p-value is smaller than a threshold (0.01, 0.05…), reject the null hypothesis

Hypothesis testing

• In general:– Hypothesis testing quantifies that a random

variable will typically be close to its mean– This holds more strongly as the standard

deviation is smaller

Permutation testing

• Sometimes, the distribution of the test statistic can be too complex

• Then: permutation testing– Generate random data sets by permutating the one

sampled (1000 times)– Compute the fraction of times the test statistic is

larger in those permuted versions– This is an approximation of the p-value

• (Assumption of this approach: permuted versions of the data are equally likely under the null hypothesis)

Permutation testing

• Test statistic = number of plane crashes in 16th unit time interval of 100 days

Permutation testing• Generate 1000 random crash time series with the same

number of crashes in the same period (e.g. by permuting the days)

• Compute the number of crashes in the 16th unit time (of 100 days)

• Compute the proportion of those 1000 permutations where the number of crashes in this interval was at least 5

• This is the p-value estimate!

• Result (in my experiment): 0.018 – very close to 0.02 as computed using Poisson

Confidence intervals• Rather than computing a point estimate for

the mean …

• … we can compute an interval for the mean

• A range of values in which the mean will be with high confidence

Confidence intervals• Consider a pipeline in a factory

• If the expected probability of a fault is (should be) 0.01, what can we conclude if we see a batch with 10% faults?

• n=100 (batch size), x=10 (number faulty)

• Let’s say: we reject the null hypothesis if p-value < delta=0.05

• p=0.01 p-value = 7.6E-8p=0.05 p-value = 0.028p=0.055 p-value = 0.05p=0.1 p-value = 0.54

• The set of all values for p for which the p-value >= 0.05 isthe confidence interval with confidence delta=0.05:

[0.055,1]

Confidence intervals

• Assume in a given unit time interval of 100 days, there are 5 crashes

• p-value threshold used: delta=0.01

• lambda = 1 p-value = 0.0037lambda = 1.28 p-value = 0.01lambda = 2 p-value = 0.053lambda = 4 p-value = 0.37

• Confidence interval with confidence delta=0.01:[1.28,infinity]

Confidence intervals

• This was one-sided• Two-sided:• For all lambda values in the interval:

P(at least 4 crashes)>=0.005P(at most 4 crashes)>=0.005

• Two-sided confidence interval with confidence delta=0.01:

[1.08,12.6]

• Indeed:1-poisscdf(4,1.08) = 0.005poisscdf(4,12.6) = 0.005

Confidence intervals

• Other (more common) interpretation of confidence intervals:– With probability (over the sampled data) equal to the confidence

parameter,the confidence interval will contain the actual value

• You can verify that this is the case…(think about it)

[For any mean outside the interval, the probability of the observed test statistic (or more extreme) is less than delta. Hence, the probability over the data that the interval contains the actual mean is at least delta]

Lab session• On the temperature time series data:

– Compute the 12-dimensional mean temperature over the year– Compute the covariance matrix– Visualize both in the report (using plot and imagesc)

• On the Titanic data:– Compute the probability of having died among first class passengers

(report)– Compute the p-value for the null hypothesis that the probability of

having died for third class passengers is the same (report)– Compute the probability of survival among all male passengers (report)– Compute the p-value for the null hypothesis that the probability of

survival for female passengers is the same (report)• On the plane crash data:

– Make a histogram of number of plane crashes per time unit, starting on 1/1/1998, and fit a Poisson to it (as in the lecture), but with unit time interval equal to 50 days (report)

– Find the unit time interval with the largest number of crashes (report)– Compute the p-value for this time interval both analytically using the

Poisson cumulative distribution function as well as using permutation testing. What can you conclude, e.g. with p-value threshold equal to 0.01? (report)

Recommended