31
Data Analysis Class 6: Hypothesis testing and confidence intervals

Data Analysis Class 6: Hypothesis testing and confidence intervals

Embed Size (px)

Citation preview

Page 1: Data Analysis Class 6: Hypothesis testing and confidence intervals

Data Analysis

Class 6: Hypothesis testingand confidence intervals

Page 2: Data Analysis Class 6: Hypothesis testing and confidence intervals

What to ‘expect’?

• We have studied various distributions:– Bernoulli – binary– Geometric – number of Bernoulli experiments to first success– Binomial – number of successes in n Bernoulli experiments– Gaussian – real-valued, bell-shaped curve– Exponential – real-valued time to first event– Poisson – number of events in a unit time interval

• Tells us what to expect from measurements• What if we are not sure about the parameters?• Can we use measurements as evidence?

– Fitting distributions to data– Testing hypotheses based on data– Confidence intervals

Page 3: Data Analysis Class 6: Hypothesis testing and confidence intervals

Remember…

Type Distribution / density function Mean Variance

Bernoulli

Geometric

Binomial

Gaussian

Exponential

Poisson

pXP

pXP

1)0(

)1(

xnxnx ppxXP )1()(

ppxXP x 1)1()(

2

2

2

)(exp

2

1)(

x

xXP

x)(- x) P(X exp

!

)exp()(

xxXP

x

p

1

p1

2

1

p

p

21

)1( pp

np )1( pnp

2

Page 4: Data Analysis Class 6: Hypothesis testing and confidence intervals

Properties of distributions

• Mean

• Sample mean (average)

• Variance

• Sample variance

xx

dxxXxPxXxPXXE )()(

x x

dxxXPxxXPxXXE )()()()( 22222

i

ixn

222 1

i

ixn

1

Page 5: Data Analysis Class 6: Hypothesis testing and confidence intervals

Moment matching

• Choosing parameters such that

First, second, … order moments are equal to their empirical estimate

...

Page 6: Data Analysis Class 6: Hypothesis testing and confidence intervals

Bernoulli / geometric / binomial

• Bernoulli first order moment:

• (Similar for geometric and binomial)

• Only one parameter higher order moments not needed

ii

ii

xn

p

xn

p

1

1

Page 7: Data Analysis Class 6: Hypothesis testing and confidence intervals

Bernoulli / geometric / binomial

Empirical mean: p = 0.34,estimated based on 100*10 Bernoulli outcomes

Page 8: Data Analysis Class 6: Hypothesis testing and confidence intervals

Gaussian

• First and second order moments:

• Only two parameters, so no need for higher moments

222 1

1

ii

ii

xn

xn

Page 9: Data Analysis Class 6: Hypothesis testing and confidence intervals

Gaussian

Empirical means: 3.2 and 15.6

Empirical stds: 1.1 and 2.0

Page 10: Data Analysis Class 6: Hypothesis testing and confidence intervals

Multivariate Gaussian

• Multivariate Gaussian density function:

• Parameters:– Mean vector mu– Covariance matrix Sigma

2exp

2||

1)(

1 μxΣμx

ΣxX

T

P

Page 11: Data Analysis Class 6: Hypothesis testing and confidence intervals

Multivariate Gaussian

• Parameters can be estimated from a set of samples {x1,x2,…,xn}:

i

Tii

ii

n

n

μxμxΣ

1

1

Page 12: Data Analysis Class 6: Hypothesis testing and confidence intervals

Exponential / Poisson

• Exponential mean:

• Poisson mean:

• Both will give the same result (lambda = empirical number of events per unit of time)

iix

n

1

1

iix

n

1

Page 13: Data Analysis Class 6: Hypothesis testing and confidence intervals

Exponential / Poisson

http://news.bbc.co.uk/1/hi/world/europe/2008892.stm

Significant plane crashes since 1 January 1998…

Page 14: Data Analysis Class 6: Hypothesis testing and confidence intervals

Exponential / Poisson

Lambda = 0.015 = 1/mean(time between crashes)

Page 15: Data Analysis Class 6: Hypothesis testing and confidence intervals

Exponential / Poisson

Lambda = 1.5 = average number of crashes in 100 days

(note: 100 larger, because unit time interval is x 100 too)

Page 16: Data Analysis Class 6: Hypothesis testing and confidence intervals

Hypothesis testing• Given a distribution (and parameters)

• E.g. binomial: number of faulty items in a lot of a factory pipeline

• Empirical data may urge us to revise our hypothesis

Page 17: Data Analysis Class 6: Hypothesis testing and confidence intervals

Binomial distribution

• Consider a pipeline in a factory

• If the expected probability of a fault is (should be) 0.01, what can we conclude if we see a batch with 10% faults?

• p=0.01, n=100 (batch size), x=10 (number faulty)

• Probability to see something equally or more surprising= the p-value: P(X>=10)?

• Extremely small we should reject the hypothesis that p=0.01! the pipeline must be broken!?

86.7)1()10(10

EppXP xnx

x

nx

Page 18: Data Analysis Class 6: Hypothesis testing and confidence intervals

Binomial distribution

• In practice:

• This can be computed by the cumulative binomial distribution function

• In matlab (with p=0.01):1-binocdf(9,100,p)

86.7)1()10(10

EppXP xnx

x

nx

Page 19: Data Analysis Class 6: Hypothesis testing and confidence intervals

Poisson distribution

• Assume the expected number of plane crashes in 100 days is supposed to be 1.5What can we conclude if there are 5 in a given 100 days? (It is true for the 16th unit time interval)

• The p-value = P(X>=5)

• P-value is small – should we reject the null hypothesis that lambda=1.5?

02.0!

)exp()5(

5

x

x

xXP

Page 20: Data Analysis Class 6: Hypothesis testing and confidence intervals

Poisson distribution

• In practice:

• This can be computed by means of the cumulative Poisson distribution function

• In matlab (with lambda=1.5):1-poisscdf(4,lambda)

02.0!

)exp()5(

5

x

x

xXP

Page 21: Data Analysis Class 6: Hypothesis testing and confidence intervals

Hypothesis testing

• In general:– Assume a null hypothesis for the data

• Faults are Bernoulli random variables with given p• Crashes occur with a fixed probability lambda per unit time

interval– Gather data– Compute a test statistic of the data

• Number of faults in a batch of n• Number of crashes in a unit time interval

– Compute the p-value as the test statistic is equally large on random data from the null hypothesis

– If the p-value is smaller than a threshold (0.01, 0.05…), reject the null hypothesis

Page 22: Data Analysis Class 6: Hypothesis testing and confidence intervals

Hypothesis testing

• In general:– Hypothesis testing quantifies that a random

variable will typically be close to its mean– This holds more strongly as the standard

deviation is smaller

Page 23: Data Analysis Class 6: Hypothesis testing and confidence intervals

Permutation testing

• Sometimes, the distribution of the test statistic can be too complex

• Then: permutation testing– Generate random data sets by permutating the one

sampled (1000 times)– Compute the fraction of times the test statistic is

larger in those permuted versions– This is an approximation of the p-value

• (Assumption of this approach: permuted versions of the data are equally likely under the null hypothesis)

Page 24: Data Analysis Class 6: Hypothesis testing and confidence intervals

Permutation testing

• Test statistic = number of plane crashes in 16th unit time interval of 100 days

Page 25: Data Analysis Class 6: Hypothesis testing and confidence intervals

Permutation testing• Generate 1000 random crash time series with the same

number of crashes in the same period (e.g. by permuting the days)

• Compute the number of crashes in the 16th unit time (of 100 days)

• Compute the proportion of those 1000 permutations where the number of crashes in this interval was at least 5

• This is the p-value estimate!

• Result (in my experiment): 0.018 – very close to 0.02 as computed using Poisson

Page 26: Data Analysis Class 6: Hypothesis testing and confidence intervals

Confidence intervals• Rather than computing a point estimate for

the mean …

• … we can compute an interval for the mean

• A range of values in which the mean will be with high confidence

Page 27: Data Analysis Class 6: Hypothesis testing and confidence intervals

Confidence intervals• Consider a pipeline in a factory

• If the expected probability of a fault is (should be) 0.01, what can we conclude if we see a batch with 10% faults?

• n=100 (batch size), x=10 (number faulty)

• Let’s say: we reject the null hypothesis if p-value < delta=0.05

• p=0.01 p-value = 7.6E-8p=0.05 p-value = 0.028p=0.055 p-value = 0.05p=0.1 p-value = 0.54

• The set of all values for p for which the p-value >= 0.05 isthe confidence interval with confidence delta=0.05:

[0.055,1]

Page 28: Data Analysis Class 6: Hypothesis testing and confidence intervals

Confidence intervals

• Assume in a given unit time interval of 100 days, there are 5 crashes

• p-value threshold used: delta=0.01

• lambda = 1 p-value = 0.0037lambda = 1.28 p-value = 0.01lambda = 2 p-value = 0.053lambda = 4 p-value = 0.37

• Confidence interval with confidence delta=0.01:[1.28,infinity]

Page 29: Data Analysis Class 6: Hypothesis testing and confidence intervals

Confidence intervals

• This was one-sided• Two-sided:• For all lambda values in the interval:

P(at least 4 crashes)>=0.005P(at most 4 crashes)>=0.005

• Two-sided confidence interval with confidence delta=0.01:

[1.08,12.6]

• Indeed:1-poisscdf(4,1.08) = 0.005poisscdf(4,12.6) = 0.005

Page 30: Data Analysis Class 6: Hypothesis testing and confidence intervals

Confidence intervals

• Other (more common) interpretation of confidence intervals:– With probability (over the sampled data) equal to the confidence

parameter,the confidence interval will contain the actual value

• You can verify that this is the case…(think about it)

[For any mean outside the interval, the probability of the observed test statistic (or more extreme) is less than delta. Hence, the probability over the data that the interval contains the actual mean is at least delta]

Page 31: Data Analysis Class 6: Hypothesis testing and confidence intervals

Lab session• On the temperature time series data:

– Compute the 12-dimensional mean temperature over the year– Compute the covariance matrix– Visualize both in the report (using plot and imagesc)

• On the Titanic data:– Compute the probability of having died among first class passengers

(report)– Compute the p-value for the null hypothesis that the probability of

having died for third class passengers is the same (report)– Compute the probability of survival among all male passengers (report)– Compute the p-value for the null hypothesis that the probability of

survival for female passengers is the same (report)• On the plane crash data:

– Make a histogram of number of plane crashes per time unit, starting on 1/1/1998, and fit a Poisson to it (as in the lecture), but with unit time interval equal to 50 days (report)

– Find the unit time interval with the largest number of crashes (report)– Compute the p-value for this time interval both analytically using the

Poisson cumulative distribution function as well as using permutation testing. What can you conclude, e.g. with p-value threshold equal to 0.01? (report)