40
4 Hypothesis Testing Rather than looking at condence intervals associated with model parameters, we might formulate a question associated with the data in terms of a hypothesis. In particular, we have a so-called null hypothesis which refers to some basic premise which to we will adhere unless evidence from the data causes us to abandon it. Example 4.1 In a clinical treatment data may be collected to compare two treatments (old v. new). The null hypothesis is likely to be no di/erence between treatments The alternative hypothesis might be:- a) treatments are di/erent (2-sided ), b) new treatment is better (1-sided ), c) old treatment is better (1-sided ). In general we are often in a position to specify the form of the p.m.f., p(x; ), say, or the p.d.f., f (x; ), but there is doubt about the value of the parameter . All that is known is that is some element of a specied parameter space . We assume that the null hypothesis of interest species that is an element of some subset 0 of , and so is true if 2 0 but false if = 2 0 . Example 4.2 A coin is tossed and we hypothesise that it is fair. Hence 0 is the set 1 2 containing just one element of the parameter space = [0; 1] : As a convention we shall denote the complement of 0 in by 1 . We call the original hypothesis that 2 0 the null hypothesis and denote it by H 0 : The hypothesis that 2 1 is referred to as the alternative hypothesis and denoted by H 1 . 4.1 Data and questions Data set 2.3 (which we have seen before) Silver content of Byzantine coins A number of coins from the reign of King Manuel I, Comnemus (1143 - 80) were dis- covered in Cyprus. They arise from four di/erent coinages at intervals throughout his reign. The question of interest is whether there is any signicant di/erence in their silver 53

4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Embed Size (px)

Citation preview

Page 1: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4 Hypothesis Testing

Rather than looking at con�dence intervals associated with model parameters, we mightformulate a question associated with the data in terms of a hypothesis. In particular,we have a so-called null hypothesis which refers to some basic premise which to we willadhere unless evidence from the data causes us to abandon it.

Example 4.1 In a clinical treatment data may be collected to compare two treatments(old v. new).

The null hypothesis is likely to be

no di¤erence between treatments

The alternative hypothesis might be:-

a) treatments are di¤erent (2-sided ),b) new treatment is better (1-sided),c) old treatment is better (1-sided).

�In general we are often in a position to specify the form of the p.m.f., p(x; �), say, or thep.d.f., f(x; �), but there is doubt about the value of the parameter �. All that is knownis that � is some element of a speci�ed parameter space �. We assume that the nullhypothesis of interest speci�es that � is an element of some subset �0 of �, and so istrue if � 2 �0 but false if � =2 �0.Example 4.2

A coin is tossed and we hypothesise that it is fair. Hence �0 is the set�12

containing

just one element of the parameter space � = [0; 1] :�As a convention we shall denote the complement of �0 in � by �1. We call the originalhypothesis that � 2 �0 the null hypothesis and denote it by H0: The hypothesis that� 2 �1 is referred to as the alternative hypothesis and denoted by H1.

4.1 Data and questions

Data set 2.3 (which we have seen before) Silver content of Byzantine coins

A number of coins from the reign of King Manuel I, Comnemus (1143 - 80) were dis-covered in Cyprus. They arise from four di¤erent coinages at intervals throughout hisreign. The question of interest is whether there is any signi�cant di¤erence in their silver

53

Page 2: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

content with the passage of time; there is a suspicion that it was deliberately and steadilyreduced. The data give the silver content (%Ag) of the coins.

Table 2.3 Silver content of coinsFirst Second Third Fourth5.9 6.9 4.9 5.36.8 9.0 5.5 5.66.4 6.6 4.6 5.57.0 8.1 4.5 5.16.6 9.3 6.27.7 9.2 5.87.2 8.6 5.86.96.2

On the face of it the suspicion could be correct in that the fourth coinage would seemto have lower silver content than, say, the �rst coinage, but there is a need for �rmstatistical evidence if it is to be con�rmed. Suppose the true percentage of silver incoinage i is �i. The null hypothesis would be

H0 : �1 = �4

versus the alternative hypothesis

H1 : �1 6= �4:

If we believed, right from the start, that no monarch of the period would ever increasethe silver content, and that the only possibility of its changing would be in the directionof reduction, then the alternative hypothesis would be

H1 : �1 > �4:

Data set 4.1 Patients with glaucoma in one eye

The following data (Ehlers, N., Acta Opthalmologica, 48) give corneal thicknesses inmicrons for patients with one glaucomatous eye and one normal eye.

Table 4.1 Glaucoma in one eyeCorneal thicknessGlaucoma Normal Di¤erence488 484 4478 478 0480 492 �12426 444 �18440 436 4410 398 12458 464 �6460 476 �16

54

Page 3: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Is there a di¤erence in corneal thickness between the eyes? To answer this we takethe di¤erences Glaucoma � Normal for each patient and test for the mean of thosedi¤erences being zero.

H0 : � = 0 against H1 : � 6= 0:�

Data set 4.2 Shoshoni rectangles

Most individuals, if asked to draw a rectangle, would produce something instinctively�not too square and not too long�- something pleasing to the eye.

l

w

Figure 4.1

The ancient Greeks called a rectangle golden if the ratio of its width to its length was

w

l=1

2

�p5� 1

�= 0:618

This ratio is called the golden ratio.

Shoshoni Indians used beaded rectangles to decorate their leather goods. The table belowgives width-to-length ratios for 20 rectangles, analysed as part of a study in experimentalaesthetics.

Table 4.2 Shoshoni bead rectanglesWidth-to-length ratios0.693 0.670 0.654 0.7490.606 0.553 0.601 0.6090.672 0.662 0.606 0.6150.844 0.570 0.933 0.5760.668 0.628 0.690 0.611

The data are taken from Lowie�s Selected Papers in Anthropology. ed Dubois, Universityof California Press.

Did the Shoshoni instinctively make their rectangles conform to the golden ratio? Interms of hypothesis testing, we would like to test

H0 : � = 0:618 against H1 : � 6= 0:618:

55

Page 4: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Data set 4.3 Etruscan and Italian skull widths

The data comprise measurements of maximum head breadth (in cm) which were madein order to ascertain whether Etruscans were native Italians. They were published byBarnicott, N.A. and Brothwell, D.R. (1959), The Evaluation of Metrical Data in theComparison of Ancient and Modern Bones in Medical Biology and Etruscan Origins.Little, Brown and Co.

Table 4.3 Ancient Etruscan and modern Italian skull widthsAncient Etruscan skulls Modern Italian skulls

141 147 126 140 141 150 142 133 124 129 139 144 140148 148 140 146 149 132 137 138 132 125 132 137 130132 144 144 142 148 142 134 130 132 136 130 140 137138 150 142 137 135 142 144 138 125 131 132 136 134154 149 141 148 148 143 146 134 139 132 128 135 130142 145 140 154 152 153 147 127 127 127 139 126 148150 149 145 137 143 149 140 128 133 129 135 139 135146 158 135 139 144 146 142 138 136 132 133 131 138155 143 147 143 141 149 140 136 121 116 128 133 135158 141 146 140 143 138 137 131 131 134 130 138 138150 144 141 131 147 142 152 126 125 125 130 133140 144 136 143 146 149 145 120 130 128 143 137

The question asked by the archaeologists is whether there is evidence of di¤erent racefrom these two sets of measurements. The null hypothesis is that the two samples ofmeasurements come from the same distribution; the alternative hypothesis is that thereis a di¤erence between the two samples.�

Data set 4.4 Pielou�s data on Armillaria root rot in Douglas �r trees

The data below were collected by the ecologist E.C. Pielou, who was interested in thepattern of healthy and diseased trees. The subject of her research was Armillaria rootrot in a plantation of Douglas �rs. She recorded the lengths of 109 runs of diseased treesand these are given below.

Table 4.4 Run lengths of diseased treesRun length 1 2 3 4 5 6Number of runs 71 28 5 2 2 1

On biological grounds, Pielou proposed a geometric distribution as a probability model.Is this plausible? Can we formally test the goodness of �t of a hypothesised distributionto data?�

56

Page 5: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Data set 4.5 Flying bomb hits on London

The following data give the number of �ying bomb hits recorded in each of 576 smallareas of 1

4km2 in the south of London during World War II.

Table 4.5 Flying bomb hits on LondonNumber of hits in an area 0 1 2 3 4 5 � 6Frequency 229 211 93 35 7 1 0

Propaganda broadcasts claimed that the weapon could be aimed accurately. If, however,this was not the case, the hits should be randomly distributed over the area and shouldtherefore be �tted by a Poisson distribution. Is this the case?�

Data set 4.6 A famous and historic data set

These are Pearson�s 1909 data on crime and drinking.

Table 4.6 Crime and drinkingCrime Drinker AbstainerArson 50 43Rape 88 62Violence 155 110Stealing 379 300Coining 18 14Fraud 63 144

Is crime drink related?�

Data set 4.7 Snoring and heart disease

The data in the table below come from a study which investigated whether snoring wasrelated to heart disease (Norton, P.G. and Dunn, E.V. (1985), British Medical Journal,291, pages 630-632).

Table 4.7 Snoring frequency and heart diseaseHeart Non- Occasional Snore nearly Snore everydisease snorers snorers every night night TotalYes 24 35 21 30 110No 1355 603 192 224 2374Total 1379 638 213 254 2484

Is there an association between snoring frequency and heart disease?�

57

Page 6: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.2 Basic ideas

The �rst, and main idea, is that we need to use statistics which contain all of the relevantinformation about the parameter (or parameters) we are going to test: in other wordswe will be looking towards using su¢ cient statistics. Therefore it is hardly surprisingthat we usually use the same statistics as we would in calculating con�dence intervals.Let us try to work out how we might do this by applying common-sense to an example.

Example 4.3 Shoshoni bead rectangles

We want to test whether the Shoshoni instinctively made their rectangles conform to thegolden ratio. That is we want to test

H0 : � = 0:618 against H1 : � 6= 0:618:Let us start by assuming the data are normally distributed (we haven�t checked this, butlet us proceed anyway) with mean � and variance �2.

Xi � N(�; �2) ) X � N(�;�2

n) )

pn(X � �)

�� N(0; 1)

Since we do not know �2 we use the t-statistic,

pn(X � �)

S� t (n� 1) :

We have 20 measurements so, under the null hypothesis � = 0:618 gives

p20(X � 0:618)

S� t (19) ;

where S2 = 120

20Xi=1

(Xi �X)2.

For these data, x = 0:660, s = 0:093. Let us look for some evidence which might supportthe null hypothesis.

Look at the graph of the t-statistic with 19 degrees of freedom.

58

Page 7: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Figure 4.2 Graph of t(19) p.d.f.

Suppose we place the observed value of the above t-statistic on the graph. Where shouldit be? The observed value of T is

p20(x� 0:618)

s=

p20(0:660� 0:618)

0:093= 2:019

The t-distribution is symmetric and the observed value is to the right. Under the as-sumption that the null hypothesis holds as above, we can calculate the probability thata measurement of T gives a value at least as extreme as the observed value. Because thealternative hypothesis is 2-sided this means calculating the following probability

P (jT j � 2:019) = 0:058.

This says that (for a t (19) distribution) the probability of a measurement of T being atleast as far into the tails as the observed value is 0:058. We write p = 0:058; and refer top as the p-value or signi�cance level of the test. It tells us how far into the tails of thedistribution our observed value of the test statistic T lies under the null hypothesis; inthis case H0 : � = 0:618. A small p-value gives grounds for rejecting the null hypothesisin favour of the alternative. However p = 0:058 (interpreted as a roughly 1 in 17 chance)is not particularly small. There is some evidence for rejecting the null hypothesis, butthe case for it is very weak.�

Example 4.4 Patients with glaucoma in one eye

Here is Table 4.1 again, and we ask �Is there a di¤erence in corneal thickness betweenthe eyes?�

Table 4.1 Glaucoma in one eyeCorneal thicknessGlaucoma Normal Di¤erence488 484 4478 478 0480 492 �12426 444 �18440 436 4410 398 12458 464 �6460 476 �16

Formally we are testing the di¤erence � between the corneal thicknesses.

H0 : � = 0 against H1 : � 6= 0:

59

Page 8: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Assuming the data to be normally distributed (for the sake of this example), the meandi¤erence is x = �4 and the estimated standard deviation is s = 10:744. Under H0 weobtain a t-statistic of

t =pnx

s=�4p8

10:744= �1:053:

The t-statistic has 7 degrees of freedom for a p-value of 0.327.We cannot reject the null hypothesis of no di¤erence in corneal thickness.

0.327

−1.053 1.053Figure 4.3 Graph of t(7) p.d.f.

The p-value is di¤erent under di¤erent alternative hypotheses.

In Example 4.4 above the natural alternative hypothesis is H1 : � 6= 0: However itis possible that we may believe that glaucoma can only reduce corneal thickness, andthat no other outcome is possible. In such a case the alternative hypothesis would beH1 : � < 0: Does this a¤ect the p-value?

In such a case the tail of interest in the t-distribution would be the lower tail. For the one-sided alternative the upper tail no longer provides evidence against the null-hypothesis,so the p-value becomes

p = P (T < �1:053) = 0:1635:In this example even in the case of the alternative hypothesis H1 : � < 0 there is nostrong evidence to suggest that the null hypothesis is false. Whatever the alternative,we have no grounds to reject the hypothesis of no di¤erence in the corneal thickness.

De�nition 4.1 Hypothesis test

A hypothesis test is conducted using a test statistic whose distribution is known underthe null hypothesis H0 , and is used to consider the likely truth of the null hypothesis asopposed to a stated alternative hypothesis H1.

De�nition 4.2 p-value

60

Page 9: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

The p-value (or signi�cance level or size) is the probability of the test statistic taking avalue, in the light of the alternative hypothesis, at least as extreme as its observed value.It is calculated under the assumption that the test statistic has the distribution whichit would have if the null hypothesis were true.

If the alternative hypothesis is two-sided it will usually be the case that extreme valuesoccur in two disjoint regions, referring to two tails of the distribution under the nullhypothesis.

4.3 Testing Normally distributed samples

The considerations of whether Shoshoni bead rectangles had proportions which con-formed with the golden ratio and whether glaucoma has an e¤ect on corneal thicknesswere each an example of a t-test, which is de�ned below.

De�nition 4.3 t-test

A t-test is used for observations, independently drawn from a normal distributionN(�; �2)with unknown parameters. Given the sample mean X and the sample variance S2, thetest statistic is

T =

pn(X � �0)

S� t (n� 1)

under the null hypothesis H0 : � = �0:�

61

Page 10: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

De�nition 4.4 Z-test

For observations drawn from a normal distribution N(�; �2), but with �2 known, we usea Z-test of H0 : � = �0 with test statistic

Z =

pn(X � �0)

�� N(0; 1)

under H0.�

De�nition 4.5 Paired t-test

Suppose that we have pairs of random variables (Xi; Yi) and that Di = Xi � Yi ,i = 1; : : : ; n; is a random sample from a normal distribution N(�; �2) with unknownparameters. We use the test statistic

pn(D � �0)

SD� t (n� 1)

under the null hypothesis H0 : � = �0: Here S2D is the sample variance of the di¤erences

Di.�

Example 4.4 (revisited) Patients with glaucoma in one eye

You have already seen how this operates by forming a single sample from the di¤erencesin corneal thickness between Glaucomatous and Normal eyes. Then the sample of dif-ferences is tested for zero mean. Note, however, that the di¤erences need to be normallydistributed, a point we rather glossed over earlier. We can check the validity of thisassumption with a Normal probability plot.

­20 ­15 ­10 ­5 0 5 10D

­2

­1

0

1

Nor

mal

 Dis

tribu

tion

Figure 4.4 Normal probability plot of di¤erences for glaucoma data

62

Page 11: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

It�s a bit rough, but it could be worse.�

One particularly important use of a t-statistic occurs when we have two samples andwe wish to compare the means under the assumption of each sample having the sameunknown variance.

De�nition 4.6 The two-sample t-test

Consider two random samples X1; : : : ; Xm and Y1; : : : ; Yn which are independent, nor-mally distributed with the same variance. The null hypothesis is H0 : �X = �Y . UnderH0 we can construct a test statistic T such that

T � t (m+ n� 2) :

For the two-sample test, T is constructed under H0 as follows.

Step 1 : Under H0,

X � Y � N

�0; �2

�1

m+1

n

��; where �2 is the common variance.

Step 2 :(m� 1)S2X

�2� �2(m� 1); (n� 1)S2Y

�2� �2(n� 1)

=) (m� 1)S2X + (n� 1)S2Y�2

� �2(m+ n� 2)

Step 3 : Thus, writing

S2 =(m� 1)S2X + (n� 1)S2Y

m+ n� 2 ;

we obtain

T =X � Y

S

s�1

m+1

n

� � t (m+ n� 2) ;

under the null hypothesis, H0.�

63

Page 12: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Example 4.5 Etruscan and Italian skull widths

Table 4.3 Ancient Etruscan and modern Italian skull widthsAncient Etruscan skulls Modern Italian skulls

141 147 126 140 141 150 142 133 124 129 139 144 140148 148 140 146 149 132 137 138 132 125 132 137 130132 144 144 142 148 142 134 130 132 136 130 140 137138 150 142 137 135 142 144 138 125 131 132 136 134154 149 141 148 148 143 146 134 139 132 128 135 130142 145 140 154 152 153 147 127 127 127 139 126 148150 149 145 137 143 149 140 128 133 129 135 139 135146 158 135 139 144 146 142 138 136 132 133 131 138155 143 147 143 141 149 140 136 121 116 128 133 135158 141 146 140 143 138 137 131 131 134 130 138 138150 144 141 131 147 142 152 126 125 125 130 133140 144 136 143 146 149 145 120 130 128 143 137

The width measurements are taken with the aim of comparing modern day Italians withancient Etruscans. The null hypothesis is therefore that the mean skull width is thesame. In what follows X refers to Ancient Etruscan measurements and Y refers to Mod-ern Italian.

x� y = 11:33; m = 84; n = 70:

Using the formula in De�nition 4.6 above, the value of the test statistic turns out tobe 11:92: As we are just asking �is there a di¤erence?�, we need a 2-sided alternativehypothesis, and so we test against a t (152) distribution and obtain

P (jT j � 11:92) = 0:0000:

The test provides overwhelming evidence to suggest that the two populations are ances-trally of di¤erent origin. Of course, we need to verify the plausibility of the data beingnormally distributed. This is easily done by calculating Xi � X for i = 1; : : : ;m andYi � Y for i = 1; : : : ; n (so that both samples now have the same mean, namely zero)and combining the whole lot into a single sample: then just look at a normal probabilityplot. Figure 4.5 shows such a plot for the skulls data, which look convincingly normal.

64

Page 13: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

­20 ­15 ­10 ­5 0 5 10 15Skulls

­3

­2

­1

0

1

2

3N

orm

al D

istri

butio

n

Figure 4.5 Normal probability plot for skulls data

Note that

P

0BBBB@�t�=2 (m+ n� 2) < X � Y � (�X � �Y )

S

s�1

m+1

n

� < t�=2 (m+ n� 2)

1CCCCA = 1� �

the con�dence interval for �X � �Y at the 95% level is given by �X � Y

�� St�=2 (m+ n� 2)

s�1

m+1

n

�;�X � Y

�+ St�=2 (m+ n� 2)

s�1

m+1

n

�!:

For the skull data this works out to be (9:45; 13:21). Note that we have a p-value less

than 0.05 and the 95% con�dence interval does not contain 0, the value of �X��Y underthe null hypothesis. This is not a coincidence - there is a link here.�

65

Page 14: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.4 Hypothesis testing and con�dence intervals

The connection here can best be illustrated with an example. Basically when consideringrandom samples with mean � and null hypothesis � = �0; a p-value of less than � isequivalent to the appropriate con�dence interval at the (1��)100% level not containing�0.

Example 4.6 Normal distribution

Suppose that we have a normal random sample of size n with sample meanX and samplevariance S2, and suppose also that the alternative hypothesis is 2-sided : We can either(i) test for the mean � = �0 using a t-test and calculate a p-value; or (ii) construct acentral con�dence interval for the mean and see whether or not it contains �0. We usea central interval because the test is two-sided.

(i) Carry out a t-test,

T =

pn(X � �0)

S� t (n� 1) ; under H0 : � = �0:

Let�s suppose that it has observed value t0 and corresponding p-value of p < �:Then

P (jT j � jt0j) = p:

(ii) Construct a (1��)100% con�dence interval. Here remember we make no assump-

tions about �: Given that Xi has a normal distribution N(�; �2) thenpn(X � �)

Shas a t (n� 1) distribution. Choose t(> 0) such that

P (

����pn(X � �)

S

���� � t) = �:

Then the observed con�dence interval for � is�x� t

spn; x+ t

spn

�where x; s are the observed values of X;S.We must have

t0 =

pn(x� �0)

sas t0 is the observed value of the t-statistic. Since p < � we know

jt0j > t , t0 =2 (�t; t) , �0 = x+ t0spn=2�x� t

spn; x+ t

spn

�:

Hence we can see that there is an equivalence between the test and the interval.�

66

Page 15: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

In each of the following we can construct a test statistic for H0 : � = �0 or construct acon�dence interval for �:

(a) Basic t-test: pn(X � �)

S� t (n� 1) :

(b) Paired t-test: Di = Xi � Yi; S2 =1

n� 1P�

Di �D�2

pn(D � �)

S� t (n� 1) :

(c) Two-sample t-test, equal variance: S2 =(m� 1)S2X + (n� 1)S2Y

m+ n� 2

X � Y � (�X � �Y )

S

s�1

m+1

n

� � t (m+ n� 2)

For all of these test statistics we have, for each H1,

2-tailed upper tailed lower tailed� 6= �0 � > �0 � < �0

4.5 The magic 5% signi�cance level (or p-value of 0.05)

The question arises in each example considered so far: what is the critical level for thep-value? Is there some generally accepted level at which null hypotheses are automaticallyrejected? Alas, the literature is �lled with what purports to be the de�nitive answer tothis question, which is so misleading and ridiculous that it needs special mention.

A signi�cance level of p < 0:05 is often taken to be of interest, because it is below the�magic� level of 0.05. For example suppose that we had tested a new drug (new drugversus standard drug), which under the null hypothesis of no di¤erence between thetwo drugs, gave p = 0:04: This says that the apparent di¤erence between the two drugsbeing due to chance is less than 1 in 20. The p-value of 0.05 is the watershed used by theAmerican control board (the FDA, which stands for Food and Drugs Administration)which licences new drugs from pharmaceutical companies. As a result it has been almostuniversally accepted right across the board in all walks of life.

However this level can be, to say the least, inappropriate and possibly even catastrophic.

67

Page 16: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Suppose, for example, we were considering test data for safety critical software for anuclear power station, N representing the number of faults detected in the �rst 10 years.Would we be happy with a p-value on trials which suggests that

P (N � 1) = 0:05?

We might be more comfortable if p = 0:0001; but even then, given the number of powerstations (over 1000 in Europe alone) we would be justi�ed in worrying. The signi�cancelevel which should be used in deciding whether or not to reject a null hypothesis oughtto depend entirely on the question being asked; it quite properly should depend uponthe consequences of being wrong. At the very least we should qualify our rejection with

something like the following.

0:05 < p � 0:06 �Weak evidence for rejection�0:03 < p � 0:05 �Reasonable evidence for rejection�0:01 < p � 0:03 �Good evidence for rejection�0:005 < p � 0:01 �Strong evidence for rejection�0:001 < p � 0:005 �Very strong evidence for rejection�0:0005 < p � 0:001 �Extremely strong evidence for rejection�p � 0:0005 �Overwhelming evidence for rejection�

4.6 The critical region

Suppose we have data x = x1; x2; : : : ; xn; x 2 Rx, which constitute evidence aboutthe truth or falsehood of a null hypothesis H0. Suppose further that we have decided toformulate our test as a decision rule by electing a p-value in advance, say �, and rejectingthe null hypothesis in situations where the data lead to a p-value less than or equal to�. In such circumstances we can decide, in advance, on a region C1 � Rx such that H0

is rejected if x 2 C1. Should x 2 C0, the complement of C1, H0 is not rejected.

C1 is called the critical region of the test and � is called the signi�cance level.

Note that � is the probability of rejection of H0 given that it is true. In other words

P (x 2 C1 j H0) = �:

Example 4.4 (revisited) Patients with glaucoma in one eye

The signi�cance level is set at 0.05 and H0 is rejected if t � �2:365 or t � 2:365.Here t = �1:053 and the null hypothesis is not rejected.

68

Page 17: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

−2.365 2.365

0.05

−1.053Figure 4.6 The critical region

4.7 Errors in hypothesis testing

There are two types of possible error.A Type I error is the error of rejecting the null hypothesis when it is, in fact, true.A Type II error is the error of not rejecting the null hypothesis when it is, in fact, false.

H0 not rejected H0 rejectedH0 true no error Type I errorH0 false Type II error no error

ThusP (Type I error) = P (x 2 C1 j H0) = �P (Type II error) = P (x 2 C0 j H1) = �:

The probability that H0 is correctly rejected, P (x 2 C1 j H1) = 1�� is called the powerof the test.

Example 4.7 Do air bags save lives?

Suppose that deaths in crashes involving a particular make of car have been at an averagerate of 6 per week and that the company has introduced air bags. They want to use the�gures over the next year (i.e. 52 weeks) to test their e¤ectiveness. Assume the dataare from a Poisson distribution with mean �. The company plans to test

H0 : � = 6 against H1 : � < 6;

using a signi�cance level of 0.05.

Y =52Xi=1

Xi � Poisson(52�)

and we use a critical region of the form

C1 = fy : y � kg :

69

Page 18: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Now the distribution of Y may be approximated byN(52�; 52�) or, underH0,N(312; 312):

0:05 = P (Y � k)

' P

�Z � k � 312p

312

�where Z � N(0; 1), so that

k � 312p312

' �1:645

giving k = 283 to the nearest integer.The power of the test is P (Y � 283) where Y � Poisson(52�). Thus

Power ' P

�Z � 283� 52�p

52�

�= �

�283� 52�p

52�

�:

Note that at � = 6 the power has value 0.05 and the power increases as � decreases,approaching 1 as � approaches 0.

0

1

6

0.05

Figure 4.7 The power function

4.8 The Neyman-Pearson Lemma

Jerzy Neyman (1894 - 1981)

70

Page 19: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Lemma 4.1 The Neyman-Pearson Lemma

Let X = X1; : : : ; Xn be a random sample from a distribution with parameter �, where� 2 � = f�0; �1g, and let L(�;x) be the likelihood function. If there exists a test atsigni�cance level � such that, for some positive constant k,

(i)L(�0;x)

L(�1;x)� k for each x 2 C1

and

(ii)L(�0;x)

L(�1;x)> k for each x 2 C0

then this test is most powerful at signi�cance level � for testing the null hypothesisH0 : � = �0 against the alternative hypothesis H1 : � = �1.

Proof The proof is given for sampling from continuous distributions.

Suppose there exists a critical region C1 with the properties in the statement of thelemma. Let A1 be the critical region of any other test at signi�cance level �. ThenZ

C1

f(x;�0)dx =

ZA1

f(x; �0)dx = �;

where f(x; �0) is the joint p.d.f. of the random sample X.

NowA1 [ C1 = A1 [ (C1 \ A1)

= C1 [ (A1 \ C1):It follows therefore that R

A1f(x; �1)dx+

RC1\A1 f(x; �1)dx

=RC1f(x; �1)dx+

RA1\C1 f(x; �1)dx

or RC1f(x; �1)dx�

RA1f(x; �1)dx

=RC1\A1 f(x; �1)dx�

RA1\C1 f(x; �1)dx

Now, by property (i), f(x; �1) � 1kf(x; �0) for each point of C1 and hence for each point

of C1 \A1. Furthermore, by property (ii), f(x; �1) < 1kf(x; �0) for each point of C1 and

hence for each point of A1 \ C1. ThereforeZC1\A1

f(x; �1)dx �1

k

ZC1\A1

f(x; �0)dx

71

Page 20: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

and ZA1\C1

f(x; �1)dx <1

k

ZA1\C1

f(x; �0)dx:

Substituting we obtainRC1f(x; �1)dx�

RA1f(x; �1)dx

� 1

k

�RC1\A1 f(x; �0)dx�

RA1\C1 f(x; �0)dx

�=1

k

�RC1f(x; �0)dx�

RA1f(x; �0)dx

�= 0

which is what we set out to prove.�

Example 4.8 Insect traps

Gilchrist (1984) refers to an experiment in which a total of 33 insect traps were set outacross sand dunes and the numbers of insects caught in a �xed time were counted. Thetable gives the number of traps containing various numbers of the taxa Staphylinoidea.

Count 0 1 2 3 4 5 6 � 7Frequency 10 9 5 5 1 2 1 0

Assuming the data to come from a Poisson distribution with mean �, we wish to testthe null hypothesis H0 : � = �0 = 1 against the alternative H1 : � = �1 > 1:

The likelihood is

L(�;x) =e�n��

PxiQ

xi!

so the Neyman-Pearson Lemma gives

e�n�0�Pxi

0Qxi!

,e�n�1�

Pxi

1Qxi!

� k

ore�n�0�

Pxi

0

e�n�1�Pxi

1

� k:

Taking logs of both sides

�n�0 + n�1 + (log �0 � log �1)X

xi � log k

and, rearranging,(log �0 � log �1)

Xxi � log k + n�0 � n�1:

Since �1 > �0, we obtain from the Neyman-Pearson Lemma a best critical region of theform X

xi � C:

72

Page 21: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

PXi � Poisson(33�), so, under H0,

PXi � Poisson(33):

In fact, the total number of insects counted was 54 and

P�X

Xi � 54 j �0 = 1�= 0:000487:

This is strong evidence for rejection.

Note that the observed mean is 54 /33 = 1:64, which does not appear to be all that farfrom 1, but appearances can be deceptive.�

Example 4.9 Teenagers with anorexia

An experiment in the treatment of anorexia by cognitive behavioural therapy was carriedout on 29 teenage females. Weights in kilograms before treatment and after 6 weeks oftreatment were recorded and the change in weight of each individual , namely (Weightafter treatment�weight before treatment), was calculated. The mean weight di¤erenceof the sample was 3.007 kg with standard deviation 7.309 kg. We want to test

H0 : � = 0 against H1 : � = �1 > 0:

Note that the test is one-sided because we are actually asking the question �does thetherapy have a bene�cial e¤ect?�We assume that it either results in weight gain ormakes no di¤erence.

Let us look at the question from a Neyman-Pearson point of view. Start by assumingthe data are normally distributed (we haven�t checked this, but let us proceed anyway)with mean � and variance �2.

Xi � N(�; �2) ) L��; �2;x

�=�2��2

��n=2exp� 1

2�2

nXi=1

(xi � �)2 :

From the Neyman-Pearson lemma, the test has the form

(2��2)�n=2

exp� 12�2

Pni=1 x

2i

(2��2)�n=2 exp� 12�2

Pni=1 (xi � �1)

2� k:

Taking logs,

� 1

2�2

nXi=1

�x2i � (xi � �1)

2� � log kor

nXi=1

�x2i � (xi � �1)

2� � �2�2 log k:This may be written

nXi=1

�1 (2xi � �1) � �2�2 log k

73

Page 22: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

or, since �1 > 0,nXi=1

xi � constant

or, equivalently,x � constant.

Under H0 the random variable X � N (0; �2=n). Since we do not know �2 we use thet-statistic,

pn(X)

S� t (n� 1) :

For a test at the 5% level of signi�cance, the upper 5% tail of t (28) is cut o¤ by 1.701,so the critical region is given by

x � 1:701spn

=1:701� 7:309p

29= 2:309:

The observed value of x is 3.007 so we reject the null hypothesis at the 5% level. Notethat the 2% tail is cut o¤ by 2.154 giving a critical region of

x � 2:923

so, with an observed value of 3.007, we can actually reject at the 2% level.�Notice that we have said absolutely nothing about the alternative hypothesis in theexample above. Apart from it�s being positive, so that the direction of the inequalitywas not altered when we divided through by it, it hasn�t �gured in the calculation. Weshall have more to say about this in the next section.

4.9 Uniformly most powerful tests

We have seen that the Neyman-Pearson test applies to a single point null hypothesisagainst a single point alternative. You might think that this is of no practical usewhatsoever because just about all applications involve composite alternatives and evensometimes a composite null hypothesis as well. You saw this with Example 4.8 on insecttraps, where the alternative was a point value of �1 > 1; and with Example 4.9 onanorexia, where the alternative was �1 > 0.

But suppose we were to regard a composite alternative hypothesis as being made up ofa set of simple alternatives and we were to test each simple hypothesis in turn accordingto the theorem. Could we obtain a uniformly most powerful test?

De�nition 4.6 Uniformly most powerful test

A uniformly most powerful test at signi�cance level � is a test such that its powerfunction Q satis�es

74

Page 23: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

(i) Q(�0) = �;

(ii) Q(�) is at least as large as the power of any other test at signi�cance level � foreach � 2 �1.

This is illustrated below, where the solid curve represents the power function of a uni-formly most powerful test and the broken curve relates to any other test at the samesigni�cance level.

α

θ0

Q (  )

θ

θ

Uniformly mostpowerful test

Any othertest

Figure 4.8 Power functions

4.9.1 Tests involving one-sided alternative hypotheses

Let us re-examine Example 4.8 involving insect traps. We obtained the critical regionXxi � constant

where, under H0,PXi � Poisson(33�0). Writing

�(k) =kXj=0

e���j

j!

we have a critical region of size �

C1 =nx :X

xi � ko; � = 1� �0(k):

The crucial feature is that this critical region does not depend upon the value of �1.

The test is simultaneously most powerful at signi�cance level � for testing

H0 : � = �0 against H1 : � = �0 + �

as � ranges over R+. In other words the test is for H0 : � = �0 against the compositealternative H1 : � > �0.

75

Page 24: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Example 4.10 Uniform distribution

X = X1; : : : ; Xn constitute a random sample from U(0; �). Let us construct a test atsigni�cance level 0.1 of H0 : � = 1 against the composite alternative H1 : � > 1:We have

L(�;x) =

���n; 0 < xi < �; 1 � i � n;0; otherwise.

The Neyman-Pearson lemma gives

L(1;x)

L(�1;x)=

��n1 ; 0 < x(1); x(n) < 1;0; 0 < x(1); 1 < x(n) < �1:

and a critical region C1 =�x : x(n) � c

. We need

P (X 2 C1; � = 1) = 0:1;

that is,P (X(n) � c; � = 1) = 0:1:

X(n) has c.d.f. F(n) (x) = xn; x 2 (0; 1) so

1� cn = 0:1;c = 0:91/n :

The test with critical region C1 =�x : x(n) � 0:91/n

is most powerful for the test. Since

this critical region does not depend upon �1, the test is uniformly most powerful againstthe composite hypothesis H1 : � > 1.

The power function of this test is

Q(�) = P (X(n) � 0:91/n ; �); � 2 [1;1);= 1� F(n)

�0:91/n

�= 1�

�0:91/n

���n

= 1� 0:9 /�n :

The graphs of this function for n = 1 and n = 5 are given below. Note that the largerthe value of n, the better the power.

0.1

1

Q (  )

θ

θ

n = 5n = 1

Figure 4.9 Power functions for n = 1 and n = 5

76

Page 25: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.9.2 Tests involving two-sided alternative hypotheses

Now consider the problem of testing H0 : � = �0 against the two-sided alternativeH1 : � 6= �0.

It is easy to see that failure to specify the direction of the alternative hypothesis makesthe methods we have been using inapplicable. There is no uniformly most powerfultest. When the true value of � is greater than �0, then a two-tailed test cannot be morepowerful than a one-tailed test which takes account of the information.

We can, however, have an unbiased test.

A test at signi�cance level � for testing H0 : � 2 �0 against H1 : � 2 �1 is said to beunbiased if

Q(�) � � for all � 2 �1:This means that the power is never less than the signi�cance level; the probability ofrejection of H0 at any element of �0 is necessarily no greater than the probability ofrejection at any element of �1.

4.10 Summary of hypothesis testing

There are four main ingredients to a test.

� The critical region C1.

� The sample size n.

� The signi�cance level (or size) � = P (x 2 C1jH0)

� The power Q = P (x 2 C1jH1).

If any two of these are known, the other two may be determined.

Example 4.8 (revisited) Sample size calculation

Suppose that, before carrying out the test for the insect traps H0 : � = �0 = 1 againstH1 : � = �1 > 1, we wanted to determine a suitable sample size. Suppose furtherthat we wanted to specify a signi�cance level of � = 0:01 and that we wished to ensurethat the test would be powerful enough to reject the null hypothesis by specifying apower of 0.95 for a value of �1 = 1:5. We know that

PXi � Poisson(n�), so, under H0,P

Xi � Poisson(n), and underH1 with �1 = 1:5 we know thatPXi � Poisson(1:25n).

A normal approximation would give us, under H0,PXi

:� N (n; n) so thatPXi � npn

:� N (0; 1) ) P

�PXi � npn

� 2:326�' 0:01

and the critical region is Xxi � n+ 2:326

pn:

77

Page 26: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

For a power of 0.95, we require

P�X

Xi � n+ 2:326pn j �1 = 1:5

�= 0:95;

which may be re-written

P

�PXi � 1:5np1:5n

� �0:5n+ 2:326pnp

1:5n

�= 0:95:

For a standard normal distribution, P (Z � �1:645) = 0:95, so the approximate samplesize can be calculated from

�0:5n+ 2:326pnp

1:5n' �1:645

giving pn = 8:681; n = 75:367;

so the recommended sample size would be 76.�

4.11 The Likelihood Ratio Test

4.11.1 The likelihood ratio

We often want to test in situations where the adopted probability model involves severalunknown parameters. Thus we may denote an element of the parameter space by

� = (�1; �2; : : : �k)

Some of these parameters may be nuisance parameters, (e.g. testing hypotheses on theunknown mean of a normal distribution with unknown variance, where the variance isregarded as a nuisance parameter).

We use the likelihood ratio, �(x), de�ned as

�(x) =sup fL(�;x) : � 2 �0gsup fL(�;x) : � 2 �g ; x 2 RnX :

The informal argument for this is as follows.

For a realisation x, determine its best chance of occurrence under H0 and also its bestchance overall. The ratio of these two chances can never exceed unity, but, if small,would constitute evidence for rejection of the null hypothesis.

A likelihood ratio test for testing H0 : � 2 �0 against H1 : � 2 �1 is a test with criticalregion of the form

C1 = fx : �(x) � kg ;where k is a real number between 0 and 1.

78

Page 27: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Clearly the test will be at signi�cance level � if k can be chosen to satisfy

sup fP (�(X) � k; � 2 �0)g = �:

If H0 is a simple hypothesis with �0 = f�0g, we have the simpler form

P (�(X) � k; �0) = �:

To determine k, we must look at the c.d.f. of the random variable �(X), where therandom sample X has joint p.d.f. fX(x;�0).

Example 4.11 Exponential distribution

Test H0 : � = �0 against H1 : � > �0.

Here �0 = f�0g, �1 = [�0;1).The likelihood function is

L(�;x) =nYi=1

f(xi; �) = �ne��Pxi :

The numerator of the likelihood ratio is

L(�0;x) = �n0e�n�0x:

We need to �nd the supremum as � ranges over the interval [�0;1). Now

l(�;x) = n log � � n�x

so that@l(�;x)

@�=n

�� nx

which is zero only when � = 1/x . Since L(�;x) is an increasing function for � < 1 /xand decreasing for � > 1 /x ,

sup fL(�;x) : � 2 �g =�x�ne�n; if 1 /x � �0�n0e

�n�0x if 1 /x < �0:

79

Page 28: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Figure 4.10 Likelihood function

�(x) =

8<: �n0e�n�0x

x�ne�n; 1 /x � �0

1; 1 /x < �0

=

��n0x

ne�n�0xen; 1 /x � �01; 1 /x < �0

Sinced

dx

�xne�n�0x

�= nxn�1e�n�0x (1� �0x)

is positive for values of x between 0 and 1 /�0 where �0 > 0, it follows that �(x) is anon-decreasing function of x: Therefore the critical region of the likelihood ratio test isof the form

C1 =

(x :

nXi=1

xi � c

):

Example 4.12 The one-sample t-test

The null hypothesis is H0 : � = �0 for the mean of a normal distribution with unknownvariance �2.

We have� = f(�; �2) : � 2 R; �2 2 R+g�0 = f(�; �2) : � = �0; �

2 2 R+gand

f(x; �; �2) =1p2��2

exp

�� 1

2�2(x� �)2

�; x 2 R:

80

Page 29: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

The likelihood function is

L(�; �2;x) = (2��2)�n/2 exp

� 1

2�2

nXi=1

(xi � �)2

!Since

l(�0; �2;x) = �n

2log(2��2)� 1

2�2

nXi=1

(xi � �0)2

and@l

@�2= � n

2�2+

1

2�4

nXi=1

(xi � �0)2;

which is zero when

�2 =1

n

nXi=1

(xi � �0)2

we conclude that

sup�L(�0; �

2;x)=

2�

n

nXi=1

(xi � �0)2

!�n/2e�n/2 :

For the denominator, we already know from previous examples that the m.l.e. of � is x,so

sup�L(�; �2;x)

=

2�

n

nXi=1

(xi � x)2

!�n/2e�n/2

and

�(x) =

�Pni=1(xi � �0)

2Pni=1(xi � x)2

��n/2:

This may be written in a more convenient form. Note thatXn

i=1(xi � �0)

2 =Xn

i=1((xi � x) + (x� �0))

2

=Xn

i=1(xi � x)2 + n(x� �0)

2

so that

�(x) =

�1 +

n(x� �0)2Pn

i=1(xi � x)2

��n/2:

The critical region isC1 = fx : �(x) � kg

so it follows that H0 is to be rejected when the value of

jx� �0jpPni=1(xi � x)2

81

Page 30: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

exceeds some constant.

Now we have already seen that

X � �

S /pn� t(n� 1)

where

S2 =1

n� 1

nXi=1

(Xi �X)2:

Therefore it makes sense to write the critical region in the form

C1 =

�x :

jx� �0js /pn� c

�which is the standard form of the two-sided t-test for a single sample.�

4.11.2 The likelihood ratio statistic

Since the function �2 log �(x) is a decreasing function, it follows that the critical regionof the likelihood ratio test can also be expressed in the form

C1 = fx : �2 log �(x) � cg :

Writing�(x) = �2 log �(x) = 2

hl(b� : x)� l(�0 : x)

ithe critical region may be written as

C1 = fx : �(x) � cg

and �(X) is called the likelihood ratio statistic.

We have been using the idea that values of � close to b� are well supported by the dataso, if �0 is a possible value of �, then it turns out that, for large samples,

�(X)D! �2p

where p = dim(�).

Let us see why.

4.11.3 The asymptotic distribution of the likelihood ratio statistic

Writel(�0) = l(b�) + (b� � �0)l

0(b�) + 12(b� � �0)

2l00(b�) + : : :

82

Page 31: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

and, remembering that l0(b�) = 0, we have� ' (b� � �0)

2h�l00(b�)i

= (b� � �0)2J(b�)

= (b� � �0)2I(�0)

J(b�)I(�0)

:

But

(b� � �0)I(�0)1/2 D! N(0; 1) and

J(b�)I(�0)

P! 1

so(b� � �0)

2I(�0)D! �21

or�

D! �21

provided �0 is the true value of �.

Example 4.13 Poisson distribution

Let X = (X1; : : : ; Xn) be a random sample from a Poisson distribution with parameter�, and test H0 : � = �0 against H1 : � 6= �0 at signi�cance level 0.05.

The p.m.f. is

p(x; �) =e���x

x!; x = 0; 1; : : :

so that

l(� : x) = �n� +nXi=1

xi log � � lognYi=1

xi!

and@l(� : x)

@�= �n+ 1

nXi=1

xi

giving b� = x.

Therefore

� = 2n

��0 � x+ x log

�x

�0

��:

The distribution of � under H0 is approximately �21 and �21(0:95) = 3:84, so the critical

region of the test is

C1 =

�x : 2n

��0 � x+ x log

�x

�0

��� 3:84

�:

83

Page 32: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.11.4 Testing goodness-of-�t for discrete distributions

Example 4.14 Pielou�s data on Armillaria root rot in Douglas �r trees

You have already seen the data below as Data set 4.4. They were collected by theecologist E.C. Pielou, who was interested in the pattern of healthy and diseased trees.The subject of her research was Armillaria root rot in a plantation of Douglas �rs. Sherecorded the lengths of 109 runs of diseased trees.

Table 4.4 Run lengths of diseased treesRun length 1 2 3 4 5 6Number of runs 71 28 5 2 2 1

On biological grounds, Pielou proposed a geometric distribution as a probability model.Is this plausible?�Let�s try to answer this by �rst looking at the general case.

Suppose we have k groups with ni in the ith group. Thus

Group 1 2 3 4 � � � kNumber n1 n2 n3 n4 � � � nk

whereP

i ni = n.

Suppose further that we have a probability model such that �i(�); i = 1; 2; : : : ; k; is theprobability of being in the ith group. Clearly

Pi �i(�) = 1.

The likelihood is

L(�) = n!kYi=1

�i(�)ni

ni!

and the log-likelihood is

l(�) =

kXi=1

ni log �i(�) + log n!�kXi=1

log ni!

Suppose b� maximises l(�), being the solution of l0(b�) = 0.The general alternative is to take �i as unrestricted by the model and subject only toP

i �i = 1. Thus we maximise

l(�) =kXi=1

ni log �i + log n!�kXi=1

log ni! with g(�) =Xi

�i = 1:

Using Lagrange multiplier we obtain the set of k equations

@l

@�i�

@g

@�i= 0; 1 � i � k;

84

Page 33: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

orni�i� = 0; 1 � i � k:

Writing this asni � �i = 0; 1 � i � k

and summing over i we �nd = n and

b�i = nin:

The likelihood ratio statistic is

� = 2

"kXi=1

ni lognin�Pk

i=1 ni log �i(b�)#

= 2kXi=1

ni log

ni

n�i(b�)!:

General statement of asymptotic result for the likelihood ratio statistic

Testing H0 : � 2 �0 � � against H1 : � 2 �, the likelihood ratio statistic

� = 2

�sup�2�

l(�)� sup�2�0

l(�)

�D! �2p;

wherep = dim�� dim�0

In the case above where we are looking at the �t of a one-parameter distribution

� = 2kXi=1

ni log

ni

n�i(b�)!;

the restrictionPk

i=1 �i = 1 means that dim� = k � 1. Clearly dim�0 = 1 so p = k � 2and

�D! �2k�2:

Example 4.14 (revisited) Pielou�s data on Armillaria root rot in Douglas �r trees

The data areRun length 1 2 3 4 5 6Number of runs 71 28 5 2 2 1

and Pielou proposed a geometric model with p.m.f.

p(x) = (1� �)x�1�; x = 1; 2; : : :

85

Page 34: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

where x is the observed run length. Thus, if xj; 1 � j � n; are the observed run lengths,the log-likelihood for Pielou�s model is

l(�) =

nXj=1

(xj � 1) log(1� �) + n log �

and, maximising,@l(�)

@�= �

Pnj=1 xj � n

(1� �)+n

which gives b� = 1

x:

By the invariance property of m.l.e.�s

�i(b�) = (1� b�)i�1b� = (x� 1)i�1

xi:

The data give x = 1:523. We can therefore use the expression for �i(b�) to calculate� = 2

kXi=1

ni log

ni

n�i(b�)!= 3:547:

There are six groups, so p = 6� 1� 1 = 4:The approximate distribution of � is therefore �24 and

P (� � 3:547) = 0:471:

There is no evidence against Pielou�s conjecture that a geometric distribution is anappropriate model.�

Example 4.15 Flying bomb hits on London

Data set 4.5 gave the number of �ying bomb hits recorded in each of 576 small areas of14km2 in the south of London during World War II.

Table 4.5 Flying bomb hits on LondonNumber of hits in an area 0 1 2 3 4 5 � 6Frequency 229 211 93 35 7 1 0

Propaganda broadcasts claimed that the weapon could be aimed accurately. If, however,this was not the case, the hits should be randomly distributed over the area and shouldtherefore be �tted by a Poisson distribution. Is this the case?�

86

Page 35: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

The �rst thing to do is to calculate the m.l.e. of the Poisson parameter. The likelihoodfunction for a sample of size n is

L (�) =

nYi=1

�xie��

xi!;

so that the log-likelihood is

l (�) =nXi=1

xi log � � n� �nXi=1

log xi!

dl

d�=

Pni=1 xi�

� n = 0

and b� = x =535

576= 0:928:

Using the Poisson probability mass function with � = 0:929 we therefore obtain

i 0 1 2 3 4 5 � 6�i

�b�� 0:3949 0:3669 0:1704 0:0528 0:0123 0:0023 0:0004

and hence

� = 2kXi=1

ni log

ni

n�i(b�)!= 1:4995:

This is tested against �2(�) where � = k � 2 = 7� 2 = 5: This gives P (� � 1:4995) =0:913. Clearly there is not a shred of evidence in favour of rejection.�

4.11.5 The approximate �2 distribution

The tests carried out in Examples 4.14 and 4.15 are not, strictly speaking, correct. Thereason for this is that the �2-distribution we have used to calculate the p-values is anapproximation, and the quality of that approximation depends upon the sample size.Happily there is a general rule of thumb you can use.

Rule of thumb for the �2 approximation

An approximate �2-distribution may be used for testing count data provided that theexpected value of each cell in the table is at least 5. If the expected value of a cell is lessthan 5, it should be pooled with an adjacent cell or cells to obtain a suitable value.�

87

Page 36: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Example 4.14 (revisited) Pielou�s data on Armillaria root rot in Douglas �r trees

Look at the table with the expected values written in.

Run length 1 2 3 4 5 6 �7Number of runs ni 71 28 5 2 2 1 0Expected number of runs n�i(b�) 71.569 24.577 8.440 2.898 0.995 0.342 0.218

Clearly we need to pool cells to obtain

Run length 1 2 � 3Number of runs ni 71 28 10Expected number of runs n�i(b�) 71.569 24.577 12.893

The test statistic is now re-calculated to obtain � = 1:087, which is tested as �2 (1)to give a p-value of 0:297. The conclusion that there is no evidence against Pielou�sconjecture that the underlying distribution is geometric is unaltered.�

Example 4.15 (revisited) Flying bomb hits on London

Including the expected values in the table for �ying bomb hits, we obtain the table below.

Number of hits in an area 0 1 2 3 4 5 � 6Frequency 229 211 93 35 7 1 0Expected frequency 227.462 211.334 98.150 30.413 7.027 1.325 0.230

After pooling, we obtain

Number of hits in an area 0 1 2 3 � 4Frequency 229 211 93 35 8Expected frequency 227.462 211.334 98.150 30.413 8.582

The test statistic is now re-calculated to obtain � = 1:101, which is tested as �2 (3) togive a p-value of 0:777. Again we �nd no evidence for rejection of the null hypothesis. Wehave therefore found no evidence that V1 �ying bomb could be aimed with any degreeof precision.�

4.11.6 Two-way contingency tables

Data are obtained by cross-classifying a �xed number of individuals according to twocriteria. They are therefore displayed as nij in a table with r rows and c columns asfollows.

n11 � � � n1c n1:...

. . ....

...nr1 � � � nrc nr:n:1 � � � n:c n

88

Page 37: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

The aim is to investigate the independence of the two classi�cations.

Example 4.16 A famous and historic data set

These are Pearson�s 1909 data on crime and drinking. The data were introduced in Dataset 4.6.

Table 4.6 Crime and drinkingCrime Drinker AbstainerArson 50 43Rape 88 62Violence 155 110Stealing 379 300Coining 18 14Fraud 63 144

Is crime drink related?�

Suppose the kth individual goes into cell (Xk; Yk); k = 1; 2; : : : ; n; and that individualsare independent. Let

P ((Xk; Yk) = (i; j)) = �ij; i = 1; 2; : : : ; r; j = 1; 2; : : : ; c;

whereP

ij �ij = 1. The null hypothesis of independence of classi�ers can be writtenH0 : �ij = �i�j.

This is on Problem Sheet 6 so here are a few hints.

The likelihood function is

L(�) = n!Yi;j

�nijij

nij!

so the log-likelihood is

l(�) =Xi;j

nij log �ij + log n!�Xi;j

log nij!

89

Page 38: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

Under H0, put �ij = �i�j and maximise with respect to �i and �j subject toP

i �i =Pj �j = 1. You will obtain b�i = ni:

n; b�j = n:j

n

Under H1, maximise with respect to �ij subject toP

ij �ij = 1. You will obtain

b�ij = nijn

and, �nally

� = 2

rXi=1

cXj=1

nij log

�nijn

ni:n:j

�:

Example 4.16 (continued) A famous and historic data set

For these data, � = 50:52.

Under H0, � � �2p, where p = dim� � dim�0. In the notation used earlier, thereare apparently 6 values of �i to estimate, but in fact there are only 5 values becauseP

i �i = 1. Similarly there are 2� 1 = 1 values of �j. Thus dim�0 = 6:Because

Pij �ij = 1, dim� = 12� 1 = 11 so, therefore, p = 11� 6 = 5.

Testing against a �2-distribution with 5 degrees of freedom, note that the 0.9999 quantileis 25.75 and we can reject at the 0.0001 level of signi�cance. There is overwhelmingevidence that crime and drink are related.�

Degrees of freedom

It is clear from the above that, when testing contingency tables, the number of degreesof freedom of the resulting �2-distribution is given, in general, by

p = rc� 1� (r � 1)� (c� 1)= rc� r � c+ 1= (r � 1)(c� 1):

90

Page 39: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.11.7 Pearson�s statistic

For testing independence in contingency tables, let Oij be the observed number in cell(i; j); i = 1; 2; : : : ; r; j = 1; 2; : : : ; c; and Eij be the expected number in cell (i; j):Pearson�s statistic is

P =Xi;j

(Oij � Eij)2

Eij� �2(r�1)(c�1):

The expected number Eij in cell (i; j) is calculated under the null hypothesis of inde-pendence.

If ni: is the total for the ith row and the overall total is n, then the probability of anobservation being in the ith row is estimated by

P (ith row) =ni:n:

SimilarlyP (jth column) =

n:jn

andEij = n� P (ith row)� P (jth column)

= =ni:n:jn

:

Example 4.16 (revisited) A famous and historic data set

These are the data on crime and drinking with the row and column totals.

Crime Drinker Abstainer TotalArson 50 43 93Rape 88 62 150Violence 155 110 265Stealing 379 300 679Coining 18 14 32Fraud 63 144 207Total 753 673 1426

The Eij are easily calculated.

E11 =93� 7531426

= 49:11; and so on.

Pearson�s statistic turns out to be P = 49:73, which is tested against a �2-distributionwith (6� 1)� (2� 1) = 5 degrees of freedom and the conclusion is, of course, the sameas before.�

91

Page 40: 4 Hypothesis Testing - Oxford University Statistics ...myers/stats_materials/pdf_notes/hypothesisT.pdf · 4 Hypothesis Testing ... measurements come from the same distribution; the

4.11.8 Pearson�s statistic and the likelihood ratio statistic

P =X

i;j

(Oij � Eij)2

Eij

=X

i;j

�nij � ni:n:j

n

�2ni:n:jn

Consider the Taylor expansion of x log(x=a) about x = a.

x log�xa

�= (x� a) +

(x� a)2

2a� (x� a)3

6a2+ � � �

Now put x = nij and a =ni:n:jn

so that

nij log

�nijn

ni:n:j

�=�nij �

ni:n:jn

�+

�nij � ni:n:j

n

�22ni:n:jn

+ � � �

Thus Xi;jnij log

�nijn

ni:n:j

= n� nX

i

ni:n

Xj

n:jn+ 1

2

Xi;j

(Oij � Eij)2

Eij+ � � � ' 1

2P

or� ' P

Example 4.17 Snoring and heart disease

You saw the data in the table below in Data set 4.7.

Table 4.7 Snoring frequency and heart diseaseHeart Non- Occasional Snore nearly Snore everydisease snorers snorers every night night TotalYes 24 35 21 30 110No 1355 603 192 224 2374Total 1379 638 213 254 2484

Is there an association between snoring frequency and heart disease?

Youmight like to practise on this data set by calculating both � = 2Pr

i=1

Pcj=1 nij log

�nijn

ni:n:j

�and P =

Xi;j

(Oij � Eij)2

Eij. You should get

� = 65:904 and P = 72:782:

Each is approximately distributed �2 (3) and in each case the p-value is 0. The conclusionis that there is an association between snoring and heart disease.�

92