Chi-square test or c 2 test

Chi-square testChi-square testor

2 test

What if we are interested in seeing if my “crazycrazy” dice are considered “fair”?

What can I do?

Chi-square testChi-square test•Used to test the countscounts of

categorical data•ThreeThree types

–Goodness of fit (univariate)– Independence (bivariate)–Homogeneity (univariate with two samples)

Chi-square distributions

Chi-square Distributions

0 5 10 15 20 25x

df = 1

df = 2

df = 3

df = 4

df = 5

df = 8

df = 10

df = 15

Upper-tail Areas for Chi-square DistributionsRight-tail area df = 1 df = 2 df = 3 df = 4 df = 5

> .100 < 2.70 < 4.60 < 6.25 < 7.77 < 9.230.100 2.70 4.60 6.25 7.77 9.230.095 2.78 4.70 6.36 7.90 9.370.090 2.87 4.81 6.49 8.04 9.520.085 2.96 4.93 6.62 8.18 9.670.080 3.06 5.05 6.75 8.33 9.830.075 3.17 5.18 6.90 8.49 10.000.070 3.28 5.31 7.06 8.66 10.190.065 3.40 5.46 7.22 8.84 10.380.060 3.53 5.62 7.40 9.04 10.590.055 3.68 5.80 7.60 9.25 10.820.050 3.84 5.99 7.81 9.48 11.070.045 4.01 6.20 8.04 9.74 11.340.040 4.21 6.43 8.31 10.02 11.640.035 4.44 6.70 8.60 10.34 11.980.030 4.70 7.01 8.94 10.71 12.370.025 5.02 7.37 9.34 11.14 12.830.020 5.41 7.82 9.83 11.66 13.380.015 5.91 8.39 10.46 12.33 14.090.010 6.63 9.21 11.34 13.27 15.080.005 7.87 10.59 12.83 14.86 16.740.001 10.82 13.81 16.26 18.46 20.51

< .001 > 10.82 > 13.81 > 16.26 > 18.46 > 20.51

Right-tail area df = 6 df = 7 df = 8 df = 9 df = 10 > .100 < 10.64 < 12.01 < 13.36 < 14.68 < 15.980.100 10.64 12.01 13.36 14.68 15.980.095 10.79 12.17 13.52 14.85 16.160.090 10.94 12.33 13.69 15.03 16.350.085 11.11 12.50 13.87 15.22 16.540.080 11.28 12.69 14.06 15.42 16.750.075 11.46 12.88 14.26 15.63 16.970.070 11.65 13.08 14.48 15.85 17.200.065 11.86 13.30 14.71 16.09 17.440.060 12.08 13.53 14.95 16.34 17.710.055 12.32 13.79 15.22 16.62 17.990.050 12.59 14.06 15.50 16.91 18.300.045 12.87 14.36 15.82 17.24 18.640.040 13.19 14.70 16.17 17.60 19.020.035 13.55 15.07 16.56 18.01 19.440.030 13.96 15.50 17.01 18.47 19.920.025 14.44 16.01 17.53 19.02 20.480.020 15.03 16.62 18.16 19.67 21.160.015 15.77 17.39 18.97 20.51 22.020.010 16.81 18.47 20.09 21.66 23.200.005 18.54 20.27 21.95 23.58 25.180.001 22.45 24.32 26.12 27.87 29.58

< .001 > 22.45 > 24.32 > 26.12 > 27.87 > 29.58

22 distribution distribution• Different df have different curves• Skewed right• Cannot take on negative values• As df increases, curve shifts

toward right & becomes more like a normal curvenormal curve

• Each curve has a mode at df-2 and a mean at df

2 2 assumptionsassumptions• SRS SRS – reasonably random sample• Have countscounts of categorical data &

we expect each category to happen at least once

• Sample sizeSample size – to insure that the sample size is large enough we should expect at least five in each category.

***Be sure to list expected counts!!

Combine these together:

All expected counts are at

least 5.

2 2 formulaformula

exp

expobs 22

22 (observed cell count - expected cell count)expected cell count

2 2 Goodness of fit testGoodness of fit test

• Uses univariate data (one sample, one variable)

• Want to see how well the observed counts “fit” what we expect the counts to be

• Use 22cdf functioncdf function on the calculator to find p-valuesp-values

Based on df –Based on df –

df = number of df = number of categoriescategories - 1 - 1

Let’s test our dice!Let’s test our dice!

Hypotheses – written in Hypotheses – written in wordswords

H0: proportions are equal

Ha: at least one proportion is not the same

Be sure to write in context!

Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others?

Aries 23 Libra 18 Leo20

Taurus 20 Scorpio 21 Virgo 19

Gemini 18 Sagittarius19 Aquarius24

Cancer 23 Capricorn 22 Pisces29

How many would you expect in each sign if there were no difference between them?

How many degrees of freedom?

I would expect CEOs to be equally born under all signs.

So 256/12 = 21.333333Since there are 12 signs –

df = 12 – 1 = 11

Assumptions:

•Have a random sample of CEO’s

•All expected counts are greater than 5. (I expect 21.33 CEO’s to be born in each sign.)

H0: The proportions of CEO’s born under each sign are the same.

Ha: At least one of the proportion of CEO’s born under each sign is different.

2.) Compute the residuals. (Observed – Expected)

Sign Observed value

Expected value

(256/12)

Residual = Observed - expected

Aires 23 21.333 1.667

Taurus 20 21.333 -1.333

Gemini 18 21.333 -3.333

Cancer 23 21.333 1.667

Leo 20 21.333 -1.333

Virgo 19 21.333 -2.333

Libra 18 21.333 -3.333

Scorpio 21 21.333 -0.333

Sagittarius

19 21.333 -2.333

Capricorn 22 21.333 0.667

Aquarius 24 21.333 2.667

Pisces 29 21.333 7.667

3.) Square the residuals

Sign Observed value

Expected value

(256/12)


(Observed-expected)2

Aires 23 21.333 1.667 2.778889

Taurus 20 21.333 -1.333 1.776889

Gemini 18 21.333 -3.333 11.108889

Cancer 23 21.333 1.667 2.778889

Leo 20 21.333 -1.333 1.776889

Virgo 19 21.333 -2.333 5.442889

Libra 18 21.333 -3.333 11.108889

Scorpio 21 21.333 -0.333 0.110889

Sagittarius

19 21.333 -2.333 5.442889

Capricorn 22 21.333 0.667 0.444889

Aquarius 24 21.333 2.667 7.112889

Pisces 29 21.333 7.667 58.782889

4. Compute the components for each cell

Sign Observed value

Expected value

(256/12)




Expected value

Aires 23 21.333 1.667 2.778889 0.130262

Taurus 20 21.333 -1.333 1.776889 0.083293

Gemini 18 21.333 -3.333 11.108889 0.520737

Cancer 23 21.333 1.667 2.778889 0.130262

Leo 20 21.333 -1.333 1.776889 0.083293

Virgo 19 21.333 -2.333 5.442889 0.255139

Libra 18 21.333 -3.333 11.108889 0.520737

Scorpio 21 21.333 -0.333 0.110889 0.005198

Sagittarius

19 21.333 -2.333 5.442889 0.255139

Capricorn 22 21.333 0.667 0.444889 0.020854

Aquarius 24 21.333 2.667 7.112889 0.333422

Pisces 29 21.333 7.667 58.782889 2.755491

5. Find the sum of the components (that’s the chi-square statistic)

Sign Observed value

Expected value

(256/12)




Expected value

Aires 23 21.333 1.667 2.778889 0.130262

Taurus 20 21.333 -1.333 1.776889 0.083293

Gemini 18 21.333 -3.333 11.108889 0.520737

Cancer 23 21.333 1.667 2.778889 0.130262

Leo 20 21.333 -1.333 1.776889 0.083293

Virgo 19 21.333 -2.333 5.442889 0.255139

Libra 18 21.333 -3.333 11.108889 0.520737

Scorpio 21 21.333 -0.333 0.110889 0.005198

Sagittarius

19 21.333 -2.333 5.442889 0.255139

Capricorn 22 21.333 0.667 0.444889 0.020854

Aquarius 24 21.333 2.667 7.112889 0.333422

Pisces 29 21.333 7.667 58.782889 2.755491

Σ = 5.094

P-value = 2cdf(5.094, 10^99, 11) = .9265 = .05

Since p-value > , I fail to reject H0. There is not sufficient evidence to suggest that the CEOs are born under some signs more than under others.

094.5

3.21

3.2129...

3.21

3.2120

3.21

3.2123222

2

Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. What are the expected counts? df?

Are the results consistent with the theoretical distribution predicted by the genetic model? (see next page)

Expected counts:Y & N = 56.25Y & S = 18.75E & N = 18.75E & S = 6.25We expect 9/16 of the

100 flies to have yellow and normal

wings. (Y & N)

Since there are 4 categories,

df = 4 – 1 = 3

Assumptions:

•Have a random sample of fruit flies

•All expected counts are greater than 5. Expected counts:Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25

H0: The proportions of fruit flies are the same as the theoretical model.

Ha: At least one of the proportions of fruit flies is not the same as the theoretical model.

P-value = 2cdf(5.671, 10^99, 3) = .129 = .05

Since p-value > , I fail to reject H0. There is not sufficient evidence to suggest that the distribution of fruit flies is not the same as the theoretical model.

671.5

25.625.610

...75.18

75.182025.56

25.5659 2222

A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether your mix is significantly different from what the company advertises?

Why is the chi-square goodness-of-fit test NOT appropriate here?

What might you do instead of weighing the nuts in order to use chi-square?

Because we do NOT have countscounts

of the type of nuts.We could countcount the

number of each type of nut and then perform a

2 test.

Example:Does the color of a car influence the chance that it will be stolen?Of 830 cars reported stolen, 140 were white, 100 were blue, 270 were red, 230 were black, and 90 were other colors.It is known that 15% of all cars are white, 15% are blue, 35% are red, 30% are black, and 5% are other colors.

Category Color Observed Expected

1 White 140 .15*830 = 124.5

2 Blue 100 .15*830 = 124.5

3 Red 270 .35*830 = 290.5

4 Black 230 .30*830 = 249

5 Other 90 .05*830 = 41.5

Category Color Observed Expected

1 White 140 124.5

2 Blue 100 124.5

3 Red 270 290.5

4 Black 230 249

5 Other 90 41.5

Let π1, π2, . . . Π5 denote true proportions of stolen cars that fall into the 5 color categories

Ho: π1 = .15, π2 = .15, π3 = .35, π4 = .30, π5 = .05

Ha: Ho is not true.

α = .0122 (observed cell count - expected cell count)

expected cell count Test statistic:

Assumptions: The sample was a random sample of stolen cars. All expected counts are greater than 5, so the sample size is large enough to use the chi-square test.

Calculations:

5.41

)5.4190(

0.249

)0.249230(

5.290

)5.290270(

5.124

)5.124100(

5.124

)5.124140( 222222

x

= 1.93 + 4.82 + 1.45 + 1.45 + 56.68= 66.33

P-value: All expected counts exceed 5, so the P-value can be based on a chi-square distribution with 4 df. The computed value is larger than 18.46, so P-value < .001.

Because P-value < α, Ho is rejected. There is convincing evidence that at least one of the color proportions for stolen cars differs from the corresponding proportion for all cars.

Documents

Chi-square test or c 2 test