24
Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution Goodness-of-Fit Test 2+ samples, 2+ levels of response variable Chi-square Test Chi-Square Tests Slide # 1

Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Embed Size (px)

Citation preview

Page 1: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests

Chi-Square Tests

• Categorical data

• 1-sample, compared to theoretical distribution– Goodness-of-Fit Test

• 2+ samples, 2+ levels of response variable– Chi-square Test

Slide #1

Page 2: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #2

Chi-Square -- Examples

• Does the dominant plants in plots differ between two locations?

• Does the frequency of females in majors differ between majors in the natural sciences, social sciences, and humanities?

• Does the occurrence of a food item in the stomachs of lake trout and chinook salmon differ?

Page 3: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #3

What do those examples have in common?

• A categorical response variable– dominant plant in a plot– sex of student (male or female)– occurrence of a food item (Y/N)

• Compare response frequencies among >2 groups– between two locations– among three divisions– between lake trout and chinook salmon

Page 4: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #4

An Illustrative Example• When Chinook Salmon were first introduced to

Lake Superior there was concern that they would compete with native Lake Trout for Lake Herring. Preliminarily, fisheries biologists classified the diets of 50 Lake Trout and 40 Chinook Salmon as containing Lake Herring or not. They found 36 Lake Trout and 24 Chinook Salmon contained Lake Herring. Test (at the 10% level) if there is a difference in the proportion of Lake Trout and Chinook Salmon that had Lake Herring.

Page 5: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #5

Observed Table

– Recall – “… the diets of 50 Lake Trout and 40 Chinook Salmon … found 36 Lake Trout and 24 Chinook Salmon contained Lake Herring”

LH no LH Total

Lake TroutCh. Salmon

Total

5040

36 1424 16

3060 90

Page 6: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #6

Observed Table

• If there is no difference between rows (i.e., the Ho) then the total row could represent either row.

• Thus, the proportion of predator (regardless of type) that consumed Lake Herring is estimated to be 60/90 or 0.67

LH no LH Total

Lake Trout 36 14 50Ch. Salmon 24 16 40

Total 60 30 90

Page 7: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #7

Expectations if Ho is true• If there is no difference and the common

proportion is estimated by 0.67 then how many ….

•LT do we expect to have LH = 50*0.67

•LT … … to not have LH = 50*0.33

•CS … … to have LH = 40*0.67

•CS … … to not have LH = 40*0.33

90

60*50

90

30*50

90

60*40

90

30*40

Page 8: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #8

Create Expected Table

LH no LH Total

Lake Trout 50Ch. Salmon 40

Total 60 30 90

90

60*50• LT to have LH = = 33.3

33.3

Page 9: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #9

LH no LH Total

Lake Trout 50Ch. Salmon 40

Total 60 30 90

Create Expected Table

90

30*50• LT to NOT have LH = = 16.7

16.726.733.3

13.316.7

• Expected counts are the product of the marginal totals divided by the table total.

Page 10: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #10

A New Test Statistic

table

22

ectedexp

ectedexpobserved

df = (rows-1)*(cols-1)

Page 11: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #11

Chi-Square Distribution• Right-skewed (all values are positive)• Less sharply skewed with increasing df

– df are related to the size of the table, not n

• All p-values are “right-ofs” – no “one-tailed” tests with chi-square

• Examine HO – page 1

0 10 20 30 40 50

Chi-square

Chi(3)Chi(10)Chi(20)

Page 12: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #12

Chi-Square Test• Ho: “distribution of individuals into the levels is

same for each population”• HA: “distribution of individuals into levels is

different for at least one pair of populations”• Assume: at least 5 in each cell of expected table• Statistic: Observed frequency table

• Test Statistic:

• df: (rows-1)*(columns-1)• When: categorical variable, 2+ populations/groups

table

22

ectedexp

ectedexpobserved

Page 13: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #13

A Full Example• When Chinook Salmon were first introduced to

Lake Superior there was concern that they would compete with native Lake Trout for Lake Herring. Preliminarily, fisheries biologists classified the diets of 50 Lake Trout and 40 Chinook Salmon as containing Lake Herring or not. They found 36 Lake Trout and 24 Chinook Salmon contained Lake Herring. Test (at the 10% level) if there is a difference in the proportion of Lake Trout and Chinook Salmon that had Lake Herring.

Page 14: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-square Slide #14

• Modification -- the researchers recorded what the dominant food item was. Do the dominant food items in Lake Trout and Chinook Salmon differ at the 5% level?

• See R HO Page 2.

LH smelt Mysis Total

Lake Trout 32 10 8 50Ch. Salmon 18 18 4 40

Total 50 28 12 90

Another Full Example

Page 15: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests

Examine HO – Page 3

Slide #15

Page 16: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #17

Goodness-of-Fit Test

• Compare observed to theoretical frequencies of individuals in categories.

• Examples –– Test whether responses are “random” (e.g., preference)– Test Mendelian genetics (e.g., 3:1 and 9:3:3:1 theories).– Test use of available resources (e.g., compare habitat

usage to availability).

Page 17: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #18

An Illustrative Example

• Determine, at the 10% level, if Northland students prefer the Chris Duarte Group (CDG), Ronnie Baker Brooks (RBB), or Bernard Allison (BA).

• Hypotheses?• Ha: “different # of students prefer each artist”

• Ho: “same # of students prefer each artist”

Page 18: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #19

• Under Ho, what proportion prefer each artist?

• If n=78, how many students prefer each artist if Ho is true?

Artist CDG RBB BA

Freq 26 26 26

1/3

26

An Illustrative Example

ExpectedTable

Page 19: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #20

• Suppose these results were obtained:

Artist CDG RBB BA

Freq 24 38 16

• Is there a preference – i.e., are these observations significantly different from what was expected when assuming no preference?

An Illustrative Example

ObservedTable

Page 20: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #21

A New Test Statistic

table

22

ectedexp

ectedexpobserved

df = cells - 1

Page 21: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #22

Artist CDG RBB BA

# 24 38 16

Artist CDG RBB BA

# 26 26 26

26

2624 2

26

2638 2 26

2616 2c2 =

c2 = 0.15 + 5.54 + 3.85 = 9.54

df = (3-1) = 2 p-value = 0.00848

Conclusion?

An Illustrative Example

ObservedTable

ExpectedTable

Page 22: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #23

Goodness-of-Fit Test

• Ho: distribution of individuals into levels follows the theoretical distribution

• HA: distribution of individuals into levels does NOT follow the theoretical distribution

• Sample: randomized, single variable of size n

• Assume: at least 5 in each cell of expected table

• Statistic: Observed frequency table

Page 23: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests Slide #24

Goodness-of-Fit Test

• Test Statistic:

• df: cells-1

• Confidence Region:

table

22

ectedexp

ectedexpobserved

n

p̂1p̂*zp̂

where is sample proportion in level of interestp̂

Page 24: Chi-Square Tests Categorical data 1-sample, compared to theoretical distribution –Goodness-of-Fit Test 2+ samples, 2+ levels of response variable –Chi-square

Chi-Square Tests

Examine HO – Page 5

Slide #25