Cases of Categorical Count Data. Chi Square.cowen.faculty.arizona.edu/sites/cowen.faculty.arizona... · Web view... f 18 29 p## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p## 3 audi

Cases of Categorical Count Data. Chi Square.Stephen L Cowen and Bryan Kromenacker

2017

The case studies:This case study will focus on a few datasets as the Chi-square test can be applied differently to different datasets to see if the number of observations differs from what would be expected from a null distribution (Called 'Expected' in Chi square speak).

First question: Cars

# ___# _-_- _/\______\\__# _-_-__ / ,-. -|- ,-.`-.# hjw _-_- `( o )----( o )-'# `-' `-'

Has the disribution of car classes produced (compact, pickups, etc...) changed from years 1999 and 2008? (these are the 2 dates in the mpg dataset)

Research hypothesis: The distribution of small cars (subcompacts and compacts) will be greater in 2008 relative to 1999 given increased environmental awareness on the consumer end.H0: there is no difference in the distribution of car types in these two years.

Second question: Drug versus placebo

Does treatment with a drug increase the likelihood of successful therapy over placebo?

Topics covered:This guy:

χ2=∑k=1

n

¿¿¿

• Chi square Goodness of Fit test for count data.• Chi square Test for Independence (contingency tables and relationships).• Interpreting Chi-tests and their output.• Effect sizes for Chi-square tests.

Let's GO:

If you are running this code on your own, don't bother with the knitr command.

knitr::opts_chunk$set(fig.width=5, fig.height=3, echo=TRUE, warning=FALSE, message=FALSE)

library('tidyverse')rm(list = ls()) # this clears variables in case any are already loaded. data('mpg') head(mpg)

## # A tibble: 6 x 11## manufacturer model displ year cyl trans drv cty hwy fl## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p## # ... with 1 more variables: class <chr>

Topic. What you can and cannot do with Chi Square.Step 1. Load the dataset...

Load the mpg database into R. See, it's very easy to load these databases. Look at available databses with the data function. Right now, type ?mpg and read about this dataset. This way you will know what each column means.

data('mpg') head(mpg)

## # A tibble: 6 x 11## manufacturer model displ year cyl trans drv cty hwy fl## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p## 4 audi a4 2.0 2008 4 auto(av) f 21 30

p## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p## # ... with 1 more variables: class <chr>

I am going to get rid of cars/trucks with 5 cylinders because I refuse to believe that such an abomination even exists. Actually -they are a weird category and I've never heard of them so I am going to take them out for this demo for simplicity.

mpg = subset(mpg, cyl !=5) # Subsetting.

You might recall that in a previous case study I used the droplevels command to get rid of unused levels. I don't have to do it here because cylinder was a number and not a level.

Generating tables of counts in R.Let's have a look at some distributions...

table counts the number of elements in each level of the given factor.

Car type

table(mpg$class)

## ## 2seater compact midsize minivan pickup subcompact ## 5 45 41 11 33 33 ## suv ## 62

Engine type

table(mpg$cyl)

## ## 4 6 8 ## 81 79 70

Drivetrain (4 wheel, front, or rear)

table(mpg$drv)

## ## 4 f r ## 103 102 25

Fuel type

table(mpg$fl)

## ## c d e p r ## 1 5 8 52 164

Year

table(mpg$year)

## ## 1999 2008 ## 117 113

Two-way tables (contingency tables or crosstabs or even pivot tables in Excel):These are for looking for RELATIONSHIPS between 2 factors.

table(mpg$class,mpg$cyl)

## ## 4 6 8## 2seater 0 0 5## compact 32 13 0## midsize 16 23 2## minivan 1 10 0## pickup 3 10 20## subcompact 21 7 5## suv 8 16 38

I will now reorder the levels of each factor by roughly the size of the vehicle to make more sense for plotting. Factor will reorder the levels to the order you specify in c(...)

mpg$class = factor(mpg$class, c('2seater' , 'subcompact', 'compact' , 'midsize' , 'minivan' , 'suv', 'pickup' ))

TBL =table(mpg$class,mpg$cyl)TBL

## ## 4 6 8## 2seater 0 0 5## subcompact 21 7 5## compact 32 13 0## midsize 16 23 2## minivan 1 10 0## suv 8 16 38## pickup 3 10 20

Q:Do you see a pattern here? What is it?

Marginal totalsThese are what they sound like - the totals for rows and columns.

Sum up the columns

colSums(TBL)

## 4 6 8 ## 81 79 70

Sum up the rows

rowSums(TBL)

## 2seater subcompact compact midsize minivan suv ## 5 33 45 41 11 62 ## pickup ## 33

Sum the entire table

sum(TBL)

## [1] 230

Now that we have the marginal totals we can express each cell as a percentage of each row, column, and table.

Percentage of each car category by engine type (for all 4 cylinder engines compact car is most common)

round(TBL/colSums(TBL),2) # table divided by marginal column totals

## ## 4 6 8## 2seater 0.00 0.00 0.07## subcompact 0.27 0.10 0.06## compact 0.46 0.16 0.00## midsize 0.20 0.29 0.03## minivan 0.01 0.14 0.00## suv 0.11 0.20 0.48## pickup 0.04 0.13 0.29

Percentage of each engine type by car category (for midsize cars 6 cylinder is most common)

round(TBL/rowSums(TBL),2) # table divided by marginal row totals

## ## 4 6 8

## 2seater 0.00 0.00 1.00## subcompact 0.64 0.21 0.15## compact 0.71 0.29 0.00## midsize 0.39 0.56 0.05## minivan 0.09 0.91 0.00## suv 0.13 0.26 0.61## pickup 0.09 0.30 0.61

Percentage of each engine type/car type combination in the table (8 cylinder SUVs are most common)

round(TBL/sum(TBL),2)

## ## 4 6 8## 2seater 0.00 0.00 0.02## subcompact 0.09 0.03 0.02## compact 0.14 0.06 0.00## midsize 0.07 0.10 0.01## minivan 0.00 0.04 0.00## suv 0.03 0.07 0.17## pickup 0.01 0.04 0.09

Now you can answer these questions (write these in your homework file):

Q:What is the portion of 4 cylinder cars that are midsize?

Q:What is the portion of 8 cylinder cars that are pickups?

Q:What is the portion of minivans that are 6 cylinder?

Compare the 2 tables for two different years and ask if the cell counts have changed.

mpg1999 = subset(mpg,year == 1999)mpg2008 = subset(mpg,year == 2008)

Table for 1999

table(mpg1999$class,mpg1999$cyl)


Table for 2008

table(mpg2008$class,mpg2008$cyl)


Pretty plotting of tablesIf you would like a pretty graphical plot of the tables use grid.table from gridExtra... - e.g. for a figure or powerpoint.

Both years

library(gridExtra)grid.table(TBL)

# Plotting side by side.

1999 versus 2008

grid.arrange(tableGrob(table(mpg1999$class,mpg1999$cyl)), tableGrob(table(mpg2008$class,mpg2008$cyl)), nrow = 1)

Now that we have done some exploration, let's work towards testing hypotheses.

Chi-squared goodness of fit test (GOF test)Technically a "test of homogeneity" in this case as we are comparing longitudinal data.

Both goodness of fit and test of homogeneity are asking really similar questions: Is the observed distribution of count data significantly different from the expected distribution? For "goodness of fit" the distibution is a theoretical one (normal distribution, etc.) while the "test of homogeneity" the distribution is a previous actual observation (car data from 1999, etc). Both use the same Chi-squared calculation and both take the null hypothesis to be that there is no statistical difference between the expected and observed distributions.

H1: The distribution of car types (classes) differed between 1999 and 2008. We will ignore interactions for now. (FYI: this is similar to the cheese curd example we talked about in class)

t1 = table(mpg1999$class)t2 = table(mpg2008$class)t1

## ## 2seater subcompact compact midsize minivan suv ## 2 19 25 20 6 29 ## pickup ## 16

sum(t1)

## [1] 117

t2

## ## 2seater subcompact compact midsize minivan suv ## 3 14 20 21 5 33 ## pickup ## 17

sum(t2)

## [1] 113

Calculate the Expected (NULL) from a previous set of data: Step 1:estimate probabilities (risks) of observing each class in the NULL distribution

Looking at teh sum(t2) and sum(t1), it looks like the same number of total car types recorded in each year. That is just coincidence, and having the same total in the null and observed distributions does not matter as we can calculte the Null distribution given the probabilities of observing each level in the null distribution.

t_null_prob = t1/sum(t1)round(t_null_prob,2) # Do you understand this table?

## ## 2seater subcompact compact midsize minivan suv ## 0.02 0.16 0.21 0.17 0.05 0.25 ## pickup ## 0.14

This is how we convert those probabilties from t1 to actual pseudo counts for t2.

t_null_counts = t_null_prob * sum(t2)

Let's calculate chi square!

χ2=∑k=1

n

¿¿¿

Chi_sq = sum((t2 - t_null_counts)^2/t1) # Simple isn't it.

Q: Now use the chisq.test() function and compare to the calculated value above. Is it the same?

Do you understand how to give the chisq.test function the null distribution probabilities? do ?chisq.test or google to get some extra help. Report the chi-square value and p value here:

Let's plot the sampling distribution of Chi-square vals for the df of the test... (df = n categories (called "k") minus 1). df = 7-1 = 6. It is also in the output of chisq.test

x2 = seq(0,15,.1)qplot(x2, dchisq(x2,6), geom = "line") + geom_vline(xintercept = Chi_sq) + ggtitle("Not even close") + ylab("p")

Recall, we can calculate the p value ourselves if we know the Chi-sq value and the df: pchisq (recall that there is also a pt, pnorm, pf and a dt, dnorm, df)

pval = 1-pchisq(Chi_sq,6)

Q: Why do we need to do 1-pchisq and not just pchisq?

Even though the number of each car type hasn't changed...we can ask if the engine types have changed over this time.

t1 = table(mpg1999$cyl)t2 = table(mpg2008$cyl)t1

## ## 4 6 8 ## 45 45 27

t2

## ## 4 6 8 ## 36 34 43

From the table it looks as if there is a visual trend toward more 8 cylinders (so much for environmental awareness... :<)

t_null_prob = t1/sum(t1)t_null_counts = t_null_prob * sum(t2)

χ2=∑k=1

n

¿¿¿

Chi_sq = sum( (t2 - t_null_counts)^2/t1)

We do in fact see a significant difference

chisq.test(t2, p = t_null_prob)

## ## Chi-squared test for given probabilities## ## data: t2## X-squared = 14.323, df = 2, p-value = 0.0007758

Q: Is the "hand"-calculated Chi-square (Chi_sq) the same as R's version from chisq.test?

Calculate the Chi-squared value from the formula by hand and plot p-value cutoff line on the appropriate Chi-squared distribution as we did above in the previous example.

Q: Can we conclude from this result that there are significantly less 4 cylinders and more 8 cylinders?

Chi-squared test of independence: Chi for tables of data - relationshipsLet's see how we examine specific relationships between rows and columns in our table...

Let's begin by building a data frame of count data for DRUGS: Proxac in particular. (it's made up so don't believe the results)

Prozac = data.frame(x = factor(c(rep(1,49),rep(0,44))), y = factor(c(rep(1,13),rep(0,36),rep(1,14),rep(0,30))))# rep just repeats stuff many times. ptab = table(Prozac)colnames(ptab) = c('relapse','success')rownames(ptab) = c('placebo','drug')ptab

## y## x relapse success## placebo 30 14## drug 36 13

Nice! This is the number of cases of people tested on either prozac or placebo who did or did not relapse.

Let's append the colsums and rowsums.

ptab = cbind(ptab,rowSums(ptab))# # You could also do it this way: # ptab = cbind(ptab,c(sum(ptab[1,]),sum(ptab[2,])))# ptab = rbind(ptab,colSums(ptab))# or this way: ptab = rbind(ptab,c(sum(ptab[,1]),sum(ptab[,2]),sum(ptab[,3])))# the joys of cbind and rbind...colnames(ptab) = c('relapse','success','row totals')rownames(ptab) = c('placebo','drug','column totals')

ptab

## relapse success row totals## placebo 30 14 44## drug 36 13 49## column totals 66 27 93

Computing expected values in contingency tables.We can see the "observed" values for each cell, but what is the "expected"?

Ei . j=Ro wi∗C ol j

N

etab = ptabetab[1,1] = round((ptab[1,3]*ptab[3,1])/ptab[3,3],1)etab[1,2] = round((ptab[1,3]*ptab[3,2])/ptab[3,3],1)etab[2,1] = round((ptab[2,3]*ptab[3,1])/ptab[3,3],1)etab[2,2] = round((ptab[2,3]*ptab[3,2])/ptab[3,3],1)

Observedgrid.table(ptab)

Expectedgrid.table(etab)

This let's us ask if there is a relationship between taking the drug and having treatment success

Chi_sq = sum( (ptab - etab)^2/etab) Chi_sq

## [1] 0.3014416

To get our degrees of freedom to calculate significance we take (num_rows-1)*(num_cols-1) = 1

x2 = seq(0,15,0.1)qplot(x2, dchisq(x2,1), geom = "line") + geom_vline(xintercept = Chi_sq) + ggtitle("Not looking good") + ylab("p")

Risks and OddsContingency table can also be used to measure "risk" and "odds"

Taking the proportions by row gives you the risk:

rtab = ptabrtab[1,1] = round(ptab[1,1]/ptab[1,3],2)rtab[1,2] = round(ptab[1,2]/ptab[1,3],2)rtab[2,1] = round(ptab[2,1]/ptab[2,3],2)rtab[2,2] = round(ptab[2,2]/ptab[2,3],2)grid.table(rtab)

So we can say that the risk of relapse with the drug was actually higher than with placebo...not great

Similarly we can take the proportions by column to get the odds:

otab = ptabotab[1,1] = round(ptab[1,1]/ptab[3,1],2)otab[1,2] = round(ptab[1,2]/ptab[3,1],2)otab[2,1] = round(ptab[2,1]/ptab[3,2],2)otab[2,2] = round(ptab[2,2]/ptab[3,2],2)grid.table(otab)

The odds ratio is odds of something happening over the odds of it not happening

ORrelapse_placebo = otab[1,1]/otab[1,2]

Placebo

ORrelapse_placebo

## [1] 2.142857

ORrelapse_drug = otab[2,1]/otab[2,2]

Drug

ORrelapse_drug

## [1] 2.770833

These odds ratios tell us that the odds of relapsing is more than twice as great as the odds of success AND that this is greater for the drug than for the placebo.

Effect sizes and contingency tables.The ODDS RATIO is the ratio of the odds in the first row to the odds in the second row. It is one measure of effect size in contingency tables.

Q: What is the odds ratio for this table?

We can calculate an effect size for the Chi-squared test.

$$Cramer's V = \sqrt \frac{(\chi^2/n)}{\min({num rows-1,num cols-1})}$$

There is no built in function in basic R, but we can build our own...

Cramers_V = function (XT){ ch = chisq.test(XT) chsq_v = ch$statistic N = sum(XT) k = min(c(nrow(XT), ncol(XT))) cv = sqrt(chsq_v/(N*(k-1))) return (cv)}# Call our fancy function. (yeah, or use the Cramer function from the effsize# library, but making our own function builds character)cv = Cramers_V(ptab)

tiny

cv

## X-squared ## 0.02056277

Connecting Chi-square to Binomial tests. Brief digression.If you recall from the readings, Chi-square is an extension of the Binomial distribution and test. Let's prove it - somewhat. Let's determine the Chi-square and p value for different counts of coin tosses and compare to the binomial test (flipping coins) for the same distributions.

Heads = c(1,1,1,1,1,0,0,0,0,0) # 50/50 so Chi square p value better be pretty high!print(length(Heads))

## [1] 10

chisq.test(c(sum(Heads==1),sum(Heads==0)), p = c(.5, .5))

## ## Chi-squared test for given probabilities## ## data: c(sum(Heads == 1), sum(Heads == 0))## X-squared = 0, df = 1, p-value = 1

Heads = c(1,1,1,0,0,0,0,0,0,0) # 50/50 so Chi square p value better be pretty high!chisq.test(c(sum(Heads==1),sum(Heads==0)), p = c(.5, .5))

## ## Chi-squared test for given probabilities## ## data: c(sum(Heads == 1), sum(Heads == 0))## X-squared = 1.6, df = 1, p-value = 0.2059





Try this with the binomial test.

binom.test(5,10,p=.5)

## ## Exact binomial test## ## data: 5 and 10## number of successes = 5, number of trials = 10, p-value = 1## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.187086 0.812914## sample estimates:## probability of success ## 0.5


## ## Exact binomial test## ## data: 3 and 10## number of successes = 3, number of trials = 10, p-value = 0.3438## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.06673951 0.65245285## sample estimates:## probability of success ## 0.3


## ## Exact binomial test## ## data: 2 and 10## number of successes = 2, number of trials = 10, p-value = 0.1094## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.02521073 0.55609546## sample estimates:## probability of success ## 0.2


## ## Exact binomial test## ## data: 1 and 10## number of successes = 1, number of trials = 10, p-value = 0.02148

## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.002528579 0.445016117## sample estimates:## probability of success ## 0.1

Q: Are the p values from the binomial test the same as the chi-squre test?

Q: Is this the same p as the binom.test? How would you fix it?

Documents

Cases of Categorical Count Data. Chi Square.cowen.faculty.arizona.edu/sites/cowen.faculty.arizona... · Web view... f 18 29 p## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p## 3 audi