The Analysis of Categorical Data. Categorical variables When both predictor and response variables...

Preview:

Citation preview

The Analysis of Categorical Data

Categorical variables

• When both predictor and response variables are categorical:

• Presence or absence• Color, etc.

• The data in such a study represents counts –or frequencies- of observations in each category

Analysis

Data Analysis

A single categorical predictor variable

Organized as two way contingency tables, and tested with chi-square or G-test

Multiple predictor variables (or complex models)

Organized as a multi-way contingency tables, and analyzed using either log-linear models or classification trees

Two way Contingency Tables

• Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data

Wildebeest carcasses from the Serengeti (Sinclair and Arcese 1995)

Sex, cause of death, and bone marrow type

• Sex (males / females)

• Cause of death (predation / other)

• Bone marrow type:

1. Solid white fatty (healthy animal)2. Opaque gelatinous 3. Translucent gelatinous

Data

Sex Marrow Death by predation

Male SWF Yes

Male OG Yes

Male TG Yes

… … …

Brief formatSEX MARROW DEATH COUNTFEMALE SWF PRED 26

MALE SWF PRED 14

FEMALE OG PRED 32

MALE OG PRED 43

FEMALE TG PRED 8

MALE TG PRED 10

FEMALE SWF NPRED 6

MALE SWF NPRED 7

FEMALE OG NPRED 26

MALE OG NPRED 12

FEMALE TG NPRED 16

MALE TG NPRED 26

Contingency table

Sex * Death Crosstabulation

Dead

Sex NPRED PRED Total

FEMALE 48 66 114

MALE 45 67 112

Total 93 133 226

Contingency table

Sex * Marrow Crosstabulation

Marrow

Sex OG SWF TG Total

FEMALE 58 32 24 114

MALE 55 21 36 112

Total 113 53 60 226

Contingency table

Death * Marrow Crosstabulation

Marrow

Death OG SWF TG Total

NPRED 38 13 42 93

PRED 75 40 18 133

Total 113 53 60 226

Are the variables independent?

We want to know, for example, whether males are more likely to die by predation than females

• Specifying the null hypothesis:• The predictor and response variable are not

associated with each other. The two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling

Calculating the expected values

• The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation

)__(ˆ, predationbydeadmaleNxPY predationbydeadmale

The probability of two independent events

)__()()__,( predationbydeadxPmalePpredationbydeadmaleP

Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals

Contingency table

Sex * Death expected values

Dead

Sex NPRED PRED P

FEMALE 46.91 67.09 114 0.5044

MALE 46.09 65.91 112 0.4956

93 133

P 0.4115 0.5885 N=226

)_(ˆ__ predatedNofemalePNY predatednofemale

sizesample

totalcolumntotalrowYij _

__ˆ

Testing the hypothesis: Pearson’s Chi-square test

cellsallPearson Expected

ExpectedObservedX

_

22

= 0.0866, P=0.7685

cellsallYates Expected

ExpectedObservedX

_

2

25.0

= 0.0253, P=0.8736

The degrees of freedom

)1__()1__( columnsofnumberxrowsofnumberdf

= 1

Calculating the P-value

• We find the probability of obtaining a value of Χ2 as large or larger than 0.0866 relative to a Χ2 distribution with 1 degree of freedom

• P = 0.769

Sta

nd

ard

ize

dR

esi

du

als

:<-

4-4

:-2

-2:0

0:2

2:4

>4

tcount

female male

non

pred

ator

pred

ator

An alternative

• The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution

cellsall Expected

ObservedObservedG

_ln2

= 0.0866

Two way contingency tables

• Sex * Death Crosstabulation:

• Sex * Marrow Crosstabulation:

• Marrow * Death Crosstabulation:

769.0,1..,087.02 PfdX Pearson

093.0,2..,745.42 PfdX Pearson

001.0,2..,308.292 PfdX Pearson

092.0,2..,778.4 PfdG

769.0,1..,087.0 PfdG

001.0,2..,520.29 PfdG

Which test to chose?

Model Rows/ Columns Sample size

Test

I

II

Not fixed

Fixed/not fixed

small G-test, with corrections

I

II

Not fixed

Fixed/not fixed

large G-test, Chi square test

III Fixed Fisher exact test

Log-linear modelsMulti-way Contingency Tables

Multiple two-way tablesFemales Marrow

Death OG SWF TG Total

PRED 32 26 8 66

NPRED 26 6 16 48

Total 58 32 24 114

Males Marrow

Death OG SWF TG Total

PRED 43 14 10 67

NPRED 12 7 26 45

Total 55 21 36 112

Log-linear models

• They treat the cell frequencies as counts distributed as a Poisson random variable

• The expected cell frequencies are modeled against the variables using the log-link and Poisson error term

• They are fit and parameters estimated using maximum likelihood techniques

Log-linear models

• Do not distinguish response and predictor variables: all the variables are considered equally as response variables

However

• A logit model with categorical variables can be analyzed as a log-linear model

Two way tables

• For a two way table (I by J) we can fit two log-linear models

• The first is a saturated (full) model• Log fij= constant + λi

x+ λky+ λjk

xy

• fij= is the expected frequency in cell ij• λi

x = is the effect of category i of variable X• λk

y = is the effect of category k of variable Y• λjk

xy = is the effect any interaction between X and Y

• This model fit the observed frequencies perfectly

Note

• The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell

Two way tables

• The second log-linear model represents independence of the two variables (X and Y) and is a reduced model:

• Log fij= constant + λix+ λk

y

• The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables

Interpretation

• The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies:

• i.e. a larger λ means that the expected frequencies will be larger for that variable.

• These variables are also deviations from the mean of all expected frequencies

Null hypothesis of independence

• The Ho is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies

• It is also a test that λjkxy =0:

• There is NO interaction between two variables

Test

• We can test this Ho by comparing the fit of the model without this term to the saturated model that includes this term

• We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the log-likelihood of each model

Test

• We then compare the fit of the two models with the likelihood ratio test statistic ∆

• However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G2 statistic

• G2 =-2log∆ • G2 Follows a Χ2 distribution for reasonable sample

sizes and can be generalized to • =- 2(log-likelihood reduced model -- log-likelihood

full model)

Degrees of freedom

• The calculated G2 is compared to a Χ2 distribution with (I-1)(J-1) df.

• This df (I-1)(J-1) is the difference between the df for the full model (IJ-1) and the df for the reduced model [(I-1)+(j-1)]

Akaike information criteria

KdataLAIC 2)|ˆ(log2

Hirotugu Akaike

The full modelmarrowsexdeath

ijk Cf logmarrowsexmarrowdeathsexdeat

marrowsexdeath

elparticularelparticular dfGAIC mod_2

mod_ 2

Complete table Model G2 df P AIC

1 D+S+M 42.76 7 0.001 28.76

2 D*S 42.68 6 0.001 30.68

3 D*M 13.24 5 0.021 3.24

4 S*M 37.98 5 0.001 27.98

5 D*S+D*M 13.16 4 0.01 5.16

6 D*S+S*M 37.89 4 0.001 29.89

7 D*M+S*M 8.46 3 0.037 2.46

8 D*S+D*M+S*M 7.19 2 0.027 3.19

9 Saturated full model 0 0

Two way interactions (marginal independence)

D+S+M 42.76

reference

d.f P

D*S

1vs 2

42.6759

42.76-42.68=0.084

7-6

=1

0.769

D*M

1vs 3

13.24

42.76-13.24=29.520

7-5

=2

<0.001

S*M

1 vs 4

37.98

42.76-37.98=4.778

7-5

=2

0.092

Three way interaction

• Death*Sex*Marrow

• Models compared 8 vs 9

• G2= 7.19

• df 2

• P=0.027

Conditional independence

term Models compared G2 df P

D*S 7 vs 8 1.28 1 0.259

D*M 6 vs 8 30.71 2 0.001

S*M 5 vs 8 5.97 2 0.051

Death and marrow have a partial association

Females Marrow

Death OG SWF TG Total

PRED 32 26 8 66

NPRED 26 6 16 48

Total 58 32 24 114

Males Marrow

Death OG SWF TG Total

PRED 43 14 10 67

NPRED 12 7 26 45

Total 55 21 36 112

kk

kkkXY nn

nn

2112

2211)(

ˆ

Conditional independence

Males 95 % CI Females

OG vs TG 0.107 0.041-0.283 0.406 0.150-1.097

SWF vs TG 0.192 0.060-0.616 0.115 0.034-0.395

SWF vs OG 0.558 0.184-1.693 3.521 1.261-9.836

558.07*43

12*14ˆ SWFvsOGmale

521.36*32

26*26ˆ SWFvsOGfemale

Complete independence

• Models compared 1 vs 8

• G2=35.57

• df= 5

• P=<0.001

Warning

• Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.

Recommended